CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous Arrays

0. Contents

Abstract
Samples
1. 2.1. Non - standard Postures
2. 2.2. Standard Postures

1. Abstract

Front-end speech separation plays a vital role in human-car interaction systems. This paper presents CabinSep, a robust neural mask-based minimum variance distortionless response (MVDR) system. It overcomes the nonlinear distortions that are prevalent in current all-neural-network-driven solutions. We train the model to estimate speech and noise masks and apply MVDR solely during inference, thereby avoiding the numerical instability and ineffective noise reduction issues encountered in previous approaches. Our system incorporates multiple efficient modules that utilize channel information to enhance the separation capability. With a computational complexity of only 0.4G MACs, CabinSep achieves a 28.8% character error rate (CER) in real-world scenarios, outperforming state-of-the-art systems. Additionally, we evaluate various data augmentation methods using real-recorded impulse responses. By simulating more realistic training data, we improve speaker positioning accuracy and further reduce the CER.

2. Samples

The audio samples are divided into two major categories, that is, the two situations mentioned in the thesis where the speaker is in a non-standard sitting posture and a standard sitting posture. When the speaker is in a non-standard sitting posture, we show a scene where the speaker is located in Zone 4, and the audio sample demonstrates the effectiveness of the method we proposed for data augmentation by combining real-recorded IRs. When the speaker is in a standard sitting posture, we display a total of 3 different samples. One of them is a scene when the car is stationary, and the other two are scenes when the car is in motion. In the scenes where the car is in motion, additional noises such as the sound of wind and tire friction will be introduced. In the scene where the car is stationary, people are speaking in all 4 zones, and there is serious overlapping. For the two samples of the scenes where the car is in motion, there are 3 speakers and 2 speakers respectively, and there is also the phenomenon of overlapping.
It is worth noting that all the audio samples are real-recorded inside the vehicle.

2.1. Non-standard Postures

We selected an audio clip of "non-standard posture" located in Zone 4 as an example. The content of the voice is the wake-up word of a keyword spotting system (KWS) system. We use the wake-up accuracy rate of the KWS system to represent the positioning accuracy rate. In other words, in the following example, the KWS system only needs to be woken up in Zone 4. The following figure specifically demonstrates this scene.

This picture is a schematic diagram of "non-standard posture" and "standard posture". The two photos on the left show the scene when the speaker is located in Zone 4 and is in a "non-standard posture" at the same time, corresponding to the position of the blue dot in the middle schematic diagram, which is also the position of the speaker in the "non-standard posture" audio sample. The two photos on the right are the corresponding schematic diagrams of the speaker in a "standard posture" when located in Zone 4.

The following table shows the separation effects of different models on this "non-standard posture" audio sample. Only CabinSep-L-stage2 was trained using the data augmentation method we proposed, and it can be seen that the separation effect has been significantly improved.

Models	Zone 1	Zone 2	Zone 3	Zone 4
Unprocessed Mixture
FasNet-TAC
DualSep-L
CabinSep-L-Stage1
CabinSep-L-Stage2

2.2. Standard Postures

We present a total of 3 samples when the speaker is in the "standard posture", along with the speech recognition results by open-sourced WeNet ASR model corresponding to each audio.

Audio Sample 1: Stationary

Speakers are speaking in all of the 4 zones.

Separated Audios

Models	Zone 1	Zone 2	Zone 3	Zone 4
Unprocessed Mixture
FasNet-TAC
DualSep-L
CabinSep-S
CabinSep-L

Speech Recognition Results of Audio Sample 1

Text Label:
Zone1: 时间换一首给我一首歌的时间 (Time. Change another song. Play "Give Me a Song's Time" for me.)
Zone2: 号门空调温度调为二十一度空调风速调一挡 (Gate. Adjust the air conditioning temperature to 21 degrees Celsius and set the air conditioning fan speed to the first gear.)
Zone3: 四个轮子的胎压都正常吗 (Are the tire pressures of all four wheels normal?)
Zone4: 导航去高新区管理委员会 (Navigate to the Administrative Committee of the High-tech Zone.)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.

Models	Zone 1	Zone 2	Zone 3	Zone 4
Text Label	时间换一首给我一首歌的时间	号门空调温度调为二十一度空调风速调一挡	四个轮子的胎压都正常吗	导航去高新区管理委员会
Unprocessed Mixture	时间导航去四个轮洞为二十五的手风路调一道	导航去四个门口刚为二十五一首叫风路朝一档	这个时间导航去四个轮子和他骸国还口正常吗风度球一档	导航去高峰去环救人了丰富条衣的
FasNet-TAC	时间差一下给我一首歌的时间	*空调温度调为二十一度空调风控潮一档	四个轮子的太阳还都正常吗	导航去高兴区管理委员会
DualSep-L	时间翻一小给我一首歌的时间	后面空调温度调约二十一度空调丰速调一*	已经死四个轮这个太阳都正常吗	导航去高山区管理馆然会
CabinSep-S	时间翻手*给我一首歌的时间	号门空调温度调约二十一度空调风度调一档	四个轮子的太阳都正常吗	导航去高新区管理委员会
CabinSep-L	时间翻手*给我一首歌的时间	号门控调温度调约二十一度空调封送调一档	四个轮子的太阳都正常吗	导航去高新区管理委员会

Audio Sample 2: Motion

Speakers are speaking in Zone2, Zone3 and Zone4.

Separated Audios

Models	Zone 1	Zone 2	Zone 3	Zone 4
Unprocessed Mixture
FasNet-TAC
DualSep-L
CabinSep-S
CabinSep-L

Speech Recognition Results of Audio Sample 2

Text Label:
Zone1: - (Nobody is speaking in Zone1)
Zone2: 帮我找一下附近的停车场 (Help me find a nearby parking lot.)
Zone3: 所有车窗打开百分之七十 (Open all the car windows by 70 percent.)
Zone4: 打开把那个车子主驾按摩打开 (Open it. Turn on the driver's seat massage of that car.)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.

Models	Zone 1	Zone 2	Zone 3	Zone 4
Text Label	-	帮我找一下附近的停车场	所有车窗打开百分之七十	打开把那个车子主驾按摩打开
Unprocessed Mixture	可以帮我找一下	帮我找一下自己的房车上	所有车窗打开百分之七十	我才把所有车窗打开百分按摩时*
FasNet-TAC	-	等我找跳出本图的卡啡*	***********	*************
DualSep-L	一步	帮我找一下父亲们组织找	博演车窗打开百分之七十	不告了阿伯找人******
CabinSep-S	-	帮我找一下附近的停车场	所有车窗打开百分之七十	来刚来把那个村子的总驾按摩打开
CabinSep-L	-	帮我找一下附近的停车场	所有车窗打开百分之七十	咱快把**村子总驾按摩打开

Audio Sample 3: Motion

Speakers are speaking in Zone2 and Zone4.

Separated Audios

Models	Zone 1	Zone 2	Zone 3	Zone 4
Unprocessed Mixture
FasNet-TAC
DualSep-L
CabinSep-S
CabinSep-L

Speech Recognition Results of Audio Sample 3

Text Label:
Zone1: - (Nobody is speaking in Zone1)
Zone2: 到达目的地后还有多少电量 (How much battery power is left after reaching the destination?)
Zone3: - (Nobody is speaking in Zone3)
Zone4: 呃播放一首小白兔白又白 (Umm, play the song "The Little White Rabbit is White and White".)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.

Models	Zone 1	Zone 2	Zone 3	Zone 4
Text Label	-	到达目的地后还有多少电量	-	呃播放一首小白兔白又白
Unprocessed Mixture	到打不过放一思全白还有一回多少电量	到达莫高方一后小白才有多少电荡	报道播放一则小白兔白月白好电话	导播放一首小白兔白月白
FasNet-TAC	-	到达目击地后还有有多少架当	-	好播放一张小白兔子蓝就白
DualSep-L	-	到达目击地后才有多少电报	你	*播放好一首小白兔白月白
CabinSep-S	-	到达目的地后还有多少电话	-	*播放一首小白兔白月白
CabinSep-L	-	到达目的地后才有多少电量	-	*播放一首小白兔白月白