CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous Arrays
0. Contents
1. Abstract
Front-end speech separation plays a vital role in human-car interaction systems. This paper presents CabinSep, a robust neural mask-based minimum variance distortionless response (MVDR) system. It overcomes the nonlinear distortions that are prevalent in current all-neural-network-driven solutions. We train the model to estimate speech and noise masks and apply MVDR solely during inference, thereby avoiding the numerical instability and ineffective noise reduction issues encountered in previous approaches. Our system incorporates multiple efficient modules that utilize channel information to enhance the separation capability. With a computational complexity of only 0.4G MACs, CabinSep achieves a 28.8% character error rate (CER) in real-world scenarios, outperforming state-of-the-art systems. Additionally, we evaluate various data augmentation methods using real-recorded impulse responses. By simulating more realistic training data, we improve speaker positioning accuracy and further reduce the CER.
2. Samples
The audio samples are divided into two major categories, that is, the two situations mentioned in the thesis where the speaker is in a non-standard sitting posture and a standard sitting posture.
When the speaker is in a non-standard sitting posture, we show a scene where the speaker is located in Zone 4,
and the audio sample demonstrates the effectiveness of the method we proposed for data augmentation by combining real-recorded IRs.
When the speaker is in a standard sitting posture, we display a total of 3 different samples. One of them is a scene when the car is stationary,
and the other two are scenes when the car is in motion. In the scenes where the car is in motion, additional noises such as the sound of wind and tire friction will be introduced.
In the scene where the car is stationary, people are speaking in all 4 zones, and there is serious overlapping. For the two samples of the scenes where the car is in motion,
there are 3 speakers and 2 speakers respectively, and there is also the phenomenon of overlapping.
It is worth noting that all the audio samples are real-recorded inside the vehicle.
2.1. Non-standard Postures
We selected an audio clip of "non-standard posture" located in Zone 4 as an example. The content of the voice is the wake-up word of a keyword spotting system (KWS) system. We use the wake-up accuracy rate of the KWS system to represent the positioning accuracy rate. In other words, in the following example, the KWS system only needs to be woken up in Zone 4. The following figure specifically demonstrates this scene.

The following table shows the separation effects of different models on this "non-standard posture" audio sample. Only CabinSep-L-stage2 was trained using the data augmentation method we proposed, and it can be seen that the separation effect has been significantly improved.
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Unprocessed Mixture |
![]() |
![]() |
![]() |
![]() |
FasNet-TAC |
![]() |
![]() |
![]() |
![]() |
DualSep-L |
![]() |
![]() |
![]() |
![]() |
CabinSep-L-Stage1 |
![]() |
![]() |
![]() |
![]() |
CabinSep-L-Stage2 |
![]() |
![]() |
![]() |
![]() |
2.2. Standard Postures
We present a total of 3 samples when the speaker is in the "standard posture", along with the speech recognition results by open-sourced WeNet ASR model corresponding to each audio.
Audio Sample 1: Stationary
Speakers are speaking in all of the 4 zones.
Separated Audios
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Unprocessed Mixture |
![]() |
![]() |
![]() |
![]() |
FasNet-TAC |
![]() |
![]() |
![]() |
![]() |
DualSep-L |
![]() |
![]() |
![]() |
![]() |
CabinSep-S |
![]() |
![]() |
![]() |
![]() |
CabinSep-L |
![]() |
![]() |
![]() |
![]() |
Speech Recognition Results of Audio Sample 1
Text Label:Zone1: 时间 换一首给我一首歌的时间 (Time. Change another song. Play "Give Me a Song's Time" for me.)
Zone2: 号门 空调温度调为二十一度空调风速调一挡 (Gate. Adjust the air conditioning temperature to 21 degrees Celsius and set the air conditioning fan speed to the first gear.)
Zone3: 四个轮子的胎压都正常吗 (Are the tire pressures of all four wheels normal?)
Zone4: 导航去高新区管理委员会 (Navigate to the Administrative Committee of the High-tech Zone.)
The
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Text Label |
时间 换一首给我一首歌的时间 |
号门 空调温度调为二十一度空调风速调一挡 |
四个轮子的胎压都正常吗 |
导航去高新区管理委员会 |
Unprocessed Mixture |
时间 |
|
|
导航去高 |
FasNet-TAC |
时间 |
|
四个轮子的 |
导航去高 |
DualSep-L |
时间 |
|
|
导航去高 |
CabinSep-S |
时间 |
号门空调温度调 |
四个轮子的 |
导航去高新区管理委员会 |
CabinSep-L |
时间 |
号门 |
四个轮子的 |
导航去高新区管理委员会 |
Audio Sample 2: Motion
Speakers are speaking in Zone2, Zone3 and Zone4.
Separated Audios
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Unprocessed Mixture |
![]() |
![]() |
![]() |
![]() |
FasNet-TAC |
![]() |
![]() |
![]() |
![]() |
DualSep-L |
![]() |
![]() |
![]() |
![]() |
CabinSep-S |
![]() |
![]() |
![]() |
![]() |
CabinSep-L |
![]() |
![]() |
![]() |
![]() |
Speech Recognition Results of Audio Sample 2
Text Label:Zone1: - (Nobody is speaking in Zone1)
Zone2: 帮我找一下附近的停车场 (Help me find a nearby parking lot.)
Zone3: 所有车窗打开百分之七十 (Open all the car windows by 70 percent.)
Zone4: 打开把那个车子主驾按摩打开 (Open it. Turn on the driver's seat massage of that car.)
The
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Text Label |
- |
帮我找一下附近的停车场 |
所有车窗打开百分之七十 |
打开把那个车子主驾按摩打开 |
Unprocessed Mixture |
|
帮我找一下 |
所有车窗打开百分之七十 |
|
FasNet-TAC |
- |
|
|
|
DualSep-L |
|
帮我找一下 |
|
|
CabinSep-S |
- |
帮我找一下附近的停车场 |
所有车窗打开百分之七十 |
|
CabinSep-L |
- |
帮我找一下附近的停车场 |
所有车窗打开百分之七十 |
|
Audio Sample 3: Motion
Speakers are speaking in Zone2 and Zone4.
Separated Audios
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Unprocessed Mixture |
![]() |
![]() |
![]() |
![]() |
FasNet-TAC |
![]() |
![]() |
![]() |
![]() |
DualSep-L |
![]() |
![]() |
![]() |
![]() |
CabinSep-S |
![]() |
![]() |
![]() |
![]() |
CabinSep-L |
![]() |
![]() |
![]() |
![]() |
Speech Recognition Results of Audio Sample 3
Text Label:Zone1: - (Nobody is speaking in Zone1)
Zone2: 到达目的地后还有多少电量 (How much battery power is left after reaching the destination?)
Zone3: - (Nobody is speaking in Zone3)
Zone4: 呃播放一首小白兔白又白 (Umm, play the song "The Little White Rabbit is White and White".)
The
Models | Zone 1 | Zone 2 | Zone 3 | Zone 4 |
---|---|---|---|---|
Text Label |
- |
到达目的地后还有多少电量 |
- |
呃播放一首小白兔白又白 |
Unprocessed Mixture |
|
到达 |
|
|
FasNet-TAC |
- |
到达目 |
- |
|
DualSep-L |
- |
到达目 |
|
|
CabinSep-S |
- |
到达目的地后还有多少电 |
- |
|
CabinSep-L |
- |
到达目的地后 |
- |
|