CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous Arrays

0. Contents

  1. Abstract
  2. Samples
    1. 2.1. Non - standard Postures
    2. 2.2. Standard Postures

1. Abstract

Front-end speech separation plays a vital role in human-car interaction systems. This paper presents CabinSep, a robust neural mask-based minimum variance distortionless response (MVDR) system. It overcomes the nonlinear distortions that are prevalent in current all-neural-network-driven solutions. We train the model to estimate speech and noise masks and apply MVDR solely during inference, thereby avoiding the numerical instability and ineffective noise reduction issues encountered in previous approaches. Our system incorporates multiple efficient modules that utilize channel information to enhance the separation capability. With a computational complexity of only 0.4G MACs, CabinSep achieves a 28.8% character error rate (CER) in real-world scenarios, outperforming state-of-the-art systems. Additionally, we evaluate various data augmentation methods using real-recorded impulse responses. By simulating more realistic training data, we improve speaker positioning accuracy and further reduce the CER.



2. Samples

The audio samples are divided into two major categories, that is, the two situations mentioned in the thesis where the speaker is in a non-standard sitting posture and a standard sitting posture. When the speaker is in a non-standard sitting posture, we show a scene where the speaker is located in Zone 4, and the audio sample demonstrates the effectiveness of the method we proposed for data augmentation by combining real-recorded IRs. When the speaker is in a standard sitting posture, we display a total of 3 different samples. One of them is a scene when the car is stationary, and the other two are scenes when the car is in motion. In the scenes where the car is in motion, additional noises such as the sound of wind and tire friction will be introduced. In the scene where the car is stationary, people are speaking in all 4 zones, and there is serious overlapping. For the two samples of the scenes where the car is in motion, there are 3 speakers and 2 speakers respectively, and there is also the phenomenon of overlapping.
It is worth noting that all the audio samples are real-recorded inside the vehicle.

2.1. Non-standard Postures

We selected an audio clip of "non-standard posture" located in Zone 4 as an example. The content of the voice is the wake-up word of a keyword spotting system (KWS) system. We use the wake-up accuracy rate of the KWS system to represent the positioning accuracy rate. In other words, in the following example, the KWS system only needs to be woken up in Zone 4. The following figure specifically demonstrates this scene.

This picture is a schematic diagram of "non-standard posture" and "standard posture". The two photos on the left show the scene when the speaker is located in Zone 4 and is in a "non-standard posture" at the same time, corresponding to the position of the blue dot in the middle schematic diagram, which is also the position of the speaker in the "non-standard posture" audio sample. The two photos on the right are the corresponding schematic diagrams of the speaker in a "standard posture" when located in Zone 4.


The following table shows the separation effects of different models on this "non-standard posture" audio sample. Only CabinSep-L-stage2 was trained using the data augmentation method we proposed, and it can be seen that the separation effect has been significantly improved.


Models Zone 1 Zone 2 Zone 3 Zone 4
Unprocessed Mixture
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
FasNet-TAC
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
DualSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-L-Stage1
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-L-Stage2
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image

2.2. Standard Postures

We present a total of 3 samples when the speaker is in the "standard posture", along with the speech recognition results by open-sourced WeNet ASR model corresponding to each audio.

Audio Sample 1: Stationary

Speakers are speaking in all of the 4 zones.

Separated Audios
Models Zone 1 Zone 2 Zone 3 Zone 4
Unprocessed Mixture
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
FasNet-TAC
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
DualSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-S
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
Speech Recognition Results of Audio Sample 1
Text Label:
Zone1: 时间 换一首给我一首歌的时间 (Time. Change another song. Play "Give Me a Song's Time" for me.)
Zone2: 号门 空调温度调为二十一度空调风速调一挡 (Gate. Adjust the air conditioning temperature to 21 degrees Celsius and set the air conditioning fan speed to the first gear.)
Zone3: 四个轮子的胎压都正常吗 (Are the tire pressures of all four wheels normal?)
Zone4: 导航去高新区管理委员会 (Navigate to the Administrative Committee of the High-tech Zone.)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.
Models Zone 1 Zone 2 Zone 3 Zone 4
Text Label

时间 换一首给我一首歌的时间

号门 空调温度调为二十一度空调风速调一挡

四个轮子的胎压都正常吗

导航去高新区管理委员会

Unprocessed Mixture

时间导航去四个轮洞为二十五手风路调一道

导航去四个门口刚为二十五一首叫风路朝一档

这个时间导航去四个轮子和他骸国还口正常吗风度球一档

导航去高峰去环救人了丰富条衣的

FasNet-TAC

时间差一下给我一首歌的时间

**空调温度调为二十一度空调风控潮*一档

四个轮子的太阳还都正常吗

导航去高区管理委员会

DualSep-L

时间给我一首歌的时间

后面空调温度调约二十一度空调速调一*

已经死四个轮这个太阳都正常吗

导航去高区管理馆然

CabinSep-S

时间翻手*给我一首歌的时间

号门空调温度调 二十一度空调 风调一档

四个轮子的太阳都正常吗

导航去高新区管理委员会

CabinSep-L

时间翻手*给我一首歌的时间

号门调温度调二十一度空调封送调一档

四个轮子的太阳都正常吗

导航去高新区管理委员会

Audio Sample 2: Motion

Speakers are speaking in Zone2, Zone3 and Zone4.

Separated Audios
Models Zone 1 Zone 2 Zone 3 Zone 4
Unprocessed Mixture
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
FasNet-TAC
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
DualSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-S
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
Speech Recognition Results of Audio Sample 2
Text Label:
Zone1: - (Nobody is speaking in Zone1)
Zone2: 帮我找一下附近的停车场 (Help me find a nearby parking lot.)
Zone3: 所有车窗打开百分之七十 (Open all the car windows by 70 percent.)
Zone4: 打开把那个车子主驾按摩打开 (Open it. Turn on the driver's seat massage of that car.)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.
Models Zone 1 Zone 2 Zone 3 Zone 4
Text Label

-

帮我找一下附近的停车场

所有车窗打开百分之七十

打开把那个车子主驾按摩打开

Unprocessed Mixture

可以帮我找一下

帮我找一下自己

所有车窗打开百分之七十

我才所有窗打开百分按摩时*

FasNet-TAC

-

我找跳出本图卡啡*

***********

*************

DualSep-L

一步

帮我找一下父亲们组织找

博演车窗打开百分之七十

不告了阿伯找人******

CabinSep-S

-

帮我找一下附近的停车场

所有车窗打开百分之七十

来刚来把那个的总驾按摩打开

CabinSep-L

-

帮我找一下附近的停车场

所有车窗打开百分之七十

咱快**村总驾按摩打开

Audio Sample 3: Motion

Speakers are speaking in Zone2 and Zone4.

Separated Audios
Models Zone 1 Zone 2 Zone 3 Zone 4
Unprocessed Mixture
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
FasNet-TAC
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
DualSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-S
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
CabinSep-L
Sample 1 Image
Sample 2 Image
Sample 3 Image
Sample 4 Image
Speech Recognition Results of Audio Sample 3
Text Label:
Zone1: - (Nobody is speaking in Zone1)
Zone2: 到达目的地后还有多少电量 (How much battery power is left after reaching the destination?)
Zone3: - (Nobody is speaking in Zone3)
Zone4: 呃播放一首小白兔白又白 (Umm, play the song "The Little White Rabbit is White and White".)

The red font in the following table represents incorrect character recognition, the black font represents correct character recognition, the "*" represents missing recognized character, and the "-" represents that no speech is recognized in the audio.
Models Zone 1 Zone 2 Zone 3 Zone 4
Text Label

-

到达目的地后还有多少电量

-

呃播放一首小白兔白又白

Unprocessed Mixture

到打不过放一思全白还有一回多少电量

到达莫高方一小白才有多少电

报道播放一则小白兔白月白好电话

播放一首小白兔白

FasNet-TAC

-

到达目地后还有多少架当

-

播放一小白兔子蓝就

DualSep-L

-

到达目地后有多少电

*播放一首小白兔白

CabinSep-S

-

到达目的地后还有多少电

-

*播放一首小白兔白

CabinSep-L

-

到达目的地后有多少电量

-

*播放一首小白兔白