LiMuSE: Lightweight Multi-modal Speaker Extraction

11/07/2021
by   Qinghua Liu, et al.
0

The past several years have witnessed significant progress in modeling the Cocktail Party Problem in terms of speech separation and speaker extraction. In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end model, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this paper, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight multi-modal framework for speaker extraction (dubbed LiMuSE), which adopts group communication (GC) to split multi-modal high-dimension features into groups of low-dimension features with smaller width which could be run in parallel, and further uses an ultra-low bit quantization strategy to achieve lower model size. The experiments on the GRID dataset show that incorporating GC into the multi-modal framework achieves on par or better performance with 24.86 times fewer parameters, and applying the quantization strategy to the GC-equipped model further obtains about 9 times compression ratio while maintaining a comparable performance compared with baselines. Our code will be available at https://github.com/aispeech-lab/LiMuSE.

READ FULL TEXT
research
06/13/2021

WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

In the speaker extraction problem, it is found that additional informati...
research
10/15/2020

Muse: Multi-modal target speaker extraction with visual cues

Speaker extraction algorithm relies on the speech sample from the target...
research
12/14/2020

Group Communication with Context Codec for Ultra-Lightweight Source Separation

Ultra-lightweight model design is an important topic for the deployment ...
research
11/17/2020

Ultra-Lightweight Speech Separation via Group Communication

Model size and complexity remain the biggest challenges in the deploymen...
research
10/28/2022

Speaker recognition with two-step multi-modal deep cleansing

Neural network-based speaker recognition has achieved significant improv...
research
10/31/2022

Model Compression for DNN-Based Text-Independent Speaker Verification Using Weight Quantization

DNN-based models achieve high performance in the speaker verification (S...

Please sign up or login with your details

Forgot password? Click here to reset