Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

04/02/2022
by   Manthan Thakker, et al.
0

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is 3× faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are 2-4× faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality.

READ FULL TEXT
research
05/24/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
08/22/2022

Multi-View Attention Transfer for Efficient Speech Enhancement

Recent deep learning models have achieved high performance in speech enh...
research
02/11/2021

An Investigation of End-to-End Models for Robust Speech Recognition

End-to-end models for robust automatic speech recognition (ASR) have not...
research
11/05/2022

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

Personalized speech enhancement (PSE) models achieve promising results c...
research
05/08/2021

Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation

In realistic speech enhancement settings for end-user devices, we often ...
research
11/04/2022

Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation with E3Net

Personalized speech enhancement (PSE), a process of estimating a clean t...
research
09/15/2023

Two-Step Knowledge Distillation for Tiny Speech Enhancement

Tiny, causal models are crucial for embedded audio machine learning appl...

Please sign up or login with your details

Forgot password? Click here to reset