Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR

12/11/2021
by   Peter Plantinga, et al.
0

Single-channel speech enhancement approaches do not always improve automatic recognition rates in the presence of noise, because they can introduce distortions unhelpful for recognition. Following a trend towards end-to-end training of sequential neural network models, several research groups have addressed this problem with joint training of front-end enhancement module with back-end recognition module. While this approach ensures enhancement outputs are helpful for recognition, the enhancement model can overfit to the training data, weakening the recognition model in the presence of unseen noise. To address this, we used a pre-trained acoustic model to generate a perceptual loss that makes speech enhancement more aware of the phonetic properties of the signal. This approach keeps some benefits of joint training, while alleviating the overfitting problem. Experiments on Voicebank + DEMAND dataset for enhancement show that this approach achieves a new state of the art for some objective enhancement scores. In combination with distortion-independent training, our approach gets a WER of 2.80% on the test set, which is more than 20% relative better recognition performance than joint training, and 14% relative better than distortion-independent mask training.

READ FULL TEXT

page 1

page 4

page 5

research
03/11/2019

Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Monaural speech enhancement has made dramatic advances since the introdu...
research
05/29/2023

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

In recent years, the joint training of speech enhancement front-end and ...
research
11/09/2020

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

The joint training framework for speech enhancement and recognition meth...
research
07/06/2023

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Accurate recognition of cocktail party speech containing overlapping spe...
research
12/07/2020

Towards end-to-end speech enhancement with a variational U-Net architecture

In this paper, we investigate the viability of a variational U-Net archi...
research
09/26/2019

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for CHiME-5 Dinner Party Transcription

Despite the strong modeling power of neural network acoustic models, spe...
research
09/21/2023

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

Neural network approaches to single-channel speech enhancement have rece...

Please sign up or login with your details

Forgot password? Click here to reset