Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

05/26/2022
by   Qiu-Shi Zhu, et al.
0

Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion. On the other hand, it was shown that self-supervised pre-training enables the utilization of a large amount of unlabeled noisy data, which is rather beneficial for the noise robustness of ASR. However, the potential of the (optimal) integration of SE and self-supervised pre-training still remains unclear. In order to find an appropriate combination and reduce the impact of speech distortion caused by SE, in this paper we therefore propose a joint pre-training approach for the SE module and the self-supervised model. First, in the pre-training phase the original noisy waveform or the waveform obtained by SE is fed into the self-supervised model to learn the contextual representation, where the quantified clean speech acts as the target. Second, we propose a dual-attention fusion method to fuse the features of noisy and enhanced speeches, which can compensate the information loss caused by separately using individual modules. Due to the flexible exploitation of clean/noisy/enhanced branches, the proposed method turns out to be a generalization of some existing noise-robust ASR models, e.g., enhanced wav2vec2.0. Finally, experimental results on both synthetic and real noisy datasets show that the proposed joint training approach can improve the ASR performance under various noisy settings, leading to a stronger noise robustness.

READ FULL TEXT
research
01/22/2022

A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

Wav2vec2.0 is a popular self-supervised pre-training framework for learn...
research
07/22/2021

Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech

To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under...
research
05/24/2023

Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

Self-supervised learning (SSL) is the latest breakthrough in speech proc...
research
02/28/2023

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

Existing self-supervised pre-trained speech models have offered an effec...
research
09/28/2022

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

With the development of deep learning, neural network-based speech enhan...
research
08/28/2023

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Benefiting from the development of deep learning, text-to-speech (TTS) t...
research
05/26/2021

Training Speech Enhancement Systems with Noisy Speech Datasets

Recently, deep neural network (DNN)-based speech enhancement (SE) system...

Please sign up or login with your details

Forgot password? Click here to reset