Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

11/09/2020
by   Cunhang Fan, et al.
0

The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. Therefore, the GRF can not only remove the noise signals from the enhanced features, but also learn the raw fine structures from the noisy features so that it can alleviate the speech distortion. The proposed method consists of speech enhancement, GRF and speech recognition. Firstly, the mask based speech enhancement network is applied to enhance the input speech. Secondly, the GRF is applied to address the speech distortion problem. Thirdly, to improve the performance of ASR, the state-of-the-art speech transformer algorithm is used as the speech recognition component. Finally, the joint training framework is utilized to optimize these three components, simultaneously. Our experiments are conducted on an open-source Mandarin speech corpus called AISHELL-1. Experimental results show that the proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method only using the enhanced features. Especially for the low signal-to-noise ratio (0 dB), our proposed method can achieves better performances with 12.67% CER reduction, which suggests the potential of our proposed method.

READ FULL TEXT

page 1

page 2

page 12

research
05/29/2023

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

In recent years, the joint training of speech enhancement front-end and ...
research
03/11/2019

Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling

Monaural speech enhancement has made dramatic advances since the introdu...
research
07/26/2019

Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

Performance of learning based Automatic Speech Recognition (ASR) is susc...
research
12/11/2021

Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR

Single-channel speech enhancement approaches do not always improve autom...
research
07/22/2021

Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech

To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under...
research
09/21/2023

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

Neural network approaches to single-channel speech enhancement have rece...
research
04/02/2022

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

In Uyghur speech, consonant and vowel reduction are often encountered, e...

Please sign up or login with your details

Forgot password? Click here to reset