Streaming end-to-end speech recognition with jointly trained neural feature enhancement

05/04/2021
by   Chanwoo Kim, et al.
0

In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers gradually increases. With GREL, the portion of the Mean Squared Error (MSE) loss for the enhanced output gradually reduces as training proceeds. In experimental results on the LibriSpeech corpus and noisy far-field test sets, the proposed model with GAEF-GREL training strategies shows significantly better results than the conventional multi-style training approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2022

Endpoint Detection for Streaming End-to-End Multi-talker ASR

Streaming end-to-end multi-talker speech recognition aims at transcribin...
research
11/03/2020

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their s...
research
05/03/2022

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

In this paper, we explore an improved framework to train a monoaural neu...
research
11/02/2022

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end...
research
12/22/2019

end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building ...
research
11/19/2021

A comparison of streaming models and data augmentation methods for robust speech recognition

In this paper, we present a comparative study on the robustness of two d...
research
05/29/2017

DNN-based uncertainty estimation for weighted DNN-HMM ASR

In this paper, the uncertainty is defined as the mean square error betwe...

Please sign up or login with your details

Forgot password? Click here to reset