Learning Shared Encoding Representation for End-to-End Speech Recognition Models

03/31/2019
by   Thai-Son Nguyen, et al.
0

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2% and 22.6% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2016

End-to-end attention-based distant speech recognition with Highway LSTM

End-to-end attention-based models have been shown to be competitive alte...
research
12/28/2019

Improved Multi-Stage Training of Online Attention-based Encoder-Decoder Models

In this paper, we propose a refined multi-stage multi-task training stra...
research
02/02/2019

Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

Acoustic-to-word (A2W) models that allow direct mapping from acoustic si...
research
02/07/2023

MMA-RNN: A Multi-level Multi-task Attention-based Recurrent Neural Network for Discrimination and Localization of Atrial Fibrillation

The automatic detection of atrial fibrillation based on electrocardiogra...
research
05/13/2019

Federated Multi-task Hierarchical Attention Model for Sensor Analytics

Sensors are an integral part of modern Internet of Things (IoT) applicat...
research
09/11/2021

College Student Retention Risk Analysis From Educational Database using Multi-Task Multi-Modal Neural Fusion

We develop a Multimodal Spatiotemporal Neural Fusion network for Multi-T...
research
01/13/2021

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

Automatic height and age estimation of speakers using acoustic features ...

Please sign up or login with your details

Forgot password? Click here to reset