An Online Attention-based Model for Speech Recognition

11/13/2018
by   Ruchao Fan, et al.
0

Attention-based end-to-end (E2E) speech recognition models such as Listen, Attend, and Spell (LAS) can achieve better results than traditional automatic speech recognition (ASR) hybrid models on LVCSR tasks. LAS combines acoustic, pronunciation and language model components of a traditional ASR system into a single neural network. However, such architectures are hard to be used for streaming speech recognition for its bidirectional listener architecture and attention mechanism. In this work, we propose to use latency-controlled bidirectional long short-term memory (LC- BLSTM) listener to reduce the delay of forward computing of listener. On the attention side, we propose an adaptive monotonic chunk-wise attention (AMoChA) to make LAS online. We explore how each part performs when it is used alone and obtain comparable or better results than LAS baseline. By combining the above two methods, we successfully stream LAS baseline with only 3.5 on our Mandarin corpus. We believe that our methods can also have the same effect on other languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
01/15/2020

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition...
research
09/02/2022

TB or not TB? Acoustic cough analysis for tuberculosis classification

In this work, we explore recurrent neural network architectures for tube...
research
08/12/2020

Online Automatic Speech Recognition with Listen, Attend and Spell Model

The Listen, Attend and Spell (LAS) model and other attention-based autom...
research
05/03/2021

Quantifying and Maximizing the Benefits of Back-End Noise Adaption on Attention-Based Speech Recognition Models

This work analyzes how attention-based Bidirectional Long Short-Term Mem...
research
03/31/2022

NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism

Although deep learning and end-to-end models have been widely used and s...
research
04/30/2021

Deformable TDNN with adaptive receptive fields for speech recognition

Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based...
research
09/23/2020

FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning

Strong presentation skills are valuable and sought-after in workplace an...

Please sign up or login with your details

Forgot password? Click here to reset