State-of-the-art Speech Recognition With Sequence-to-Sequence Models

12/05/2017
by   Chung-Cheng Chiu, et al.
0

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a novel multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and applying minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2 the best conventional system which achieves 6.7

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2018

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

We investigate training end-to-end speech recognition models with the re...
research
02/05/2019

Model Unit Exploration for Sequence-to-Sequence Speech Recognition

We evaluate attention-based encoder-decoder models along two dimensions:...
research
05/01/2020

Multi-head Monotonic Chunkwise Attention For Online Speech Recognition

The attention mechanism of the Listen, Attend and Spell (LAS) model requ...
research
10/18/2021

Efficient Sequence Training of Attention Models using Approximative Recombination

Sequence discriminative training is a great tool to improve the performa...
research
12/05/2017

Improving the Performance of Online Neural Transducer Models

Having a sequence-to-sequence model which can operate in an online fashi...
research
05/08/2019

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation

We present state-of-the-art automatic speech recognition (ASR) systems e...

Please sign up or login with your details

Forgot password? Click here to reset