Listen, Attend and Spell

by   William Chan, et al.

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1 language model, and 10.3 By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0


page 1

page 2

page 3

page 4


Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

We investigate training end-to-end speech recognition models with the re...

Implicit Language Model in LSTM for OCR

Neural networks have become the technique of choice for OCR, but many as...

Character-Aware Attention-Based End-to-End Speech Recognition

Predicting words and subword units (WSUs) as the output has shown to be ...

What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

We propose an end-to-end, domain-independent neural encoder-aligner-deco...

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Attention-based recurrent neural encoder-decoder models present an elega...

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

End-to-end speech synthesis models directly convert the input characters...

BLPnet: A new DNN model and Bengali OCR engine for Automatic License Plate Recognition

The development of the Automatic License Plate Recognition (ALPR) system...

Code Repositories


Listen, Attend and Spell (LAS) framework for speech recognition (see with DNN feature extractor

view repo