Listen, Attend and Spell

08/05/2015 ∙ by William Chan, et al. ∙ 0

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1 language model, and 10.3 By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

LAS-SpeechRecognition

Listen, Attend and Spell (LAS) framework for speech recognition (see https://arxiv.org/pdf/1508.01211.pdf) with DNN feature extractor


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.