Model Unit Exploration for Sequence-to-Sequence Speech Recognition

02/05/2019
by   Kazuki Irie, et al.
0

We evaluate attention-based encoder-decoder models along two dimensions: choice of target unit (phoneme, grapheme, and word-piece), and the amount of available training data. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks; across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. On the 960hr task the word-piece model achieves a word error rate (WER) of 4.7 set and 13.4 (other) when decoded with an LSTM LM: the lowest reported numbers using sequence-to-sequence models. We also conduct a detailed analysis of the various models, and investigate their complementarity: we find that we can improve WERs by up to 9 model with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, resulting in lower oracle WERs, than the phonemic system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2017

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Attention-based encoder-decoder architectures such as Listen, Attend, an...
research
12/05/2017

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Sequence-to-sequence models, such as attention-based models in automatic...
research
07/23/2018

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Acoustic-to-Word recognition provides a straightforward solution to end-...
research
12/21/2019

Candidate Fusion: Integrating Language Modelling into a Sequence-to-Sequence Handwritten Word Recognition Architecture

Sequence-to-sequence models have recently become very popular for tackli...
research
09/14/2019

Integrating Source-channel and Attention-based Sequence-to-sequence Models for Speech Recognition

This paper proposes a novel automatic speech recognition (ASR) framework...
research
01/20/2020

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300

It is generally believed that direct sequence-to-sequence (seq2seq) spee...
research
06/14/2023

EM-Network: Oracle Guided Self-distillation for Sequence Learning

We introduce EM-Network, a novel self-distillation approach that effecti...

Please sign up or login with your details

Forgot password? Click here to reset