Advancing Acoustic-to-Word CTC Model

03/15/2018
by   Jinyu Li, et al.
0

The acoustic-to-word model based on the connectionist temporal classification (CTC) criterion was shown as a natural end-to-end (E2E) model directly targeting words as output units. However, the word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Hence, such a word-based CTC model can only recognize the frequent words modeled by the network output nodes. Our first attempt to improve the acoustic-to-word model is a hybrid CTC model which consults a letter-based CTC when the word-based CTC model emits OOV tokens during testing time. Then, we propose a much better solution by training a mixed-unit CTC model which decomposes all the OOV words into sequences of frequent words and multi-letter units. Evaluated on a 3400 hours Microsoft Cortana voice assistant task, the final acoustic-to-word solution improves the baseline word-based CTC by relative 12.09 attention CTC. Such an E2E model without using any language model (LM) or complex decoder outperforms the traditional context-dependent phoneme CTC which has strong LM and decoder by relative 6.79

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2018

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

The acoustic-to-word model based on the Connectionist Temporal Classific...
research
11/28/2017

Acoustic-To-Word Model Without OOV

Recently, the acoustic-to-word model based on the Connectionist Temporal...
research
12/05/2017

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

For decades, context-dependent phonemes have been the dominant sub-word ...
research
11/02/2022

Boosting word frequencies in authorship attribution

In this paper, I introduce a simple method of computing relative word fr...
research
12/22/2017

Letter-Based Speech Recognition with Gated ConvNets

In this paper we introduce a new speech recognition system, leveraging a...
research
05/08/2018

Comparing phonemes and visemes with DNN-based lipreading

There is debate if phoneme or viseme units are the most effective for a ...
research
05/19/2020

A New Training Pipeline for an Improved Neural Transducer

The RNN transducer is a promising end-to-end model candidate. We compare...

Please sign up or login with your details

Forgot password? Click here to reset