Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

12/31/2018
by   Amit Das, et al.
0

The acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion is a natural end-to-end (E2E) system directly targeting word as output unit. Two issues exist in the system: first, the current output of the CTC model relies on the current input and does not account for context weighted inputs. This is the hard alignment issue. Second, the word-based CTC model suffers from the out-of-vocabulary (OOV) issue. This means it can model only the frequently occurring words while tagging the remaining words as OOV. Hence, such a model is limited in its capacity in recognizing only a fixed set of frequent words. In this study, we propose addressing these problems using a combination of attention mechanism and mixed-units. In particular, we introduce Attention CTC, Self-Attention CTC, Hybrid CTC, and Mixed-unit CTC. First, we blend attention modeling capabilities directly into the CTC network using Attention CTC and Self-Attention CTC. Second, to alleviate the OOV issue, we present Hybrid CTC which uses a word and letter CTC with shared hidden layers. The Hybrid CTC consults the letter CTC when the word CTC emits an OOV. Then, we propose a much better solution by training a Mixed-unit CTC which decomposes all the OOV words into sequences of frequent words and multi-letter units. Evaluated on a 3400 hours Microsoft Cortana voice assistant task, our final acoustic-to-word solution using attention and mixed-units achieves a relative reduction in word error rate (WER) over the vanilla word CTC by 12.09 decoder also outperforms a traditional context-dependent (CD) phoneme CTC with strong LM and decoder by 6.79

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2018

Advancing Acoustic-to-Word CTC Model

The acoustic-to-word model based on the connectionist temporal classific...
research
11/28/2017

Acoustic-To-Word Model Without OOV

Recently, the acoustic-to-word model based on the Connectionist Temporal...
research
05/08/2018

Comparing phonemes and visemes with DNN-based lipreading

There is debate if phoneme or viseme units are the most effective for a ...
research
12/05/2017

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

For decades, context-dependent phonemes have been the dominant sub-word ...
research
12/22/2017

Letter-Based Speech Recognition with Gated ConvNets

In this paper we introduce a new speech recognition system, leveraging a...
research
03/15/2018

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by dir...
research
03/13/2021

Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve th...

Please sign up or login with your details

Forgot password? Click here to reset