Direct Acoustics-to-Word Models for English Conversational Speech Recognition

03/22/2017
by   Kartik Audhkhasi, et al.
0

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0 decoder compared with 9.6 present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2017

Building competitive direct acoustics-to-word models for English conversational speech recognition

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have re...
research
02/16/2020

Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

Ainu is an unwritten language that has been spoken by Ainu people who ar...
research
07/27/2022

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

End-to-end speech synthesis models directly convert the input characters...
research
06/28/2016

Generation and Pruning of Pronunciation Variants to Improve ASR Accuracy

Speech recognition, especially name recognition, is widely used in phone...
research
06/17/2018

Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin

End-to-end models have been showing superiority in Automatic Speech Reco...
research
05/19/2020

Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion

Proper nouns present a challenge for end-to-end (E2E) automatic speech r...
research
05/10/2023

Quran Recitation Recognition using End-to-End Deep Learning

The Quran is the holy scripture of Islam, and its recitation is an impor...

Please sign up or login with your details

Forgot password? Click here to reset