Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

05/26/2017
by   Shane Walker, et al.
0

For conversational large-vocabulary continuous speech recognition (LVCSR) tasks, up to about two thousand hours of audio is commonly used to train state of the art models. Collection of labeled conversational audio however, is prohibitively expensive, laborious and error-prone. Furthermore, academic corpora like Fisher English (2004) or Switchboard (1992) are inadequate to train models with sufficient accuracy in the unbounded space of conversational speech. These corpora are also timeworn due to dated acoustic telephony features and the rapid advancement of colloquial vocabulary and idiomatic speech over the last decades. Utilizing the colossal scale of our unlabeled telephony dataset, we propose a technique to construct a modern, high quality conversational speech training corpus on the order of hundreds of millions of utterances (or tens of thousands of hours) for both acoustic and language model training. We describe the data collection, selection and training, evaluating the results of our updated speech recognition system on a test corpus of 7K manually transcribed utterances. We show relative word error rate (WER) reductions of 35 5 task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2018

The Marchex 2018 English Conversational Telephone Speech Recognition System

In this paper, we describe recent improvements to the production Marchex...
research
11/17/2021

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

The People's Speech is a free-to-download 30,000-hour and growing superv...
research
06/13/2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

This paper introduces GigaSpeech, an evolving, multi-domain English spee...
research
07/13/2017

Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies

Today, the vocabulary size for language models in large vocabulary speec...
research
08/04/2023

Adapting the NICT-JLE Corpus for Disfluency Detection Models

The detection of disfluencies such as hesitations, repetitions and false...
research
11/20/2017

Speech recognition for medical conversations

In this paper we document our experiences with developing speech recogni...
research
06/15/2016

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

Phonemic or phonetic sub-word units are the most commonly used atomic el...

Please sign up or login with your details

Forgot password? Click here to reset