You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

05/14/2020
by   Aleksandr Laptev, et al.
0

Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. An external language model allows our approach to reach the quality of a comparable supervised setup and outperform a semi-supervised setup (both on test-clean). We establish a state-of-the-art result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3 test-clean and 13.5

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2019

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Recent advances in text-to-speech (TTS) led to the development of flexib...
research
04/22/2020

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

While end-to-end ASR systems have proven competitive with the convention...
research
03/02/2020

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

This article investigates into recently emerging approaches that use dee...
research
12/23/2020

Speech Synthesis as Augmentation for Low-Resource ASR

Speech synthesis might hold the key to low-resource speech recognition. ...
research
03/12/2021

Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

With the rapid development of speech assistants, adapting server-intende...
research
05/18/2020

An Effective End-to-End Modeling Approach for Mispronunciation Detection

Recently, end-to-end (E2E) automatic speech recognition (ASR) systems ha...
research
03/04/2021

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Recently, it has become easier to obtain speech data from various media ...

Please sign up or login with your details

Forgot password? Click here to reset