Log In Sign Up

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

by   Nick Rossenbach, et al.

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which work well for large datasets, but tend to overfit when applied in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems. We present a novel approach of silence correction in the data pre-processing for TTS systems which increases the robustness when training on corpora targeted for ASR applications. In this work we do not only show the successful application of synthetic data for AED systems, but also test the same method on a highly optimized state-of-the-art Hybrid ASR system and a competitive monophone based system using connectionist-temporal-classification (CTC). We show that for the later systems the addition of synthetic data only has a minor effect, but they still outperform the AED systems by a large margin on LibriSpeech-100h. We achieve a final word-error-rate of 3.3 clean/noisy test-sets, surpassing any previous state-of-the-art systems that do not include unlabeled audio data.


page 1

page 2

page 3

page 4


Audio Adversarial Examples for Robust Hybrid CTC/Attention Speech Recognition

Recent advances in Automatic Speech Recognition (ASR) demonstrated how e...

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition

Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrate...

SynthASR: Unlocking Synthetic Data for Speech Recognition

End-to-end (E2E) automatic speech recognition (ASR) models have recently...

Data Augmentation for Low-Resource Quechua ASR Improvement

Automatic Speech Recognition (ASR) is a key element in new services that...

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

We introduce Wav2Seq, the first self-supervised approach to pre-train bo...

Residual Excitation Skewness for Automatic Speech Polarity Detection

Detecting the correct speech polarity is a necessary step prior to sever...

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation

We present state-of-the-art automatic speech recognition (ASR) systems e...