Back-Translation-Style Data Augmentation for End-to-End ASR

07/28/2018
by   Tomoki Hayashi, et al.
0

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.

READ FULL TEXT
research
12/19/2019

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Recent advances in text-to-speech (TTS) led to the development of flexib...
research
02/14/2020

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

We propose an unsupervised speaker adaptation method inspired by the neu...
research
10/16/2020

Adaptive Feature Selection for End-to-End Speech Translation

Information in speech signals is not evenly distributed, making it an ad...
research
11/17/2022

Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation

Conversion of Chinese Grapheme-to-Phoneme (G2P) plays an important role ...
research
05/23/2023

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Attention-based encoder-decoder (AED) models have shown impressive perfo...
research
03/27/2018

Multi-Modal Data Augmentation for End-to-end ASR

We present a new end-to-end architecture for automatic speech recognitio...
research
10/28/2020

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

Despite the recent significant advances witnessed in end-to-end (E2E) AS...

Please sign up or login with your details

Forgot password? Click here to reset