A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

04/05/2022
by   Ye-Qian Du, et al.
0

Unpaired data has shown to be beneficial for low-resource automatic speech recognition (ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training. In this work, we leverage unpaired data to train a general sequence-to-sequence model. Unpaired speech and text are used in the form of data pairs by generating the corresponding missing parts in prior to model training. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair in both acoustic features and linguistic features, we propose a complementary joint training (CJT) method that trains a model alternatively with two data pairs. Furthermore, label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data, termed as CJT++. Experimental results show that compared to speech-only training, the proposed basic CJT achieves great performance improvements on clean/other test sets, and the CJT++ re-training yields further performance enhancements. It is also apparent that the proposed method outperforms the wav2vec2.0 model with the same model size and beam size, particularly in extreme low-resource cases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2018

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

We present a simple approach to improve direct speech-to-text translatio...
research
07/20/2022

When Is TTS Augmentation Through a Pivot Language Useful?

Developing Automatic Speech Recognition (ASR) for low-resource languages...
research
05/31/2023

Strategies for improving low resource speech to text translation relying on pre-trained ASR models

This paper presents techniques and findings for improving the performanc...
research
01/01/2018

PronouncUR: An Urdu Pronunciation Lexicon Generator

State-of-the-art speech recognition systems rely heavily on three basic ...
research
07/21/2023

Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information

Traditional topic identification solutions from audio rely on an automat...
research
04/08/2019

Exploring Methods for the Automatic Detection of Errors in Manual Transcription

Quality of data plays an important role in most deep learning tasks. In ...
research
03/28/2023

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Neural text-to-speech (TTS) models can synthesize natural human speech w...

Please sign up or login with your details

Forgot password? Click here to reset