Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

11/19/2021
by   Jiyeon Kim, et al.
0

In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively. In the first stage, we use transfer learning from a well-trained English model. This primarily helps in learning the acoustic information from a resource-rich language. This stage achieves around 24 reduction over the baseline. In stage two, We utilize unlabeled text data via TTS data-augmentation to incorporate language information into the model. We also explore freezing the acoustic encoder at this stage. TTS data augmentation helps us further reduce the WER by   21 reduce the WER by another 4 Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoChA) in the first pass and a full-attention in the second pass achieves a WER reduction of   42

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

Text-To-Speech Data Augmentation for Low Resource Speech Recognition

Nowadays, the main problem of deep learning techniques used in the devel...
research
04/16/2022

STRATA: Word Boundaries Phoneme Recognition From Continuous Urdu Speech using Transfer Learning, Attention, Data Augmentation

Phoneme recognition is a largely unsolved problem in NLP, especially for...
research
06/20/2023

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

Low-resource accented speech recognition is one of the important challen...
research
09/26/2019

DARTS: Dialectal Arabic Transcription System

We present the speech to text transcription system, called DARTS, for lo...
research
08/03/2020

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

In this work, we explore a multimodal semi-supervised learning approach ...
research
06/20/2022

An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models

End-to-end learning models have demonstrated a remarkable capability in ...
research
03/04/2021

Transfer learning from High-Resource to Low-Resource Language Improves Speech Affect Recognition Classification Accuracy

Speech Affect Recognition is a problem of extracting emotional affects f...

Please sign up or login with your details

Forgot password? Click here to reset