Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

06/20/2023
by   Xuefei Wang, et al.
0

Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extract complementary acoustic information. Moreover, we propose to train the Aformer in a multi-pass manner, and investigate three cross-information fusion methods to effectively combine the information from both general and accent encoders. All experiments are conducted on both the accented English and Mandarin ASR tasks. Results show that our proposed methods outperform the strong Conformer baseline by relative 10.2 word/character error rate reduction on six in-domain and out-of-domain accented test sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2021

Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

In this paper, we propose a three-stage training methodology to improve ...
research
11/12/2018

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
05/01/2021

AlloST: Low-resource Speech Translation without Source Transcription

The end-to-end architecture has made promising progress in speech transl...
research
06/13/2023

Large-scale Language Model Rescoring on Long-form Data

In this work, we study the impact of Large-scale Language Models (LLM) o...
research
03/23/2023

A Deliberation-based Joint Acoustic and Text Decoder

We propose a new two-pass E2E speech recognition model that improves ASR...
research
10/23/2019

A practical two-stage training strategy for multi-stream end-to-end speech recognition

The multi-stream paradigm of audio processing, in which several sources ...
research
12/13/2021

PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition

Consonant and vowel reduction are often encountered in Uyghur speech, wh...

Please sign up or login with your details

Forgot password? Click here to reset