Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

02/05/2021
by   Ruizhi Li, et al.
0

Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3

READ FULL TEXT
research
10/23/2019

A practical two-stage training strategy for multi-stream end-to-end speech recognition

The multi-stream paradigm of audio processing, in which several sources ...
research
06/17/2019

Multi-Stream End-to-End Speech Recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
11/29/2017

Stream Attention for far-field multi-microphone ASR

A stream attention framework has been applied to the posterior probabili...
research
11/12/2018

Stream attention-based multi-array end-to-end speech recognition

Automatic Speech Recognition (ASR) using multiple microphone arrays has ...
research
06/14/2021

SynthASR: Unlocking Synthetic Data for Speech Recognition

End-to-end (E2E) automatic speech recognition (ASR) models have recently...
research
03/27/2018

Multi-Modal Data Augmentation for End-to-end ASR

We present a new end-to-end architecture for automatic speech recognitio...
research
12/13/2021

PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition

Consonant and vowel reduction are often encountered in Uyghur speech, wh...

Please sign up or login with your details

Forgot password? Click here to reset