A practical two-stage training strategy for multi-stream end-to-end speech recognition

10/23/2019
by   Ruizhi Li, et al.
0

The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion. Our previous study offered a promising direction within end-to-end automatic speech recognition, where parallel encoders aim to capture diverse information followed by a stream-level fusion based on attention mechanisms to combine the different views. However, with an increasing number of streams resulting in an increasing number of encoders, the previous approach could require substantial memory and massive amounts of parallel data for joint training. In this work, we propose a practical two-stage training scheme. Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs are produced from a single-stream model trained with all data. Stage-2 formulates a multi-stream scheme intending to solely train the attention fusion module using the UFE features and pretrained components from Stage-1. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with our previous method, this strategy achieves relative word error rate reductions of 8.2–32.4 outperforming several conventional combination methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2021

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Performance degradation of an Automatic Speech Recognition (ASR) system ...
research
06/17/2019

Multi-Stream End-to-End Speech Recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
11/12/2018

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
11/12/2018

Stream attention-based multi-array end-to-end speech recognition

Automatic Speech Recognition (ASR) using multiple microphone arrays has ...
research
06/20/2023

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition

Low-resource accented speech recognition is one of the important challen...
research
04/02/2019

End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...
research
03/14/2019

Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

Data fusion plays an important role in many technical applications that ...

Please sign up or login with your details

Forgot password? Click here to reset