Multi-Stream End-to-End Speech Recognition

06/17/2019
by   Ruizhi Li, et al.
0

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary information from same acoustics. Experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1 WSJ eval92 test set. The MEM-Array framework aims at improving the far-field ASR robustness using multiple microphone arrays which are activated by separate encoders. Compared with the best single-array results, the proposed framework has achieved relative WER reduction of 3.7 multi-array corpora, respectively, which also outperforms conventional fusion strategies.

READ FULL TEXT

page 1

page 8

page 10

research
11/12/2018

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
11/12/2018

Stream attention-based multi-array end-to-end speech recognition

Automatic Speech Recognition (ASR) using multiple microphone arrays has ...
research
10/23/2019

A practical two-stage training strategy for multi-stream end-to-end speech recognition

The multi-stream paradigm of audio processing, in which several sources ...
research
02/05/2021

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Performance degradation of an Automatic Speech Recognition (ASR) system ...
research
11/29/2017

Stream Attention for far-field multi-microphone ASR

A stream attention framework has been applied to the posterior probabili...
research
09/17/2021

Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition

In this paper, we propose a dual-encoder ASR architecture for joint mode...
research
12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...

Please sign up or login with your details

Forgot password? Click here to reset