DeepAI AI Chat
Log In Sign Up

End-to-End Multi-speaker Speech Recognition with Transformer

by   Xuankai Chang, et al.
Shanghai Jiao Tong University
Johns Hopkins University

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9 to 12.1 multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5


page 1

page 2

page 3

page 4


MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Recently, the end-to-end approach has proven its efficacy in monaural mu...

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating di...

Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition

Recently, several types of end-to-end speech recognition methods named t...

Multi-Channel Transformer Transducer for Speech Recognition

Multi-channel inputs offer several advantages over single-channel, to im...

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Despite successful applications of end-to-end approaches in multi-channe...

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Acoustic echo cancellation (AEC) is a technique used in full-duplex comm...

NatiQ: An End-to-end Text-to-Speech System for Arabic

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthes...