DeepAI AI Chat
Log In Sign Up

End-to-End Multi-speaker Speech Recognition with Transformer

02/10/2020
by   Xuankai Chang, et al.
Shanghai Jiao Tong University
Johns Hopkins University
MERL
0

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9 to 12.1 multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5

READ FULL TEXT

page 1

page 2

page 3

page 4

10/15/2019

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Recently, the end-to-end approach has proven its efficacy in monaural mu...
02/08/2021

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating di...
11/02/2020

Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition

Recently, several types of end-to-end speech recognition methods named t...
08/30/2021

Multi-Channel Transformer Transducer for Speech Recognition

Multi-channel inputs offer several advantages over single-channel, to im...
05/21/2020

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Despite successful applications of end-to-end approaches in multi-channe...
11/09/2021

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Acoustic echo cancellation (AEC) is a technique used in full-duplex comm...
06/15/2022

NatiQ: An End-to-end Text-to-Speech System for Arabic

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthes...