Ene-to-end training of time domain audio separation and recognition

12/18/2019
by   Thilo von Neumann, et al.
0

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/18/2019

End-to-end training of time domain audio separation and recognition

The rising interest in single-channel multi-speaker speech separation sp...
research
03/24/2021

Blind Speech Separation and Dereverberation using Neural Beamforming

In this paper, we present the Blind Speech Separation and Dereverberatio...
research
09/07/2020

An End-to-end Architecture of Online Multi-channel Speech Separation

Multi-speaker speech recognition has been one of the keychallenges in co...
research
10/20/2021

Time-Domain Mapping Based Single-Channel Speech Separation With Hierarchical Constraint Training

Single-channel speech separation is required for multi-speaker speech re...
research
10/30/2019

SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition

We present a multi-channel database of overlapping speech for training, ...
research
10/10/2021

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

The crux of single-channel speech separation is how to encode the mixtur...
research
07/24/2023

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Voice activity and overlapped speech detection (respectively VAD and OSD...

Please sign up or login with your details

Forgot password? Click here to reset