Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

07/23/2023
by   Yoshiki Masuyama, et al.
0

Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5 word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2022

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

Self-supervised learning representation (SSLR) has demonstrated its sign...
research
06/19/2018

Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition

This paper presents, in the context of multi-channel ASR, a method to ad...
research
10/13/2021

All-neural beamformer for continuous speech separation

Continuous speech separation (CSS) aims to separate overlapping voices f...
research
10/22/2018

Investigation of Monaural Front-End Processing for Robust ASR without Retraining or Joint-Training

In recent years, monaural speech separation has been formulated as a sup...
research
10/22/2018

Investigation of Independent Monaural Front-End Processing for Robust ASR without Retraining and Joint-Training

In recent years, monaural speech separation has been formulated as a sup...
research
05/08/2020

Neural Spatio-Temporal Beamformer for Target Speech Separation

Purely neural network (NN) based speech separation and enhancement metho...
research
11/05/2020

Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription

Joint optimization of multi-channel front-end and automatic speech recog...

Please sign up or login with your details

Forgot password? Click here to reset