Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

12/10/2021
by   Rohit Paturi, et al.
0

Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2018

Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only

Automatic speech recognition (ASR) has been widely researched with super...
research
01/16/2023

Multi-resolution location-based training for multi-channel continuous speech separation

The performance of automatic speech recognition (ASR) systems severely d...
research
11/03/2020

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

Multi-speaker speech recognition of unsegmented recordings has diverse a...
research
08/29/2017

Comparing Human and Machine Errors in Conversational Speech Transcription

Recent work in automatic recognition of conversational telephone speech ...
research
07/06/2021

Separation Guided Speaker Diarization in Realistic Mismatched Conditions

We propose a separation guided speaker diarization (SGSD) approach by fu...
research
10/08/2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...
research
01/30/2020

Continuous speech separation: dataset and analysis

This paper describes a dataset and protocols for evaluating continuous s...

Please sign up or login with your details

Forgot password? Click here to reset