Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

06/19/2020
by   Naoyuki Kanda, et al.
0

We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT by introducing a speaker inventory as an auxiliary input to produce speaker labels as well as multi-speaker transcriptions. All model parameters are optimized by speaker-attributed maximum mutual information criterion, which represents a joint probability for overlapped speech recognition and speaker identification. Experiments on LibriSpeech corpus show that our proposed method achieves significantly better speaker-attributed word error rate than the baseline that separately performs overlapped speech recognition and speaker identification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2020

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

Recently, an end-to-end speaker-attributed automatic speech recognition ...
research
10/07/2021

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

This paper presents Transcribe-to-Diarize, a new approach for neural spe...
research
04/05/2021

End-to-End Speaker-Attributed ASR with Transformer

This paper presents our recent effort on end-to-end speaker-attributed a...
research
03/28/2020

Serialized Output Training for End-to-End Overlapped Speech Recognition

This paper proposes serialized output training (SOT), a novel framework ...
research
10/07/2020

Domain Adversarial Neural Networks for Dysarthric Speech Recognition

Speech recognition systems have improved dramatically over the last few ...
research
08/28/2023

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

This technical report details our submission system to the CHiME-7 DASR ...
research
08/22/2023

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Wav2vec2 has achieved success in applying Transformer architecture and s...

Please sign up or login with your details

Forgot password? Click here to reset