Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

01/06/2021
by   Xuankai Chang, et al.
0

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/11/2020

Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

Recently, an end-to-end (E2E) speaker-attributed automatic speech recogn...
research
11/03/2020

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

Recently, an end-to-end speaker-attributed automatic speech recognition ...
research
05/16/2020

Speech Recognition and Multi-Speaker Diarization of Long Conversations

Speech recognition (ASR) and speaker diarization (SD) models have tradit...
research
11/02/2022

Towards End-to-end Speaker Diarization in the Wild

Speaker diarization algorithms address the "who spoke when" problem in a...
research
11/29/2022

On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

We present a general framework to compute the word error rate (WER) of A...
research
09/17/2019

DOVER: A Method for Combining Diarization Outputs

Speech recognition and other natural language tasks have long benefited ...
research
05/05/2016

The IBM Speaker Recognition System: Recent Advances and Error Analysis

We present the recent advances along with an error analysis of the IBM s...

Please sign up or login with your details

Forgot password? Click here to reset