Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

09/17/2019
by   Naoyuki Kanda, et al.
0

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20 error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2019

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

In this paper, we propose a novel auxiliary loss function for target-spe...
research
02/12/2021

Content-Aware Speaker Embeddings for Speaker Diarisation

Recent speaker diarisation systems often convert variable length speech ...
research
12/08/2021

A study on native American English speech recognition by Indian listeners with varying word familiarity level

In this study, listeners of varied Indian nativities are asked to listen...
research
10/05/2016

Monaural Multi-Talker Speech Recognition using Factorial Speech Processing Models

A Pascal challenge entitled monaural multi-talker speech recognition was...
research
11/29/2022

On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

We present a general framework to compute the word error rate (WER) of A...
research
08/04/2020

"This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II)

We describe the speech activity detection (SAD), speaker diarization (SD...
research
05/05/2016

The IBM Speaker Recognition System: Recent Advances and Error Analysis

We present the recent advances along with an error analysis of the IBM s...

Please sign up or login with your details

Forgot password? Click here to reset