End-to-End Joint Target and Non-Target Speakers ASR

06/04/2023
by   Ryo Masumura, et al.
0

This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applications, transcribing both the target speaker's speech and non-target speakers' ones is often required to understand interactive information. To naturally consider both target and non-target speakers in a single ASR model, our idea is to extend autoregressive modeling-based multi-talker ASR systems to utilize the enrollment speech of the target speaker. Our proposed ASR is performed by recursively generating both textual tokens and tokens that represent target or non-target speakers. Our experiments demonstrate the effectiveness of our proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2021

Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

In this paper, we present a novel modeling method for single-channel mul...
research
06/21/2023

Mixture Encoder for Joint Speech Separation and Recognition

Multi-speaker automatic speech recognition (ASR) is crucial for many rea...
research
06/16/2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Non-autoregressive (NAR) models have achieved a large inference computat...
research
10/25/2018

Speaker Selective Beamformer with Keyword Mask Estimation

This paper addresses the problem of automatic speech recognition (ASR) o...
research
09/09/2022

Streaming Target-Speaker ASR with Neural Transducer

Although recent advances in deep learning technology have boosted automa...
research
03/01/2022

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

Graph-based temporal classification (GTC), a generalized form of the con...
research
11/01/2022

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Self-supervised learning (SSL) methods which learn representations of da...

Please sign up or login with your details

Forgot password? Click here to reset