Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

07/04/2021
by   Ryo Masumura, et al.
0

In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation. Our key idea is to handle gender and age estimation tasks within the unified autoregressive modeling. In the proposed method, transformer-based autoregressive model recursively generates not only textual tokens but also attribute tokens of each speaker. This enables us to effectively utilize speaker attributes for improving multi-talker overlapped ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the effectiveness of the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2023

End-to-End Joint Target and Non-Target Speakers ASR

This paper proposes a novel automatic speech recognition (ASR) system th...
research
10/18/2022

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

In this paper, we propose a method for intermediating multiple speakers'...
research
07/07/2021

End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

We propose a semi-supervised learning method for building end-to-end ric...
research
03/13/2023

Neural Diarization with Non-autoregressive Intermediate Attractors

End-to-end neural diarization (EEND) with encoder-decoder-based attracto...
research
08/14/2023

Generating Individual Trajectories Using GPT-2 Trained from Scratch on Encoded Spatiotemporal Data

Following Mizuno, Fujimoto, and Ishikawa's research (Front. Phys. 2022),...
research
06/16/2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Non-autoregressive (NAR) models have achieved a large inference computat...
research
10/25/2018

Speaker Selective Beamformer with Keyword Mask Estimation

This paper addresses the problem of automatic speech recognition (ASR) o...

Please sign up or login with your details

Forgot password? Click here to reset