Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

11/01/2022
by   Zili Huang, et al.
0

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios – possibly due to the domain mismatch – which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1 unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2022

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong perform...
research
06/04/2023

End-to-End Joint Target and Non-Target Speakers ASR

This paper proposes a novel automatic speech recognition (ASR) system th...
research
04/15/2022

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Dominant researches adopt supervised training for speaker extraction, wh...
research
05/25/2023

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Self-supervised learning (SSL) based speech pre-training has attracted m...
research
03/20/2023

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

Self-supervised learning leverages unlabeled data effectively, improving...
research
11/12/2020

Self-supervised reinforcement learning for speaker localisation with the iCub humanoid robot

In the future robots will interact more and more with humans and will ha...
research
06/27/2023

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Disentangling uncorrelated information in speech utterances is a crucial...

Please sign up or login with your details

Forgot password? Click here to reset