Cascaded encoders for fine-tuning ASR models on overlapped speech

06/28/2023
by   Richard Rose, et al.
0

Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapping speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of task domains. Given the scale of these models and their ability to generalize well across a variety of domains, it makes sense to consider scenarios where a foundation model is augmented with multi-talker capability. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with respect to a baseline multi-talker model without sacrificing performance achievable by the foundation model on non-overlapping utterances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2022

Simulating realistic speech overlaps improves multi-talker ASR

Multi-talker automatic speech recognition (ASR) has been studied to gene...
research
02/20/2023

A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

Although automatic speech recognition (ASR) can perform well in common n...
research
06/24/2023

An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing

Deaf or hard-of-hearing (DHH) speakers typically have atypical speech ca...
research
08/09/2021

The HW-TSC's Offline Speech Translation Systems for IWSLT 2021 Evaluation

This paper describes our work in participation of the IWSLT-2021 offline...
research
07/10/2023

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-sp...
research
03/04/2021

Error-driven Fixed-Budget ASR Personalization for Accented Speakers

We consider the task of personalizing ASR models while being constrained...
research
09/14/2023

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

We introduce a multilingual speaker change detection model (USM-SCD) tha...

Please sign up or login with your details

Forgot password? Click here to reset