How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

03/31/2022
by   Juan Zuluaga-Gomez, et al.
0

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases (i.e., domain shift). We target this scenario by analyzing the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications. We benchmark the proposed models on four challenging ATC test sets (signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate (WER) reduction between 20 state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small fraction of labeled data. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/04/2022

A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems

Self-supervised models for speech processing emerged recently as popular...
research
03/31/2021

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

Transcribing meetings containing overlapped speech with only a single di...
research
06/01/2023

AfriNames: Most ASR models "butcher" African Names

Useful conversational agents must accurately capture named entities to m...
research
07/18/2023

Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuning

In this work, we propose a method to create domain-sensitive speech reco...
research
05/16/2019

Learning discriminative features in sequence training without requiring framewise labelled data

In this work, we try to answer two questions: Can deeply learned feature...
research
06/09/2023

Developing Speech Processing Pipelines for Police Accountability

Police body-worn cameras have the potential to improve accountability an...
research
09/18/2023

Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

Large self-supervised pre-trained speech models require computationally ...

Please sign up or login with your details

Forgot password? Click here to reset