Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

09/04/2023
by   Jiaxu Zhu, et al.
0

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2023

Understanding Shared Speech-Text Representations

Recently, a number of approaches to train speech models by incorpo-ratin...
research
06/22/2022

A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data

Automatic Speech Recognition(ASR) has been dominated by deep learning-ba...
research
04/08/2021

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

Machine Speech Chain, which integrates both end-to-end (E2E) automatic s...
research
10/20/2022

Improving Semi-supervised End-to-end Automatic Speech Recognition using CycleGAN and Inter-domain Losses

We propose a novel method that combines CycleGAN and inter-domain losses...
research
02/27/2023

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

We propose an end-to-end ASR system that can be trained on transcribed s...
research
09/01/2023

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

For fine-grained generation and recognition tasks such as minimally-supe...
research
11/27/2016

Invariant Representations for Noisy Speech Recognition

Modern automatic speech recognition (ASR) systems need to be robust unde...

Please sign up or login with your details

Forgot password? Click here to reset