Speech-text based multi-modal training with bidirectional attention for improved speech recognition

11/01/2022
by   Yuhang Yang, et al.
0

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15 error rate reduction (WERR) with only paired data learning, while 9.23 when more unpaired text data is employed.

READ FULL TEXT
research
05/11/2023

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Masked Language Models (MLMs) have proven to be effective for second-pas...
research
04/27/2023

Understanding Shared Speech-Text Representations

Recently, a number of approaches to train speech models by incorpo-ratin...
research
05/13/2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Text to speech (TTS) and automatic speech recognition (ASR) are two dual...
research
11/02/2018

Cycle-consistency training for end-to-end speech recognition

This paper presents a method to train end-to-end automatic speech recogn...
research
10/23/2021

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brou...
research
03/31/2022

NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism

Although deep learning and end-to-end models have been widely used and s...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...

Please sign up or login with your details

Forgot password? Click here to reset