Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

01/17/2021
by   Cheng Yi, et al.
0

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

READ FULL TEXT
research
10/26/2022

Efficient Use of Large Pre-Trained Models for Low Resource ASR

Automatic speech recognition (ASR) has been established as a well-perfor...
research
09/19/2021

Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

Unifying acoustic and linguistic representation learning has become incr...
research
03/10/2021

Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks

Adversarial training of end-to-end (E2E) ASR systems using generative ad...
research
09/24/2019

Understanding Semantics from Speech Through Pre-training

End-to-end Spoken Language Understanding (SLU) is proposed to infer the ...
research
11/21/2019

Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models

In this paper, we propose to use pre-trained features from end-to-end AS...
research
06/28/2022

Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding

End-to-end spoken language understanding (SLU) systems benefit from pret...
research
11/13/2021

Prediction of Listener Perception of Argumentative Speech in a Crowdsourced Dataset Using (Psycho-)Linguistic and Fluency Features

One of the key communicative competencies is the ability to maintain flu...

Please sign up or login with your details

Forgot password? Click here to reset