Incremental Learning for End-to-End Automatic Speech Recognition

by   Li Fu, et al.

We propose an incremental learning for end-to-end Automatic Speech Recognition (ASR) to extend the model's capacity on a new task while retaining the performance on existing ones. The proposed method is effective without accessing to the old dataset to address the issues of high training cost and old dataset unavailability. To achieve this, knowledge distillation is applied as a guidance to retain the recognition ability from the previous model, which is then combined with the new ASR task for model optimization. With an ASR model pre-trained on 12,000h Mandarin speech, we test our proposed method on 300h new scenario task and 1h new named entities task. Experiments show that our method yields 3.25 on the new scenario, when compared with the pre-trained model and the full-data retraining baseline, respectively. It even yields a surprising 0.37 CER reduction on the new scenario than the fine-tuning. For the new named entities task, our method significantly improves the accuracy compared with the pre-trained model, i.e. 16.95 adaptions, the new models still maintain a same accuracy with the baseline on the old tasks.


page 1

page 2

page 3

page 4


K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables

Wav2vec 2.0 is an end-to-end framework of self-supervised learning for s...

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

End-to-end models have achieved impressive results on the task of automa...

Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model

Modern Automatic Speech Recognition (ASR) systems can achieve high perfo...

The THUEE System Description for the IARPA OpenASR21 Challenge

This paper describes the THUEE team's speech recognition system for the ...

Layer Pruning on Demand with Intermediate CTC

Deploying an end-to-end automatic speech recognition (ASR) model on mobi...

Augmented Bilinear Network for Incremental Multi-Stock Time-Series Classification

Deep Learning models have become dominant in tackling financial time-ser...

Prediction of Listener Perception of Argumentative Speech in a Crowdsourced Dataset Using (Psycho-)Linguistic and Fluency Features

One of the key communicative competencies is the ability to maintain flu...