DeepAI AI Chat
Log In Sign Up

Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition

by   Cong-Thanh Do, et al.

This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hypotheses to the computation of the CTC loss when ASR hypotheses are used. When being applied in semi-supervised adaptation scenarios where part of the adaptation data do not have labels, the CTC loss of the proposed method is computed from different ASR 1-best hypotheses obtained by decoding the unlabeled adaptation data. Experiments are performed in clean and multi-condition training scenarios where the CTC-based end-to-end ASR systems are trained on Wall Street Journal (WSJ) clean training data and CHiME-4 multi-condition training data, respectively, and tested on Aurora-4 test data. The proposed adaptation method yields 6.6 and 5.8 training scenarios, respectively, compared to a baseline system which is adapted with part of the adaptation data having manual transcriptions using back-propagation fine-tuning.


page 1

page 2

page 3

page 4


Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

This paper proposes a new approach to perform unsupervised fine-tuning a...

Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition

In this paper, we explore various approaches for semi supervised learnin...

Sequence-level self-learning with multiple hypotheses

In this work, we develop new self-learning techniques with an attention-...

Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems

End-to-end (e2e) systems have recently gained wide popularity in automat...

DNN adaptation by automatic quality estimation of ASR hypotheses

In this paper we propose to exploit the automatic Quality Estimation (QE...

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensor...

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

High-quality data labeling from specific domains is costly and human tim...