Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

03/10/2023
by   Mohammad Zeineldeen, et al.
0

This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNN-T architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2020

Efficient Knowledge Distillation for RNN-Transducer Models

Knowledge Distillation is an effective method of transferring knowledge ...
research
10/20/2020

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with ...
research
10/03/2019

On the Efficacy of Knowledge Distillation

In this paper, we present a thorough evaluation of the efficacy of knowl...
research
01/08/2022

Two-Pass End-to-End ASR Model Compression

Speech recognition on smart devices is challenging owing to the small me...
research
12/08/2022

Knowledge Distillation Applied to Optical Channel Equalization: Solving the Parallelization Problem of Recurrent Connection

To circumvent the non-parallelizability of recurrent neural network-base...
research
06/14/2021

CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition

We propose a simple yet effective method to compress an RNN-Transducer (...
research
04/08/2019

Knowledge Distillation For Recurrent Neural Network Language Modeling With Trust Regularization

Recurrent Neural Networks (RNNs) have dominated language modeling becaus...

Please sign up or login with your details

Forgot password? Click here to reset