Collaborative Training of Acoustic Encoders for Speech Recognition

06/16/2021
by   Varun Nagaraja, et al.
0

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11 improvement in the word error rate on both the test partitions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2021

Echo State Speech Recognition

We propose automatic speech recognition (ASR) models inspired by echo st...
research
04/05/2021

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

We propose a dynamic encoder transducer (DET) for on-device speech recog...
research
04/21/2021

Scene-aware Far-field Automatic Speech Recognition

We propose a novel method for generating scene-aware training data for f...
research
09/14/2023

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

Large-scale self-supervised pre-trained speech encoders outperform conve...
research
06/08/2017

Optimizing expected word error rate via sampling for speech recognition

State-level minimum Bayes risk (sMBR) training has become the de facto s...
research
04/12/2023

Acoustic absement in detail: Quantifying acoustic differences across time-series representations of speech data

The speech signal is a consummate example of time-series data. The acous...
research
05/27/2022

Contrastive Siamese Network for Semi-supervised Speech Recognition

This paper introduces contrastive siamese (c-siam) network, an architect...

Please sign up or login with your details

Forgot password? Click here to reset