Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model

04/06/2021

∙

In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including out-of-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show that for supervised adaptation of the wav2vec 2.0 model, both E2E-LFMMI and CTC achieve similar results; significantly outperforming the baselines trained only with supervised data. Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the following relative WER improvements over the supervised baseline trained with E2E-LFMMI. We get relative improvements of 40 clean-set and 64 On Switchboard (300h) we obtain relative improvements of 33 respectively. Finally, for Babel languages, we obtain relative improvements of 26

READ FULL TEXT

Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model

Sign in with Google

Consider DeepAI Pro