Log In Sign Up

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

by   Cheng-I Jeff Lai, et al.

Recent work on speech self-supervised learning (speech SSL) demonstrated the benefits of scale in learning rich and transferable representations for Automatic Speech Recognition (ASR) with limited parallel data. It is then natural to investigate the existence of sparse and transferrable subnetworks in pre-trained speech SSL models that can achieve even better low-resource ASR performance. However, directly applying widely adopted pruning methods such as the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost needed. Moreover, contrary to what LTH predicts, the discovered subnetworks yield minimal performance gain compared to the original dense network. In this work, we propose Prune-Adjust- Re-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run. PARP is inspired by our surprising observation that subnetworks pruned for pre-training tasks only needed to be slightly adjusted to achieve a sizeable performance boost in downstream ASR tasks. Extensive experiments on low-resource English and multi-lingual ASR show (1) sparse subnetworks exist in pre-trained speech SSL, and (2) the computational advantage and performance gain of PARP over baseline pruning methods. On the 10min Librispeech split without LM decoding, PARP discovers subnetworks from wav2vec 2.0 with an absolute 10.9 model. We demonstrate PARP mitigates performance degradation in cross-lingual mask transfer, and investigate the possibility of discovering a single subnetwork for 10 spoken languages in one run.


page 8

page 18

page 19

page 21

page 22

page 23

page 24

page 25


From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

In this work, we propose a new parameter-efficient learning framework ba...

Self-Supervised Representations Improve End-to-End Speech Translation

End-to-end speech-to-text translation can provide a simpler and smaller ...

Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition

Aphasia is a common speech and language disorder, typically caused by a ...

Fast Development of ASR in African Languages using Self Supervised Speech Representation Learning

This paper describes the results of an informal collaboration launched d...

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Recent years have witnessed great strides in self-supervised learning (S...

Device Directedness with Contextual Cues for Spoken Dialog Systems

In this work, we define barge-in verification as a supervised learning t...