Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

05/19/2023
by   Kangwook Jang, et al.
1

Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72 benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Self-supervised learning (SSL) has achieved notable success in many spee...
research
07/01/2022

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Large-scale speech self-supervised learning (SSL) has emerged to the mai...
research
04/27/2022

Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently ...
research
04/26/2022

ATST: Audio Representation Learning with Teacher-Student Transformer

Self-supervised learning (SSL) learns knowledge from a large amount of u...
research
07/14/2022

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

Self-supervised learning (SSL) is seen as a very promising approach with...
research
04/02/2023

A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation

Virtual humans have gained considerable attention in numerous industries...
research
07/03/2014

Enhanced EZW Technique for Compression of Image by Setting Detail Retaining Pass Number

This submission has been withdrawn by arXiv administrators because it co...

Please sign up or login with your details

Forgot password? Click here to reset