Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

07/14/2022
by   Takanori Ashihara, et al.
0

Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowledge distillation (KD). Although the KD approach is able to shrink the depth and/or width of SSL model structures, there has been little research on how varying the depth and width impacts the internal representation of the small-footprint model. This paper provides an empirical study that addresses the question. We investigate the performance on SUPERB while varying the structure and KD methods so as to keep the number of parameters constant; this allows us to analyze the contribution of the representation introduced by varying the model architecture. Experiments demonstrate that a certain depth is essential for solving content-oriented tasks (e.g. automatic speech recognition) accurately, whereas a certain width is necessary for achieving high performance on several speaker-oriented tasks (e.g. speaker identification). Based on these observations, we identify, for SUPERB, a more compressed model with better performance than previous studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2023

Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

Much research effort is being applied to the task of compressing the kno...
research
02/24/2023

Ensemble knowledge distillation of self-supervised speech models

Distilled self-supervised models have shown competitive performance and ...
research
10/29/2022

Application of Knowledge Distillation to Multi-task Speech Representation Learning

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to...
research
11/17/2022

Compressing Transformer-based self-supervised models for speech processing

Despite the success of Transformers in self-supervised learning with app...
research
07/01/2022

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Large-scale speech self-supervised learning (SSL) has emerged to the mai...
research
09/02/2022

Feature diversity in self-supervised learning

Many studies on scaling laws consider basic factors such as model size, ...

Please sign up or login with your details

Forgot password? Click here to reset