DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

10/05/2021 ∙ by Heng-Jui Chang, et al. ∙ 0

Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech processing tasks. Despite the success of these methods, they require large memory and high pre-training costs, making them inaccessible for researchers in academia and small companies. Therefore, this paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly. This method reduces HuBERT's size by 75 retaining most performance in ten different tasks. Moreover, DistilHuBERT required little training time and data, opening the possibilities of pre-training personal and on-device SSL models for speech.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, self-supervised learning (SSL) methods for speech representation learning succeeded in many tasks [21, 20, 17, 18, 19, 9, 10, 8, 22, 30, 33, 3, 4, 31, 14, 26, 29, 1]. They learn from targets derived from unlabeled speech data. They can be roughly categorized into two classes by learning objectives: generative and discriminative. For generative methods, models try to either reconstruct masked acoustic features [21, 20, 17, 18, 19], or generate future acoustic features [9, 10, 8]. For the discriminative methods, models either learn by contrastive learning [22, 30, 33, 3, 4, 31]

or classifying pseudo labels


Many previous SSL methods for speech representation learning are designed for few tasks like speech recognition. Therefore, Speech processing Universal PERformance Benchmark (SUPERB) [35] and LeBenchmark [11] are developed to evaluate the effectiveness of these methods on general speech processing applications. These benchmarks comprehensively assess the capability of pre-trained models by applying them to multiple downstream tasks like recognition, detection, semantics, and speaker identification.

Although recent developments of speech SSL methods show excellent performance across various tasks, many require large memory and high pre-training costs, including wav2vec 2.0 [4] and HuBERT [14]. The limitations make these models unsuitable for on-device computation and make researchers in academia and small corporations difficult to use [12]. Knowledge distillation is a common method for compressing models [13], in which a small student model is learned to generate the teacher model’s outputs or hidden representations. Distilling knowledge has shown to be effective for NLP, and DistilBERT [32] and TinyBERT [15] are good examples. However, we found these approaches ineffective in distilling speech SSL models, and few studies investigated this problem [28, 16]. Therefore, we propose a novel multi-task learning framework to layer-wise distill hidden representations of SSL speech models.

In this work, we distill HuBERT and obtain DistilHuBERT. DistilHuBERT uses three prediction heads to respectively predict the 4, 8, and 12 HuBERT hidden layers’ output. After training, the heads are removed because the multi-task learning paradigm forces the DistilHuBERT model to learn representations containing rich information. DistilHuBERT reduces HuBERT’s size by 75% and speedup by 73%, retaining most performance and requiring less training time. Moreover, we offer comprehensive analyses of proposed methods and show DistilHuBERT’s ability to distill knowledge with few data. To our knowledge, this is the first attempt to distill speech representations from SSL pre-trained models directly.

2 Methods

2.1 HuBERT

The proposed distillation approach can be applied to any self-supervised model. This paper takes HuBERT [14] as the teacher because it outperforms other methods in almost all tasks in SUPERB [35], showing it offers good speech representations. HuBERT consists of a CNN and a transformer encoder [34] to classify randomly masked frames to pseudo labels. The labels are obtained by clustering MFCC or another model’s hidden units. Despite HuBERT’s success, two major disadvantages appear. First, HuBERT models have 95M to 1B parameters, consuming large memory and slow inference. Second, the training costs are high, e.g., HuBERT Base pre-training requires 32 GPUs and 2k GPU hours. These issues make academia and small corporations unable to pre-train their models or apply these technologies to products. Hence, this paper aims to alleviate these problems.

2.2 DistilHuBERT

Figure 1:

The proposed DistilHuBERT framework. (I) Before pre-training, the student is initialized with the teacher’s parameters. Then, student’s prediction heads learn to generate the teacher’s hidden representations by minimizing the loss function

. (II) The pre-trained DistilHuBERT model is frozen to generate speech representations for downstream tasks.

In a typical knowledge distillation method, a student learns to generate the teacher’s output. However, different layers in self-supervised speech models contain different information like speaker identity or semantics [4, 25, 7]. An SSL model’s output layer does not always offer rich information. For instance, wav2vec 2.0 stores phonetic information in its middle layers [25, 2]. Therefore, the student only learning the last layer of the teacher might benefit little. One possible way to solve this problem is to let different hidden layers of the student learn from different layers of the teacher [15]. The student model must be deep enough to have multiple layers, but against having a small model.

This paper proposes a novel teacher-student framework for speech representation learning by multi-task knowledge distillation, as shown in Fig. 1. We distill HuBERT to obtain DistilHuBERT, consisting of a CNN feature extractor and a small transformer encoder. The basic idea of our approach for knowledge distillation is to learn to generate multiple teacher’s hidden representations from shared representations. Therefore, we propose predicting teacher’s hidden representations with separate prediction heads as shown in Fig. 1(I). This objective is a multi-task learning paradigm. It encourages the transformer encoder to produce compact representations for multiple prediction heads. After pre-training, we remove the heads. The model parameters are frozen, and the output representations are used for various downstream tasks, as shown in Fig. 1(II).

Objective Function. Here, we describe the objective function of DistilHuBERT. We denote the

dimensional feature vectors at time

produced by the teacher’s layer and the student’s corresponding prediction head respectively as and . The loss function is


where is the number of time steps. and

respectively denote sigmoid activation and cosine similarity. Minimizing

is equivalent to simultaneously minimizing the distance and maximizing the cosine similarity between the hidden representations. We empirically found that considering both and yielded better performance than considering only one of them. 0 controls the contribution of the cosine similarity loss.

Parameter Initialization. Following DistilBERT [32], we initialized DistilHuBERT with HuBERT’s CNN extractor and the first two transformer layers.

Reducing Computation for Distillation. We note that the number of prediction heads can be 1 to , where is the number of hidden layers in the self-supervised speech model to be distilled. Because neighboring layers’ representations might contain similar information, the student model only predicts some specific layers in the teacher model and skip other layers in our implementation. For example, in Fig. 1, only the 4, 8, and 12 HuBERT hidden layers’ outputs are predicted. In the experiments, we will show that predicting these three layers is sufficient.

Method Millions PER Acc Acc Acc Acc WER MTWV F1 / CER EER DER Rank
(I) Baselines [35]
FBANK 0 82.01 8.63 9.10 8.5E-4 35.39 23.18 / 15.21 0.0058 69.64 / 52.94 9.56 10.05 8.8
TERA 21.33 49.17 89.48 58.42 57.57 56.27 18.17 / 12.16 0.0013 67.50 / 54.17 15.89 9.96 7.9
wav2vec 32.54 31.58 95.59 84.92 56.56 59.79 15.86 / 11.00 0.0485 76.37 / 43.71 7.99 9.90 6.5
DeCoAR 2.0 85.12 14.93 94.48 90.80 74.42 62.47 13.02 / 9.07 0.0406 83.28 / 34.73 7.16 6.59 4.1
HuBERT (teacher) 94.68 5.41 96.30 98.34 81.42 64.92 6.42 / 4.79 0.0736 88.53 / 25.20 5.11 5.88 1.0
(II) Distillation Baselines
predict last layer 24.67 16.71 96.07 95.57 43.44 62.81 13.67 / 9.43 0.0423 81.52 / 37.31 7.29 6.45 4.6
predict w/ hidden 23.49 17.43 95.46 94.91 54.90 61.49 14.95 / 10.27 0.0529 83.73 / 35.14 7.41 6.86 5.0
(III) Proposed
w/ prediction heads 27.03 14.35 96.20 94.17 72.83 62.73 13.26 / 8.99 0.0451 83.14 / 35.53 6.85 5.97 3.3
DistilHuBERT 23.49 16.27 95.98 94.99 73.54 63.02 13.34 / 9.21 0.0511 82.57 / 35.59 8.55 6.19 3.8
Table 1: Results on SUPERB [35]. The metrics include accuracy (Acc%) phoneme error rate (PER%), word error rate (WER%), maximum term weighted value (MTWV), F1 score (F1%), concept error rate (CER%), equal error rate (EER%), and diarization error rate (DER%).

3 Experiments

(a) IC (b) SID (c) ASR (w/o LM)
Figure 2: Comparisons of pre-trained model size v.s. performance on three tasks: IC, SID, and ASR.

3.1 Experimental Setup

Experiments were implemented with S3PRL [20, 21] and fairseq [23].

Model. The HuBERT model had a 7-layer CNN and a 12-layer transformer encoder, while DistilHuBERT had a similar architecture, but its transformer encoder had only two layers. DistilHuBERT had three prediction heads respectively predicting the 4, 8, and 12 HuBERT hidden layers. We applied past distillation methods to DistilHuBERT for comparison [32, 15]. As shown in Table 1, ”predict last layer” predicted only the final output of HuBERT. ”Predict w/ hidden” was trained to predict the three chosen layers with its hidden layers, i.e., the CNN and the two transformer encoder layers respectively predicted the 4, 8, and 12 HuBERT layers. ”W/ prediction heads” preserved the three prediction heads after pre-training.


. We used the 960-hour LibriSpeech dataset

[24] for knowledge distillation, except Sec. 3.6 used the English Wall Street Journal (WSJ) [27] and the Mandarin AISHELL-1 [5] speech corpora.

Pre-training. The default DistilHuBERT was trained on a 32GB V100 GPU for 200k updates with a batch size of 24 utterances, taking roughly 55 hours. The learning rate was linearly increased to 2e-4 in the first 7% of updates, then linearly decreased to zero. 1 in Eq. (1). Note that training DistilHuBERT was cheaper than HuBERT since the HuBERT Base model required 2k GPU hours.

3.2 Superb

DistilHuBERT was evaluated on SUPERB [35]

. Each SSL model was frozen, and a trainable weighted sum summarized the hidden representations. Each downstream task used a simple model with minimal labeled data using the summarized representations. The tasks included phoneme recognition (PR), keyword spotting (KS), intent classification (IC), speaker identification (SID), emotion recognition (ER), automatic speech recognition (ASR), query by example spoken term detection (QbE), slot filling (SF), automatic speaker verification (ASV), and speaker diarization (SD).

Results on SUPERB are shown in Table 1.111We only show some SUPERB results because of the paper length limitation. a complete comparison is in https://superbbenchmark.org/. The rightmost column was each method’s ranking score by averaging its ranks in the ten tasks, offering an easy way to compare the performance of the methods. First, the ranking scores revealed that DistilHuBERT outperformed all methods, except HuBERT (its teacher), indicating DistilHuBERT retained most of HuBERT’s performance (sec. (III) vs. (I)). Next, the distillation baselines in section (II) performed worse than DistilHuBERT, especially in PR and SID tasks, showing the proposed framework distilled better representations using separate prediction heads. Moreover, DistilHuBERT slightly degraded without the prediction heads, but it reduced the model size by 13%, implying the prediction heads were redundant after pre-training.

Furthermore, Fig. 2 visualizes the relation between the model sizes and their performance on IC, SID, and ASR. We chose IC, SID, and ASR for comparison since they were easier to observe the performance differences. In Fig. 2(a)(b), DistilHuBERT showed competitive performance compared with HuBERT and wav2vec 2.0 and surpassed all other similar-sized methods in IC and SID. In Fig. 2(c), DistilHuBERT degraded more in ASR yet still better than most other SSL methods, showing the better trade-off between the model size and performance.

# param. Inf. time
Model Millions seconds
HuBERT 94.68 (100%) 992 (1.00X)
DeCoAR 2.0 85.12 (90%) 924 (1.07X)
DistilHuBERT 23.49 (25%) 574 (1.73X)
Table 2: Comparison of different SSL model sizes and inference time. The inference was performed on 4 CPUs extracting all features from the LibriSpeech dev-clean set with a batch size of one. Results were averaged over three runs.

3.3 Model Size and Inference Speed

We compared the sizes and inference speed of several methods as shown in Table 2. DistilHuBERT had a significantly smaller model size than HuBERT and offered a 73% speedup. DistilHuBERT performed equally well as DeCoAR 2.0 but was smaller and faster. Combining the results in Tables 1 and 2, DistilHuBERT was fast, small, and powerful. Note that DistilHuBERT can be further compressed by pruning or quantization for on-device computation.

3.4 Ablation Study

Method Acc Acc WER
DistilHuBERT 94.99 73.54 13.34
w/o 93.36 72.54 13.74
w/o teacher init 94.20 73.68 13.41
Table 3: Ablation study of the techniques used.

Table 3 shows the effectiveness of the cosine similarity loss in Eq. (1) and teacher initialization. Both techniques showed some benefits to training DistilHuBERT, and we thus used them in the rest of the experiments. Removing cosine similarity loss degraded more than teacher initialization, showing that the former method had more impact on improving DistilHuBERT, while the latter was less effective.

3.5 Layer Selection

Figure 3: Weights of the feature summarization in several SUPERB tasks after training on DistilHuBERT representations (predicting all hidden layers in HuBERT). The weights were normalized by the averaged norm of each layer’s output. feat and hid respectively denotes the CNN’s and transformer’s outputs.
Predicted Layers Acc Acc WER
4, 8, 12 94.99 73.54 13.34
4 79.09 76.85 14.90
8 96.89 61.52 12.28
12 95.52 65.14 13.48
4, 8 91.67 75.36 13.73
4, 12 91.88 74.48 14.03
8, 12 96.34 65.30 13.03
Table 4: Results of predicting different HuBERT’s hidden layers.

This section discussed the selection of HuBERT’s hidden layers for DistilHuBERT to learn. Because each HuBERT layer contains different information, we first inspected each layer’s importance for several downstream tasks. We trained a DistilHuBERT model to predict all HuBERT layers with separate heads and preserved the heads after training because we wish to reveal each prediction head’s importance to each task. The hidden and predicted representations were summarized by a learnable weighted sum and applied to all SUPERB tasks. Results are shown in Fig. 3, where larger values indicate higher importance. We only showed ASR, KS, IC, and SID because they represent recognition, detection, semantics, and speaker identity tasks, while others showed similar trends.

Fig. 3 showed features before the prediction heads (hid) were useful, especially for SID. Results corroborated our hypothesis that the shared representation for multiple heads carried rich information. Among the head outputs, the heads for the 6 to 12 layers were also crucial to ASR, KS, and IC because they had content and semantic information. With the above findings, we made DistilHuBERT predicting the 8 and 12 HuBERT layers. DistilHuBERT additionally predicted the 4 layer because we hypothesized that the bottom layers preserved speaker identity.

To justify our selection method, we conducted experiments by predicting different HuBERT layers as shown in Table 4. First, for predicting only one HuBERT layer, the evaluation scores among the three tasks were imbalanced. Predicting the 4 layer performed the worst in IC and ASR but better in SID than predicting three layers, corroborating our hypothesis that the bottom layers offered speaker identity. Then, predicting two of the selected layers, the downstream tasks’ scores were also imbalanced. Therefore, the three selected layers offered a good combination for learning representations with less biased information.

3.6 Knowledge Distillation with Different Datasets

Data Size IC SID ASR
Dataset hours Acc Acc WER
LibriSpeech 960 94.99 73.54 13.34
LibriSpeech 100 93.17 69.46 14.77
WSJ 81 90.22 64.14 15.59
AISHELL-1 150 87.29 67.65 16.42
Table 5: Pre-training DistilHuBERT with different datasets.

In practice, the pre-training data for a self-supervised teacher model might be inaccessible or too large to reuse for distillation. This section inspected the generalizability and flexibility of DistilHuBERT by training it with small or out-of-domain datasets. We used the 100-hour subset of LibriSpeech, 81-hour WSJ, and the Mandarin AISHELL-1 corpora to train DistilHuBERT by the proposed knowledge distillation approach. The results are shown in Table 5.

First, training with the 100-hour LibriSpeech, DistilHuBERT degraded slightly but retained better performance than many other methods (in Table 1), indicating that using a smaller dataset to distill knowledge was sufficient. Next, the smaller WSJ corpus showed more degradation; perhaps the WSJ’s averaged utterance duration was shorter than LibriSpeech [6] or simply because it was small. Furthermore, training with the Mandarin AISHELL-1 corpus offered better SID than WSJ because speaker identity was independent of language and AISHELL had more data than WSJ. In contrast, AISHELL performed the worst in IC and ASR because of the language mismatch. Although the performance was affected by training data, DistilHuBERT offered better performance than many other SSL methods, even with smaller datasets for distillation.

4 Conclusion

This paper proposes DistilHuBERT, a novel framework to layer-wise distill knowledge from HuBERT. DistilHuBERT retained most of HuBERT’s performance and had only 25% of its size. We provided comprehensive analyses of the proposed methods and demonstrated the DistilHuBERT’s flexibility and generalizability. Our method can be easily applied to more powerful SSL models.


  • [1] A. Baevski, M. Auli, and A. Mohamed (2019) Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912. Cited by: §1.
  • [2] A. Baevski, W. Hsu, A. Conneau, and M. Auli (2021) Unsupervised speech recognition. arXiv preprint arXiv:2105.11084. Cited by: §2.2.
  • [3] A. Baevski, S. Schneider, and M. Auli (2020) Vq-wav2vec: self-supervised learning of discrete speech representations. In ICLR, Cited by: §1.
  • [4] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. In NeurIPS, Cited by: §1, §1, §2.2.
  • [5] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017)

    AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline

    In O-COCOSDA, Cited by: §3.1.
  • [6] H. Chang, H. Lee, and L. Lee (2021) Towards lifelong learning of end-to-end ASR. In Interspeech, Cited by: §3.6.
  • [7] X. Chang, T. Maekaku, P. Guo, J. Shi, Y. Lu, A. S. Subramanian, T. Wang, S. Yang, Y. Tsao, H. Lee, and S. Watanabe (2021) An exploration of self-supervised pretrained representations for end-to-end speech recognition. In ASRU, Cited by: §2.2.
  • [8] Y. Chung and J. Glass (2020) Improved speech representations with multi-target autoregressive predictive coding. In ACL, Cited by: §1.
  • [9] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An unsupervised autoregressive model for speech representation learning

    In Interspeech, Cited by: §1.
  • [10] Y. Chung, H. Tang, and J. Glass (2020) Vector-quantized autoregressive predictive coding. In Interspeech, Cited by: §1.
  • [11] S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet, A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier (2021) LeBenchmark: a reproducible framework for assessing self-supervised representation learning from speech. In Interspeech, Cited by: §1.
  • [12] A. Hannun (2021) The history of speech recognition to the year 2030. arXiv preprint arXiv:2108.00084. Cited by: §1.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    arXiv preprint arXiv:1503.02531. Cited by: §1.
  • [14] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. Cited by: §1, §1, §2.1.
  • [15] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2020) TinyBERT: distilling bert for natural language understanding. In EMNLP, Cited by: §1, §2.2, §3.1.
  • [16] C. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y. Liao, Y. Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass (2021) PARP: prune, adjust and re-prune for self-supervised speech recognition. NeurIPS. Cited by: §1.
  • [17] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP, Cited by: §1.
  • [18] S. Ling and Y. Liu (2020) DeCoAR 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659. Cited by: §1.
  • [19] A. H. Liu, Y. Chung, and J. Glass (2021) Non-autoregressive predictive coding for learning speech representations from local dependencies. In Interspeech, Cited by: §1.
  • [20] A. T. Liu, S. Li, and H. Lee (2021) TERA: self-supervised learning of transformer encoder representation for speech. TASLP 29. Cited by: §1, §3.1.
  • [21] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP, Cited by: §1, §3.1.
  • [22] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1.
  • [23] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In NAACL-HLT: Demonstrations, Cited by: §3.1.
  • [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In ICASSP, Cited by: §3.1.
  • [25] A. Pasad, J. Chou, and K. Livescu (2021) Layer-wise analysis of a self-supervised speech representation model. arXiv preprint arXiv:2107.04734. Cited by: §2.2.
  • [26] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. In Interspeech, Cited by: §1.
  • [27] D. B. Paul and J. M. Baker (1992) The design for the Wall Street Journal-based CSR corpus. In HLT, Cited by: §3.1.
  • [28] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and J. Nassour (2021) Shrinking bigfoot: reducing wav2vec 2.0 footprint. arXiv preprint arXiv:2103.15760. Cited by: §1.
  • [29] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio (2020) Multi-task self-supervised learning for robust speech recognition. In ICASSP, Cited by: §1.
  • [30] M. Riviere, A. Joulin, P. Mazaré, and E. Dupoux (2020) Unsupervised pretraining transfers well across languages. In ICASSP, Cited by: §1.
  • [31] S. Sadhu, D. He, C. Huang, S. H. Mallidi, M. Wu, A. Rastrow, A. Stolcke, J. Droppo, and R. Maas (2021) wav2vec-C: a self-supervised model for speech representation learning. In Interspeech, Cited by: §1.
  • [32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §1, §2.2, §3.1.
  • [33] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. In Interspeech, Cited by: §1.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2.1.
  • [35] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, et al. (2021) SUPERB: speech processing universal performance benchmark. In Interspeech, Cited by: §1, §2.1, Table 1, §3.2.