UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

by   Sanyuan Chen, et al.

Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.


WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised learning (SSL) achieves great success in speech recognit...

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong perform...

Exploring wav2vec 2.0 on speaker verification and language identification

Wav2vec 2.0 is a recently proposed self-supervised framework for speech ...

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

In this paper, we propose a unified pre-training approach called UniSpee...

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Learning meaningful and general representations from unannotated speech ...

Learning Speaker Representation with Semi-supervised Learning approach for Speaker Profiling

Speaker profiling, which aims to estimate speaker characteristics such a...

Robust Speech Representation Learning via Flow-based Embedding Regularization

Over the recent years, various deep learning-based methods were proposed...

Code Repositories

1 Introduction

Self-supervised learning has achieved great successes in natural language processing, which utilizes a large amount of unlabeled data to learn universal representation. The representation enjoys outstanding generalizability, re-usability, and effectiveness, thus brings significant performance improvements when employed by various downstream tasks. Motivated by this, a series of work in speech processing have been proposed to leverage unlabeled audio for representation learning.

Self-supervised learning methods are categorized into discriminative methods [17, 15, 16, 1, 2, 7, 20], generative methods [5, 4, 6, 10, 12, 11], and multi-task learning methods [14]. The typical generative method is Autoregressive Predictive Coding (APC) [5, 4]

, where the model is similar to the autoencoder architectures except that the network is trained to predict features for future time steps. The discriminative methods usually employ contrastive learning

[2] or classification on weak clustering label [7]

to pre-train an encoder network with large-scale unsupervised data. Recently, the discriminative methods achieved great successes in automatic speech recognition (ASR), which outperforms the best system for Librispeech dataset in 2019 with significantly less supervised data. The improved performance on different speech tasks in SUPERB benchmark

[19] also verifies the effectiveness of pre-training.

Although achieving numerous successes, most pretraining methods for speech application focus on the extraction of spoken content information, i.e. learning representation optimized for tasks such as speech recognition, keyword spotting, etc. Limited exploration was carried out on other speech characteristics. As speech signal contains multi-fold information, e.g. content, identity, presentation etc., optimization for one aspect might lead to sub-optimized representation for other tasks. Interestingly, even trained with ASR-oriented objective function, the representation learnt by unsupervised pre-training shows excellent performance in speaker identification related tasks, such as speaker verification, diarization etc., in SUPERB challenge. However, can the speaker tasks’ performance be further boosted, when provided embedding from matched pre-training, is still an open question.

To answer this question, we investigate the unsupervised speaker pre-training methods that encourage the preservation of speaker identity. Specifically, we proposed two training methods: 1) We integrate the utterance-wise contrastive loss with the unsupervised representation learning, where the aggregated embedding from each utterance is employed for affinity computation, and a speaker-wise pseudo label is applied as reference. 2) We propose an utterance-mixing training strategy, where partially overlapped signal is constructed for each training sample, by mixing it with a randomly selected speech piece, while the training objective remains the same. We integrate our proposed training method in the HuBERT framework [7], and conduct experiment on Speech processing Universal PERformance Benchmark (SUPERB) [19]. The experiment results show that our method significantly improves speaker identification, speaker verification, speaker diarization, and emotion recognition, while maintaining the same speech recognition performance. Finally, we extend our pre-training network to 94k hours of public English audio data, consisting of LibriVox [9], GigaSpeech [3], and VoxPopuli [18], which further increases performance on speaker tasks compared to previous work using 60k LibriVox data only.

The contribution of the paper is summarized into three-folds. 1) We propose a speaker aware pre-training method which is complementary to current ASR oriented pre-training. 2) We empirically evaluate the model performance on the SUPERB benchmark and achieve state-of-the-art performance in the overall evaluation. 3) We release our model at https://github.com/microsoft/UniSpeech.

Figure 1: An illustration of our method. We conduct contrastive loss in the intermediate layer, and use mixed utterance as input.

2 Background

We first overview HuBERT [7] for universal speech representation learning, which serves as our baseline model. HuBERT has the state-of-the-art performance for several representation learning benchmarks [19]. The main idea of HuBERT is to learn the representation by iterative clustering. HuBERT firstly conducts an offline clustering step based on MFCC (Mel-Frequency Cepstrum Coefficient) of input signal, where the cluster center of each frame is indexed as the pseudo-label for later steps. Then, a Transformer model with an CNN as a feature extractor is trained on the MFCC and pseudo-labels to form the representation for the first iteration. A mask prediction loss is used as training criteria, where the network is required to predict the pseudo-label of a masked region from the input sequence, with the features from unmasked parts as input. Specifically, given a speech utterance with feature frames, the corresponding labels are , the feature sequence is extracted from the utterance with CNN Encoders. We denote as the set of masked indices in , as the corrupted where each is replaced by a random-initialized mask embedding if . Then the Transformers model is trained to predict each labels corresponding to the masked indices given the corrupted feature sequence with the cross-entropy loss .

The combination of clustering and network training is considered as one iteration. Starting from the second iteration, instead of MFCC feature, the embeddings generated by network from last iteration are used as the input for clustering and network step. Presumably both the pseudo-label and the embedding are refined through iterations.

3 Unispeech-Sat

We propose Universal Speech representation learning with Speaker Aware pre-Training (UniSpeech-SAT), which is shown in Figure 1. On top of HuBERT model, two approaches are proposed, namely the utterance-wise contrastive learning and the utterance mixing augmentation. The former is applied to enhance the single speaker information extraction to improve downstream tasks like speaker verification and speaker identification. The latter mainly benefits the multi-speaker tasks such as speech diarization problem.

3.1 Utterance-wise Contrastive Learning

We combine the utterance-wise contrastive loss to enhance unsupervised speaker information modeling. Two assumptions are made for this integration: 1. Each training utterance contains one active speaker. 2. Each utterance in the training batch belongs to a different speaker, i.e., there is no speaker having two utterances in one batch. Given that the dataset is collected from various sources, we believe the two assumptions are mostly satisfied.

In proposed contrastive loss, embeddings within the utterance are considered as positive instances, while the negative instances consists of embedding from other utterances in the same batch. Suppose that the input feature sequence is , where is the batch size. , we obtain the latent representation from the output of an intermediate Transformer encoder layer. Then we discretize the latent representation to a finite set of speech representations with a quantization module [2]. Suppose the quantization module has codebooks with

entries, we firstly linear transform each latent representations

to logit

and then use Gumbel softmax [8] to choose one discrete entry

from each codebook. The probability for choosing the

-th entry from -th codebook is , where is a non-negative temperature, , and is uniform sampled from

. Then we concatenate the selected vectors as

, and linear transform it to the quantized representation . For the latent representation centered over mask step in -th utterance, the model is trained to identify the true quantized representations from the same utterance in a set of quantized candidate representations that uniformly sampled from all the masked time steps in all the utterances within the training batch . The utterance-wise contrastive loss among and is defined as: , where

denotes the cosine similarity between the latent representations and quantized representations

. The utterance-wise contrastive loss is augmented by a codebook diversity loss to encourage the equal use of all the codebook entries , where is the averaged across the batch of utterances. Finally, the speaker information modeling is trained with the loss: , where

is a pre-defined hyperparameter. Our model will learn the combination of speaker loss and content loss by

, where is a pre-defined hyper-parameter.

3.2 Utterance Mixing Augmentation

We introduce utterance mixing strategy to further boost speaker information modeling in pre-training, especially for multi-speaker tasks such as speaker diarization etc. The utterance mixing method aims to simulate the multi-speaker speech for self-supervised pretraining when only single-speaker pretraining data is available. Specifically, as shown in Algorithm 1, given a batch of speech utterances with batch size , we randomly choose utterances from the batch. Then for each utterance , we randomly choose an utterance from the batch , crop a chunk of random length from , and mix it with in a random region. With the utterance mixing method, the model is trained to extract the information of the main speaker from the mixed audio with the single-speaker information modeling loss (Section 3.1), and predict the content information corresponding to the main speaker with the content information modeling loss (Section 2). Note that we constrain the mixing portion in each utterance to be less than 50%, avoiding potential label permutation problem.

1:  given a batch of speech utterances with batch size and length , mixing probability
2:  Choose utterances by Bernoulli sampling with probability
3:  for  in  do
4:      Sample a utterance

from discrete uniform distribution with probability

5:      Sample the mix length from discrete uniform distribution with probability
6:      Sample a start position of from discrete uniform distribution with probability
7:      Sample a start position of from discrete uniform distribution with probability
9:  return  
Algorithm 1 Utterance Mixing

3.3 Large and Diverse Pre-training Data

We also propose to leverage large-scale unsupervised data from diverse domains to improve the robustness of our model. Previous works use Librispeech [13] or Librivox [9] datasets for pre-training, which limits the pre-training model since the input data are all extracted from the audiobook. We extend the training dataset with (1) 10K hours the Gigaspeech data [3], which is collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. (2) 24K hours VoxPopuli data [18]), which from European Parliament (EP) event recordings including plenary sessions, committee meetings and other events. Finally, we have 94k hours data, including LibriVox, VoxPopuli, and Gigaspeech. We believe the diverse dataset can improve model performance on all tasks, because it contains diverse audio background, more speakers, and different contents of speech.

Method #Params Corpus Speaker Content Semantics ParaL Overall
Acc EER DER PER w/o w/ LM Acc MTWV Acc F1 CER Acc Score
FBANK - - 8.5E-4 9.56 10.05 82.01 23.18 15.21 8.63 0.0058 9.10 69.64 52.94 35.39 44.2
PASE+ [14] 7.83M LS 50 hr 37.99 11.61 8.68 58.87 25.11 16.62 82.54 0.0072 29.82 62.14 60.17 57.86 57.5
APC [5] 4.11M LS 360 hr 60.42 8.56 10.53 41.98 21.28 14.74 91.01 0.0310 74.69 70.46 50.89 59.33 67.6
VQ-APC [6] 4.63M LS 360 hr 60.15 8.72 10.45 41.08 21.20 15.21 91.11 0.0251 74.48 68.53 52.91 59.66 67.2
NPC [10] 19.38M LS 360 hr 55.92 9.40 9.34 20.20 13.91 43.81 88.96 0.0246 69.44 72.79 48.44 59.08 67.0
Mockingjay [12] 85.12M LS 360 hr 32.29 11.66 10.54 22.82 15.48 70.19 83.67 6.6E-04 34.33 61.59 58.89 50.28 56.1
TERA [11] 21.33M LS 360 hr 57.57 15.89 9.96 18.17 12.16 49.17 89.48 0.0013 58.42 67.50 54.17 56.27 64.2
modified CPC [15] 1.84M LL 60k hr 39.63 12.86 10.38 42.54 20.18 13.53 91.88 0.0326 64.09 71.19 49.91 60.96 65.1
wav2vec [16] 32.54M LS 960 hr 56.56 7.99 9.90 31.58 15.86 11.00 95.59 0.0485 84.92 76.37 43.71 59.79 71.5
vq-wav2vec [1] 34.15M LS 960 hr 38.80 10.38 9.93 33.48 17.71 12.80 93.38 0.0410 85.68 77.68 41.54 58.24 69.3
wav2vec 2.0 Base [2] 95.04M LS 960 hr 75.18 5.74 6.02 6.08 6.43 4.79 96.23 0.0233 92.35 88.30 24.77 63.43 80.3
HuBERT Base [7] 94.68M LS 960 hr 81.42 5.11 5.88 5.41 6.42 4.79 96.30 0.0736 98.34 88.53 25.20 64.92 82.0
UniSpeech-SAT Base 94.68M LS 960 hr 85.76 4.31 4.41 5.40 6.75 4.86 96.75 0.0927 98.58 88.98 23.56 66.04 83.0
   contrastive loss 94.68M LS 960 hr 84.74 4.61 4.72 5.22 6.80 5.17 96.79 0.0956 98.31 88.56 24.00 65.60 82.8
   utterance mixing 94.68M LS 960 hr 85.97 4.35 5.87 5.06 7.04 5.05 96.88 0.0866 98.10 88.50 24.52 65.97 82.7
UniSpeech-SAT Base+ 94.68M CD 94k hr 87.59 4.36 3.80 4.44 6.44 4.88 97.40 0.1125 98.84 89.76 21.75 68.48 84.0
wav2vec 2.0 Large [2] 317.38M LL 60k hr 86.14 5.65 5.62 4.75 3.75 3.10 96.6 0.0489 95.28 87.11 27.31 65.64 82.1
HuBERT Large [7] 316.61M LL 60k hr 90.33 5.98 5.75 3.53 3.62 2.94 95.29 0.0353 98.76 89.81 21.76 67.62 83.5
UniSpeech-SAT Large 316.61M CD 94k hr 95.16 3.84 3.85 3.38 3.99 3.19 97.89 0.0836 99.34 92.13 18.01 70.68 85.6
Table 1: Universal speech representation evaluation on SUPERB benchmark. The overall score is computed by ourselves: we multiply the QbE score with 100, replace each error rate score with (1 - error rate), and average the scores of all tasks.

4 Experiment

4.1 Implementation Details

We implement and pretrain our UniSpeech-SAT model following previous work [7]. We pretrain the UniSpeech-SAT Base model for 400k steps on LibriSpeech 960 hours audio [13] using the label generated by clustering the 6-th transformer layer output of the first iteration model of HuBERT Base model. The UniSpeech-SAT Base+ and UniSpeech-SAT Large model is pretrained for 400k steps on 94K large-scale diverse data (Section 3.3) using the label generated by clustering the 6-th transformer layer output of the HuBERT Base model. As for the model architecture and training configurations, we use the same hyperparameters as [7].

4.2 Universal Representation Evaluation

We evaluate our models on SUPERB, which is designed to provide a standard and comprehensive testbed for pretrained models on various speech tasks. It covers ten tasks, including Speaker Identification (SID), Automatic Speaker Verification (ASV), Speaker Diarization (SD), Phoneme Recognition (PR), Automatic Speech Recognition (ASR), Keyword Spotting (KS), Query by Example Spoken Term Detection (QbE), Intent Classification (IC), Slot Filling (SF), Emotion Recognition (ER). The tasks can be grouped into four aspects of speech: speaker, content, semantics, and paralinguistics. We follow the policies created by SUPERB. 1) The design of task specific layers follows the rules of SUPERB. 2) Transformer model is frozen to limit the space of fine-tuning hyper-parameter search. 3) The task specific layer uses the weighted sum results of hidden states from different layers.

Table 1 shows the evaluation results. There is a significant improvement on speaker diarization task in both base and large setting, where the diarization error rate (DER) is reduced by over 25. The results demonstrate that the proposed utterance mixing method is very effective for the multi-talker task. Moreover, positive results are observed in speaker identification and speaker verification, which is attributed to the utterance contrastive loss. Surprisingly, our model also obtains substantial gain on emotion recognition. One possible explanation is that the task also requires utterance level information rather than content information. However, our model shows a degradation on ASR without LM. The word error rate of our large model is 9 worse than the baseline, while the gap becomes less than 2 in the base setting. Our explanation is speaker information and content information orthometric, and the content information is sacrificed given that the model capacity is limited.

Figure 2: Weight Analysis.
Method Ratio Speaker Content Semantics ParaL
DER w/o w/ LM Acc Acc
HuBERT Base [7] - 5.88 6.42 4.79 98.34 64.92
UniSpeech-SAT Base+ 0.0 5.04 6.39 4.76 99.24 66.32
0.2 3.80 6.44 4.88 98.84 68.48
0.5 3.73 6.65 5.18 99.29 67.36
Table 2: Results of UniSpeech-SAT Base+ with various mixing ratios on 94k hours training data.

4.3 Analysis

Weight Analysis: Figure 2 shows the layer contribution to different tasks. For speaker verification and diarization, shallow layers contribute more, while for ASR and intent classification, the top layers are more important. The phenomenon indicates the shallow layers learn speaker information while the top layers learn content and semantic information.

Mixing ratio: We explore different ratios of mixing utterance and test the performance of mixing 0, 20, 50 utterances, shown in Table 2. For 94k hours setting, utterance mixing is still effective. It is a trade-off between speaker and content. We use 20 for our UniSpeech-SAT Base+ model.

5 Conclusion

In this work, we integrate contrastive loss and utterance mixing to existing framework for unsupervised speech representation learning, aiming at improving the speaker discrimination in learnt embedding. The evaluation on the SUPERB benchmark shows our model achieves the state-of-the-art performance and outperforms other baselines by a large margin.


  • [1] A. Baevski, S. Schneider, and M. Auli (2020) Vq-wav2vec: self-supervised learning of discrete speech representations. In ICLR, Cited by: §1, Table 1.
  • [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Cited by: §1, §3.1, Table 1.
  • [3] G. Chen, S. Chai, G. Wang, J. Du, W. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, et al. (2021) GigaSpeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. Cited by: §1, §3.3.
  • [4] Y. Chung and J. Glass (2020) Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3497–3501. External Links: Document Cited by: §1.
  • [5] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An Unsupervised Autoregressive Model for Speech Representation Learning

    In Interspeech, pp. 146–150. Cited by: §1, Table 1.
  • [6] Y. Chung, H. Tang, and J. Glass (2020) Vector-quantized autoregressive predictive coding. In Interspeech, pp. 3760–3764. Cited by: §1, Table 1.
  • [7] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. Cited by: §1, §1, §2, Table 1, §4.1, Table 2.
  • [8] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.1.
  • [9] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al. (2020) Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. Cited by: §1, §3.3.
  • [10] A. H. Liu, Y. Chung, and J. Glass (2020) Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406. Cited by: §1, Table 1.
  • [11] A. T. Liu, S. Li, and H. Lee (2020) Tera: self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028. Cited by: §1, Table 1.
  • [12] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP. Cited by: §1, Table 1.
  • [13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §3.3, §4.1.
  • [14] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio (2020) Multi-task self-supervised learning for robust speech recognition. In ICASSP, pp. 6989–6993. Cited by: §1, Table 1.
  • [15] M. Rivière, A. Joulin, P. Mazaré, and E. Dupoux (2020) Unsupervised pretraining transfers well across languages. In ICASSP, pp. 7414–7418. Cited by: §1, Table 1.
  • [16] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition.. In Interspeech, Cited by: §1, Table 1.
  • [17] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: 1807.03748 Cited by: §1.
  • [18] C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux (2021)

    VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    arXiv preprint arXiv:2101.00390. Cited by: §1, §3.3.
  • [19] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee (2021) SUPERB: Speech Processing Universal PERformance Benchmark. pp. 1194–1198. External Links: Document Cited by: §1, §1, §2.
  • [20] Y. Zhang, D. S. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y. Xu, Y. Huang, S. Wang, et al. (2021) BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2109.13226. Cited by: §1.