Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.


Exploring wav2vec 2.0 on speaker verification and language identification

Wav2vec 2.0 is a recently proposed self-supervised framework for speech ...

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve fu...

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

We present a simple and effective self-supervised learning approach for ...

A study on the distribution of social biases in self-supervised learning visual models

Deep neural networks are efficient at learning the data distribution if ...

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

Self-supervised learning (SSL) is a long-standing goal for speech proces...

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised learning (SSL) achieves great success in speech recognit...

Speaker Diarization as a Fully Online Learning Problem in MiniVox

We proposed a novel AI framework to conduct real-time multi-speaker diar...

1 Introduction

Recently, self-supervised learning (SSL) has achieved the state-of-the-art results on a diverse array of downstream speech tasks [wav2vec2, hsu2021hubert, zhang2021bigssl, superb, chen2021unispeech, chen2021wavlm, wang2021self]. Typical SSL methods either discriminate the correlated positive samples from the negative ones (e.g. wav2vec 2.0) [wav2vec2], or predict discrete pseudo-labels on the masked regions (e.g. HuBERT) [hsu2021hubert]

. Both methods try to implicitly learn short-time phonetic information from a huge amount of unlabeled speech, and mainly target at self-supervised learning for automatic speech recognition task (SSL4ASR).

Due to the high correlation with phoneme units, it is straightforward to understand that SSL4ASR has the potential to drastically improve the speech recognition task. Interestingly, SSL4ASR also achieves state-of-the-art performance on the speaker-related tasks, e.g. speaker verification (SV). For instance, WavLM [chen2021wavlm] and BigSSL [zhang2021bigssl] show the best performance on different partitions of VoxCeleb1 dataset [nagrani2020voxceleb], and the ensemble of WavLM model and Res2Net [gao2019res2net, zhou2021resnext] ranks at the top position on VoxSRC 2021 speaker verification permanent leaderboard111https://competitions.codalab.org/competitions/34066#results. with the team name Strasbourg-Spk.

In this work, our goal is to understand which factor leads to the success of SSL4ASR in speaker recognition. Specifically, we try to answer the following questions:

  1. Can supervised ASR model benefit the SV task?

  2. How does SSL benefit the SV task?

  3. What is the best SSL setup for the SV task?

To this end, we carefully design and conduct a series of experiments to investigate what is the indispensable part of SSL. We also perform Integrated Gradients attribution analysis and loss landscape visualization to further understand the contribution of SSL to SV performance.

The main finding is three-fold as follows. First, SSL4ASR models have significantly better transferability than supervised ASR models in an apple-to-apple comparison, indicating the SSL objective function is a key ingredient for achieving excellent transferability. Second, the HuBERT style loss, mask speech prediction, is slightly better than other SSL losses, such as contrastive learning and Mean Squared Error (MSE) loss, while how to generate pseudo-labels has minor impacts on the performance of HuBERT style models. Even pre-training with simple clustering methods on raw inputs could provide good performance on the SV task. Data augmentation proposed in WavLM [chen2021wavlm] is very helpful, even if the pre-train data is scaled up to 94k hours. In addition, data scale and model scale have a strong correlation to model transferability. Third, our analysis shows that SSL models only learn speaker related knowledge with shallow layers in pre-training stage, while fine-tuning stage could unleash the full capability of the model. We observe that an SSL model could provide a wider optima in fine-tuning, which enables better resistance against small perturbation, stronger generalization capability, and easier SV model optimization.

2 Background

Figure 1: Self-supervised learning for speaker verification task.

Self-supervised learning (SSL) has been shown to be an effective means of improving state-of-the-art results on SV task [chen2021unispeech, chen2021large, chen2021wavlm]. A common practice is: We first optimize the pre-train model with SSL objective on the large-scale unsupervised data, then fine-tune the pre-trained model along with a downstream SV model on the annotated dataset.

The typical SSL objectives are designed for automatic speech recognition task (SSL4ASR) by implicitly learning short-time phonetic information from unlabeled speech [wav2vec2, hsu2021hubert]. Specifically, given a raw audio , a latent representation is obtained by a CNN feature extractor, where is the number of frames. Then the representation is fed to an layer Transformer model, yielding hidden states , where denotes the -th layer in the encoder. During pre-training, we employ the masked-based self-supervised learning methods to optimize the Transformer-based model. Before feeding the latent representation to the Transformer model, SSL methods first mask a proportion of them in some random frames, then minimize a variety of self-supervised objective functions based on the last layer hidden states output in the masked regions.

During fine-tuning, we weighted average the hidden states of each layer to generate the output representation , where is a learnable weight for the hidden state of the -th layer. Then we employ ECAPA-TDNN [desplanques2020ecapa] as the downstream SV model following [chen2021large], and feed the output representation into the downstream model to generate the speaker embedding . We use the additive angular margin (AAM) loss [deng2019arcface] as the supervised objective function, and train the downstream SV model along with the pre-trained model for two stages. In the first stage, we optimize the parameters of the downstream model with the pretrained parameters fixed. In the second stage, we continue to optimize the parameters of the downstream model as well as the pre-trained model. In addition, we can also apply large-margin fine-tuning strategy and score calibration to further improve the speaker verification performance [thienpondt2021idlab].

3 Why does SSL4ASR Benefit the SV task?

3.1 Can supervised ASR model benefit the SV task?

Given the similar modeling unit between SSL4ASR and supervised ASR models, it is a natural question whether the supervised ASR model can also benefit speaker verification task. To verify this hypothesis, we compare the transferability of supervised ASR and SSL4ASR, both of which are trained on LibriSpeech 960h [librispeech] and use the Transformer structure of HuBERT [hsu2021hubert].

The ASR model is trained with the Connectionist Temporal Classification (CTC) loss function

[ctc] in a supervised way. We use the character sequence as the target golden labels, and require the ASR model to predict the golden label given the hidden states of the last encoder layer : . Spec-augmentation is also applied following [specaug].

HuBERT, based on masked pseudo-label prediction loss, is selected as the SSL4ASR model for the comparison [hsu2021hubert]. The pseudo labels are generated by iterative clustering. At the first iteration, we conduct an offline clustering step on the MFCC feature of the input audio, where the clustering center of each frame is indexed as the pseudo label. Then we use the hidden states to predict the embeddings corresponding to the pseudo labels with the cross-entropy loss function in the masked regions :

, where is the projection matrix,

denote the cosine similarity function,

is a pre-defined hyperparameter, and

is the number of clusters. Starting from the second iteration, we perform the offline clustering step with the hidden states extracted from the last iteration pretrained HuBERT model, and then train a new HuBERT model with the pseudo labels obtained by the new clustering centers.

We also use a random initialized Transformer model as a baseline to get rid of the effect of the additional parameters introduced by the pre-trained model and focus on the performance of different pre-training methods.

Model EER (%)
Vox1-O Vox1-E Vox1-H
FBank 1.01 1.24 2.32
Random 3.696 3.71 6.034
CTC 1.159 1.256 2.434
HuBERT 0.84 0.879 1.726
Table 1: Transferability of supervised ASR and SSL4ASR

Table 1 shows SSL4ASR model can provide a better representation than the handcrafted FBank feature, while the representations from the ASR model with CTC loss and the random initialized Transformer model are inferior to the FBank feature. It indicates that the key to the success of SSL4ASR on SV task is neither the Transformer structure nor the fine-tuning pipeline, but the self-supervised learning procedure.

3.2 What is the best SSL objective for the SV task?

Besides HuBERT, which is based on masked pseudo-label prediction loss, we also evaluate the transferability of wav2vec 2.0 [wav2vec2] and Mean Squared Error (MSE) loss based pre-training method. It should be noted that all the three methods use the same mask setting proposed in HuBERT.

MSE firstly calculates the FBank feature of the raw audio, then measures the mean square error between the FBank feature and the linear projection of last layer hidden states output in the masked regions as the objective function: , where is the projection matrix.

Wav2vec 2.0 firstly discretizes the latent representation of each masked timestep to the quantized latent representation , then uses the context representation to identify the true quantized latent representation out of a set of candidate representations with contrastive loss function:

, where is the projection matrix, denote the cosine similarity function, is a pre-defined hyperparameter.

Model EER (%)
Vox1-O Vox1-E Vox1-H
MSE 0.979 1.075 1.98
wav2vec 2.0 0.973 0.933 1.831
HuBERT 0.84 0.879 1.726
Table 2: SSL with different objective functions

Table 2 demonstrates that all the three SSL methods can provide better representation than the FBank feature, which is attributed to the contextual speech representation learning from the masked speech. HuBERT achieves the best performance, indicating the better generalization and effectiveness of pseudo-label prediction loss than contrastive loss and MSE loss.

3.3 What is the best SSL quantizer for the SV task?

Since HuBERT style loss is better than others, we explore the performance of different pseudo-label creation methods (quantizers) for HuBERT loss. Besides the MFCC Clustering and Hidden State Clustering introduced by HuBERT, we also experiment with the labels obtained by Random Projection [chiu2022self], VQ-VAE quantizers [van2017neural], and frame-phoneme alignment.

With random projection quantizer, we first extract the FBank features of the input audio, project

to the vector

with a random initialized matrix , and then find the closest vector from a set of random initialized vectors , where is the vector (code) numbers. The pseudo label of -th frame is defined as the index of the closest vector: .

With VQ-VAE quantizer, we first extract the FBank features of the input audio, and train a VQ-VAE model [van2017neural] to reconstruct the FBank feature on LibriSpeech 960h [librispeech]. Given the latent variable obtained by a -layer Transformer-based encoder, we discretize it with the closest vector { in a latent embedding space , where is the embedding numbers, and then reconstruct the features with a -layer Transformer-based decoder. The training loss of VQ-VAE is to minimize the mean squared error between the reconstructed features and the input features, along with the difference between the encoded variable and the discrete variable:

where is the stopgradient operator and is a pre-defined hyperparameter. During inference, the pseudo label of -th frame is defined as the index of the discrete latent variables in the latent embedding space: .

In addition, we also consider using the phoneme sequence of the input audio as the pseudo label to see if ASR-related pseudo label can benefit the SV performance. Here, we use force-alignment tool [mcauliffe17_interspeech] to get the frame-phoneme pairs on LibriSpeech 960h data.

Model EER (%)
Vox1-O Vox1-E Vox1-H
MFCC Clustering 0.872 0.917 1.766
Hidden State Clustering 0.840 0.879 1.726
Random Projection (500 codes) 0.899 0.95 1.775
Random Projection (8192 codes) 0.883 0.903 1.675
VQ-VAE 0.824 0.899 1.655
Phoneme 0.867 0.918 1.776
Table 3: HuBERT style loss with different quantizers.

Table 3 shows that all the quantizers have similar performance on the speaker verification task. Even when we use the phone sequence as the pseudo label, which is irrelevant to the speaker information, we can still obtain a well-performed speaker verification model with the masked pseudo-label prediction SSL method.

3.4 Large-Scale SSL on SV task

Moreover, we also leverage the data augmentation and scale-up strategy to further strengthen the self-supervised learning for speaker verification task. Following WavLM [chen2021wavlm], we employ the masked speech denoising and prediction framework as the data augmented self-supervised learning method to improve pre-trained model robustness for complex acoustic environments and the preservation of speaker identity. We also scale up unlabeled pre-training data to 94k hours of public audios [chen2021wavlm], including 60k hours of Libri-Light [librilight], 10k hours of GigaSpeech [GigaSpeech2021], and 24k hours of VoxPopuli [wang2021voxpopuli], and enlarge the model to 24 layer Transformers with 316M parameters.

Model EER (%)
Vox1-O Vox1-E Vox1-H
HuBERT 960h 0.84 0.879 1.726
WavLM 960h 0.777 0.829 1.629
HuBERT 94kh 0.734 0.847 1.725
WavLM 94kh 0.739 0.742 1.483
WavLM 94kh Large 0.505 0.579 1.176
WavLM 94kh Large 0.308 0.462 0.906
Table 4: Data and Model Scale Up. means using large margin finetune and calibration

Table 4 shows that the data augmentation strategy used in WavLM can successfully benefit the self-supervised learning for SV task. The performance improvement would be more significant if we scale up the pre-training data to 94kh. Thanks to the larger parameter capacity, the WavLM Large model can bring more than 20% EER reduction compared to the WavLM Base model. With the large-margin fine-tuning strategy and score calibration methods, the WavLM Large model can achieve 33.2%, 27.1%, and 8.8% relatively EER reduction compared to the state-of-the-art supervised model (Vox1-O: 0.461, Vox1-E: 0.634, Vox1-H: 0.993) [zhao2021speakin] on all the three VoxCeleb1 trial lists.

4 Discussion and Analysis

4.1 Contribution Attribution

We employ the Integrated Gradients (IG) attribution method [sundararajan2017axiomatic] to demonstrate how each layer of the pre-trained model contributes to the final SV performance. Compared with method in [chen2021unispeech, chen2021wavlm]

, IG better models contribution estimation as it consider not only the layer weight, but also the magnitude of each layer’s hidden states. Specifically, given a well-trained downstream model

, the hidden states extracted from all layers, and the corresponding learned weights , the attribution score of -th layer hidden states is assigned as:

, where denotes Hadamard product, is the integral variable, and denotes the summation over the time and feature dimensions. The larger attribution score indicates the more importance of the corresponding hidden states. The summation of the attribution scores of all the hidden states indicates the final prediction of the SV model, i.e., . Due to the intractability, we approximate with the gradients summation as:

, where is the number of approximation steps for computing integrated gradients. We set to 50 in our experiment.

(a) Stage 1: we fix the pre-trained model and only train the downstream model.
(b) Stage 2: we train both the pre-trained model and the downstream model.
Figure 2: Contribution attributed to each layer of each pre-trained model

Figure 2

shows the contribution attribution from each layer of different pre-trained models. As for the first stage of fine-tuning, where we train the downstream model with the pre-trained parameters fixed, the contribution mostly comes from the output of the CNN feature extractor and the first encoder layer for all the pre-trained models. It indicates that only the shallow layers of pre-trained models learn the speaker-related information during the self-supervised learning procedure. If the hidden states are extracted from the ASR model, which is supervised trained with CTC loss, only the latent feature extracted by CNN extractor contributes to the final prediction. And if the hidden states are extracted from the SSL4ASR model, such as wav2vec 2.0 and HuBERT, the contribution is also dominated by the CNN extracted feature. In contrast, if we pre-train HuBERT with data augmentation or the phoneme-independent quantizer, such as MFCC clustering or random projection, there are more contributions from the hidden states encoded by Transformer layers.

As for the second stage of fine-tuning, we update the parameters of the downstream model as well as the pre-trained parameters. Since we unleash the full capability of the pre-trained model, the higher Transformer-based encoder layers can also learn to model the speaker information with the SV training objective, and make more contribution to the final prediction than in the first stage, leading to better speaker verification performance .

4.2 Loss Landscape Visualization

To better understand how self-supervised learning benefits the SV task, we visualize and compare the two-dimensional loss landscapes along with the optimization trajectories of different SV models. For better comparison of different input features, we plot the parameters of the downstream models, and the optimization trajectories in the first fine-tuning stage where the pre-trained parameters are kept frozen.

Following [li2018visualizing, hao2019visualizing], we first define the origin and two axes of the loss surface as the random initialized downstream model’s parameters and two directions in the parameter space, respectively. Then, we uniformly sample multiple points around the initialized parameters, and plot the training loss of the downstream model with the parameters of each sampled point and the input feature from the pre-trained model.

Let , denote the random initialized parameters and well-trained parameters of the SV downstream model respectively, we can define one of the axes as the optimization direction . The other axis is set as a random direction , where is the randomly generated parameters. Due to the high-dimensional parameter space, experimental results confirm that the two axes and are divergent and orthogonal to each other. Then, the 2-D loss surface can be plotted with the function: , where are scalar values and is the loss function of the SV model training. For better visualization, we scale the second direction vector to the same norm as the first one by , where is the Euclidean norm. We set the range of and to , and uniformly sample 29 points for each axis. In addition, we also project the optimization trajectory of the SV downstream model onto the two-dimensional loss surface. Specifically, for the parameters of the downstream model at

-th training epoch,

denotes the optimization direction at the -th epoch, we can calculate the cosine similarity between the optimization direction and each of the projected directions as . Then, the corresponding projected point in the 2-D loss surface of can be calculated as: .

(a) FBank feature
(b) RI-WavLM
(c) WavLM
Figure 3: Visualization of the loss landscape and optimization trajectories of SV model with different input features (FBank feature, Random-Initialized WavLM feature, and WavLM feature ). The figure below is a top view of the figure above.

Figure 3 shows the visualization of speaker verification downstream model with different input features. Compared with the FBank feature, we can find that the representation from the Random-Initialized WavLM model can provide a wider optima, which enables better resistance against some small perturbation, and leads to easier SV model optimization. However, without the self-supervised pretraining, the speaker verification model would stuck into a poor local minima with worse speaker verification performance. With large-scale self-supervised learning, the pretrained WavLM representation can provide a better initial point with a much broader and deeper optimum area. Even with some small disturbance, the WavLM input feature enables the downstream model to converge to the expected optimal region, and prevent it from skipping the optimal region with a steep loss hill.

5 Conclusion

Our experimental results demonstrate that the self-supervised learning procedure is the key to the success on SV task. Among a variety of SSL methods, the masked pseudo-label prediction loss can provide the representation with best generalization capability on SV task, regardless of the pseudo-label creation methods. We also show that data augmentation and model scale-up can further strengthen SSL for SV task. Moreover, our analyses show that two-stage fine-tuning can make use of the full capacity of SSL models, and that SSL models can facilitate the SV model optimization with a better initial point with a broader and deeper optimum area.