1 Introduction
Recently, selfsupervised learning (SSL) has achieved the stateoftheart results on a diverse array of downstream speech tasks [wav2vec2, hsu2021hubert, zhang2021bigssl, superb, chen2021unispeech, chen2021wavlm, wang2021self]. Typical SSL methods either discriminate the correlated positive samples from the negative ones (e.g. wav2vec 2.0) [wav2vec2], or predict discrete pseudolabels on the masked regions (e.g. HuBERT) [hsu2021hubert]
. Both methods try to implicitly learn shorttime phonetic information from a huge amount of unlabeled speech, and mainly target at selfsupervised learning for automatic speech recognition task (SSL4ASR).
Due to the high correlation with phoneme units, it is straightforward to understand that SSL4ASR has the potential to drastically improve the speech recognition task. Interestingly, SSL4ASR also achieves stateoftheart performance on the speakerrelated tasks, e.g. speaker verification (SV). For instance, WavLM [chen2021wavlm] and BigSSL [zhang2021bigssl] show the best performance on different partitions of VoxCeleb1 dataset [nagrani2020voxceleb], and the ensemble of WavLM model and Res2Net [gao2019res2net, zhou2021resnext] ranks at the top position on VoxSRC 2021 speaker verification permanent leaderboard^{1}^{1}1https://competitions.codalab.org/competitions/34066#results. with the team name StrasbourgSpk.
In this work, our goal is to understand which factor leads to the success of SSL4ASR in speaker recognition. Specifically, we try to answer the following questions:

Can supervised ASR model benefit the SV task?

How does SSL benefit the SV task?

What is the best SSL setup for the SV task?
To this end, we carefully design and conduct a series of experiments to investigate what is the indispensable part of SSL. We also perform Integrated Gradients attribution analysis and loss landscape visualization to further understand the contribution of SSL to SV performance.
The main finding is threefold as follows. First, SSL4ASR models have significantly better transferability than supervised ASR models in an appletoapple comparison, indicating the SSL objective function is a key ingredient for achieving excellent transferability. Second, the HuBERT style loss, mask speech prediction, is slightly better than other SSL losses, such as contrastive learning and Mean Squared Error (MSE) loss, while how to generate pseudolabels has minor impacts on the performance of HuBERT style models. Even pretraining with simple clustering methods on raw inputs could provide good performance on the SV task. Data augmentation proposed in WavLM [chen2021wavlm] is very helpful, even if the pretrain data is scaled up to 94k hours. In addition, data scale and model scale have a strong correlation to model transferability. Third, our analysis shows that SSL models only learn speaker related knowledge with shallow layers in pretraining stage, while finetuning stage could unleash the full capability of the model. We observe that an SSL model could provide a wider optima in finetuning, which enables better resistance against small perturbation, stronger generalization capability, and easier SV model optimization.
2 Background
Selfsupervised learning (SSL) has been shown to be an effective means of improving stateoftheart results on SV task [chen2021unispeech, chen2021large, chen2021wavlm]. A common practice is: We first optimize the pretrain model with SSL objective on the largescale unsupervised data, then finetune the pretrained model along with a downstream SV model on the annotated dataset.
The typical SSL objectives are designed for automatic speech recognition task (SSL4ASR) by implicitly learning shorttime phonetic information from unlabeled speech [wav2vec2, hsu2021hubert]. Specifically, given a raw audio , a latent representation is obtained by a CNN feature extractor, where is the number of frames. Then the representation is fed to an layer Transformer model, yielding hidden states , where denotes the th layer in the encoder. During pretraining, we employ the maskedbased selfsupervised learning methods to optimize the Transformerbased model. Before feeding the latent representation to the Transformer model, SSL methods first mask a proportion of them in some random frames, then minimize a variety of selfsupervised objective functions based on the last layer hidden states output in the masked regions.
During finetuning, we weighted average the hidden states of each layer to generate the output representation , where is a learnable weight for the hidden state of the th layer. Then we employ ECAPATDNN [desplanques2020ecapa] as the downstream SV model following [chen2021large], and feed the output representation into the downstream model to generate the speaker embedding . We use the additive angular margin (AAM) loss [deng2019arcface] as the supervised objective function, and train the downstream SV model along with the pretrained model for two stages. In the first stage, we optimize the parameters of the downstream model with the pretrained parameters fixed. In the second stage, we continue to optimize the parameters of the downstream model as well as the pretrained model. In addition, we can also apply largemargin finetuning strategy and score calibration to further improve the speaker verification performance [thienpondt2021idlab].
3 Why does SSL4ASR Benefit the SV task?
3.1 Can supervised ASR model benefit the SV task?
Given the similar modeling unit between SSL4ASR and supervised ASR models, it is a natural question whether the supervised ASR model can also benefit speaker verification task. To verify this hypothesis, we compare the transferability of supervised ASR and SSL4ASR, both of which are trained on LibriSpeech 960h [librispeech] and use the Transformer structure of HuBERT [hsu2021hubert].
The ASR model is trained with the Connectionist Temporal Classification (CTC) loss function
[ctc] in a supervised way. We use the character sequence as the target golden labels, and require the ASR model to predict the golden label given the hidden states of the last encoder layer : . Specaugmentation is also applied following [specaug].HuBERT, based on masked pseudolabel prediction loss, is selected as the SSL4ASR model for the comparison [hsu2021hubert]. The pseudo labels are generated by iterative clustering. At the first iteration, we conduct an offline clustering step on the MFCC feature of the input audio, where the clustering center of each frame is indexed as the pseudo label. Then we use the hidden states to predict the embeddings corresponding to the pseudo labels with the crossentropy loss function in the masked regions :
, where is the projection matrix,
denote the cosine similarity function,
is a predefined hyperparameter, and
is the number of clusters. Starting from the second iteration, we perform the offline clustering step with the hidden states extracted from the last iteration pretrained HuBERT model, and then train a new HuBERT model with the pseudo labels obtained by the new clustering centers.We also use a random initialized Transformer model as a baseline to get rid of the effect of the additional parameters introduced by the pretrained model and focus on the performance of different pretraining methods.
Model  EER (%)  

Vox1O  Vox1E  Vox1H  
FBank  1.01  1.24  2.32 
Random  3.696  3.71  6.034 
CTC  1.159  1.256  2.434 
HuBERT  0.84  0.879  1.726 
Table 1 shows SSL4ASR model can provide a better representation than the handcrafted FBank feature, while the representations from the ASR model with CTC loss and the random initialized Transformer model are inferior to the FBank feature. It indicates that the key to the success of SSL4ASR on SV task is neither the Transformer structure nor the finetuning pipeline, but the selfsupervised learning procedure.
3.2 What is the best SSL objective for the SV task?
Besides HuBERT, which is based on masked pseudolabel prediction loss, we also evaluate the transferability of wav2vec 2.0 [wav2vec2] and Mean Squared Error (MSE) loss based pretraining method. It should be noted that all the three methods use the same mask setting proposed in HuBERT.
MSE firstly calculates the FBank feature of the raw audio, then measures the mean square error between the FBank feature and the linear projection of last layer hidden states output in the masked regions as the objective function: , where is the projection matrix.
Wav2vec 2.0 firstly discretizes the latent representation of each masked timestep to the quantized latent representation , then uses the context representation to identify the true quantized latent representation out of a set of candidate representations with contrastive loss function:
, where is the projection matrix, denote the cosine similarity function, is a predefined hyperparameter.
Model  EER (%)  

Vox1O  Vox1E  Vox1H  
MSE  0.979  1.075  1.98 
wav2vec 2.0  0.973  0.933  1.831 
HuBERT  0.84  0.879  1.726 
Table 2 demonstrates that all the three SSL methods can provide better representation than the FBank feature, which is attributed to the contextual speech representation learning from the masked speech. HuBERT achieves the best performance, indicating the better generalization and effectiveness of pseudolabel prediction loss than contrastive loss and MSE loss.
3.3 What is the best SSL quantizer for the SV task?
Since HuBERT style loss is better than others, we explore the performance of different pseudolabel creation methods (quantizers) for HuBERT loss. Besides the MFCC Clustering and Hidden State Clustering introduced by HuBERT, we also experiment with the labels obtained by Random Projection [chiu2022self], VQVAE quantizers [van2017neural], and framephoneme alignment.
With random projection quantizer, we first extract the FBank features of the input audio, project
to the vector
with a random initialized matrix , and then find the closest vector from a set of random initialized vectors , where is the vector (code) numbers. The pseudo label of th frame is defined as the index of the closest vector: .With VQVAE quantizer, we first extract the FBank features of the input audio, and train a VQVAE model [van2017neural] to reconstruct the FBank feature on LibriSpeech 960h [librispeech]. Given the latent variable obtained by a layer Transformerbased encoder, we discretize it with the closest vector { in a latent embedding space , where is the embedding numbers, and then reconstruct the features with a layer Transformerbased decoder. The training loss of VQVAE is to minimize the mean squared error between the reconstructed features and the input features, along with the difference between the encoded variable and the discrete variable:
where is the stopgradient operator and is a predefined hyperparameter. During inference, the pseudo label of th frame is defined as the index of the discrete latent variables in the latent embedding space: .
In addition, we also consider using the phoneme sequence of the input audio as the pseudo label to see if ASRrelated pseudo label can benefit the SV performance. Here, we use forcealignment tool [mcauliffe17_interspeech] to get the framephoneme pairs on LibriSpeech 960h data.
Model  EER (%)  

Vox1O  Vox1E  Vox1H  
MFCC Clustering  0.872  0.917  1.766 
Hidden State Clustering  0.840  0.879  1.726 
Random Projection (500 codes)  0.899  0.95  1.775 
Random Projection (8192 codes)  0.883  0.903  1.675 
VQVAE  0.824  0.899  1.655 
Phoneme  0.867  0.918  1.776 
Table 3 shows that all the quantizers have similar performance on the speaker verification task. Even when we use the phone sequence as the pseudo label, which is irrelevant to the speaker information, we can still obtain a wellperformed speaker verification model with the masked pseudolabel prediction SSL method.
3.4 LargeScale SSL on SV task
Moreover, we also leverage the data augmentation and scaleup strategy to further strengthen the selfsupervised learning for speaker verification task. Following WavLM [chen2021wavlm], we employ the masked speech denoising and prediction framework as the data augmented selfsupervised learning method to improve pretrained model robustness for complex acoustic environments and the preservation of speaker identity. We also scale up unlabeled pretraining data to 94k hours of public audios [chen2021wavlm], including 60k hours of LibriLight [librilight], 10k hours of GigaSpeech [GigaSpeech2021], and 24k hours of VoxPopuli [wang2021voxpopuli], and enlarge the model to 24 layer Transformers with 316M parameters.
Model  EER (%)  

Vox1O  Vox1E  Vox1H  
HuBERT 960h  0.84  0.879  1.726 
WavLM 960h  0.777  0.829  1.629 
HuBERT 94kh  0.734  0.847  1.725 
WavLM 94kh  0.739  0.742  1.483 
WavLM 94kh Large  0.505  0.579  1.176 
WavLM 94kh Large  0.308  0.462  0.906 
Table 4 shows that the data augmentation strategy used in WavLM can successfully benefit the selfsupervised learning for SV task. The performance improvement would be more significant if we scale up the pretraining data to 94kh. Thanks to the larger parameter capacity, the WavLM Large model can bring more than 20% EER reduction compared to the WavLM Base model. With the largemargin finetuning strategy and score calibration methods, the WavLM Large model can achieve 33.2%, 27.1%, and 8.8% relatively EER reduction compared to the stateoftheart supervised model (Vox1O: 0.461, Vox1E: 0.634, Vox1H: 0.993) [zhao2021speakin] on all the three VoxCeleb1 trial lists.
4 Discussion and Analysis
4.1 Contribution Attribution
We employ the Integrated Gradients (IG) attribution method [sundararajan2017axiomatic] to demonstrate how each layer of the pretrained model contributes to the final SV performance. Compared with method in [chen2021unispeech, chen2021wavlm]
, IG better models contribution estimation as it consider not only the layer weight, but also the magnitude of each layer’s hidden states. Specifically, given a welltrained downstream model
, the hidden states extracted from all layers, and the corresponding learned weights , the attribution score of th layer hidden states is assigned as:, where denotes Hadamard product, is the integral variable, and denotes the summation over the time and feature dimensions. The larger attribution score indicates the more importance of the corresponding hidden states. The summation of the attribution scores of all the hidden states indicates the final prediction of the SV model, i.e., . Due to the intractability, we approximate with the gradients summation as:
, where is the number of approximation steps for computing integrated gradients. We set to 50 in our experiment.
Figure 2
shows the contribution attribution from each layer of different pretrained models. As for the first stage of finetuning, where we train the downstream model with the pretrained parameters fixed, the contribution mostly comes from the output of the CNN feature extractor and the first encoder layer for all the pretrained models. It indicates that only the shallow layers of pretrained models learn the speakerrelated information during the selfsupervised learning procedure. If the hidden states are extracted from the ASR model, which is supervised trained with CTC loss, only the latent feature extracted by CNN extractor contributes to the final prediction. And if the hidden states are extracted from the SSL4ASR model, such as wav2vec 2.0 and HuBERT, the contribution is also dominated by the CNN extracted feature. In contrast, if we pretrain HuBERT with data augmentation or the phonemeindependent quantizer, such as MFCC clustering or random projection, there are more contributions from the hidden states encoded by Transformer layers.
As for the second stage of finetuning, we update the parameters of the downstream model as well as the pretrained parameters. Since we unleash the full capability of the pretrained model, the higher Transformerbased encoder layers can also learn to model the speaker information with the SV training objective, and make more contribution to the final prediction than in the first stage, leading to better speaker verification performance .
4.2 Loss Landscape Visualization
To better understand how selfsupervised learning benefits the SV task, we visualize and compare the twodimensional loss landscapes along with the optimization trajectories of different SV models. For better comparison of different input features, we plot the parameters of the downstream models, and the optimization trajectories in the first finetuning stage where the pretrained parameters are kept frozen.
Following [li2018visualizing, hao2019visualizing], we first define the origin and two axes of the loss surface as the random initialized downstream model’s parameters and two directions in the parameter space, respectively. Then, we uniformly sample multiple points around the initialized parameters, and plot the training loss of the downstream model with the parameters of each sampled point and the input feature from the pretrained model.
Let , denote the random initialized parameters and welltrained parameters of the SV downstream model respectively, we can define one of the axes as the optimization direction . The other axis is set as a random direction , where is the randomly generated parameters. Due to the highdimensional parameter space, experimental results confirm that the two axes and are divergent and orthogonal to each other. Then, the 2D loss surface can be plotted with the function: , where are scalar values and is the loss function of the SV model training. For better visualization, we scale the second direction vector to the same norm as the first one by , where is the Euclidean norm. We set the range of and to , and uniformly sample 29 points for each axis. In addition, we also project the optimization trajectory of the SV downstream model onto the twodimensional loss surface. Specifically, for the parameters of the downstream model at
th training epoch,
denotes the optimization direction at the th epoch, we can calculate the cosine similarity between the optimization direction and each of the projected directions as . Then, the corresponding projected point in the 2D loss surface of can be calculated as: .Figure 3 shows the visualization of speaker verification downstream model with different input features. Compared with the FBank feature, we can find that the representation from the RandomInitialized WavLM model can provide a wider optima, which enables better resistance against some small perturbation, and leads to easier SV model optimization. However, without the selfsupervised pretraining, the speaker verification model would stuck into a poor local minima with worse speaker verification performance. With largescale selfsupervised learning, the pretrained WavLM representation can provide a better initial point with a much broader and deeper optimum area. Even with some small disturbance, the WavLM input feature enables the downstream model to converge to the expected optimal region, and prevent it from skipping the optimal region with a steep loss hill.
5 Conclusion
Our experimental results demonstrate that the selfsupervised learning procedure is the key to the success on SV task. Among a variety of SSL methods, the masked pseudolabel prediction loss can provide the representation with best generalization capability on SV task, regardless of the pseudolabel creation methods. We also show that data augmentation and model scaleup can further strengthen SSL for SV task. Moreover, our analyses show that twostage finetuning can make use of the full capacity of SSL models, and that SSL models can facilitate the SV model optimization with a better initial point with a broader and deeper optimum area.