The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN , as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.564 of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.431 Among the three evaluation trials, our best system outperforms the winner system  of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.READ FULL TEXT VIEW PDF
Recent years have witnessed significant improvements in automatic speaker verification (ASV) tasks. Researchers have developed various neural network architectures[6, 16, 24, 33], training objectives [31, 34, 12, 29], pooling functions [18, 37]
to push the limits of the system performance. However, these techniques always require large-amount well-labeled data. It is a challenge to collect large-scale labeled data for real applications due to the privacy issue of speaker information. Over the past years, pre-trained models have become the de-facto standard for state-of-the-art performance on many natural language processing (NLP) tasks. Inspired by the great success of BERT and GPT , a series of work in the speech community, e.g. wav2vec 2.0  and HuBERT 
, have been proposed to leverage large-scale unlabeled data, showing the impressive results on the automatic speech recognition (ASR) tasks.
For the speaker verification field, many researchers have designed specific losses to train the extractor of speaker embeddings from the unlabeled data under an assumption that there is only one speaker in one utterance [35, 30, 2]. Such an assumption may limit the application for un-supervised speaker verification training on the unlimited data from the internet. The Wav2Vec 2.0  and HuBERT  rely less on such assumption. These two pre-trained models have shown that they can capture phonetic structure information contained in speech and thus benefit ASR. It is an interesting research topic to probe the nature of the representations learned by different layers of pre-trained models [13, 19]. The effectiveness of Wav2vec 2.0 in a two-stage training process of pre-trained and fine-tuning has been demonstrated on both speaker verification and language recognition tasks in . Besides,  introduces a benchmark to evaluate the performance of pre-trained models and shows the better performance of the speech representations learned from large-scale unlabeled data, by comparing with Fbank, on various downstream tasks including ASV. In order to minimize architecture changes and fine-tuning to solve all downstream tasks, the works above only use a simple downstream model and train the system on a small speaker verification dataset Voxceleb1  for ASV task. However, whether the speech representations can also benefit the state-of-the-art (SOTA) ASV systems is still an open question.
In this paper, the speech representations learned from large-scale unlabeled data are extensively investigated on a benchmark dataset for speaker verification. The major contribution of this paper is four-fold as follows:
To the best of our knowledge, it is the first attempt to use the speech representation learned from large-scale unlabeled data to improve the performance of the SOTA speaker verification model (i.e., ECAPA-TDNN  ) on Voxceleb dataset.
Instead of using the representations only from the final layer of the pre-trained model, we employ a weighted average of the representations from all hidden layers to fully leverage the speaker-related information embedded in the whole model.
We conduct a comprehensive study on the performance of pre-trained models with different learning methods, model sizes and large-scale training datasets.
A detailed analysis based on learnable weights is performed for probing layer-wise speaker information embedded in the pre-trained models.
Speech signals contain all kinds of information, such as phonetic structure, emotion, speaker indentity, etc. The Fbank and MFCC are the most commonly used handcrafted acoustic features, which demonstrate sound characteristics in the frequency domain. In addition, researchers have been doing lots of feature engineering to improve their performance, e.g., delta features to capture temporal dynamics of Fbank or MFCC. The authors in combined the articulation rate filter with the constant Q cepstral coefficients (CQCCs)  in the speaker verification task and achieved significant improvement compared to MFCC baseline. In order to make better use of the powerful learning ability of neural networks, Mirco et al.  and Jee-weon et al. 
have tried to used convolution neural network to learn task-specific features from raw audio signals and achieved comparable performance with handcrafted feature.
Recently, speech representation learning by leveraging unlabeled data is gradually emerging. It is commonly believed that the pre-trained models by self-supervised learning have a good generalizability and a simple classifier added on the top of the representations from these pre-trained models can obtain decent performance for many downstream tasks, even with a limited amount of labeled data. Self-supervised learning for speech representations can be categorized into three approaches: 1) reconstruction learning aims to reconstruct the original input using information extracted from past time steps or masked inputs; 2) Contrastive learning learns high-level representations by solving a contrastive task in the latent embedding space; 3) multi-task learning with multiple objectives and multiple inputs. A review of these approaches is given in.
In this study, we leverage the representations from Wav2Vec 2.0 , HuBERT  and UniSpeech-SAT 111UniSpeech-SAT is a submission for ICASSP 2022. Details in https://github.com/microsoft/UniSpeech to do speaker verification task. These three models use different methods to learn the feature representation. The Wav2Vec 2.0 model uses a contrastive loss to distinguish a true speech segment from negatives. The goal of HuBERT is to predict the weak supervised label for the mask frames. UniSpeech-SAT integrates an utterance-wise contrastive loss into Hubert-like representation learning that forces speaker-related information into the learned representation. Despite the different training objectives for the pre-trained models described above, they share the similar model structures. As shown in the left part of Figure 1, these three pre-trained models all consist of a convolutional feature extractor and a deep transformer  network as the encoder. Mathematically, given an input wavform where
is the number of sampling points, the CNN feature encoder convolves the sample points to a sequence of feature vector,. Then the sequence of feature vector is fed to the Transformer model, yielding a hidden state for each frame at the -th layer , where .
In , the authors added an average pooling layer and a fully connected layer with a task-specific loss on the top of pre-trained models and achieved comparable results with the systems using handcrafted features. In , x-vector  is used as the downstream model. To push the limit of the performance of the downstream task, we use the state-of-the-art speaker verification system ECAPA-TDNN  as the downstream model. Compared to x-vector, ECAPA-TDNN has a more advanced design, e.g. Squeeze-Excitation Res2Blocks [11, 9] and multi-layer feature aggregation, which significantly improves system performance. The brief structural framework of ECAPA-TDNN is shown as the right part of Figure 1. The model takes the sequence of the Fbank feature as input. Then, the frame encoder extracts speaker information from each input frame and the statistic pooling layer transforms the variable length input sequence to fix-dimensional representation. Finally, a fully connected (FC) layer is added to extract speaker embedding. To leverage the representations learned from the pre-trained models, we can replace Fbank with the last-layer outputs of pre-trained models and feed them into the ECAPA-TDNN.
The pre-trained model, which has seen tons of audio data, should have good generalization for various downstream tasks. However, the results in  didn’t show the superiority of the pre-trained representation compared to handcrafted feature. The objectives of the most pre-trained tasks are not directly related to speaker recognition. The layers close to the final objectives will contain more information related to the training loss. It could be better to discover the speaker information from the low layers of the pre-trained model.
Here, similar to the implementation in [32, 20], we introduce a learnable weight, , for hidden states from each layer in pre-trained model. Rather feeding the outputs from the last layer of the pre-trained model, i.e. , to the downstream model, we weighted average the hidden states of each layer to generate the frame representation . Then, we replace the Fbank feature fed into the ECAPA-TDNN with the weighted average representations to extract speaker embedding e:
The training pipeline is mainly divided into two stages. In the first stage, the pre-trained model is fixed. We only update the ECAPA-TDNN and the weight for all the hidden states. Then, we fine-tune all the parameters for pre-trained model and ECAPA-TDNN.
To analyze the effectiveness of pre-trained model representation for speaker verification task, we trained and evaluated the downstream speaker verification model using Voxceleb1  and Voxceleb2  datasets. All three official trial lists Vox1-O, Vox1-E and Vox1-H are used to evaluate the system performance. When implementing our baseline models using the handcrafted acoustic feature, we extract 40-dimensional Fbank feature with 25ms window size and 10ms frame shift. We didn’t do voice activity detection (VAD) processing for the Voxceleb data. Besides, we also did data augmentation for the training data using the MUSAN  noise and RIR 222https://www.openslr.org/28/
reverberation with probability 0.6 in online mode.
The detailed information about the pre-trained models used in our experiments and the speaker verification downstream models is listed in Table 1. The HuBERT_Base, HuBERT_Large and Wav2vec2.0_Large (XLSR) models are released by Fairseq sequence modeling toolkit 333https://github.com/pytorch/fairseq. The results in  show that the Wav2vec2.0_Base performed worse than HuBERT_Base on speaker-related task and we didn’t use it here. UniSpeech-SAT is a model proposed recently, which explicitly models the speaker information in pre-trained. It introduces utterance contrastive loss to model the single speaker information, where the positive instances are hidden states in the same utterance while the negative instances are hidden states in other utterances. Moreover, UniSpeech-SAT uses more synthesis or public available data compared to HuBERT. For downsteam task model, we use the small ECAPA-TDNN in .
We trained all the models with Additive Angular Margin Loss (AAM)  and set the margin to 0.2. During the training process, we randomly sampled 3s segment from each utterance to construct training batch. For the two-stage training pipeline described in section 3.2.2
, we first fixed the pre-trained model and trained for 165 epochs. Then, we fine-tuned all the parameters for another 10 epochs. Besides, to further improve our best system, we did large margin fine-tuning by randomly sampling 6s segments and set the AAM margin to 0.5 to train extra 6 epochs.
During the evaluation, we use the cosine score to measure the similarity for trial pairs. We also use the adaptive s-norm [15, 4] to normalize the scores in our experiment. The embeddings extracted from the training set are averaged according to the speaker label and used as the imposter cohort. We set the imposter cohort size to 600 in our experiment. When doing quality-aware score calibration , we randomly generated 30k trials based on the voxceleb2 test set to train our calibration model.
First, we will compare the speech representations extracted from pre-trained models with the commonly used handcrafted feature. The experiments in  have shown that Wav2Vec 2.0 pre-trained models contain speaker information and can achieve comparable performance with the handcrafted acoustic feature. Different from 
, in our experiments, we directly replaced the handcrafted feature fed to the speaker verification model ECAPA-TDNN with the representations from pre-trained models. Besides, we explored to leverage the representations from pre-trained models in two different ways, using the representation from the last layer or weighted averaging all the hidden representations. The results are shown in Table2. From the upper part of the table, we find that the last layer representation and all hidden layers’ representation from the pre-trained model both perform better than the handcrafted feature Fbank. Encouragingly, the performance of weighted averaging hidden representation exceeds the Fbank by a very large margin (46% relatively). Then, we augment the training data and the results are listed in the bottom part of Table 2. With data augmentation, all the results are further improved and the weighted averaging hidden representations also shows superiority over the Fbank feature. For the experiments in the following sections, we will use the weighted average hidden representations for pre-trained model and augment the training data.
To further improve the effectiveness of the representations from pre-trained models, we trained the model on a larger dataset, Voxceleb2_dev, and compared different pre-trained models and training strategies. All the results are shown in Table 3. The results show that all the large models perform better than Fbank feature on both Vox1_dev and Vox2_dev setup. When we unfix the pre-trained model and jointly fine-tune the pre-trained model and downstream model, further improvements can be achieved. The improvement from pre-trained model fine-tuning is more obvious on Vox2_dev setup than Vox1_dev setup. Besides, the Wav2vec2.0_Large (XLSR) and UniSpeech-SAT_Large pre-trained models perform better than the HuBERT_Large after fine-tuning. As shown in table 1, the training set size of the Wav2vec2.0_Large (XLSR) and HuBERT_Large is comparable. However, the training data for Wav2vec2.0_Large (XLSR) is more diverse and more matched with Voxceleb data, enabling it to be more suitable for this downstream task. As expected, the UniSpeech-SAT_Large model with more training data performs the best among all the pre-trained models. Compared to Fbank feature, representations from this model achieved 30% relative EER improvement on all three trials for the Voxceleb1 evaluation set.
In , the authors introduced a large margin fine-tuning strategy and quality-aware score calibration to the speaker verification task and achieved impressive improvement. Here, we also leverage these two strategies in our experiments to push the performance limit. The corresponding results are listed at the bottom part in Table 3. With these two strategies, our best system exceeds the state-of-the-art system  (Vox1-O: 0.461, Vox1-E: 0.634, Vox1-H: 0.993) in VoxSRC challenge 2021 on Vox-E trial.
The results in Section 5 have shown that it is better to leverage the representations from all the hidden layers rather than the last layer. Thus, it could be necessary and meaningful to explore which layer contains more speaker information than the others. We visualize the normalized weight value for all the layers’ output in Figure 2. The figure shows that the speaker information at the lower layers of pre-trained models is more discriminative than those at the higher layers for ASV task . This phenomenon is reasonable because the training objectives for the pre-trained models used in our experiments are more related to the speech recognition task. For large pre-trained models in our experiments, i.e. UniSpeech-SAT_Large, HuBERT_Large and Wav2vec2.0_Large (XLSR), the learned weights assigned to the higher layers are much smaller than those of lower layers, which indicates that we might be able to directly throw away these higher layers to reduce model size.
In this paper, we leverage the representations extracted from pre-trained models trained on large-scale unlabeled data in speaker verification task. In our experiments, we first compared such representations with handcrafted Fbank feature and verify the superiority of pre-trained representations. To comprehensively explore speaker information in the pre-trained model, we make the model learn the weights automatically for all the hidden states of the pre-trained model and achieve significant performance improvement compared to the baseline. By visualizing the learned weights, we find the lower layers of the pre-trained model can capture more speaker-related information than those of higher layers. Despite the significant improvement benefiting from the pre-trained model, there is still a relatively small performance gap (on two evaluation sets) between our system and the best system  in the VoxSRC2021 challenge, which has a more aggressive augmentation strategy and dedicated training objectives. In the future, we will incorporate the better training setup in  for our system to further push the limit of speaker verification performance.
Arcface: additive angular margin loss for deep face recognition. In Proc. CVPR, pp. 4690–4699. Cited by: §3.2.2, §4.