1 Introduction
Speaker verification is a task of determining whether the input speech is spoken by the same speaker or not [1]
. The general speaker verification framework consists of an embedding extraction and scoring process. In the embedding extraction step, audio with variable duration is converted into a single fixeddimensional vector representation called speaker embedding, which is assumed to contain speakerrelevant information. With a sophisticated speaker embedding, even a simple scoring method such as cosine similarity or euclidean distance has shown high speaker verification performance
[2, 3, 4]. Therefore, most studies have been focused on how to extract a fine speaker embedding from input speech.With the development of the deep learning field, various studies have been proposed to utilize the neural network for extracting speaker representation called deep speaker embedding, which reflects the speaker’s characteristics well
[7, 5, 6]. Despite the success of deep speaker embedding methods, there still remains the problem of performance degradation in mismatched conditions (e.g., device, noise, language). In order to solve this problem, there has been a demand for robust speaker embedding, unaffected by the domain mismatch due to speakerirrelevant factors.Traditionally, data augmentation is the most common approach for training neural networks robust to domain mismatch. For speaker verification, simulated reverberation [8], additive noise [9], and SpecAugment [10] can be good options for data augmentation to increase the number of acoustic environments that might be encountered in the inference phase [11, 12]. While these methods are proven to be effective when there are insufficient data on target conditions, they can only indirectly mitigate the domain mismatch problem.
Unlike the methods described above, various studies have been proposed to disentangle speakerirrelevant variability from the speaker embedding directly. Recently, adversarial learningbased domain adaptation methods have been studied. [13, 14, 15]
utilized gradient reversal layer (GRL) to prevent speaker embeddings network from learning the information needed for the subtask (i.e., noise classification). Although the gradient reversal techniques have proven to be effective for performance improvement, training a network with GRL is known to be unstable and sensitive to the hyperparameter setting. As an alternative to GRL, domain adversarial training similar to the generative adversarial network (GAN) framework was exploited to maximize the error on the subtask
[16, 17]. However, these domain adaptation methods have a limitation that adaptation is applied to the feature space shared by both speakerrelevant and speakerirrelevant factors. Therefore, speaker embedding is inherently hindered by speakerindependent factors. Also, adversarial training has known to be difficult and unstable [18].Alternatively, there have been several approaches to minimize correlation between speaker and speakerindependent embeddings in distinct space. For instance, joint factor embedding (JFE) [18] framework simultaneously extracts speaker and nuisance (i.e., nonspeaker) embeddings and maximizes entropy (or uncertainty) on their opposite task, while minimizing correlation between two embeddings using mean absolute Pearson’s correlation (MAPC) computed batchwise. Similarly, [19, 20] divided features into the speaker and residual embeddings and increased their uncertainty on the contrary task, and [21]
minimized mutual information via mutual information neural estimator (MINE) with GRL. Additionally, they adopted an autoencoder framework for training merged embedding to maintain the complete information of input speech
[19, 20, 21]. However, naively increasing uncertainty on the other task does not guarantee disentanglement.For learning disentangled representations, mutual information (MI) minimization has gained considerable interest in various machine learning tasks
[23, 22, 24]. Since the exact computation of MI in highdimensional space is intractable when only samplebased approaches are available, several prominent MI estimators have been proposed [26, 28, 27, 25]. Among them, contrastive logratio upper bound (CLUB) [25]proposed the MI upper bound estimator by using the difference of conditional probabilities between positive and negative sample pairs in a contrastive learning manner. As our goal is to learn disentangled speaker embedding, we utilize CLUB to reduce the interdependence between opposite latent representations explicitly.
In this work, we propose an effective learning framework for disentangled speaker representation via MI minimization. To learn speaker embedding that is not only soundly disentangled but also has high speaker discrimination ability, we construct a 3stage structure; Frontend Encoder Network, Decoupling Block, and Classifier and MI Estimator parts. Through this framework, we explicitly learn disentangled representations and obtain practically good speaker embedding.
The rest of this paper is organized as follows: Section ii@ describes the MI estimation and CLUB, and Section iii@ presents the proposed framework. Then, the experiments and results are addressed in Sections iv@ and v@, respectively. Finally, we conclude in Section vi@.
2 Mutual Information Upper Bound Estimation
Mutual information (MI) is a quantity to measure the amount of dependency between two random variables. For two continuous random variables
x and y, MI is defined as follows:(1) 
where
is the joint distribution,
and denote the marginal distributions.Since our goal is to learn disentangled representations, MI minimization between two random variables is required. Therefore, we focus on contrastive logratio upper bound (CLUB) [25], MI upper bound estimator. For given two random variables x and y, CLUB is formulated as follows:
(2) 
As the conditional distribution is intractable in our framework, we approximate it using a variational distribution . In practice, a variational CLUB (vCLUB) is obtained as follows:
(3) 
where is sample pairs drawn from the joint distribution . vCLUB is not guaranteed to be the MI upper bound anymore since we approximate to . However, if the Kullback–Leibler (KL) divergence between conditional and variational distribution is small enough, it can be a reliable MI upper bound estimator. Let be the variational joint distribution, then KL divergence between and is as follows:
(4)  
(5)  
(6) 
where Equation (6) denotes . Consequently, minimizing is equivalent to maximizing with respect to . We train the variational network
by minimizing the negative loglikelihood loss function as follows:
(7) 
3 Proposed Framework
In this work, our proposed framework is constructed under the FarField Speaker Verification Challenge 2022 (FFSVC 2022) [29] scenario to explore more practical cases. FFSVC 2022 provides a farfield dataset collected by real 155 speakers in complex environments with multiple conditions. In particular, the datasets of FFSVC 2022 consist of noisy speech samples recorded under farfield conditions and different devices (i.e., tablet, telephone, and microphone array). In these settings, we learn disentangled speaker and device representations. As shown in Figure 1, the overall proposed framework is composed of A. Frontend Encoder Network, B. Decoupling Block, and C. Classifier and MI Estimator parts.
3.1 Frontend Encoder Network
Given an dimensional acoustic feature with frames, the frontend encoder network extracts an utterancelevel initial embedding . To efficiently capture global and local information, we adopt the multiscale feature aggregation conformer (MFAConformer) [30] backbone and the channel and contextdependent statistic pooling [12] for the frontend network.
In case the proposed systems are trained from scratch using a dataset with a limited number of speakers (i.e., FFSVC2022 training set), our disentanglement framework could not work properly (discussed in Section v@). To this end, we firstly force the initial embedding to obtain sufficient speaker discrimination ability by pretraining the frontend encoder with a largescale dataset including many different speakers but no device labels. Then we finetune the whole network with the dataset containing device labels but limited speakers to effectively focus on disentangling the speaker and device factors latent in the initial embedding.
3.2 Decoupling Block
To explicitly divide the initial embedding extracted from into the latent speaker and device representations, we deploy the decoupling block
, as shown in Figure 2. The speaker and device embeddings are obtained via multilayer perceptron (MLP) modules in
. MLP module is sequentially comprised of a fullyconnected (FC) layer, a batchnormalization (BN) layer, and a rectified linear unit (ReLU) activation function. Two fixed dimensional embedding vectors,
and , are learned to represent the input speech’s speaker and device characteristics, respectively. For the evaluation, the speaker embeddings are extracted, and the similarities are calculated to perform the verification.3.3 Classifier and MI Estimator
Analogous to the previous disentanglement approaches [17, 18, 22, 24, 31], we follow the multitask learning strategy which includes the classification and MI minimizationbased disentanglement tasks. As shown in Figure 1, the classification task consists of the speaker and device classifiers, and , respectively. For the MI minimizationbased disentanglement task, there are three MI estimators, , , and .
Speaker classifier : To force the speaker embeddings to discriminate their speaker labels, we adopt the combination of the additive angular margin (AAM) softmax [32] and the angular prototypical (AP) loss [33], which has shown the great performance in this field [4, 34]. Given the pairs of speaker embeddings and labels , the speaker classification loss function is formulated as follows:
(8)  
(9) 
(10) 
where is the batch size, is a scale factor, is a margin, is the normalized dot product between the th class weight of and , and denotes the cosine similarity between two different utterances of th speaker.
Device classifier : As in the speaker classifier, the device embeddings are trained to identify their device labels. The device classification loss is defined as AAM softmax:
(11) 
MI estimator : To minimize the MI between speaker and device embeddings, we adopt the mechanism of variational CLUB estimator, which calculates the MI upper bound via the difference of variational distributions between positive and negative sample pairs. The MI upper bound between and is estimated as:
(12) 
(13) 
where is the variational network with trainable parameters for approximating , i.e., representing . The variational distribution is estimated via the isotropic Gaussian with a diagonal covariance matrix, as shown in Figure 3 (left network). and are obtained via the last two MLP layers of . The parameters of the variational network are optimized independently with the parameters of the main networks by minimizing the following negative loglikelihood:
(14) 
MI estimators and : To reduce the interdependence between embeddings and labels, the estimators and estimate the MI upper bounds of and , respectively, through the variational CLUB as follows:
(15)  
(16)  
(17) 
(18)  
(19)  
(20) 
where and are the variational networks with trainable parameters and , respectively, as shown in Figure 3 (right network). is the softmax activation output to approximate . The variational parameters and are optimized using and , respectively, in the same way as in the MI estimator .
3.4 Total Objective Function
Finally, the main networks (i.e., , , , and ) are jointly trained with following total objective function:
(21) 
where , , , , and
are weighting factors to balance each loss term. Algorithm 1 summarizes the overall disentangled representation learning framework where
is an optimizer, is a learning rate, andis the number of updates for variatinal networks per epoch. The main and variatinal networks are updated alternately.
Pretraining Dataset (Frontend Encoder)  Finetuning Dataset (whole network)  Objective Function  Development Set  
EER(%)  MinDCF  
VoxCeleb  Only using initial embedding from pretrained  12.09  0.722  
FFSVC2022  JFE [18]  11.98  0.688  
VoxCeleb  FFSVC2022  7.02  0.460  
FFSVC2022  11.83  0.668  
VoxCeleb  FFSVC2022  7.08  0.468  
FFSVC2022  12.20  0.690  
VoxCeleb  FFSVC2022  7.15  0.473  
FFSVC2022  12.06  0.718  
VoxCeleb  FFSVC2022  7.03  0.467  
FFSVC2022  12.00  0.703  
VoxCeleb  FFSVC2022  6.99  0.461  
FFSVC2022  11.95  0.684  
VoxCeleb  FFSVC2022  6.95  0.450  
4 Experiments
4.1 Datasets
To pretrain the frontend encoder network , we employ the development set of VoxCeleb1 and VoxCeleb2 datasets [35, 36, 37]
, which consist of 1,092,009 and 148,642 utterances from 5,994 and 1,211 speakers, respectively. VoxCeleb dataset is one of the most popular corpora for largescale textindependent speaker verification. The speech samples were extracted from YouTube video clips and degraded with realworld noises, including background chatter, laughter, overlapping speech, room acoustics, etc. The frontend encoder network was trained in a fully supervised learning manner with the speaker classifier.
When finetuning the whole network with pretrained , we use the FFSVC2022 training dataset which is the composition of the training, development, and supplementary sets of the FFSVC 2020 challenge [38]. FFSVC2022 training dataset totally contains 2,548,351 utterances from 155 speakers where we only utilize samples longer than 1 second (i.e., 2,542,392 utterances). FFSVC2022 dataset was collected from four recording devices (i.e., iPhone, Android phone, iPad, and normal/circular microphone array) in six different locations (i.e., 0m, 25cm, 1m, 1.5m, 3m, and 5m). For our disentangled representation learning framework, we finetuned the whole network using the utterances with corresponding speaker and device labels.



4.2 Evaluation Protocol and Metrics
To evaluate the system performance, we adopt development trial protocol provided by FFSVC2022 challenge, which was utilized to tune hyperparameters and validate the model performance during the previous competition period [29]. Since the FFSVC2022 development trial protocol contains speech samples collected by real speakers in multiple environments, we can evaluate the system performance in realistic scenarios with multiple conditions. We report two performance metrics: the equal error rate (EER) and the minimum detection cost function (MinDCF). The EER is the error when the false alarm rate (FAR) and the false reject rate (FRR) are the same, and the MinDCF is defined as the minimum value of the weighted sum of the FAR and FRR. The parameters of MinDCF were set as , , and .
4.3 Model Architectures
For the frontend encoder network, we adopt the MFAConformer [30] architecture, which is the multiscale featureaggregated encoder for extracting speaker embedding based on the convolutionaugmented transformer. We use six conformer layers which consist of the multiheaded selfattention module (MHSA), the convolution module (CM), the feedforward module (FFM), and the subsampling layer (SSL). For the MHSA, the encoder dimension, the number of attention heads, the dropout rate, and the kernel size are set to 256, 4, 0.1, and 15, respectively. For the CM, the kernel size is set to 15. For the FFM, FC layers with the dimension of 2,048 are used. For the SSL, a convolution layer with a subsampling rate of 2 is employed. We aggregate the framelevel output features to the 192dimensional initial embedding x via the channel and contextdependent statistic pooling [12]. In the decoupling block, there are three MLP layers, as shown in Figure 2, where each MLP layer consists of FCReLUBN sequentially. From the outputs of the last two MLP layers, the 192dimensional speaker and device embeddings are obtained. The dimension of the hidden and last FC layers for the variational network is set to 1,024 and 192, respectively.
4.4 Baseline: Joint Factor Embedding (JFE)
To compare with the existing disentanglement method, we adopt joint factor embedding (JFE) [18]. JFE framework simultaneously learns speaker and nuisance (device) embeddings where the crossentropy on their main task ( and ) is minimized while the entropy on their opposite task ( and ) is maximized. Also, the negative MAPC between two embeddings () is jointly minimized. For our experimental setting, the speaker and device embeddings are optimized using the following JFE objective function:
(22) 
4.5 Implementation Details
We made use of the PyTorch library and conducted experiments using
NVIDIA GeForce RTX 3090 GPUs in parallel^{1}^{1}1All implementations are developed based on https://github.com/clovaai/voxceleb_trainer.. During both pretraining and finetuning phases, we randomly cropped an input utterance to 200frames segment and then applied MUSAN noises [9] or the simulated room impulse responses (RIRs) [8]for data augmentation. If input utterance is shorter than 200 frames, we duplicated and randomly selected 200frames segment. Acoustic features are 80dimensional log melfilterbanks with a hamming window length of 25ms and hopsize of 10ms with 512size FFT bins. Mean and variance normalization is applied to the log melfilterbanks. The AAMsoftmax loss function
[32] employs a margin of 0.2 and a scale of 30. The AP loss function [33] uses the prototype with one utterance. We adopted a batch size of and an Adam optimizer with a weight decay of 2e5. For the pretraining phase, we scheduled the learning rate via the cosine annealing with warmup restart (SGDR) [39] with a cycle size of 25 epochs, the maximum learning rate of 1e3 and the decreasing rate of 0.8 for two cycles. In the finetuning phase, we set the hyperparameters of SGDR scheduler to a cycle size of 4 epochs, the maximum learning rate of 1e5, and the minimum learning rate of 1e8 for one cycle. The weighting factors for total objective function are set to , , , , and . The weighting factors for JFE objective function are set to , , and .5 Results
5.1 Speaker Verification Performance
Table 1 shows the speaker verification performances on the FFSVC 2022 development set. We report the experimental results of seven systems to compare the verification performance of proposed methods with the baseline and analyze the effect of each objective function term in the proposed framework, i.e., , , , , and .
In Table 1, the first row shows the result using the only initial embedding x from the pretrained frontend encoder without finetuning. The second row in Table 1 indicates the performance of the JFE baseline described in Section iv@.D. The systems from the third to seventh rows in Table 1 show the results using the speaker embeddings finetuned with (3 row) the speaker classification loss , (4 row) the multitask learning of speaker and device classification losses , (5 row) the multitask learning including the estimated MI between and loss , (6 row) the multitask learning including the estimated MIs between the embeddings and labels loss , and (7 row) the total objective loss . Also, for each system, we report the results of finetuning with a randomly initialized frontend encoder from scratch.
As shown in Table 1, where upper values in each row indicate the performances without pretraining, applying regularization terms, i.e., multitask learning and CLUB estimators , did not show significant improvement in the speaker verification performance but rather even degrades the system. However, utilizing the frontend encoder pretrained using a largescale dataset without the device labels significantly improved the system performance. This shows that the proposed framework can work effectively when the speaker and device factors, and , latent in the shared embedding x are separated after securing sufficient speaker discrimination ability. Comparing the 3 and 4 rows of Table 1 in the cases using the pretrained , we observed that multitask learning does not help improve the verification performance. However, jointly employing the MI regularization terms led to a consistent performance improvement, as shown in the 5, 6 and 7 rows. Finally, we obtained the best performing result using the final objective function, achieving EER of 6.95% and MinDCF of 0.450 on the FFSVC2022 development trial protocol, respectively. These results outperform those of the JFE baseline system of the 2 row in Table 1.
5.2 Visualization of Speaker Embedding Space
We also investigate the effect of our proposed framework in embedding space by visualizing the speaker representations learned using the three different training strategies, i.e., (a) only speaker classification loss, (b) JFE objective function [18], and (c) the proposed objective function. Figure 4 (a), (b), and (c) show the tSNE plots of speaker embeddings of ten speakers and three devices. Embedding points are colored by speaker labels in the left parts of Figure 4 while colored by device labels in the right parts.
As shown in the left parts of Figure 4 (a), (b), and (c), the speaker embeddings are well separated between different speakers. However, from the view of the device label in the right parts of Figure 4, the embedding points of different devices are highly overlapped, making it difficult to identify their own color (red, blue, and green). In particular, it is observed that the embedding points in the right part of Figure 4 (c) are more evenly dispersed over different devices compared to those in the right parts of Figure 4 (a) and (b). This shows that the speaker embedding extracted from the proposed framework is welldiscriminated in the main task while indistinguishable in the subtask. Furthermore, the speaker embedding learned via our proposed framework demonstrates a more disentangled visualization result than the speaker embeddings obtained from other training strategies, i.e., only speaker classification loss and JFE objective function.
6 Conclusion
In this paper, we propose a novel framework for disentangling speaker representation from speakerirrelevant factors in a direct manner. The proposed framework can explicitly reduce the mutual information by minimizing the estimation of its upper bound. Through mutual information minimization, the interdependence of decoupled speaker and device embedding is removed. Experimental results demonstrate that our approach can improve the speaker verification performance by taking advantage of the pretrained frontend encoder. Also, visualization of speaker embedding space shows that devicedependent factor in speaker embedding is dispersed, from which we can assert that their interdependency is lost.
Acknowledgment
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021000456, Development of Ultrahigh Speech Quality Technology for Remote Multispeaker Conference System).
References
 [1] J. Hansen and T. Hasan, “Speaker recognition by machines and humans,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, Oct. 2015.
 [2] J. Jung, Y. J. Kim, H.S. Heo, B.J. Lee, Y. Kwon, J. S. Chung, “Pushing the Limits of Raw Waveform Speaker Recognition,” arXiv preprint arXiv:2203.08488v2, 2022.
 [3] A. Brown, J. Huh, J. S. Chung, A. Nagrani, A. Zisserman, “VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge,” arXiv preprint arXiv:2201.04583, 2022.
 [4] Y. Kwon, H. S. Heo, B.J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. ICASSP, 2021.
 [5] D. Snyder et al., “Xvectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, IEEE, 2018.
 [6] K. Okabe et al., “Attentive statistics pooling for deep speaker embedding,” in Proc. INTERSPEECH, 2018, pp. 3573–3577.
 [7] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
 [8] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition,” in Proc. ICASSP, IEEE, 2017.
 [9] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv preprint arXiv:1510.08484, 2015.

[10]
D. S. Park et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” in
Proc. INTERSPEECH, 2019, pp. 26132617.  [11] H. S. Heo, BJ. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the voxceleb speaker recognition challenge 2020,” arXiv preprint arXiv:2009.14153, 2020.
 [12] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPATDNN: Emphasized channel attention, propagation, and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH, 2020, pp. 38303834.
 [13] Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial speaker verification,” in Proc. ICASSP, IEEE, 2019, pp. 6216–6220.
 [14] Q. Wang, W. Rao, P. Gui, and L. Xie, “Adversarial training for multidomain speaker recongition,” in Proc. ISCSLP, IEEE, 2021, pp. 1–5.
 [15] J. Huh, H. Heo, J. Kang, S. Watanabe, and J. Chung, “Augmentation adversarial training for unsupervised speaker recognition,” in Workshop on SelfSupervised Learning for Speech and Audio Processing, NeurIPS, 2020.
 [16] G. Bhattacharya, J. monteiro, J. Alam, and P. Kenny, “Generative adversarial speaker embedding networks for domain robust endtoend speaker verification,” in Proc. ICASSP, IEEE, 2019, pp. 62266230.
 [17] J. Zhou, T. Jiang, L. Li, Q. Hong, Z. Wang, and B. Xia, “Training multitask adversarial network for extracting noiserobust speaker embedding,” in Proc. ICASSP, IEEE, 2019, pp. 61966200.
 [18] W. H. Kang, S. H. Mun, M. H. Han, and N. S. Kim, “Disentangled speaker embedding and nuisance attribute embedding for robust speaker verification,” IEEE Access, vol. 8, pp. 141838141849, 2020.
 [19] J. Tai, H. Zhou, Q. Huang, and X. Jia, “Powerful speaker embedding training framework by adversarially disentangled identity representation,” arXiv preprint arXiv:1912.02608, 2019.
 [20] Y. Kwon, SW. Chung, and HG. Kang, “Intraclass variation reduction of speaker representation in disentanglement framework,” in Proc. INTERSPEECH, 2020, pp. 32313235.
 [21] M. Sang, W. Xia, and J. Hansen, “DEAAN: Disentangled embedding and adversarial adaptation network for robust speaker representation learning,” in Proc. ICASSP, IEEE, 2021, pp. 61696173.
 [22] W. Zhu, H. Zheng, H. Liao, W. Li, and J. Luo, “Learning biasinvariant representation by crosssample mutual information minimization,” in Proc. ICCV, IEEE/CVF, 2021, pp. 1500215012.
 [23] M. BabaieZadeh and C. Jutten, “A general approach for mutual information minimization and its application to blind source separation,” Signal Processing, vol. 85, no. 5, pp. 975995, May 2005.

[24]
X. Hou, Y. Li, and S. Wang, “Disentangled representation for ageinvariant face recognition: A mutual information minimization perspective,” in
Proc. ICCV, IEEE/CVF, 2021, pp. 36923701.  [25] P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “CLUB: a contrastive logratio upper bound of mutual information,” in Proc. ICML, 2020, pp. 17791788.
 [26] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,” in NeurIPS, 2016.
 [27] M. I. Belghazi, A. Baratin, A. Rajeshwar, S. Ozair, and Y. Bengio, A. Courville, and R. D. Hjelm, “Mutual Information Neural Estimation,” in Proc. ICML, 2018, pp. 531540.
 [28] A. Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv preprint arXiv:1807.03748, 2018.
 [29] X. Qin, and M. Li, and H. Bu, and S. Narayanan, and H. Li, “Farfield Speaker Verification Challenge (FFSVC) 2022 : Challenge Evaluation Plan,” https://ffsvc.github.io/assets/pdf/ffsvc2022_plan_v2.pdf, 2022.
 [30] Y. Zhang, and Z. Lv, and H. Wu, and S. Zhang, and P. Hu, and Z. Wu, and H. Lee, and H. Meng, “MFAConformer: Multiscale Feature Aggregation Conformer for Automatic Speaker Verification,” arXiv preprint arXiv:2203.15249, 2022.
 [31] L. Yi and M. W. Mak, “Disentangled speaker embedding for robust speaker verification,” in Proc. ICASSP, IEEE, 2022, pp. 76627666.
 [32] J. Deng, and J. Guo, and N. Xue, and S. Zafeiriou, “Arcface: Additive Angular Margin Loss for Deep Face Recognition,” in Proc. CVPR, 2019, pp. 46904699.
 [33] J. S. Chung, J. Hur, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Proc. INTERSPEECH, 2020.
 [34] S. H. Mun, J. Jung, M. H. Han, and N. S. Kim, “Frequency and MultiScale Selective Kernel Attention for Speaker Verification,” arXiv preprint arXiv:2204.01005v2, 2022.
 [35] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A LargeScale Speaker Identification Dataset,” in Proc. INTERSPEECH, 2017, pp. 2616–2620.
 [36] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. INTERSPEECH, 2018, pp. 1086–1090.
 [37] A. Nagrani, J. S. Chung, X. Xie, and A. Zisserman, “VoxCeleb: LargeScale Speaker Verification in the Wild,” Computer Speech & Language vol 60, pp. Mar. 2020.
 [38] X. Qin, M. Li, H. Bu, W. Rao, R. K. Das, S. Narayanan, and H. Li, “VoxCeleb2: Deep Speaker Recognition,” in Proc. INTERSPEECH, 2020, pp. 34563460.

[39]
I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in
Proc. ICLR, 2017, pp. 113.