Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification

10/11/2021 ∙ by Miao Zhao, et al. ∙ 0

This paper describes the multi-query multi-head attention (MQMHA) pooling and inter-topK penalty methods which were first proposed in our submitted system description for VoxCeleb speaker recognition challenge (VoxSRC) 2021. Most multi-head attention pooling mechanisms either attend to the whole feature through multiple heads or attend to several split parts of the whole feature. Our proposed MQMHA combines both these two mechanisms and gain more diversified information. The margin-based softmax loss functions are commonly adopted to obtain discriminative speaker representations. To further enhance the inter-class discriminability, we propose a method that adds an extra inter-topK penalty on some confused speakers. By adopting both the MQMHA and inter-topK penalty, we achieved state-of-the-art performance in all of the public VoxCeleb test sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification focuses on whether two utterances come from the same speaker. Over the years, various embedding-based models have been developed to encode the utterances to an embedding with speakers’ feature. There are two main categories of embedding-based models. One is the conventional i-vector which estimates the speaker representation based on Gaussian mixture model (GMM). The other is built on a deep neural network (DNN), such as the typical x-vector

[16].

Recently, the x-vector framework has proved its superiority over the i-vector in many speaker verification tasks [17]. After that, more and more optimizations were proposed based on the original x-vector architecture. For example, ResNet [6] was used to replace the time delay neural network (TDNN) layers of x-vector [1, 27] due to its strong ability to extract the feature. ECAPA-TDNN [5] was also proposed as an enhanced version of x-vector and achieved a competitive performance with ResNet [19]

. Moreover, ECAPA-TDNN was further enhanced by involving both 1D and 2D convolution neural networks (CNN), named ECAPA CNN-TDNN

[20]

. In general, these DNN-based architectures have four parts: (1) a backbone to encode the acoustic features to a high-level representation, (2) a pooling layer to map a variable sequence to a fixed-length embedding, (3) several segment level layers to decode the hidden information and (4) a loss function to classify different speakers and to learn a discriminative speaker embedding. In this paper, we mainly focus on the pooling layer and loss function to further enhance the performance of speaker verification.

For the DNN-based architecture, the pooling layer is a key component to aggregate the variable sequence to an utterance level embedding. Recently, the statistics pooling [16] has been popular to represent the speaker characteristics even if there are other alternatives, such as higher-order statistics [23] and channel-wise correlation matrix [18]. Considering the different importance of different frames of a sequence, many works focus on the weights of frames to obtain a better segment level representation. For example, inspired by i-vector, learnable dictionary encoding (LDE) pooling [1] and recent Xi-vector [9]

learned the weights based on the theory of Gaussian mixture model (GMM). Meanwhile, the simpler and more efficient self-attention mechanism were also introduced to calculate weighted mean and standard deviation, named attentive statistics (AS) pooling

[12]. Moreover, some multi-head mechanisms were further used to increase the diversity of attention, such as self-attentive (SA) pooling [30] and self multi-head attention (MHA) pooling [7]. However, the two typical multi-head attention pooling have a completely different definition on heads. The SA defines the multi-head as adding more than one group of trainable parameters and the attentive weights for every head will be computed by the whole features (we prefer to name the head as a query in this case), while the MHA firstly splits the channels of features into several groups and then assigns an attentive head for each group respectively. Comparing with the SA, the MHA makes it possible to learn weights with a part of features. However, one single head in each group may be insufficient to capture the patterns of speaker characteristics. To address this issue, we proposed MQMHA pooling by adding more than one query for each group. Furthermore, inspired by channel-dependent attentive statistics (CAS) [5, 19] pooling and vector-based SA (VSA) pooling [25], we also consider assigning a unique chann channels of one frame rather than applying the same weight on all channels. Therefore, our proposed MQMHA is a generalized pooling structure covering AS, MHA, SA, and VSA, etc.

Besides the pooling layer, the loss function is also important to learn a discriminative speaker embedding. It ensures a low similarity of different speakers and a high similarity within the same speaker. Despite the popularity of AM-Softmax [21, 22] and AAM-Softmax [4] in speaker verification and a great number of successes they have made [11], the same angular margin for different speakers could be inappropriate because some of the speakers are more difficult to recognize than others. Recently, setting an adaptive margin for each sample has been proposed in [29]. However, with many hard samples generated in data augmentation, tuning the range of margin can be difficult. Different from this, since the relative strong penalty could be expected for similar speakers, we proposed adding an extra inter penalty to the top k negative speakers based on the original AM-Softmax loss. On one hand, the proposed inter top k penalty is different from other losses which focus on inter class, such as minimum hyper-spherical energy (MHE) [10]

, in which our method focuses on the relation between a sample and its top k closed class centers but MHE pushes different centers to be uniformly distributed. Moreover, our proposed inter top k penalty could be also seen as a hard prototype mining (HPM) method without extra sampling requirements for it also pays more attention to similar speakers. Finally, by applying both MQMHA and inter-topK penalty, we achieved state-of-the-art performance in VoxCeleb tasks.

The organization of this paper is as follows: Section 2 describes our baseline architecture based on a 34-layer ResNet. Section 3 describes two proposed methods, MQMHA and inter-topK penalty. The experiments and results are given in the Section 4. And we concludes this paper in the Section 5.

2 Baseline System

In this section, we first introduce our baseline system architecture and then describe the training protocol. As shown in Figure 1

, the backbone of our baseline system is a modified version of the standard 34-layer ResNet, in which the kernel size of the first convolution is changed to 3 and the max pooling is removed. For the loss function, besides AM-Softmax, the k-subcenter

[3]

method is also used jointly as the basic loss function. In this case, the cosine similarity between a sample and one center of speaker is given by

, where the function means that the nearest center is selected and it inhibits possible noisy samples from interfering the dominant class center.

Before training, we extracted 81-dimensional log Mel filter bank energies based on Kaldi [14]. The window size is 25 ms, and the frame-shift is 10 ms. 200 frames of features were extracted without extra voice activation detection (VAD), and the features were cepstral mean normalized before being fed into networks. During training, the SGD optimizer with a momentum of 0.9 and a weight decay of 1e-3 was used. We used 8 GPUs with 1,024 mini-batch and an initial learning rate of 0.08 to train our models. As is described above, 200 frames of each sample in one batch were adopted to avoid over-fitting and speed up training. We adopted ReduceLROnPlateau scheduler with a frequency of validating every 2,000 iterations, and the patience is 2. The minimum learning rate is 1e-6, and the decay factor is 0.1. Furthermore, the margin gradually increases from 0 to 0.2 [11]

. We used Pytorch

[13] to conduct our experiments.

After the training is done, a 512-dimensional embedding could be extracted from the linear layer and the single cosine similarity is used to compute the score of two embeddings.

Figure 1: Our baseline architecture for speaker verification. The backbone is a modified 34-layer ResNet. The statistics pooling with concatenating mean and stddev is used as the basic pooling layer. The loss function is AM-Softmax with 3 sub-centers. The margin and scale of AM-Softmax are 0.2 and 35 respectively. The frequency dimension of input features is 81. B: the mini-batch size for training. T: the number of frames of input features.

3 The Proposed Methods

3.1 Multi-query Multi-head Attention Pooling

Most attentive pooling layers pay attention to the importance of some unique features, such as giving different frames and frequency different contributions to a speaker representation. The multi-head is usually used to avoid the simple pattern learned by a single head. As described in Section 1, the SA and MHA give two different definitions of head and the proposed MQMHA pooling combines them to attend more patterns of the feature. Besides the definition of head, there are also two noteworthy points in various attentive pooling used in speaker verification. The first point is that most attentive poolings assign a same weight to all channels of one frame (shared case) except the recent VSA pooling. The VSA is more like the original self-attention mechanism in which every value of input features will have a unique attentive weight (unique case). The other is that the attention module of most of attentive pooling has two linear layers but MHA only has one linear layer to reduce the number of parameters. To evaluate the effects of these two points, we also combine them in our proposed MQMHA method and the final MQMHA can be described as below.

Suppose we have a backbone output , and each is split into parts with , where is the number of heads. For each head, it has trainable queries. Then the attention weight of is defined as:

(1)

where the function is an attention mechanism to calculate weights and it can be composed of one linear layer or two linear layers with a nonlinearity:

(2)

where is a matrix of size is a matrix of size and is a matrix of size . The is the hidden size of the two linear layers and it is set to 512 by default in our experiments. The is the number of weight and it equals to 1 for the shared case and for the unique case. After the weights are calculated by the attention mechanism, the representations of mean and standard deviation can be formulated as Equation (3) and Equation (4) respectively:

(3)
(4)

Then we concatenate all of the sub-representations to get the utterance level embedding with , where , and an extra attentive standard deviation could be obtained from all of the with the same way. Finally, this representation of standard deviation is concatenated with to enhance the performance. As mentioned above, the MQMHA contains the cases of SA (), MHA (), AS () and VSA (). Moreover, learning more patterns from local features and multi queries at the same time is compatible based on the MQMHA.

3.2 Extra Inter-TopK Penalty for AM-Softmax Loss

The AM-Softmax and AAM-Softmax loss functions have been widely used in recent speaker recognition works. However, the optimization to further distinguish similar speakers could be limited due to the same margin applied on all negative classes. To mitigate this issue, we proposed adding extra inter-topK penalty on AM-softmax. Given a batch with examples and a number of classes of , the formulation with adding extra inter-topK penalty based on the AM-Softmax is given by:

(5)

where is the original margin of AM-Softmax and is the scalar of magnitude. And here the in AM-Softmax is already replaced by the to add an extra penalty on inter-class:

(6)

where the could be replaced by for the AAM-Softmax loss function. The extra penalty is only added for the closest centers to the example . Since the similarity between samples and centers could be changed during training, the penalty will not always be effective for some fixed centers. Moreover, it will pay more attention to the confused pairs of different speakers as training converges and the confidence of increases. Therefore, this method is also a hard prototype mining method but without extra sampling requirements.

The in

function is an important hyperparameter which determines the number of top nearest negative classes to select for each sample. To give an analysis of this variable, we firstly transform the

Equation (5) to a form with inter-class penalty only where the numerator and denominator are multiplied by at the same time:

(7)

Then with the different , the averaged between one example and all negative classes could be given by:

(8)

where the range of in terms of different is . Note that the will be equal to when and equal to when . For these two cases, this equation is equal to the general AM-Softmax and there will be no special optimization for hard examples. While for the other cases, there are two discriminative penalties for different negative speakers. In general, the k is not expected to be too larger for the negative speakers with high similarity are usually in the minority that it is similar to the selection of imposters in adaptive score normalization.

Method VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H VoxSRC20-dev VoxSRC21-val
EER(%) EER(%) EER(%) EER(%) EER(%)
Baseline Statistics & AM (m=0.2) 1.0101 0.0997 1.0435 0.0962 1.7668 0.1531 2.7075 0.1380 2.9167 0.1576
Pooling AS (q=1, h=1) [12] 1.0313 0.0829 1.0224 0.0940 1.7356 0.1527 2.6863 0.1380 2.9317 0.1613
SA (q=2, h=1) [30] 0.9968 0.0800 1.0217 0.0924 1.7402 0.1493 2.6506 0.1339 2.9233 0.1572
VSA (q=2, h=1) [25] 0.9995 0.0845 1.0294 0.0924 1.7483 0.1479 2.6783 0.1333 2.8983 0.1566
MHA (q=1, h=16) [7] 0.9756 0.0840 1.0270 0.0930 1.7020 0.1467 2.6450 0.1321 2.7850 0.1503
MQMHA (q=4, h=16) 0.9465 0.0783 1.0090 0.0913 1.7099 0.1465 2.6172 0.1316 2.7467 0.1480
Loss Inter-TopK (=0.06) 0.9783 0.0846 1.0088 0.0883 1.7060 0.1461 2.5998 0.1317 2.7117 0.1491
Combine MHA & Inter-TopK 0.9730 0.0912 1.0170 0.0892 1.6860 0.1415 2.5760 0.1297 2.5800 0.1433
MQMHA & Inter-TopK 0.9305 0.0738 0.9809 0.0879 1.6020 0.1373 2.5070 0.1246 2.5100 0.1403
Table 1: Results of Pooling, Loss and Their Combination on Five Test Sets of VoxCeleb.

4 Experiments and Results

4.1 Training and Test Sets

The VoxCele2-dev dataset [2] was used as our training set. It contains 1,092,009 utterances and 5,994 speakers in total. As the data augmentation make the system more robust, we here adopted a 3-fold speed augmentation [26, 24] at first to generate extra twice speakers. Each speech segment in this dataset was perturbed by 0.9 or 1.1 factor based on the SoX speed function. Then we obtained 3,276,027 utterances and 17,982 speakers. Then we also used RIRs [8] and MUSAN [15] to create extra four augmented copies of the original training utterances. And the data augmentation process was based on the Kaldi VoxCeleb recipe of sre16/v2. After the two augmentations, 16,380,135 utterances were generated to extract acoustic features. To evaluate our proposed methods, we used five public VoxCeleb tasks, VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H, VoxSRC20-dev and VoxSRC21-val, which were also adopted in our system description [28] for the VoxSRC 2021. It is worth mentioning that the VoxSRC20-dev and VoxSRC21-val are much harder to recognize as the VoxSRC20-dev contains some out-of-domain utterances and the VoxSRC21-val focuses on multi-lingual verification.

4.2 Results on Voxceleb Test Sets

In our experiments, the performance is evaluated using the Equal Error Rate (EER) and the minimum Decision Cost Function (DCF) with , , and or in different cases. Table 1 shows the performance of various pooling, inter-topK loss and their combination on five test sets. For convenience, we took the performance of VoxSRC21-val as our benchmark. Firstly, our proposed MQMHA pooling outperformed all the other pooling systems, showing 5.83% and 6.09% relative improvement compared with the baseline in terms of EER and DCF respectively. Secondly, introducing inter-topK penalty into AM-Softmax loss reduced the EER and the DCF in all test sets, especially in VoxSRC21-val where there are more similar utterances. The EER and the DCF decreased by 7.03% and 5.39% respectively in comparison with the baseline. Finally, although MHA pooling and MQMHA pooling are close in performance when applied alone, when combined with inter-topK loss, MQMHA pooling achieved a better result than MHA pooling, outperforming the baseline by 13.94% in EER and 10.98% in DCF.

4.3 Ablation Study of MQMHA Pooling

To evaluate the effect of head and query in the MQMHA, we conducted an ablation study based on the VoxSRC21-val. As shown in Table 2, a general attentive pooling (q=1, h=1, n=1, =1) barely improves the performance compared to the baseline. When we start to increase the number of head, there is no obvious improvement until and the best results is obtained when . With the number of heads continues to increase, the performance begins to decay. It means that the features cannot be divided to too many parts. For multi-query, the results are unstable when increasing the number of queries and keep the head as 1. However, the improvements of multi-query are significant when features are split into several parts. As for the function to calculate the attention weight, we do not observe better results when using two linear layers.

Configures EER(%)
no attention (baseline) 2.9167 0.1576
q=1, h=1, n=1, =1 2.8850 0.1569
q=1, h=2, n=1, =1 2.9983 0.1717
q=1, h=4, n=1, =1 2.9217 0.1633
q=1, h=8, n=1, =1 2.8200 0.1573
q=1, h=16, n=1, =1 2.7850 0.1503
q=1, h=32, n=1, =1 2.9167 0.1585
q=2, h=1, n=1, =1 2.8717 0.1575
q=4, h=1, n=1, =1 2.9233 0.1645
q=8, h=1, n=1, =1 2.8983 0.1642
q=2, h=16, n=1, =1 2.8367 0.1581
q=4, h=16, n=1, =1 2.7467 0.1480
q=8, h=16, n=1, =1 2.7767 0.1557
q=4, h=16, n=2, =1 2.7800 0.1532
q=4, h=16, n=2, = 2.8867 0.1551
Table 2: Results of MQMHA on VoxSRC21-val.

4.4 Ablation Study of Inter-topk Penalty

For the inter-topK method, both the extra inter margin penalty and the number of top nearest negative speakers have an important effect on the performance. As shown in Table 3, our proposed inter-topK outperforms baseline by 7.03% in EER and 5.39% in DCF when and . Firstly, the is a more important hyperparameter than . As described in Equation (8), it will be the general AM-Softmax case if the is equal or . However, simply increasing the margin from 0.20 to 0.26 does not improve the speaker verification performance. On the other hand, only adding extra penalty term on those top negative classes significantly improves the system performance. The best result is obtained when equals 5. We also observe that should not be too large e.g. due to the fact that some negative classes may be overly punished. Similarly, the extra penalty neither should not be too large.

Configures EER(%)
m=0.20, =0.00 (baseline) 2.9167 0.1576
m=0.22, =0.00 2.9133 0.1656
m=0.24, =0.00 2.9200 0.1616
m=0.26, =0.00 3.0000 0.1719
m=0.20, =0.02 & =5 2.7550 0.1605
m=0.20, =0.04 & =5 2.7450 0.1506
m=0.20, =0.06 & =5 2.7117 0.1491
m=0.20, =0.08 & =5 2.7233 0.1501
m=0.20, =0.06 & =1 2.8833 0.1605
m=0.20, =0.06 & =2 2.7633 0.1543
m=0.20, =0.06 & =5 2.7117 0.1491
m=0.20, =0.06 & =10 2.7617 0.1556
Table 3: Results of Inter-TopK on VoxSRC21-val.

5 Conclusion

In this papaer, we proposed two methods, MQMHA pooling and inter-topK penalty based on AM-Softmax loss function, to further improve the performance of speaker verification. The MQMHA calculates the weights of frames by not only splitting the features to several parts along the channel axis but also assigning more than one queries for each part. The inter-topK penalty further enhances the inter class discriminability through adding an extra penalty term on top negative speakers. Both these two methods outperform our baseline model. With a combination of the two methods above, our system achieves state-of-the-art performance. The EER on VoxCeleb1-H is 1.6020% and the corresponding minDCF is 0.1373.

References

  • [1] W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li (2018) A novel learnable dictionary encoding layer for end-to-end language identification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5189–5193. Cited by: §1, §1.
  • [2] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §4.1.
  • [3] J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou (2020)

    Sub-center arcface: boosting face recognition by large-scale noisy web faces

    .
    In

    European Conference on Computer Vision

    ,
    pp. 741–757. Cited by: §2.
  • [4] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4690–4699. Cited by: §1.
  • [5] B. Desplanques, J. Thienpondt, and K. Demuynck (2020) Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143. Cited by: §1, §1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [7] M. India, P. Safari, and J. Hernando (2019) Self multi-head attention for speaker recognition. arXiv preprint arXiv:1906.09890. Cited by: §1, Table 1.
  • [8] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. Cited by: §4.1.
  • [9] K. A. Lee, Q. Wang, and T. Koshinaka (2021) Xi-vector embedding for speaker recognition. IEEE Signal Processing Letters. Cited by: §1.
  • [10] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song (2018) Learning towards minimum hyperspherical energy. arXiv preprint arXiv:1805.09298. Cited by: §1.
  • [11] Y. Liu, L. He, and J. Liu (2019) Large margin softmax loss for speaker verification. arXiv preprint arXiv:1904.03479. Cited by: §1, §2.
  • [12] K. Okabe, T. Koshinaka, and K. Shinoda (2018) Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963. Cited by: §1, Table 1.
  • [13] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    .
    Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §2.
  • [14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The kaldi speech recognition toolkit. In

    IEEE 2011 workshop on automatic speech recognition and understanding

    ,
    Cited by: §2.
  • [15] D. Snyder, G. Chen, and D. Povey (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §4.1.
  • [16] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur (2017) Deep neural network embeddings for text-independent speaker verification.. In Interspeech, pp. 999–1003. Cited by: §1, §1.
  • [17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1.
  • [18] T. Stafylakis, J. Rohdin, and L. Burget (2021) Speaker embeddings by modeling channel-wise correlations. arXiv preprint arXiv:2104.02571. Cited by: §1.
  • [19] J. Thienpondt, B. Desplanques, and K. Demuynck (2020) The idlab voxceleb speaker recognition challenge 2020 system description. arXiv preprint arXiv:2010.12468. Cited by: §1, §1.
  • [20] J. Thienpondt, B. Desplanques, and K. Demuynck (2021) Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint arXiv:2104.02370. Cited by: §1.
  • [21] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §1.
  • [22] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §1.
  • [23] S. Wang, Y. Yang, Y. Qian, and K. Yu (2021) Revisiting the statistics pooling layer in deep speaker embedding learning. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. Cited by: §1.
  • [24] W. Wang, D. Cai, X. Qin, and M. Li (2020) The dku-dukeece systems for voxceleb speaker recognition challenge 2020. arXiv preprint arXiv:2010.12731. Cited by: §4.1.
  • [25] Y. Wu, C. Guo, H. Gao, X. Hou, and J. Xu (2020) Vector-based attentive pooling for text-independent speaker verification.. In INTERSPEECH, pp. 936–940. Cited by: §1, Table 1.
  • [26] H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka (2019) Speaker augmentation and bandwidth extension for deep speaker embedding.. In INTERSPEECH, pp. 406–410. Cited by: §4.1.
  • [27] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot (2019) But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592. Cited by: §1.
  • [28] M. Zhao, Y. Ma, M. Liu, and M. Xu (2021) The speakin system for voxceleb speaker recognition challange 2021. arXiv preprint arXiv:2109.01989. Cited by: §4.1.
  • [29] D. Zhou, L. Wang, K. A. Lee, Y. Wu, M. Liu, J. Dang, and J. Wei (2020) Dynamic margin softmax loss for speaker verification.. In INTERSPEECH, pp. 3800–3804. Cited by: §1.
  • [30] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey (2018) Self-attentive speaker embeddings for text-independent speaker verification.. In Interspeech, Vol. 2018, pp. 3573–3577. Cited by: §1, Table 1.