Rep Works in Speaker Verification

by   Yufeng Ma, et al.

Multi-branch convolutional neural network architecture has raised lots of attention in speaker verification since the aggregation of multiple parallel branches can significantly improve performance. However, this design is not efficient enough during the inference time due to the increase of model parameters and extra operations. In this paper, we present a new multi-branch network architecture RepSPKNet that uses a re-parameterization technique. With this technique, our backbone model contains an efficient VGG-like inference state while its training state is a complicated multi-branch structure. We first introduce the specific structure of RepVGG into speaker verification and propose several variants of this structure. The performance is evaluated on VoxCeleb-based test sets. We demonstrate that both the branch diversity and the branch capacity play important roles in RepSPKNet designing. Our RepSPKNet achieves state-of-the-art performance with a 1.5982 VoxCeleb1-H.



There are no comments yet.


page 1

page 2

page 3

page 4


CS-Rep: Making Speaker Verification Networks Embracing Re-parameterization

Automatic speaker verification (ASV) systems, which determine whether tw...

An Integrated Framework for Two-pass Personalized Voice Trigger

In this paper, we present the XMUSPEECH system for Task 1 of 2020 Person...

Coupled Ensembles of Neural Networks

We investigate in this paper the architecture of deep convolutional netw...

Connectivity Learning in Multi-Branch Networks

While much of the work in the design of convolutional networks over the ...

The UPC Speaker Verification System Submitted to VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20)

This report describes the submission from Technical University of Catalo...

ResNeXt and Res2Net Structure for Speaker Verification

ResNet-based architecture has been widely adopted as the speaker embeddi...

Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks

Convolutional networks (ConvNets) have shown impressive capability to so...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification aims to verify a speaker’s identity given an audio segment. In recent years, deep neural networks (DNNs) has improved the performance of speaker verification systems which outperform the traditional i-vector system

[2]. Most DNN-based systems, such as x-vector [14], r-vector [24], and the recently proposed ECAPA-TDNN [3, 18]

, consist of three parts: (1) a network backbone to extract frame-level speaker representations, (2) a pooling layer to aggregate the frame-level information, and (3) a loss function. This paper focuses on the backbone architecture, which is the core part of the DNN models.

The backbone architecture can be a 1-dimensional convolutional neural network (TDNN) [14], a 2-dimensional convolutional neural network (CNN) [24, 26]

, a recurrent neural network (RNN), and even a hybrid architecture that combines TDNN, CNN, RNN, and Transformer-like structures

[27]. Several modifications of the backbone architecture are made to improve the performance. These modifications include adding a channel attention [8], transforming the custom convolution into a multi-scale convolution [21, 7], and aggregating multi-layer or multi-stage features [3, 9]. However, all these methods above only focus on the improvement of single-branch structures and neglect a multi-branch way of designing neural networks. Adding parallel branches [16, 17, 15] can significantly enlarge the model capacity and enrich the feature space, which results in better model performance. Yu et al. [23] proposed a multi branch version of densely connected TDNN structure with a selective kernel (D-TDNN-SS) and this model achieved competitive performance in speaker verification.

Though the complicated multi-branch structure has proved its power, more parameters and connections usually lead to a slow inference speed. Recent researches [4, 6, 5] proposed a new technique called re-parameterization to solve the increasing inference cost. The main idea of this technique is to design a training time multi-branch structure which can be transformed to a single path with only one custom convolution during the inference time. This technique decouples training time and inference time architecture and ensures that the output remains the same. Inspired by this re-parameterization, we [25] first introduced the original RepVGG model into the speaker verification and obtained first place in both Track 1 and Track 2 of VoxSRC2021.

The spectrogram whose shape is

is different from the image data that commonly serve as data input in computer vision tasks.

here means the feature maps (channel), means the frequency features and

means the time axis. We presented that the design of the multi-branch structure can be heuristic due to this difference of input data. The structure is task-specific and various re-parameterizable branches achieve different performances. To fully investigate how this re-parameterization works in speaker verification, we followed the work in

[6, 25] and proposed several variants of the original RepVGG block. We evaluated the performance of these systems on Voxceleb1-O, VoxCeleb1-E, and Voxceleb1-H [11, 1]. Based on these results, we at the first time proposed a new re-parameterizable structure named RepSPKNet and demonstrated the importance of branch diversity and branch capacity in designing multi-branch structures. Our RepSPKNet can be transformed to a stack of simple

convolutions and ReLU layers during the inference time which results in a fast inference speed and competitive performance. The proposed RepSPKNet model achieved a 1.5982% EER and a 0.1374 minDCF on VoxCeleb1-H.

The paper is organized as follows: Section 2 reviews the prior works related to re-parameterization. Section 3 presents our baseline system with a RepVGG backbone and other variants. In section 4, we discuss the experiment details and analyze the result. The analysis finally derives our carefully designed RepSPKNet. Section 5 concludes this paper.

2 Re-parameterization

Structural re-parameterization is used to avoid the extra parallel branch parameters and the slow inference speed via converting a multi-branch structure into a single path. ACNet [4] proposed a

kernel convolution with batch normalization (

CONV-BN) and a CONV-BN to strengthen the original CONV-BN. RepVGG added a CONV-BN and an identity batch normalization layer (ID-BN) in parallel. Furthermore, DBB [5] proposed a diverse branch block and gave more general transformations. It is worth mentioning that the combinations of these transformations also satisfy the request for re-parameterization. Here we list the general transformations as follows:

Figure 1: The baseline system. The ,

here denote the layer width parameter. Initial stride means the stride of first block of each stage. The input format is

  • CONV-BN fusion Batch normalization can be fused into its preceding convolution. Given a kernel weight , the fused parameters can be formulated as:


    where i denotes the i-th channel, and , , ,

    denote the scaling factor, mean, variance and bias of the BN layer.

  • Parallel conv addition

    Convolutions with different kernel size in different branches can be fused into one convolution by zero-padding small kernels and applying a simple element-wise addition.

  • Sequential convolutions fusion A sequence of CONV-BN and CONV-BN can be fused into a CONV whose parameters can be formulated as:


    where , and , denote the weights and bias of convolutions, and denotes transpose operation.

  • Average pooling transformation A kernel average pooling can be transformed to a convolution. The parameter is



    is an identity matrix.

3 Our Proposed RepSPKNet system

The RepVGG based speaker verification system has already shown its competitive performance [25]. This section describes our baseline system and our proposed variants.

3.1 Baseline system

Figure 2: Architecture of RepVGG block. Here (a) is the training time state. (b) demonstrates the process of CONV-BN fusion. (c) is the inference time state. denotes element-wise addition. A ReLU is added after branch addition.

Here we present our baseline system that consists of a RepVGG-A backbone, a statistical pooling [13], a 512-dimensional embedding layer and an additive margin softmax (AM-Softmax) loss function [20, 19]. The detailed topology is shown in Fig. 1.

As Fig. 2 shows, the basic RepVGG block consists of three parallel branches: (1) a CONV-BN, (2) a CONV-BN, and (3) an ID-BN. The ID-BN branch exists only when the input channel equals the output channel. According to the transformations in Section 2, it is easy to verify that these three branches can be merged into one convolution during the inference time. The RepVGG-A backbone consists of a stem layer and four stages. These stages contain 2, 4, 14, and 1 RepVGG blocks respectively and the stem layer is also a RepVGG block. The complexity of our backbone depends on the layer width parameters (a, b). For RepVGG-A0, we set , and . For RepVGG-A1, , and . For RepVGG-A2, , and . We slightly change the original stride setting [6] to make this backbone fit the speaker verification task. Both the first stage and the stem layer have a stride of 1. The other stages have a stride of 2. The format of input is . The output of the backbone has a shape as and is reshaped to . The whole backbone can be transformed to a stack of convolutions and ReLU layers during the inference time.

A statistical pooling layer is applied to aggregate the speaker information. We calculate the mean and the standard deviation of the backbone output along the time axis. The mean and standard deviation are concatenated and then compressed to a 512-dimensional vector which serves as the speaker embedding. The AM-Sofmtax is used to classify speakers which can be formulated as:


where N denotes the number of samples, C denotes the number of speakers, s denotes the scaling factor, m denotes the margin penalty and denotes the angle between weight vector and the i-th sample.

3.2 Variants of the baseline system

Figure 3: Variants of the original RepVGG basic block. For convenience, the CONV-BN and the ID-BN not depicted.

As we introduced in Section 1, the original architecture of this RepVGG block is designed for computer vision tasks. Though this architecture has proved its superiority, it is task-specific and may not be the optimal structure in speaker verification. The

CONV-BN is the main branch of this architecture, while the ID-BN serves as a residual connection to avoid the gradient vanishing problem. To investigate the re-parameterization in speaker verification, we fixed the

CONV-BN and ID-BN branches and proposed several variants to replace the CONV-BN. As Fig. 3 demonstrates, structure (a) is a duplicate CONV-BN. Structure (b) and (c) are borrowed from ACNet. Structure (d) and (e) are borrowed from DBB. Structure (f) consists of a convolution with a dilation of 2 and a batch normalization layer. The original RepVGG blocks are replaced with these variants of the original block to form new speaker verification models. According to the transformations in Section 2, all these variants (except (f)) can be transformed to a convolution which means the inference bodies of these new models remain the same when compared to the original RepVGG inference time state.

4 Experiments and results

4.1 Dataset and features

All our models adopted the VoxCeleb2 development set [1] as our training set. This dataset contains 1,092,009 utterances and 5,994 speakers in total. Our data augmentation consisted of two parts: (1) A 3-fold speed augmentation [25, 22] was implemented at first to generate extra twice speakers based on the SoX speed function. (2) We followed the data augmentation method provided by the Kaldi VoxCeleb recipe. The RIRs [10] and MUSAN [12] dataset was used.

After the augmentation process, 16,380,135 utterances from 17,982 speakers were generated. We extracted 81-dimensional log Mel filter bank energies based on Kaldi without voice activity detection (VAD). The window size is 25 ms, and the frame-shift is 10 ms. All the features were cepstral mean normalized.

4.2 Experiment setup

200 frames of each sample in one batch were randomly selected. The SGD optimizer with a momentum of 0.9 and a weight decay of 1e-3 was used. We used 8 GPUs with mini-batch as 1,024 and an initial learning rate of 0.08. We adopted ReduceLROnPlateau scheduler and the minimum learning rate is 1e-6. The margin of the AM-Softmax loss is set to 0.2

and the scale is 36. All our systems were evaluated on VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H. Trials were scored by cosine similarity of the 512-dimensional embeddings and no score normalization was implemented. The criterion is equal error rate (EER) and minimum decision cost function (DCF) where

, , and .

4.3 Results and analysis

4.3.1 Ablation study of base model

Models VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
A0 1.4310 0.1219 1.2900 0.1207 2.1700 0.1864
Var a 1.3940 0.1185 1.3100 0.1256 2.2100 0.1913
Var b 1.3890 0.1140 1.2980 0.1236 2.1802 0.1889
Var c 1.3091 0.1169 1.3110 0.1265 2.2213 0.1937
Var d 1.2671 0.1039 1.2602 0.1181 2.1126 0.1850
Var e 1.4263 0.1320 1.3164 0.1266 2.2201 0.1940
Var f 1.0821 0.1006 1.1204 0.1067 1.9342 0.1665
Table 1: Ablation study on our baseline model RepVGG-A0. For convenience, we omitted the % sign of EER and used Var to denote the variant using the structures we proposed. Var f cannot be transformed to a custom convolution.

As we mentioned in Section 3.2, the RepVGG block is task-specific. To compare our proposed variants with the original RepVGG structure, we selected RepVGG-A0 as our base model since it has a much faster training speed when compared with RepVGG-A1 and RepVGG-A2. To find out the most suitable re-parameterizable structure in speaker verification, we trained all the variants mentioned above. All the performances were presented in Table 1. As for Var a, it is rather intriguing that replacing the original CONV-BN with an extra CONV-BN bring performance decay on some complex test sets like VoxCeleb1-E and VoxCeleb1-H even the training time state has more parameters (larger branch capacity). We believed that this extra CONV-BN structure tended to learn a representation that was similar to the main CONV-BN. The lack of feature diversity caused the performance decay. Furthermore, Var d performed the best among all models (Var f not included) on all test sets. This structure added a CONV-BN after the original CONV-BN. On the contrary, Var e which consists of a CONV-BN and an average pooling performed the worst. Both these two structures had operators that control the balance between the branch diversity and branch capacity.

Figure 4: Branch similarity of each model. The layer number starts from the initial block to the final one.
Models VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
ECAPA 0.8600 0.0960 1.0800 0.1223 2.0100 0.2004
ResNet34 1.0498 0.1045 1.0587 0.1008 1.8456 0.1619
A0 1.4310 0.1219 1.2900 0.1207 2.1700 0.1864
RSBA-A0 1.2671 0.1039 1.2602 0.1181 2.1126 0.1850
RSBB-A0 1.0821 0.1006 1.1204 0.1067 1.9342 0.1665
   - + 1.1771 0.0982 1.1041 0.1081 1.8960 0.1688
A1 1.2141 0.0913 1.1593 0.1054 1.9347 0.1655
RSBA-A1 1.1503 0.0872 1.1295 0.1013 1.8902 0.1605
RSBB-A1 0.9650 0.0795 1.0359 0.0936 1.7602 0.1512
A2 0.9546 0.0831 1.0143 0.0926 1.7149 0.1465
RSBA-A2 0.9122 0.0801 0.9939 0.0916 1.6809 0.1431
RSBB-A2 0.8430 0.0775 0.9637 0.0907 1.5982 0.1374
Table 2: Performances of the RepSPKNet. RSBA denotes RepSPK-A Block and RSBB denotes RepSPK-B block. - + means replacing the main CONV-BN branch with a CONV-BN. The ECAPA denotes the ECAPA-TDNN(C=2048), and its results are referred from [18]. The ResNet34 model denotes our implementation of SOTA ResNet system. No score normalization is adopted except that the ECAPA system reports performances using adaptive s-norm.

To verify that the multi-branch structure’s performance depends on the trade-off between branch diversity and branch capacity, we designed a branch Var f as shown in Fig. 3. This structure is a CONV-BN with dilation as 2. This dilated convolution ensures diversity by constraining a different input and receptive field from the custom convolution of the main branch, and it also ensures capacity by increasing the parameters. We used the cosine similarity between the outputs of the main branch and our proposed branch to represent the branch similarity. The branch similarity of each layer was presented in Fig. 4. Var a, obviously had the highest branch similarities (around 0.9) as we speculated. Var e, on the contrary, had the lowest similarities (around 0.2). Both the two structures showed performance decay. The other two variants had similarities around 0.5 and outperformed the base model. Moreover, Var f with similarities closer to 0.5 achieved a relative 10.9% EER and a relative 10.5% minDCF improvement compared to the base RepVGG-A0 model.

4.3.2 RepSPKNet architecture

Figure 5: RepSPK block (RSB). (a) denotes the RSBA block and (b) denotes RSBB block. The ID-BN exists when input channel equals output channel.

It is easy to prove that a convolution with a dilation of 2 can be transformed to a convolution. According to the transformations provided by Section.2, this Var f structure can also be re-parameterized to a single custom convolution. Based on the results, we proposed the final architecture of the RepSPKNet. As depicted in Fig. 5, two blocks were presented. The RSBA block was composed of a CONV-BN, a CONV-BN followed by a CONV-BN, and an ID-BN. The RSBB block was composed of a CONV-BN, a CONV-BN with a dilation of 2, and an ID-BN. To verify the stability and transferability of our proposed architecture, we compared our RepSPKNet with the original RepVGG-A1 and RepVGG-A2. Here we only replaced the RepVGG block with the RepSPK block (RSB). The model consisted of RSBA blocks was called RepSPKNet-A and the other consisted of RSBB blocks was called RepSPKNet-B. We also conducted an ablation study of the RSBB structure by replacing the CONV-BN main branch with a CONV-BN. The results were presented in Table 2. The RepSPKNet-A and RepSPKNet-B both outperformed their corresponding base model. Compared to the performance of RepVGG-A2 on VoxCeleb1-H, the RSBA-A2 achieved relative improvements of 2.0% in EER and 2.3% in minDCF. Moreover, the RSBB-A2 achieved relative improvements of 6.8% in EER and 6.2% in minDCF. The result demonstrated that our RepSPKNets can achieve SOTA performance in speaker verification.

5 Conclusion

In this paper, we proposed two blocks as Fig.5 demonstrates. The RepSPKNet-A is composed of the RSBA block while the RepSPKNet-B is composed of the RSBB block. With the structral re-parameterization method, the RepSPKNet-A can be transformed to a stack of convolution and ReLU and the RepSPKNet-B can be transformed to a stack of convolution and ReLU. Ablation studies on various variants indicated that the performance of multi-branch structure depended on the branch diversity and branch capacity, which was a heuristic principle for designing multi-branch models. Our proposed RepSPKNet outperformed the original RepVGG and achieved SOTA performance in speaker verification.


  • [1] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §1, §4.1.
  • [2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. External Links: Document Cited by: §1.
  • [3] B. Desplanques, J. Thienpondt, and K. Demuynck (2020-10) ECAPA-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. Interspeech 2020. External Links: Link, Document Cited by: §1, §1.
  • [4] X. Ding, Y. Guo, G. Ding, and J. Han (2019-10) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [5] X. Ding, X. Zhang, J. Han, and G. Ding (2021) Diverse branch block: building a convolution as an inception-like unit. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 10886–10895. Cited by: §1, §2.
  • [6] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun (2021) Repvgg: making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742. Cited by: §1, §1, §3.1.
  • [7] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. Torr (2021-02) Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2), pp. 652–662. External Links: ISSN 1939-3539, Link, Document Cited by: §1.
  • [8] J. Hu, L. Shen, and G. Sun (2018-06) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [9] Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim (2020-10) Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances. Interspeech 2020. External Links: Link, Document Cited by: §1.
  • [10] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. Cited by: §4.1.
  • [11] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §1.
  • [12] D. Snyder, G. Chen, and D. Povey (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §4.1.
  • [13] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur (2017) Deep neural network embeddings for text-independent speaker verification.. In Interspeech, pp. 999–1003. Cited by: §3.1.
  • [14] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §1.
  • [15] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, Cited by: §1.
  • [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2014) Going deeper with convolutions. External Links: 1409.4842 Cited by: §1.
  • [17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015) Rethinking the inception architecture for computer vision. External Links: 1512.00567 Cited by: §1.
  • [18] J. Thienpondt, B. Desplanques, and K. Demuynck (2020) The idlab voxceleb speaker recognition challenge 2020 system description. arXiv preprint arXiv:2010.12468. Cited by: §1, Table 2.
  • [19] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §3.1.
  • [20] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018)

    Cosface: large margin cosine loss for deep face recognition

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §3.1.
  • [21] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He (2017-07) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [22] H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka (2019) Speaker augmentation and bandwidth extension for deep speaker embedding.. In INTERSPEECH, pp. 406–410. Cited by: §4.1.
  • [23] Y. Yu and W. Li (2020) Densely connected time delay neural network for speaker verification. In Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 921–925. Cited by: §1.
  • [24] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot (2019) But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592. Cited by: §1, §1.
  • [25] M. Zhao, Y. Ma, M. Liu, and M. Xu (2021) The speakin system for voxceleb speaker recognition challange 2021. arXiv preprint arXiv:2109.01989. Cited by: §1, §1, §3, §4.1.
  • [26] T. Zhou, Y. Zhao, and J. Wu (2020) ResNeXt and res2net structures for speaker verification. External Links: 2007.02480 Cited by: §1.
  • [27] H. Zhu, K. A. Lee, and H. Li (2021) Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. In Proc. Interspeech 2021, pp. 106–110. External Links: Document Cited by: §1.