1 Introduction
Speaker verification aims to verify a speaker’s identity given an audio segment. In recent years, deep neural networks (DNNs) has improved the performance of speaker verification systems which outperform the traditional ivector system
[2]. Most DNNbased systems, such as xvector [14], rvector [24], and the recently proposed ECAPATDNN [3, 18], consist of three parts: (1) a network backbone to extract framelevel speaker representations, (2) a pooling layer to aggregate the framelevel information, and (3) a loss function. This paper focuses on the backbone architecture, which is the core part of the DNN models.
The backbone architecture can be a 1dimensional convolutional neural network (TDNN) [14], a 2dimensional convolutional neural network (CNN) [24, 26]
, a recurrent neural network (RNN), and even a hybrid architecture that combines TDNN, CNN, RNN, and Transformerlike structures
[27]. Several modifications of the backbone architecture are made to improve the performance. These modifications include adding a channel attention [8], transforming the custom convolution into a multiscale convolution [21, 7], and aggregating multilayer or multistage features [3, 9]. However, all these methods above only focus on the improvement of singlebranch structures and neglect a multibranch way of designing neural networks. Adding parallel branches [16, 17, 15] can significantly enlarge the model capacity and enrich the feature space, which results in better model performance. Yu et al. [23] proposed a multi branch version of densely connected TDNN structure with a selective kernel (DTDNNSS) and this model achieved competitive performance in speaker verification.Though the complicated multibranch structure has proved its power, more parameters and connections usually lead to a slow inference speed. Recent researches [4, 6, 5] proposed a new technique called reparameterization to solve the increasing inference cost. The main idea of this technique is to design a training time multibranch structure which can be transformed to a single path with only one custom convolution during the inference time. This technique decouples training time and inference time architecture and ensures that the output remains the same. Inspired by this reparameterization, we [25] first introduced the original RepVGG model into the speaker verification and obtained first place in both Track 1 and Track 2 of VoxSRC2021.
The spectrogram whose shape is
is different from the image data that commonly serve as data input in computer vision tasks.
here means the feature maps (channel), means the frequency features andmeans the time axis. We presented that the design of the multibranch structure can be heuristic due to this difference of input data. The structure is taskspecific and various reparameterizable branches achieve different performances. To fully investigate how this reparameterization works in speaker verification, we followed the work in
[6, 25] and proposed several variants of the original RepVGG block. We evaluated the performance of these systems on Voxceleb1O, VoxCeleb1E, and Voxceleb1H [11, 1]. Based on these results, we at the first time proposed a new reparameterizable structure named RepSPKNet and demonstrated the importance of branch diversity and branch capacity in designing multibranch structures. Our RepSPKNet can be transformed to a stack of simpleconvolutions and ReLU layers during the inference time which results in a fast inference speed and competitive performance. The proposed RepSPKNet model achieved a 1.5982% EER and a 0.1374 minDCF on VoxCeleb1H.
The paper is organized as follows: Section 2 reviews the prior works related to reparameterization. Section 3 presents our baseline system with a RepVGG backbone and other variants. In section 4, we discuss the experiment details and analyze the result. The analysis finally derives our carefully designed RepSPKNet. Section 5 concludes this paper.
2 Reparameterization
Structural reparameterization is used to avoid the extra parallel branch parameters and the slow inference speed via converting a multibranch structure into a single path. ACNet [4] proposed a
kernel convolution with batch normalization (
CONVBN) and a CONVBN to strengthen the original CONVBN. RepVGG added a CONVBN and an identity batch normalization layer (IDBN) in parallel. Furthermore, DBB [5] proposed a diverse branch block and gave more general transformations. It is worth mentioning that the combinations of these transformations also satisfy the request for reparameterization. Here we list the general transformations as follows:
CONVBN fusion Batch normalization can be fused into its preceding convolution. Given a kernel weight , the fused parameters can be formulated as:
(1) where i denotes the ith channel, and , , ,
denote the scaling factor, mean, variance and bias of the BN layer.

Parallel conv addition
Convolutions with different kernel size in different branches can be fused into one convolution by zeropadding small kernels and applying a simple elementwise addition.

Sequential convolutions fusion A sequence of CONVBN and CONVBN can be fused into a CONV whose parameters can be formulated as:
(2) (3) where , and , denote the weights and bias of convolutions, and denotes transpose operation.

Average pooling transformation A kernel average pooling can be transformed to a convolution. The parameter is
(4) where
is an identity matrix.
3 Our Proposed RepSPKNet system
The RepVGG based speaker verification system has already shown its competitive performance [25]. This section describes our baseline system and our proposed variants.
3.1 Baseline system
Here we present our baseline system that consists of a RepVGGA backbone, a statistical pooling [13], a 512dimensional embedding layer and an additive margin softmax (AMSoftmax) loss function [20, 19]. The detailed topology is shown in Fig. 1.
As Fig. 2 shows, the basic RepVGG block consists of three parallel branches: (1) a CONVBN, (2) a CONVBN, and (3) an IDBN. The IDBN branch exists only when the input channel equals the output channel. According to the transformations in Section 2, it is easy to verify that these three branches can be merged into one convolution during the inference time. The RepVGGA backbone consists of a stem layer and four stages. These stages contain 2, 4, 14, and 1 RepVGG blocks respectively and the stem layer is also a RepVGG block. The complexity of our backbone depends on the layer width parameters (a, b). For RepVGGA0, we set , and . For RepVGGA1, , and . For RepVGGA2, , and . We slightly change the original stride setting [6] to make this backbone fit the speaker verification task. Both the first stage and the stem layer have a stride of 1. The other stages have a stride of 2. The format of input is . The output of the backbone has a shape as and is reshaped to . The whole backbone can be transformed to a stack of convolutions and ReLU layers during the inference time.
A statistical pooling layer is applied to aggregate the speaker information. We calculate the mean and the standard deviation of the backbone output along the time axis. The mean and standard deviation are concatenated and then compressed to a 512dimensional vector which serves as the speaker embedding. The AMSofmtax is used to classify speakers which can be formulated as:
(5) 
where N denotes the number of samples, C denotes the number of speakers, s denotes the scaling factor, m denotes the margin penalty and denotes the angle between weight vector and the ith sample.
3.2 Variants of the baseline system
As we introduced in Section 1, the original architecture of this RepVGG block is designed for computer vision tasks. Though this architecture has proved its superiority, it is taskspecific and may not be the optimal structure in speaker verification. The
CONVBN is the main branch of this architecture, while the IDBN serves as a residual connection to avoid the gradient vanishing problem. To investigate the reparameterization in speaker verification, we fixed the
CONVBN and IDBN branches and proposed several variants to replace the CONVBN. As Fig. 3 demonstrates, structure (a) is a duplicate CONVBN. Structure (b) and (c) are borrowed from ACNet. Structure (d) and (e) are borrowed from DBB. Structure (f) consists of a convolution with a dilation of 2 and a batch normalization layer. The original RepVGG blocks are replaced with these variants of the original block to form new speaker verification models. According to the transformations in Section 2, all these variants (except (f)) can be transformed to a convolution which means the inference bodies of these new models remain the same when compared to the original RepVGG inference time state.4 Experiments and results
4.1 Dataset and features
All our models adopted the VoxCeleb2 development set [1] as our training set. This dataset contains 1,092,009 utterances and 5,994 speakers in total. Our data augmentation consisted of two parts: (1) A 3fold speed augmentation [25, 22] was implemented at first to generate extra twice speakers based on the SoX speed function. (2) We followed the data augmentation method provided by the Kaldi VoxCeleb recipe. The RIRs [10] and MUSAN [12] dataset was used.
After the augmentation process, 16,380,135 utterances from 17,982 speakers were generated. We extracted 81dimensional log Mel filter bank energies based on Kaldi without voice activity detection (VAD). The window size is 25 ms, and the frameshift is 10 ms. All the features were cepstral mean normalized.
4.2 Experiment setup
200 frames of each sample in one batch were randomly selected. The SGD optimizer with a momentum of 0.9 and a weight decay of 1e3 was used. We used 8 GPUs with minibatch as 1,024 and an initial learning rate of 0.08. We adopted ReduceLROnPlateau scheduler and the minimum learning rate is 1e6. The margin of the AMSoftmax loss is set to 0.2
and the scale is 36. All our systems were evaluated on VoxCeleb1O, VoxCeleb1E, and VoxCeleb1H. Trials were scored by cosine similarity of the 512dimensional embeddings and no score normalization was implemented. The criterion is equal error rate (EER) and minimum decision cost function (DCF) where
, , and .4.3 Results and analysis
4.3.1 Ablation study of base model
Models  VoxCeleb1O  VoxCeleb1E  VoxCeleb1H  

EER  EER  EER  
A0  1.4310  0.1219  1.2900  0.1207  2.1700  0.1864 
Var a  1.3940  0.1185  1.3100  0.1256  2.2100  0.1913 
Var b  1.3890  0.1140  1.2980  0.1236  2.1802  0.1889 
Var c  1.3091  0.1169  1.3110  0.1265  2.2213  0.1937 
Var d  1.2671  0.1039  1.2602  0.1181  2.1126  0.1850 
Var e  1.4263  0.1320  1.3164  0.1266  2.2201  0.1940 
Var f  1.0821  0.1006  1.1204  0.1067  1.9342  0.1665 
As we mentioned in Section 3.2, the RepVGG block is taskspecific. To compare our proposed variants with the original RepVGG structure, we selected RepVGGA0 as our base model since it has a much faster training speed when compared with RepVGGA1 and RepVGGA2. To find out the most suitable reparameterizable structure in speaker verification, we trained all the variants mentioned above. All the performances were presented in Table 1. As for Var a, it is rather intriguing that replacing the original CONVBN with an extra CONVBN bring performance decay on some complex test sets like VoxCeleb1E and VoxCeleb1H even the training time state has more parameters (larger branch capacity). We believed that this extra CONVBN structure tended to learn a representation that was similar to the main CONVBN. The lack of feature diversity caused the performance decay. Furthermore, Var d performed the best among all models (Var f not included) on all test sets. This structure added a CONVBN after the original CONVBN. On the contrary, Var e which consists of a CONVBN and an average pooling performed the worst. Both these two structures had operators that control the balance between the branch diversity and branch capacity.
Models  VoxCeleb1O  VoxCeleb1E  VoxCeleb1H  

EER  EER  EER  
ECAPA  0.8600  0.0960  1.0800  0.1223  2.0100  0.2004 
ResNet34  1.0498  0.1045  1.0587  0.1008  1.8456  0.1619 
A0  1.4310  0.1219  1.2900  0.1207  2.1700  0.1864 
RSBAA0  1.2671  0.1039  1.2602  0.1181  2.1126  0.1850 
RSBBA0  1.0821  0.1006  1.1204  0.1067  1.9342  0.1665 
 +  1.1771  0.0982  1.1041  0.1081  1.8960  0.1688 
A1  1.2141  0.0913  1.1593  0.1054  1.9347  0.1655 
RSBAA1  1.1503  0.0872  1.1295  0.1013  1.8902  0.1605 
RSBBA1  0.9650  0.0795  1.0359  0.0936  1.7602  0.1512 
A2  0.9546  0.0831  1.0143  0.0926  1.7149  0.1465 
RSBAA2  0.9122  0.0801  0.9939  0.0916  1.6809  0.1431 
RSBBA2  0.8430  0.0775  0.9637  0.0907  1.5982  0.1374 
To verify that the multibranch structure’s performance depends on the tradeoff between branch diversity and branch capacity, we designed a branch Var f as shown in Fig. 3. This structure is a CONVBN with dilation as 2. This dilated convolution ensures diversity by constraining a different input and receptive field from the custom convolution of the main branch, and it also ensures capacity by increasing the parameters. We used the cosine similarity between the outputs of the main branch and our proposed branch to represent the branch similarity. The branch similarity of each layer was presented in Fig. 4. Var a, obviously had the highest branch similarities (around 0.9) as we speculated. Var e, on the contrary, had the lowest similarities (around 0.2). Both the two structures showed performance decay. The other two variants had similarities around 0.5 and outperformed the base model. Moreover, Var f with similarities closer to 0.5 achieved a relative 10.9% EER and a relative 10.5% minDCF improvement compared to the base RepVGGA0 model.
4.3.2 RepSPKNet architecture
It is easy to prove that a convolution with a dilation of 2 can be transformed to a convolution. According to the transformations provided by Section.2, this Var f structure can also be reparameterized to a single custom convolution. Based on the results, we proposed the final architecture of the RepSPKNet. As depicted in Fig. 5, two blocks were presented. The RSBA block was composed of a CONVBN, a CONVBN followed by a CONVBN, and an IDBN. The RSBB block was composed of a CONVBN, a CONVBN with a dilation of 2, and an IDBN. To verify the stability and transferability of our proposed architecture, we compared our RepSPKNet with the original RepVGGA1 and RepVGGA2. Here we only replaced the RepVGG block with the RepSPK block (RSB). The model consisted of RSBA blocks was called RepSPKNetA and the other consisted of RSBB blocks was called RepSPKNetB. We also conducted an ablation study of the RSBB structure by replacing the CONVBN main branch with a CONVBN. The results were presented in Table 2. The RepSPKNetA and RepSPKNetB both outperformed their corresponding base model. Compared to the performance of RepVGGA2 on VoxCeleb1H, the RSBAA2 achieved relative improvements of 2.0% in EER and 2.3% in minDCF. Moreover, the RSBBA2 achieved relative improvements of 6.8% in EER and 6.2% in minDCF. The result demonstrated that our RepSPKNets can achieve SOTA performance in speaker verification.
5 Conclusion
In this paper, we proposed two blocks as Fig.5 demonstrates. The RepSPKNetA is composed of the RSBA block while the RepSPKNetB is composed of the RSBB block. With the structral reparameterization method, the RepSPKNetA can be transformed to a stack of convolution and ReLU and the RepSPKNetB can be transformed to a stack of convolution and ReLU. Ablation studies on various variants indicated that the performance of multibranch structure depended on the branch diversity and branch capacity, which was a heuristic principle for designing multibranch models. Our proposed RepSPKNet outperformed the original RepVGG and achieved SOTA performance in speaker verification.
References
 [1] (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §1, §4.1.
 [2] (2011) Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. External Links: Document Cited by: §1.
 [3] (202010) ECAPAtdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. Interspeech 2020. External Links: Link, Document Cited by: §1, §1.
 [4] (201910) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2.

[5]
(2021)
Diverse branch block: building a convolution as an inceptionlike unit.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 10886–10895. Cited by: §1, §2.  [6] (2021) Repvgg: making vggstyle convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742. Cited by: §1, §1, §3.1.
 [7] (202102) Res2Net: a new multiscale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2), pp. 652–662. External Links: ISSN 19393539, Link, Document Cited by: §1.
 [8] (201806) Squeezeandexcitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [9] (202010) Improving multiscale aggregation using feature pyramid module for robust speaker verification of variableduration utterances. Interspeech 2020. External Links: Link, Document Cited by: §1.
 [10] (2017) A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. Cited by: §4.1.
 [11] (2017) Voxceleb: a largescale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §1.
 [12] (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §4.1.
 [13] (2017) Deep neural network embeddings for textindependent speaker verification.. In Interspeech, pp. 999–1003. Cited by: §3.1.
 [14] (2018) Xvectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §1.
 [15] (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, Cited by: §1.
 [16] (2014) Going deeper with convolutions. External Links: 1409.4842 Cited by: §1.
 [17] (2015) Rethinking the inception architecture for computer vision. External Links: 1512.00567 Cited by: §1.
 [18] (2020) The idlab voxceleb speaker recognition challenge 2020 system description. arXiv preprint arXiv:2010.12468. Cited by: §1, Table 2.
 [19] (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §3.1.

[20]
(2018)
Cosface: large margin cosine loss for deep face recognition
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §3.1.  [21] (201707) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [22] (2019) Speaker augmentation and bandwidth extension for deep speaker embedding.. In INTERSPEECH, pp. 406–410. Cited by: §4.1.
 [23] (2020) Densely connected time delay neural network for speaker verification. In Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 921–925. Cited by: §1.
 [24] (2019) But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592. Cited by: §1, §1.
 [25] (2021) The speakin system for voxceleb speaker recognition challange 2021. arXiv preprint arXiv:2109.01989. Cited by: §1, §1, §3, §4.1.
 [26] (2020) ResNeXt and res2net structures for speaker verification. External Links: 2007.02480 Cited by: §1.
 [27] (2021) Serialized MultiLayer MultiHead Attention for Neural Speaker Embedding. In Proc. Interspeech 2021, pp. 106–110. External Links: Document Cited by: §1.
Comments
There are no comments yet.