Speaker recognition refers to the task of verifying the identity claimed by a speaker from that person’s voice . For example, it has been shown useful for speaker diarization , forensics  or voice dubbing .
In recent years, Deep Neural Networks (DNN) have allowed to propose original voice representations, outperforming the state-of-the-art -vector framework . One of this DNN approach seeks to extract an embedding representation of a speaker directly from its acoustic excerpts. This high-level speaker representation is called -vector . In the -vector framework, the DNN uses a stack of convolution layers followed by a temporal pooling layer that computes the mean and standard deviation of an input sequence.
Convolutional layers allow us to extract hierarchical information from the speech signal in feature maps. Lower layers find trivial information such as low-level speech information, and upper layers capture more complex speech information. Unfortunately, whether high or low layers, extraction of feature maps by convolutional layers is carried out independently and locally (except for the last high-level layer).
In , the authors propose to tackle this problem by creating a new architectural unit which allows us to re-weight each feature map in convolutional layers by explicitly modelling inter-dependencies between feature maps. This mechanism performs feature recalibration by using global information in order to select useful information and suppress useless one. This method is called squeeze-and-excitation (SE).
The use of SE in Convolutional Neural Networks (CNN) for speaker verification has recently be introduced in[6, 2, 10]; but its influence in details has never been studied on this speaker task. In this paper, we propose to study and improve the squeeze-and-excitation mechanism. In details, our contributions are as follows:
studying the influence of SE at different stages in ResNet-34 architecture. We observe that the best system is the one where SE is integrated on the first two stages of ResNet;
evaluating different configurations of SE in the context of speaker verification, related to the pooling layer, integration strategy… We show that the parameters and architectures of SE that obtain the best performance in image processing are not the same than in speaker verification;
obtaining a new variant of SE in which the global information generated by the SE by using the concatenation of two poolings: mean and standard-deviation;
Our experiments on the Speaker In The Wild (SITW) and Voxceleb1-E corpus without SE obtain respectively 1.39% and 1.26% of Equal Error Rate (EER), whereas the best system using SE obtains respectively 1.29% and 1.13% EER. A relative gain of 9% is observed in terms of EER.
The paper is organized as follows: Section 2 summarizes the -vector approach. Section 3 presents the squeeze-and-excitation (SE) approach. In Section 4, we analyze the results of SE on speaker verification task and the contribution of our proposals. A qualitative study of the role of squeeze-and-excitation is proposed in Section 5. A conclusion is finally provided in Section 6.
2 -vector system
-vector is a high-level speaker feature extracted from a DNN model. The DNN model is trained through a speaker identification task,i.e.
by classifying speech segments into one ofspeaker identities. In that context, the different layers of the DNN are trained to extract information for discriminating between different speakers. The idea is to use one of the hidden layer as the speaker representation (the -vector). One of the main advantage is that -vectors produced by the DNN generalize well to speakers beyond those present in the training set. The benefits of -vectors in terms of speaker detection accuracy have been demonstrated during the recent evaluation campaigns on NIST SRE [20, 11, 13], VoxCeleb 2020 [18, 2, 19], SdSVC [21, 15, 8]…
|Input||–||60 400 1|
3, Stride 1
|60 400 128|
|ResNetBlock-1||, Stride 1|
is the number of speakers. Batch-norm and ReLU layers are not shown. The dimensions are (FrequencyChannelsTime). The input is comprised of 60 filter banks from speech segments. A fixed segment length of 400 is used during training.
As mentioned in Section 1, the problem of convolution layers is that features maps are extracted independently and locally. In , the authors proposed to tackle this problem by creating a new architectural unit called squeeze-and-excitation (SE) block. The SE block allows us to model inter-dependencies of feature maps, so that the network is able to increase its sensitivity to the most informative features.
The structure of the SE block is depicted in Figure 1. First, a pooling layer is used to produce a global information of each channel by aggregating feature maps across their spatial dimension to a single numeric value. Thus, a vector of size is obtained where is equal to the number of feature maps. Afterwards, the vector is introduced into a two-layer neural network. A dimensional output vector is then obtained. These values can now be used as weights on the original feature maps, scaling each channel based on its importance. The SE process is performed in two steps: 1) produce a global information (squeeze step); and 2) re-weight each feature maps (excitation step).
The global information is achieved using a pooling layer. The pooling layer plays a central role in the SE strategy. While the mean-pooling obtains the best performance in image processing, it is unclear which pooling strategy performs best on speaker verification task. We then propose to evaluate different pooling strategies on speaker verification such as max-pooling, standard-deviation-pooling and the concatenation of mean- and standard-deviation pooling.
Also, in order to limit model complexity, the hidden layers in SE blocks can be used as a reduction block where the input space is reduced to a smaller space and then expanded to the original dimensionality as the input. A discussion on this reduction is done in the experiments as well as the number of hidden layers.
The SE block can be simply integrated in CNN by inserting after the non-linearity following each convolution. In the case of ResNet, the classical integration strategy is to insert SE Block after the final convolutional layer and before the skip connection branch. The idea to integrate the SE Block before the skip connection branch, is to avoid noise in the skip connection branch and facilitate the learning of identity.
4 Experiments and protocols
This section describes the experimental setup in terms of dataset and experimental protocols.
4.1 Training and Evaluation datasets
The -vector extractors are trained on the VoxCeleb2 dataset , only on the development partition, which contains speech excepts from 5,994 speakers with a 16 Khz sampling rate. The trained -vectors are assessed on the Speakers in the Wild (SITW) core-core task , Voxceleb1-E Cleaned and Voxceleb1-H Cleaned  dataset with a 16 KHz sampling rate. Note that the development set of VoxCeleb2 is completely disjoint from the VoxCeleb1 dataset (i.e. no common speakers).
We report results in terms of Equal Error Rate (EER) and the minimum of the normalized detection cost function (minDCF) at PTarget = .
4.2 Implementation details
The -vector extractor used in this paper is a variant based on ResNet-34. The extractor cuts training dataset into 4-second chunks and augmented with noise, as described in  and available as a part of the Kaldi-recipe. As input, we used 60-dimensional filter-banks. The
-vectors are 256-dimensional and the loss is the angular additive margin with scale equals to 30 and margin equals to 0.4. The size of the feature maps are 128, 128, 256 and 256 for the 4 ResNet blocks. We use stochastic gradient descent with momentum equals to 0.9, a weight decay equals to 2.10
and initial learning rate equals to 0.2. The batch size was set to 128, however, training on 4 GPUs in parallel. The implementation is based on PyTorch and the model training takes about 2 days. In order to remove silence and low energy speech segments, a simple energy-based VAD is used based on the C0 component of the acoustic feature.
Let us note that, for a fair comparison, mini-batch used during neural network training and weights initialisation of neural networks are the same for all the experiments.
4.3 SE blocks at different stages
Table 2 explores the influence by integrating SE blocks at different stages into the ResNet-34 (one stage at a time). The system called Baseline is the system without SE blocks. This system obtained on Voxceleb1-E Cleaned and SITW 1.26% and 1.39% EER respectively. The system that achieves the best performance is the one that integrated SE blocks at Stage 1 and 2 (system called Stage=1,2). This system obtained on Voxceleb1-E Cleaned and SITW 1.14% and 1.99% EER respectively. We observe that integrating SE blocks at all stages, as it is done in image processing, obtained the worst results (1.30% and 2.21% EER respectively on Voxceleb1-E Cleaned and SITW).
|-E cleaned||-H cleaned||core-core|
4.4 Reduction factor
The SE Blocks is composed of two fully-connected hidden layers. These hidden layers can be used as a reduction block where the input space is reduced to a smaller space defined by the reduction factor () and then expanded to the original dimensionality as the input. Table 3 investigates the trade-off between performance and model complexity by varying this reduction factor. We observe that performance is robust for a reduction factor between and . In image processing, the reduction factor is classically set to . We observe that the system without any reduction factor obtained the best performance, but setting the reduction factor set to achieves a good balance between accuracy and complexity.
|-E cleaned||-H cleaned||core-core|
4.5 Integration strategy
Table 4 studies the influence of the location of the SE block when integrating into ResNet-34. In addition to the standard integration, three variants are proposed, similar to the ones proposed in  and depicted in Figure 2 :
SE-PRE block in which the SE block is moved before the residual unit.
SE-POST block in which the SE unit is moved after the summation with the identity branch (after ReLU).
SE-Identity block in which the SE unit is placed on the identity connection in parallel to the residual unit.
We observed that the SE-Standard design obtained the best performance in terms of EER or minDCF.
|-E cleaned||-H cleaned||core-core|
4.6 Different hidden layers
Traditionally, the SE Blocks is composed of two hidden layers. Table 5 shows results when varying the number of hidden layers. The motivation behind that is to ensure that global information given by the pooling layer is well decorrelated by the different hidden layers. We observe that the SE block containing two hidden layers obtained the best results.
|-E cleaned||-H cleaned||core-core|
4.7 Pooling layer
Table 6 investigates the performance by using different pooling layers in SE blocks. We propose to evaluate the performance of various poolings: 1) mean pooling (system called Mean) and 2) maximum pooling (system called Max). In addition to traditional poolings, we propose to evaluate: 1) standard-deviation pooling (system called Std) and 2) the concatenation of mean- and standard-deviation poolings (system called Mean+Std). We observe that the Mean+Std system obtains the best performance.
|-E cleaned||-H cleaned||core-core|
5 Role of Squeeze-and-Excitation
In this section, we study the role of squeeze-and-excitation in the context of speaker verification and, in particular, understand why SE is very efficient when is only integrated in Stage 1 and 2 of ResNet-34 architecture. We propose to study activations from the different SE blocks and their distribution at various stages in the network.
Fist, we study the distribution of excitation across speakers. Seven speakers are randomly picked up in the Voxceleb1 corpus. Then, for the last SE block of each stage, we compute, for each speaker, the mean activations of all segments. Figure 3 depicts this distribution. It can be observed that the activation distribution is substantially the same at Stage 1, whatever the speaker (the lines of different speakers are overlap). However, the distribution is significantly varying from one speaker to another at Stage 4. We presume that the SE blocks used in low layers excite informative features in a class-agnostic manner, strengthening the shared high-level representations (Stages 1 and 2). In top layers, the SE blocks become increasingly specialised, and respond to different inputs in a highly class-specific manner (Stages 3 and 4).
Next, we study the within-speaker excitation distribution. Similarly to the previous experiments, we randomly pick up one speaker in the Voxceleb1 corpus. Then, for the last SE block of each, we compute, for each speaker, the mean and standard activations of all the segments related to this speaker. Figure 4 depicts this distribution. We observe that the standard deviation is rather weak at Stages 1 and 2, while it becomes more and more important at Stages 3 and 4. This reinforces our idea that low layers extract information independent to the speaker class while the high layers extract speaker-specific information.
In recent years, the introduction of the squeeze-and-excitation (SE) method has allowed to overcome some weaknesses of CNN architectures in the research field of image recognition. Since then, introduced in speaker verification, this method required to be adapted to the specificity of this research field.
In this paper, different architectures and configurations of SE are presented and evaluated in order to build a robust -vector extractor for speaker verification. The results of our experiments show that SE blocks used in low-layers excite informative features in a class-agnostic manner. Moreover, when used in top layers, the SE blocks become increasingly specialised to class-specific manner.
Experiments performed on the SITW, Voxceleb1-E Cleaned and Voxceleb1-H dataset showed significant gains by using SE blocks at Stage 1 and 2 and by using a pooling layer combining mean- and standard-deviation statistics (leading to a relative gain of 9% in terms of equal error rate). These experiences confirm the need to properly adapt the architecture and configuration of SE to the task of speaker verification.
This research was supported by the ANR agency (Agence Nationale de la Recherche), RoboVox project (ANR-18-CE33-0014).
-  (2004) A tutorial on text-independent speaker verification. EURASIP Journal on Advances in Signal Processing 2004 (4), pp. 101962. Cited by: §1.
-  BUT+ omilia system description voxceleb speaker recognition challenge 2020. Cited by: §1, §2.
-  (2009) Forensic speaker recognition. IEEE Signal Processing Magazine 26 (2), pp. 95–103. Cited by: §1.
-  (2018) Voxceleb2: deep speaker recognition. Cited by: §4.1.
-  (2010) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing (TASLP) 19 (4), pp. 788–798. Cited by: §1.
-  (2020) Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143. Cited by: §1.
-  (2017) Acoustic pairing of original and dubbed voices in the context of video game localization. In Interspeech, Cited by: §1.
-  (2019) Veridas solution for sdsv challenge technical report. Cited by: §2.
-  (2018) Squeeze-and-excitation networks. In , pp. 7132–7141. Cited by: §1, §3, §4.5.
-  (2020) NEC-tt speaker verification system for sre’19 cts challenge. Proc. Interspeech 2020, pp. 2227–2231. Cited by: §1.
-  (2019) The nec-tt 2018 speaker verification system.. In Interspeech, pp. 4355–4359. Cited by: §2.
-  (2016) The speakers in the wild (sitw) speaker recognition database.. In Interspeech, pp. 818–822. Cited by: §4.1.
-  (2019) The lia system description for nist sre 2019. Cited by: §2.
-  (2017) Voxceleb: a large-scale speaker identification dataset. Interspeech, pp. 2616–2620. Cited by: §4.1.
-  (2019) The lia system description for sdsv challenge task 2. Cited by: §2.
-  (2012) A global optimization framework for speaker diarization. In IEEE Odyssey - The Speaker and Language Recognition Workshop, Cited by: §1.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §4.2.
-  (2020) The idlab voxceleb speaker recognition challenge 2020 system description. Cited by: §2.
-  (2020) ID r&d system description to voxceleb speaker recognition challenge 2020. Cited by: §2.
-  (2018) The jhu-mit system description for nist sre18. Cited by: §2.
-  (2019) The jhu system description for sdsv2020 challenge. Cited by: §2.
-  (2019) But system description to voxceleb speaker recognition challenge 2019. Cited by: §2.