DeepAI
Log In Sign Up

Studying squeeze-and-excitation used in CNN for speaker verification

09/13/2021
by   Mickael Rouvier, et al.
0

In speaker verification, the extraction of voice representations is mainly based on the Residual Neural Network (ResNet) architecture. ResNet is built upon convolution layers which learn filters to capture local spatial patterns along all the input, then generate feature maps that jointly encode the spatial and channel information. Unfortunately, all feature maps in a convolution layer are learnt independently (the convolution layer does not exploit the dependencies between feature maps) and locally. This problem has first been tackled in image processing. A channel attention mechanism, called squeeze-and-excitation (SE), has recently been proposed in convolution layers and applied to speaker verification. This mechanism re-weights the information extracted across features maps. In this paper, we first propose an original qualitative study about the influence and the role of the SE mechanism applied to the speaker verification task at different stages of the ResNet, and then evaluate several SE architectures. We finally propose to improve the SE approach with a new pool- ing variant based on the concatenation of mean- and standard- deviation-pooling. Results showed that applying SE only on the first stages of the ResNet allows to better capture speaker information for the verification task, and that significant discrimination gains on Voxceleb1-E, Voxceleb1-H and SITW evaluation tasks have been noted using the proposed pooling variant.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/14/2020

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Current speaker verification techniques rely on a neural network to extr...
10/31/2022

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

Deep convolutional neural networks (CNNs) have been applied to extractin...
02/16/2019

RES-SE-NET: Boosting Performance of Resnets by Enhancing Bridge-connections

One of the ways to train deep neural networks effectively is to use resi...
04/01/2020

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...
04/01/2020

Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...
10/13/2021

Duality Temporal-channel-frequency Attention Enhanced Speaker Representation Learning

The use of channel-wise attention in CNN based speaker representation ne...

1 Introduction

Speaker recognition refers to the task of verifying the identity claimed by a speaker from that person’s voice [1]. For example, it has been shown useful for speaker diarization [16], forensics [3] or voice dubbing [7].

In recent years, Deep Neural Networks (DNN) have allowed to propose original voice representations, outperforming the state-of-the-art -vector framework [5]. One of this DNN approach seeks to extract an embedding representation of a speaker directly from its acoustic excerpts. This high-level speaker representation is called -vector [17]. In the -vector framework, the DNN uses a stack of convolution layers followed by a temporal pooling layer that computes the mean and standard deviation of an input sequence.

Convolutional layers allow us to extract hierarchical information from the speech signal in feature maps. Lower layers find trivial information such as low-level speech information, and upper layers capture more complex speech information. Unfortunately, whether high or low layers, extraction of feature maps by convolutional layers is carried out independently and locally (except for the last high-level layer).

In [9], the authors propose to tackle this problem by creating a new architectural unit which allows us to re-weight each feature map in convolutional layers by explicitly modelling inter-dependencies between feature maps. This mechanism performs feature recalibration by using global information in order to select useful information and suppress useless one. This method is called squeeze-and-excitation (SE).

The use of SE in Convolutional Neural Networks (CNN) for speaker verification has recently be introduced in 

[6, 2, 10]; but its influence in details has never been studied on this speaker task. In this paper, we propose to study and improve the squeeze-and-excitation mechanism. In details, our contributions are as follows:

  • studying the influence of SE at different stages in ResNet-34 architecture. We observe that the best system is the one where SE is integrated on the first two stages of ResNet;

  • evaluating different configurations of SE in the context of speaker verification, related to the pooling layer, integration strategy… We show that the parameters and architectures of SE that obtain the best performance in image processing are not the same than in speaker verification;

  • obtaining a new variant of SE in which the global information generated by the SE by using the concatenation of two poolings: mean and standard-deviation;

Our experiments on the Speaker In The Wild (SITW) and Voxceleb1-E corpus without SE obtain respectively 1.39% and 1.26% of Equal Error Rate (EER), whereas the best system using SE obtains respectively 1.29% and 1.13% EER. A relative gain of 9% is observed in terms of EER.

The paper is organized as follows: Section 2 summarizes the -vector approach. Section 3 presents the squeeze-and-excitation (SE) approach. In Section 4, we analyze the results of SE on speaker verification task and the contribution of our proposals. A qualitative study of the role of squeeze-and-excitation is proposed in Section 5. A conclusion is finally provided in Section 6.

2 -vector system

An

-vector is a high-level speaker feature extracted from a DNN model. The DNN model is trained through a speaker identification task,

i.e.

by classifying speech segments into one of

speaker identities. In that context, the different layers of the DNN are trained to extract information for discriminating between different speakers. The idea is to use one of the hidden layer as the speaker representation (the -vector). One of the main advantage is that -vectors produced by the DNN generalize well to speakers beyond those present in the training set. The benefits of -vectors in terms of speaker detection accuracy have been demonstrated during the recent evaluation campaigns on NIST SRE [20, 11, 13], VoxCeleb 2020 [18, 2, 19], SdSVC [21, 15, 8]

The -vector extractor proposed in this paper is a variant based on ResNet [22]. The detailed topology is shown in Table 1.

Layer name Structure Output
Input 60 400 1
Conv2D-1 3

3, Stride 1

60 400 128
ResNetBlock-1 , Stride 1
ResNetBlock-2 , Stride
ResNetBlock-3 , Stride
ResNetBlock-4 , Stride
Pooling
Flatten
Dense1
Dense2 (Softmax)
Total
Table 1: The proposed ResNet34 architecture. Last row,

is the number of speakers. Batch-norm and ReLU layers are not shown. The dimensions are (Frequency

ChannelsTime). The input is comprised of 60 filter banks from speech segments. A fixed segment length of 400 is used during training.

3 Squeeze-and-excitation

As mentioned in Section 1, the problem of convolution layers is that features maps are extracted independently and locally. In [9], the authors proposed to tackle this problem by creating a new architectural unit called squeeze-and-excitation (SE) block. The SE block allows us to model inter-dependencies of feature maps, so that the network is able to increase its sensitivity to the most informative features.

The structure of the SE block is depicted in Figure 1. First, a pooling layer is used to produce a global information of each channel by aggregating feature maps across their spatial dimension to a single numeric value. Thus, a vector of size is obtained where is equal to the number of feature maps. Afterwards, the vector is introduced into a two-layer neural network. A dimensional output vector is then obtained. These values can now be used as weights on the original feature maps, scaling each channel based on its importance. The SE process is performed in two steps: 1) produce a global information (squeeze step); and 2) re-weight each feature maps (excitation step).

The global information is achieved using a pooling layer. The pooling layer plays a central role in the SE strategy. While the mean-pooling obtains the best performance in image processing, it is unclear which pooling strategy performs best on speaker verification task. We then propose to evaluate different pooling strategies on speaker verification such as max-pooling, standard-deviation-pooling and the concatenation of mean- and standard-deviation pooling.

Also, in order to limit model complexity, the hidden layers in SE blocks can be used as a reduction block where the input space is reduced to a smaller space and then expanded to the original dimensionality as the input. A discussion on this reduction is done in the experiments as well as the number of hidden layers.

The SE block can be simply integrated in CNN by inserting after the non-linearity following each convolution. In the case of ResNet, the classical integration strategy is to insert SE Block after the final convolutional layer and before the skip connection branch. The idea to integrate the SE Block before the skip connection branch, is to avoid noise in the skip connection branch and facilitate the learning of identity.

Figure 1: Structure of the SE Block used in ResNet architecture.

4 Experiments and protocols

This section describes the experimental setup in terms of dataset and experimental protocols.

4.1 Training and Evaluation datasets

The -vector extractors are trained on the VoxCeleb2 dataset [4], only on the development partition, which contains speech excepts from 5,994 speakers with a 16 Khz sampling rate. The trained -vectors are assessed on the Speakers in the Wild (SITW) core-core task [12], Voxceleb1-E Cleaned and Voxceleb1-H Cleaned [14] dataset with a 16 KHz sampling rate. Note that the development set of VoxCeleb2 is completely disjoint from the VoxCeleb1 dataset (i.e. no common speakers).

We report results in terms of Equal Error Rate (EER) and the minimum of the normalized detection cost function (minDCF) at PTarget = .

4.2 Implementation details

The -vector extractor used in this paper is a variant based on ResNet-34. The extractor cuts training dataset into 4-second chunks and augmented with noise, as described in [17] and available as a part of the Kaldi-recipe. As input, we used 60-dimensional filter-banks. The

-vectors are 256-dimensional and the loss is the angular additive margin with scale equals to 30 and margin equals to 0.4. The size of the feature maps are 128, 128, 256 and 256 for the 4 ResNet blocks. We use stochastic gradient descent with momentum equals to 0.9, a weight decay equals to 2.10

and initial learning rate equals to 0.2. The batch size was set to 128, however, training on 4 GPUs in parallel. The implementation is based on PyTorch and the model training takes about 2 days. In order to remove silence and low energy speech segments, a simple energy-based VAD is used based on the C0 component of the acoustic feature.

Let us note that, for a fair comparison, mini-batch used during neural network training and weights initialisation of neural networks are the same for all the experiments.

4.3 SE blocks at different stages

Table 2 explores the influence by integrating SE blocks at different stages into the ResNet-34 (one stage at a time). The system called Baseline is the system without SE blocks. This system obtained on Voxceleb1-E Cleaned and SITW 1.26% and 1.39% EER respectively. The system that achieves the best performance is the one that integrated SE blocks at Stage 1 and 2 (system called Stage=1,2). This system obtained on Voxceleb1-E Cleaned and SITW 1.14% and 1.99% EER respectively. We observe that integrating SE blocks at all stages, as it is done in image processing, obtained the worst results (1.30% and 2.21% EER respectively on Voxceleb1-E Cleaned and SITW).

System VoxCeleb1 VoxCeleb1 SITW
-E cleaned -H cleaned core-core
EER DCF EER DCF EER DCF
Baseline 1.26 0.131 2.12 0.200 1.39 0.127
Stage=1 1.20 0.129 2.04 0.191 1.34 0.115
Stage=1,2 1.14 0.127 1.99 0.187 1.31 0.111
Stage=1,2,3 1.22 0.133 2.05 0.195 1.26 0.116
Stage=1,2,3,4 1.30 0.134 2.21 0.217 1.39 0.130
Table 2: Results obtained by integrating SE Blocks at different stages.

4.4 Reduction factor

The SE Blocks is composed of two fully-connected hidden layers. These hidden layers can be used as a reduction block where the input space is reduced to a smaller space defined by the reduction factor () and then expanded to the original dimensionality as the input. Table 3 investigates the trade-off between performance and model complexity by varying this reduction factor. We observe that performance is robust for a reduction factor between and . In image processing, the reduction factor is classically set to . We observe that the system without any reduction factor obtained the best performance, but setting the reduction factor set to achieves a good balance between accuracy and complexity.

System VoxCeleb1 VoxCeleb1 SITW
-E cleaned -H cleaned core-core
EER DCF EER DCF EER DCF
Baseline 1.14 0.127 1.99 0.187 1.31 0.111
r=2 1.17 0.127 2.00 0.188 1.26 0.115
r=4 1.15 0.122 1.95 0.181 1.34 0.109
r=8 1.22 0.136 2.09 0.193 1.56 0.118
Table 3: Results obtained by using different reduction and expansion rates.

4.5 Integration strategy

Table 4 studies the influence of the location of the SE block when integrating into ResNet-34. In addition to the standard integration, three variants are proposed, similar to the ones proposed in [9] and depicted in Figure 2 :

Figure 2: Schema of different integration strategies of the SE blocks in ResNet architecture.
  • SE-PRE block in which the SE block is moved before the residual unit.

  • SE-POST block in which the SE unit is moved after the summation with the identity branch (after ReLU).

  • SE-Identity block in which the SE unit is placed on the identity connection in parallel to the residual unit.

We observed that the SE-Standard design obtained the best performance in terms of EER or minDCF.

System VoxCeleb1 VoxCeleb1 SITW
-E cleaned -H cleaned core-core
EER DCF EER DCF EER DCF
SE-Standard 1.14 0.127 1.99 0.187 1.31 0.111
SE-PRE 1.22 0.132 2.09 0.195 1.48 0.117
SE-POST 1.20 0.134 2.04 0.189 1.37 0.121
SE-Identity 1.18 0.125 2.03 0.193 1.37 0.113
Table 4: Results obtained by using different integration strategies.

4.6 Different hidden layers

Traditionally, the SE Blocks is composed of two hidden layers. Table 5 shows results when varying the number of hidden layers. The motivation behind that is to ensure that global information given by the pooling layer is well decorrelated by the different hidden layers. We observe that the SE block containing two hidden layers obtained the best results.

System VoxCeleb1 VoxCeleb1 SITW
-E cleaned -H cleaned core-core
EER DCF EER DCF EER DCF
h=1 1.18 0.129 2.04 0.191 1.23 0.110
h=2 1.14 0.127 1.99 0.187 1.31 0.111
h=3 1.18 0.127 2.06 0.194 1.31 0.113
h=4 1.18 0.130 2.01 0.186 1.37 0.113
Table 5: Results obtained by systems using multi-statistical poolings.

4.7 Pooling layer

Table 6 investigates the performance by using different pooling layers in SE blocks. We propose to evaluate the performance of various poolings: 1) mean pooling (system called Mean) and 2) maximum pooling (system called Max). In addition to traditional poolings, we propose to evaluate: 1) standard-deviation pooling (system called Std) and 2) the concatenation of mean- and standard-deviation poolings (system called Mean+Std). We observe that the Mean+Std system obtains the best performance.

System VoxCeleb1 VoxCeleb1 SITW
-E cleaned -H cleaned core-core
EER DCF EER DCF EER DCF
Max 1.21 0.131 2.06 0.193 1.31 0.114
Mean 1.14 0.127 1.99 0.187 1.31 0.111
Std 1.18 0.129 1.99 0.190 1.31 0.113
Mean+Std 1.13 0.127 1.97 0.192 1.29 0.113
Table 6: Results by using different pooling layers in SE Blocks.
(a) Stage 1
(b) Stage 2
(c) Stage 3
(d) Stage 4
Figure 3: Distribution of excitation across speakers given by the last SE blocks at different stages of ResNet.
(a) Stage 1
(b) Stage 2
(c) Stage 3
(d) Stage 4
Figure 4: Distribution of excitation within-speaker given by the last SE blocks at different stages of ResNet.

5 Role of Squeeze-and-Excitation

In this section, we study the role of squeeze-and-excitation in the context of speaker verification and, in particular, understand why SE is very efficient when is only integrated in Stage 1 and 2 of ResNet-34 architecture. We propose to study activations from the different SE blocks and their distribution at various stages in the network.

Fist, we study the distribution of excitation across speakers. Seven speakers are randomly picked up in the Voxceleb1 corpus. Then, for the last SE block of each stage, we compute, for each speaker, the mean activations of all segments. Figure 3 depicts this distribution. It can be observed that the activation distribution is substantially the same at Stage 1, whatever the speaker (the lines of different speakers are overlap). However, the distribution is significantly varying from one speaker to another at Stage 4. We presume that the SE blocks used in low layers excite informative features in a class-agnostic manner, strengthening the shared high-level representations (Stages 1 and 2). In top layers, the SE blocks become increasingly specialised, and respond to different inputs in a highly class-specific manner (Stages 3 and 4).

Next, we study the within-speaker excitation distribution. Similarly to the previous experiments, we randomly pick up one speaker in the Voxceleb1 corpus. Then, for the last SE block of each, we compute, for each speaker, the mean and standard activations of all the segments related to this speaker. Figure 4 depicts this distribution. We observe that the standard deviation is rather weak at Stages 1 and 2, while it becomes more and more important at Stages 3 and 4. This reinforces our idea that low layers extract information independent to the speaker class while the high layers extract speaker-specific information.

6 Conclusions

In recent years, the introduction of the squeeze-and-excitation (SE) method has allowed to overcome some weaknesses of CNN architectures in the research field of image recognition. Since then, introduced in speaker verification, this method required to be adapted to the specificity of this research field.

In this paper, different architectures and configurations of SE are presented and evaluated in order to build a robust -vector extractor for speaker verification. The results of our experiments show that SE blocks used in low-layers excite informative features in a class-agnostic manner. Moreover, when used in top layers, the SE blocks become increasingly specialised to class-specific manner.

Experiments performed on the SITW, Voxceleb1-E Cleaned and Voxceleb1-H dataset showed significant gains by using SE blocks at Stage 1 and 2 and by using a pooling layer combining mean- and standard-deviation statistics (leading to a relative gain of 9% in terms of equal error rate). These experiences confirm the need to properly adapt the architecture and configuration of SE to the task of speaker verification.

7 Acknowledgement

This research was supported by the ANR agency (Agence Nationale de la Recherche), RoboVox project (ANR-18-CE33-0014).

References

  • [1] F. Bimbot, J. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A. Reynolds (2004) A tutorial on text-independent speaker verification. EURASIP Journal on Advances in Signal Processing 2004 (4), pp. 101962. Cited by: §1.
  • [2] N. Brummer, L. Burget, O. Glembek, P. Matejka, L. Mošner, O. Novotnỳ, O. Plchot, J. Rohdin, A. Silnova, T. Stafylakis, et al. BUT+ omilia system description voxceleb speaker recognition challenge 2020. Cited by: §1, §2.
  • [3] J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J. Bonastre, and D. Matrouf (2009) Forensic speaker recognition. IEEE Signal Processing Magazine 26 (2), pp. 95–103. Cited by: §1.
  • [4] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: deep speaker recognition. Cited by: §4.1.
  • [5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2010) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing (TASLP) 19 (4), pp. 788–798. Cited by: §1.
  • [6] B. Desplanques, J. Thienpondt, and K. Demuynck (2020) Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143. Cited by: §1.
  • [7] A. Gresse, M. Rouvier, R. Dufour, V. Labatut, and J. Bonastre (2017) Acoustic pairing of original and dubbed voices in the context of video game localization. In Interspeech, Cited by: §1.
  • [8] S. P. Guillermo Barbadillo (2019) Veridas solution for sdsv challenge technical report. Cited by: §2.
  • [9] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 7132–7141. Cited by: §1, §3, §4.5.
  • [10] K. A. Lee, K. Okabe, H. Yamamoto, Q. Wang, L. Guo, T. Koshinaka, J. Zhang, K. Ishikawa, and K. Shinoda (2020) NEC-tt speaker verification system for sre’19 cts challenge. Proc. Interspeech 2020, pp. 2227–2231. Cited by: §1.
  • [11] K. A. Lee, H. Yamamoto, K. Okabe, Q. Wang, L. Guo, T. Koshinaka, J. Zhang, and K. Shinoda (2019) The nec-tt 2018 speaker verification system.. In Interspeech, pp. 4355–4359. Cited by: §2.
  • [12] M. McLaren, L. Ferrer, D. Castan, and A. Lawson (2016) The speakers in the wild (sitw) speaker recognition database.. In Interspeech, pp. 818–822. Cited by: §4.1.
  • [13] P. B. Mickael Rouvier (2019) The lia system description for nist sre 2019. Cited by: §2.
  • [14] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. Interspeech, pp. 2616–2620. Cited by: §4.1.
  • [15] M. R. Pierre-Michel Bousquet (2019) The lia system description for sdsv challenge task 2. Cited by: §2.
  • [16] M. Rouvier and S. Meignier (2012) A global optimization framework for speaker diarization. In IEEE Odyssey - The Speaker and Language Recognition Workshop, Cited by: §1.
  • [17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §4.2.
  • [18] J. Thienpondt, B. Desplanques, and K. Demuynck (2020) The idlab voxceleb speaker recognition challenge 2020 system description. Cited by: §2.
  • [19] N. Torgashov (2020) ID r&d system description to voxceleb speaker recognition challenge 2020. Cited by: §2.
  • [20] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, F. Richardson, S. Shon, F. Grondin, et al. (2018) The jhu-mit system description for nist sre18. Cited by: §2.
  • [21] J. Villalba and N. Dehak (2019) The jhu system description for sdsv2020 challenge. Cited by: §2.
  • [22] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot (2019) But system description to voxceleb speaker recognition challenge 2019. Cited by: §2.