In recent years, x-vectors [x_vectors] and their subsequent improvements have consistently provided state-of-the-art results on the task of speaker verification. Improving upon the original time delay neural network (TDNN) architecture is an active area of research. Usually, the neural networks are trained on the speaker identification task. After convergence, low dimensional speaker embeddings can be extracted from the penultimate layer of the trained network to characterize the speaker in the input recording. Speaker verification can be accomplished by comparing the two embeddings corresponding with an enrollment and a test recording to accept or reject the hypothesis that both recordings contain the same speaker. A simple cosine distance measurement can be used for this comparison. In addition, a more complicated scoring backend can be trained such as Probabilistic Linear Discriminant Analysis (PLDA) [plda].
The rising popularity of the x-vector system has resulted in significant architectural improvements and optimized training procedures [arcface] over the original approach. The topology of the system was improved by incorporating elements of the popular ResNet [resnet]
architecture. Adding residual connections between the frame-level layers has been shown to enhance the embeddings[jhu_voxsrc, but_voxsrc]
. Additionally, residual connections enable the back-propagation algorithm to converge faster and help avoid the vanishing gradient problem[resnet].
The statistics pooling layer in the x-vector system projects the variable-length input to a fixed-length representation by gathering simple statistics of hidden node activations across time. The authors in [att_stat, self_att] introduce a temporal attention system to this pooling layer which allows the network to only focus on frames it deems important. It can also be interpreted as a Voice Activity Detection (VAD) pre-processing step to detect the irrelevant non-speech frames.
In this work we propose further architectural enhancements to the TDNN architecture and statistics pooling layer. We introduce additional skip connections to propagate and aggregate channels throughout the system. Channel attention that uses a global context is incorporated in the frame layers and statistics pooling layer to improve the results even further.
The paper is organized as follows: Section 2 will describe the current state-of-the-art speaker recognition systems which will be used as baseline. Section 3 will explain and motivate the novel components of our proposed architecture. Section 4 will explain our experimental setup to test the impact of the individual components in our architecture on the popular VoxCeleb datasets [vox1, vox2, voxsrc]. In addition, a comparison between popular state-of-the-art baseline systems will be provided. We discuss the results of these experiments in Section 5. Section 6 will conclude with a brief overview of our findings.
2 DNN Speaker Recognition Systems
Two types of DNN-based speaker recognition architectures will be used as strong baselines for measuring the impact of our proposed architecture: an x-vector and ResNet based system which both currently provide state-of-the-art performance on speaker verification tasks such as VoxSRCs [voxsrc].
2.1 Extended-TDNN x-vector
The first baseline system is the Extended TDNN x-vector architecture [x_vector_wide, jhu_voxsrc, but_voxsrc] and improves upon the original x-vector system introduced in [x_vectors]
. The initial frame layers consist out of 1-dimensional dilated convolutional layers interleaved with dense layers. Every filter has access to all the features of the previous layer or input layer. The task of the dilated convolutional layers is to gradually build up the temporal context. Residual connections are introduced in all frame-level layers. The frame layers are followed by an attentive statistics pooling layer that calculates the mean and standard deviations of the final frame-level features. The attention system[att_stat] allows the model to select the frames it deems relevant. After pooling, two fully-connected layers are introduced with the first one acting as a bottleneck layer to generate the low-dimensional speaker characterizing embedding.
2.2 ResNet-based r-vector
The second baseline system is the r-vector system proposed in [but_voxsrc]. It is based on the ResNet18 and ResNet34 implementations of the successful ResNet architecture [resnet]. The convolutional frame layers of this network process the features as a 2-dimensional signal before collecting the mean and standard deviation statistics in the pooling layer. See [but_voxsrc] for more details about the topology.
3 Proposed ECAPA-TDNN Architecture
In this section, we examine some of the limitations of the x-vector architecture and and incorporate potential solutions in our ECAPA-TDNN architecture. The following subsections will focus on frame-level and pooling-level enhancements. An overview of the complete architecture is given by Figure 2.
3.1 Channel- and context-dependent statistics pooling
In recent x-vector architectures, soft self-attention is used for calculating weighted statistics in the temporal pooling layer [att_stat]. Success with multi-headed attention has shown that certain speaker properties can be extracted on different sets of frames [self_att]. Due to these results, we argue that it might be beneficial to extend this attention mechanism even further to the channel dimension. This enables the network to focus more on speaker characteristics that do not activate on identical or similar time instances, e.g. vowels vs. consonants.
We implement the attention mechanism as described in [att_stat] and adapt it to be channel-dependent:
where are the activations of the last frame layer at time step . The parameters and project the information for self-attention to a smaller -dimensional representation that is shared across all channels to reduce the parameter count and over-fitting. After a non-linearity this information is transformed to a channel-dependent self-attention score through a linear layer with weights and bias . This scalar score is then normalized over all frames by applying the softmax function channel-wise across time:
The self-attention score represents the importance of each frame given the channel and is used to calculate the weighted statistics of channel . For each utterance the channel component of the weighted mean vector is estimated as:
The channel component of the weighted standard deviation vector is constructed as follows:
The final output of the pooling layer is given by concatenating the vectors of the weighted mean and weighted standard deviation .
Furthermore, we expand the temporal context of the pooling layer by allowing the self-attention to look at global properties of the utterance. We concatenate the local input in (1) with the global non-weighted mean and standard deviation of across the time domain. This context vector should allow the attention mechanism to adapt itself to global properties of the utterance such as noise or recording conditions.
3.2 1-Dimensional Squeeze-excitation Res(2)Blocks
The temporal context of frame layers in the original x-vector system is limited to 15 frames. As the network apparently benefits from a wider temporal context [x_vector_wide, but_voxsrc, jhu_voxsrc] we argue it could be beneficial to rescale the frame-level features given global properties of the recording, similar to the global context in the attention module described above. For this purpose we introduce 1-dimensional Squeeze-excitation (SE) blocks, as this computer vision approach to model global channel interdependencies has been proved successful [se_block].
The first component of an SE-block is the squeeze operation which generates a descriptor for each channel. The squeeze operation simply consists of calculating the mean vector of the frame-level features across the time domain:
The descriptors in are then used in the excitation operation to calculate a weight for each channel. We define the subsequent excitation operation as:
denoting the sigmoid function andand . This operation acts as a bottleneck layer with and referring to the number of input channels and reduced dimensionality respectively. The resulting vector contains weights between zero and one, which are applied to the original input through channel-wise multiplication
The 1-dimensional SE-block can be integrated in the x-vector architecture in various ways, with using them after each dilated convolution being the most straightforward one. However, we want to combine them with the benefits of residual connections [resnet]. Simultaneously, we do not want to increase the total amount of parameters too much compared to the baseline systems.
The SE-Res2Block shown in Figure 1 incorporates the requirements mentioned above. We contain the dilated convolutions with a preceding and succeeding dense layer with a context of 1 frame. The first dense layer can be used to reduce the feature dimension, while the second dense layer restores the number of features to the original dimension. This is followed by an SE-block to scale each channel. The whole unit is covered by a skip connection. The use of these traditional ResBlocks makes it easy to incorporate advancements concerning this popular computer vision architecture. The recent Res2Net module [res2net], for example, enhances the central convolutional layer so that it can process multi-scale features by constructing hierarchical residual-like connections within. The integration of this module improved performance, while significantly reducing the number of model parameters.
3.3 Multi-layer feature aggregation and summation
The original x-vector system only uses the feature map of the last frame-layer for calculating the pooled statistics. Given the hierarchical nature of a CNN, these deeper level features are the most complex ones and should be strongly correlated with the speaker identities. However, due to evidence in [mfa_tag, multi_stage_aggregation] we argue that more shallow feature maps can also contribute towards more robust speaker embeddings. For each frame, our proposed system concatenates the output feature maps of all the SE-Res2Blocks. After this Multi-layer Feature Aggregation (MFA), a dense layer processes the concatenated information to generate the features for the attentive statistics pooling.
Another, complementary way to exploit multi-layer information is to use the output of all preceding SE-Res2Blocks and initial convolutional layer as input for each frame layer block [mfa_tag, multi_stage_summation]. We implement this by defining the residual connection in each SE-Res2Block as the sum of the outputs of all the previous blocks. We opt for a summation of the feature maps instead of concatentation to restrain the model parameter count. The final architecture without the summed residual connections is shown in Figure 2.
4 Experimental Setup
4.1 Training the speaker embedding extractors
We apply the fixed-condition VoxSRC 2019 training restrictions [voxsrc] and only use the development part of the VoxCeleb2 dataset [vox2]
with 5994 speakers as training data. A small subset of about 2% of the data is reserved as a validation set for hyperparameter optimization. It is a well known fact that neural networks benefit from augmenting the data to generate extra training samples. We generate a total of 6 extra samples for each utterance. The first set of augmentations follow the KALDI recipe[x_vector_wide] in combination with the publicly available MUSAN dataset (babble, noise) [musan] and the RIR dataset (reverb) provided in [rirs]
. The remaining three augmentations are generated with the open-source SoX (tempo up, tempo down) and FFmpeg (alternating opus or aac compression) libraries.
The input features are 80 dimensional MFCCs from a 25 ms window with a 10 ms frame shift. The MFCCs are normalized through cepstral mean subtraction and no voice activity detection is applied. We use a random crop of 2 seconds on the normalized features. As a final augmentation step, we apply SpecAugment [specaugment]
on the log mel spectrogram of the samples. The algorithm randomly masks 0 to 5 frames in the time domain and 0 to 10 channels in the frequency domain.
All models are trained with a cyclical learning rate varying between 1e-8 and 1e-3 using the triangular2 policy as described in [clr] in conjunction with the Adam optimizer [adam]. The step_size parameter used is 65000. All systems are trained using AAM softmax [arcface, aam_embeddings] with a margin of 0.2 and softmax prescaling of 30 for 4 cycles. To prevent overfitting, we apply a weight decay on all weights in the model of 2e-5, except for the AAM softmax weights, which uses 2e-4. The mini-batch size for training is 128.
We study two setups of the proposed ECAPA-TDNN architecture with either 512 or 1024 channels in the convolutional frame layers. The dimension of the bottleneck in the SE-Block and the attention module is set to 128. The scale dimension in the Res2Block [res2net] is set to 8. The number of nodes in the final FC layer is 192. The performance of this system will be compared to the baselines described in Section 2.1 and 2.2.
4.2 Speaker verification
Speaker embeddings are extracted from the final fully-connected layer for all systems. Trial scores are produced using the cosine distance between embeddings for all systems. Subsequently, all scores are normalized using adaptive s-norm [s_norm]. The imposter cohort consists of the speaker-wise averages of the length-normalized embeddings of all training utterances. The size of the imposter cohort was set to 1000 for the VoxCeleb test sets and to a more robust value of 50 for the cross-dataset VoxSRC 2019 evaluation.
4.3 Evaluation protocol
The system is evaluated on the popular VoxCeleb1 test sets [vox1] and VoxSRC 2019 evaluation set [voxsrc]. Performance will be measured by providing the EERs and the minimum normalized detection costs MinDCF with and . A concise ablation study is used to gain a deeper understanding how each of the proposed improvements affects the performance.
A performance overview of the baseline systems described in Section 2 and our proposed ECAPA-TDNN system is given in Table 1. We implement two setups with the number of filters in the convolutional layers either set to 512 or 1024. Our proposed architecture significantly outperforms all baselines while using fewer model parameters. The larger ECAPA-TDNN system gives an average relative improvement of 18.7% for EER and 12.5% for MinDCF over the best scoring baseline for each test set. We note that the performance of the baselines is superseeds the numbers reported in [jhu_voxsrc, but_voxsrc] in most cases. We continue with an ablation study of the individual components introduced in Section 3. An overview of these results is given in Table 2.
|A.1||Attentive Statistics [att_stat]||1.12||0.1316|
|A.2||Channel Att. w/o Context||1.03||0.1288|
|C.2||No Res. Connections||1.08||0.1310|
|C.3||No Sum Res. Connections||1.08||0.1217|
To measure the impact of our proposed attention module, we run an experiment A.1, that uses the attention module from [att_stat]. We also run a separate experiment A.2 that does not supply the context vector to the proposed attention. The channel- and context-dependent statistics pooling system improves the EER and minDCF metric with 9.8% and 3.2%, respectively. This confirms the benefits of applying different temporal attention to each channel. Addition of the context vector results in very small performance gains with the system relatively improving about 1.9% on EER and 1.1% on minDCF. Nonetheless, this strengthens our belief that a TDNN-based architecture should try to exploit global context information.
This intuition is confirmed with experiment B.1 that clearly shows the importance of the SE-blocks described in Section 3.2. Incorporating the SE-modules in the Res2Blocks results in relative improvements of 20.5% in EER and 11.9% on the minDCF metric. This indicates that the limited temporal context of the frame-level features is insufficient and should be complemented with global utterance-based information. In experiment B.1 we replaced the multi-scale features of the Res2Blocks with the standard central dilated 1D convolutional of the ResNet counterpart. Aside from a substantial 30% relative reduction in model parameters, the multi-scale Res2Net approach also leads towards a relative improvement of 5.6% in EER and 3.2% on the minDCF value.
In experiment C.1, we only use the output of the final SE-Res2Block instead of aggregating the information of all SE-Res2Blocks. Aggregation of the outputs leads to relative improvements of 8.2% on EER and 2.8% on the minDCF value. Removing all residual connections (experiment C.2) shows a similar rate of degradation. We observe a slight performance improvement incorporating a residual connection from the previous SE-Res2Block. Replacing a default ResNet skip connection in the SE-Res2Blocks (experiment C.3) with the sum of the outputs of all previous SE-Res2Blocks improves the EER with 6.5%, while slightly degrading the minDCF score. However, further experience on challenging datasets such as [sdsv_plan] convinced us to incorporate summed residuals in the final ECAPA-TDNN architecture.
In this paper we presented ECAPA-TDNN, a novel TDNN-based speaker embedding extractor for speaker verification. We built further upon the original x-vector architecture and put more Emphasis on Channel Attention, Propagation and Aggregation. The incorporation of Squeeze-Excitation blocks, multi-scale Res2Net features, extra skip connections and channel-dependent attentive statistics poolings, led to significant relative improvements of 19% in EER on average over strong baseline systems on the VoxCeleb and VoxSRC 2019 evaluation sets.