Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

04/07/2020 ∙ by Youngmoon Jung, et al. ∙ KAIST 수리과학과 0

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multiscale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification (SV) is the task of verifying a speaker’s claimed identity based on his or her speech. Depending on the constraint of the transcripts used for enrollment and verification, SV systems fall into two categories, text-dependent SV (TD-SV) and text-independent SV (TI-SV). TD-SV requires the content of input speech to be fixed, while TI-SV operates on unconstrained speech. Before the deep learning era, the combination of

i-vector [Dehak2011] and probabilistic linear discriminant analysis (PLDA) [Ioffe2006] was the dominant approach for SV [Kenny2010, Garcia2011]. Although this approach performs well with long utterances, it suffers from performance degradation with short utterances.

Recently, the most widely used SV approach is the deep speaker embedding learning which extracts speaker embeddings directly from a deep-learning-based speaker-discriminative network [Li2017, Zhang2018, Nagrani2017, Chung2018, Cai2018, Snyder2018, Okabe2018, Tang2019, Gao2019, Seo2019, Hajavi2019, Jung2019, Jung2019asru, Kye2020]

. This approach outperforms i-vector/PLDA approach, especially on short utterances. In deep speaker embedding learning, convolutional neural networks (CNNs) such as time-delay neural network (TDNN)

[Snyder2018, Okabe2018, Tang2019] or ResNet [Li2017, Chung2018, Cai2018, Jung2019, Gao2019, Seo2019, Jung2019asru, Hajavi2019, Kye2020]

are mostly used as the speaker-discriminative network. Specifically, the network is trained to classify training speakers

[Cai2018, Snyder2018, Okabe2018, Jung2019, Gao2019, Seo2019, Jung2019asru, Tang2019, Hajavi2019, Kye2020] or to separate same-speaker and different-speaker utterance pairs [Li2017, Zhang2018]

. After training, an utterance-level speaker embedding called deep speaker embedding is obtained by aggregating speaker features extracted from the network. Most of these approaches use a pooling layer for feature aggregation, mapping variable-length speaker features to a fixed-dimensional embedding. There are several pooling methods, such as global average pooling

[Li2017, Nagrani2017, Seo2019], statistics pooling [Snyder2018, Gao2019], attentive statistics pooling [Okabe2018], learnable dictionary encoding [Cai2018], and spatial pyramid encoding [Jung2019].

Meanwhile, all these pooling layers use only single-scale features from the last layer of the feature extractor. To aggregate speaker information from different time scales, multi-scale aggregation (MSA) methods have been proposed recently [Gao2019, Seo2019, Tang2019, Hajavi2019]. The MSA aggregates multi-scale features from different layers of a feature extractor to generate a speaker embedding. In [Gao2019, Seo2019, Tang2019, Hajavi2019], the authors show the effectiveness of the MSA in dealing with short or long utterances.

In this work, we propose a new MSA method using a feature pyramid module (FPM). A top-down architecture with lateral connections is used to generate feature maps with rich speaker information at all selected layers. Then, we exploit the rich multi-scale features of a ResNet-based feature extractor to extract speaker embeddings. In addition, we present a novel interpretation of the MSA using the theory of [Veit2016]. We evaluate our method using various pooling layers for TI-SV on the VoxCeleb dataset [Nagrani2017, Chung2018]. Experimental results show that the performance of MSA is further improved by the FPM with a smaller number of parameters. Besides, the effectiveness of our method is verified on variable-duration test utterances.

2 Relation to prior works

Gao et al. [Gao2019] proposed multi-stage aggregation, where ResNet is used as a feature extractor. The output feature maps of stage 2, 3, and 4 (see Table 1

) are concatenated along the channel axis. To make feature maps match in time-frequency resolution, the feature map from stage 2 is downsampled by convolution with stride 2, and the feature map from stage 4 is upsampled by bilinear interpolation or transposed convolution. After concatenation, speaker embeddings are generated by statistics pooling.

Seo et al. [Seo2019] also utilize feature maps from different stages of ResNet to fuse information at different resolutions. Unlike the method of Gao et al., global average pooling (GAP) is applied to the feature maps, and the pooled feature vectors are concatenated into a long vector. The concatenated vector is fed into fully-connected layers to generate the speaker embedding. Hajavi et al. [Hajavi2019] proposed a similar approach to the study of Seo et al. Their proposed model, UtterIdNet, shows significant improvement in speaker recognition with short utterances.

Figure 1: Two types of multi-scale aggregation (MSA). (a) Multi-scale feature aggregation (MSFA). (b) Multi-scale embedding aggregation (MSEA). In this paper, acoustic features of consecutive frames are indicated by grey rectangles and feature maps are indicated by blue outlines. “2up (or down)” is an upsampling (or downsampling) by a factor of 2.

To sum up, the method of Gao et al. fuses feature maps from different stages to form a single feature map, and applies pooling to the fused feature map to generate the speaker embedding. On the other hand, the method of Seo et al. applies pooling to feature maps respectively, and aggregates the resulting vectors to obtain the speaker embedding. In this paper, we take these two methods as baselines. For clarity, we denote the first method as multi-scale feature aggregation (MSFA) and the second one as multi-scale embedding aggregation (MSEA). These approaches are illustrated in Fig. 1. We will show that the proposed feature pyramid module improves both baselines.

Meanwhile, these studies show the effectiveness of MSA on only one type of pooling operation, i.e., statistics pooling and GAP, respectively. Unlike these studies, we evaluate our approach using three popular pooling methods: GAP, self-attentive pooling (SAP) [Cai2018], and learnable dictionary encoding (LDE) [Cai2018]. Experimental results show that the proposed method achieves good performance for the three pooling layers.

3 Proposed approach

In this section, we introduce the proposed multi-scale aggregation using the feature pyramid module motivated by [Lin2017]. In this work, we use ResNet as our feature extractor just as in the baselines [Gao2019, Seo2019]. The architecture is described in Table 1.

3.1 Multi-scale aggregation

Deep CNNs such as ResNet are usually bottom-up, feed-forward architectures, which use repeated convolutional and subsampling layers to learn sophisticated feature representations. Deep CNNs compute a feature hierarchy layer by layer, and with subsampling layers, the feature hierarchy is inherently multi-scale of pyramidal shape. This in-network feature hierarchy produces feature maps of different time-frequency scales and resolutions, but introduces large semantic gaps caused by different depths. In SV, as the network is trained to discriminate speakers, features of higher layers contain more speaker-discriminative information (higher-level speaker information) but have lower resolutions due to the repeated subsampling.

Figure 2: How to use feature maps for deep speaker embedding. (a) Using only single-scale feature maps. (b) Using multi-scale feature maps without feature pyramid module (FPM). (c) Using multi-scale feature maps with FPM. In this paper, thicker outlines denote feature maps with more speaker-discriminative information. : concatenation, : element-wise addition.
Layer name Output size ResNet-34 Stage
conv1 , stride -
conv2_x 1
conv3_x 2
conv4_x 3
conv5_x 4
Table 1: The architecture of the feature extractor based on ResNet-34 [He2016]. Inside the brackets is the shape of a residual block and outside the brackets is the number of stacked blocks on a stage. The input size is .

Deep CNNs are robust to variance in scale and thus facilitate extraction of speaker embeddings from feature maps computed on a single input scale (Fig.

2(a)). But even with this robustness, using multi-scale features from multiple layers (Fig. 2(b)) improves the performance as we discussed in Section 2.

According to the previous works, the advantage of MSA is that it extracts speaker embeddings from multiple temporal scales, improving speaker recognition performance [Gao2019, Tang2019]. It is also useful for short-segment speaker recognition through an efficiently increased use of information in short utterances [Hajavi2019]

. Besides, it passes error signals back to earlier layers, which helps alleviate the vanishing gradient problem

[Tang2019, Seo2019].

Two types of MSA are described in Fig. 1. For MSFA (Fig. 1(a)), we use the same method in [Gao2019]. For MSEA (Fig. 1(b)), convolutions are added before pooling. The embeddings of different stages are concatenated and the output of the following fully-connected (FC) layer is used as the speaker embedding.

3.2 Feature pyramid module

In deep CNNs, feature maps of lower layers have less speaker-discriminative information compared to those of higher layers. Intuitively, if we can enhance the speaker discriminability of the low-layer feature maps, the performance of MSA will be improved. Motivated by this intuition, we aim to create multi-scale features that have high-level speaker information at all layers. The feature pyramid module (FPM) is used to achieve this goal. The MSA with FPM is illustrated in Fig. 2(c). The dotted box indicates the building block of FPM that consists of the lateral connection and the top-down pathway, merged by addition.

The detailed architecture is shown in Fig. 3, involving a bottom-up pathway, a top-down pathway, and lateral connections. The bottom-up pathway is the feed-forward computation of the backbone ResNet. It computes a feature hierarchy consisting of feature maps at multiple scales with a scaling step of 2. In each ResNet stage, there are many layers producing feature maps of the same spatial size (see Table 1). We choose the output of the last layer of each stage as the output of each stage, since the deepest layer contains the strongest features. We denote the output of conv as for i=2,3,4,5.

Figure 3: Proposed MSA with FPM. The black dotted box indicates the FPM. Only MSEA is illustrated here, but the FPM can be applied to both MSFA and MSEA.

The black dotted box in Fig. 3 indicates the FPM. The procedure of the proposed approach is as follows: (1) Using bilinear interpolation or transposed convolution, we upsample lower-resolution feature maps from higher stages by a factor of 2. That is, the top-down pathway hallucinates higher resolution features by upsampling low-resolution feature maps, but with more speaker-discriminative information, from higher stages. (2) The upsampled feature maps are then enhanced with features from the bottom-up pathway via lateral connections. Concretely, the top-down feature map is merged with the corresponding bottom-up feature map by element-wise addition. Before merging, a convolution in the lateral connection reduces the channel dimension of the bottom-up feature map to 32 which is the channel dimension of the lowest stage. These lateral connections play the same role as the skip connections in U-Net [Ronneberger2015]. They directly transfer the high-resolution information from the bottom-up pathway to the top-down pathway. (3) This process is repeated from the top stage to the bottom stage. At the beginning, a convolutional layer reduces the channel dimension of to 32. (4) Finally, an additional convolutional layer is appended to each merged feature map to reduce the aliasing effect of upsampling. This final set of feature maps is called corresponding to that are respectively of the same time-frequency resolution.

The FPM combines lower-resolution features with higher-level speaker information and higher-resolution features with lower-level speaker information. The result is a feature pyramid that has rich speaker information at all stages. In other words, the FPM plays the role of feature enhancement for MSA. Furthermore, the FPM reduces the total number of parameters in the network because the number of channels at stage 2, 3, and 4 are reduced by the convolution in lateral connections.

A recent study shows that the collection of paths having different lengths in ResNet exhibits ensemble-like behavior [Veit2016]. Similarly, we can interpret that the MSA method uses an ensemble of multi-scale features from different paths. As the variable-length feature maps are used to extract speaker embeddings, we expect that speaker verification performance will be improved for variable-duration test utterances, especially with the FPM. In our experiments, we show that the MSA with FPM provides improved performance for both short and long utterances.

4 Experimental setup

4.1 Datasets

We use the VoxCeleb1 [Nagrani2017] and VoxCeleb2 [Chung2018] datasets. Both are for large scale text-independent speaker recognition, containing 1,250 and 5,994 speakers, respectively. The utterances are extracted from YouTube videos where the speech segments are corrupted with real-world noise. Both datasets are split into development (dev) and test sets. The dev set of VoxCeleb2 is used for training and the test set of VoxCeleb1 is used for testing. There are no overlapping speakers between them.

When evaluating the performance on short utterances, our test recordings are cut into four different durations: 1 s, 2 s, 3 s, and 5 s, which is determined by the energy-based voice activity detection (VAD). If the length of the utterance is less than the given duration, the entire utterance is used.

4.2 Implementation details

The input features are 64-dimensional log Mel-filterbank features with a frame-length of 25 ms, which are mean-normalized over a sliding window of up to 3 s. Neither VAD nor data augmentation is used for training. In training, the input size of the ResNet is for 3 s of speech (i.e., and in Fig. 3). In testing, the entire utterance is evaluated at once. The 128-dimensional speaker embeddings are extracted from the network. We report the equal error rate (EER) in % and the minimum detection cost function () with the same settings as in [Chung2018]. Verification trials are scored using cosine distance.

The models are implemented using PyTorch


and optimized by stochastic gradient descent with momentum 0.9. The mini-batch size is 64, and the weight decay parameter is 0.0001. We use the same learning rate (LR) schedule as in

[Cai2018], with the initial LR of 0.1. For LDE, we use the same method as in [Jung2019]. Before the LDE layer, a convolution is applied to change the number of channels to 64. After the LDE layer, -normalization and an FC layer are added to reduce the dimension to 128. The number of codewords is 64. When applying MSA, both the LDE and FC layers are shared by all stages. On the other hand, the parameters of the SAP layers are not shared.

Systems EER (%) # Parameters
Single 0.423 4.55 5.77M
MSFA w/o FPM 0.437 4.30 6.20M
w/ FPM-B 0.398 4.22 5.82M
w/ FPM-TC 0.408 4.01 5.85M
MSEA w/o FPM 0.416 4.22 5.90M
w/ FPM-B 0.403 4.20 5.83M
w/ FPM-TC 0.411 4.01 5.85M
Table 2: Comparison of Single, MSFA, and MSEA. The softmax loss and GAP are used for all systems. In this paper, Single denotes using only single-scale features (Fig. 2(a)), FPM-B and FPM-TC are the FPMs with bilinear upsampling and transposed convolution upsampling, respectively. The results of the proposed approaches are shown in bold.

5 Results

5.1 Performance in different MSA methods

In Table 2 and 3, the models are trained on the VoxCeleb1 dev set. Table 2 compares with (w/) and without (w/o) the FPM in two MSA methods. MSFA w/o FPM and MSEA w/o FPM correspond to the approaches in Gao et al. [Gao2019] and Seo et al. [Seo2019], respectively. In both MSA approaches, we aggregate feature maps from 3 different stages: , , and for w/o FPM and , , and for w/ FPM. Note that the output of stage is for because the stage 1 corresponds to conv2 as shown in Table 1. We can see that both MSAs yield better performance than Single, but with more parameters.

As we discussed in Section 2, the MSFA uses upsampling so that the three feature maps have the same spatial size. In this work, bilinear upsampling is applied to since using transposed convolution does not improve the performance of MSFA with additional parameters. On the other hand, for the FPM, both bilinear and transposed convolution upsampling are used in the top-down pathway, which correspond to FPM-B and FPM-TC, respectively. Among the three systems, w/o FPM has the worst performance with the largest number of parameters. By adding the FPM, we obtain better performance with fewer parameters. The number of parameters is reduced because the channel dimension of feature maps at selected stages is reduced as we discussed in Section 3.2. FPM-TC has slightly more parameters than FPM-B because the transposed convolutional layer has learnable parameters. FPM-B achieves a relative improvement of 8.92% in over w/o FPM. FPM-TC shows a relative improvement of 6.74% in EER over w/o FPM.

For MSEA, FPM-B provides the best (0.403), and FPM-TC obtains the best EER (4.01%). The proposed FPM improves the performance of both MSFA and MSEA, achieving similar performance with a similar number of parameters. Therefore, we only use MSEA in the following experiments. Moreover, the MSEA is more flexible to use various number of stages unlike the MSFA which is developed to use only 3 stages.

Single 0.423 4.55 0.410 4.38 0.421 4.44
w/o FPM 0.416 4.22 0.416 4.24 0.435 4.09
w/ FPM-B 0.403 4.20 0.393 4.13 0.402 3.84
w/ FPM-TC 0.411 4.01 0.408 4.09 0.368 3.63
Table 3: Performance comparison with and without the FPM for three pooling strategies: GAP, SAP [Cai2018], and LDE [Cai2018]. MSEA with softmax loss is applied for all the systems.
Systems Train set EER
i-vectors+PLDA [Nagrani2017] Vox1 0.73 8.8
VGG-M (C+GAP) [Nagrani2017] Vox1 0.71 7.8
ResNet-34 (ASM+SAP) [Cai2018] Vox1 0.622 4.40
ResNet-34 (ASM+SPE) [Jung2019] Vox1 0.402 4.03
TDNN (SM+ASP) [Okabe2018] Vox1* 0.406 3.85
ResNet-34 (ASM+GAP) w/ FPM-TC Vox1 0.393 3.52
ResNet-34 (ASM+LDE) w/ FPM-TC Vox1 0.350 3.22
Thin ResNet-34 (SM+GhostVLAD) [Xie2019] Vox2 NR 3.22
ResNet-50 (EAMS+GAP) [Yu2019] Vox2 0.278 2.94
ResNet-34 (ASM+SPE) [Jung2019] Vox2 0.245 2.61
DDB+Gate (SM+SP) [Jiang2019] Vox1&2 0.268 2.31
ResNet-34 (ASM+GAP) w/ FPM-TC Vox2 0.228 2.17
ResNet-34 (ASM+LDE) w/ FPM-TC Vox2 0.205 1.98
Table 4:

Comparison with state-of-the-art systems. In the parentheses, the first and second terms are the used loss function and pooling layer. For the loss function, C, ASM, SM, and EAMS denote contrastive, A-softmax

[Liu2017], softmax, and EAM softmax [Yu2019] loss, respectively. For the pooling layer, SPE, ASP, and SP denote spatial pyramid encoding [Jung2019], attentive statistics pooling [Okabe2018], and statistics pooling[Snyder2018], respectively. * means that data augmentation is used and NR is “not reported”.
Systems 1 s 2 s 3 s 5 s full
TDNN (ASM+SP) 10.93 6.12 4.50 3.83 3.35
TDNN (ASM+ASP) 10.21 5.62 4.47 3.62 3.29
ResNet-34 (ASM+SPE) [Jung2019] 12.13 5.60 4.10 3.45 3.07
MSEA w/o FPM (ASM+GAP) 6.65 4.05 3.23 2.74 2.56
Proposed 1 6.25 3.95 2.99 2.54 2.35
Proposed 2 6.54 3.76 2.80 2.47 2.23
Table 5: EER (%) of systems on the 5 s enrollment set.
Systems 1 s 2 s 3 s 5 s full
TDNN (ASM+SP) 9.92 5.50 4.06 3.47 3.04
TDNN (ASM+ASP) 9.63 5.02 3.87 3.36 2.92
ResNet-34 (ASM+SPE) [Jung2019] 11.12 4.93 3.55 2.98 2.61
MSEA w/o FPM (ASM+GAP) 6.13 3.68 2.84 2.50 2.31
Proposed 1 5.85 3.64 2.83 2.41 2.17
Proposed 2 5.92 3.38 2.54 2.17 1.98
Table 6: EER (%) of systems on the full-length enrollment set.

5.2 Performance in different pooling methods

In Table 3, we compare the performance among the three pooling strategies. For the SAP layer, FPM-B provides the best (0.393), and FPM-TC obtains the best EER (4.09%). For the LDE layer, we only aggregate features from stage 3 and 4 because using features from stage 2 does not improve the performance. FPM-TC achieves the best performance on both (0.368) and EER (3.63%). For all the three pooling strategies, the FPM improves the performance of the MSA method. We can see that the FPM is most effective for the LDE layer.

5.3 Comparison with recent methods

In Table 4, we compare the two proposed systems with recently reported SV systems. Both proposed systems apply the MSEA with FPM-TC, and the combination of A-softmax and ring loss with the same settings as in [Jung2019]. When we use the VoxCeleb2 dataset for training, we extract 256-dimensional speaker embeddings. In the first system (Proposed 1), we aggregate features from all the stages using GAP. In the second system (Proposed 2), we aggregate features from stage 2, 3, and 4 using LDE. The proposed systems outperform other baseline systems in terms of both and EER. Proposed 2 achieves the best performance among all the systems. Using the VoxCeleb1 dataset for training, Proposed 2 obtains a of 0.350 and an EER of 3.22%. Using the VoxCeleb2 dataset for training, Proposed 2 achieves a of 0.205 and an EER of 1.98%.

In Table 5 and 6, we evaluate the performance of several systems in 5 s and full-length enrollment conditions, respectively. For each condition, we evaluate the performance on five different test durations: 1 s, 2 s, 3 s, 5 s, and original full-length (full). The average duration of ‘full’ is 6.3 s. All the results are obtained by our own implementation and VoxCeleb2 dataset is used for training. For TDNN, we follow the same architecture as in [Okabe2018]. For all the baseline systems, we use the same acoustic features and optimization as our proposed systems.

Among baselines, the ResNet-based system using SPE (the advanced version of LDE) outperforms TDNN-based systems except for the 1 s test condition. Similarly, Proposed 2 achieves better results than Proposed 1 for utterances longer than 1 s. From these, we can see that LDE-based pooling shows a greater performance degradation on very short utterances than the other pooling strategies. MSEA w/o FPM is the approach in the 5th row of Table 2. We find that applying the FPM improves the performance of MSA for variable-duration test utterances. When we compare proposed systems with other state-of-the-art baseline systems including TDNN (x-vector) and ResNet-based systems, we also observe that the MSA with FPM achieves higher performance for both short and long utterances.

6 Conclusions

In this study, we proposed a novel MSA method for TI-SV using a FPM. We applied the FPM to two types of MSA methods, MSFA and MSEA. It enhances speaker-discriminative information on multi-scale features at multiple layers of a speaker feature extractor. On the VoxCeleb dataset, experimental results showed that the FPM improves both MSA methods with fewer parameters, and works well with three popular pooling layers. The proposed systems obtained better results for both short and long utterances than the state-of-the-art baseline systems.

7 Acknowledgements

This material is based upon work supported by the Ministry of Trade, Industry and Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10063424, Development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots).