Author's repository for reproducing RawNet2 with PyTorch and RawNet with PyTorch and Keras.
Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by scaling feature maps using various methods. The proposed mechanism utilizes a scale vector that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the number of filters in a given feature map. Using a scale vector, we propose to scale the feature map multiplicatively, additively, or both. In addition, we investigate replacing the first convolution layer with the sinc-convolution layer of SincNet. Experiments performed on the VoxCeleb1 evaluation dataset demonstrate the effectiveness of the proposed methods, and the best performing system reduces the equal error rate by half compared to the original RawNet. Expanded evaluation results obtained using the VoxCeleb1-E and VoxCeleb-H protocols marginally outperform existing state-of-the-art systems.READ FULL TEXT VIEW PDF
Author's repository for reproducing RawNet2 with PyTorch and RawNet with PyTorch and Keras.
With the recent advances in deep learning, many speaker verification studies have replaced the acoustic feature extraction process with deep neural networks (DNNs)[1, 2, 3, 4]. In the preliminary stage of utilizing DNNs for speaker verification, acoustic features, e.g., Mel-frequency cepstral coefficients and Mel-filterbank energy features, were utilized as input to DNNs [5, 6, 7, 8]. In contrast, many recent studies have also used less processed features, e.g., spectrograms and raw waveforms [9, 10, 11, 12], hypothesizing that the usage of such less processed features as input allows data-driven approaches with DNNs to yield better discriminative representations compared to using knowledge-based acoustic features. Following this trend, many recent systems, e.g., RawNet , have demonstrated competitive results using raw waveforms as input for speaker verification.
The attention mechanism was initially designed to emphasize more important elements in sequence-to-sequence processing [13, 14, 15, 16, 17]. It has been adopted to several tasks, including speaker verification. Among various methods, self-attentive pooling has been applied to speaker verification to aggregate frame-level representations into a single utterance-level representation . Here, the term “self” refers to a property of the attention mechanism that no external data, e.g., phoneme labels , are used. Compared to conventional global average pooling, self-attentive pooling involves assigning a weight to each frame and conducting weighted summation. Recent attention mechanisms, e.g., multi-head self-attentive pooling, have also been investigated, and such methods have demonstrated further performance improvements. However, applying an attention mechanism to speaker verification is more focused on attentive pooling than applying such mechanisms to intermediate feature maps in image domain tasks [7, 16, 17].
In this study, we propose to scale the filter axis of feature maps using a sigmoid-based mechanism, which we refer to as feature map scaling (FMS). The FMS uses a scale vector whose dimension is identical to the number of filters, where each value is between 0 and 1, similar to an attention map used for an attention mechanism, with the exception that a sigmoid activation function is employed rather than a softmax function. The underlying hypothesis of using sigmoid activation functions (rather than the softmax function) to independently perform scaling is that, differing from few other tasks, an attention mechanism that exclusively selects only a few filters may remove an excessive amount of discriminative information. In addition, in light of the recent successes of attentive pooling mechanisms in speaker verification tasks, we investigate replacing the gated recurrent unit (GRU) layer of RawNet, which performs aggregation of frame-level representations, with the self-attentive pooling and self-multi-head-attentive pooling mechanisms.
Specifically, we propose to apply the FMS to feature maps by either multiplying, adding, or performing both sequentially, as shown in Figure 1. By multiplicatively scaling a feature map, we expect to emphasize each filter of a feature map independently. In addition, by applying an FMS through adding, we expect to provide small perturbations that lead to increased discriminative power. This is inspired by a previous study  that showed analyzing small alterations in high-dimensional space can drastically change discriminative power. By hypothesizing that these two approaches function in a complementary manner, we also propose to apply both approaches in sequence. In experiments, the proposed methods were applied to the output feature maps of each residual block following the literature [20, 21]. In addition, we investigated replacing RawNet’s first convolutional layer with a sinc-convolution layer , which has been reported to better capture aggregated frequency responses than the conventional convolutional layer.
The remainder of this paper is organized as follows. Section 2 describes the RawNet system, which we use as a baseline with several modifications. In Section 3, we introduce the proposed FMS. Section 4 discusses experimentation and presents an analysis of the experimental results. Finally, conclusions are presented in Section 5.
RawNet is a neural speaker embedding extractor that inputs raw waveforms directly without preprocessing techniques and outputs speaker embeddings designed for speaker verification 
. The underlying assumption behind using a DNN is that speaker embeddings extracted directly from raw waveforms by replacing an acoustic feature extraction with more hidden layers are expected to yield more discriminative representations as the amount of available data increases. RawNet adopts a convolutional neural network-gated recurrent unit (CNN-GRU) architecture, in which the first CNN layer has stride size identical to the filter length. The front CNN layers comprise residual blocks followed by a max-pooling layer, and extracts frame-level representations. Then a GRU layer aggregates frame-level features into an utterance-level representation, which is the final timestep of the GRU’s output. The GRU layer is then connected to a fully-connected layer, where its output is used as the speaker embedding. Finally, the output layer receives a speaker embedding and performs identification in the training phase.
|Layer||Input:59049 samples||Output shape|
|Res block||2||(2187, 128)|
|Res block||4||(27, 256)|
To construct the baseline used in this study, we implemented several modifications to the original RawNet. First, we modified the structure of the residual blocks to a pre-activation structure 
. Second, we simplified the loss functions from using categorical cross-entropy (CCE), center, and speaker basis loss  to using only CCE loss. Third, we omitted a CNN pretraining scheme. Fourth, we modified the training dataset from VoxCeleb1 to VoxCeleb2 to utilize recently expanded evaluation protocols that consider the entire VoxCeleb1 dataset. Finally, we applied a test time augmentation (TTA) method in the evaluation phase  to extract multiple speaker embeddings from a single utterance by cropping with overlaps where the duration is identical to that in the training phase. Then, the average of the speaker embeddings is used as the final speaker embedding. Through these modifications, we achieve a relative error reduction (RER) of 37.50 %.
We propose to independently scale each filter of a feature map using a filter-wise feature map scaling (FMS) technique. The FMS uses a scale vector whose dimension is identical to the number of filters with values between 0 and 1 derived using sigmoid activation. Its purpose is to independently modify the scale of a given feature map, i.e., the output of a residual block, to derive more discriminative representations. We also propose various methods to utilize the FMS to scale given feature maps, i.e., multiplication, addition, and applying both. Note that these proposed approaches do not require additional hyperparameters.
Here, let be a feature map of a residual block, i.e., , where is the sequence length in time, and is the number of filters. We derive a scale vector to conduct FMS by first performing global average pooling on the time axis, and then feed-forwarding through a fully-connected layer followed by sigmoid activation. By expressing a scale vector as , i.e., , we first propose to derive a scaled feature map , i.e., , to scale the feature map in an additive method expressed as follows:
where is broadcasted, i.e. copied, to perform element-wise calculation. We also propose to scale the feature map in a multiplicative method:
These two methods can be applied sequentially, where either method can be performed first, expressed as follows:
We also propose to use two individual scale vectors, i.e., one for addition, and the other for multiplication for (4), because it can be interpreted as . Figure 1 shows the proposed methods using FMS to scale feature maps. Here, we applied the propose methods using the FMS to the outputs of residual blocks in the baseline system following the literature [20, 21].
The proposed method using multiplicative FMS for scaling has commonality with the widely used attention mechanism [13, 14, 15] applied in the filter domain, which exclusively emphasizes a given feature map using a softmax activation. This can be interpreted as using the recently proposed multi-head attention mechanism 
in the filter domain, where the number of the heads is equal to the number of filters. We apply scaling using a sigmoid function rather than exclusively performing scaling using a softmax function because information might be removed excessively when a conventional softmax-based attention mechanism. In translation or image classification tasks, performing exclusive concentration is reasonable; however, we hypothesize that different filters would yield complementary features for speaker verification, thereby making independent scaling more adequate.
The proposed method with additive FMS for filter-wise scaling adds a value between 0 and 1 to a given feature map. The purpose is to apply data-driven perturbation to a feature map with a relatively small value. Here, it is assumed that this may increase the discriminative power of the feature maps. This concept is inspired by a phenomenon demonstrated by Zhang et. al. , where the discriminative power of a DNN’s high-dimensional intermediate representation can differ significantly with small perturbations. In addition, we assume that applying an additive FMS combined with a multiplicative FMS will lead to further improvements.
We also investigated replacing the RawNet’s first convolutional layer with a sinc-convolution (sinc-conv) layer, which was first proposed to process raw waveforms by performing time-domain convolutions [4, 12]. It is a type of a bandpass filter, where cut-off frequencies are set as parameters that are optimized with other DNN parameters. With fewer parameters, i.e., , the sinc-conv layer is frequently employed in DNNs that directly input a raw waveform. Table 1 details the overall architecture of the proposed system.
|Input Feature||Front-end||Aggregation||Loss||Dims||EER (%)|
|Chung et. al. ||Spectrogram||ResNet-50||TAP||Softmax+Contrastive||512||4.42|
|Xie et. al. ||Spectrogram||Thin ResNet-34||GhostVLAD||Softmax||512||3.13|
|Nagrani et al. ||Spectrogram||Thin-ResNet-34||GhostVLAD||Softmax||512||2.95|
|Chung et. al. ||Spectrogram||ResNet-50||TAP||Softmax+Contrastive||512||7.33|
|Xie et. al. ||Spectrogram||Thin ResNet-34||GhostVLAD||Softmax||512||5.06|
|Nagrani et al. ||Spectrogram||Thin-ResNet-34||GhostVLAD||Softmax||512||4.93|
All experiments reported in this paper were conducted using PyTorch, and the code is available at https://github.com/Jungjee/RawNet.
We used the VoxCeleb2 dataset  for training, and we used the VoxCeleb1 dataset  to evaluate various protocols. The VoxCeleb2 dataset contains over one million utterances from 6112 speakers, and the VoxCeleb1 dataset contains approximately 330 hours of recordings from 1251 speakers for text-independent scenarios. Both datasets were obtained automatically from YouTube. Note that the VoxCeleb2 dataset is an extended version of the VoxCeleb1 dataset.
We used raw waveforms with pre-emphasis applied as input to the DNNs [3, 11, 1]. For the experiments in which the first convolutional layer was replaced with sinc-conv layer, we followed the literature . This study did not apply pre-emphasis but performed layer normalization  to raw waveform. Here, we modified the duration of the input waveforms to samples ( 3.69 s) in the training phase to facilitate mini-batch construction. In the testing phasing, we applied TTA with a % overlap.
We used Leaky ReLU activation functions with a negative slope of following implementations of . The dimension of the speaker embedding is . The AMSGrad optimizer  with a learning rate of was used. A weight decay with was applied. We used CCE loss as the objective function. The other parameters related to the system architecture are described in Table 1 and the literature .
Table 2 shows the performance according to the modifications made to the RawNet system (Section 2) using the original VoxCeleb1 evaluation set. Here, the top three rows describe existing systems using the same dataset for comparison. The results indicate that the original RawNet demonstrates competitive performance. For x-vectors, we show the results for an improved version reported in the literature . Here, System #1 describes the performance of RawNet , and System #2 shows the result of changing the DNN architecture and expanding the training set to the VoxCeleb2 dataset from System #1. System #3 shows the results obtained by applying TTA to System #2. As can be seen, the results demonstrate that the applied changes were effective and resulted in RER of 37.5 % compared to the original RawNet. Note that we used System #3 as the baseline in all experiments.
Systems #4 through #7 of Table 3 show the results obtained by applying various related methods. Systems #4 and #5 show the results obtained using the attention and multi-head attention mechanisms using the softmax-based exclusive attention map on the filter domain. Attention demonstrated marginal improvement, and multi-head attention reduced the performance matching the hypothesis discussed in Section 3. Systems #6 and #7 describe the results of applying squeeze-excitation  and convolutional block attention module  to the baseline. In addition, the experimental results obtained by replacing the GRU layer with self-attentive pooling or self-multi-head-attentive pooling reduced performance. These results demonstrate that, in the case of RawNet, the GRU better aggregates frame-level representations into an utterance-level representation. Among application of various related methods, System #6 demonstrated the best result.
Systems #8 through #12 of Table 3 show the results of proposed FMS method with different configurations. Here, the “Mechanism” column shows how we performed the proposed FMS method. System #8 and System #9 applied the two proposed methods, and #10 and #11 applied both methods at the same time in different order. The results show that the proposed methods yielded improvements with RERs of 6.00 % and 11.33 %. Applying both methods simultaneously further improved the performance, and System #11 demonstrated an EER of 2.56 %. System #12 shows the result obtained using separate scale vectors for additive and multiplicative FMS. As shown, additional improvements were not observed.
Table 4 shows the results obtained by replacing the RawNet’s first convolutional layer with the sinc-conv layer of SincNet. Here, we used System #11 to perform these comparative experiments. The result demonstrates that it provides 3.12 % additional improvement in System #15. However, the performance was easily affected by the length of the sinc-conv layer, i.e., setting an overly long filter length reduced performance. In the following, we refer to System #15 that demonstrates the best performance as ‘RawNet2’ for brevity. RawNet2 demonstrates an RER of 48.33 % compared to the original RawNet (System #1), thereby nearly halving the EER.
Finally, Table 5 compares the results obtained in various recent studies using the expanded evaluation protocols, i.e., VoxCeleb1-E and VoxCeleb1-H, which utilize more than 1000 speakers and 500000 trials compared to 40 speakers and 38000 trials in the original evaluation protocol.111We report all performance values using the cleaned protocol. The results show that the proposed RawNet2 marginally outperformed the state-of-the-art performance, i.e., EER of 2.87 % for the VoxCeleb-E protocol and 4.89 % for the VoxCeleb-H protocol. From the various experimental results given through Tables 2 to 5
, we conclude that the proposed RawNet2 using the FMS demonstrates competitive performance despite its simple process pipeline of inputting raw waveforms to a DNN and measuring cosine similarity using the output speaker embeddings.
In this paper, we have proposed various FMS-based methods to improve the existing RawNet system, which is neural speaker embedding extractor in which speaker embeddings are extracted directly from raw waveforms. The FMS uses a scale vector to perform scaling, where the dimension of the scale vector is identical to the number of filters. The FMS-based methods scale filters in feature maps to construct improved feature maps that focus on more important features in the frame-level feature map through addition, multiplication, or both. We applied various FMS-based methods to the output of each residual block. In addition, by replacing the first convolution layer with a sinc-conv layer, we achieved further improvements. The results of the evaluation performed using the original VoxCeleb1 protocol demonstrate an EER of 2.46 %, while the original RawNet reported EER of 4.80 %. In an evaluation using recently expanded evaluation protocols, the proposed method marginally outperformed the current state-of-the-art methods.
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 718–725.
Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 499–515.
, “Keras,”https://github.com/keras-team/keras, 2015.