Log In Sign Up

Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by rescaling feature maps using various methods. The proposed mechanism utilizes a filter-wise rescale map that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the number of filters in a given feature map. Using a filter-wise rescale map, we propose to rescale the feature map multiplicatively, additively, or both. In addition, we investigate replacing the first convolution layer with the sinc-convolution layer of SincNet. Experiments performed on the VoxCeleb1 evaluation dataset demonstrate that the proposed methods are effective, and the best performing system reduces the equal error rate by half compared to the original RawNet. Expanded evaluation results obtained using the VoxCeleb1-E and VoxCeleb-H protocols marginally outperform existing state-of-the-art systems.


page 1

page 2

page 3

page 4


Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

State-of-the-art text-independent speaker verification systems typically...

Raw waveform speaker verification for supervised and self-supervised learning

Speaker verification models that directly operate upon raw waveforms are...

Studying squeeze-and-excitation used in CNN for speaker verification

In speaker verification, the extraction of voice representations is main...

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Recently, direct modeling of raw waveforms using deep neural networks ha...

A study of the robustness of raw waveform based speaker embeddings under mismatched conditions

In this paper, we conduct a cross-dataset study on parametric and non-pa...

Offline Text-Independent Writer Identification based on word level data

This paper proposes a novel scheme to identify the authorship of a docum...

1 Introduction

With the recent advances in deep learning, many speaker verification studies have replaced the acoustic feature extraction process with deep neural networks (DNNs)

[jung2018complete, muckenhirn2018towards, ravanelli2018speaker]. In the preliminary stage of utilizing DNNs for speaker verification, acoustic features, e.g., Mel-frequency cepstral coefficients and Mel-filterbank energy features, were utilized as input to DNNs [d-vector, snyder2018x, okabe2018attentive, jung2019spatial]. In contrast, many recent studies have also used less processed features, e.g., spectrograms and raw waveforms [hajibabaei2018unified, nagrani2020voxceleb, jung2018avoiding22, ravanelli2019sincnet], hypothesizing that the usage of such less processed features as input allows data-driven approaches with DNNs to yield better discriminative representations compared to using knowledge-based acoustic features. Following this trend, many recent systems, e.g., RawNet [jung2019rawnet], have demonstrated competitive results using a raw waveform as input for speaker verification.

The attention mechanism was initially designed to emphasize more important elements in sequence-to-sequence processing [bahdanau2014neural, vaswani2017attention, chan2016listen, zhu2018self, safari2019self] and has been investigated relative to several tasks, including speaker verification. Among various methods, self-attentive pooling has been applied to speaker verification to aggregate frame-level representations into a single utterance-level representation [zhu2018self]. Here, the term ‘self’ refers to a property of the attention mechanism that no external data, e.g., phoneme labels [zhou2019cnn], are used. Compared to conventional global average pooling, self-attentive pooling involves assigning a weight to each frame-level representation and conducting weighted summation. Recent attention mechanisms, e.g., multi-head self-attentive pooling, have also been investigated[safari2019self], and such methods have demonstrated further performance improvements. However, to the best of our knowledge, applying an attention mechanism to speaker verification is more focused on attentive pooling than applying such mechanisms to intermediate feature maps in image domain tasks.

Therefore, in this study, we propose to rescale the filter domain of a feature map in a filter-wise manner using a sigmoid-based mechanism, which we refer to as a filter-wise rescale map (FRM). Here, an FRM is a vector of length identical to the number of filters, where each value is between 0 and 1, which is similar to an attention map used for an attention mechanism, with the exception that a sigmoid activation function is employed rather than a softmax function. The underlying hypothesis of using sigmoid activation functions (rather than the softmax function) to independently perform rescaling is that, differing from few other tasks, an attention mechanism that exclusively selects only a few filters may remove an excessive amount of discriminative information. In addition, in light of the recent successes of attentive pooling mechanisms in speaker verification tasks, we investigate replacing the gated recurrent unit (GRU) layer of RawNet, which performs aggregation of frame-level representations, with the self-attentive pooling and self-multi-head-attentive pooling mechanisms.

Specifically, we propose to apply an FRM to a feature map by either multiplying, adding, or performing both sequentially, as shown in Figure 1. By multiplicatively rescaling a feature map, we expect to emphasize each filter of a feature map independently. In addition, by applying an FRM through adding, we expect to provide small perturbations that lead to increased discriminative power. This is inspired by a previous study [zhang2018vector] that showed analyzing small alterations in high-dimensional space can drastically change discriminative power. By hypothesizing that these two approaches function in a complementary manner, we also propose to apply both approaches in sequence. In experiments, the proposed methods were applied to the output feature maps of each residual block following the literature [woo2018cbam, hu2018squeeze]. In addition, we investigated replacing RawNet’s first convolutional layer with a sinc-convolution layer [ravanelli2018speaker], which has been reported to better capture aggregated frequency responses than the conventional convolutional layer.

The remainder of this paper is organized as follows. Section 2 describes the RawNet system, which we use as a baseline with several modifications. In Section 3, we introduce the proposed filter-wise rescaling scheme. Section 4 discusses experimentation and presents an analysis of the experimental results. Finally, conclusions are presented in Section 5.

2 Baseline Composition: RawNet

RawNet is a neural speaker embedding extractor that inputs a raw waveform directly without preprocessing techniques and outputs a speaker embedding designed for speaker verification [jung2019rawnet]

. The underlying assumption behind using a DNN is that speaker embeddings extracted directly from raw waveforms by replacing an acoustic feature extraction with more hidden layers is expected to yield more discriminative representations as the amount of available data increases. RawNet adopts a convolutional neural network-gated recurrent unit (CNN-GRU) architecture, in which the first CNN layer has stride size identical to the filter length. The following CNN component comprises residual blocks followed by a max-pooling layer and extracts frame-level representations. Using a GRU layer, we aggregate frame-level features into an utterance-level representation, which is the final timestep of the GRU’s output. It is then connected to a fully-connected layer, where its output is used as the speaker embedding. Finally, the output layer receives a speaker embedding and performs identification in the training phase.

Layer Input:59,049 samples Output shape
Sinc(251,1,128) (19683, 128)
Sinc MaxPool(3)
-conv BN
Res block 2 (2187, 128)
Res block 4 (27, 256)
GRU GRU(1024) (1024,)
Speaker FC(1024) (1024,)
Table 1: DNN architecture of the proposed System (referred to as RawNet2 for brevity). An output layer conducts speaker identification in train phase, removed after training. BN and LeakyReLU at the beginning of the first block is omitted following [he2016identity]. Numbers denoted in Conv and Sinc-conv refers to filter length, stride, and number of filters respectively. This architecture can also be used to represent the baseline (# 3-ours of Table 2) by removing FRM operations and using a convolutional layer instead of a sinc-conv layer.

To construct the baseline used in this study, we implemented several modifications to the original RawNet. First, we modified the structure of the residual blocks to a pre-activation structure [he2016identity]

. Second, we simplified the loss functions from using categorical cross-entropy (CCE), center

[wen2016discriminative], and speaker basis loss [heo2019end] to using only CCE loss. Third, we omitted a CNN pretraining scheme. Fourth, we modified the training dataset from VoxCeleb1 to VoxCeleb2 to utilize recently expanded evaluation protocols that consider the entire VoxCeleb1 dataset. Finally, we applied a test time augmentation (TTA) method in the evaluation phase [Voxceleb2] to extract multiple speaker embeddings from a single utterance by cropping with overlaps where the duration is identical to that in the training phase. Then, an average of the speaker embeddings is used as the final speaker embedding. Through these modifications, we achieve a relative error reduction (RER) of 37.50 %.

3 Filter-wise re-scaling

Figure 1: Illustration of the four methods using the proposed filter-wise re-scale map (FRM). Here, is broadcasted to perform element-wise calculations with .

We propose to independently rescale each filter of a feature map using a filter-wise resale map (FRM). The FRM is a vector with length that is identical to the number of filters with values between 0 and 1 derived using sigmoid activation. Its purpose is to independently modify the scale of a given feature map, i.e., the output of a residual block, to derive more discriminant representations. We also propose various methods to utilize the FRM to rescale a given feature map, i.e., multiplication, addition, and applying both. Note that these proposed approaches do not require additional hyperparameters.

Here, let be a feature map of a residual block, i.e., , where is the time axis (i.e., sequence length), and is to the number of filters. We derive an FRM by first performing global average pooling on the time axis, and then feed-forwarding through a fully-connected layer followed by sigmoid activation. By expressing an FRM as , i.e., , we first propose to derive a rescaled feature map , i.e., , to rescale the feature map in an additive method expressed as follows:


where is broadcasted, i.e. copied, to perform element-wise calculation. We also propose to rescale the feature map in a multiplicative method:


These two methods can be applied sequentially, where either method can be performed first.

For (4), we also propose to use two individual FRMs, i.e., one for addition, and the other for multiplication, because it can be interpreted as . Figure 1 shows the proposed methods using an FRM to rescale a feature map. Here, we applied the propose methods using the FRM to the outputs of residual blocks in the baseline system following the literature [woo2018cbam, hu2018squeeze].


The proposed method using multiplicative FRM for rescaling has commonality with the widely used attention mechanism [bahdanau2014neural, vaswani2017attention, chan2016listen] applied in the filter domain, which exclusively emphasizes a given feature map using a softmax activation. This can be interpreted as using the recently proposed multi-head attention mechanism [vaswani2017attention]

in the filter domain, where the number of the heads is equal to the number of filters. We apply rescaling using a sigmoid function rather than exclusively performing rescaling using a softmax function because information might be removed excessively when a conventional softmax-based attention mechanism. In translation or image classification tasks, performing exclusive concentration is reasonable; however, we hypothesize that different filters would yield complementary features for speaker verification, thereby making independent rescaling more adequate.

The proposed method with additive FRM for filter-wise rescaling adds a value between 0 and 1 to a given feature map. The purpose is to apply data-driven perturbation to a feature map with a relatively small value. Here, it is assumed that this may increase the discriminative power of a given feature map. This concept is inspired by a phenomenon demonstrated by Zhang et. al. [zhang2018vector], where the discriminative power of a DNN’s high-dimensional intermediate representation can differ significantly with small perturbations. In addition, we assume that applying an additive FRM combined with a multiplicative FRM will lead to further improvements.

We also investigated replacing the RawNet’s first convolutional layer with a sinc-convolution (sinc-conv) layer, which was first proposed to process raw waveforms by performing time-domain convolutions [ravanelli2018speaker, ravanelli2019sincnet]. It is a type of a bandpass filter, where cut-off frequencies are set as hyperparameters that are optimized with other DNN parameters. With fewer parameters, i.e., , the sinc-conv layer is frequently employed in DNNs that directly input a raw waveform. Table 1 details the overall architecture of the proposed system.

System Trained on TTA EER RER
i-vector [shon2018frame] Vox1 5.40 -
specCNN* [hajibabaei2018unified] Vox1 4.3 -
x-vector* [snyder2018x] Vox2 3.10 -
# 1-RawNet [jung2019rawnet] Vox1 4.80 -
# 2-Ours Vox2 3.52 26.67
# 3-Ours Vox2 3.00 37.50
Table 2: Performance comparison according to modifications of the baseline construction (: data augmentation). Equal error rate (EER) is reported using the original VoxCeleb1 evaluation dataset. Ours shows the results of applying identity mapping [he2016identity], modifying the dimensionality of the code representation, and increasing the training set.
System Mechanism EER RER
Baseline - 3.00 -
# 4-att - 2.89 3.67
# 5-multi-att - 3.42 -
# 6-add 2.82 6.00
# 7-mul 2.66 11.33
# 8-add-mul 2.60 13.33
# 9-mul-add 2.56 14.67
# 10-mul-add-sep 2.57 14.33
Table 3: Various applications of the proposed FRM. Baseline refers to the modified RawNet (Table 2). Mechanism addresses variations of applying the proposed method. is to the output feature map, and is the FRM derived from . ‘sep’ indicates using separate FRMs for additive and multiplicative rescaling. Performance is reported in terms of EER and RER.
System Sinc-conv length EER RER
# 9-mul-add - 2.56 -
# 11 125 2.53 1.17
# 12 195 2.54 0.78
# 13-RawNet2 251 2.48 3.12
# 14 313 2.70 -
# 15 375 2.75 -
Table 4: Experimental results of replacing the first strided convolution layer with varying length of sinc-conv layer proposed in SincNet [ravanelli2018speaker, ravanelli2019sincnet]. Applied to system # 4-mul-add of Table 3.
Input Feature Front-end Aggregation Loss Dims EER (%)
Chung et. al. [Voxceleb2] Spectrogram ResNet-50 TAP Softmax+Contrastive 512 4.42
Xie et. al. [xie2019aggregation] Spectrogram Thin ResNet-34 GhostVLAD Softmax 512 3.13
Nagrani et al. [nagrani2020voxceleb] Spectrogram Thin-ResNet-34 GhostVLAD Softmax 512 2.95
Ours Raw waveform RawNet2 GRU Softmax 1024 2.57
Chung et. al. [Voxceleb2] Spectrogram ResNet-50 TAP Softmax+Contrastive 512 7.33
Xie et. al. [xie2019aggregation] Spectrogram Thin ResNet-34 GhostVLAD Softmax 512 5.06
Nagrani et al. [nagrani2020voxceleb] Spectrogram Thin-ResNet-34 GhostVLAD Softmax 512 4.93
Ours Raw waveform RawNet2 GRU Softmax 1024 4.89
Table 5: Results of comparison with state-of-the-art systems on expanded VoxCeleb1-E and VoxCeleb-H evaluation protocols.

4 Experiments and Result Analysis

All experiments reported in this paper were conducted using PyTorch

[paszke2019PyTorch], and the code is available at

4.1 Dataset

We used the VoxCeleb2 dataset [Voxceleb2] for training, and we used the VoxCeleb1 dataset [Voxceleb] to perform evaluations using various protocols. The VoxCeleb2 dataset contains over one million utterances from 6112 speakers, and the VoxCeleb1 dataset contains approximately 330 hours of recordings from 1251 speakers for text-independent scenarios, where all recordings are encoded at sampling rate of 16 kHz with 16-bit resolution. Both datasets were obtained automatically from YouTube. Note that the VoxCeleb2 dataset is an extended version of the VoxCeleb1 dataset.

4.2 Experimental configurations

We used raw waveforms with pre-emphasis applied as input to the DNNs [jung2018complete, jung2018avoiding22, jung2019rawnet]. For the experiments in which the first convolutional layer was replaced with sinc-conv layer, we followed the literature [ravanelli2018speaker]. That study did not apply pre-emphasis but performed layer normalization [ba2016layer] to raw waveform. Here, we modified the duration of the input raw waveforms to samples ( 3.69 s) in the training phase to facilitate mini-batch construction. In the testing phasing, we applied TTA with a % overlap.

We used Leaky ReLU activation functions

[leaky] with a negative slope of following implementations of [keras]. Here, the speaking embedding had a dimensionality of . The AMSGrad optimizer [reddi2019convergence] with a learning rate of was used. A weight decay with was applied. We used CCE loss as the sole objective function. The other parameters related to the system architecture are described in Table 1 and the literature [jung2019rawnet].

4.3 Results analysis

Table 2 shows the results according to the modifications made to the RawNet system (Section 2) using the original VoxCeleb1 evaluation set. Here, the top three rows describe existing systems using an identical dataset for comparison. The results indicate that the original RawNet demonstrates competitive performance. For x-vectors, we show the results for an improved version reported in the literature [nagrani2020voxceleb]. Here, System # 1 describes the performance of RawNet [jung2019rawnet], and System # 2 shows the result of changing the DNN architecture and expanding the training set to the VoxCeleb2 dataset from System # 1. System # 3 shows the results obtained by applying TTA to System # 2. As can be seen, the results demonstrate that the applied changes were effective and resulted in additional RER of 37.5 % compared to the original RawNet. Note that we used System # 3 as the baseline in all experiments.

Table 3 shows the results obtained by applying the proposed FRM method with various configurations. Here, the ‘Mechanism’ Column shows how we performed the proposed FRM method, and Systems #4 and # 5 show the results obtained using the attention and multi-head attention mechanisms using the softmax-based exclusive attention map on the filter domain. Attention demonstrated marginal improvement, and multi-head attention reduced the performance matching the hypothesis discussed in Section 3. Here, System # 6 and System # 7 applied the two proposed methods, and # 8 and # 9 applied both methods concurrently in different sequence. The results demonstrate show that the proposed methods yielded improvements with an RER of 6.00 % and 11.33 %. Applying both methods concurrently further improved the performance, and System # 9 demonstrated an EER of 2.56 %. System # 10 shows the result obtained using separate FRMs for additive and multiplicative rescaling. As shown, additional improvements were not observed. In addition, the experimental results obtained by replacing the GRU layer with self-attentive pooling or self-multi-head-attentive pooling reduced performance. These results demonstrate that, in the case of RawNet, the GRU better aggregates frame-level representations into an utterance-level representation.

Table 4 shows the results obtained by replacing the RawNet’s first convolutional layer with the sinc-conv layer of SincNet. Here, we used System # 9 to perform these comparative experiments. The result demonstrates that it provides 3.12 % additional improvement in System # 13. However, the performance was easily affected by the length of the sinc-conv layer, i.e., setting an overly long filter length reduced performance. In the following, we refer to System # 13 that demonstrates the best performance as ‘RawNet2’ for brevity. RawNet2 demonstrates an RER of 48.33 % compared to the original RawNet (System # 1), thereby nearly halving the EER.

Finally, Table 5 compares the results obtain in various recent studies using the expanded evaluation protocols, i.e., VoxCeleb1-E and VoxCeleb1-H, which utilize more than 1000 speakers and 500000 trials compared to 40 speakers and 38000 trials in the original evaluation protocol. 111We report all performance values using the cleaned protocol, as recommended by the dataset providers. The results show that the proposed RawNet2 marginally outperformed the state-of-the-art performance, i.e., EER of 2.87 % for the VoxCeleb-E protocol and 4.89 % for the VoxCeleb-H protocol. From the various experimental results given through Tables 2 to 5

, we conclude that the proposed RawNet2 using an FRM for filter-wise rescaling demonstrates competitive performance despite its simple process pipeline of inputting raw waveforms to a DNN and measuring cosine similarity using the output speaker embeddings.

5 Conclusion

In this paper, we have proposed various FRM-based methods to improve the existing RawNet system, which is neural speaker embedding extractor in which speaker embeddings are extracted directly from a raw waveform. The FRM refers to the values used to perform rescaling, where the length of an FRM is identical to the number of filters. The FRM-based methods rescale a feature map in a filter-wise manner to construct an improved feature map that focuses on more important features in the frame-level feature map through addition, multiplication, or both. We applied various FRM-based methods to the output of each residual block. In addition, by replacing the first convolution layer with a sinc-conv layer, we achieved further improvements. The results of an evaluation performed using the original VoxCeleb1 protocol demonstrate an EER of 2.46 %, and the original RawNet reported EER of 4.80 %. In an evaluation using recently expanded evaluation protocols, the proposed method marginally outperformed the current state-of-the-art methods.

6 Acknowledgements

This research was supported by Projects for Research and Development of Police science and Technology under Center for Research and Development of Police science and Technology and Korean National Police Agency funded by the Ministry of Science, ICT and Future Planning (Grant No. PA-J000001-2017-101)