Log In Sign Up

A study of the robustness of raw waveform based speaker embeddings under mismatched conditions

In this paper, we conduct a cross-dataset study on parametric and non-parametric raw-waveform based speaker embeddings through speaker verification experiments. In general, we observe a more significant performance degradation of these raw-waveform systems compared to spectral based systems. We then propose two strategies to improve the performance of raw-waveform based systems on cross-dataset tests. The first strategy is to change the real-valued filters into analytic filters to ensure shift-invariance. The second strategy is to apply variational dropout to non-parametric filters to prevent them from overfitting irrelevant nuance features.


Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network

State-of-the-art text-independent speaker verification systems typically...

Speaker Diarization with LSTM

For many years, i-vector based speaker embedding techniques were the dom...

Multi-Span Acoustic Modelling using Raw Waveform Signals

Traditional automatic speech recognition (ASR) systems often use an acou...

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Raw waveform speaker verification for supervised and self-supervised learning

Speaker verification models that directly operate upon raw waveforms are...

Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and...

1 Introduction

The design and analysis of hand-crafted features inspired by human auditory perception, such as mel frequency cepstral coefficients (MFCCs), has long been an active area of research in audio processing. In recent years, increasing attention has been directed toward the substitution of such features for data-driven raw-waveform models. Earlier research on sample-level deep neural networks (DNNs) has demonstrated the ability to learn suitable feature embeddings directly from the raw waveform for phone classification 

[22], music classification [8] and speaker recognition [14]. The performance of these systems is comparable to, and in some cases even surpasses, traditional spectral based methods. On feature interpretability, Tuske et. al. [22] showed that DNNs are able to learn bandpass filters purely from the raw waveform without any prior knowledge, and that the first layer can be interpreted as performing a “quasi time-frequency” analysis on audio.

Inspired by these findings, contemporary raw-waveform models typically comprise a modular structure [7, 9, 6, 26]: First, a waveform encoder is used to learn a meaningful representation for audio waveforms and to reduce the dimensionality of feature maps, also referred to as ‘wavegrams’ [9]. Then, an additional backbone network further processes the wavegram into embeddings. Under this framework, the trainable front-end filterbanks are the key components of raw-waveform based models. Ideally, the filters should only model task-relevant information, while ignoring other nuisance aspects [11]. However, directly learning from densely sampled audio inputs using DNNs without any prior knowledge can lead to over-fitting [23].

There are two main strategies for effectively training a group of meaningful filters from scratch to achieve comparable results to spectral features. These include parametric filterbanks, and non-parametric filterbanks combined with some initialization or regularization strategy. Both filterbank variations are trained together with a respective network architecture. Learnable parametric filterbanks constrain the filters by optimizing only a few parameters, e.g., center frequency and bandwidth [23, 20], of pre-defined parametric functions. With such strong constraints, the learned filters generally follow expected shapes and are easier to interpret. In contrast, non-parametric learnable filterbanks have little to no regularization. In order to mediate this lack of structure, various techniques borrowed from signal processing, such as Gabor initialization [24], multi-scale analysis [26], learnable compression functions [23] and complex convolution [18], are usually applied to the first few layers to avoid overfitting and to speed up convergence.

The performance of raw-waveform based models on cross-domain speech recognition [11, 1] and source separation [17] tasks is known to be susceptible to mismatch that on in-domain datasets. In this paper, we compare the efficacy of raw-waveform speaker embeddings to that of traditional mel-spectrum based methods under different acoustic conditions. We propose several strategies to improve the performance of raw-waveform embeddings on cross-domain tasks, including making use of filter analycity and variational dropout to learn sparse filter coefficients. Finally, we visualize and analyze the learned filter responses. The complete code for training and inference will also be made available111

2 Cross-dataset Studies

In this section, we present an empirical study comparing several raw-waveform based speaker embeddings with mel-spectrum based models under both matched and mismatched conditions across several speaker verification tasks.

2.1 Datasets

VoxCeleb [15, 2] is a large-scale dataset containing speech spanning a wide range of speakers under uncontrolled acoustic conditions. We use the VoxCeleb2 development partition for training. We also add 100k augmented noisy utterances by adding reverberation, noise, music, and babble to the original speech files following the Kaldi [19] recipe222 We use the full VoxCeleb1 dataset, including Vox1-O, Vox1-E and Vox1-H, to perform matched condition tests.

VOiCEs [21], i.e., the Voices Obscured In Complex Environmental Settings corpus, was released with the aim to simulate realistic data under complicated acoustic conditions. It was created by playing Librispeech [16] recordings inside multiple room configurations and re-recording with 12 different microphones placed at various locations. In addition, pre-recorded background noise plus reverberation or echo were played along with the foreground speech. For evaluating the robustness under mismatched conditions, we used the evaluation partition of this corpus, which consists of 3.6 million trial pairs derived from 11,392 utterances.

2.2 Experimental setup

We select one parametric waveform encoder, SincNet [20], and two non-parametric encoders, multi-scale filters [26] and TDFbank [24]

, to compare against mel-spectrum based system. All of the speaker embedding systems employ 30 filters of length 400 sample (25ms sampling at 16kHz) with a stride size of 5 to extract speech features. Then we feed the output of these three trainable filterbanks to the same backbone network. We model the common backbone with sample-level CNN architectures 

[8, 9, 26]. Specifically, the waveform embeddings output from the learnable filters are first fed into five down-sampling blocks with a decimation rate of 2. Hence, the sequence length of the feature maps is reduced by a factor of 160 in total, equating to 10 ms of hop size. In the downsampling block, we replace the original dense convolution in [26] with simple depth-wise separable convolutions, inspired by [6, 12]. In this way, the number of parameters is largely reduced. Finally, speaker embeddings are extracted with time delay neural networks (TDNN).

For the spectral baseline, we use fixed mel-scaled filterbank and the above mentioned backbone network, named ‘x-conv-vector’, for a fair comparison. As a sanity check of the TDNN model’s capability, we also train a vanilla MFCC based model, ‘x-vector (Kaldi)’ and a mel-fbank based model ‘x-vector’ in PyTorch for reference. In order to eliminate the influence of back-end scoring systems on the final verification results, we simply used cosine similarity for scoring. We also compute the equal error rate (EER) to compare different systems.

2.3 Results and analysis

Figure 1: EER () comparison of mel-spectrum based models and raw-waveform based models on different test sets with cosine similarity scoring. Error bars show a 95confidence interval.

In Fig. 1, we demonstrate EER degradation across datasets for raw-waveform based speaker embeddings. In matched test conditions on VoxCeleb datasets, raw-waveform based speaker embeddings perform on par with the three mel-spectrum based systems. However, in the VOiCEs evaluation dataset, both parametric and non-parametric waveform models lead to degradation compared to spectral based models. It is noted that among all of the six methods, only ‘x-vector (Kaldi)’ performs voice activity detection, making it a less fair baseline. This may also be an important reason for the performance mismatches among mel-spectrum baselines.

We also visualize learned filter responses after training on the noise augmented VoxCeleb dataset, shown in Fig. 2. We can see that multi-scale filters and TDFbank are much noisier compared to SincNet, and the frequency resolution in the higher frequencies of multi-scale filters is worse.

Figure 2: Learned filter responses (normalized by the maximum value for better visualization): (a) multi-scale filters, (2) sinc filters (3) the real part of TDFbank.

3 Robust improvement strategies

In this section, we propose two strategies and discuss their effect on the robustness of raw-waveform speaker embeddings under mismatched conditions. Neither strategy introduces additional parameters or computation. The experimental settings and training details are the same as in section 2.2, except one thing: we also integrated PLDA scoring for the final comparisons on complete speaker verification systmes. We adopt the Gaussian PLDA from Kaldi, which was trained on the augmented VoxCeleb-2 training dataset and evaluated on both the VoxCeleb1 and VOiCEs test datasets. Before training, the extracted speaker embeddings were projected onto a 200-dimensional vector with LDA, followed by whitening and length normalization.

3.1 Proposed method

Analytic filterbanks. In the original TDFbank architecture, real filters and imaginary filters are initialized into analytic pairs with Gabor wavelets to approximate the mel-filterbanks. Then, a magnitude response is computed using L2 pooling across the output of the real and imaginary pairs. Under the original setting, the weights of the real and imaginary filter components are independently trained without any constraints. As a result, although the initial mel-scale of frequency is mostly preserved after training,the analyticity of the initialization is not preserved. Analytic filters [4] are shift-invariant with respect to time, a desirable property for time-frequency representations. Downsampled convolutions or pooling layers in waveform encoders are not shift-invariant, which compromises their performance on robust classification tasks [25]. A natural way to constrain the analycity of learned complex filterbanks is to learn only the real component of a filter, and to and infer the imaginary component directly using the Hilbert transform [17, 3]. In this way, the magnitude of the filter response is shift-invariant and the number of filter parameters is essentially halved. Therefore, in this work, we apply the Hilbert transform to obtain the corresponding imaginary filters of real filters. We do this for both the non-parametric and parametric sinc filters.

Sparse variational dropout. Observing the noisy filter responses in Fig. 2, we believe that the non-parametric filters tend to overfit the noisy training data, learning nuisance aspects of the recordings. One way to ease this problem is to regularize the network by dropping irrelevant weights with sparse variational dropout (VD) [13]. VD was originally proposed as a model compression technique to sparsify DNN weights. In this work, we follow our previous work [3] to sparsify filters by applying VD in the first layer of the raw-waveform models.

Dropout can be seen as injecting fixed Bernoulli noise or Gaussian noise into weights during training. Instead of setting a fixed variance as in Gaussian dropout (GD), VD injects an individual multiplicative Gaussian noise

to every weight, with the variance consisting of model parameters learned with an approximated KL-divergence measure. By learning an individual variance for every weight, VD is able to induce sparsity across learned weights when (equivalent to in Bernoulli dropout). In such cases, the weights can be ignored or removed from neural networks during inference time.

System Feature VoxCeleb-O VoxCeleb-E VoxCeleb-H VOiCEs
x-vector (Kaldi) MFCC 2.26 0.256 2.37 0.279 4.14 0.408 6.79 0.553
x-vector Mel-fbank 2.37 0.264 2.42 0.280 4.18 0.406 8.14 0.658
x-conv-vector Mel-fbank 2.04 0.241 2.17 0.252 3.79 0.379 7.10 0.581
Multi-scale Waveform 2.28 0.273 2.38 0.285 4.17 0.408 8.54 0.705
Sinc 2.37 0.287 2.32 0.278 4.02 0.400 8.55 0.682
Sinc+ 2.15 0.270 2.28 0.271 3.91 0.396 8.90 0.669
TDF 1.98 0.230 2.19 0.249 3.85 0.383 8.38 0.663
TDF+ 2.01 0.261 2.27 0.263 3.98 0.396 7.46 0.621
TDF+VD 1.98 0.235 2.30 0.264 4.05 0.385 7.68 0.626
TDF++VD 1.99 0.266 2.26 0.253 3.93 0.385 7.40 0.633
Table 1: EER (%) comparison on different test sets. All models are trained on the noise augmented VoxCeleb2 training set and scored with PLDA backend. A statistical significance test is performed using a bootstrap procedure [5]: an absolute value of 0.05 of EER difference for Vox1-E and Vox1-H is outside the 95 confidence interval for all methods, while for Vox1-O and VOiCEs the EER difference has to be larger than 0.15 and 0.13 respectively.

3.2 Results

Comparison. In this experiment, we evaluate the proposed strategies in the same experimental setup as in Section 2. We can see that the ‘Multi-scale’, ‘Sinc’ and ‘TDF’ baselines in Table 1 show more degradation on VOiCEs test set compared to spectral baselines, which is consistent with the conclusion in Sec. 2. By comparing ‘TDF’ and ‘Sinc’ with their corresponding analytic versions, we find that ‘Sinc+’ only achieves a marginal improvement over the ‘Sinc’ baseline on VoxCeleb but a slight degradation on VOiCEs, whereas ‘TDF+’ significantly outperforms the ‘TDF’ baseline on VOiCEs and yields comparable results on VoxCeleb. This shows that the analyticity constraint helps non-parametric filters learn robust representations, but it is not the case for parametric filters. This may be because the benefit of filter analyticity is mainly on learning transient components, which cannot be well modeled in the sinc filters anyway. Comparing ‘TDF’ and ‘TDF+VD’, we can also observe a significant improvement on VOiCEs without compromising the performance on VoxCeleb with the help of VD. Among all of the raw-waveform based systems, ‘TDF++VD’ achieves the best results on the out-of-domain test set, with both VD and analytic filters helping to boost the performance. Compared with the three spectral based models, it achieves comparable results to ‘x-conv-vector’ with a similar model size and training strategy. Note that x-vector (Kaldi) achieves similar performance to that reported in  [10] on the VOiCEs dataset; this further validates our TDNN backbone implementation.

System Vox1-O Vox1-E Vox1-H Voices
x-vector (Kaldi) 3.12 2.9 4.99 8.41
x-vector 3.12 2.94 5.07 10.78
x-conv-vector 2.93 2.7 4.67 10.45
TDF 2.79 2.69 4.67 12.74
TDF+VD 3.01 2.79 4.81 11.10
TDF+ 2.72 2.81 4.86 10.72
TDF++BD 3.06 2.77 4.83 11.69
TDF++GD 2.98 2.73 4.83 11.29
TDF++VD 2.72 2.72 4.72 10.32
Table 2: EER (%) comparison on different test sets. All models are trained on the augmented VoxCeleb2 training set and scored with cosine similarity.

Ablation study. In order to better demonstrate the effectiveness of each component without the influence of the scoring backend, we conducted several ablation studies using cosine similarity, as shown in Table 2. The improvement of applying analytic filters is consistent with the PLDA backend results in Table 1. Different from results in Table 1, we find that ‘TDF++VD’ outperforms ‘TDF+’ when cosine similarity is used. Similarly, ‘TDF++VD’ outperforms ‘x-conv-vector’ slightly on VOiCEs. These differences suggest that by dropping filter weights through VD, the final learned speaker embeddings tend to become less Gaussian, hence yield worse results with the PLDA backend. We also experimented with different dropout techniques shown in the last four rows in Table 2, we can observe that BD and GD are not helpful in improving robustness compared to ‘TDF+’ baseline, while VD achieves better verification results in all of in-domain and out-of-domain tasks.

Figure 3: Examples of learned filters with their maximum response frequency labeled. Top row: ‘TDF+’ filters trained on clean VoxCeleb. Middle row: ‘TDF+’ filters trained on noise augmented VoxCeleb. Bottom row: ‘TDF++VD’ filters trained on noise augmented VoxCeleb.

Filter visualization. In Fig. 3, we visualize several learned non-parametric filters at different frequency bands under different training settings for TDF based methods. When trained on the noisy dataset, the learned filters are less regular and much noisier than filters trained on the clean dataset. With the help of VD, the learned filter at 345Hz is similar to the one trained without noise, and only the center weights of the filters at 2258Hz and 7937Hz are retained. The ‘jitters’ picked up from the noise are not present in the filters. Although there is no significant improvement on EER over the baseline with VD, this verifies that during training, raw waveform models tend to capture nuisance information from noisy data, and proves that dropping out the corresponding weights does not affect the final performance.

4 Conclusion

In this paper, we performed a systematic empirical study of multiple parametric and non-parametric raw-waveform based speaker embeddings. In comparison to several mel-spectrum baselines, these raw-waveform based methods yield similar results on in-domain tests, but show a more significant degradation on cross-domain tests. In order to bridge this performance gap, we proposed to apply filter analyticity to promote shift-invariance of the learned filters and variational dropout on non-parametric filters to discard task irrelevant information during training. Finally, we observed a significant improvement for non-parametric raw-waveform based embeddings with respect to cosine similarity and PLDA backends, achieving similar performance to the mel-spectrum baselines.


  • [1] P. Agrawal and S. Ganapathy (2020) Interpretable representation learning for speech and audio signals based on relevance weighting. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2823–2836. Cited by: §1.
  • [2] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In Proc. Interspeech, Cited by: §2.1.
  • [3] F. Cwitkowitz, M. Heydari, and Z. Duan (2021) Learning sparse analytic filters for piano transcription. arXiv preprint arXiv:2108.10382. Cited by: §3.1, §3.1.
  • [4] J. Flanagan (1980) Parametric coding of speech spectra. The Journal of the Acoustical Society of America 68 (2), pp. 412–419. Cited by: §3.1.
  • [5] E. Haasnoot, A. Khodabakhsh, C. Zeinstra, L. Spreeuwers, and R. Veldhuis (2018) FEERCI: a package for fast non-parametric confidence intervals for equal error rates in amortized o(m log n). In 2018 International Conference of the Biometrics Special Interest Group, pp. 1–5. Cited by: Table 1.
  • [6] S. Han, J. Byun, and J. W. Shin (2021) Time-domain speaker verification using temporal convolutional networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6688–6692. Cited by: §1, §2.2.
  • [7] J. Jung, S. Kim, H. Shim, J. Kim, and H. Yu (2020) Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv preprint arXiv:2004.00526. Cited by: §1.
  • [8] T. Kim, J. Lee, and J. Nam (2018) Sample-level cnn architectures for music auto-tagging using raw waveforms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 366–370. Cited by: §1, §2.2.
  • [9] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2020)

    Panns: large-scale pretrained audio neural networks for audio pattern recognition

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §1, §2.2.
  • [10] W. Lin, M. Mak, and L. Yi (2020) Learning mixture representation for deep speaker embedding using attention. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 210–214. Cited by: §3.2.
  • [11] E. Loweimi, P. Bell, and S. Renals (2020) On the robustness and training dynamics of raw waveform models.. In INTERSPEECH, pp. 1001–1005. Cited by: §1, §1.
  • [12] Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27 (8), pp. 1256–1266. Cited by: §2.2.
  • [13] D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In

    International Conference on Machine Learning

    pp. 2498–2507. Cited by: §3.1.
  • [14] H. Muckenhirn, M. M. Doss, and S. Marcell (2018) Towards directly modeling raw speech signal for speaker verification using cnns. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888. Cited by: §1.
  • [15] A. Nagrani, J. S. Chung, and A. Zisserman (2017) VoxCeleb: a large-scale speaker identification dataset. In Proc. Interspeech, Cited by: §2.1.
  • [16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. Cited by: §2.1.
  • [17] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent (2020) Filterbank design for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6364–6368. Cited by: §1, §3.1.
  • [18] J. Peng, X. Qu, J. Wang, R. Gu, J. Xiao, L. Burget, and J. Černockỳ (2021) ICSpk: interpretable complex speaker embedding extractor from raw waveform. Proc. Interspeech 2021, pp. 511–515. Cited by: §1.
  • [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011-12) The Kaldi speech recognition toolkit. In

    IEEE Workshop on Automatic Speech Recognition and Understanding

    Cited by: §2.1.
  • [20] M. Ravanelli and Y. Bengio (2018) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. Cited by: §1, §2.2.
  • [21] C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, et al. (2018) Voices obscured in complex environmental settings (voices) corpus. arXiv preprint arXiv:1804.05053. Cited by: §2.1.
  • [22] Z. Tüske, P. Golik, R. Schlüter, and H. Ney (2014) Acoustic modeling with deep neural networks using raw time signal for lvcsr. In Fifteenth annual conference of the international speech communication association, Cited by: §1.
  • [23] N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi (2021) LEAF: a learnable frontend for audio classification. ICLR. Cited by: §1, §1.
  • [24] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux (2018) Learning filterbanks from raw speech for phone recognition. In 2018 IEEE international conference on acoustics, speech and signal Processing (ICASSP), pp. 5509–5513. Cited by: §1, §2.2.
  • [25] R. Zhang (2019) Making convolutional networks shift-invariant again. In International conference on machine learning, pp. 7324–7334. Cited by: §3.1.
  • [26] G. Zhu, F. Jiang, and Z. Duan (2021) Y-vector: multiscale waveform encoder for speaker embedding. In Interspeech, Cited by: §1, §1, §2.2.