Analysis of Deep Feature Loss based Enhancement for Speaker Verification

by   Saurabh Kataria, et al.
Johns Hopkins University

Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called deep feature loss, greatly improved over the state-of-the-art conventional x-vector based system on a children speech dataset called BabyTrain. This work analyzes various facets of that approach and asks few novel questions in that context. We first search for optimal number of auxiliary network activations, training data, and enhancement feature dimension. Experiments reveal the importance of Signal-to-Noise Ratio filtering that we employ to create a large, clean, and naturalistic corpus for enhancement network training. To counter the "mismatch" problem in enhancement, we find enhancing front-end (x-vector network) data helpful while harmful for the back-end (Probabilistic Linear Discriminant Analysis (PLDA)). Importantly, we find enhanced signals contain complementary information to original. Established by combining them in front-end, this gives  40 improvement over the baseline. We also do an ablation study to remove a noise class from x-vector data augmentation and, for such systems, we establish the utility of enhancement regardless of whether it has seen that noise class itself during training. Finally, we design several dereverberation schemes to conclude ineffectiveness of deep feature loss enhancement scheme for this task.


Feature Enhancement with Deep Feature Losses for Speaker Verification

Speaker Verification still suffers from the challenge of generalization ...

VoiceID Loss: Speech Enhancement for Speaker Verification

In this paper, we propose VoiceID loss, a novel loss function for traini...

Unsupervised Feature Enhancement for speaker verification

The task of making speaker verification systems robust to adverse scenar...

Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

In this work, we present an analysis of a DNN-based autoencoder for spee...

Dual Application of Speech Enhancement for Automatic Speech Recognition

In this work, we exploit speech enhancement for improving a recurrent ne...

Extended U-Net for Speaker Verification in Noisy Environments

Background noise is a well-known factor that deteriorates the accuracy a...

1 Introduction

Supervised deep learning based speech enhancement made significant progress in the last decade. Notable works include masking [1] and mapping [2] based approach, Speech Enhancement Generative Adversarial Network (SEGAN[3], Deep Feature Loss (DFL[4], end-to-end metric optimization [5], and Transformer based approach [6, 7]. Meanwhile, an active research exists in the robustness of Speaker Verification (SV) systems [8, 9, 10, 11]. Another reason for interest in speech enhancement arises from the notion that it is considered as a modern solution to improve noise robustness in SV systems [10, 12, 13]. Such studies demonstrate that an explicit speech enhancement processing is beneficial to the state-of-the-art (SOTA) conventional x-vector and Probabilistic Linear Discriminant Analysis (PLDA) based SV system [14]. We refer to this methodology as task-specific enhancement. Prior work revealed its benefit for other tasks like Speaker Diarization [15], Language Recognition [16], and Automatic Speech Recognition (ASR[17].

Building on perceptual loss [18], [4] proposed to learn speech enhancement using a pre-trained auxiliary network to obtain (deep feature) loss (Section 2). Authors observed that the usual supervised training with time-domain loss gives poor enhancement performance on low Signal-to-Noise Ratio (SNR) test signals, as confirmed with speech enhancement metrics like Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). Therefore, they suggested to instead minimize the deviation of auxiliary network activations of enhanced and (reference) clean signals. Here, enhanced signals refer to the output of enhancement network (Figure 1).

Recently, [10] proposed a test-time feature denoising approach based on [4] and reported large gains over the SOTA data augmented x-vector based SV system. Since the conventional x-vector system can tackle clean signals such as in the Speakers In The Wild (SITW) dataset [14, 19], authors chose DFL technique for its potential to handle low SNR signals. Due to their primary focus on final SV performance, they chose the auxiliary network as speaker classification/embedding network. Such enhancement preserves speaker information. They reported results on a single-channel wide-band (16 KHz) dataset called BabyTrain, which consists of daylong recordings of children speech in noisy and reverberant environments [20]. The main contribution of this study is to explore in-depth various facets of DFL, ask some novel analysis-oriented questions, and present evaluation on real data (BabyTrain). We now describe the significance of all experiment sections.

Section 5.1 reproduces the gains observed with the DFL based enhancement, as done in [10]. Furthermore, it judges the utility of activations from the deeper and, especially, the last layer (i.e. speaker embedding layer) of the auxiliary network. Motivation for this comes from the common knowledge that a convolutional network contains high level information such as speaker identity primarily in the initial layers [21]. [4] used only first few layers and our preliminary experiments on their setup revealed degradation by incorporating deeper layer activations. However, their data setting was small (on VCTK corpus [22]) and a much larger data setting such as ours is better suited to investigate this.

Section 5.2 investigates the choice of training data for enhancement and auxiliary network. For training enhancement network, it is imperative to have a clean, large, and naturalistic corpus. For this, [10] chose a (high) SNR-filtered version of VoxCeleb [23, 24]. In DFL training, activations of noisy signals come from auxiliary network (Equation 1). Hence, it remains an open question if a stronger auxiliary network i.e. one trained with (noisy) data augmentations is superior. Training data choice is important to us because we focus on BabyTrain and large “in the wild” public data releases such as SITW [19], VoxCeleb [23], and CN-Celeb [9] do not explicitly account for children speech.

Section 5.3 asks whether it is beneficial to use higher dimensional features in the enhancement network. For uniformity, we start with same features (40-dimensional log Mel-filterbank (LMFB)) for the enhancement, auxiliary, and x-vector network. Then, we quantify the effect of increasing feature dimension for the former network while keeping it fixed for the others. This idea of using different features for different networks is promising because most feature-domain enhancement studies work with spectrogram features. They have higher dimension than the standard 40-D LMFB features [14] and we experiment with them too.

Section 5.4 explores whether enhancement of PLDA and x-vector network data brings improvement on top of simple test set enhancement scheme. Enhancement of data other than test sets can, potentially, counter the distortion introduced by enhancement and reduce the mismatch among test, PLDA, and x-vector network data. This is a notable problem in speech enhancement [17, 25, 26]. Note that enhancing x-vector data means training x-vector network with enhanced features.

Section 5.5 considers a different viewpoint to Section 5.4 and asks whether enhanced signals contain useful/complementary information to original signals. We investigate this by including both enhanced and original signals in PLDA and x-vector data. Such analysis should provide insight into the nature of enhanced signals. It is worthwhile to do as our enhancement setup is in (filterbank) feature domain and it is implausible to calculate time-domain metrics like SDR and PESQ for analysis.

Section 5.6 tests the effectiveness of enhancement when a noise class is missing from data augmentation of x-vector network. While designing a generic x-vector based SV system, it is a common practice to mix clean data with several noise classes such as music, babble, and general environmental noises. We use this particular notion of data augmentation in this study. This may be not be optimal for the deployed environment and even cause performance degradation. Thus, enhancement as a solution to robustness of SV is attractive - provided the enhancer has good generalization property. This section quantifies this generalization. In this “leave-one-out” analysis, we, separately, consider the cases when enhancement has or has not seen the missing class. This analysis is akin to finding harmful and/or superfluous noise class during data augmentation and, thereby, similar in motivation to ablation and pruning work in deep learning [27, 28].

Section 5.7 addresses an important extension to [10]: effectiveness of DFL enhancement for dereverberation for SV. Weighted Prediction Error (WPE[29] is widely regarded as SOTA dereverberation technique. Recently, a Generative Adversarial Network (GAN) based domain-adaptation work outperformed it in a large scale setting [30, 31]. We design several dereverberation schemes based on DFL. Several of such schemes combine denoising since dereverberation (alone) may be ineffective for final performance gains.

Figure 1: A schematic of deep feature loss scheme

2 Deep Feature Loss

Perceptual loss or Deep Feature Loss [18, 4] refers to the extraction of loss from a pre-trained auxiliary network by comparing its activations for enhanced and reference clean signal. To obtain this, we manually pre-select few hidden layers of the auxiliary network. Main idea is to enhance while retaining high level properties of signal. This property depends on the choice of the auxiliary task. With a speaker embedding/classification network (in our case), enhancement preserves speaker information. Mathematically, DFL using hidden layers of auxiliary network is:


Here, and refers to noisy and clean feature matrices of size , is the feature dimension, is number of frames, is the number of hidden layers considered for DFL computation, is the index for such layers, is the auxiliary network, is the enhancement network. A corresponding visual description is in Figure 1. The maximum value of is . They refer to 5 equidistant hidden layers preselected in our auxiliary network. We handle final layer activations exclusively by the loss denoted by . refers to the usual feature loss i.e. without using auxiliary network. Importantly, we do not use x-vector network itself for extracting DFL because it may be not be optimal as noted in Section 5.2.

3 Neural Networks Architectures

3.1 Enhancement network

We choose Convolutional Neural Network (CNN) based Context Aggregation Network (CAN) from [10] except with higher number of channels (90). It is inspired by CAN in [4]. Its main features are linearly increasing dilations (1 to 8), eight convolution layers, Adaptive Batch Normalization (BN), LeakyReLU activations, and three Temporal Squeeze Excitation (TSE[10]

connections along with residual connections.

Final layer linearly maps the output to input dimension and a subsequent logarithm operation predicts the Time-Frequency (TF) mask [1]. To mimic Signal Approximation (SA) loss [1], we add this log-domain mask to the original input (multiplication in linear domain) to predict the final enhanced features. We found this global skip connection significantly helpful in our preliminary experiments. The network has a context length of 73 frames and 10.2M number of parameters. Since the main feature of CAN is high context, we tried increasing its receptive field but observed degradation in our preliminary experiments.

3.2 Auxiliary network

The auxiliary network used in this work is the 16KHz version ResNet-34 network described in [32, 33, 14]. We select this network due to its good performance on SV [32]. It is a 2D CNN based ResNet-34 residual network [34] with Learnable Dictionary Encoding (LDE) pooling [35]

and Angular Softmax loss function 

[36, 37]. The dictionary size of LDE is 64 and the network has 5.9M parameters.

3.3 x-vector network

We choose Extended TDNN (E-TDNN) architecture introduced in [38]. E-TDNN greatly improves upon Time-Delay Neural Network (TDNN) by interleaving dense layers with convolution layers and employing a (slightly) wider temporal context. Total trainable parameters are 10M. A summary of its exact specification is in [14]. [10] prefers a larger Factorized TDNN (F-TDNN) network due to its superior performance than E-TDNN. Since several of our experiments require re-training of the x-vector network, we choose E-TDNN to facilitate faster experimentation. Note that E-TDNN gives competitive performance [14] and, therefore, is suitable for our analysis-oriented work.

4 Experimental Setup

4.1 Dataset details

We combine VoxCeleb1 and VoxCeleb2 [39, 23, 40] to create voxceleb. We, then, concatenate utterances from the same video to create voxcelebcat (or vc). This gives us 2710 hrs of relatively clean audio with 7185 speakers. voxcelebcat_div2 (or vc_div2) refers to a random 50% subset of voxcelebcat

. We use a SNR estimation algorithm called

Waveform Amplitude Distribution Analysis (WADA-SNR) [24] to retain top 50% clean samples from voxcelebcat to create voxcelebcat_wadasnr (or vc.w). This is 1665 hrs of audio with 7104 speakers.

To create noisy counterpart, we use noise utterances from MUSAN [41] and DEMAND [42] corpora. We make the reverberant counterpart using impulse responses of small and medium size rooms from the Aachen Impulse Response (AIR) database. A 90-10 split gives us the training and validation lists for the enhancement system. Lastly, we use MUSAN and DEMAND to corrupt voxcelebcat to form voxcelebcat_combined (or vcc). Its size is three times as that of voxcelebcat. “libri” refers to LibriSpeech dataset [43]. Unless specified otherwise, we train the auxiliary network and x-vector network with voxcelebcat_wadasnr and voxcelebcat_combined respectively.

For evaluation on real data, we choose BabyTrain corpus which is based on the Homebank repository [20]. It consists of day-long children speech in uncontrolled noisy and reverberant environments. Recordings are in the presence of several (dynamic) number of background speakers. Training data for diarization and detection (adaptation data) has duration of 130 and 120 hrs respectively. Enrollment and test data are 95 and 30 hrs respectively. The classification of enrollment and test utterances is as follows. test>= and enroll= refers to test and enrollment utterances of minimum and equal to seconds from the speaker of interest respectively with and . For enrollment utterances, time marks of the target speaker are present but not for the test utterances. There may be multiple speakers present in the test utterances. Scripts for above described data preparations are available online111

4.2 Training details

We train CAN

with batch size of 32, learning rate of 0.001 (exponentially decreasing), 6 epochs, Adam optimizer 

[44], and 500 frames (5s audio). Code for CAN is available online as “DFL_TSEResCAN2d_SmallContext_LogSigMask_BNIn”222 Unless otherwise stated, input features are un-normalized 40-D LMFB features. We train the auxiliary network with batch size of 128, number of epochs as 50, optimizer as Adam [44], learning rate of 0.0075 (exponentially decreasing) with warmup [6], and sequences of 800 frames (8s audio). Since this network is a CNN, we use mean-normalized LMFB features which have spatial information contrary to Mel-Frequency Cepstrum Coefficient (MFCC) features. To account for this normalization mismatch with the enhancement network, we insert an online mean normalization between them during DFL training. For E-TDNN training, we use Kaldi [45] scripts with 40-D MFCC features which have silence removed and are mean-normalized.

4.3 Evaluation details

The PLDA-based back-end consists of a 200-D Linear Discriminant Analysis (LDA) with generative Gaussian SPLDA [32]. Additionally, we use a diarization system since BabyTrain consists of babble noise (background speakers). For this, we followed the Kaldi x-vector Callhome diarization recipe [46]. Details are in the JHU-CLSP diarization system as described in [32]. Note that, in general, “enhancement of test set” refers to enhancing test, enroll, and adaptation data. For the final evaluation, we use standard metrics like Equal Error Rate (EER) and minimum Detection Cost Function (minDCF) at target prior (NIST SRE18 VAST operating point [47]). Except Kaldi based x-vector training, we develop all framework using Hyperion library333

and Pytorch 


EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
(*) 5.14 7.17 11.02 11.41 8.68
6.28 8.90 12.35 12.71 10.06
5.66 8.11 11.40 11.79 9.24
5.38 7.84 11.07 11.47 8.94
5.63 7.96 11.26 11.62 9.12
5.32 7.75 10.83 11.18 8.77
5.93 8.36 11.79 12.16 9.56
5.73 8.38 11.84 12.19 9.54
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
(*) 0.204 0.333 0.441 0.448 0.357
0.239 0.370 0.478 0.485 0.393
0.218 0.343 0.452 0.459 0.368
0.210 0.331 0.439 0.447 0.357
0.213 0.342 0.452 0.459 0.367
0.215 0.334 0.441 0.449 0.360
0.218 0.338 0.446 0.453 0.364
0.215 0.334 0.441 0.448 0.360
Table 1: Baseline results

5 Experiments

5.1 Baseline results

In Table 1, we reproduce the claims of [10]. Last column refers to the mean metric value per row. We organize results for EER and minDCF separately. Boldface result signify the best value achieved per column per metric. Note that x-vector network is trained with augmentation in all cases and enhancement is applied on adaptation data, enrollment, and test utterances. That is, we use the default test-time enhancement scheme as mentioned in Section 4.3.

“no-enh” refers to the case when enhancement is not used in the SV pipeline. refers to the results obtained with DFL using all intermediate hidden layers of the auxiliary network. We note relative improvement of 12.3% and 12.5% for EER and minDCF respectively w.r.t. “no-enh”. Feature loss leads to lesser gains contrary to degradation caused in [10]. This variation is perhaps due to use of a different x-vector network in this work. Combining it with DFL gives better results. We note that adding auxiliary network speaker embedding layer loss does not lead to improvement. This suggests that all hidden activations from auxiliary network need not be useful for final performance. Using lesser number of layers in DFL does not lead to consistent observation. Nevertheless, gives best performance for both metrics and it serves as the baseline for this work. These baseline results are present in all results tables under different names but all denoted by (*).

Importantly, note that results under “test>=0s” represent final average performance on BabyTrain. “mean” refers to the weighted mean performance with higher weight for longer test trials. In practice, it is uncommon to have very small test utterances. Therefore, for this practical significance, we consider “mean” for final model comparisons in this work. For simplicity in reading all tables, reader may focus on “mean” performance.

EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
vc.w-vc.w (*) 5.14 7.17 11.02 11.41 8.68
vc.w-vc 5.63 8.12 11.37 11.74 9.22
vc.w-vcc 5.19 7.81 11.02 11.39 8.85
vc-vc.w 5.33 7.87 11.17 11.57 8.99
vc-vc 5.62 8.25 11.63 12.00 9.38
vc-vcc 5.43 8.16 11.44 11.80 9.21
vc_div2-vc.w 5.29 8.10 11.51 11.89 9.20
libri-vc.w 6.00 9.08 12.68 13.06 10.21
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
vc.w-vc.w (*) 0.204 0.333 0.441 0.448 0.357
vc.w-vc 0.215 0.335 0.444 0.451 0.361
vc.w-vcc 0.210 0.330 0.440 0.447 0.357
vc-vc.w 0.220 0.344 0.450 0.457 0.368
vc-vc 0.226 0.345 0.453 0.460 0.371
vc-vcc 0.222 0.349 0.456 0.463 0.373
vc_div2-vc.w 0.204 0.335 0.444 0.451 0.359
libri-vc.w 0.232 0.357 0.464 0.471 0.381
Table 2: Choice of training data for enhancement and auxiliary network. “vc” is VoxCeleb, “vc.w” is 50% WADASNR-filtered VoxCeleb, “vc_div2” is 50% random subsampled VoxCeleb, “vcc” is VoxCeleb with 3x augmentations, “libri” is LibriSpeech.

5.2 Choice of training data for enhancement and auxiliary network

Table 2

presents the results obtained with different choice of training data for enhancement and auxiliary network. Here, training data for enhancement network refers to the clean data counterpart required for creating training pairs for supervised learning. A preliminary

WADA-SNR analysis of VoxCeleb (“vc”) revealed the presence of several low SNR signals. For this reason, we use SNR estimation to retain top 50% clean utterances from “vc” to form “vc.w”. The second column of Table 2 specifies the training data for enhancement and auxiliary network (separated by “-”) respectively.

We make few prominent observations. First, by comparing enhancers trained with “vc” and “vc.w” as enhancement network training data, we find using full VoxCeleb (“vc”) harmful for both metrics. This suggests “vc” may not be clean enough for training enhancer and some filtering may be necessary. Second, using “vc_div2” in place of “vc.w” degrades EER, which suggests a SNR-based filtering is better than random subsampling. Third, to test the hypothesis that a cleaner data (LibriSpeech) helps further, we find that it gives worst performance. This establishes the superiority of VoxCeleb, perhaps, due to its diverse and spontaneous conversation nature, which is contrary to the read speech nature of LibriSpeech. Fourth, in our DFL formulation, we obtain activations of noisy samples from the auxiliary network (Equation 1). We do not observe gains by using a stronger auxiliary network (trained with “vc” or “vcc”). This is contrary to the popular notion that even clean test files benefit from data augmentation [49]. This indicates that using x-vector network for deep feature loss extraction may not be optimal, as hinted in Section 2. To sum up, we obtain best results with SNR-filtered VoxCeleb for both networks (“vc.w-vc.w”).

EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
LMFB-40D (*) 5.14 7.17 11.02 11.41 8.69
LMFB-80D 6.46 10.14 13.83 14.22 11.16
LMFB-100D 6.43 9.76 13.40 13.79 10.85
LMFB-120D 6.84 10.14 13.77 14.17 11.23
spectrogram-256D 5.72 8.91 12.49 12.84 9.99
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
LMFB-40D (*) 0.204 0.333 0.441 0.448 0.357
LMFB-80D 0.276 0.444 0.546 0.553 0.455
LMFB-100D 0.284 0.436 0.539 0.545 0.451
LMFB-120D 0.288 0.446 0.546 0.552 0.458
spectrogram-256D 0.242 0.390 0.492 0.498 0.406
Table 3: Enhancement with mismatch between enhancement and x-vector/aux. network acoustic features. First column indicates enhanced features, x-vec/aux. networks always use 40D LMFB.

5.3 Enhancement with mismatch between enhancement and x-vector/aux. network acoustic features

Table 3 presents the results by varying the feature used in the enhancement network. Result rows specify the feature dimension against the name of the feature. Features (40-D LMFB

) for the auxiliary and x-vector network remain unchanged. A trainable linear layer bridges enhancement and auxiliary network to handle the mismatch of the feature dimensions for these networks. We note that all higher dimensional feature models result in similar level of degradation except for spectrogram which leads to lesser degradation. As an additional evidence, we observed higher variance in the training and validation losses for these networks. This degradation is perhaps because learning with higher dimensional features require more data. A fair comparison study should, correspondingly, vary the training data amounts but we do not investigate that. Another option to avoid degradation could be to use same higher-dimensional features for all three networks. However, that leads to increased training complexity and, possibly, worse performance as apparent by the popularity of low-dimensional features like 40-D

LMFB in SOTA SV systems [14].

EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
test (*) 5.14 7.17 11.02 11.41 8.68
PLDA,test 4.93 7.58 10.93 11.34 8.70
train,test 5.36 8.01 11.25 11.63 9.06
train,PLDA,test 6.74 10.23 14.27 14.71 11.49
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
test (*) 0.204 0.333 0.441 0.448 0.357
PLDA,test 0.211 0.340 0.449 0.456 0.364
train,test 0.199 0.315 0.425 0.432 0.343
train,PLDA,test 0.295 0.443 0.551 0.558 0.462
Table 4: Effect of enhancing PLDA and/or x-vector data on top of test set enhancement

5.4 Effect of enhancing PLDA and/or x-vector data on top of test set enhancement

Table 4 presents the results for systems with enhancement of PLDA and/or x-vector train data (train) on top of test, enroll, adaptation data) enhancement (test). First column lists the datasets that undergo enhancement processing. We find enhancing PLDA data (slightly) harmful. Enhancing x-vector data gives best minDCF, while enhancing x-vector and PLDA data gives worst performance, even worse than the case of no enhancement. This suggests that PLDA is susceptible to enhancement processing. This finding is contrary to the notion that enhancement of all datasets solve the mismatch problem [26].

5.5 Augmentation with enhanced features

Table 5 presents results for this experiment. Here, “test (*)” and “PLDA,test” (from Table 4) represent enhancement of test set and test set along with PLDA data respectively. To gain insight into the nature of enhanced signals, we investigate if they contain complementary information to original signals. “aug-in-PLDA” refers to including enhanced signals with original (non-enhanced) in PLDA data. In Section 5.4, we noted that training PLDA with enhanced data gives worse performance compared to training with original data. Here, combining them causes further degradation.

The next experiment is “aug-in-train”, which refers to training x-vector data with original as well as enhanced data. This doubles the training data and time but, nevertheless, counts for a fair investigation since we train all x-vector networks till convergence and don’t introduce any new data here. Note that we assign same speaker label to enhanced signal as the original. Doing this bring huge (relative) improvements of ~40% in both metrics. This strongly establishes our hypothesis that enhanced signals contain useful complementary information. This is a novel finding albeit computationally expensive. “aug-in-train,PLDA” is an extension of “aug-in-train”. It refers to inclusion of enhanced and original signals in x-vector as well as PLDA data. This leads to some degradation with respect to “aug-in-train”. Thus, it is our consistent observation that PLDA is susceptible to enhancement processing and it is best trained with unenhanced data. It is useful to reiterate that in our enhancement schemes, test set is always enhanced.

EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
test (*) 5.14 7.17 11.02 11.41 8.68
PLDA,test 4.93 7.58 10.93 11.34 8.70
aug-in-PLDA 5.31 8.06 11.48 11.87 9.18
aug-in-train 3.34 4.99 7.53 7.92 5.95
aug-in-train,PLDA 3.38 5.13 7.78 8.19 6.12
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
test (*) 0.204 0.333 0.441 0.448 0.357
PLDA,test 0.211 0.340 0.449 0.456 0.364
aug-in-PLDA 0.219 0.350 0.459 0.466 0.374
aug-in-train 0.128 0.209 0.300 0.309 0.237
aug-in-train,PLDA 0.132 0.215 0.307 0.315 0.242
Table 5: Augmentation with enhanced features
EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
test-enh (*) 5.14 7.17 11.02 11.41 8.68
noise 7.36 10.90 15.02 15.44 12.18
enh-unseen 5.88 9.59 13.51 13.93 10.73
enh-seen 6.30 9.87 13.97 14.38 11.13
music 4.99 7.01 9.93 10.28 8.05
enh-unseen 4.42 6.52 9.54 9.96 7.61
enh-seen 4.35 6.38 9.34 9.74 7.45
babble 4.98 7.59 11.04 11.46 8.77
enh-unseen 4.13 6.56 9.61 10.03 7.58
enh-seen 4.07 6.64 9.82 10.26 7.70
chime3bg 5.49 7.66 10.69 11.04 8.72
enh-unseen 4.83 7.48 10.51 10.88 8.43
enh-seen 4.97 7.66 10.70 11.05 8.59
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
test-enh (*) 0.204 0.333 0.441 0.448 0.357
noise 0.414 0.525 0.618 0.624 0.545
enh-unseen 0.334 0.474 0.572 0.578 0.489
enh-seen 0.333 0.484 0.586 0.592 0.499
music 0.255 0.355 0.454 0.461 0.381
enh-unseen 0.217 0.327 0.424 0.432 0.350
enh-seen 0.213 0.326 0.425 0.433 0.349
babble 0.247 0.357 0.458 0.465 0.382
enh-unseen 0.213 0.324 0.423 0.431 0.348
enh-seen 0.206 0.320 0.419 0.426 0.343
chime3bg 0.302 0.420 0.523 0.530 0.444
enh-unseen 0.264 0.402 0.509 0.515 0.423
enh-seen 0.257 0.392 0.499 0.506 0.414
Table 6: Leave-one-out noise class in x-vector data. Each block leaves one noise type out of x-vector training. The first row in each block is without enhancement, “enh-unseen” trains enh. without the leave-out noise, “enh-seen” trains enh. with all noises.

5.6 Leave-one-out noise class in x-vector data

Table 6 summarizes the findings for this experiment. Like previously, “no-enh” and “test-enh (*)” serve as reference results. In our case, we have four, namely, noise, music, babble, chime3bg. In simulated data settings, usually, introduction of new noise classes in x-vector data leads to performance gains. However, these augmentations can be harmful for real data, as established by the result rows which contain noise class name in first column. They represent four SV systems with x-vector data missing one noise class. These results don’t include enhancement and, thus, are comparable with “no-enh” system which has seen all noise classes. We find omitting music class in x-vector data gives best performance on BabyTrain. Similarly, omitting babble and chime3bg lead to performance better than “no-enh”. Speculating noise class which can hurt final performance is impossible a priori. Therefore, speech enhancement is an appealing solution for improving robustness.

For all four SV systems, we report the benefit of using our enhancement scheme. “enh-seen” and “enh-unseen” refer to cases when enhancement network training has or has not seen the noise class respectively. Numbers in underline refer to best performance per SV system. Enhancement helped all four systems individually. As expected, the enhancement system which has seen the missing noise class achieves the best performance (expect for noise). Importantly, this shows that enhancement helps even when a noise class is missing from x-vector training, regardless of whether it has seen that noise class itself or not. However, “test-enh (*)” is worse than the best performance achieved in this ablation experiment, which reveals that current enhancement scheme is not strong enough to counter the degradation caused by harmful data augmentations. This also highlights the scope in the improvement of the enhancement scheme. Lastly, we note that omitting noise (general environmental noises) brings degradation, suggesting the importance of complex environmental noises in training. Thus, incorporating noise files from Voices2019 [50], DCASE Challenge444, and AudioSet [51] can be useful in our framework.

5.7 Handling reverberations

It is unclear if the DFL based supervised enhancement scheme can work for the dereverberation task. It is also unclear how much scope for dereverberation is in BabyTrain. In Table 7, we present results for several dereverberation schemes, some combined with denoising. “WPE” refers to Weighted-Prediction Error algorithm based pre-processing. It gives minor improvement over “no-enh”. This suggests that dereverberation is either very challenging or has less scope in BabyTrain in the first place. “dereverb” refers to DFL system trained for only dereverberation, which gives worse performance than “WPE” suggesting DFL scheme doesn’t work for dereverberation out-of-the-box. “WPE->denoise” is the denoising system but with WPE pre-processing. It is minimally better than “denoise”. However, it is largely better than “dereverb->denoise”, which refers to use of two DFL systems trained (separately) for the two tasks respectively. “denoise->dereverb” (flipped version of “dereverb->denoise”) does not lead to significant difference.

We now describe the joint training schemes. “joint1stage” refers to DFL system trained for denoising and dereverberation (jointly) in one go. Training pairs for it contain examples for denoising, dereverberation, and both. Note that it is worse than “WPE->denoise” suggesting doing these two tasks in one-go is hard. “joint2stage” is an assisted modification of “joint1stage”. In addition to accepting reverberant and noisy signal input, it accepts another reverberant signal in the middle of the network and tries to minimise its deep feature loss as well. This forces the network to first do denoising mimicking the standard signal model in signal processing. This assisted scheme did not work, further solidifying our presumption that combining the tasks of denoising and dereverberation is very challenging. Since our denoising network has seen few reverberant samples (from chime3bg), we tried a double (disjoint) denoising scheme (“denoise->denoise”) and find it brings minimal improvement. Results in this section suggest, finally, that the current DFL scheme does not work for dereverberation and we suspect this problem is better solved through domain-adaptation methodology, as shown recently in [30, 31].

EER test>=30s test>=15s test>=5s test>=0s mean
no-enh 5.78 8.78 12.34 12.71 9.90
denoise (*) 5.14 7.17 11.02 11.41 8.68
WPE 5.94 8.65 12.20 12.58 9.84
WPE->denoise 5.31 7.76 10.97 11.35 8.85
dereverb 6.31 9.75 13.39 13.76 10.80
joint1stage 5.63 8.33 11.55 11.93 9.36
joint2stage 5.74 9.14 12.82 13.21 10.23
dereverb->denoise 6.35 9.60 12.95 13.35 10.57
denoise->dereverb 6.13 9.11 12.39 12.78 10.10
denoise->denoise 5.26 7.59 10.73 11.11 8.67
minDCF test>=30s test>=15s test>=5s test>=0s mean
no-enh 0.255 0.386 0.492 0.499 0.408
denoise (*) 0.204 0.333 0.441 0.448 0.357
WPE 0.247 0.373 0.480 0.487 0.397
WPE->denoise 0.206 0.330 0.438 0.445 0.355
dereverb 0.242 0.391 0.498 0.504 0.409
joint1stage 0.221 0.344 0.452 0.458 0.369
joint2stage 0.249 0.393 0.501 0.508 0.413
dereverb->denoise 0.249 0.394 0.499 0.506 0.412
denoise->dereverb 0.244 0.386 0.492 0.499 0.405
denoise->denoise 0.205 0.325 0.433 0.440 0.351
Table 7: Handling reverberations

6 Conclusion

Incorporating robustness in Speaker Verification is a challenging problem. Data augmentation is a common practice which tries to handle it. BabyTrain is an appropriate dataset for this study due to its uncontrolled nature and emphasis on children’s speech verification. Since large data releases do not explicitly account for children speech, generalization of SV systems to lower age group is an open question. This work evaluates that, although inherently. Deep feature loss is a promising methodology which, in its current form, works along with data augmentation in x-vector network. It is shown to bring vast improvements. Our experiments reveal that this test-time feature denoising approach is optimal when it utilizes all hidden activations of auxiliary network excluding the final layer activations. Search for best training data combination for enhancement and auxiliary network reveals it is optimal to use top 50% utterances of VoxCeleb according to their SNR. This satisfies the ideal requirements of clean, large, and naturalistic nature of data for training enhancement. Experiments using different features for enhancement network shows it is best to use same 40-D LMFB features as in the auxiliary and x-vector network.

An important inquiry into enhancing data other than test set reveals it is beneficial for the front-end (x-vector network) while harmful for the back-end (PLDA). To further investigate into the nature of enhanced signals, we hypothesize, and subsequently confirm, that they contain information complementary to the original signals. We combine both signals in front-end and/or back-end to establish this. The newly trained x-vector network with combined data turns out quite powerful as demonstrated by ~40% relative improvements over the baseline. We also make a consistent observation that PLDA is susceptible to enhancement processing. Hence, it is best to train it using original features. The leave-one-out ablation analysis solidifies the notion that DFL enhancement is effective even when a noise class is missing from the training data of enhancement and/or x-vector network. Importantly, we show the limitation of data augmentation by demonstrating the degradation caused by including certain noise classes. Surprisingly, they turn out to be common noise classes used in practice. Finally, we design several dereverberation schemes combining WPE, denoising, and dereverberation in either joint or disjoint fashion. Extensive evaluation suggests ineffectiveness of DFL enhancement for dereverberation while speculating domain-adaption as superior methodology.

We also speculate that findings of this work can vary with the choice of the x-vector network and the evaluation database. Nevertheless, the analysis reported here provides further insight into the deep feature loss based Speaker Verification and addresses its advantages, weaknesses, and extensions.