Feature Enhancement with Deep Feature Losses for Speaker Verification

by   Saurabh Kataria, et al.

Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is created using various noise types at different SNR levels. For evaluation on real data, we choose BabyTrain corpus which consists of children recordings in uncontrolled environments. We observe consistent gains in every condition over the state-of-the-art augmented Factorized-TDNN x-vector system. On BabyTrain corpus, we observe relative gains of 10.38 respectively.


page 1

page 2

page 3

page 4


Analysis of Deep Feature Loss based Enhancement for Speaker Verification

Data augmentation is conventionally used to inject robustness in Speaker...

Unsupervised Feature Enhancement for speaker verification

The task of making speaker verification systems robust to adverse scenar...

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

We investigated an enhancement and a domain adaptation approach to make ...

Extended U-Net for Speaker Verification in Noisy Environments

Background noise is a well-known factor that deteriorates the accuracy a...

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

Deep learning based speech denoising still suffers from the challenge of...

MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

Motivated by unconsolidated data situation and the lack of a standard be...

1 Introduction

Various phenomena degrades speech such as noise, reverberation, speaker movement, device orientation, and room characteristics [1]. This makes the deployment of Speaker Verification (SV) systems challenging. To address this, several challenges were organized recently such as NIST Speaker Recognition Evaluation (SRE) 2019, VOiCES from a Distance Challenge [2], and VoxCeleb Speaker Recognition Challenge (VoxSRC

) 2019. We consider acoustic feature enhancement as a solution to this problem. In the past decade, deep learning based enhancement has made great progress. Notable approaches include mask estimation, feature mapping,

Generative Adversarial Network (GAN[3], and Deep Feature Loss (DFL[4]. Usually, such works report on enhancement metrics like Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR) on small datasets like VCTK. Some works tackle joint denoising-dereverberation, unsupervised enhancement, and source separation. However, we focus on supervised denoising. Specifically, we are interested in enhancement for improving the robustness of other speech tasks. We refer to this methodology as task-specific enhancement.

Task-specific enhancement has been proposed for Automatic Speech Recognition (ASR), Language Recognition, and SV. We focus on single-channel wide-band SV, for which augmented x-vector network with Probabilistic Linear Discriminant Analysis (PLDA) back-end is the state-of-the-art (SOTA[5]. For SV, [6] and [7] have reported improvements on simulated data. We note that x-vector systems still face significant challenge in adverse scenarios, as demonstrated in a recent children speech diarization study [8]. This interests us in investigating if task-specific enhancement can complement SOTA x-vector based SV systems.

We argue that the training of task-specific enhancement system should depend on the task. Therefore, we build on the ideas of Perceptual Loss [9] and propose a solution based on the speech denoising work in [4]. In [4], authors train a speech denoising network by deriving loss from a pre-trained speech classification network. There are several differences in our work from [4]. First, we choose the auxiliary task same as the x-vector network task i.e. speaker classification. This follows from the motivation to use task-specific enhancement to improve upon the SOTA x-vector system for SV. Second, we enhance in feature-domain (log Mel-filterbank), which makes it conducive for use with Mel-Frequency Cepstrum Coefficient (MFCC) based auxiliary network. Lastly, we demonstrate the proof-of-concept using datasets of much larger scale. An added advantage of our proposed approach is that we do enhancement only during inference, thus, avoiding the need for re-training of x-vector network.

2 Deep Feature Loss

Perceptual Loss or deep feature loss refers to use of a pre-trained auxiliary network for the training loss. The auxiliary network is trained for a different task and returns loss in form of hidden layer activations from multiple layers. In [4], authors train an enhancement system with an audio classification auxiliary network. The loss is the deviation of the activations of clean and enhanced signal. We refer to this as deep feature loss (DFL), while feature loss

(FL) refers to the independent naïve training of enhancement system without auxiliary network. For batch size of 1, the loss functions for DFL, FL, and DFL+FL (combination) are given below.


Here, and are matrices containing features for the current pair of noisy and clean sample respectively. is the number of frequency bins, is the number of frames, is the enhancement network, is the auxiliary network, is the output of the -th layer of considered for computing DFL, and is the number of layers of whose outputs are used for computing DFL. We fix the coefficients of and equal to 1. We tried the coefficient re-weighting scheme of [4] but found it unhelpful. depends on the architecture of . We fix it to 6, as suggested by our preliminary experiments.

3 Neural Network Architectures

3.1 Enhancement Networks

Here, we describe the two fully-convolutional architectures we designed as candidates for the enhancement network.

3.1.1 Context Aggregation Network

A deep CNN with dilated convolutions increases the receptive field of network monotonically, resulting in large temporal context. In [4], authors design such a network for time-domain signal using 1-D convolutions. The first layer of our Context Aggregation Network (CAN) is a 2-D Batch Normalization (BN) layer. It has eight 2-D convolution layers with kernel size of 3x3, channel dimension of 45, and dilation linearly increasing from 1 to 8. Between a pair of such layers, is an Adaptive BN layer followed by a LeakyReLU activation of slope 0.2. We introduced several modifications to the architecture in [4]. First, we include, uniformly separated, three Temporal Squeeze Excitation (TSE

) connections along with residual connections.

TSE is a variant of Squeeze Excitation [10], where instead of obtaining a global representation common to all Time-Frequency (TF) bins (by average pooling in both dimensions), we obtain a representation per frequency bin (pooling just in time dimension). Then, we compute excitation weights for every TF bin. Finally, a linear layer is used to map to original input dimension. The network output is assumed to represent a mask that we have multiply by the noisy features to obtain the clean features in linear domain. Since, we used acoustic features in domain. We apply the network output and add to the input to obtain the enhanced features in domain. The network has a context length of 73 frames and number of parameters are 2.6M.

3.1.2 Encoder-Decoder Network

We modify the Encoder-Decoder Network (EDN) architecture of the generator of Cycle-GAN in the domain adaptation work of [11]. EDN has several residual blocks after the encoder and a skip connection. Details can be found in [12]

. We make three modifications. First, the number of channels are set to a high value of 90. Second, Swish activation function


is used instead of ReLU. Lastly, the training details are different, particularly, in the context of optimization (refer Section

4.2). The network has a context length of 55 and number of parameters are 22.5M.

3.2 Speaker Embedding Networks

3.2.1 Residual Network

The auxiliary network in our DFL formulation is the ResNet-34-LDE network described in [14, 15, 5]. It is a ResNet-34 residual network with Learnable Dictionary Encoding (LDE) pooling and Angular Softmax loss function. The dictionary size of LDE is 64 and the network has 5.9M parameters.

3.2.2 x-vector Network

We experiment with two x-vector networks, Extended TDNN (ETDNN) and Factorized TDNN (FTDNN). ETDNN improves upon the previously proposed TDNN system by interleaving dense layers in between the convolution layers. The FTDNN network forces the weight matrix between convolution layers to be a product of two low rank matrices. Total parameters for ETDNN and FTDNN are 10M and 17M respectively. A summary of those networks can be found in [5].

4 Experimental Setup

4.1 Dataset Description

We combine VoxCeleb1 and VoxCeleb2 [16] to create voxceleb. Then, we concatenate utterances extracted from the same video to create voxcelebcat. This results in 2710 hrs of audio with 7185 speakers. A random 50% subset of voxcelebcat forms voxcelebcat_div2. To ensure sampling of clean utterances (required for training enhancement), an SNR estimation algorithm ( Waveform Amplitude Distribution Analysis (WADASNR) [17]) is used to sample top 50% clean samples from voxcelebcat to create voxcelebcat_wadasnr. This results in 1665 hrs of audio with 7104 speakers. To create the noisy counterpart, MUSAN [18] and DEMAND [19] are used. A 90-10 split gives us a parallel copy of training and validation data for the enhancement system. The auxiliary network is trained with voxcelebcat_wadasnr. Lastly, voxcelebcat_combined is formed by data augmentation with MUSAN to create a dataset of size three times voxcelebcat.

We design a simulated testing setup called Simulated Speakers In The Wild (SSITW). Several noisy test sets are formed by corrupting Speakers In The Wild (SITW) [20] core-core condition with MUSAN and “background noises” from CHiME-3 challenge (referred to as chime3bg). This results in five test SNRs (-5dB, 0dB, 5dB, 10dB, 15dB) and four noise types (noise, music, babble, chime3bg). Here, noise refers to “noise category” in MUSAN, consisting of common environmental acoustic events. It is ensured that the test noise files are disjoint from the training ones.

We choose BabyTrain corpus for evaluation on real data. It is based on the Homebank repository [21] and consists of daylong children speech around other speakers in uncontrolled environments. Training data for diarization and detection (adaptation data

) are around 130 and 120 hrs respectively, while enrollment and test data are around 95 and 30 hrs respectively. This data was split into enrollment and test utterances which were classified as per their duration. In our terminology,

test>= sec and enroll= sec refers to test and enrollment utterances of minimum and equal to seconds from the speaker of interest respectively with and . For enrollment, time marks of the target speaker were given but not for test where multiple speakers may be present.

We now describe the training data for our three x-vector based baseline systems. For the first (and simplest) baseline, we use ETDNN. The training data for ETDNN as well as its PLDA back-end is voxcelebcat_div2. Since no data augmentation is done, we refer to this system as clean x-vector system or ETDNN_div2. For the second and third baseline, we choose FTDNN, which is trained with voxcelebcat_combined and several SRE datasets. Its details can be found in [14]. These two baselines are referred to as augmented x-vector systems. The difference between the second (FTDNN_div2) and the third baseline (FTDNN_comb) is that they use voxcelebcat_div2 and voxcelebcat_combined as PLDA training data respectively. There is an additional PLDA in the diarization step for BabyTrain, for which voxceleb is used.

4.2 Training details


is trained with batch size of 60, learning rate of 0.001 (exponentially decreasing), number of epochs as 6, optimizer as Adam, and 500 number of frames (5s audio). The differences for

EDN is in batch size (32) and optimizer ( Rectified Adam (RAdam)). Differences arise due to the independent tuning of two networks. However, they are both trained with unnormalized 40-D log Mel-filterbank features. The auxiliary network is trained with batch size of 128, number of epochs as 50, optimizer as Adam, learning rate of 0.0075 (exponentially decreasing) with warmup, and sequences of 800 frames (8s audio). It is trained with mean-normalized log Mel-filterbank features. To account for this normalization mismatch, we do online mean normalization between the enhancement and auxiliary network. ETDNN and FTDNN are trained with Kaldi scripts using

Mean-Variance Normalized

(MVN) 40-D MFCC features.

4.3 Evaluation details

The PLDA-based backend for SSITW consists of a 200-D LDA with generative Gaussian SPLDA [14]. For evaluation on BabyTrain, a diarization system is used additionally to account for the multiple speakers in test utterances. We followed the Kaldi x-vector callhome diarization recipe. Details are in the JHU-CLSP diarization system described in [14]. Note that only test, enroll, and adaptation data utterances were enhanced. For the final evaluation, we use standard metrics like Equal Error Rate (EER) and Minimum Decision Cost Function (minDCF) at target prior (NIST SRE18 VAST operating point). The Code for this work is available online 111https://github.com/jsalt2019-diadet and a parent paper is submitted in parallel [22].

5 Results

5.1 Baseline results

In Table 1, we present the baseline (averaged) results on simulation and real data. As expected, clean x-vector system performs worst. Among SSITW and BabyTrain, we observe different trends using the augmented x-vector systems. FTDNN_div2 performs better for BabyTrain, while FTDNN_comb performs better for SSITW. Due to focus on real data, we drop third baseline from further analysis.

SSITW BabyTrain
ETDNN_div2 10.75 0.608 13.90 0.783
FTDNN_div2 5.70 0.357 7.66 0.366
FTDNN_comb 3.70 0.222 9.72 0.409
Table 1: Baseline results using three verification systems

5.2 Comparison of Context Aggregation Network and Encoder-Decoder Network

Table 2 present enhancement results using the two candidate enhancement networks. There is a difference in performance trend among CAN and EDN. On SSITW, EDN works better, while on BabyTrain, CAN gives better performance. Again, due to focus on real data, CAN is chosen for further analysis. Results can be compared with Table 1 and the benefit of enhancement can be noted for both baseline systems. Underlined numbers represent the overall best performance attained in this study for each dataset.

SSITW BabyTrain


ETDNN_div2 7.61 0.450 10.33 0.510
FTDNN_div2 5.37 0.333 6.71 0.328


ETDNN_div2 6.51 0.398 11.76 0.561
FTDNN_div2 4.18 0.273 7.35 0.334
Table 2: Comparison of enhancement by CAN and EDN

5.3 Comparison of feature and Deep Feature Loss

Table 3 present results using the three loss functions using the stronger baseline (FTDNN_div2). The loss function in our proposed solution () gives best performance. It is important is note that the naïve enhancement (), which does not use auxiliary network, gives worse results than baseline. The combination loss () gives slightly better EER on BabyTrain but degrades all other metrics. The last row represents the performance difference between the naïve and the proposed scheme. In next sections, we present detailed results on both datasets using .

SSITW BabyTrain
FTDNN_div2 5.70 0.357 7.66 0.366
8.51 0.516 7.90 0.485
5.37 0.333 6.71 0.328
6.27 0.381 7.30 0.383
3.14 0.183 1.19 0.157
Table 3: Comparison of three losses on FTDNN_div2

5.4 Results on Simulated Speakers In The Wild

In Table 4, we present results on SSITW per noise condition. The upper half of table shows results with and without enhancement using clean x-vector. The performance gain in every condition is consistent. We note here that the babble condition is the most challenging. The lower half of table shows results using the augmented x-vector. The performance gain is lesser albeit consistent here. (in %) represents the relative change in metric after enhancement. Asterisk (*) denotes the metric value after enhancement.

noise music babble chime3bg


EER 8.52 9.17 13.36 11.94
EER* 5.98 6.31 10.6 8.19
-29.81% -31.19% -20.66% -31.41%
minDCF 0.546 0.552 0.661 0.672
minDCF* 0.381 0.391 0.544 0.484
-30.22% -29.17% -17.70% -27.98%


EER 3.80 4.42 8.75 6.49
EER* 3.69 3.83 8.06 5.88
-2.90% -13.35% -7.89% -9.40%
minDCF 0.264 0.301 0.461 0.402
minDCF* 0.253 0.269 0.435 0.375
-4.17% -10.63% -5.64% -6.72%
Table 4: Results with and without DFL enhancement on SSITW using two baseline systems

5.5 Results on BabyTrain

In Table 5, we present results on BabyTrain per test duration condition (averaged over all enroll durations). Similar to the previous section, we observe high gains using the clean x-vector. The lower half of table also shows consistent significant improvement in every condition. It is important to note that even with a strong FTDNN based augmented x-vector baseline, enhancement helps significantly. Also, the easier the test condition, the higher the improvement.

test>=30s test>=15s test>=5s test>=0s


EER 9.83 12.94 16.26 16.57
EER* 6.80 9.35 12.40 12.78
-30.82% -27.74% -23.74% -22.87%
minDCF 0.673 0.782 0.837 0.840
minDCF* 0.378 0.517 0.581 0.587
-43.83% -33.89% -30.59% -30.12%


EER 4.67 6.50 9.54 9.92
EER* 3.97 5.67 8.41 8.78
-14.99% -12.77% -11.84% -11.49%
minDCF 0.242 0.335 0.440 0.447
minDCF* 0.204 0.298 0.400 0.409
-15.70% -11.04% -9.09% -8.50%
Table 5: Results with and without DFL enhancement on BabyTrain using two baseline systems

6 Conclusion

We propose to do feature-domain enhancement at the front-end of the x-vector based Speaker Verification system and claim that it improves robustness. To establish the proof-of-concept, we experiment with two enhancement networks, three loss functions, three baselines, and two testing setups. We create simulation data using noises of different types at a broad range of SNRs. For evaluation on real data, we choose BabyTrain, which consists of day-long children recordings in uncontrolled environments. Using deep feature loss based enhancement, we observe consistent gains in every condition of simulation and real data. On BabyTrain, we observe relative gain of 10.38% in minDCF and 12.40% in EER. In future, we will explore our idea with more real noisy datasets.