Unsupervised Feature Enhancement for speaker verification

by   Phani Sankar Nidadavolu, et al.

The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to train the enhancement systems. The effectiveness of the approach was shown by testing on several real, simulated noisy, and reverberant test sets. The approach yielded significant improvements on both real and simulated sets when data augmentation was not used in speaker verification pipeline or augmentation was used only during x-vector training. When data augmentation was used for x-vector and PLDA training, our enhancement approach yielded slight improvements.


page 1

page 2

page 3

page 4


Data augmentation enhanced speaker enrollment for text-dependent speaker verification

Data augmentation is commonly used for generating additional data from t...

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

We investigated an enhancement and a domain adaptation approach to make ...

Feature Enhancement with Deep Feature Losses for Speaker Verification

Speaker Verification still suffers from the challenge of generalization ...

Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

In this work, we present an analysis of a DNN-based autoencoder for spee...

Extended U-Net for Speaker Verification in Noisy Environments

Background noise is a well-known factor that deteriorates the accuracy a...

Data augmentation versus noise compensation for x- vector speaker recognition systems in noisy environments

The explosion of available speech data and new speaker modeling methods ...

Analysis of Deep Feature Loss based Enhancement for Speaker Verification

Data augmentation is conventionally used to inject robustness in Speaker...

1 Introduction

Speech gets contaminated by various background noises, reverberation and other unwanted variabilities present during its acquisition. An ideal Speaker Verification (SV) system should be robust to any background noises and reverberation effects present. Recently, developing robust SV systems has become a very active research area. Several challenges were organized recently such as NIST Speaker Recognition Evaluation (SRE) 2019, VOiCES from a Distance Challenge [1], and VoxCeleb Speaker Recognition Challenge 2019.

One approach to improve the robustness of SV systems is to train them on data created by artificially adding noise to the original training data or simulating the reverberant speech. This method, known as data augmentation, has proven to be effective in improving the performance of SV systems yielding state-of-the-art (SOTA) results on various tasks [2, 3]. However, such simulation strategies do not take into account the amount and type of degradation the test utterances can have. A recent study on Speaker Diarization on children’s speech [4] demonstrates various challenges that x-vector systems face in adverse scenarios.

In this work, we experimented with an unsupervised single channel wide-band feature enhancement approach to improve the quality of speech features with the end goal of improving the performance of SV systems – a task-specific approach. The motivation behind taking an unsupervised approach was to incorporate the knowledge of the target (adverse) domain in the enhancement procedure with the help of some training data from that domain. The Unsupervised Enhancement Network (UEN) we experimented with was a cycle consistent generative adversarial network (CycleGAN) [5] trained on log Mel-filter bank (log mel-FB) features.

Previously, task-specific enhancement techniques have been proposed for Automatic Speech Recognition (ASR) and SV. Denoising approach using CycleGAN was proposed by [6] to improve the performance of ASR with results reported on several simulated test conditions. For SV, [7] and [8] have reported improvements on simulated data.

The main contributions of this paper are as follows: 1) to develop a unified UEN

that serves dual purpose - simultaneous dereverberation and denoising, 2) to test the generalization ability of this network to unseen test conditions, 3) use features extracted from real degraded speech to train the

UEN and 4) to investigate if the UEN approach complements the SOTA x-vector system trained with data augmentation.

Our experimental approach was as follows: We first developed an enhancement based SV pipeline, referred as UEN-SV system, where we enhance the test features using UEN before extracting the x-vectors. When data augmentation was used to train x-vector networks the x-vectors for training the PLDA were also extracted from enhanced training data. To be consistent with the notation of CycleGAN, we used the terms clean/source and reverberant/target interchangeably in this paper.

2 Unsupervised Enhancement System

2.1 CycleGAN Training

The UEN in this work is a CycleGAN system which consists of two generators and two discriminators. The generators map features from one domain to the other. They were trained using a multi-task objective which consists of two loss components- an adversarial loss and a cycle consistent loss. Adversarial loss was responsible for making the generator produce features that appear to be drawn from the opposite domain. Cycle consistency loss additionally constrains the generator to reconstruct original features of the domain from the generated features in opposite domain (achieved by minimizing the

distance between original and reconstructed features). The adversarial loss of each generator takes help from a binary classifier, termed as discriminator, coupled to that generator. The task for the discriminator is to classify between original and generated features of a particular domain, achieved by minimizing a least-squares objective

[9]. The adversarial loss then becomes a non saturating loss as shown in [10]. During evaluation, features of degraded speech are enhanced by mapping them to clean domain using the corresponding generator. More details on the objectives used for training CycleGAN can be found in our previous work on domain adaptation [11, 12].

2.2 CycleGAN Architecture


generator was a full-convolutional residual network with an encoder-decoder architecture. The encoder consisted of three convolutional layers followed by nine residual blocks. The number of filters in the first three convolutional layers were set to 32, 64 and 128 with strides of 1,2 and 2 respectively. The residual network consisted of two convolutional layers with 128 filters. The decoder network consisted of two deconvolutional layers with strides 2 and filters 64 and 32 respectively followed by a final convolutional layer with stride 1. Instance normalization was used in each layer except the first and last. ReLU activation was used in all layers except the last. The kernel size in all layers was set to 3x3. We used a short cut connection from input of the network to the output (input was added to the output of the last layer which becomes the generator’s final output). We trained the generators on

log mel-FB

features. Since, dereverberation is a convolution operation it becomes additive in the log-spectral domain. Hence, the short cut connection disentangles the reverberation effect (which was estimated by the model) from the input. The discriminator had 5 convolutional layers each with a kernel size of 4. The strides of first three and last two layers were set to 2 and 1 respectively. The number of filters in each layer were set to 64, 128, 256, 512 and 1. LeakyReLu with slope 0.2 was used as activation in all layers except the last. More details on the architecture can be found in


2.3 x-vector Architectures

For the x-vector networks in our SV pipeline, we experimented with two different architectures: Extended TDNN (ETDNN) and Factorized TDNN (FTDNN[3]. ETDNN improves upon TDNN [2] by interleaving dense layers in between the convolution layers. The FTDNN network forces the weight matrix between convolution layers to be a product of two low rank matrices and introduces skip connections. Total parameters for ETDNN and FTDNN are 10M and 17M respectively. More details on the networks and the pipeline can be found in [3, 13].

3 Experimental Details

3.1 Dataset Details

The training of UEN network requires access to non-parallel features from clean and reverberant domains which was obtained as follows. The files from the same YouTube video of VoxCeleb1 [14] and Voxceleb2 [15] were concatenated, denoted as voxcelebcat, to obtain longer audio sequences. Since voxcelebcat was collected in wild conditions and contained unwanted background noise, additional filtering of files was done based on their Signal-to-Noise Ratio (SNR), similar to the recent LibriTTS [16] work. We retained only the top 50% files sorted by their estimated SNR value using Waveform Amplitude Distribution Analysis (WADASNR) algorithm [17]. Thus, we obtained speech from 7104 speakers with duration around 1665 hours. The clean corpus, termed as voxcelebcat_wadasnr, was used as source domain for training the UEN.

Degraded speech from target domain for training the UEN was obtained either by simulation or by real recordings collected in adverse conditions. The degraded speech using simulation was obtained by first convolving voxcelebcat_wadasnr with simulated Room Impulse Response (RIR)111All RIRs are available for public use at http://www.openslr.org/26 with RT60 values in the range 0.0-1.0 seconds. Then noise from Music, Speech and Noise (MUSAN) corpus was artificially added (at SNR levels 15,10,5 and 0 dB) to the simulated reverberant speech (speech and music portions from MUSAN were not used in the simulation). This corpora was termed as voxcelebcat_reverb_noise whose features were used as target domain for training the UEN.

The target domain data for UENs trained with degraded speech obtained from real recordings was sampled from training sets of AMI Meeting Corpus (AMI[18] and Chime5 [19]. AMI was recorded in a setting of 3 different meeting rooms, 180 speakers x 3.5 sessions per speaker. Out of these 180 speakers, 135 speakers were used for training the UEN and 45 for testing. Chime5 corpus was recorded in an indoor uncontrolled setting of kitchen, dining, living room with 80 speakers. Similar to simulated setup, we added noise from MUSAN to the recordings from AMI and Chime5. Addition of noise to reverberant speech followed from our earlier work on domain adaptation where it was shown that noise addition improves the performance of CycleGAN by making the distributions of both the domains distinct while also improving the speed of convergence.Clean data, voxcelebcat_wadasnr, remains the same for both simulated and real target domain UEN setups. The real target domain has much less speakers (135 from AMI) compared to simulated setup (7104).

To test our UEN-SV pipeline, we used three different corpora: Speakers In The Wild (SITW) [20], AMI and SRI [21]222This data was recorded by SRI international and was submitted to LDC for publication. SRI data was recorded in an indoor controlled setting of small/large rooms; controlled backgrounds, 30 speakers 2 sessions and 40 hour. SRI data does not have a training portion, we used training corpus from Chime5 (as explained earlier) as target domain for training the UEN on real data. To test the effectiveness of the enhancement system, we also tested our UEN-SV system on reverberant and noisy tests obtained from SITW using simulation. We treated SITW as clean corpus. The reverberant copy of SITW, known as SITW reverb, was created similar to the training except that the max value of RT60 for the RIRs used was set to 4.0 seconds (instead of 1.0). We ensured RIRs for training and testing simulations were disjoint. We also designed a simulated additive noise testing setup, called SITW noisy, by adding different types of noise from MUSAN corpus and “background noises” from CHiME-3 challenge (referred to as chime3bg) at different SNRs. This resulted in five test SNRs (-5dB, 0dB, 5dB, 10dB, 15dB) and four noise types (noise, music, babble, chime3bg). It is ensured that the test noise files were disjoint from the ones used for training.

The testing data for AMI and SRI data was split into enrollment and test utterances which were classified as per their duration. test>= sec and enroll= sec refers to test and enrollment utterances of minimum and equal to seconds from the speaker of interest respectively with and . The results from all conditions were averaged and reported in this work.

For x-vector system training in this work, the ETDNN and FTDNN systems were trained without and with data augmentation respectively. The training data for ETDNN was sampled from voxcelebcat333Data preparation and training scripts can be found at: https://github.com/jsalt2019-diadet/jsalt2019-diadet. FTDNN system was trained using augmentation applied on data from voxcelebcat and several SRE datasets (details in [3]).

3.2 Training Details

CycleGAN system was trained on 40-dimensional log mel-FB features. Short-time mean centering and energy based Voice Activity Detection (VAD

) was applied on the features. Two batches of features were sampled from clean and degraded speech during each training step. Since, the training process was unsupervised both the mini batches were drawn in a completely random fashion with no correspondence between the two batches. The sizes of the batches were set to 32 and sequence length was 127. The model was trained for 50 epochs. Each epoch was set to be complete when one random sample from each of the utterances of clean training corpus has appeared once in that epoch. Adam Optimizer was used with momentum

. The learning rates for the generators and discriminators were set to 0.0003 and 0.0001 respectively. The learning rates were kept constant for the first 15 epochs and, then, linearly decreased until they reach the minimum learning rate (1e-6). The cycle and adversarial loss weights were set to 2.5 and 1.0 respectively. We trained ETDNN and FTDNN using Kaldi for 3 epochs with Natural Gradient Descent optimizer, and multi-GPU periodic model averaging scheme. These x-vector networks were trained with 40-dimensional MFCC features. During evaluation, output log mel-FB features of UEN were converted to MFCCs by applying Discrete Cosine Transform (DCT) before forward passing through the x-vector network.

4 Results

In this section, we present the results of UEN-SV system with and without augmentation applied to SV systems. All the results are reported using metrics Minimum Decision Cost Function (minDCF) and Equal Error Rate (EER).

SITW SITW reverb
ETDNN w/o aug EER minDCF EER minDCF
Baseline SV 5.23 0.340 6.78 0.460
SV with WPE enh 5.69 0.370 6.48 0.466
sim UEN-SV 5.68 0.323 6.09 0.363
Table 1: Enhancement results on SITW and SITW reverb
MUSAN noise MUSAN music MUSAN speech chime3bg
ETDNN w/o aug 10 5 0 -5 10 5 0 -5 10 5 0 -5 10 5 0 -5
Baseline SV .42 .50 .63 .80 .39 .48 .66 .87 .43 .61 .89 1.0 .45 .62 .92 .99
sim UEN-SV .36 .39 .46 .57 .34 .38 .47 .64 .37 .49 .77 .99 .35 .40 .51 .71
Table 2: Enhancement results on SITW noisy at various SNRs (in dB) (Only DCF values are shown to be concise)

4.1 Uen-Sv Results on Sitw and simulated Sitw

Table 1 presents the results for UEN-SV system with ETDNN trained without data augmentation on core-core condition of SITW and SITW reverb test sets. The UEN network was trained on simulated voxcelebcat_reverb_noise data as target domain (details in 3.1), the system was termed as sim UEN-SV. We compared these results with a baseline SV system where the test features were not enhanced and a SV system where the features were enhanced using the SOTA Weighted Prediction Error (WPE[22, 23] dereverberation algorithm. We obtained 21% and 22% relative improvements on minDCF of SITW reverb over baseline SV and SV with WPE enhancement.

We then tested sim UEN-SV system on SITW noisy (details in 3.1). Out of the four different testing conditions, only MUSAN noise was added to the training data of UEN. The remaining three conditions (MUSAN speech, MUSAN music and chime3bg) were not used during the training of UEN.The results are presented in Table 2. sim UEN-SV yielded consistent improvements on all four noise conditions at all SNRs. More pronounced improvements were observed at 0dB and -5dB SNRs. The results showed that the UEN we devised exhibited good dereverberation and denoising capabilities and also good generalization ability to unseen noise conditions (music, speech and chime3bg).

4.2 Uen-Sv Results on AMI and SRI

Encouraged by the results on SITW reverb and SITW noisy, we tested the ETDNN based UEN-SV system on more challenging evaluation corpora from AMI and SRI. Results are presented in Table 3. In addition to sim UEN-SV we also present results of UEN system trained using real data as target domain, system termed as real UEN-SV and for SV system with PLDA adapted to target domain as explained in [24]. The UEN system for AMI was trained on the training corpus of AMI. However, the UEN system for SRI was trained on Chime5 as target domain data for lack of availability of training set for SRI corpus (details in 3.1). As shown in Table 3, both the real and sim UEN-SV systems improved in performance compared to the baseline SV system for both the testsets. For AMI, real UEN-SV performed better than sim UEN-SV system even though it was trained on smaller amount of target domain data compared to the sim UEN. However, the advantage of using real data over simulated dropped when PLDA was adapted to the target domain. For SRI, unlike AMI, sim UEN-SV performed better than the real UEN-SV. The difference in domains between SRI (testset) and Chime5 (training set) might have resulted in slighlty poor performance of real UEN-SV compared to its simulated counterpart. From these experiments we observed that when training conditions and evaluation conditions matched closely in target domain (like in AMI) use of real data over simulated data offered advantage, which justifies our approach for unsupervised enhancement.

ETDNN w/o aug
Baseline SV 26.51 0.940 21.11 0.767
sim UEN-SV 20.22 0.766 18.63 0.714
real UEN-SV 19.66 0.726 19.92 0.732
ETDNN w/o aug
and PLDA adapt
Baseline SV 22.61 0.847 19.10 0.774
sim UEN-SV 18.57 0.680 17.26 0.738
real UEN-SV 18.21 0.691 19.41 0.767
Table 3: UEN-SV results on AMI and SRI
FTDNN with xvec aug
& w/o PLDA aug
Baseline SV 18.00 0.721 0.832
enhance test data 17.20 0.675 0.720
FTDNN with xvec aug
& with PLDA aug
Baseline SV 13.87 0.523 0.541
UEN-SV with test enh 14.33 0.557 0.572
UEN-SV with test
and train enh 14.10 0.518 0.540
Table 4: UEN-SV results on AMI with xvector augmentation

4.3 Uen-Sv Results on AMI with Data Augmentation

The results of enhancement on a FTDNN x-vector trained with data augmentation are presented in Table 4.. We considered two cases: 1) PLDA trained without augmentation and 2) PLDA trained with augmentation. Enhancement improved the SV system whose PLDA was trained without augmentation (6.4% relative improvement on minDCF). For the system with PLDA augmentaton, enhancing only the test/enroll data deteriorated the performance. Then, we enhanced the PLDA training data, extracted the corresponding x-vectors, and retrained the PLDA. With this setup we observed slight improvements over the baseline SV model. We did not retrain the x-vector network on enhanced features. However, encouraged by this trend, in future we intend to train the x-vector network on enhanced features which makes the entire pipeline homogeneous (train and test on enhanced features).

5 Summary and Future Work

We devised an unsupervised feature enhancement network with the end goal of improving the performance of x-vector based speaker verification systems. Validation on several simulated noisy, reverberant and real test sets showed the effectiveness of this approach when no data augmentation was used for the SV system or data augmentation was only used for x-vector training in SV system. However, the task of complementing data augmented x-vector and PLDA based SV system with an enhancement system still remains a challenging task. Encouraged from the observations in this work, we plan to develop a homogenous UEN-SV system where both the x-vector and PLDA are trained on enhanced features and testing data is enhanced during evaluation. We also consider learning domain specific augmentation features using CycleGAN by transforming clean features to the real target domain and use them to train the PLDA and x-vector systems.