Speech gets contaminated by various background noises, reverberation and other unwanted variabilities present during its acquisition. An ideal Speaker Verification (SV) system should be robust to any background noises and reverberation effects present. Recently, developing robust SV systems has become a very active research area. Several challenges were organized recently such as NIST Speaker Recognition Evaluation (SRE) 2019, VOiCES from a Distance Challenge , and VoxCeleb Speaker Recognition Challenge 2019.
One approach to improve the robustness of SV systems is to train them on data created by artificially adding noise to the original training data or simulating the reverberant speech. This method, known as data augmentation, has proven to be effective in improving the performance of SV systems yielding state-of-the-art (SOTA) results on various tasks [2, 3]. However, such simulation strategies do not take into account the amount and type of degradation the test utterances can have. A recent study on Speaker Diarization on children’s speech  demonstrates various challenges that x-vector systems face in adverse scenarios.
In this work, we experimented with an unsupervised single channel wide-band feature enhancement approach to improve the quality of speech features with the end goal of improving the performance of SV systems – a task-specific approach. The motivation behind taking an unsupervised approach was to incorporate the knowledge of the target (adverse) domain in the enhancement procedure with the help of some training data from that domain. The Unsupervised Enhancement Network (UEN) we experimented with was a cycle consistent generative adversarial network (CycleGAN)  trained on log Mel-filter bank (log mel-FB) features.
Previously, task-specific enhancement techniques have been proposed for Automatic Speech Recognition (ASR) and SV. Denoising approach using CycleGAN was proposed by  to improve the performance of ASR with results reported on several simulated test conditions. For SV,  and  have reported improvements on simulated data.
The main contributions of this paper are as follows: 1) to develop a unified UEN
that serves dual purpose - simultaneous dereverberation and denoising, 2) to test the generalization ability of this network to unseen test conditions, 3) use features extracted from real degraded speech to train theUEN and 4) to investigate if the UEN approach complements the SOTA x-vector system trained with data augmentation.
Our experimental approach was as follows: We first developed an enhancement based SV pipeline, referred as UEN-SV system, where we enhance the test features using UEN before extracting the x-vectors. When data augmentation was used to train x-vector networks the x-vectors for training the PLDA were also extracted from enhanced training data. To be consistent with the notation of CycleGAN, we used the terms clean/source and reverberant/target interchangeably in this paper.
2 Unsupervised Enhancement System
2.1 CycleGAN Training
The UEN in this work is a CycleGAN system which consists of two generators and two discriminators. The generators map features from one domain to the other. They were trained using a multi-task objective which consists of two loss components- an adversarial loss and a cycle consistent loss. Adversarial loss was responsible for making the generator produce features that appear to be drawn from the opposite domain. Cycle consistency loss additionally constrains the generator to reconstruct original features of the domain from the generated features in opposite domain (achieved by minimizing the
distance between original and reconstructed features). The adversarial loss of each generator takes help from a binary classifier, termed as discriminator, coupled to that generator. The task for the discriminator is to classify between original and generated features of a particular domain, achieved by minimizing a least-squares objective. The adversarial loss then becomes a non saturating loss as shown in . During evaluation, features of degraded speech are enhanced by mapping them to clean domain using the corresponding generator. More details on the objectives used for training CycleGAN can be found in our previous work on domain adaptation [11, 12].
2.2 CycleGAN Architecture
generator was a full-convolutional residual network with an encoder-decoder architecture. The encoder consisted of three convolutional layers followed by nine residual blocks. The number of filters in the first three convolutional layers were set to 32, 64 and 128 with strides of 1,2 and 2 respectively. The residual network consisted of two convolutional layers with 128 filters. The decoder network consisted of two deconvolutional layers with strides 2 and filters 64 and 32 respectively followed by a final convolutional layer with stride 1. Instance normalization was used in each layer except the first and last. ReLU activation was used in all layers except the last. The kernel size in all layers was set to 3x3. We used a short cut connection from input of the network to the output (input was added to the output of the last layer which becomes the generator’s final output). We trained the generators onlog mel-FB
features. Since, dereverberation is a convolution operation it becomes additive in the log-spectral domain. Hence, the short cut connection disentangles the reverberation effect (which was estimated by the model) from the input. The discriminator had 5 convolutional layers each with a kernel size of 4. The strides of first three and last two layers were set to 2 and 1 respectively. The number of filters in each layer were set to 64, 128, 256, 512 and 1. LeakyReLu with slope 0.2 was used as activation in all layers except the last. More details on the architecture can be found in.
2.3 x-vector Architectures
For the x-vector networks in our SV pipeline, we experimented with two different architectures: Extended TDNN (ETDNN) and Factorized TDNN (FTDNN) . ETDNN improves upon TDNN  by interleaving dense layers in between the convolution layers. The FTDNN network forces the weight matrix between convolution layers to be a product of two low rank matrices and introduces skip connections. Total parameters for ETDNN and FTDNN are 10M and 17M respectively. More details on the networks and the pipeline can be found in [3, 13].
3 Experimental Details
3.1 Dataset Details
The training of UEN network requires access to non-parallel features from clean and reverberant domains which was obtained as follows. The files from the same YouTube video of VoxCeleb1  and Voxceleb2  were concatenated, denoted as voxcelebcat, to obtain longer audio sequences. Since voxcelebcat was collected in wild conditions and contained unwanted background noise, additional filtering of files was done based on their Signal-to-Noise Ratio (SNR), similar to the recent LibriTTS  work. We retained only the top 50% files sorted by their estimated SNR value using Waveform Amplitude Distribution Analysis (WADASNR) algorithm . Thus, we obtained speech from 7104 speakers with duration around 1665 hours. The clean corpus, termed as voxcelebcat_wadasnr, was used as source domain for training the UEN.
Degraded speech from target domain for training the UEN was obtained either by simulation or by real recordings collected in adverse conditions. The degraded speech using simulation was obtained by first convolving voxcelebcat_wadasnr with simulated Room Impulse Response (RIR)111All RIRs are available for public use at http://www.openslr.org/26 with RT60 values in the range 0.0-1.0 seconds. Then noise from Music, Speech and Noise (MUSAN) corpus was artificially added (at SNR levels 15,10,5 and 0 dB) to the simulated reverberant speech (speech and music portions from MUSAN were not used in the simulation). This corpora was termed as voxcelebcat_reverb_noise whose features were used as target domain for training the UEN.
The target domain data for UENs trained with degraded speech obtained from real recordings was sampled from training sets of AMI Meeting Corpus (AMI)  and Chime5 . AMI was recorded in a setting of 3 different meeting rooms, 180 speakers x 3.5 sessions per speaker. Out of these 180 speakers, 135 speakers were used for training the UEN and 45 for testing. Chime5 corpus was recorded in an indoor uncontrolled setting of kitchen, dining, living room with 80 speakers. Similar to simulated setup, we added noise from MUSAN to the recordings from AMI and Chime5. Addition of noise to reverberant speech followed from our earlier work on domain adaptation where it was shown that noise addition improves the performance of CycleGAN by making the distributions of both the domains distinct while also improving the speed of convergence.Clean data, voxcelebcat_wadasnr, remains the same for both simulated and real target domain UEN setups. The real target domain has much less speakers (135 from AMI) compared to simulated setup (7104).
To test our UEN-SV pipeline, we used three different corpora: Speakers In The Wild (SITW) , AMI and SRI 222This data was recorded by SRI international and was submitted to LDC for publication. SRI data was recorded in an indoor controlled setting of small/large rooms; controlled backgrounds, 30 speakers 2 sessions and 40 hour. SRI data does not have a training portion, we used training corpus from Chime5 (as explained earlier) as target domain for training the UEN on real data. To test the effectiveness of the enhancement system, we also tested our UEN-SV system on reverberant and noisy tests obtained from SITW using simulation. We treated SITW as clean corpus. The reverberant copy of SITW, known as SITW reverb, was created similar to the training except that the max value of RT60 for the RIRs used was set to 4.0 seconds (instead of 1.0). We ensured RIRs for training and testing simulations were disjoint. We also designed a simulated additive noise testing setup, called SITW noisy, by adding different types of noise from MUSAN corpus and “background noises” from CHiME-3 challenge (referred to as chime3bg) at different SNRs. This resulted in five test SNRs (-5dB, 0dB, 5dB, 10dB, 15dB) and four noise types (noise, music, babble, chime3bg). It is ensured that the test noise files were disjoint from the ones used for training.
The testing data for AMI and SRI data was split into enrollment and test utterances which were classified as per their duration. test>= sec and enroll= sec refers to test and enrollment utterances of minimum and equal to seconds from the speaker of interest respectively with and . The results from all conditions were averaged and reported in this work.
For x-vector system training in this work, the ETDNN and FTDNN systems were trained without and with data augmentation respectively. The training data for ETDNN was sampled from voxcelebcat333Data preparation and training scripts can be found at: https://github.com/jsalt2019-diadet/jsalt2019-diadet. FTDNN system was trained using augmentation applied on data from voxcelebcat and several SRE datasets (details in ).
3.2 Training Details
CycleGAN system was trained on 40-dimensional log mel-FB features. Short-time mean centering and energy based Voice Activity Detection (VAD
) was applied on the features. Two batches of features were sampled from clean and degraded speech during each training step. Since, the training process was unsupervised both the mini batches were drawn in a completely random fashion with no correspondence between the two batches. The sizes of the batches were set to 32 and sequence length was 127. The model was trained for 50 epochs. Each epoch was set to be complete when one random sample from each of the utterances of clean training corpus has appeared once in that epoch. Adam Optimizer was used with momentum. The learning rates for the generators and discriminators were set to 0.0003 and 0.0001 respectively. The learning rates were kept constant for the first 15 epochs and, then, linearly decreased until they reach the minimum learning rate (1e-6). The cycle and adversarial loss weights were set to 2.5 and 1.0 respectively. We trained ETDNN and FTDNN using Kaldi for 3 epochs with Natural Gradient Descent optimizer, and multi-GPU periodic model averaging scheme. These x-vector networks were trained with 40-dimensional MFCC features. During evaluation, output log mel-FB features of UEN were converted to MFCCs by applying Discrete Cosine Transform (DCT) before forward passing through the x-vector network.
In this section, we present the results of UEN-SV system with and without augmentation applied to SV systems. All the results are reported using metrics Minimum Decision Cost Function (minDCF) and Equal Error Rate (EER).
|ETDNN w/o aug||EER||minDCF||EER||minDCF|
|SV with WPE enh||5.69||0.370||6.48||0.466|
|MUSAN noise||MUSAN music||MUSAN speech||chime3bg|
|ETDNN w/o aug||10||5||0||-5||10||5||0||-5||10||5||0||-5||10||5||0||-5|
4.1 Uen-Sv Results on Sitw and simulated Sitw
Table 1 presents the results for UEN-SV system with ETDNN trained without data augmentation on core-core condition of SITW and SITW reverb test sets. The UEN network was trained on simulated voxcelebcat_reverb_noise data as target domain (details in 3.1), the system was termed as sim UEN-SV. We compared these results with a baseline SV system where the test features were not enhanced and a SV system where the features were enhanced using the SOTA Weighted Prediction Error (WPE) [22, 23] dereverberation algorithm. We obtained 21% and 22% relative improvements on minDCF of SITW reverb over baseline SV and SV with WPE enhancement.
We then tested sim UEN-SV system on SITW noisy (details in 3.1). Out of the four different testing conditions, only MUSAN noise was added to the training data of UEN. The remaining three conditions (MUSAN speech, MUSAN music and chime3bg) were not used during the training of UEN.The results are presented in Table 2. sim UEN-SV yielded consistent improvements on all four noise conditions at all SNRs. More pronounced improvements were observed at 0dB and -5dB SNRs. The results showed that the UEN we devised exhibited good dereverberation and denoising capabilities and also good generalization ability to unseen noise conditions (music, speech and chime3bg).
4.2 Uen-Sv Results on AMI and SRI
Encouraged by the results on SITW reverb and SITW noisy, we tested the ETDNN based UEN-SV system on more challenging evaluation corpora from AMI and SRI. Results are presented in Table 3. In addition to sim UEN-SV we also present results of UEN system trained using real data as target domain, system termed as real UEN-SV and for SV system with PLDA adapted to target domain as explained in . The UEN system for AMI was trained on the training corpus of AMI. However, the UEN system for SRI was trained on Chime5 as target domain data for lack of availability of training set for SRI corpus (details in 3.1). As shown in Table 3, both the real and sim UEN-SV systems improved in performance compared to the baseline SV system for both the testsets. For AMI, real UEN-SV performed better than sim UEN-SV system even though it was trained on smaller amount of target domain data compared to the sim UEN. However, the advantage of using real data over simulated dropped when PLDA was adapted to the target domain. For SRI, unlike AMI, sim UEN-SV performed better than the real UEN-SV. The difference in domains between SRI (testset) and Chime5 (training set) might have resulted in slighlty poor performance of real UEN-SV compared to its simulated counterpart. From these experiments we observed that when training conditions and evaluation conditions matched closely in target domain (like in AMI) use of real data over simulated data offered advantage, which justifies our approach for unsupervised enhancement.
|ETDNN w/o aug|
|ETDNN w/o aug|
|and PLDA adapt|
|FTDNN with xvec aug|
|& w/o PLDA aug|
|enhance test data||17.20||0.675||0.720|
|FTDNN with xvec aug|
|& with PLDA aug|
|UEN-SV with test enh||14.33||0.557||0.572|
|UEN-SV with test|
|and train enh||14.10||0.518||0.540|
4.3 Uen-Sv Results on AMI with Data Augmentation
The results of enhancement on a FTDNN x-vector trained with data augmentation are presented in Table 4.. We considered two cases: 1) PLDA trained without augmentation and 2) PLDA trained with augmentation. Enhancement improved the SV system whose PLDA was trained without augmentation (6.4% relative improvement on minDCF). For the system with PLDA augmentaton, enhancing only the test/enroll data deteriorated the performance. Then, we enhanced the PLDA training data, extracted the corresponding x-vectors, and retrained the PLDA. With this setup we observed slight improvements over the baseline SV model. We did not retrain the x-vector network on enhanced features. However, encouraged by this trend, in future we intend to train the x-vector network on enhanced features which makes the entire pipeline homogeneous (train and test on enhanced features).
5 Summary and Future Work
We devised an unsupervised feature enhancement network with the end goal of improving the performance of x-vector based speaker verification systems. Validation on several simulated noisy, reverberant and real test sets showed the effectiveness of this approach when no data augmentation was used for the SV system or data augmentation was only used for x-vector training in SV system. However, the task of complementing data augmented x-vector and PLDA based SV system with an enhancement system still remains a challenging task. Encouraged from the observations in this work, we plan to develop a homogenous UEN-SV system where both the x-vector and PLDA are trained on enhanced features and testing data is enhanced during evaluation. We also consider learning domain specific augmentation features using CycleGAN by transforming clean features to the real target domain and use them to train the PLDA and x-vector systems.
-  Mahesh Kumar Nandwana, Julien Van Hout, et al., “The voices from a distance challenge 2019 evaluation plan,” arXiv preprint arXiv:1902.10828, 2019.
-  David Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
Jesús Villalba, Nanxin Chen, et al.,
“State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations,”Computer Speech & Language, p. 101026, 2019.
-  Jiamin Xie, Leibny Paola García-Perera, et al., “Multi-plda diarization on children’s speech,” Proc. Interspeech 2019, pp. 376–380, 2019.
Jun-Yan Zhu, Park, et al.,
“Unpaired image-to-image translation using cycle-consistent adversarial networks,”arXiv preprint, 2017.
-  Zhong Meng, Jinyu Li, Yifan Gong, et al., “Cycle-consistent speech enhancement,” arXiv preprint arXiv:1809.02253, 2018.
-  Suwon Shon, Hao Tang, and James Glass, “Voiceid loss: Speech enhancement for speaker verification,” arXiv preprint arXiv:1904.03601, 2019.
-  Daniel Michelsanti and Zheng-Hua Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” arXiv preprint arXiv:1709.01703, 2017.
-  Xudong Mao, Li, et al., “Least squares generative adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 2813–2821.
-  Ian Goodfellow, Pouget-Abadie, et al., “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, and Najim Dehak, “Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-GANs,” in Accepted at IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Sentosa, Singapore, 2019, IEEE.
-  Phani Sankar Nidadavolu, Jesús Villalba, and Najim Dehak, “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6206–6210.
-  Paola Garcia et al., “Speaker detection in the wild: lessons learned from jsalt 2019.,” in ICASSP 2020 (submitted).
-  Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
-  Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
-  Heiga Zen et al., “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
-  Chanwoo Kim and Richard M Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
-  Iain McCowan, Jean Carletta, et al., “The ami meeting corpus,” in Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, 2005, vol. 88, p. 100.
-  Jon Barker et al., “The third chime speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
-  Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson, “The speakers in the wild (sitw) speaker recognition database.,” in Interspeech, 2016, pp. 818–822.
-  Diego Castán et al., “Ldc2019e60, distant microphone conversational speech in noisy environments,” Private communication in support of the 2019 JHU/CLSP Summer Workshop, 2019.
Tomohiro Nakatani et al.,
“Speech dereverberation based on variance-normalized delayed linear prediction,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
-  Takuya Yoshioka and Tomohiro Nakatani, “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012.
-  Jesús Villalba, Nanxin Chen, et al., “State-of-the-art speaker recognition for telephone and video speech: the jhu-mit submission for nist sre18,” in Interspeech, 2019, p. accepted for publication.