One of the standard speech enhancement approaches is to learn a mapping function from acoustic features of degraded speech to clean speech using a Deep Neural Network
Deep Neural Network(DNN) [xu2014regression]. This feature mapping (MP) solves the problem by minimizing distance metrics like or between the output and reference clean features. This is a supervised approach, since the DNN is trained on paired clean-degraded speech, usually obtained by simulation. The objective trains a regressor that outputs the mean of all plausible outputs, which is known to produce smooth and/or distorted features. [bishop2006pattern, ledig2017photo]. This issue is well-noted in enhancement community [wang2019bridging]. In this work, we explored the usage of adversarial (adv) loss [goodfellow2014generative] to overcome the distortions introduced by the FM approach. We focused on dereverberation aimed at improving Speaker Verification System (SVS) performance.
Recently, task specific enhancement approaches are gaining attention in speech research. The usage of cycle-consistency (cyc) loss [zhou2016learning, zhu2017unpaired] and adv loss [goodfellow2014generative] along with FM loss to train denoising networks, for improving Automatic Speech Recognition (ASR), is explored by [meng2018cycle] and [meng2018adversarial] respectively. Results are reported on simulated test conditions. For feature denoising in speaker verification (SV), Deep Feature Loss (DFL) [germain2018speech] in lieu of FM is proposed in [kataria2020feature, kataria2020analysis]. Speech enhancement is one of the main approaches considered in developing robust SVSs to adverse environments during JSALT 2019 workshop [garcia2019speaker, nidadavolu2020unsupervised, kataria2020feature, sun2020progressive]. Also, a relevant non-task specific speech enhancement network using adv loss is proposed in [pascual2017segan].
In this work, we explore single channel far-field (reverberant) microphone feature enhancement, in log mel filter-bank (melFB) space, for improving SV. Previously, we proposed an Unsupervised Enhancement Network (UEN) [nidadavolu2020unsupervised, nidadavolu2019lr] trained on unpaired reverb-clean data, that transforms features from reverberant to clean domain. UEN has shown good dereverberation and denoising capabilities by improving performance on simulated reverberant, simulated noisy, and real datasets collected in wild/uncontrolled environments. UEN also obtained better verification performance compared to the widespread Weighted Prediction Error (WPE)-based speech dereverberation approach [nakatani2010speech, yoshioka2012generalization]. UEN, since does not require paired data, can be trained on real data from reverberant (target) and clean (source) domains111We used the terms clean/source and reverberant/target interchangeably in this paper, thus avoiding the need for using simulated data.
The main contributions of this work are as follows. First, we propose a SEN, trained on paired far-field(reverberant)-clean data, using a multi-task objective–a combination of FM and adv losses. Second, we demonstrate the importance of training the SEN with adv
loss by performing an ablation study on the loss functions; and by testing it onSVS trained without data augmentation. Third, we propose a DAN that maps features from reverberant to some chosen domain (not necessarily clean). DAN, like UEN, is also trained on unpaired training data. Fourth, we test the effectiveness of using SEN and DAN in improving the performance of a SVS trained with data augmentation and three testing schemes.
Our experimental approach was as follows: we developed a SVS pipeline where the features of evaluation data (enrollment and test utterances) were mapped to clean or any chosen domain via SEN and DAN respectively. We compared them with the previously proposed homogeneous UEN-SVS pipeline [nidadavolu2020unsupervised]–SVS trained and evaluated on features enhanced using UEN. To make an ideal comparison between SEN, DAN and UEN, we trained all the networks on common list of audio files, unless specified otherwise.
2 Enhancement Networks
2.1 Supervised Enhancement Network (SEN)
The SEN is trained using paired reverberant-clean speech to minimize a combination of FM and adv losses. We chose metric for the FM objective:
This objective usually distorts the output by making it smooth [bishop2006pattern, ledig2017photo]. The added adv loss [goodfellow2014generative] avoids this. adv
loss requires a discriminator–a binary classifier that discriminates between the enhanced and original clean features. TheSEN is then trained to trick the discriminator in believing that the output features are sampled from the original clean feature distribution instead of the enhanced feature distribution. At the end of the training, the enhanced and original clean features become indistinguishable by the discriminator, making the enhanced features more realistic, thus avoiding distortion. We used least-squares objective [mao2017least] to train the discriminator as,
The adv objective for the SEN is
The final multi-task objective for training the SEN is given by
where and represent the weights assigned to FM and adv objectives respectively.
2.2 Unsupervised Enhancement Network (UEN)
We compare SEN with the previous UEN work [nidadavolu2020unsupervised]. UEN is trained on unpaired data. The procedure is as follows. We train a CycleGAN [zhu2017unpaired], which consists of two generators and two discriminators. One generator maps features from clean to reverberant domain, while the second maps features from reverberant to clean domain. Generators are trained using a multi-task objective, consisting of adv loss and cyc loss. Similar to SEN, adv loss is responsible for making the generator to produce features that appear to be drawn from the real distribution of the domain we are mapping to. The adv loss of each generator is obtained using its respective discriminator. The discriminator used least-squares objective (LS-GAN) [mao2017least]. The cyc loss additionally constrains the generator to reconstruct original features of each domain from the generated features in the opposite domain (achieved by minimizing the distance between original and reconstructed features). The cyc loss makes sure no information is lost during the mapping.
During evaluation, features of reverberant speech are enhanced by the reverberant-to-clean generator, termed as UEN. More details on the training procedure and objectives used for training CycleGAN can be found in related previous works [nidadavolu2020unsupervised, nidadavolu2019lr, nidadavolu2019cycle].
2.3 Domain Adaptation Network (DAN)
UEN, described above, transforms reverberant features to clean domain. Hence, we call the transformation reverberant feature enhancement or feature dereverberation. On the other hand, DAN transforms features from reverberant domain to any chosen domain (details in Section 3.1). This mapping is also attained by training a CycleGAN, similar to the one trained for UEN, except that the source domain does not need to be clean. The CycleGAN for DAN is trained on unpaired data from a selected source domain and reverberant domain. During evaluation, the reverberant features are transferred to the source domain using the corresponding generator in the CycleGAN, termed as DAN. Except for the difference in training data used, the procedure for training the DAN is identical to UEN.
3 Experimental Procedure
3.1 Dataset Details
The training of SEN required access to paired data from clean and reverberant domains, which was obtained as follows. The audio files from the same YouTube video of VoxCeleb1 [nagrani2017voxceleb] and Voxceleb2 [chung2018voxceleb2] were concatenated, denoted as VoxCeleb concatenated (VC), to obtain longer audio sequences. Since VC was collected in wild conditions and contained unwanted background noise, we filtered the files based on their Signal-to-Noise Ratio (SNR
), estimated byWaveform Amplitude Distribution Analysis (WADASNR) algorithm [kim2008robust, nidadavolu2020unsupervised, zen2019libritts, nidadavolu2019lr]. We retained the VC files with SNR decibel (dB). The high SNR signals, thus obtained, termed as VC clean, consisted of 1665 hours of speech from 7104 speakers. The far-field data was obtained via simulation by first convolving VC clean with simulated RIRs222RIRs available at http://www.openslr.org/26 with RT60 values in the range 0.0-1.0 seconds. Then, assorted noise files from Music, Speech and Noise (MUSAN) [snyder2015musan] corpus were artificially added as foreground noise (at SNR levels of 15, 10, 5, and 0dB) to the simulated reverberant speech (speech and music portions from MUSAN were not used in the simulation). Simulated reverb-clean parallel corpora, termed as VC reverb_noise-VC clean, was used as training data for the SEN.
|Network||Output||Training Data||Data Type||Approach||Objectives|
|SEN||clean||VC reverb_noise & VC clean||paired||Regression||FM & adv|
|UEN||VC reverb_noise & VC clean||unpaired||CycleGAN||cyc & adv|
|DAN||noise||VC reverb_noise & VC noise|
To make a fair comparison between SEN and UEN, the latter was also trained on VC reverb_noise-VC clean. However, the unpaired reverb-clean pairs required for training the UEN were drawn randomly without any correspondence between them. The source domain for training the DAN was obtained by adding assorted noise files from MUSAN to VC clean, termed as VC noise, while the target domain data (VC reverb_noise) remains the same as UEN. We added assorted noise to the source domain, to increase data variability and obtain better generalization. The choice of the type of noise was based on the observation from the following experiment. We obtained three copies of VC by adding music, speech and noise files from MUSAN. We then trained three individual SVSs on each of these conditions and observed that system trained on VC noise gave best performance compared to systems trained on music, speech and also VC clean on all the evaluation sets considered in this work. Table 1 summarizes the training details of three networks.
Once trained, SEN and DAN networks were used to enhance/adapt the features of evaluation
corpora, which were finally tested with an x-vector[snyder2018x] based SVS.
We experimented with SVSs trained without and with data augmentation. SVS without data augmentation were trained on VC clean. For SVS trained with augmentation (similar to [snyder2018x]), we used simulated VC reverb_noise (described above) and VC additive as far-field and additive noise corpora to augment VC clean. VC additive corpora consisted of VC noise, VC babble and VC music, each obtained by adding noise, speech and music from MUSAN to VC clean at SNRs randomly chosen from 15, 10, 5 and 0 dB. A randomly chosen subset of VC reverb_noise and VC additive, twice the size of VC clean, was used as augmented data for training the x-vector network.
We tested the SVSs on both simulated and real datasets. Simulated reverberant test set was obtained from Speakers In The Wild (SITW), labelled as SITW reverb. We treated SITW as clean corpus. SITW reverb was created similar to the VC reverb_noise except that the maximum value of RT60 for the RIRs used was set to 4.0 seconds (instead of 1.0). We ensured Room Impulse Response (RIR)s used for training and testing simulations were disjoint.
For the real testing conditions, we used three different corpora [garcia2019speaker] collected in different scenarios:
Meeting ( AMI Meeting Corpus (AMI) [mccowan2005ami]): with a setting of 3 different meeting rooms with 4 individual headset Microphones, 8 Multiple Distant Microphones forming a microphone array; 180 speakers x 3.5 sessions per speaker (sps). Since we are exploring enhancement with single microphone, we focused only on the mix Headset.
Indoor controlled ( Stanford Research Institute (SRI) data [SRI-Real-Voices]): with a setting of 23 different microphones placed throughout 4 different rooms; controlled backgrounds, 30 speakers x 2 sessions and 40 h, live speech along with background noises (TV, radio).
Wild (BabyTrain): with an uncontrolled setting, 450 recurrent speakers, up to 40 sps (longitudinal), 225hrs; suitable for diarization and detection.
The enrollments for verification were generated by accumulating non-overlapping speech (5, 15 and 30s duration) of every target speaker along one or multiple utterances. For the test, we cut the audio into 60 second chunks. We did a Cartesian product between the enrollments and the test segments to generate all possible trials. Then, based on certain criteria, some trials were filtered out. For example, same session and same microphones were not allowed to produce a target-trial pair. Table 2 shows a summary of all datasets used in this work.
|VC clean||VC additive||VC reverb_noise||AMI||SRI||BabyTrain||SITW||SITW reverb|
|VC noise||VC babble||VC music|
3.2 Network Architectures
SEN was a fully convolutional (conv) residual network with an encoder-decoder architecture. The encoder consisted of three conv
layers followed by nine residual blocks. The number of (filters, strides) in the first threeconv layers were set to (32, 1), (64, 2) and (128, 2) respectively. The residual network consisted of two conv layers with 128 filters. The decoder network consisted of two de-conv layers with strides 2 and filters 64 and 32 respectively followed by a final conv
layer with stride 1. Instance normalization was used in each layer except in the first and last. ReLU activation was used in all layers except the last. The kernel size in all layers was set to. We used a short cut connection from input SEN to the output (input was added to the output of the last layer which becomes SEN’s final output). The generators used in CycleGAN have similar architecture as the SEN. The discriminator had 5 conv layers each with a kernel size of 4. The strides of first three and last two layers were set to 2 and 1 respectively. The number of filters in each layer were set to 64, 128, 256, 512 and 1. LeakyReLu with slope 0.2 was used as activation in all layers except the last. More details on the architecture can be found in [nidadavolu2019lr].
3.3 Training Details
Both the Enhancement Networks (ENs) and DAN were trained on 40-D log melFBs extracted from their respective training corpora (details in Sec.3.1). The SEN was trained to optimize a multi-task objective –combination of FM and adv losses (details in Sec. 2.1), with paired data drawn from the training corpora. UEN and DAN were trained to optimize a combination of cyc and adv losses with unpaired data drawn from their respective training corpora (details in Sec. 2.2 and Sec. 3.1). Rest of the training details remain the same for all the networks and are detailed below. Short-time mean centering and energy based Voice Activity Detection (VAD
) was applied on the features. Batch size and sequence length were set to 32 and 127 respectively. The models were trained for 50 epochs. Each epoch was set to be complete when one random sample from each of the utterances ofVC clean has appeared once in that epoch. Adam Optimizer was used with momentum . The learning rates for the ENs and discriminators were set to 0.0003 and 0.0001 respectively. The learning rates were kept constant for the first 15 epochs and, then, linearly decreased until they reach the minimum learning rate (1e-6). For SEN, the FM and adv loss weights were set to 1.0 and 0.1 respectively. For UEN and DAN, the cyc and adv loss weights were set to 2.5 and 1.0 respectively.
We used Extended TDNN (ETDNN) [villalbajhu, nidadavolu2020unsupervised] based x-vector network in the SVS. More details on the ETDNN and the pipeline can be found in [villalba2019state, garcia2019speaker]. ETDNN was trained on 40-D MFCC features using Kaldi333Data preparation and training scripts can be found at: https://github.com/jsalt2019-diadet/jsalt2019-diadet. During evaluation, output log melFB features of enhancement network (EN)s were converted to MFCCs by applying Discrete Cosine Transform (DCT) before forward passing through the x-vector network.
The baseline (BL) SVS (evaluated on original features with no enhancement) was termed as BL-SVS. The SVSs evaluated on features mapped using SEN, UEN and DAN were termed as SEN-SVS, UEN-SVS and DAN-SVS respectively. In Sec. 4.1, we present results on a SVS trained without data augmentation on simulated test set, which can be treated as system trained on clean and tested on reverberant speech. In this system, enhancement was done during the evaluation stage and was used to tune the SEN and compare it with the previously published UEN. DAN was not used here since we tailored it to work with an augmented x-vector system by training it to map features to noise domain.
In Sec. 4.2, we compare our enhancement and DAN approaches on a more practical scenario–BL-SVS trained on data augmentation and tested on real datasets acquired from various conditions. In this case, we experimented with training the SVS pipeline on enhanced features while also enhancing the evaluation corpora, as suggested in [nidadavolu2020unsupervised].
4.1 Supervised vs unsupervised enhancement
The SEN network was trained using a multi-task objective: a weighted sum of FM and adv losses (details in Sec. 2.1). We first performed an ablation study on the losses by training two SENs on individual losses. The SENs trained on FM loss only and adv loss only were termed as SEN1 and SEN2 respectively. Results are in Table 3. SEN2 is trained with only adv loss which learns its own loss function. Using both SEN1 and SEN2, SVSs perform no better than the BL-SVS and UEN-SVS. However, SEN2 yielded better minimum Detection Cost Function (minDCF) (0.535) compared to SEN1 (0.626) on SITW reverb, which justified the usage of adv loss. SEN3 trained using a combination of both the losses performed better than BL-SVS and UEN-SVS. The results suggests that FM approach alone is not enough for dereverberation in feature domain. We further experimented with tuning the adv loss weight. We experimented with 1.0, 0.1 and 0.01. Setting adv loss weight to 0.01 made it insignificant compared to FM loss (SEN5 had slight improvements over SEN1). We obtained better results with the loss weight of 0.1, system we termed as SEN4 in Table.3. SEN4 yielded 20.5% and 33.5% percent relative improvements in terms of minDCF on SITW and SITW reverb compared to the BL-SVS (UEN yielded 9.1% and 23% relative improvements). For the rest of this work, we use SEN4 as the SEN.
|SVS||Loss Weights||SITW||SITW reverb|
4.2 Enhancement for Svs trained with data augmentation
The BL-SVS with data augmentation was trained on combination of VC clean, VC additive and VC reverb (details in Sec. 3.1 and Table. 2). We experimented with three different training schemes–all modifying training data of x-vector in different ways. In the first scheme, the entire training data was enhanced using SEN, similar to the homogeneous UEN-SVS pipeline in [nidadavolu2020unsupervised]. In the second scheme, training data consisted of dereverberated/enhanced VC reverb along with unmodified clean and VC additive. In the third scheme, we trained on the original training data of BL-SVS along with its enhanced version. Since the x-vector network in the third case had double the training data compared to the first two scenarios, it was trained for 1.5 epochs compared to 3 epochs in other cases, thus making the systems comparable. In all schemes, the Probabilistic Linear Discriminant Analysis (PLDA) was trained on x-vectors extracted from enhanced features only (making the PLDAs comparable too). The three schemes were termed as SVS1, SVS2, and SVS3 respectively. In all three cases, the evaluation data was enhanced. All the three training schemes were repeated for UEN, DAN and SEN separately.
Results are presented in Table 4. BabyTrain benefited from all three feature mapping approaches and all three SVStraining schemes. SEN yielded better results on SRI and BabyTrain compared to UEN and deteriorated performance on AMI. However, DAN-SVS3 yielded improvements on all three datasets. DAN-SVS3 yielded relative improvements on minDCF of 2.2%, 6% and 31.6% on AMI, SRI and BabyTrain respectively. SEN-SVS3 yielded 8.6% and 26.5% relative improvements on minDCF on SRI and BabyTrain respectively but deteriorated performance on AMI. The best relative improvements and the systems that yielded best results on these three datasets were 2.2% (DAN-SVS3), 10.7% (SEN-SVS2) and 31.6% (DAN-SVS3) respectively. We observed that the x-vector network in homogeneous system SVS1 over-fitted compared to SVS2 and SVS3 (large gap between training and validation accuracy was observed) because the enhancement removed the noise from the training data. That explained the superior performance of SVS3 (or SVS2 in some cases) compared to SVS1.
The aim of this study was to make SVSs robust to far-field data using proposed Supervised Enhancement Network (SEN) and Domain Adaptation Network (DAN). SEN maps far-field evaluation features to clean domain. It was trained on paired data using a supervised objective combined with a generative adversarial objective. Meanwhile, DAN maps the features to a noise domain, which is similar to the augmented data used to train the SVS x-vector. DAN is trained on unpaired data using a CycleGAN scheme. We observed that training the SVS systems on both original/augmented features and their enhanced version using the networks proposed yielded significant improvements compared to training on augmented data alone or enhanced augmented alone. We observed relative improvements ranging from 2-31% in terms of minDCF on several simulated and real datasets using this approach. Though the enhancement procedure in this work was targeted at improving SV performance, the networks were trained with task independent objectives. Future direction would be to test these techniques on ASR task.