Automatic speaker verification (ASV) refers to automatically making the decision to accept or reject a claimed speaker by analyzing the given speech from that speaker. In the past few years, the performance of ASV systems has been improved significantly with the successful application of deep neural network (DNN) to speaker embedding modeling [1, 2]. However, unsatisfactory performance persists under noisy environments, which commonly noticed in smartphones or smart speakers with ASV applications. The additive noises on a clean speech contaminate the low energy regions of the spectrogram and blur the acoustic details . These noises result in the loss of speech intelligibility and quality, imposing great challenges on speaker recognition systems.
To compensate for these adverse impacts, various approaches have been proposed at different stages of the ASV systems. At the signal level, DNN based speech or feature enhancement [4, 5, 6, 7] has been investigated for ASV under complex environment. At the feature level, feature normalization techniques  and noise-robust features such as power-normalized cepstral coefficients (PNCC)  have also been applied to ASV systems. At the model level, robust back-end modeling methods such as multi-condition training of probabilistic linear discriminant analysis (PLDA) models  and mixture of PLDA 
were employed in the i-vector framework. Also, score normalization  could be used to improve the robustness of the ASV system under noisy scenarios.
More recently, researchers are working on training deep speaker networks to cope with the distortions caused by noise. Within this framework, there are two main methods. The first one regards the noisy data as a different domain from the clean data and applies adversarial training to deal with domain mismatch and get a noise-invariant speaker embedding [14, 15]. The second method employs a DNN speech enhancement network for ASV tasks. Shon et al.  train the speech enhancement network with feedbacks from the speaker network to find the time-frequency bins that are beneficial to ASV tasks with noisy speech. Zhao et al.  uses the intermediate result of the speech enhancement network as an auxiliary input for the speaker embedding network and jointly optimize these two networks.
In this work, our network learns enhancement directly at the embedding level for speaker recognition under noisy environments. We train the deep speaker embedding network by incorporating the original speaker identification loss with an auxiliary within-sample loss. The speaker identification loss learns the speaker representation using the speaker label, while the within-sample loss aims to learn the embedding of noisy utterance as similar as possible to its clean version. In this way, the deep speaker embedding network is trained to prevent from encoding the additive noises into the speaker representation and learn the “clean” embedding for the noisy speech utterance. The loss that helps the speaker network to learn variability-invariant embedding is called within-sample variability-invariant loss.
Furthermore, to fully explore the modeling ability of the within-sample variability-invariant loss, we dynamically generate the clean and noisy utterance pairs when preparing data for the training process. Different noisy copies for the same clean utterance are generated at different training steps, helping the speaker embedding network generalize better under noisy environments.
2 Revisit: Deep speaker embedding
In this section, we describe the deep speaker embedding framework, which consists of a frame-level local pattern extractor, an utterance-level encoding layer, and several fully-connected layers for speaker embedding extraction and speaker classification.
Given a variable-length input feature sequence, the local pattern extractor, which is typically a convolutional neural network (CNN) or a time-delayed neural network (TDNN) 
, learns the frame-level representations. An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation)[1, 2]. Self-attentive pooling layer , learnable dictionary encoding layer , and dictionary-based NetVLAD layer [20, 21]
are other commonly used encoding layers. Once the utterance-level representation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and classify the training speakers. After training, deep speaker embedding is extracted after the penultimate layer of the network for the given variable-length utterance.
In this work, the local pattern extractor is a residual convolutional neural network (ResNet) , and the encoding layer is a global statistics pooling (GSP) layer. For the frame-level representation , the output of GSP is a utterance-level representation , where and are the mean and standard deviation of the feature map:
and denote the number of channels, height and width of the feature map respectively.
In this section, we describe the proposed framework with within-sample variability-invariant loss and online noisy data generation.
3.1 Within sample variability-invariant loss
A clean speech and its noisy copies contain the same acoustic contents for recognizing speakers. Ideally, the speaker embeddings of the noisy utterance should be the same as its clean version. But in reality, the deep speaker embedding network usually encodes the noises as parts of the speaker representation for the noisy speech.
In this work, we train the local pattern extractor to learn the enhancement at the embedding level. Formally, for a clean utterance and its noisy copy with noise , the speaker embeddings extracted by the network are
In this way, the speaker embedding network is trained to ignore the additive noises and learn noise-invariant embeddings. We refer this loss function as within-sample variability-invariant loss. Two different loss functions are investigated in this work, i.e., mean square error (MSE) regression loss and cosine embedding loss.
The MSE regression loss calculates the mean of the square L2 norm between the clean embedding and its noisy version ,
where denotes the L2 norm, is the dimension of the speaker embeddings .
The cosine embedding loss calculates the cosine distance between the clean embedding and its noisy version ,
The within-sample variability-invariant loss works with the original speaker identification loss together to train the speaker embedding network. The speaker identification loss is typically a cross-entropy. In our implementation, the hyper-parameters of the network are updated twice at each training step. The first update from the speaker identification loss is followed by the second update from the within-sample variability-invariant loss. Figure 1 shows the flowchart of our proposed framework.
|SNR||Clean||Offline AUG||Online AUG||Online AUG||Online AUG||Online AUG||Online AUG||Online AUG|
3.2 Online data augmentation
In this work, we implement an online data augmentation strategy. Different parameters of noise types, noise clips and signal-to-noise ratio (SNR) are randomly selected to generate the clean-noisy utterance pair when training. Different permutations of these random parameters generate different noisy segments for the same utterance at different training steps, so the network never “sees” the same noisy segment from the same clean speech.
During training, the SNR is a continuous random variable uniformly distributed between 0 and 20dB, and there are four types of noise: music, ambient noise, television, and babble. The television noise is generated with one music file and one speech file. The babble noise is constructed by mixing three to six speech files into one, which results in overlapping voices simultaneously with the foreground speech.
|Encoding||Global Statistics Pooling|
|Embedding||Fully Connected Layer|
|Classifier||Fully Connected Layer|
(kernal size, stride) denotes the convolutional layer,(kernal size, stride) denotes the shortcut convolutional layer, denotes the residual block.
The experiments are conducted on Voxceleb 1 dataset . The training data contain 148642 utterances from 1211 speakers. In the test data, 4874 utterances from 40 speakers construct 37720 test trials. Although the Voxceleb dataset collected from online video is not strictly in clean condition, we assume the original data as a clean dataset and generate noisy data from the original data.
The MUSAN dataset  is used as the noise source. We split the MUSAN into two non-overlapping subsets for training and testing noisy data generation respectively.
4.2 Experimental setup
Speech signals are firstly converted to 64-dimensional log Mel-filterbank energies and then fed into the speaker embedding network. The detailed network architecture is in table 2. The front-end local pattern extractor is based on the well known ResNet-34 architecture 
For the speaker identification loss, a standard softmax-based cross-entropy loss or angular softmax loss (A-softmax)  is used. When training with softmax loss, dropout is added to the penultimate fully-connected layer to prevent overfitting.
Three training data settings are investigated: (1) original Voxceleb 1 dataset (clean); (2) original training dataset and offline generated noisy data, i.e., the noisy data are generated in advance (offline AUG); (3) original training data with online data augmentation (online AUG).
At the testing stage, cosine similarity is used for scoring. We use equal error rate (EER) and detection cost function (DCF) as the performance metric. The reported DCF is the average of two minimum DCFs whenis 0.01 and 0.001.
4.3 Experimental results
Eight deep speaker embedding networks are trained based on three training conditions and different loss functions. Table 1 shows the DCF and EER of three noise types (babble, ambient noise and music) at five SNR settings (0, 5, 10, 15, 20dB). Also, all of the 15 noisy testing trials are combined to form the “all noises” trial.
Several observations from the results are discussed in the following. 1) The experimental results confirm that data augmentation strategy can greatly improve the performance of the deep speaker embedding system under noisy conditions. 2) Comparing with the offline data augmentation strategy, the performance improvement achieved by online data augmentation is more obvious in the low SNR conditions. 3) Training the deep speaker embedding system with within-sample variability-invariant loss can improve the system performance in the clean and all noisy conditions. 4) Comparing with the network trained with offline data augmentation, the proposed framework using within-sample variability-invariant loss with online data augmentation achieves 13.0% and 6.5% reduction in terms of EER and DCF respectively. 5) When the speaker embedding network is trained discriminatively using the A-softmax loss with angular margin, the proposed within-class loss can still improve the system performance by setting constraints on the distance among the clean utterance and its noisy copies.
The detection error tradeoff (DET) curves in figure 3 provide comparisons among four selected systems, two of which are trained with our proposed framework. The DET curve uses testing trials from all the noisy conditions.
We also visualized the speaker embeddings by using the t-distributed stochastic neighbor embedding (t-SNE) algorithm . The two-dimensional results of the speaker embeddings are shown in figure 4. Four speakers, each with six clean utterances, are selected from the training dataset for visualization. Also, each clean utterance has three 5dB noisy copies of music, babble and ambient noises. Comparing with the clean training condition, data augmentation helps the clean and noisy embeddings from the same utterance cluster together. Further, after training the deep speaker embedding network with within-noise variability-invariant loss, the clean and noisy embeddings of the same utterance are closer to each other.
The loss values of each training epoch are shown in figure3 for the network with speaker softmax and within-sample MSE losses. The referenced MSE loss between embeddings from the clean and noisy data of the converged network trained with only softmax loss is also given. We can observe that the MSE loss is maintained at a low level during training, which helps the network to extract noisy embedding similar to its clean version.
This paper has proposed the within-sample variability-invariant loss for deep speaker embedding networks under noisy conditions. By setting constraints on the embeddings extracted from the clean utterance and its noisy copies, the proposed loss works with the original speaker identification loss to learn robust embedding for noisy speeches. We also employ the data preparation strategy of generating the clean and noisy utterance pairs on-the-fly to help the speaker embedding network generalize better under noisy environments. The proposed framework is flexible and can be extended to other similar applications when multiple views of the same training speech sample are available.
This research is funded in part by the National Natural Science Foundation of China (61773413) and Duke Kunshan University.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x-vectors: Robust DNN Embeddings for Speaker Recognition,” in ICASSP, 2018, pp. 5329–5333.
-  W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Speaker Odyssey, 2018, pp. 74–81.
-  M. Wolfel and J. McDonough, Distant Speech Recognition, John Wiley & Sons, Incorporated, 2009.
-  X. Zhao, Y. Wang, and D. Wang, “Robust Speaker Identification in Noisy and Reverberant Conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014.
-  M. Kolboek, Z. Tan, and J. Jensen, in SLT, 2016, pp. 305–311.
-  Z. Oo, Y. Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification,” in Interspeech, 2016, pp. 2204–2208.
O. Plchot, L. Burget, H. Aronowitz, and P. Matejka,
“Audio enhancing with DNN autoencoder for speaker recognition,”in ICASSP, 2016, pp. 5090–5094.
-  J. Pelecanos and S. Sridharan, “Feature Warping for Robust Speaker Veriﬁcation,” in Speaker Odyssey, 2001, pp. 213–218.
-  C. Kim and R. M Stern, “Power-Normalized Cepstral Coefﬁcients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016.
-  D. Garcia-Romero, X. Zhou, and C. Y. Espy-Wilson, “Multi-Condition Training of Gaussian PLDA Models in i-vector Space for Noise and Reverberation Robust Speaker Recognition,” in ICASSP, 2012, pp. 4257–4260.
-  M. Mak, X. Pang, and J. Chien, “Mixture of PLDA for Noise Robust i-Vector Speaker Verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 130–142, 2016.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  I. Peer, B. Rafaely, and Y. Zigel, “Reverberation Matching for Speaker Recognition,” in ICASSP, 2008, pp. 4829–4832.
-  J. Zhou, T. Jiang, L. Li, Q. Hong, Z. Wang, and B. Xia, “Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding,” in ICASSP, 2019, pp. 6196–6200.
-  Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial Speaker Verification,” in ICASSP, 2019, pp. 6216–6220.
-  S. Shon, H. Tang, and J. Glass, “VoiceID Loss: Speech Enhancement for Speaker Verification,” in Interspeech, 2019, pp. 2888–2892.
-  F. Zhao, H. Li, and X. Zhang, “A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker,” in ICASSP, 2019, pp. 6101–6105.
-  G. Bhattacharya, J. Alam, and P. Kenny, “Deep Speaker Embeddings for Short-Duration Speaker Verification,” in Interspeech, 2017, pp. 1517–1521.
-  W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification,” in ICASSP, 2018, pp. 5189–5193.
-  J. Chen, W. Cai, D. Cai, Z. Cai, H. Zhong, and M. Li, “End-to-end Language Identification using NetFV and NetVLAD,” in ISCSLP, 2018.
-  W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level Aggregation For Speaker Recognition In The Wild,” in ICASSP, 2019, pp. 5791–5795.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in CVPR, 2016, pp. 770–778.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large-Scale Speaker Identification Dataset,” in Interspeech, 2017, pp. 2616–2620.
-  D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015.
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song,
“Sphereface: Deep Hypersphere Embedding for Face Recognition,”in CVPR, 2017, pp. 212–220.
L. Maaten and G. Hinton,
“Visualizing Data using t-SNE,”
Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579–2605, 2008.