Channel adversarial training for speaker verification and diarization

10/25/2019 ∙ by Chau Luu, et al. ∙ 0

Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset- or environment-invariance. By training an adversary to predict whether pairs of same-speaker embeddings belong to the same recording in a Siamese fashion, learned features are discouraged from utilizing channel information that may be speaker discriminative during training. Experiments for verification on VoxCeleb and diarization and verification on CALLHOME show promising improvements over a strong baseline in addition to outperforming a dataset-adversarial model. The VoxCeleb model in particular performs well, achieving a 4% relative improvement in EER over a Kaldi baseline, while using a similar architecture and less training data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning speaker discriminative features is an important approach to tasks such as Speaker Verification (SV) and Speaker Diarization (SD). In recent years, using deep learning to extract speaker embeddings has become the state-of-the-art method for both tasks

[1, 2, 3, 4]

, outperforming the well established i-vector technique


Although such embeddings have shown excellent performance for speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) is often used to score the similarity between embeddings, rather than more direct measures such as cosine similarity or Euclidean distance. For i-vectors, the use of PLDA is motivated by the observation that often other unwanted sources of information, such as channel information, are present in these embeddings

[5, 6, 7, 8] – thus training a separate model to disentangle these sources of information and extract only the speaker specific information has shown great benefit. The performance increase of PLDA when used with x-vectors [1, 2], suggests this source of unwanted variability is also present in deep embeddings.

This raises the question of whether disentangling channel information can be performed within the deep feature extractor, to either remove the need for PLDA, or to increase its effectiveness. The generation of channel-invariant features may be regarded as closely related to the production of domain-invariant features, for which adversarial training has emerged as a powerful approach to learning properties such as domain-invariance in feature embeddings

[9, 10, 11].

Previous work in adversarial learning of speaker representation has encouraged domain invariance by having an adversary classify the dataset or labelled environment to which the generated features belong [4, 12]. However, this is a coarse modelling of the domains over which generated features are encouraged to be invariant. In the case of dataset adversarial training [12], for instance, intra-dataset variation is not penalized, instead relying on the differences between datasets being enough to encourage meaningful invariance.

We aim to encourage invariance at the channel or recording level, without the need for labelled recordings, by training an adversary to predict whether pairs of same-speaker embeddings belong to the same recording. Since this recording-level adversarial penalty affects channel-related information, the approach encourages channel-invariant embeddings.

Figure 1: The proposed architecture. The classifier is trained in the same way as the ordinary X-vector architecture, and the discriminator is trained on concatenated pairs of within-speaker pairs. The blue arrows represent the forward propagation, and the red arrows represent the backward propagation of gradients.

Several researchers have performed related work using adversarial training to learn speaker embeddings. Meng et al [4]

used environment classification from a finite set of training environments and prediction of the signal to noise ratio of the input utterance as adversarial losses to a embedding generator for speaker verification. Tu et al 


aimed to make deep embeddings more amenable to PLDA by enforcing variational regularization to ensure a Gaussian distribution of representations. That work also incorporated an adversary to encourage domain invariance by having a discriminator perform a multi-class classification task on generated features, based on the dataset from which each sample originated. Bhattacharya et al 

[13]estimated deep speaker embeddings that were encouraged to be robust between single source and target domains using adversarial techniques.

Hsu et al [14] proposed SiGAN, a Siamese architecture for upscaling faces using a generative network. By generating a pair of faces and ensuring that the performance of face verification was maintained, the generated faces were encouraged to be identity preserving. SiGAN is related to our work since it also focuses on pairwise properties of generated features, which we use to encourage channel-invariance.

2 Learning speaker embeddings

We learn speaker embeddings by mapping a set of input frames of variable length , to a fixed dimensional set of hidden features h that represent the identity of a speaker, using a neural network parameterized by . The x-vector neural network architecture [1] has been a particularly successful approach for this.

X-vectors are extracted from an intermediate layer of a network trained on classifying a set of speakers in a training set. From the input acoustic features of variable sequence length X

, a series of Time Delay Neural Network (TDNN) layers (1-D convolutions in time) are applied sequentially, with each subsequent layer incorporating a larger temporal context. The output of the sequence of TDNN layers is pooled into a fixed dimension by taking the mean and variance of each unit of the frame-level output. This may then be projected into a smaller number of dimensions in order to extract the final speaker embedding. Up to this point in the network can be referred to as the embedding extractor, or

generator, parameterized by .

Taking the speaker embeddings h as input, a classifier

network learns to predict the input class. This network is trained with a loss function for multi-class classification, such cross entropy loss,

, and is parameterized by .

3 Channel Adversarial Training

By training a discriminator, parameterized by , to ascertain the domain of the generated features, an adversarial penalty is added to the overall loss function of a domain adversarial neural network (DANN) [9, 10]:


where is a controllable parameter to determine the weighting of this loss term. Allowing the adversary to act against the classifier is implemented via a gradient reversal layer between the generator and the discriminator.

In this work, the adversary classifies pairs of embeddings as being within-recording or not, thus penalizing the inclusion of channel information in the embeddings. This is implemented by attaching a discriminator with a gradient reversal layer that takes concatenated pairs of embeddings as input. This discriminator outputs a binary prediction.

This presents the question of which embeddings should be paired together to train the discriminator. Naively, one could select pairs randomly with a 50% within-recording distribution. However, selecting pairs which do not have the same speaker may lead to the discriminator ascertaining the recording information based on the identities of the speakers, which is the opposite of the main training objective. As such, it is important that the discriminator only receives pairs of embeddings which belong to the same speaker.

This is achieved by selecting an anchor speaker , and a random anchor utterance belonging to that speaker, : a segment belonging to a recording . Within the same batch, another utterance from speaker is chosen from a separate segment of the same recording , . If such a segment does not exist, then and can be chosen by taking separated subsegments of a single utterance, with overlap minimized if possible.

A second pair is constructed by choosing utterance , which can be a random utterance belonging to speaker which is not from recording . If no out-of-recording utterances exist, or recording information is unknown, two subsegments of a single utterance can be taken to be the within-recording pair.

At the embedding stage, two pairs are concatenated, a within-recording pair and an out-of-recording pair .

A batch for training is populated by selecting anchor speakers, and selecting the three segments for each speaker, , , and . This results in an overall batch size of for both the generator and the classifier, and an input batch size of for the discriminator. The overall system is shown in Figure 1, with colors to indicate the pattern of concatenation.

4 Experimental setup

We used the VoxCeleb 1 [15] evaluation set and the CALLHOME corpus111 for our experiments. CALLHOME is typically used for speaker diarization, so non-speaker-overlapping segments were extracted with minimum duration 0.5s according to the ground truth diarization, in order to evaluate verification, selecting only pairs of segments occurring within the same recording.

The training data used was the same as in the Kaldi222 recipes for VoxCeleb and CALLHOME. For training the VoxCeleb system, the VoxCeleb 2 [16] corpus was augmented using background noises and room impulse responses as in the Kaldi recipe (although the Kaldi recipe also uses the training portion of VoxCeleb 1, which this work omits).

For CALLHOME, the training data used was a combination of the NIST SRE 2004-2008 corpora, along with Switchboard 1, 2 and Cellular, all augmented in a similar fashion. Augmented versions were considered as different recordings.

4.1 Baselines

The network architecture for the generator closely follows Snyder et al [1]

, utilizing the same widths of temporal context at each layer, along with the choices for the number of hidden units at each layer. Leaky ReLU and Batch Normalization were applied at each layer.

Instead of using the stats pooling that the original architecture used, attentive stats pooling [17] was used, with hidden units in the single attention head for the VoxCeleb system, and for the CALLHOME system. After pooling, the VoxCeleb system was projected to an embedding of size , and CALLHOME to a -dimension embedding.

The classifier network was a single hidden layer feed forward network with hidden units for all models, projecting to the number of classes for each dataset. The classifier was trained using an additive margin softmax loss [18]

using the recommended hyperparameters. All layers had a dropout schedule applied that started at

, rose to in the middle and dropped off to thereafter, similar to the Kaldi recipe.

Networks were trained on batches of utterances between s in duration with batch size

, ensuring one example per speaker. Speakers were cycled in each batch to ensure a uniform distribution of speakers across training. The VoxCeleb system was trained for

batches and the CALLHOME for . SGD was used with learning rate and momentum , with the learning rate halving at of the way through training, and halving for every thereafter.

For both VoxCeleb and CALLHOME there exist pretrained models in Kaldi, which were also used for benchmarking. Note that the Kaldi VoxCeleb model is trained using the VoxCeleb 1 training portion in addition to VoxCeleb 2.

4.2 Acoustic features

For all experiments, -dimensional MFCCs were extracted, with the standard ms window and ms step. Cepstral mean and variance normalization was applied to each utterance before training and only voiced frames were selected, judged by a simple energy based VAD system.

4.3 Similarity scoring

For both verification and diarization, either a cosine similarity or PLDA backend was used, utilizing length normalization for both. The PLDA model was trained on only the training data for that task, meaning either VoxCeleb 2 or the SRE-Switchboard combination. This differs particularly from some works on CALLHOME, which will train on some folds of the CALLHOME data, using the unseen folds for evaluation [19, 20]. At no point have the models in this work been trained on any CALLHOME data.

4.4 Diarization

The diarization pipeline was as follows. From oracle speech activity marks, s subsegments were extracted with a

s overlap. Speaker embeddings were extracted from each subsegment, normalized, and agglomerative hierarchical clustering was performed on the cosine similarity matrix. Cluster label overlaps were resolved by taking the mid-point of the overlap. Final diarization error rate was computed using

md-eval.pl333 with a forgiveness collar of s.

4.5 Adversarial Experiments

To establish a baseline for other domain adversarial techniques, the CALLHOME model was also trained with a dataset-predicting adversary. The training data was split into three domain labels according to the dataset: SRE, Switchboard Cellular, or Switchboard. This adversarial discriminator was trained on the 3-class classification task on all embeddings in a batch using a cross entropy loss. This baseline was not possible with VoxCeleb due to the lack of domain label candidates.

The discriminator in all experiments was a simple feedforward network which had one hidden layer with units, outputting a single value for the within-recording prediction. For the channel-adversarial model, the size of the input was twice that of an embedding, so for the VoxCeleb system and for the CALLHOME system. The gradient reversal layer value was set to .

5 Results and Discussions

Cosine PLDA
Baseline (Kaldi) 9.77% 3.10%
Baseline (ours) 5.94% 3.87%
Data-Tuned 5.83% 3.92%
Channel-Adversarial 4.21% 2.98%
Table 1: EER values for the VoxCeleb 1 test set using cosine similarity or PLDA backend.
All pairs Within-rec
Cosine PLDA Cosine PLDA
BL (Kaldi) 29.29% 19.06% 30.05% 23.16%
BL (ours) 19.09% 16.19% 28.51% 20.47%
Data-Tuned 20.32% 17.75% 29.55% 22.43%
Dataset-Adv 19.45% 16.30% 26.71% 20.55%
Channel-Adv 21.11% 15.65% 26.30% 19.01%
Table 2: EER values for utterances from the CALLHOME dataset using cosine similarity or PLDA backend.
Baseline (Kaldi) 11.69%
Baseline (ours) 11.21%
Dataset-Adv 10.97%
Channel-Adv 10.01%
Table 3: Diarization error rate on CALLHOME using a cosine similarity back-end.

Table 1 shows speaker verification results on VoxCeleb for each model. When all components were trained from a random initialization, the channel-adversarial model did not converge. However, when the discriminator was added to an already converged baseline, the technique showed a marked improvement in performance, listed as ‘Channel-Adversarial’ in the table. The ‘Data-Tuned’ model is the control model that was trained from the same point as the channel-adversarial model but without an adversary – this model never improves on the performance of the baseline. The improvement of our baseline over the Kaldi baseline for cosine similarity is likely due to the use of attentive statistics pooling and the angular penalty softmax. The most comparable network architecture in the literature is that of Okabe et al [17], which achieves an EER of on VoxCeleb. In the recent VoxSRC444 competition, much lower values for EER on VoxCeleb 1 were achieved (), generally using much deeper models and also with higher dimension inputs. However, our results outperform others using small variations on the original x-vector architecture, in addition to outperforming some deeper models with more parameters [21, 22].

Table 2 shows the verification performance of utterances from CALLHOME, for both within-recording pairs and across-recording pairs. Here, the channel-adversarial model with a PLDA backend produces the best EER in both scenarios. The adversarial models appear to perform better in general for within-recording pairs, with the channel-adversarial model performing the best once again, outperforming the dataset-adversarial model. Interestingly, the cosine similarity of the channel-adversarial model appears to degrade on the ‘all pairs’ scenario.

Across all models, PLDA improves performance on verification, but the effectiveness of this improvement is somewhat unpredictable.

Table 3 displays the diarization performance on CALLHOME using a cosine similarity backend, with the channel-adversarial model once again performing the best.

6 Conclusions

We have proposed a recording-level adversarial training strategy to reduce domain mismatch when estimating deep speaker embeddings. This is carried out by training an adversary to classify whether pairs of embeddings belong to the same recording, thus penalising embeddings that contain channel information. Experimental results on VoxCeleb and CALLHOME show an improvement in performance by utilising this method over not only a standard baseline, but also an adversarial baseline which adversarially predicts training dataset occupancy.


  • [1] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: robust DNN embeddings for speaker recognition,” in IEEE ICASSP, 2018, pp. 5329–5333.
  • [2] Gregory Sell, David Snyder, Alan Mccree, Daniel Garcia-Romero, Jesús Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, Shinji Watanabe, and Sanjeev Khudanpur, “Diarization is Hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Interspeech, 2018, pp. 2808–2812.
  • [3] Mireia Diez, Lukáš Burget, Shuai Wang, and Johan Rohdin, “ based x-vector clustering for speaker diarization,” in Interspeech, 09 2019, pp. 346–350.
  • [4] Zhong Meng, Yong Zhao, Jinyu Li, and Yifan Gong, “Adversarial speaker verification,” in ICASSP, 2019, vol. 2019-May, pp. 6216–6220.
  • [5] Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [6] Simon J.D. Prince and James H Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE ICCV, 2007.
  • [7] Gregory Sell and Daniel Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” IEEE SLT, pp. 413–417, 2014.
  • [8] Daniel Garcia-Romero and Carol Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Interspeech, 01 2011, pp. 249–252.
  • [9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Victor Lempitsky, Urun Dogan, Marius Kloft, Francesco Orabona, and Tatiana Tommasi, “Domain-adversarial training of neural networks,” in

    Advances in Computer Vision and Pattern Recognition

    , vol. 17, pp. 189–209. 5 2015.
  • [10] Yusuke Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition,” in Interspeech, 2016, pp. 2369–2372.
  • [11] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu, “Wasserstein distance guided representation learning for domain adaptation,” in AAAI, 2018, pp. 4058–4065.
  • [12] Youzhi Tu, Man-Wai Mak, and Jen-Tzung Chien, “Variational domain adversarial learning for speaker verification,” in Interspeech, 2019, pp. 4315–4319.
  • [13] Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, and Patrick Kenny, “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification,” in ICASSP, 2019, vol. 2019-May, pp. 6226–6230.
  • [14] Chih-Chung Hsu, Chia-Wen Lin, Weng-Tai Su, and Gene Cheung, “SiGAN: Siamese generative adversarial network for identity-preserving face hallucination,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6225–6236, 2019.
  • [15] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Interspeech, vol. 2017-August, pp. 2616–2620, 2017.
  • [16] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” Interspeech, vol. 2018-Septe, no. ii, pp. 1086–1090, 2018.
  • [17] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Interspeech, 2018, vol. 2018-September, pp. 2252–2256.
  • [18] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
  • [19] Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras,

    “LSTM based similarity measurement with spectral clustering for speaker diarization,”

    in Interspeech, 2019, pp. 366–370.
  • [20] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in IEEE ICASSP, May 2019, pp. 6301–6305.
  • [21] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in IEEE ICASSP, May 2019, pp. 5791–5795.
  • [22] Jee weon Jung, Hee-Soo Heo, Ju ho Kim, Hye jin Shim, and Ha-Jin Yu, “Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verifcation,” in Interspeech, 2019, pp. 1268–1272.