Deep speaker embeddings have become the state of the art technique in learning speaker representations [Snyder, Snyder2018]
, outperforming the historically successful i-vector technique[Dehak2011]. These speaker representations are crucial for many tasks related to speaker recognition, such as speaker verification, identification and diarization [Sell2018, Diez2019, Villalba2020].
The networks used to generate speaker embeddings, such as the popular x-vector architecture [Snyder2018], are typically trained on a speaker classification task, taking as input the acoustic features of an utterance and predicting which training set speaker produced the input utterance. By taking one of the upper layers of this network as an embedding, a fixed dimensional vector can be extracted for any given input utterance. This vector, due to the training objective that the network was given, is speaker-discriminative. Crucially, it has been found that these embeddings can be used to discriminate between speakers that were not present in the training set.
Although the approach of achieving speaker-discriminative embeddings through training via classification is common [Xie2019, Tang2019], there exist other means in which to achieve this. For example, there are several approaches that are variants on triplet loss [Hoffer2015, Heigold2016, Wan2017]
, which explicitly optimizes embeddings to move closer to same-class examples whilst moving further away from out-of-class examples. Another approach is the family of angular penalty loss functions[Wanga, Deng2018, Liu, Zhang2019], which are similar to the standard softmax loss, but enforce a stricter condition on the decision boundary between classes by adding angular penalty terms for the correct class, thus encouraging larger intra-class distances and more compact inter-class distances.
We propose and make available code111https://github.com/cvqluu/dropclass_speaker for two methods aimed at achieving speaker-discriminative embeddings, both focused around the notion of dropping classes during training. The first technique, referred to as DropClass, continually changes the training objective for deep speaker embedding systems by periodically dropping a random subset of classes during training, such that the network is continually trained on many different classification tasks. This is conceptually similar to applying Dropout [Srivastava2014] on the output layer of a classification network while also disallowing training examples from the dropped classes. We argue that speaker recognition tasks have strong parallels with few-shot learning tasks and thus may benefit from a meta-learning style approach, which is what DropClass provides.
The second method that is proposed in this work, referred to as DropAdapt, can be applied to adapt a fully trained model to a set of enrolment speakers in an unsupervised manner. This is achieved by dropping the training classes which are are predicted by the model to be unlikely in the full set of enrolment utterances, rather than dropping randomly selected subsets of classes. We also show that the predicted distribution of speakers in the training set and test set can be heavily mismatched, which we argue negatively impacts performance. Our experiments show that DropAdapt can mitigate this distribution mismatch and that this correlates with improved speaker verification performance.
2 Dropping Training Classes
This work proposes two functionally similar techniques which revolve around the process of dropping classes during training. Despite this functional similarity, they have distinct applications, justifications, and links to literature which will be detailed in the following sections.
A typical architecture of a speaker embedding network, such as the successful x-vector architecture [Snyder2018], has the following structure. This network is trained as a whole, but can be split into two components, and , which will be detailed below.
From an input of acoustic features such as MFCCs, the first part of the network parameterized by can produce a -dimensional embedding .
This is typically trained on a classification task, meaning the classification part of the overall network , parameterized by acts on to produce a prediction for what class the input belongs to.
Both and are trained as a whole, usually via the standard cross entropy loss against a target one-hot vector which indicates which class out the set of training classes, , belongs to.
For simplicity, the classification network may be as rudimentary as an affine transform that projects the input embedding into the correct number of dimensions, . In this simplified case, the entirety of is a weight matrix with the dimensions . Without a bias term, this changes Equation 2 to the following:
contains the logits of the class prediction.
The proposed technique, referred to as DropClass, is detailed in Algorithm 1. When training with DropClass, every iterations, a random subset of is chosen: with size where is a variable that determines how many classes should be dropped. and
are configurable hyperparameters. The setdefines the permitted classes in the next iterations. The rows of the weight matrix which correspond to the subset of classes in are selected to make a new matrix, , which has the dimensions , and the output of the resulting modification of Equation 3, has dimension :
After iterations, the process is repeated and a new proper subset is randomly selected, with the process continually repeated until training is completed (Figure 1).
This proposed method can be compared with a number of existing techniques in literature, in particular Dropout [Srivastava2014]. DropClass essentially drops units in the output classification layer and synchronizes this with the data provided to the model, ensuring that no dropped classes are provided while the corresponding classification units are dropped.
The effectiveness of Dropout has been justified by the technique performing a continuous sampling of an exponential number of thinned networks throughout training and then taking an average of these at test time [Baldi2014, Warde-Farley2014]. As a result of this model averaging, Dropout has been shown to reduce overfitting and generally improve performance [Srivastava2014]
, and has seen widespread adoption in many different applications of neural networks[Dahl2013, Variani2014, Wang2017a]. Similar in its justification, DropClass is continuously sampling from a large number of different classification tasks on which the embedding generator must perform well, in theory making it agnostic to any one specific task.
This technique also has some similarity to some techniques in the field of meta-learning for few-shot learning, specifically Model-Agnostic Meta Learning (MAML) [Finn2017] and the related technique Almost No Inner Loop (ANIL) [Raghu]. MAML is a method for tackling few-shot learning problems by utilizing two nested optimization loops. The outer loop finds an initialization for a network which can adapt to new tasks quickly, whilst the inner loop uses the initialization from the outer loop and learns from a small number of examples from each desired task (referred to as the ‘support set’), performing a few gradient updates.
Raghu et al [Raghu]
found the strength of MAML lay in the quality of the initialization found by the outer loop, with each task specific adaptation in the inner loop mostly reusing features already learned in the outer loop step. They proposed ANIL, which reduces the inner task-specific optimization loop to only optimize the classification layer, or ‘head’, of a MAML-trained network. Similar to DropClass, ANIL makes a distinction between the part of the overall classification network which generates discriminative features (referred to as the ‘body’), and the classification head, which is more task specific. Raghu et al also proposed the No Inner Loop (NIL) method, which uses the cosine similarity between the generated features of an unseen example to the generated features of a small number of known examples to weight the classification prediction. This use of cosine similarity to compare embeddings is extremely commonplace in speaker recognition[Hansen2015] and in practice, the inference step of the NIL technique is identical to a to speaker identification set up, if one considers the utterances from the enrolment speakers to be the small number of labeled examples, the ‘support set’.
This similarity of the problems of the few-shot learning and speaker recognition tasks has influenced the proposal of DropClass, both of which aim to produce a ‘body’ that generates features applicable to a distribution of tasks (sub-set classification) rather than to a single task. However, DropClass does not perform the outer and inner loops found in MAML/ANIL which explicitly optimizes the network to be robust to additional gradient steps per sub-task. Instead, DropClass encourages performance on all tasks by continually randomizing the training objective, implicitly encouraging the generated features to perform well across subtasks. Despite this, exploring ANIL and MAML for speaker representation learning would be a natural extension to this work. This extension would be particularly interesting considering the experiments on the NIL method (cosine similarity scoring) from Raghu et al [Raghu], specifically Table 5. They found that MAML and ANIL trained models significantly outperformed ‘multiclass training’ models, where all possible classes were trained simultaneously. Considering the ‘multiclass training’ paradigm is the most common approach to training deep speaker embedding extractors, there could well be gains to be found in adopting a meta-learning approach to training speaker embedding extractors.
Deep speaker representations are optimized to distinguish between the training set speakers, which is hoped to generalise to any given set of new unseen speakers. Generally however, the distribution of speakers in a desired held out evaluation set does not exactly match the distribution seen during training; that is, the expected distribution along the manifold of known speakers is often not replicated in the evaluation set. A clearer explanation for this can be seen if we examine what classes are predicted by the whole network when we give as input the utterances in the test set.
Starting from a trained model, for a given dataset of examples, the average probability assigned to each class can be calculated as follows,
where is the embedding extracted from the utterance, and is the final affine weight matrix. The resulting -dimensional vector is a representation of the mean probability that the model predicts for the presence of each speaker across the
utterances. Provided a uniform distribution of speakers was used to train the model, it would be expected that the model predict a near uniformfor an input of training examples with uniform class distribution. This however may not be the case if the model is provided a selection of evaluation utterances with uniform speaker distribution.
This effect can be seen in Figure 2
, where a trained model predicts close to the uniform distribution of classes when provided a uniform class distribution of training examples, but predicts a much more skewedon the test set, with some training classes predicted to be much more likely than others. This is perhaps not a surprising result, as in the hypothetical situation of a test set with entirely one gender, a skewed is expected. However, this skew is often not as clearly explainable as the hypothetical one-gender test set, and may have multiple contributing factors. For context, the VoxCeleb 1 test set has a ‘good balance of male and female speakers’ [Nagraniy2017], and the speakers in it were chosen because their names began with the letter ‘E’.
This observation can be interpreted in a number of ways. For example, it is well known that class imbalance is a significant impedance to performance in classification tasks [Buda2017], especially in cases in which training and inference have significantly different distributions. It is a natural extension to this that the performance of an embedding extracted from a classification network would degrade in performance in the same manner, which has been seen in the work of Huang et al and Khan et al [Huang, Khan2018]. In these works, they found cost sensitive training and oversampling methods to increase the performance of learned representations.
The second closely related interpretation and hypothesis is that the ‘low probability’ classes predicted by the model are in some way less important to the performance of the embeddings on the test set. These ‘low probability’ classes are suggested by the model’s predictions to be less likely to be present in the test set. This might imply that distinguishing between these specific classes is not crucial to the end task.
Following from these interpretations, this technique, referred to as DropAdapt, works via dropping these low probability classes permanently to fine-tune a fully trained model, adapting the model to a test set and hopefully increasing performance. This is described in Algorithm 2
. This method in theory should be applied only to a fully or near fully trained model, as an accurate estimation of the training class occupation must be obtained first.
To ensure an accurate probability estimation of the test set throughout the fine-tuning, this ranking (and dropping) of the least probable classes can be performed periodically, meaning this technique is functionally similar to the DropClass method above, except that classes are removed permanently, and the dropping of classes is determined by the probability criterion instead of randomly.
A slight variation of this method explored in this work is referred to DropAdapt-Combine, in which instead of permanently removing these classes, all the low probability classes are combined into a single new class such that the examples belonging to the removed classes are not completely discarded.
This method can be compared to techniques in the fields of active learning and learning from small amounts of data, such as the Facility-Location and Disparity-Min models[Kaushal2019], which put heavy emphasis on selecting the right subset of examples in order to learn efficiently. These methods are typically used to capture the whole distribution of the desired dataset in as few examples as possible, encouraging a diverse and representative subset of examples. However, it is implied by Figure 2 that in this speaker embedding task, even if the whole training dataset were used, this may not be representative of the distribution found at test time. DropAdapt can be seen as a means of correcting this mismatch through subset selection for fine-tuning.
Buda et al [Buda2017] and Huang et al. [Huang] found oversampling minority classes to be an effective strategy in improving performance for neural networks on imbalanced datasets. Viewing this problem as a dataset imbalance problem, DropAdapt could also be interpreted as a corrective oversampling strategy, training additionally on those classes which are retained to better match the target distribution.
This train-test distribution mismatch is also closely linked to the field of domain adaptation and the domain-shift problem [Patel2015]. However, DropAdapt is primarily proposed as a means of adapting to a class distribution mismatch, as it is likely that is less informative the greater the domain mismatch. Combining domain adaptation techniques with DropAdapt could be an interesting extension to this work.
The following section details the experimental setup and the experiments performed utilizing the proposed methods.
3.1 Experimental Setup
The primary task that these experiments attempted to improve performance on was that of speaker verification, specifically that on VoxCeleb 1 [Nagraniy2017] and Speakers In The Wild (SITW) core-core task [Mclaren2016]. Although there exist several metrics to evaluate verification performance, which are typically chosen depending on the desired behaviour of a system, the primary metric explored here was the equal error rate (EER), as that is the primary metric for evaluation on VoxCeleb 1.
The training data used for all experiments was the VoxCeleb 2 development set [Chung2018], which features 5994 unique speakers. This was augmented in the standard Kaldi222https://kaldi-asr.org/
fashion with noise, music, babble and reverberation. The original x-vector architecture was used with very little modification, using Leaky ReLU instead of ReLU, with-dimensional MFCC features as inputs, and 512-dimensional embeddings. The main difference between this implementation and that of Snyder et al [Snyder2018] was the use of the CosFace [Wang2018]
angular penalty loss function instead of a traditional cross entropy loss. This classification transform also was applied directly to the embedding layer, unlike the original, which has an additional hidden layer between the embedding layer and the classification layer. This means that the simplified notation for the classifierfollowing from equation 3 is an accurate representation of our model. All pairs of embeddings were normalized and scored using cosine distance.
A batch size of 500 was used, with each example having 350 frames. Each batch had the same number of unique speakers as examples. Models were trained for 120,000 iterations, using SGD with a learning rate of 0.2 and momentum 0.5. The learning rate was halved at 60,000, 80,000, 90,000, and 110,000 steps. For DropAdapt fine-tuning, the learning rate was chosen to be the same as it was at the end of training the original model, and all the enrolment utterances were used to calculate .
3.2 DropClass Experiments
Our initial experiments investigated favourable settings of and for DropClass, and the results are shown in Figure 3, where the number of classes to drop was fixed at and the number of iterations
was varied between 50 and 4,000, the latter being slightly over 1 epoch’s worth of data with the chosen batch size. It can be seen that improvements over the baseline are to be found more reliably at lower values of, with consistent performance improvements when for both VoxCeleb and SITW (dev). This is perhaps unsurprising, as a motivating factor for this technique was to train a network on many different permutations for robustness on a variety of tasks, and thus with a training budget for each model of 120,000 iterations, this is not a large number of permutations throughout training. As the value for increases, this may also increase the risk of incurring the phenomenon of catastrophic forgetting [French1999, Goodfellow2013], an issue in which networks trained on a new task begin to degrade on the task that they previously were trained on.
From the previous experiment, choosing the best performing value of across each dataset, the number of classes to drop was then varied from 1000 to 5000, shown in Figure 4. From this, we can see that for nearly all configurations of , performance was improved on all datasets using DropClass over the baseline. Dropping approximately half the classes at
=3000 appeared to produce the best performance, although a more thorough exploration with different training data is likely required to ascertain if any heuristic exists for the selection of this value. However, from the previous experiments, it can be seen that for a suitably low value of, DropClass can convey improvements over the baseline.
It is however important to note that a crucial component of this method is the use of the CosFace [Wang2018] angular penalty loss, with Table 1 showing a comparison of the effect that changing the loss function had on the improvement that DropClass produced on VoxCeleb. A more in-depth analysis on how each loss function changes with the permutations of each subset of classes is required.
3.3 DropAdapt Experiments
Table 2 displays the relative improvement in EER from utilizing the DropAdapt and DropAdapt-Combine method, using either the enrolment speakers from VoxCeleb 1 or SITW (dev) to choose which speakers to drop. The starting point was a standard classification trained baseline. Models were trained on a budget of 30,000 iterations, and one configuration for and was tested. Also compared were the following control experiments: The baseline but trained additionally for the same number of iterations as DropAdapt, Drop-Random, which drops random classes permanently, ignoring the score, and Drop Only Data, which removes the low probability classes from the training data, but does not remove the relevant rows in the final weight matrix, bypassing the use of in Equation 4.
Compared to the baseline and control experiments, both DropAdapt and DropAdapt-Combine show strong performance gains on VoxCeleb. The EER on VoxCeleb is particularly impressive when compared to other works which use similar or larger network architectures and more training data and achieve EER [Okabe2018, Xie2019]. The improvements over the baseline on SITW however are more modest, with DropClass trained models and ‘Drop Only Data’ outperforming the DropAdapt models.
An interesting observation is the fact that dropping only the data improved performance on VoxCeleb, but not as much as the DropAdapt methods. As discussed in section 2.2, DropAdapt can be viewed as a form of corrective oversampling of targeted classes, with oversampling techniques having been shown to improve performance in imbalanced data scenarios [Huang, Khan2018]. From this, we can see that for the within-domain data, some of the benefit of DropClass is gained from only fine-tuning via oversampling, but this benefit is increased further by also dropping the classes from the output layer. Conversely, for the out-of-domain SITW dataset, dropping only the classes from the data performed the best. We hypothesize that the reduced effectiveness of DropAdapt in this case may be due to the technique having to adapt to not only a new speaker distribution, but also a new domain. Further exploration combining DropAdapt with traditional domain adaptation techniques is left for future work.
In addition, more experimentation on the configurations of and could be explored, as it may be possible for example that the iterative dropping of classes is not necessary, and that the initial probability estimation is suitable. Furthermore, the most obvious extension left for future work is to use both DropClass and DropAdapt in conjunction, as both have been shown to provide performance increases in parallel.
Following up on the hypothesis presented in section 2.2 that the imbalanced distribution of on the test set may be an indicator of train-test mismatch and thus incurring performance loss, Figure 5 shows the EER and the KL divergence () from the VoxCeleb test set to the uniform distribution as the DropAdapt-Combine model is trained. As we can see from the figure, while the EER decreases, the distribution of also gets closer to the uniform distribution. Whilst there appears to be a correlation, this is likely not a strongly linked pair of observations, in that we can easily break this relationship by training only the final affine matrix and freezing the embedding extractor to provide more favourable class weightings for . However, in the case of DropAdapt, the decreasing may indicate that a favourable change in the extracted representations is occurring. This could be useful as a stopping criterion for cases in which adaptation data has no labels at all.
|Baseline (More iterations)||3.06%||-0.7%|
|Drop Random (=500, =5000)||3.08%||-1.3%|
|Drop Only Data (=500, =5000)||2.86%||5.9%|
|DropAdapt (=500, =5000)||2.68%||11.8%|
|DropAdapt-C (=500, =5000)||2.64%||13.2%|
|Baseline (More iterations)||3.61%||-1.7%|
|Drop-Random (=500, =5000)||3.73%||-5.1%|
|Drop Only Data (=500, =5000)||3.31%||6.7%|
|DropAdapt (=500, =5000)||3.47%||2.3%|
|DropAdapt-C (=500, =5000)||3.39%||4.5%|
In this work we presented the DropClass and DropAdapt methods for training and fine-tuning deep speaker embeddings. Both methods are based around the notion of dropping classes from the final classification output layer while also withholding examples belonging to those same classes. Drawing inspiration from Dropout and meta-learning, DropClass is a method that drops classes randomly and periodically throughout training such that a model is trained on a large number of different classification objectives for subsets of the training classes as opposed to classifying on the full set of classes. We show that in conjunction with the CosFace [Wang2018] loss function, DropClass can improve verification performance on the VoxCeleb and SITW core-core tasks.
We present the mismatch in class imbalance between train and test as a potential reason for reduced performance in verification, and propose DropAdapt as a means of alleviating this. DropAdapt is a method which can adapt a trained model to a target dataset with unknown speakers in an unsupervised manner. This is achieved by calculating the average predicted probability of each training class with the adaptation data as input. From these predictions, the model is fine-tuned by dropping the low probability classes and training for more iterations, focusing on the classes which the model has predicted to be presented in the adaptation dataset. This is not unlike traditional oversampling techniques. Applying DropAdapt to VoxCeleb leads to a large improvement over the baseline, with DropAdapt also outperforming simply oversampling the same classes, suggesting it may be an effective strategy in adapting to a different class distribution than what was seen during training. We also show empirically that as the class distribution mismatch is corrected during DropAdapt, so too does the verification performance increase.