Research on speaker recognition has a long history and has received an increasing amount of attention in recent years. Large-scale datasets for speaker recognition such as the VoxCeleb [Nagrani17, Chung18a] and Speakers in the Wild [McLaren16] have become freely available, facilitating fast progress in the field.
Speaker recognition can be categorised into closed-set or open-set settings. For closed-set setting, all testing identities are predefined in training set, therefore can be addressed as a classification problem. For open-set setting, the testing identities are not seen during training, which is close to practice. This is a metric learning problem in which voices must be mapped to a discriminative embedding space. The focus of this research, and most others, are on the latter problem.
Pioneering work on speaker recognition using deep neural networks have learnt speaker embeddings via the classification loss[Nagrani17, snyder2017deep, snyder2018x]
. Since then, the prevailing method has been to use softmax classifiers to train the embeddings[ravanelli2018speaker, okabe2018attentive, snyder2019speaker]. While the softmax loss can learn separable embeddings, they are not discriminative enough since it is not explicitly designed to optimise embedding similarity. Therefore, softmax-trained models have often been combined with PLDA [Ioffe06] back-ends to generate scoring functions [snyder2018x, ramoji2020pairwise].
This weakness has been addressed by [liu2017sphereface]
who have proposed angular softmax (A-Softmax) where cosine similarity is used as logit input to the softmax layer, and a number of works have demonstrated its superiority over vanilla softmax in speaker recognition[ravanelli2018speaker, okabe2018attentive, snyder2019speaker, villalba2019state, snyder2019jhu]. Additive margin variants, AM-Softmax [wang2018additive, wang2018cosface] and AAM-Softmax [deng2019arcface]
, have been proposed to increase inter-class variance by introducing a cosine margin penalty to the target logit, and these have been very popular due to their ease of implementation and good performance[Xie19a, hajibabaei2018unified, liu2019large, garcia2019x, zeinali2019but, luu2019channel, luu2020dropclass, xiang2019margin]. However, training with AM-Softmax and AAM-Softmax has proven to be challenging since they are sensitive to the value of scale and margin in the loss function.
Metric learning objectives present strong alternatives to the prevailing classification-based methods, by learning embeddings directly. Since open-set speaker recognition is essentially a metric learning problem, the key is to learn features that have small intra-class and large inter-class distance. Contrastive loss [Chopra05] and triplet loss [Schroff15] have been demonstrated promising performance on speaker recognition [zhang2018text, rahman2018attention] by optimising the distance metrics directly, but these methods require careful pair or triplet selection which can be time consuming and performance sensitive.
Of closest relevance to our work is prototypical networks [snell2017prototypical] that learn a metric space in which open-set classification can be performed by computing distances to prototype representations of each class, with a training procedure that mimics the test scenario. The use of multiple negatives helps to stabilise learning since loss functions can enforce that an embedding is far from all negatives in a batch, rather than one particular negative in the case of triplet loss. [wang2019centroid, anand2019few] have adopted the prototypical framework for speaker recognition. Generalised end-to-end loss [wan2018generalized], originally proposed for speaker recognition, is also closely related to this setup.
Comparing different loss functions from prior works can be challenging and unreliable, since speaker recognition systems can vary widely in their design. Popular trunk architectures include TDNN-based systems such as x-vector[snyder2018x] and its deeper counterparts [snyder2019speaker]
, as well as network architectures from the computer vision community such as the ResNet[He16]. A range of encoders have been proposed to aggregate frame-level informations into utterance-level embeddings, from simple averaging [Nagrani17] to statistical pooling [snyder2017deep, okabe2018attentive] and dictionary-based encodings [Xie19a, cai2018exploring]. [snyder2018x] has proven that data augmentation can significantly boost speaker recognition performance, but the augmentation methods can range from adding noise [snyder2015musan] to room impulse response (RIR) simulation [allen1979image].
Therefore, in order to directly compare a range of loss functions, we conduct over 10,000 GPU-hours of careful experiments while keeping other training details constant. Against popular belief, we demonstrate that the networks trained with vanilla triplet loss show competitive performance compared to most AM-Softmax and AAM-Softmax trained networks, and those trained with our proposed angular objective outperform all comparable methods.
The experiments in this paper can be reproduced with the PyTorch trainer that is released with this paper.
2 Training functions
Equal Error Rates (EER, %) on the VoxCeleb1 test set. We report the mean and standard deviation of the repeated experiments.CHNM: Curriculum Hard Negative Mining.
This section describes the loss functions used in our experiments and proposes a new angular variant of the prototypical loss.
2.1 Classification objectives
The VoxCeleb2 development set contains speakers or classes. During training, each mini-batch contains utterances each from different speakers, whose embeddings are and the corresponding speaker labels are where and .
Softmax. The softmax loss consists of a softmax function followed by a multi-class cross-entropy loss. It is formulated as:
where and are the weights and bias of the last layer of the trunk architecture, respectively. This loss function only penalises classification error, and does not explicitly enforce intra-class compactness and inter-class separation.
By normalising the weights and the input vectors, softmax loss can be reformulated such that the posterior probability only relies on cosine of angle between the weights and the input vectors. This loss function, termed by the authors as Normalised Softmax Loss (NSL), is formulated as:
where is the dot product of normalised vector and .
However, embeddings learned by the NSL are not sufficiently discriminative because the NSL only penalises classification error. In order to mitigate this problem, cosine margin is incorporated into the equation:
where is a fixed scale factor to prevent gradient from getting too small in training phase.
AAM-Softmax (ArcFace). This is equivalent to CosFace except that there is additive angular margin penalty between and . The additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere.
2.2 Metric learning objectives
For metric learning objectives, each mini-batch contains utterances from each of different speakers, whose embeddings are where and .
Triplet. Triplet loss minimises the distance between an anchor and a positive (same identity), and maximises the distance between an anchor and a negative (different identity).
For our implementation, the negative utterances are sampled from different speakers within the mini-batch and the sample is selected by the hard negative mining function. This requires utterances from each speaker.
Prototypical. Each mini-batch contains a support set and a query set . For simplicity, we will assume that the query is -th utterance from every speaker. Then the prototype (or centroid) is:
Squared Euclidean distance is used as the distance metric as proposed by the original paper:
During training, each query example is classified against speakers based on a softmax over distances to each speaker prototype:
Here, is the squared Euclidean distance between the query and the prototype of the same speaker from the support set. The softmax function effectively serves the purpose of hard negative mining, since the hardest negative would most affect the gradients. The value of is typically chosen to match the expected situation at test-time, e.g. for 5-shot learning, so that the prototype is composed of five different utterances. In this way, the task in training exactly matches the task in test scenario.
Generalised end-to-end (GE2E). In GE2E training, every utterance in the batch except the query itself is used to form centroids. As a result, the centroid that is of the same class as the query is computed from one fewer utterance than centroids of other classes. They are defined as:
The similarity matrix is defined as scaled cosine similarity between the embeddings and all centroids:
where and are learnable scale and bias. The final GE2E loss is defined as:
Angular Prototypical. The angular prototypical loss uses the same batch formation as the original prototypical loss, reserving one utterance from every class as the query. This has advantages over GE2E-like formation since every centroid is made from the same number of utterances in the support set, therefore it is possible to exactly mimic the test scenario during training.
We use a cosine-based similarity metric with learnable scale and bias, as in the GE2E loss.
Using the angular loss function introduces scale invariance, improving the robustness of objective against feature variance and demonstrating more stable convergence [wang2017deep].
The resultant objective is the same as the original prototypical loss, Equation 8.
In this section we describe the experimental setup, which is identical across all objectives described in Section 2.
3.1 Input representations
During training, we use a fixed length 2-second temporal segment, extracted randomly from each utterance. Spectrograms are extracted with a hamming window of width 25ms and step 10ms. For the ResNet, the 257-dimensional raw spectrograms are used as the input to the network. For the VGG network, 40-dimensional Mel filterbanks are used as the input. Mean and variance normalisation (MVN) is performed by applying instance normalisation [ulyanov2016instance] to the network input. Since the VoxCeleb dataset consists mostly of continuous speech, voice activity detection (VAD) is not used in training.
3.2 Trunk architecture
Experiments are performed on two different trunk architectures described below. These are identical to the two models used and described in [chung2019delving].
VGG-M-40. The VGG-M model has been proposed for image classification [Chatfield14] and adapted for speaker recognition by [Nagrani17]. The network is known for high efficiency and good classification performance. VGG-M-40 is a modification of the network proposed by [Nagrani17] to take 40-dimensional filterbanks as inputs instead of the 513-dimensional spectrogram. The temporal average pooling (TAP) layer takes the mean of the features along the time domain in order to produce utterance-level representation.
Thin ResNet-34. Residual networks [He16] are widely used in image recognition and have recently been applied to speaker recognition [cai2018exploring, Chung18a, Xie19a]. Thin ResNet-34 is the same as the original ResNet with 34 layers, except using only one-quarter of the channels in each residual block in order to reduce computational cost. The model only has 1.4 million parameters compared to 22 million of the standard ResNet-34. Self-attentive pooling (SAP) [cai2018exploring] is used to aggregate frame-level features into utterance-level representation while paying attention to the frames that are more informative for utterance-level speaker recognition.
3.3 Implementation details
Datasets. The network is trained on the development set of VoxCeleb2 [Chung18a] and evaluated on test set of VoxCeleb1 [Nagrani17]. Note that the development set of VoxCeleb2 is completely disjoint from the VoxCeleb1 dataset (i.e. no speakers in common).
Training. Our implementation is based on the PyTorch framework [paszke2019pytorch]
and trained on the NAVER Smart Machine Learning (NSML) platform[sung2017nsml]. The models are trained using a NVIDIA V100 GPU with 32GB memory for epochs. For each epoch, we sample a maximum of 100 utterances from each of the 5,994 identities. We use the Adam optimizer with an initial learning rate of decreasing by every 10 epochs. For metric learning objectives, we use the largest batch size that fits on a GPU. For classification objectives, we use a fixed batch size of 200. The training takes approximately one day for the VGG-M-40 model and five days for the Thin ResNet-34 model.
All experiments were repeated independently three times in order to minimise the effect of random initialisation, and we report mean and standard deviation of the experiments.
Data augmentation. No data augmentation is performed during training, apart from the random sampling.
Curriculum learning. The AAM-Softmax loss function demonstrates unstable convergence from random initialisation with larger values of such as . Therefore, we start training the model with and increase it to after 100 epochs. This strategy is labelled Curriculum in Table 1.
Similarly, the triplet loss can cause models to diverge if the triplets are too difficult early in the training. We only enable hard negative mining after 100 epochs, at which point the network only sees the most difficult 1% of the negatives.
Evaluation protocol. The trained networks are evaluated on the VoxCeleb1 test set. We sample ten 4-second temporal crops at regular intervals from each test segment, and compute the distances between all possible combinations () from every pair of segments. The mean of the 100 distances is used as the score. This protocol is in line with that used by [Chung18a, chung2019delving].
Results. The results are given in Table 1. It can be seen that the performance of networks trained with AM-Softmax and AAM-Softmax loss functions can be very sensitive to the value of margin and scale set during training. We iterate over many combinations of and to find the optimal value. The model trained with the most common setting (AM-Softmax with and ) is outperformed by the vanilla triplet loss.
Generalised end-to-end and prototypical losses show improvements over the triplet loss by using multiple negatives in training. The prototypical networks perform best when the value of matches the test scenario, removing the necessity for hyperparameter optimisation. The performance of the model trained with the proposed angular objective exceeds that of all classification-based and metric learning methods.
There are a substantial number of recent works on the VoxCeleb2 dataset, but we do not compare to these in the table, since the goal of this work is to compare the performance of different loss functions under identical conditions. However, we are unaware of any work that outperforms our method with a similar number of network parameters.
Batch size. The effect of batch size on various loss functions is shown in Table 2. We observe that a bigger batch size has a positive effect on performance for metric learning methods, which can be explained by the ability to sample harder negatives within the batch. We make no such observation for the network trained with classification loss.
In this paper, we have presented a case for metric learning in speaker recognition. Our extensive experiments indicate that the GE2E and prototypical networks show superior performance to the popular classification-based methods. We also propose an angular variant of the prototypical networks that outperforms all existing training functions. Finally, we release a flexible PyTorch trainer for large-scale speaker recognition that can be used to facilitate further research in the field.