Learning useful representations from high-dimensional data is one of the main goals of modern machine learning. However, doing so is generally a side effect of the solution of a pre-defined task, e.g., while learning the decision surface in a classification problem, inner layers of artificial neural networks are shown to make salient cues of input data which are discriminable. Moreover, in unsupervised settings, bottleneck layers of autoencoders as well as approximate posteriors from variational autoencoders have all been shown to embed relevant properties of input data which can be leveraged in downstream tasks. Rather than employing a neural network to solve some task and hope learned features are useful, approaches such assiamese networks (bromley1994signature), which can be included in a set of approaches commonly referred to as Metric Learning, have been introduced with the goal of explicitly inducing features holding desirable properties such as class separability. In this setting, an encoder is trained so as to minimize or maximize a distance measured across pairs of encoded examples, depending on whether the examples within each pair belong to the same class or not, provided that class labels are available. Follow-up work leveraged this idea for several applications (hadsell2006dimensionality; hoffer2015deep), which include, for instance, the verification problem in biometrics, as is the case of FaceNet (schroff2015facenet) and Deep-Speaker (li2017deep), which are used for face and speaker recognition, respectively. However, as pointed out in recent work (schroff2015facenet; shi2016embedding; wu2017sampling; li2017deep; zhang2018text), careful selection of training pairs is crucial to ensure a reasonable sample complexity during training given that most triplets of examples quickly reach the condition such that distances measured between pairs from the same class are smaller than those of the pairs from different classes. As such, developing efficient strategies for harvesting negative pairs with small distances throughout training becomes primordial.
In this contribution, we are concerned with the metric learning setting briefly described above, and more specifically, we turn our attention to its application to the verification problem, i.e., that of comparing data pairs and determining whether they belong to the same class. The verification problem arises in applications where comparisons of two small samples is required such as face/finger-print/voice verification (reynolds2002overview)zhu2016deep; wu2017sampling), and so on. At test time, inference is often performed to answer two types of questions: (i) Do two given examples belong to the same class? and (ii) Does a test example belong to a specific claimed class? And in both cases test examples might belong to classes never presented to the model during training. Current verification approaches are usually comprised of several components trained in a greedy manner (kenny2013plda; snyder2018x), and an end-to-end approach is still lacking.
Euclidean spaces will not, in general, be suitable for representing any desired type of structure expressed in the data (e.g. asymmetry (pitis2020an) or hierarchy (nickel2017poincare)). To avoid the need to select an adequate distance given every new problem we are faced with, as well as to deal with the training difficulties mentioned previously, we propose to augment the metric learning framework and jointly train an encoder (which embeds raw data into a lower dimensional space) and a (pseudo) distance model tailored to the problem of interest. An end-to-end approach for verification is then defined by employing such pseudo-distance to compute similarity scores. Both models together, parametrized by neural networks, define a (pseudo) metric space in which inference can be performed efficiently since now semantic properties of the data (e.g., discrepancies across classes) are encoded by scores. While doing so, we found several interpretations appear from such learned pseudo-distance, and it can be further interpreted as a likelihood ratio in a Neymann-Pearson hypothesis test, as well as an approximate divergence measure between the joint distributions of positive (same classes) and negative (different classes) pairs of examples. Moreover, even though we do not enforce models to satisfy properties of an actual metric111Symmetry, identity of indiscernibles, and triangle inequality., we empirically observe such properties to appear.
Our contributions can be summarized as follows:
We propose an augmented metric learning framework where an encoder and a (pseudo) distance are trained jointly and define a (pseudo) metric space where inference can be done efficiently for verification.
We show that the optimal distance model for any fixed encoder yields the likelihood-ratio for a Neymann-Pearson hypothesis test, and it further induces a high Jensen-Shannon divergence between the joint distributions of positive and negative pairs.
The introduced setting is trained in an end-to-end fashion, and inference can be performed with a single forward pass, greatly simplifying current verification pipelines which involve several sub-components.
Evaluation on large scale verification tasks provides empirical evidence of the effectiveness in directly using outputs of the learned pseudo-distance for inference, outperforming commonly used downstream classifiers.
The remainder of this paper is organized as follows: metric learning and the verification problem are discussed in Section 2. The proposed method is presented in Section 3 along with our main guarantees, while empirical evaluation is presented in Section 4. Discussion and final remarks as well as future directions are presented in Section 5.
2 Background and related work
2.1 Metric Learning and Distance Metric Learning
Being able to efficiently assess similarity across samples from data under analysis is a long standing problem within machine learning. Algorithms such as K-means, nearest-neighbors classifiers, and kernel methods generally rely on the selection of some similarity or distance measure able to encode semantic relationships present in high-dimensional data into real scores. Under this view, approaches commonly referred to asDistance Metric Learning, introduced originally by xing2003distance, try to learn a so-called Mahalanobis distance, which, given , will have the form: , where is positive semidefinite. Several extensions of that setting were then introduced (globerson2006metric; weinberger2009distance; ying2012distance).
shalev2004online, for instance, proposed an online version of the algorithm in (xing2003distance)
, while an approach based on support vector machines was introduced in(schultz2004learning) for learning . davis2007information provided an information-theoretic approach to solve for
by minimizing the divergence between Gaussian distributions associated to the learned and the Euclidean distances, further showing such an approach to be equivalent to low-rank kernel learning(kulis2006learning). Similar distances have also been used in other settings, such as similarity scoring for contrastive learning (oord2018representation; tian2019contrastive). Besides the Mahalanobis distance, other forms of distance/similarity have been considered in recent work. In (lanckriet2004learning), for example, a kernel matrix is directly learned, implicitly defining a similarity function. In (pitis2020an), classes of neural networks are proposed to define pseudo-distances which satisfy the triangle inequality while not being necessarily symmetric.
For the particular case of Mahalanobis distance metric learning, one can show that (shalev2004online), which means that there exists a linear projection of the data after which the Euclidian distance will correspond to the Mahalanobis distance on the original space. chopra2005learning substituted the linear projection by a learned non-linear encoder so that yields a (non-Mahalanobis) distance measure between raw data points yielding useful properties. Follow-up work has extended such idea to several applications (schroff2015facenet; shi2016embedding; li2017deep; zhang2018text). One extra variation of , besides the introduction of , is to switch the Euclidean distance with an alternative better suited for the task of interest. That is the case in (norouzi2012hamming), where the Hamming distance is used over data encoded to a binary space. In (courty2018learning), in turn, the encoder is trained so that Euclidean distances in the encoded space approximate Wasserstein divergences, while nickel2018learning employs a hyperbolic distance which is argued to be suitable for their particular use case.
Based on the covered literature, one can conclude that there are two different directions aimed at achieving a similar goal: learn to represent the data in a metric space where distances yield efficient inference mechanisms for various tasks. While one corresponds to learning a meaningful distance or similarity from raw data, the other corresponds to, given a fixed distance metric, finding an encoding process yielding the desired metric space. Here, we propose an alternative to perform both these tasks simultaneously, i.e., jointly learn both the encoder and distance. Close to such an approach is the method discussed by garcia2019learning where, similarly to our setting, both encoder and distance are trained, with the main differences lying in the facts that our method is fully end-to-end222What authors refer to as end-to-end requires pre-training an encoder in the metric learning setting with a standard distance.
while in their case training happens separately. Moreover, training of the distance model in that case is done by imitation learning of cosine similarities.
2.2 The Verification Problem
Given data instances such that each can be associated to a class label through a labeling function , we define a trial as a pair of sets of examples , provided that and , so that we can assign class labels to such sets defining . The verification problem can be thus viewed as, given a trial , deciding whether , in which case we refer to as target trial, or and the trial will be called non-target.
The verification problem is illustrated in Figure 1. We categorize trials into two types in accordance to practical instances of the verification problem: type I trials are those such that is referred to as enrollment sample, i.e., a set of data points representing a given class such as a gallery of face pictures from a given user in an access control application, while will correspond to a single example to be verified against the enrollment gallery. For the type II case, is simply a claim corresponding to the class against which will be verified. Classes corresponding to examples within test trials might have never been presented to the model, and sets and are typically small ().
Under the Neyman-Pearson approach (neyman1933ix), verification is seen as a hypothesis test, where and correspond to the hypothesis such that is target or otherwise, respectively (jiang2001bayesian). The test is thus performed through the following likelihood ratio (LR):
where and correspond to models of target, and non-target (or impostor) trials. The decision is made by comparing with a threshold .
One can then explicitly approximate through generative approaches (deng2018speech)
, which is commonly done using Gaussian mixture models. In that case, the denominator is usually defined as a universal background model (GMM-UBM,reynolds2000speaker), meaning that it is trained on data from all available classes, while the numerator is a fine-tuned model on enrollment data so that, for trial , will be:
Alternatively, cumani2013pairwise showed that discriminative settings, i.e., binary classifiers trained on top of data pairs to determine whether they belong to the same class, yielded likelihood ratios useful for verification. In their case, a binary SVM was trained on pairs of i-vectors (dehak2010front) for automatic speaker verification. We build upon such discriminative setting, but with the difference that we learn an encoding process along with the discriminator (here represented as a distance model), and show it to yield likelihood ratios required for verification through contrastive estimation results. This is more general than the result in (cumani2013pairwise), which shows that there exists a generative classifier associated to each discriminator whose likelihood ratio matches the discriminator’s output, requiring such classifier’s assumptions to hold.
We remark that current verification approaches are composed of complex pipelines containing several components (dehak2010front; kenny2013plda; snyder2018x), including a pre-trained data encoder, followed by a downstream classifier, such as probabilistic linear discriminant analysis (PLDA) (ioffe2006probabilistic; prince2007probabilistic), and score normalization (auckenthaler2000score), each contributing practical issues (e.g., cohort selection) to the overall system. This renders both training and testing of such systems difficult. The approach proposed herein is a step towards end-to-end verification, i.e., from data to scores via a single forward pass, thus simplifying inference.
3 Learning pseudo metric spaces
We consider the setting where both an encoding mechanism, as well as some type of similarity or distance across data points are to be learned. Assume and are deterministic mappings which will be referred to as encoder and distance model, respectively, and will be both parametrized by neural networks. Such entities resemble a metric-space, thus we will refer to it as pseudo metric space. We empirically observed that introducing distance properties in , i.e., by constraining it to be symmetric and enforcing it to satisfy the triangle inequality, did not result in improved performance, yet rendered training unstable. However, since trained models are found to approximately behave as an actual distance, we make use of the analogy, but further provide alternative interpretations of ’s outputs.
Data samples are such that , and represents embedded data in . It will be usually the case that . Once more, each data example can be further assigned to one of class labels through a labeling function . Moreover, we define positive and negative pairs of examples denoted by or superscripts such that , as well as . The same notation is employed in the embedding space so that , and . We will denote the sets of all possible positive and negative pairs by and
, respectively, and further define a probability distributionover which, along with , will yield and over and . Similarly to the setting in (hjelm2018learning), which introduces a discriminator over pairs of samples, we are interested in and such that:
and indicates composition so that . Such problem is separable in the parameters of and and iterative solution strategies might include either alternate or simultaneous updates. We found the latter to converge faster in terms of wall-clock time and both approaches reach similar performance. We thus perform simultaneous updates while training.
The problem stated in (3) corresponds to finding and which will ensure that semantically close or distant samples, as defined through , will preserve such properties in terms of distance in the new space, while doing so in lower dimension. We stress the fact that class labels define which samples should be close together or far apart, which means that the same underlying data can yield different pseudo-metric spaces if different semantic properties are used to define class labels. For example, if one considers that, for a given set of speech recordings, class labels are equivalent to speaker identities, recordings from the same speaker are expected to be clustered together in the embedding space, while different results can be achieved if class labels are assigned corresponding to spoken language, acoustic conditions, and so on.
3.1 Different interpretations for
Besides the view of as a distance-like object defining a metric-like space , here we discuss some other possible interpretations of its outputs. We start by justifying the choice of the training objective defined in Eq. 3 by showing it to yield the likelihood ratio of particular trials of type I corresponding to a single enrollment example against a single test example, i.e. . In both of the next two propositions, proofs directly reuse results from the contrastive estimation and generative adversarial networks literature (gutmann2010noise; goodfellow2014generative) to show can be used for verification.
Proposition 1. The optimal for any fixed yields a simple transformation of the likelihood ratio stated in Eq. 1 for trials of the type .
Proof. We first define and , which correspond to the counterparts of and induced by in the embedding space. Now consider the loss defined in Eq. 3:
where corresponds to or equivalently . Since , above integrand , provided that the set from which we pick candidate solutions is rich enough, has its maximum at:
The last step above is of course only valid for . Nevertheless, is in any case meaningful for verification. In fact, as will be discussed in Proposition 2, the optimal encoder is the one that induces . Considering trial , we can write the ratio as:
Proposition 1 indicates that the discussed setting can be used in an end-to-end fashion to yield verification decision rules against a threshold for trials of a specific type.
The following lemma will be necessary for the next result:
Lemma 1. If , any positive threshold yields optimal decision rules for trials .
Proof. We prove the lemma by inspecting the decision rule under the considered assumptions in the two possible test cases: if is non-target . If is target , completing the proof.
We now proceed and use the optimal discriminator into , which yields the following result for the optimal encoder:
Proposition 2. Minimizing yields optimal decision rules for any positive threshold.
Proof. We plug into so that for any we obtain:
is therefore minimized () iff yields , which results in optimal decision rules for any positive threshold, invoking lemma 1, and assuming such encoders are available in the set one searches over.
We thus showed the proposed training scheme to be convenient for 2-sample tests under small sample regimes, such as in the case of verification, given that: (i) the distance model is also a discriminator which approximates the likelihood ratio of the joint distributions over positive and negative pairs333The joint distribution over negative pairs is simply the product of marginals: ., and the encoder will be such that it induces a high divergence across such distributions, rendering their ratio amenable to decision making even in cases where verified samples are as small as single enrollment and test examples.
On a speculative note, we provide yet another view of by defining the kernel function . If we assume to satisfy Mercer’s condition (which won’t likely be the case within our setting since will not be symmetric nor positive semidefinite), we can invoke Mercer’s theorem and state that there is a feature map to a Hilbert space where verification can be performed through inner products. Training in the described setting could be viewed such that minimizing becomes equivalent to building such a Hilbert space where classes can be distinguished by directly scoring data points one against the other. We hypothesize that constraining to sets where Mercer’s condition does hold might yield an effective approach for the problems we consider herein, which we intend to investigate in future work.
We now describe the procedure we adopt to minimize as well as some practical design decisions made based on empirical results. Both and are implemented as neural networks. In our experiments, will be convolutional (2-d for images and 1-d for audio) while
is a stack of fully-connected layers which take as input concatenated embeddings of pairs of examples. Training is carried out with standard minibatch stochastic gradient descent with Polyak’s acceleration. We perform simultaneous update steps forand since we observed that to be faster than alternate updates, while yielding the same performance. Standard regularization strategies such as weight decay and label smoothing (szegedy2016rethinking) are also employed. We empirically found that employing an auxiliary multi-class classification loss significantly accelerates training. Since our approach requires labels to determine which pairs of examples are positive or negative, we make further use of the labels to compute such auxiliary loss, which will be indicated by . To allow for computation of , we project onto the simplex using a fully-connected layer. Minimization is then performed on the sum of the two losses, i.e., we solve , where the subscript in indicates the multi-class cross-entropy loss.
All hyperparameters are selected with a random search over a pre-defined grid. For the particular case of the auxiliary loss, besides the standard cross-entropy, we also ran experiments considering one of its so-called large margin variations. We particularly evaluated models trained with the additive margin softmax approach (wang2018additive). The choice between the two types of auxiliary losses (standard or large-margin) was a further hyperparameter and the decision was based on the random search over the two options. The grid used for hyperparameters selection along with the values chosen for each evaluation are presented in the appendix. A pseudocode describing our training procedure is presented in Algorithm 1.
|Cosine + E2E||3.42%||0.80%|
|Cosine + E2E||28.49%||20.90%|
|Cosine + E2E||29.32%||22.24%|
We proceed to evaluation of the described framework and do so with three sets of experiments. In the first part of our evaluation, we run proof-of-concept experiments and make use of standard image datasets to simulate verification settings. We report results on all trials created for the test sets of Cifar-10 and Mini-ImageNet. In the former, the same 10 classes of examples appear for both train and test partitions, in what we refer to as closed-set verification. For the case of Mini-ImageNet, since that dataset was designed for few-shot learning applications, we have an open-set evaluation for verification since there are 64, 16, and 20 disjoint classes of training, validation, and test examples.
We then move on to a large scale realistic evaluation. To this end, we make use of the recently introduced VoxCeleb corpus (nagrani2017voxceleb; chung2018voxceleb2), corresponding to audio recordings of interviews taken from youtube videos, which means there’s no control over the acoustic conditions present in the data. Moreover, while most of the corpus corresponds to speech in English, other languages are also present, so that test recordings are from different speakers relative to the train data, and potentially also from different languages and acoustic environments. We specifically employ the second release of the corpus so that training data is composed of recordings from 5994 speakers while three test sets are available: (i) VoxCeleb1 Test set, which is made up of utterances from 40 speakers, (ii) VoxCeleb1-E, i.e., the complete first release of the data containing 1251 speakers, and (iii) VoxCeleb1-H, corresponding to a sub-set of the trials in VoxCeleb1-E so that non-target trials are designed to be hard to discriminate by using the meta-data to match factors such as nationality and gender of the speakers. We finally report experiments performed to observe whether ’s outputs present properties of actual distances, which we observe to be the case.
Our main baselines for proof-of-concept experiments correspond to the same encoders as in the evaluation of our proposed approach, while is dropped and replaced by the Euclidean distance. In those cases however, in order to get the most challenging baselines, we perform online selection of hard negatives. Our baselines closely follow the setting described in (monteiro2019combining). All such baselines are referred to as triplet in the tables with results as a reference to the training loss in those cases. All models, baseline or otherwise, are trained from scratch, and the same computation budget is used for training and hyperparameter search for all models we trained.
Performance is assessed in terms of the difference to 1 of the area under the operation curve, indicated by 1-AUC in the tables, and also in terms of equal error rate (EER). EER indicates the operating point (i.e. threshold selection) at which the miss and false alarm rates are equal. Both metrics are better when closer to 0. We consider different strategies to score test trials. Both cosine similarity and PLDA are considered in some cases, and when the output of is directly used as a score we then indicate it by E2E in a reference to end-to-end444Scoring trials with cosine similarity can be also seen as end-to-end.. We further remark that cosine similarity can also be used to score trials in our proposed setting, and we observed some performance gains when applying simple sum fusion of the two available scores. Additional implementation details are included in the appendix.
4.1 Cifar-10 and Mini-ImageNet
The encoder for evaluation on both Cifar-10 and MiniImagenet was implemented as a ResNet-18 (he2016deep). Results are reported in Table 1.
Results indicate the proposed scheme indeed yields effective inference strategies under the verification setting compared to traditional metric learning approaches, while using a more simplified training scheme since: (i) no sort of approach for harvesting hard negative pairs (e.g., (schroff2015facenet; wu2017sampling)) is needed in our case, and those are usually expensive, (ii) the method does not require large batch sizes, and (iii) we employ a simple loss with no hyperparameters that have to be tuned, as opposed to margin-based triplet or contrastive losses. We further highlight that the encoders trained with the proposed approach have the possibility for trials to be further scored with cosine similarities, which yields a performance improvement in some cases when combined with ’s output.
4.2 Large-scale verification with VoxCeleb
We now proceed and evaluate the proposed scheme in a more challenging scenario corresponding to realistic audio data for speaker verification. To do so, we implement as the well-known time-delay architecture (waibel1989phoneme) employed within the x-vector setting, showed to be effective in summarizing speech into speaker- and spoken language-dependent representations (snyder2018x; snyder2018spoken)
. The model consists of a sequence of dilated 1-dimensional convolutions across the temporal dimension, followed by a time pooling layer, which simply concatenates element-wise first- and second-order statistics over time. Statistics are finally projected into an output vector through fully-connected layers. Speech is represented as 30 mel-frequency cepstral coefficients obtained with a short-time Fourier transform using a 25ms Hamming window with 60% overlap. All the data is downsampled to 16kHz beforehand. An energy-based voice activity detector is employed to filter out non-speech frames. We augment the data by creating noisy versions of training recordings using exactly the same approach as in(snyder2018x)
. Model architecture and feature extraction details are included in the appendix.
We compared our models with a set of published results as well as the results provided by the popular Kaldi recipe555Kaldi recipe: https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb, considering scoring using cosine similarity or PLDA. For the Kaldi baseline, we found the same model as ours to yield relatively weak performance. As such, we decided to search over possible architectures in order to make it a stronger baseline. We thus report the best model we could find which has the same structure as ours, i.e., it is made up of convolutions over time followed by temporal pooling and fully-connected layers, while the convolutional stack is deeper, which makes the comparison unfair in their favor.
We further evaluated our models using PLDA by running just the part of the same Kaldi recipe corresponding to the training of that downstream classifier on top of representations obtained from our encoder. Results are reported in Table 2 and support our claims that the proposed framework can be directly used in an end-to-end fashion. It is further observed that it outperformed standard downstream classifiers, such as PLDA, by a significant difference while not requiring any complex training procedure, as common metric learning approaches usually do. We employ simple random selection of training pairs. Ablation results are also reported, in which case we dropped the auxiliary loss and trained the same and using the same budget in terms of number of iterations, showing that having the auxiliary loss improves performance in the considered evaluation.
|VoxCeleb1 Test set|
|Kaldi recipefootnote 5||PLDA||VoxCeleb2||2.51%|
|Proposed||Cosine + E2E||VoxCeleb2||2.51%|
|Kaldi recipefootnote 5||PLDA||VoxCeleb2||2.60%|
|Proposed||Cosine + E2E||VoxCeleb2||2.53%|
|Kaldi recipefootnote 5||PLDA||VoxCeleb2||4.62%|
|Proposed||Cosine + E2E||VoxCeleb2||4.69%|
4.3 Checking for distance properties in
We now empirically evaluate how behaves in terms of properties of distances or metrics, such as symmetry, for instance. We start by plotting embeddings from and do so by training an encoder on MNIST under the proposed setting (without the auxiliary loss in this case) so that its outputs are given by . We then plot the embeddings of the complete MNIST’s test set on Fig. 2, where the raw embeddings in are directly displayed in the plot. Interestingly, classes are reasonably clustered in the Euclidean space even if such behavior was never enforced during training. We proceed and directly check for distance properties in . For the test set of Cifar-10 as well as for VoxCeleb1 Test set, we plot histograms of (i) the distance to itself for all test examples, (ii) a symmetry measure given by the absolute difference of the outputs of measured in the two directions for all possible test pairs, and (iii) a measure of how much satisfies the triangle inequality, which we do by measuring for a random sample taken from all possible triplets of examples . Proper metrics should have all such quantities equal 0. In Figures 9-a to 9-f, it can be seen that once more, even if any particular behavior is enforced over at its training phase, resulting models approximately behave as proper metrics. We thus hypothesize the relatively easier training observed in our setting, in the sense that it works without complicated schemes for selection of negative pairs, is due to the not so constrained distances induced by .
We introduced an end-to-end setting particularly tailored to perform small sample 2-sample tests and compare data pairs to determine whether they belong to the same class. Several interpretations of such framework are provided, including joint metric and distance metric learning, as well as contrastive estimation over data pairs. We used contrastive estimation results to show the solutions of the posed problem yield optimal decision rules under verification settings, resulting in correct decisions for any choice of threshold. In terms of practical contributions, the proposed method simplifies both the training under the metric learning framework, as it does not require any scheme to select negative pairs of examples, and also simplifies verification pipelines, which are usually made up of several individual components, each one contributing specific challenges at training and testing phases. Our models can be used in an end-to-end fashion by using ’s outputs to score test trials yielding strong performance even in large scale and realistic open-set conditions where test classes are different from those seen at train time. The proposed approach can be extended to any setting relying on distances to do inference, such as image retrieval, prototypical networks (snell2017prototypical), and clustering. Similarly to extensions of GANs (nowozin2016f; arjovsky2017wasserstein), variations of our approach where maximizes other types of divergences instead of Jensen-Shannon’s might also be a relevant future research direction, requiring corresponding decision rules to be defined.
Appendix A Extra experiment: large scale speaker verification under domain shift
In this experiment, we evaluate the performance of the proposed setting when test data significantly differs from training examples. To do so, we employ the data introduced for one of the tasks of the 2018 edition of the NIST Speaker Recognition Evaluation (SRE)666https://www.nist.gov/system/files/documents/2018/08/17/sre18_eval_plan_2018-05-31_v6.pdf. We specifically consider the CTS task so that test data corresponds to spontaneous conversational telephone speech spoken in Tunisian Arabic, while the bulk of the train data is spoken in English. Besides the language mismatch, variations due to different codecs are further observed (PSTN vs. PSTN and VOIP).
The main training dataset (English) is built by combining the data from NIST SREs from 2004 to 2010, Mixer 6, as well as Switchboard-2, phases 1, 2, and 3, and the first release of VoxCeleb, yielding a total of approximately 14000 speakers. Audio representations correspond to 23 MFCCs obtained using a short-time Fourier transform with a 25ms Hamming window and 60% overlap. The audio data is downsampled to 8kHz. Further pre-processing steps are the same as those performed for experiments with VoxCeleb as reported in Section 4, i.e. an energy-based voice activity detector is followed by data augmentation performed via distorting original samples adding reverberation and background noise.
Baseline: For performance reference, we trained the well-known x-vector setting (snyder2018x) using its Kaldi recipe777https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v1/local/nnet3/xvector. In that case, PLDA is employed for scoring test trials. The same training data used to train our systems is employed in this case as well. The recipe performs the following steps: i-training of a TDNN (same architecture as in our case) as a multi-class classifier over the set of training speakers using the same training data utilized to train our proposed model; ii-preparation of PLDA’s training data, in which case the SRE partition of the training set is encoded using the second to last layer of the TDNN, embeddings are length-normalized and mean-centered using the average of an unlabelled sample from the target domain and finally have their dimensionality reduced using Linear Discriminant Analysis; iii-training of PLDA; iv-scoring of test trials. In addition to that, in order to cope with the described domain shift, the model adaptation scheme introduced in (garcia2014unsupervised) is also utilized for PLDA so that a second PLDA model is trained on top of target data. The final downstream classifier is then obtained by averaging the parameters of the original and target domain models. Both results obtained with and without the described scheme are reported in Table 3.
For the case of the proposed approach, training is carried out using the training data described above corresponding to speech spoken in English. We reuse the setting found to work well on the experiments reported in Section 4 with the VoxCeleb corpus including all hyperparameters, architecture, data sampling and minibatch construction strategies, and computational budget. We additionally build a multi-language training set including data corresponding to the target domain so that we can fine-tune our model. The complementary training data corresponds to the data introduced for the 2012 (English) and 2016 (Cantonese+Tagalog) editions of NIST SRE as well as the development partition of NIST SRE 2018 which corresponds to the target domain of evaluation data (Arabic). This is done so as to increase the amount of data within the complementary partition and avoid overfitting to the small amount of target data. The combination of such data sources yields approximately 800 speakers. We thus train our models on the large out-of-domain dataset and fine tune the resulting model in the multi-language complementary data.
Results in terms of equal error rate are presented in Table 3. While our model appears to be more domain dependent when compared to PLDA as indicated by results where only out-of-domain data is employed, it significantly improves once a relatively small amount of target domain data is provided. We stress the fact that the proposed setting dramatically simplifies verification pipelines and completely removes practical issues such as those related to processing steps prior to training of the downstream classifier.
Appendix B Implementation details
is implemented as a stack of fully-connected layers with LeakyReLU activations. Dropout is further used in between the last hidden and the output layer. The number and size of hidden layers as well as the dropout probability were tuned for each experiment.
b.2 Cifar-10 and MiniImagenet
The grid used on the hyperparameter search for each hyperparameter is presented next. A budget of 100 runs was considered and each model was trained for 600 epochs. Hyperparameters yielding the best EER on the validation data for our proposed approach and the triplet baseline are represented byand , respectively. In all experiments, the minibatch size was set to 64 and 128 for Cifar-10 and Mini-ImageNet, respectively. A reduce-on-plateau schedule for the learning rate was employed, while its patience was a further hyperparameter included in the search.
Number of hidden layers:
Size of hidden layers:
Type of auxiliary loss: Standard cross-entropy, Additive margin
Number of hidden layers:
Size of hidden layers:
Type of auxiliary loss: Standard cross-entropy, Additive margin
b.3.1 Encoder architecture
We implement as the well-known TDNN architecture employed within the x-vector setting (snyder2018x)
, which consists of a sequence of dilated 1-dimensional convolutions across the temporal dimension, followed by a time pooling layer, which simply concatenates element-wise first- and second-order statistics over time. Concatenated statistics are finally projected into an output vector through two fully-connected layers. Pre-activation batch normalization is performed after each convolution and fully-connected layer. A summary of the employed architecture is shown in Table4. Pre-activation batch normalization is employed after all convolutional and fully-connected layers.
|Layer||Input Dimension||Output dimension|
|30 T||512 T|
|Conv1d+ReLU||512 T||512 T|
|Conv1d+ReLU||512 T||512 T|
|Conv1d+ReLU||512 T||512 T|
|Conv1d+ReLU||512 T||1500 T|
|Statistical Pooling||1500 T||3000|
b.3.2 Data augmentation and feature extraction
We augment the training data by simulating diverse acoustic conditions using supplementary noisy speech, as done in (snyder2018x)
. More specifically, we corrupt the original samples by adding reverberation (reverberation time varies from 0.25s - 0.75s) and background noise, such as music (signal-to-noise ratio, SNR, within 5-15dB), and babble (SNR varies from 10 to 20dB). Noise signals were selected from the MUSAN corpus(musan2015) and the room impulse responses samples from (ko2017study) were used to simulate reverberation. All the audio pre-processing steps including feature extraction, degradation with noise as well as silence frames removal was performed with the Kaldi toolkit (povey2011kaldi) and are openly available as the first step of the recipe in https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb. The corpora used for augmentation are also openly available at https://www.openslr.org/.
In order to deal with recordings of varying duration within a minibatch, we pad all recordings to a maximum duration set in advance. We do so by repeating the signal up until it reaches the maximum duration or taking a random continuous chunk with the maximum duration for log utterances.
b.3.3 Minibatch construction
Given the large number of classes in the VoxCeleb case (corresponding to the number of speakers, i.e., 5994), we need to ensure several examples belonging to the same speaker exist in a minibatch to allow for positive pairs to exist. We thus create a list of sets of five recordings belonging to the same speaker, and such sets are randomly selected at training time. Minibatches are constructed through sequentially picking examples from the list, and the list is recreated once all elements are sampled. Such approach provides minibatches of size , where and correspond to the number of speakers per minibatch and recordings per speaker, respectively. While is set to 5, is set to 24, which gives an effective minibatch size of .
Training was carried out with a linear learning rate warm-up, employed during the first iterations, and the same exponential decay as in (vaswani2017attention) is employed after that. A budget of 40 runs was considered and each model was trained for a budget of 600k iterations. The best set of hyperparameters, as assessed in terms of EER measured over a random set of trials created from VoxCeleb1-E, was then used to train a model from scratch for a total of 2M iterations. We report the results obtained by the best model within the 2M iterations in terms of the same metric used during the hyperparameter search. Selected values are indicated by .
The grid used for the hyperparameter search is presented next. In all experiments, the minibatch size was set to 24, which, given the sampling strategy employed in this case, yields an effective batch size of 120. We further employed gradient clipping and searched over possible clipping thresholds.
Base learning rate:
Embedding size :
Maximum duration (in number of frames):
Gradient clipping threshold:
Number of hidden layers:
Size of hidden layers:
Type of auxiliary loss: Standard cross-entropy, Additive margin