1 Introduction
Learning useful representations from highdimensional data is one of the main goals of modern machine learning. However, doing so is generally a side effect of the solution of a predefined task, e.g., while learning the decision surface in a classification problem, inner layers of artificial neural networks are shown to make salient cues of input data which are discriminable. Moreover, in unsupervised settings, bottleneck layers of autoencoders as well as approximate posteriors from variational autoencoders have all been shown to embed relevant properties of input data which can be leveraged in downstream tasks. Rather than employing a neural network to solve some task and hope learned features are useful, approaches such as
siamese networks (bromley1994signature), which can be included in a set of approaches commonly referred to as Metric Learning, have been introduced with the goal of explicitly inducing features holding desirable properties such as class separability. In this setting, an encoder is trained so as to minimize or maximize a distance measured across pairs of encoded examples, depending on whether the examples within each pair belong to the same class or not, provided that class labels are available. Followup work leveraged this idea for several applications (hadsell2006dimensionality; hoffer2015deep), which include, for instance, the verification problem in biometrics, as is the case of FaceNet (schroff2015facenet) and DeepSpeaker (li2017deep), which are used for face and speaker recognition, respectively. However, as pointed out in recent work (schroff2015facenet; shi2016embedding; wu2017sampling; li2017deep; zhang2018text), careful selection of training pairs is crucial to ensure a reasonable sample complexity during training given that most triplets of examples quickly reach the condition such that distances measured between pairs from the same class are smaller than those of the pairs from different classes. As such, developing efficient strategies for harvesting negative pairs with small distances throughout training becomes primordial.In this contribution, we are concerned with the metric learning setting briefly described above, and more specifically, we turn our attention to its application to the verification problem, i.e., that of comparing data pairs and determining whether they belong to the same class. The verification problem arises in applications where comparisons of two small samples is required such as face/fingerprint/voice verification (reynolds2002overview)
(zhu2016deep; wu2017sampling), and so on. At test time, inference is often performed to answer two types of questions: (i) Do two given examples belong to the same class? and (ii) Does a test example belong to a specific claimed class? And in both cases test examples might belong to classes never presented to the model during training. Current verification approaches are usually comprised of several components trained in a greedy manner (kenny2013plda; snyder2018x), and an endtoend approach is still lacking.Euclidean spaces will not, in general, be suitable for representing any desired type of structure expressed in the data (e.g. asymmetry (pitis2020an) or hierarchy (nickel2017poincare)). To avoid the need to select an adequate distance given every new problem we are faced with, as well as to deal with the training difficulties mentioned previously, we propose to augment the metric learning framework and jointly train an encoder (which embeds raw data into a lower dimensional space) and a (pseudo) distance model tailored to the problem of interest. An endtoend approach for verification is then defined by employing such pseudodistance to compute similarity scores. Both models together, parametrized by neural networks, define a (pseudo) metric space in which inference can be performed efficiently since now semantic properties of the data (e.g., discrepancies across classes) are encoded by scores. While doing so, we found several interpretations appear from such learned pseudodistance, and it can be further interpreted as a likelihood ratio in a NeymannPearson hypothesis test, as well as an approximate divergence measure between the joint distributions of positive (same classes) and negative (different classes) pairs of examples. Moreover, even though we do not enforce models to satisfy properties of an actual metric^{1}^{1}1Symmetry, identity of indiscernibles, and triangle inequality., we empirically observe such properties to appear.
Our contributions can be summarized as follows:

We propose an augmented metric learning framework where an encoder and a (pseudo) distance are trained jointly and define a (pseudo) metric space where inference can be done efficiently for verification.

We show that the optimal distance model for any fixed encoder yields the likelihoodratio for a NeymannPearson hypothesis test, and it further induces a high JensenShannon divergence between the joint distributions of positive and negative pairs.

The introduced setting is trained in an endtoend fashion, and inference can be performed with a single forward pass, greatly simplifying current verification pipelines which involve several subcomponents.

Evaluation on large scale verification tasks provides empirical evidence of the effectiveness in directly using outputs of the learned pseudodistance for inference, outperforming commonly used downstream classifiers.
The remainder of this paper is organized as follows: metric learning and the verification problem are discussed in Section 2. The proposed method is presented in Section 3 along with our main guarantees, while empirical evaluation is presented in Section 4. Discussion and final remarks as well as future directions are presented in Section 5.
2 Background and related work
2.1 Metric Learning and Distance Metric Learning
Being able to efficiently assess similarity across samples from data under analysis is a long standing problem within machine learning. Algorithms such as Kmeans, nearestneighbors classifiers, and kernel methods generally rely on the selection of some similarity or distance measure able to encode semantic relationships present in highdimensional data into real scores. Under this view, approaches commonly referred to as
Distance Metric Learning, introduced originally by xing2003distance, try to learn a socalled Mahalanobis distance, which, given , will have the form: , where is positive semidefinite. Several extensions of that setting were then introduced (globerson2006metric; weinberger2009distance; ying2012distance).shalev2004online, for instance, proposed an online version of the algorithm in (xing2003distance)
, while an approach based on support vector machines was introduced in
(schultz2004learning) for learning . davis2007information provided an informationtheoretic approach to solve forby minimizing the divergence between Gaussian distributions associated to the learned and the Euclidean distances, further showing such an approach to be equivalent to lowrank kernel learning
(kulis2006learning). Similar distances have also been used in other settings, such as similarity scoring for contrastive learning (oord2018representation; tian2019contrastive). Besides the Mahalanobis distance, other forms of distance/similarity have been considered in recent work. In (lanckriet2004learning), for example, a kernel matrix is directly learned, implicitly defining a similarity function. In (pitis2020an), classes of neural networks are proposed to define pseudodistances which satisfy the triangle inequality while not being necessarily symmetric.For the particular case of Mahalanobis distance metric learning, one can show that (shalev2004online), which means that there exists a linear projection of the data after which the Euclidian distance will correspond to the Mahalanobis distance on the original space. chopra2005learning substituted the linear projection by a learned nonlinear encoder so that yields a (nonMahalanobis) distance measure between raw data points yielding useful properties. Followup work has extended such idea to several applications (schroff2015facenet; shi2016embedding; li2017deep; zhang2018text). One extra variation of , besides the introduction of , is to switch the Euclidean distance with an alternative better suited for the task of interest. That is the case in (norouzi2012hamming), where the Hamming distance is used over data encoded to a binary space. In (courty2018learning), in turn, the encoder is trained so that Euclidean distances in the encoded space approximate Wasserstein divergences, while nickel2018learning employs a hyperbolic distance which is argued to be suitable for their particular use case.
Based on the covered literature, one can conclude that there are two different directions aimed at achieving a similar goal: learn to represent the data in a metric space where distances yield efficient inference mechanisms for various tasks. While one corresponds to learning a meaningful distance or similarity from raw data, the other corresponds to, given a fixed distance metric, finding an encoding process yielding the desired metric space. Here, we propose an alternative to perform both these tasks simultaneously, i.e., jointly learn both the encoder and distance. Close to such an approach is the method discussed by garcia2019learning where, similarly to our setting, both encoder and distance are trained, with the main differences lying in the facts that our method is fully endtoend^{2}^{2}2What authors refer to as endtoend requires pretraining an encoder in the metric learning setting with a standard distance.
while in their case training happens separately. Moreover, training of the distance model in that case is done by imitation learning of cosine similarities.
2.2 The Verification Problem
Given data instances such that each can be associated to a class label through a labeling function , we define a trial as a pair of sets of examples , provided that and , so that we can assign class labels to such sets defining . The verification problem can be thus viewed as, given a trial , deciding whether , in which case we refer to as target trial, or and the trial will be called nontarget.
The verification problem is illustrated in Figure 1. We categorize trials into two types in accordance to practical instances of the verification problem: type I trials are those such that is referred to as enrollment sample, i.e., a set of data points representing a given class such as a gallery of face pictures from a given user in an access control application, while will correspond to a single example to be verified against the enrollment gallery. For the type II case, is simply a claim corresponding to the class against which will be verified. Classes corresponding to examples within test trials might have never been presented to the model, and sets and are typically small ().
Under the NeymanPearson approach (neyman1933ix), verification is seen as a hypothesis test, where and correspond to the hypothesis such that is target or otherwise, respectively (jiang2001bayesian). The test is thus performed through the following likelihood ratio (LR):
(1) 
where and correspond to models of target, and nontarget (or impostor) trials. The decision is made by comparing with a threshold .
One can then explicitly approximate through generative approaches (deng2018speech)
, which is commonly done using Gaussian mixture models. In that case, the denominator is usually defined as a universal background model (GMMUBM,
reynolds2000speaker), meaning that it is trained on data from all available classes, while the numerator is a finetuned model on enrollment data so that, for trial , will be:(2) 
Alternatively, cumani2013pairwise showed that discriminative settings, i.e., binary classifiers trained on top of data pairs to determine whether they belong to the same class, yielded likelihood ratios useful for verification. In their case, a binary SVM was trained on pairs of ivectors (dehak2010front) for automatic speaker verification. We build upon such discriminative setting, but with the difference that we learn an encoding process along with the discriminator (here represented as a distance model), and show it to yield likelihood ratios required for verification through contrastive estimation results. This is more general than the result in (cumani2013pairwise), which shows that there exists a generative classifier associated to each discriminator whose likelihood ratio matches the discriminator’s output, requiring such classifier’s assumptions to hold.
We remark that current verification approaches are composed of complex pipelines containing several components (dehak2010front; kenny2013plda; snyder2018x), including a pretrained data encoder, followed by a downstream classifier, such as probabilistic linear discriminant analysis (PLDA) (ioffe2006probabilistic; prince2007probabilistic), and score normalization (auckenthaler2000score), each contributing practical issues (e.g., cohort selection) to the overall system. This renders both training and testing of such systems difficult. The approach proposed herein is a step towards endtoend verification, i.e., from data to scores via a single forward pass, thus simplifying inference.
3 Learning pseudo metric spaces
We consider the setting where both an encoding mechanism, as well as some type of similarity or distance across data points are to be learned. Assume and are deterministic mappings which will be referred to as encoder and distance model, respectively, and will be both parametrized by neural networks. Such entities resemble a metricspace, thus we will refer to it as pseudo metric space. We empirically observed that introducing distance properties in , i.e., by constraining it to be symmetric and enforcing it to satisfy the triangle inequality, did not result in improved performance, yet rendered training unstable. However, since trained models are found to approximately behave as an actual distance, we make use of the analogy, but further provide alternative interpretations of ’s outputs.
Data samples are such that , and represents embedded data in . It will be usually the case that . Once more, each data example can be further assigned to one of class labels through a labeling function . Moreover, we define positive and negative pairs of examples denoted by or superscripts such that , as well as . The same notation is employed in the embedding space so that , and . We will denote the sets of all possible positive and negative pairs by and
, respectively, and further define a probability distribution
over which, along with , will yield and over and . Similarly to the setting in (hjelm2018learning), which introduces a discriminator over pairs of samples, we are interested in and such that:(3) 
and indicates composition so that . Such problem is separable in the parameters of and and iterative solution strategies might include either alternate or simultaneous updates. We found the latter to converge faster in terms of wallclock time and both approaches reach similar performance. We thus perform simultaneous updates while training.
The problem stated in (3) corresponds to finding and which will ensure that semantically close or distant samples, as defined through , will preserve such properties in terms of distance in the new space, while doing so in lower dimension. We stress the fact that class labels define which samples should be close together or far apart, which means that the same underlying data can yield different pseudometric spaces if different semantic properties are used to define class labels. For example, if one considers that, for a given set of speech recordings, class labels are equivalent to speaker identities, recordings from the same speaker are expected to be clustered together in the embedding space, while different results can be achieved if class labels are assigned corresponding to spoken language, acoustic conditions, and so on.
3.1 Different interpretations for
Besides the view of as a distancelike object defining a metriclike space , here we discuss some other possible interpretations of its outputs. We start by justifying the choice of the training objective defined in Eq. 3 by showing it to yield the likelihood ratio of particular trials of type I corresponding to a single enrollment example against a single test example, i.e. . In both of the next two propositions, proofs directly reuse results from the contrastive estimation and generative adversarial networks literature (gutmann2010noise; goodfellow2014generative) to show can be used for verification.
Proposition 1. The optimal for any fixed yields a simple transformation of the likelihood ratio stated in Eq. 1 for trials of the type .
Proof. We first define and , which correspond to the counterparts of and induced by in the embedding space. Now consider the loss defined in Eq. 3:
(4) 
where corresponds to or equivalently . Since , above integrand , provided that the set from which we pick candidate solutions is rich enough, has its maximum at:
(5) 
The last step above is of course only valid for . Nevertheless, is in any case meaningful for verification. In fact, as will be discussed in Proposition 2, the optimal encoder is the one that induces . Considering trial , we can write the ratio as:
(6) 
Proposition 1 indicates that the discussed setting can be used in an endtoend fashion to yield verification decision rules against a threshold for trials of a specific type.
The following lemma will be necessary for the next result:
Lemma 1. If , any positive threshold yields optimal decision rules for trials .
Proof. We prove the lemma by inspecting the decision rule under the considered assumptions in the two possible test cases: if is nontarget . If is target , completing the proof.
We now proceed and use the optimal discriminator into , which yields the following result for the optimal encoder:
Proposition 2. Minimizing yields optimal decision rules for any positive threshold.
Proof. We plug into so that for any we obtain:
(7) 
is therefore minimized () iff yields , which results in optimal decision rules for any positive threshold, invoking lemma 1, and assuming such encoders are available in the set one searches over.
We thus showed the proposed training scheme to be convenient for 2sample tests under small sample regimes, such as in the case of verification, given that: (i) the distance model is also a discriminator which approximates the likelihood ratio of the joint distributions over positive and negative pairs^{3}^{3}3The joint distribution over negative pairs is simply the product of marginals: ., and the encoder will be such that it induces a high divergence across such distributions, rendering their ratio amenable to decision making even in cases where verified samples are as small as single enrollment and test examples.
On a speculative note, we provide yet another view of by defining the kernel function . If we assume to satisfy Mercer’s condition (which won’t likely be the case within our setting since will not be symmetric nor positive semidefinite), we can invoke Mercer’s theorem and state that there is a feature map to a Hilbert space where verification can be performed through inner products. Training in the described setting could be viewed such that minimizing becomes equivalent to building such a Hilbert space where classes can be distinguished by directly scoring data points one against the other. We hypothesize that constraining to sets where Mercer’s condition does hold might yield an effective approach for the problems we consider herein, which we intend to investigate in future work.
3.2 Training
We now describe the procedure we adopt to minimize as well as some practical design decisions made based on empirical results. Both and are implemented as neural networks. In our experiments, will be convolutional (2d for images and 1d for audio) while
is a stack of fullyconnected layers which take as input concatenated embeddings of pairs of examples. Training is carried out with standard minibatch stochastic gradient descent with Polyak’s acceleration. We perform simultaneous update steps for
and since we observed that to be faster than alternate updates, while yielding the same performance. Standard regularization strategies such as weight decay and label smoothing (szegedy2016rethinking) are also employed. We empirically found that employing an auxiliary multiclass classification loss significantly accelerates training. Since our approach requires labels to determine which pairs of examples are positive or negative, we make further use of the labels to compute such auxiliary loss, which will be indicated by . To allow for computation of , we project onto the simplex using a fullyconnected layer. Minimization is then performed on the sum of the two losses, i.e., we solve , where the subscript in indicates the multiclass crossentropy loss.All hyperparameters are selected with a random search over a predefined grid. For the particular case of the auxiliary loss
, besides the standard crossentropy, we also ran experiments considering one of its socalled large margin variations. We particularly evaluated models trained with the additive margin softmax approach (wang2018additive). The choice between the two types of auxiliary losses (standard or largemargin) was a further hyperparameter and the decision was based on the random search over the two options. The grid used for hyperparameters selection along with the values chosen for each evaluation are presented in the appendix. A pseudocode describing our training procedure is presented in Algorithm 1.4 Evaluation
Scoring  EER  1AUC  

Cifar10  Triplet  Cosine  3.80%  0.98%  
Proposed  E2E  3.43%  0.60%  
Cosine  3.56%  1.03%  
Cosine + E2E  3.42%  0.80%  

Triplet  Cosine  28.91%  21.58%  
Proposed  E2E  28.64%  21.01%  
Cosine  30.66%  23.70%  
Cosine + E2E  28.49%  20.90%  

Triplet  Cosine  29.68%  22.56%  
Proposed  E2E  29.26%  22.04%  
Cosine  32.97%  27.34%  
Cosine + E2E  29.32%  22.24% 
We proceed to evaluation of the described framework and do so with three sets of experiments. In the first part of our evaluation, we run proofofconcept experiments and make use of standard image datasets to simulate verification settings. We report results on all trials created for the test sets of Cifar10 and MiniImageNet. In the former, the same 10 classes of examples appear for both train and test partitions, in what we refer to as closedset verification. For the case of MiniImageNet, since that dataset was designed for fewshot learning applications, we have an openset evaluation for verification since there are 64, 16, and 20 disjoint classes of training, validation, and test examples.
We then move on to a large scale realistic evaluation. To this end, we make use of the recently introduced VoxCeleb corpus (nagrani2017voxceleb; chung2018voxceleb2), corresponding to audio recordings of interviews taken from youtube videos, which means there’s no control over the acoustic conditions present in the data. Moreover, while most of the corpus corresponds to speech in English, other languages are also present, so that test recordings are from different speakers relative to the train data, and potentially also from different languages and acoustic environments. We specifically employ the second release of the corpus so that training data is composed of recordings from 5994 speakers while three test sets are available: (i) VoxCeleb1 Test set, which is made up of utterances from 40 speakers, (ii) VoxCeleb1E, i.e., the complete first release of the data containing 1251 speakers, and (iii) VoxCeleb1H, corresponding to a subset of the trials in VoxCeleb1E so that nontarget trials are designed to be hard to discriminate by using the metadata to match factors such as nationality and gender of the speakers. We finally report experiments performed to observe whether ’s outputs present properties of actual distances, which we observe to be the case.
Our main baselines for proofofconcept experiments correspond to the same encoders as in the evaluation of our proposed approach, while is dropped and replaced by the Euclidean distance. In those cases however, in order to get the most challenging baselines, we perform online selection of hard negatives. Our baselines closely follow the setting described in (monteiro2019combining). All such baselines are referred to as triplet in the tables with results as a reference to the training loss in those cases. All models, baseline or otherwise, are trained from scratch, and the same computation budget is used for training and hyperparameter search for all models we trained.
Performance is assessed in terms of the difference to 1 of the area under the operation curve, indicated by 1AUC in the tables, and also in terms of equal error rate (EER). EER indicates the operating point (i.e. threshold selection) at which the miss and false alarm rates are equal. Both metrics are better when closer to 0. We consider different strategies to score test trials. Both cosine similarity and PLDA are considered in some cases, and when the output of is directly used as a score we then indicate it by E2E in a reference to endtoend^{4}^{4}4Scoring trials with cosine similarity can be also seen as endtoend.. We further remark that cosine similarity can also be used to score trials in our proposed setting, and we observed some performance gains when applying simple sum fusion of the two available scores. Additional implementation details are included in the appendix.
4.1 Cifar10 and MiniImageNet
The encoder for evaluation on both Cifar10 and MiniImagenet was implemented as a ResNet18 (he2016deep). Results are reported in Table 1.
Results indicate the proposed scheme indeed yields effective inference strategies under the verification setting compared to traditional metric learning approaches, while using a more simplified training scheme since: (i) no sort of approach for harvesting hard negative pairs (e.g., (schroff2015facenet; wu2017sampling)) is needed in our case, and those are usually expensive, (ii) the method does not require large batch sizes, and (iii) we employ a simple loss with no hyperparameters that have to be tuned, as opposed to marginbased triplet or contrastive losses. We further highlight that the encoders trained with the proposed approach have the possibility for trials to be further scored with cosine similarities, which yields a performance improvement in some cases when combined with ’s output.
4.2 Largescale verification with VoxCeleb
We now proceed and evaluate the proposed scheme in a more challenging scenario corresponding to realistic audio data for speaker verification. To do so, we implement as the wellknown timedelay architecture (waibel1989phoneme) employed within the xvector setting, showed to be effective in summarizing speech into speaker and spoken languagedependent representations (snyder2018x; snyder2018spoken)
. The model consists of a sequence of dilated 1dimensional convolutions across the temporal dimension, followed by a time pooling layer, which simply concatenates elementwise first and secondorder statistics over time. Statistics are finally projected into an output vector through fullyconnected layers. Speech is represented as 30 melfrequency cepstral coefficients obtained with a shorttime Fourier transform using a 25ms Hamming window with 60% overlap. All the data is downsampled to 16kHz beforehand. An energybased voice activity detector is employed to filter out nonspeech frames. We augment the data by creating noisy versions of training recordings using exactly the same approach as in
(snyder2018x). Model architecture and feature extraction details are included in the appendix.
We compared our models with a set of published results as well as the results provided by the popular Kaldi recipe^{5}^{5}5Kaldi recipe: https://github.com/kaldiasr/kaldi/tree/master/egs/voxceleb, considering scoring using cosine similarity or PLDA. For the Kaldi baseline, we found the same model as ours to yield relatively weak performance. As such, we decided to search over possible architectures in order to make it a stronger baseline. We thus report the best model we could find which has the same structure as ours, i.e., it is made up of convolutions over time followed by temporal pooling and fullyconnected layers, while the convolutional stack is deeper, which makes the comparison unfair in their favor.
We further evaluated our models using PLDA by running just the part of the same Kaldi recipe corresponding to the training of that downstream classifier on top of representations obtained from our encoder. Results are reported in Table 2 and support our claims that the proposed framework can be directly used in an endtoend fashion. It is further observed that it outperformed standard downstream classifiers, such as PLDA, by a significant difference while not requiring any complex training procedure, as common metric learning approaches usually do. We employ simple random selection of training pairs. Ablation results are also reported, in which case we dropped the auxiliary loss and trained the same and using the same budget in terms of number of iterations, showing that having the auxiliary loss improves performance in the considered evaluation.
Scoring  Training set  EER  
VoxCeleb1 Test set  
nagrani2017voxceleb  PLDA  VoxCeleb1  8.80% 
Cai2018  Cosine  VoxCeleb1  4.40% 
okabe2018attentive  Cosine  VoxCeleb1  3.85% 
hajibabaei2018unified  Cosine  VoxCeleb1  4.30% 
ravanelli2019learning  Cosine  VoxCeleb1  5.80% 
chung2018voxceleb2  Cosine  VoxCeleb2  3.95% 
xie2019utterance  Cosine  VoxCeleb2  3.22% 
hajavi2019deep  Cosine  VoxCeleb2  4.26% 
xiang2019margin  Cosine  VoxCeleb2  2.69% 
Kaldi recipefootnote 5  PLDA  VoxCeleb2  2.51% 
Proposed  Cosine  VoxCeleb2  4.97% 
Proposed  E2E  VoxCeleb2  2.51% 
Proposed  Cosine + E2E  VoxCeleb2  2.51% 
Proposed  PLDA  VoxCeleb2  3.75% 
Ablation  E2E  VoxCeleb2  3.44% 
VoxCeleb1E  
chung2018voxceleb2  Cosine  VoxCeleb2  4.42% 
xie2019utterance  Cosine  VoxCeleb2  3.13% 
xiang2019margin  Cosine  VoxCeleb2  2.76% 
Kaldi recipefootnote 5  PLDA  VoxCeleb2  2.60% 
Proposed  Cosine  VoxCeleb2  4.77% 
Proposed  E2E  VoxCeleb2  2.57% 
Proposed  Cosine + E2E  VoxCeleb2  2.53% 
Proposed  PLDA  VoxCeleb2  3.61% 
Ablation  E2E  VoxCeleb2  3.70% 
VoxCeleb1H  
chung2018voxceleb2  Cosine  VoxCeleb2  7.33% 
xie2019utterance  Cosine  VoxCeleb2  5.06% 
xiang2019margin  Cosine  VoxCeleb2  4.73% 
Kaldi recipefootnote 5  PLDA  VoxCeleb2  4.62% 
Proposed  Cosine  VoxCeleb2  8.61% 
Proposed  E2E  VoxCeleb2  4.73% 
Proposed  Cosine + E2E  VoxCeleb2  4.69% 
Proposed  PLDA  VoxCeleb2  5.98% 
Ablation  E2E  VoxCeleb2  7.76% 
4.3 Checking for distance properties in
We now empirically evaluate how behaves in terms of properties of distances or metrics, such as symmetry, for instance. We start by plotting embeddings from and do so by training an encoder on MNIST under the proposed setting (without the auxiliary loss in this case) so that its outputs are given by . We then plot the embeddings of the complete MNIST’s test set on Fig. 2, where the raw embeddings in are directly displayed in the plot. Interestingly, classes are reasonably clustered in the Euclidean space even if such behavior was never enforced during training. We proceed and directly check for distance properties in . For the test set of Cifar10 as well as for VoxCeleb1 Test set, we plot histograms of (i) the distance to itself for all test examples, (ii) a symmetry measure given by the absolute difference of the outputs of measured in the two directions for all possible test pairs, and (iii) a measure of how much satisfies the triangle inequality, which we do by measuring for a random sample taken from all possible triplets of examples . Proper metrics should have all such quantities equal 0. In Figures 9a to 9f, it can be seen that once more, even if any particular behavior is enforced over at its training phase, resulting models approximately behave as proper metrics. We thus hypothesize the relatively easier training observed in our setting, in the sense that it works without complicated schemes for selection of negative pairs, is due to the not so constrained distances induced by .
5 Conclusion
We introduced an endtoend setting particularly tailored to perform small sample 2sample tests and compare data pairs to determine whether they belong to the same class. Several interpretations of such framework are provided, including joint metric and distance metric learning, as well as contrastive estimation over data pairs. We used contrastive estimation results to show the solutions of the posed problem yield optimal decision rules under verification settings, resulting in correct decisions for any choice of threshold. In terms of practical contributions, the proposed method simplifies both the training under the metric learning framework, as it does not require any scheme to select negative pairs of examples, and also simplifies verification pipelines, which are usually made up of several individual components, each one contributing specific challenges at training and testing phases. Our models can be used in an endtoend fashion by using ’s outputs to score test trials yielding strong performance even in large scale and realistic openset conditions where test classes are different from those seen at train time. The proposed approach can be extended to any setting relying on distances to do inference, such as image retrieval, prototypical networks (snell2017prototypical), and clustering. Similarly to extensions of GANs (nowozin2016f; arjovsky2017wasserstein), variations of our approach where maximizes other types of divergences instead of JensenShannon’s might also be a relevant future research direction, requiring corresponding decision rules to be defined.
References
Appendix A Extra experiment: large scale speaker verification under domain shift
In this experiment, we evaluate the performance of the proposed setting when test data significantly differs from training examples. To do so, we employ the data introduced for one of the tasks of the 2018 edition of the NIST Speaker Recognition Evaluation (SRE)^{6}^{6}6https://www.nist.gov/system/files/documents/2018/08/17/sre18_eval_plan_20180531_v6.pdf. We specifically consider the CTS task so that test data corresponds to spontaneous conversational telephone speech spoken in Tunisian Arabic, while the bulk of the train data is spoken in English. Besides the language mismatch, variations due to different codecs are further observed (PSTN vs. PSTN and VOIP).
The main training dataset (English) is built by combining the data from NIST SREs from 2004 to 2010, Mixer 6, as well as Switchboard2, phases 1, 2, and 3, and the first release of VoxCeleb, yielding a total of approximately 14000 speakers. Audio representations correspond to 23 MFCCs obtained using a shorttime Fourier transform with a 25ms Hamming window and 60% overlap. The audio data is downsampled to 8kHz. Further preprocessing steps are the same as those performed for experiments with VoxCeleb as reported in Section 4, i.e. an energybased voice activity detector is followed by data augmentation performed via distorting original samples adding reverberation and background noise.
Baseline: For performance reference, we trained the wellknown xvector setting (snyder2018x) using its Kaldi recipe^{7}^{7}7https://github.com/kaldiasr/kaldi/tree/master/egs/sre16/v1/local/nnet3/xvector. In that case, PLDA is employed for scoring test trials. The same training data used to train our systems is employed in this case as well. The recipe performs the following steps: itraining of a TDNN (same architecture as in our case) as a multiclass classifier over the set of training speakers using the same training data utilized to train our proposed model; iipreparation of PLDA’s training data, in which case the SRE partition of the training set is encoded using the second to last layer of the TDNN, embeddings are lengthnormalized and meancentered using the average of an unlabelled sample from the target domain and finally have their dimensionality reduced using Linear Discriminant Analysis; iiitraining of PLDA; ivscoring of test trials. In addition to that, in order to cope with the described domain shift, the model adaptation scheme introduced in (garcia2014unsupervised) is also utilized for PLDA so that a second PLDA model is trained on top of target data. The final downstream classifier is then obtained by averaging the parameters of the original and target domain models. Both results obtained with and without the described scheme are reported in Table 3.
Training domain  Scoring  EER  

snyder2018x  English  PLDA  11.30% 
24  English+Arabic  Adapted PLDA  9.44% 
Proposed  English  E2E  13.61% 
24  Multilanguage  E2E  8.43% 
For the case of the proposed approach, training is carried out using the training data described above corresponding to speech spoken in English. We reuse the setting found to work well on the experiments reported in Section 4 with the VoxCeleb corpus including all hyperparameters, architecture, data sampling and minibatch construction strategies, and computational budget. We additionally build a multilanguage training set including data corresponding to the target domain so that we can finetune our model. The complementary training data corresponds to the data introduced for the 2012 (English) and 2016 (Cantonese+Tagalog) editions of NIST SRE as well as the development partition of NIST SRE 2018 which corresponds to the target domain of evaluation data (Arabic). This is done so as to increase the amount of data within the complementary partition and avoid overfitting to the small amount of target data. The combination of such data sources yields approximately 800 speakers. We thus train our models on the large outofdomain dataset and fine tune the resulting model in the multilanguage complementary data.
Results in terms of equal error rate are presented in Table 3. While our model appears to be more domain dependent when compared to PLDA as indicated by results where only outofdomain data is employed, it significantly improves once a relatively small amount of target domain data is provided. We stress the fact that the proposed setting dramatically simplifies verification pipelines and completely removes practical issues such as those related to processing steps prior to training of the downstream classifier.
Appendix B Implementation details
b.1 architecture
is implemented as a stack of fullyconnected layers with LeakyReLU activations. Dropout is further used in between the last hidden and the output layer. The number and size of hidden layers as well as the dropout probability were tuned for each experiment.
b.2 Cifar10 and MiniImagenet
b.2.1 Hyperparamters
The grid used on the hyperparameter search for each hyperparameter is presented next. A budget of 100 runs was considered and each model was trained for 600 epochs. Hyperparameters yielding the best EER on the validation data for our proposed approach and the triplet baseline are represented by
and , respectively. In all experiments, the minibatch size was set to 64 and 128 for Cifar10 and MiniImageNet, respectively. A reduceonplateau schedule for the learning rate was employed, while its patience was a further hyperparameter included in the search.Cifar10:

Learning rate:

Weight decay:

Momentum:

Label smoothing:

Patience:

Number of hidden layers:

Size of hidden layers:

dropout probability:

Type of auxiliary loss: Standard crossentropy, Additive margin
MiniImagenet:

Learning rate:

Weight decay:

Momentum:

Label smoothing:

Patience:

Number of hidden layers:

Size of hidden layers:

dropout probability:

Type of auxiliary loss: Standard crossentropy, Additive margin
b.3 Voxceleb
b.3.1 Encoder architecture
We implement as the wellknown TDNN architecture employed within the xvector setting (snyder2018x)
, which consists of a sequence of dilated 1dimensional convolutions across the temporal dimension, followed by a time pooling layer, which simply concatenates elementwise first and secondorder statistics over time. Concatenated statistics are finally projected into an output vector through two fullyconnected layers. Preactivation batch normalization is performed after each convolution and fullyconnected layer. A summary of the employed architecture is shown in Table
4. Preactivation batch normalization is employed after all convolutional and fullyconnected layers.Layer  Input Dimension  Output dimension 

Conv1d+ReLU 
30 T  512 T 
Conv1d+ReLU  512 T  512 T 
Conv1d+ReLU  512 T  512 T 
Conv1d+ReLU  512 T  512 T 
Conv1d+ReLU  512 T  1500 T 
Statistical Pooling  1500 T  3000 
Linear+ReLU  3000 T  512 
Linear+ReLU  512 
b.3.2 Data augmentation and feature extraction
We augment the training data by simulating diverse acoustic conditions using supplementary noisy speech, as done in (snyder2018x)
. More specifically, we corrupt the original samples by adding reverberation (reverberation time varies from 0.25s  0.75s) and background noise, such as music (signaltonoise ratio, SNR, within 515dB), and babble (SNR varies from 10 to 20dB). Noise signals were selected from the MUSAN corpus
(musan2015) and the room impulse responses samples from (ko2017study) were used to simulate reverberation. All the audio preprocessing steps including feature extraction, degradation with noise as well as silence frames removal was performed with the Kaldi toolkit (povey2011kaldi) and are openly available as the first step of the recipe in https://github.com/kaldiasr/kaldi/tree/master/egs/voxceleb. The corpora used for augmentation are also openly available at https://www.openslr.org/.In order to deal with recordings of varying duration within a minibatch, we pad all recordings to a maximum duration set in advance. We do so by repeating the signal up until it reaches the maximum duration or taking a random continuous chunk with the maximum duration for log utterances.
b.3.3 Minibatch construction
Given the large number of classes in the VoxCeleb case (corresponding to the number of speakers, i.e., 5994), we need to ensure several examples belonging to the same speaker exist in a minibatch to allow for positive pairs to exist. We thus create a list of sets of five recordings belonging to the same speaker, and such sets are randomly selected at training time. Minibatches are constructed through sequentially picking examples from the list, and the list is recreated once all elements are sampled. Such approach provides minibatches of size , where and correspond to the number of speakers per minibatch and recordings per speaker, respectively. While is set to 5, is set to 24, which gives an effective minibatch size of .
b.3.4 Hyperparamters
Training was carried out with a linear learning rate warmup, employed during the first iterations, and the same exponential decay as in (vaswani2017attention) is employed after that. A budget of 40 runs was considered and each model was trained for a budget of 600k iterations. The best set of hyperparameters, as assessed in terms of EER measured over a random set of trials created from VoxCeleb1E, was then used to train a model from scratch for a total of 2M iterations. We report the results obtained by the best model within the 2M iterations in terms of the same metric used during the hyperparameter search. Selected values are indicated by .
The grid used for the hyperparameter search is presented next. In all experiments, the minibatch size was set to 24, which, given the sampling strategy employed in this case, yields an effective batch size of 120. We further employed gradient clipping and searched over possible clipping thresholds.

Base learning rate:

Weight decay:

Momentum:

Label smoothing:

Embedding size :

Maximum duration (in number of frames):

Gradient clipping threshold:

Number of hidden layers:

Size of hidden layers:

dropout probability:

Type of auxiliary loss: Standard crossentropy, Additive margin
Comments
There are no comments yet.