A wide variety of formulations have been proposed. Traditionally, these formulations encode a notion of similar and dissimilar data points. For example, contrastive loss [2, 3], which is defined for a pair of either similar or dissimilar data points. Another commonly used family of losses is triplet loss, which is defined by a triplet of data points: an anchor point, and a similar and dissimilar data points. The goal in a triplet loss is to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.
The above losses, which depend on pairs or triplets of data points, empirically suffer from sampling issues – selecting informative pairs or triplets is crucial for successfully optimizing them and improving convergence rates. In this work we address this aspect and propose to re-define triplet based losses over a different space of points, which we call proxies. This space approximates the training set of data points (for each data point in the original space there is a proxy point close to it), additionally, it is small enough so that we do not need to sample triplets but can explicitly write the loss over all (or most) of the triplets involving proxies. As a result, this re-defined loss is easier to optimize, and it trains faster (See Figure 1). Note that the proxies are learned as part of the model parameters.
In addition, we show that the proxy-based loss is an upper bound to triplet loss and that, empirically, the bound tightness improves as training converges, which justifies the use of proxy-based loss to optimize the original loss.
Further, we demonstrate that the resulting distance metric learning problem has several desirable properties. First and foremost, the obtained metric performs well in the zero-shot scenario, improving state of the art, as demonstrated on three widely used datasets for this problem (CUB200, Cars196 and Stanford Products). Second, the learning problem formulated over proxies exhibits empirically faster convergence than other metric learning approaches.
2 Related Work
There is a large body of work on metric learning, here we focus on its use in computer vision using deep methods.
An early use of deep methods for metric learning was the introduction of Siamese networks with a contrastive loss [2, 3]. Pairs of data points were fed into a network, and the difference between the embeddings produced was used to pull together points from the same class, and push away from each other points from different classes. A shortcoming of this approach is that it can not take directly into account relative distances between classes. Since then, most methods use a notion of triplets to provide supervision.
In  a large margin, nearest neighbor approach is designed to enable k-NN classification. It strives to ensure for each image a predefined set of images from the same class as neighbors that are closer to than images from other classes with a high separation margin. The set of target neighbors is defined using
metric on the input space. The loss function is defined over triplets of points which are sampled during training. This sampling becomes prohibiting when the number of classes and training instances becomes large, see Sec3.2 for more details.
To address some of the issues in this and similar work  a Semi-Hard negative mining approach was introduced in . In this approach, hard triplets were formed by sampling positive/negative instances within a mini-batch with the goal of finding negative examples that are within the margin, but are not too confusing, as those might come from labeling errors in the data. This improved training stability but required large mini-batches - 1800 images in the case of , and training was still slow. Large batches also require non trivial engineering work, e.g. synchronized training with multiple GPUs.
This idea, of incorporating information beyond a single triplet has influenced many approaches. Song et. al.  proposed Lifted Structured Embedding, where each positive pair compares the distances against all negative pairs in the batch weighted by the margin violation. This provided a smooth loss which incorporates the negative mining functionality. In , the N-Pair Loss was proposed, which used Softmax cross-entropy loss on pairwise similarity values within the batch. Inner product is used as a similarity measure between images. The similarity between examples from the same class is encouraged to be higher than the similarity with other images in the batch. A cluster ranking loss was proposed in 
. The network first computed the embedding vectors for all images in the batch and ranked a clustering score for the ground truth clustering assignment higher than the clustering score for any other batch assignment with a margin.
Magnet Loss  was designed to compare distributions of classes instead of instances. Each class was represented by a set of
cluster centers, constructed by k-means. In each training iteration, a cluster was sampled, and the M nearest impostor clusters (clusters from different classes) retrieved. From each imposter cluster a set of images were then selected and NCA loss used to compare the examples. Note that, in order to update the cluster assignments, training was paused periodical, and K-Means reapplied.
Our proxy-based approach compares full sets of examples, but both the embeddings and the proxies are trained end-to-end (indeed the proxies are part of the network architecture), without requiring interruption of training to re-compute the cluster centers, or class indices.
3 Metric Learning using Proxies
3.1 Problem Formulation
We address the problem of learning a distance between two data points and . For example, it can be defined as Euclidean distance between embeddings of data obtained via a deep neural net : , where are the parameters of the network. To simplify the notation, in the following we drop the full notation, and use and interchangeably.
Often times such distances are learned using similarity style supervision, e. g. triplets of similar and dissimilar points (or groups of points) , where in each triplet there is an anchor point , and the second point (the positive) is more similar to than the third point (the negative). Note that both and, more commonly, can be sets of positive/negative points. We use the notation , and whenever sets of points are used.
The DML task is to learn a distance respecting the similarity relationships encoded in :
An ideal loss, precisely encoding Eq. (1), reads:
is the Heaviside step function. Unfortunately, this loss is not amenable directly to optimization using stochastic gradient descent as its gradient is zero everywhere. As a result, one resorts to surrogate losses such as Neighborhood Component Analysis (NCA) or margin-based triplet loss [18, 12]. For example, Triplet Loss uses a hinge function to create a fixed margin between the anchor-positive difference, and the anchor-negative difference:
Where is the margin, and is the hinge function.
Similarly, the NCA loss  tries to make closer to than to any element in a set using exponential weighting:
3.2 Sampling and Convergence
Neural networks are trained using a form of stochastic gradient descent, where at each optimization step a stochastic loss is formulated by sampling a subset of the training set , called a batch. The size of a batch is small, e.g. in many modern computer vision network architectures . While for classification or regression the loss depends on a single data point from , the above distance learning losses depend on at least three data points, i.e. total number of possible samples could be in for .
To see this, consider that a common source of triplet supervision is from a classification-style labeled dataset: a triplet is selected such that and have the same label while and do not. For illustration, consider a case where points are distributed evenly between classes. The number of all possible triplets is then .
As a result, in metric learning each batch samples a very small subset of all possible triplets, i.e., in the order of . Thus, in order to see all triplets in the training one would have to go over steps, while in the case of classification or regression the needed number of steps is . Note that is in the order of hundreds of thousands, while is between a few tens to about a hundred, which leads to being in the tens of thousands.
Empirically, the convergence rate of the optimization procedure is highly dependent on being able to see useful triplets, e.g., triplets which give a large loss value as motivated by . The authors propose to sample triplets within the data points present in the current batch, this however, does not address the problem of sampling from the whole set of triplets . This is particularly challenging as the number of triplets is so overwhelmingly large.
3.3 Proxy Ranking Loss
To address the above sampling problem, we propose to learn a small set of data points with . Intuitively we would like to have approximate the set of all data points, i.e. for each there is one element in which is close to w.r.t. the distance metric . We call such an element a proxy for :
and denote the proxy approximation error by the worst approximation among all data points
We propose to use these proxies to express the ranking loss, and because the proxy set is smaller than the original training data, the number of triplets would be significantly reduced (see Figure 2). Additionally, since the proxies represent our original data, the reformulation of the loss would implicitly encourage the desired distance relationship in the original training data.
To see this, consider a triplet for which we are to enforce Eq. (1). By triangle inequality,
As long as , the ordinal relationship between the distance and is not changed when are replaced by the proxies . Thus, we can bound the expectation of the ranking loss over the training data:
Under the assumption that all the proxies have norm and all embeddings have the same norm , the bound can be tightened. Note that in this case we have, for any :
I.e. the ranking loss is scale invariant in . However, such re-scaling affects the distances between the embeddings and proxies. We can judiciously choose to obtain a better bound. A good value would be one that makes the embeddings and proxies lie on the same sphere, i.e. . These assumptions prove easy to satisfy, see Section 4.
The ranking loss is difficult to optimize, particularly with gradient based methods. We argue that many losses, such as NCA loss , Hinge triplet loss , N-pairs loss , etc are merely surrogates for the ranking loss. In this next section, we show how the proxy approximation can be used to bound the popular NCA loss for distance metric learning.
In this section we explain how to use the introduced proxies to train a distance based on the NCA formulation. We would like to minimize the total loss, defined as a sum over triplets (see Eq. (1)). Instead, we minimize the upper bound, defined as a sum over triplets over an anchor and two proxies (see Eq. (7)).
We perform this optimization by gradient descent, as outlined in Algorithm 1. At each step, we sample a triplet of a data point and two proxies , which is defined by a triplet in the original training data. However, each triplet defined over proxies upper bounds all triplets whose positive and negative data points have the same proxies as and respectively. This provides convergence speed-up. The proxies can all be held in memory, and sampling from them is simple. In practice, when an anchor point is encountered in the batch, one can use its positive proxy as , and all negative proxies as to formulate triplets that cover all points in the data. We back propagate through both points and proxies, and do not need to pause training to re-calculate the proxies at any time.
We train our model with the property that all proxies have the same norm and all embeddings have the norm . Empirically such a model performs at least as well as without this constraint, and it makes applicable the tighter bounds discussed in Section 3.3. While in the future we will incorporate the equal norm property into the model during training, for the experiments here we simply trained a model with the desired loss, and re-scaled all proxies and embeddings to the unit sphere (note that the transformed proxies are only useful for analyzing the effectiveness of the bounds, and are not used during inference).
4.1 Proxy Assignment and Triplet Selection
In the above algorithm we need to assign the proxies for the positive and negative data points. We experiment with two assignment procedures.
When triplets are defined by the semantic labels of data points (the positive data point has the same semantic label as the anchor; the negative a different label), then we can associate a proxy with each semantic label: . Let be the label of . We assign to a data point the proxy corresponding to its label: . We call this static proxy assignment as it is defined by the semantic label and does not change during the execution of the algorithm. Critically, in this case, we no longer need to sample triplets at all. Instead one just needs to sample an anchor point , and use the anchor’s proxy as the positive, and the rest as negatives
In the more general case, however, we might not have semantic labels. Thus, we assign to a point the closest proxy, as defined in Eq. (5). We call this dynamic proxy assignment and note that is aligned with the original definition of the term . See Section 6 for evaluation with the two proxy assignment methods.
4.2 Proxy-based Loss Bound
In addition to the motivation for proxies in Sec. 3.3, we also show in the following that the proxy based surrogate losses upper bound versions of the same losses defined over the original training data. In this way, the optimization of a single triplet of a data point and two proxies bounds a large number of triplets of the original loss.
More precisely, if a surrogate loss over triplet can be bounded by proxy triplet
for constant and , then the following bound holds for the total loss:
where denotes the number of triplets in the training data with anchor and proxies and for the positive and negative data points.
The quality of the above bound depends on , which depends on the loss and as we will see also on the proxy approximation error . We will show for concrete loss that the bound gets tighter for small proxy approximation error.
The proxy approximation error depends to a degree on the number of proxies . In the extreme case, the number of proxies is equal to the number of data points, and the approximation error is zero. Naturally, the smaller the number of proxies the higher the approximation error. However, the number of terms in the bound is in . If then the number of samples needed will again be . We would like to keep the number of terms as small as possible, as motivated in the previous section, while keeping the approximation error small as well. Thus, we seek a balance between small approximation error and small number of terms in the loss. In our experiments, the number of proxies varies from a few hundreds to a few thousands, while the number of data points is in the tens/hundreds of thousands.
Proxy loss bounds
For the following we assume that the norms of proxies and data points are constant and , we will denote . Then the following bounds of the original losses by their proxy versions are:
The NCA loss (see Eq. (4)) is proxy bounded:
where is defined as with normalized data points and is the number of negative points used in the triplet.
The margin triplet loss (see Eq. (3)) is proxy bounded:
where is defined as with normalized data points.
See Appendix for proofs.
5 Implementation Details
architecture with batch normalization. All methods are first pretrained on ILSVRC 2012-CLS data , and then finetuned on the tested datasets. The size of the learned embeddings is set to 64. The inputs are resized to pixels, and then randomly cropped to . The numbers reported in  are using multiple random crops during test time, but for fair comparison with the other methods, and following the procedure in 
, our implementation uses only a center crop during test time. We use the RMSprop optimizer with the margin multiplier constantdecayed at a rate of . The only difference we take from the setup described in  is that for our proposed method, we use a batch size of 32 images (all other methods use ). We do this to illustrate one of the benefits of the proposed method - it does not require large batches. We have experimentally confirmed that the results are stable when we use larger batch sizes for our method.
Based on the experimental protocol detailed in [15, 14] we evaluate retrieval at and clustering quality on data from unseen classes on 3 datasets: CUB200-2011 , Cars196 , and Stanford Online Products 
. Clustering quality is evaluated using the Normalized Mutual Information measure (NMI). NMI is defined as the ratio of the mutual information of the clustering and ground truth, and their harmonic mean. Letbe the cluster assignments that are, for example, the result of K-Means clustering. That is, contains the instances assigned to the i’th cluster. Let be the ground truth classes, where contains the instances from class .
Note that NMI is invariant to label permutation which is a desirable property for for our evaluation. For more information on clustering quality measurement see .
We compare our Proxy-based method with 4 state-of-the-art deep metric learning approaches: Triplet Learning with semi-hard negative mining , Lifted Structured Embedding , the N-Pairs deep metric loss , and Learnable Structured Clustering . In all our experiments we use the same data splits as .
The Cars196 dataset  is a fine-grained car category dataset containing 16,185 images of 196 car models. Classes are at the level of make-model-year, for example, Mazda-3-2011. In our experiments we split the dataset such that 50% of the classes are used for training, and 50% are used for evaluation. Table 1 shows recall-at-k and NMI scores for all methods on the Cars196 dataset. Proxy-NCA has a 15 percentage points (26% relative) improvement in recall@1 from previous state-of-the-art, and a 6% point gain in NMI. Figure 3 shows example retrieval results on the test set of the Cars196 dataset.
|Triplet Semihard ||51.54||63.78||73.52||81.41||53.35|
|Lifted Struct ||52.98||66.70||76.01||84.27||56.88|
|Struct Clust ||58.11||70.64||80.27||87.81||59.04|
6.2 Stanford Online Products dataset
The Stanford product dataset contains 120,053 images of 22,634 products downloaded from eBay.com. For training, 59,5511 out of 11,318 classes are used, and 11,316 classes (60,502 images) are held out for testing. This dataset is more challenging as each product has only about 5 images, and at first seems well suited for tuple-sampling approaches, and less so for our proxy formulation. Note that holding in memory 11,318 float proxies of dimension 64 takes less than 3Mb. Figure 4 shows recall-at-1 results on this dataset. Proxy-NCA has over a 6% gap from previous state of the art. Proxy-NCA compares favorably on clustering as well, with a score of 90.6. This, compared with the top method, described in  which has an NMI score of 89.48. The difference is statistically significant.
Figure 5 shows example retrieval results on images from the Stanford Product dataset. Interestingly, the embeddings show a high degree of rotation invariance.
The Caltech-UCSD Birds-200-2011 dataset contains 11,788 images of birds from 200 classes of fine-grained bird species. We use the first 100 classes as training data for the metric learning methods, and the remaining 100 classes for evaluation. Table 2
compares the proxy-NCA with the baseline methods. Birds are notoriously hard to classify, as the inner-class variation is quite large when compared to the initra-class variation. This is apparent when observing the results in the table. All methods perform less well than in the other datasets. Proxy-NCA improves on SOTA for recall at 1-2 and on the clustering metric.
|Triplet Semihard ||42.59||55.03||66.44||77.23||55.38|
|Lifted Struct ||43.57||56.55||68.59||79.63||56.50|
|Struct Clust ||48.18||61.44||71.83||81.92||59.23|
6.4 Convergence Rate
The tuple sampling problem that affects most metric learning methods makes them slow to train. By keeping all proxies in memory we eliminate the need for sampling tuples, and mining for hard negative to form tuples. Furthermore, the proxies act as a memory that persists between batches. This greatly speeds up learning. Figure 1 compares the training speed of all methods on the Cars196 dataset. Proxy-NCA trains much faster than other metric learning methods, and converges about three times as fast.
6.5 Fractional Proxy Assignment
Metric learning requires learning from a large set of semantic labels at times. Section 6.2 shows an example of such a large label set. Even though Proxy-NCA works well in that instance, and the memory footprint of the proxies is small, here we examine the case where one’s computational budget does not allow a one-to-one assignment of proxies to semantic labels. Figure 6 shows the results of an experiment in which we vary the ratio of labels to proxies on the Cars196 dataset. We modify our static proxy assignment method to randomly pre-assign semantic labels to proxies. If the number of proxies is smaller than the number of labels, multiple labels are assigned to the same proxy. So in effect each semantic label has influence on a fraction of a proxy. Note that when Proxy-NCA has better performance than previous methods.
6.6 Dynamic Proxy Assignment
In many cases, the assignment of triplets, i.e. selection of a positive, and negative example to use with the anchor instance, is based on the use of a semantic concept – two images of a dog need to be more similar than an image of a dog and an image of a cat. These cases are easily handled by our static proxy assignment, which was covered in the experiments above. In some cases however, there are no semantic concepts to be used, and a dynamic proxy assignment is needed. In this section we show results using this assignment scheme. Figure 7 shows recall scores for the Cars196 dataset using the dynamic assignment. The optimization becomes harder to solve, specifically due to the non-differentiable argmin term in Eq.(5). However, it is interesting to note that first, a budget of proxies per semantic concept is again enough to improve on state of the art, and one does see some benefit of expanding the proxy budget beyond the number of semantic concepts.
In this paper we have demonstrated the effectiveness of using proxies for the task of deep metric learning. Using proxies, which are saved in memory and trained using back-prop, training time is reduced, and the resulting models achieve a new state of the art. We have presented two proxy assignment schemes – a static one, which can be used when semantic label information is available, and a dynamic one which is used when the only supervision comes in the form of similar and dissimilar triplets. Furthermore, we show that a loss defined using proxies, upper bounds the original, instance-based loss. If the proxies and instances have constant norms, we show that a well optimized proxy-based model does not change the ordinal relationship between pairs of instances.
Our formulation of Proxy-NCA loss produces a loss very similar to the standard cross-entropy loss used in classification. However, we arrive at our formulation from a different direction: we are not interested in the actual classifier and indeed discard the proxies once the model has been trained. Instead, the proxies are auxiliary variables, enabling more effective optimization of the embedding model parameters. As such, our formulation not only enables us to surpass the state of the art in zero-shot learning, but also offers an explanation to the effectiveness of the standard trick of training a classifier, and using its penultimate layer’s output as the embedding.
The authors would like to thank Hossein Mobahi, Zhen Li, Hyun Oh Song, Vincent Vanhoucke, and Florian Schroff for helpful discussions.
Proof of Proposition 4.1: In the following for a vector we will denote its unit norm vector by .
First, we can upper bound the dot product of a unit normalized data points and by the dot product of unit normalized point and proxy using the Cauchy inequality as follows:
Similarly, one can obtain an upper bound for the negative dot product:
Using the above two bounds we can upper bound the original NCA loss :
Further, we can upper bound the above loss of unit normalized vectors by a loss of unnormalized vectors. For this we would make the assumption, which empirically we have found true, that for all data points . In practice these norm are much larger than 1.
Lastly, if we denote by and under the assumption that , we can apply the following version of the Hoelder inequality defined for positive real numbers :
to upper bound the sum of exponential terms:
Hence, the above loss with unit normalized points is bounded as:
Under the assumption that the data points and the proxies have constant norms, we can convert the above dot products to products of unnormalized points:
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
S. Chopra, R. Hadsell, and Y. LeCun.
Learning a similarity metric discriminatively, with application to
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2005.
-  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 1735–1742. IEEE, 2006.
-  J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 31–35. IEEE, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: 1502.03167, 2015.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
-  C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval. Cambridge university press Cambridge, 2008.
-  H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  O. Rippel, M. Paluri, P. Dollar, and L. Bourdev. Metric learning with adaptive density discrimination. arXiv preprint arXiv:1511.05939, 2015.
-  S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. Adv. Neural Inf. Process. Syst.(NIPS), 2004.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In NIPS, volume 1, page 2, 2003.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1857–1865. Curran Associates, Inc., 2016.
-  H. O. Song, S. Jegelka, V. Rathod, and K. Murphy. Learnable structured clustering framework for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv: 1409.4842, 2014.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
-  K. Q. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems, 18:1473, 2006.
-  S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.