Introduction
Research in deep metric learning studies techniques for training deep neural networks to learn similarities and dissimilarities between data samples, typically by learning a distance metric via feature embeddings in . Most extensively, deep metric learning is used in face recognition Schroff et al. (2015); Liu et al. (2017); Hermans et al. (2017) and other computer vision tasks Tack et al. (2020); Chen et al. (2020a) where there are an abundance of label values.
Common deep metric learning techniques include contrastive loss Hadsell et al. (2006) and triplet loss Schroff et al. (2015). Moreover, each of these methods have variants to address specific applications. SimCLR Chen et al. (2020a, b)
, for example, is a recent contrastive loss variant designed to address unsupervised deep metric learning with stateoftheart performance on ImageNet
Russakovsky et al. (2015). Ladder Loss Zhou et al. (2019), a generalized variant of triplet loss, improved upon existing methods for coherent visualsemantic embedding and has important applications in multiple visual and language understanding tasks Karpathy et al. (2014); Ma et al. (2015); Vinyals et al. (2014). Given the success of metric learning in a wide range of applications, we see value in investigating its underlying theories. In this paper, we present a theoretical framework which explains observed but previously unexplained behaviors of the Triplet Loss.Literature Review
We choose to analyze Triplet Loss’s underlying theory due to its strong dependence on the triplet selection strategy. This makes the Triplet Loss fickle to work with, as empirical results had shown that randomly sampling these triplets yielded unsatisfactory results. On the other hand, successful triplet selection strategies like hard negative mining can face issues like network collapse, a phenomenon where the network projects all data points onto a single point Schroff et al. (2015), while more stable triplet selection strategies do not perform as well in practice Hermans et al. (2017).
In the original FaceNet paper, Schroff et al. find that with large batch sizes (thousands), hard negative mining lead to collapsed solutions. To address this, they instead used a strategy they called semihard mining Schroff et al. (2015). On the other hand, Herman et al. find that with smaller batch sizes (), the hardest mining strategy significantly outperforms other mining strategies, and does not suffer from collapsed solutions Hermans et al. (2017). These seemingly contradictory results showcase the need for a theoretical framework to explain the theory of hard negative mining and the root cause of collapsed solutions.
There has been some prior literature investigating the phenomenon of network collapse. Xuan et al. show that hard negative mining leads to collapsed solutions by analyzing the gradients of a simplified neural network model Xuan et al. (2020). However, they do not account for the many cases where hard negative mining does work. Levi et al. prove that, under a label randomization assumption, the globally optimal solution to the triplet loss necessarily exhibits network collapse Levi et al. (2021). Rather than investigating functional hard mining strategies, Levi et al. instead suggest using the less effective easy positive mining to avoid network collapse.
In literature, there are plenty of claims that hard negative mining succeeds Hermans et al. (2017); Faghri et al. (2017), and numerous examples where it fails Schroff et al. (2015); Ge et al. (2019); Oh Song et al. (2016). Our work explains why network collapse happens by using the theory of isometric approximation to better characterize the behavior of the Triplet Loss.
Background and Definitions
Establishing the notation used in the paper, let be the data manifold and let be the classes with being the number of classes. Let be the true hypothesis function, or true labels of the data. Then the dataset consists of pairs with and . We define the learned neural network as a function which maps similar points in the data manifold to similar points in .
As our paper focuses on metric learning, we define the similarity between embeddings to be the Euclidean distance where . Further, we define the shorthand where .
Triplet Loss and Hard Negative Mining
In this section, we discuss the Triplet Loss, one of the more successful approaches to supervised metric learning introduced by Schroff et al. Schroff et al. (2015). The Triplet Loss considers samples as triplets of data, composed of the anchor , positive , and negative samples, described in (1). The similarity relation (1a) requires that the anchor and positive samples must be of the same class, while the dissimilarity relation (1b) requires the anchor and negative must be of different classes.
(1a)  
(1b) 
Restating the objective of supervised metric learning, the embedding of the anchor sample must be closer to the positive than the negative for every triplet. An example of a satisfactory triplet is shown in Figure 1. Formally, we express this relation via (2), where is the margin term.
(2) 
This leads to the definition of the Triplet Loss in (3).
(3) 
The function zeroes negative values in order to ignore all the triplets that already satisfy the desired relation. In addition, as the margin
adds only a constant value to the loss function, its effect is negligible for small
. Therefore, we will assume a zero value for the margin () for the remainder of this paper.Definition 1.
TripletSeparated. We refer to nonempty subsets as TripletSeparated if for every and with we have
(4) 
This property can be extended to a function by checking whether the embedding subsets are TripletSeparated.
(5) 
It is worth noting that if and only if is TripletSeparated. An example of two TripletSeparated sets is shown in Figure 1.
As mentioned in the Literature Review section, the Triplet Loss relies heavily on its triplet mining strategy to achieve its performance for two popularly accepted reasons: First, enumerating all triplets of data every iteration would be too computationally intensive to guarantee fast training. Second, improper sampling of triplets risks network collapse Xuan et al. (2020). Our work substantiates the use of hard negative mining, a successful triplet mining strategy, by characterizing conditions that lead to network collapse.
Isometric Approximation
We will present a novel application of the isometric approximation theorem in Euclidean subsets in order to mathematically justify hard negative mining. The isometric approximation theorem primarily defines the behavior of nearisometries, or functions that are close to isometries, as given by Definition 2.
Definition 2.
nearisometry. Let and be real normed spaces. A function where is called an nearisometry () if
(6) 
In other words, an nearisometry is a function that preserves the distance metric within . The isometric approximation theorem seeks to determine how close is to an isometry, say , as given by (7). Note is a function of that is fixed for a given and is thus independent of . Consequently, inequality (7) holds for all and all nearisometries .
(7) 
Now consider the case where and are ndimensional Euclidean metric spaces, making . Then the following theorems and definitions Väisälä (2002); Vaisala (2002); Alestalo et al. (2001) prove that is linear in given a thickness condition on the set .
Definition 3.
Thickness
. For each unit vector
, define the projection by the dot product . The thickness of a bounded set is the number(8a)  
(8b) 
Theorem 1 (From Theorem 3.3 Alestalo et al. (2001)).
Suppose that and is a compact set with . Let be an nearisometry. Then there is an isometry such that
(9) 
with depending only on dimension.
As this property depends entirely on the set , we call Theorem 1 the Isometric Approximation Property (cIAP) on set with diam(.
Theory and Proofs
Overview and Problem Setup
From the background and definitions, the goal of the Triplet Loss is to learn a function such that the induced distance metric satisfies the property in (2). However, rather than starting with the Triplet Loss then introducing hard negative mining as a method to correct the weaknesses of the Triplet Loss, we instead derive hard negative mining and the Triplet Loss together, using the Hausdorfflike distance as a starting point.
HausdorffLike Distance
Reiterating the training objective from the problem setup, we aim to learn a function that is TripletSeparated (Definition 1). We restate this problem as a distance minimization problem, and prove that it is equivalent to hard negative mining with the Triplet Loss.
First we construct set of all functions that are TripletSeparated and denote it with . We next construct the Hausdorfflike distance metric (denoted by ) between these functions that compares the embedding subsets via the Hausdorff distance metric .
(10)  
(11) 
One way to solve metric learning is to find the closest to any function in as indicated by (12).
(12) 
We claim that the Triplet Loss with hard negative mining is equivalent to minimizing within a constant factor (see Corollary 1).
Isometric Approximation Applied to
In this section, we present Theorem 2 to show that minimizing the Hausdorfflike distance is equivalent to minimizing a discrepancy in distance metrics, referred to as the isometric error (Definition 4).
Definition 4.
isometric error. For two functions , we define the isometric error to be the maximum discrepancy between their distance metrics.
(13) 
Similar to (12), we extend the definition of isometric error to as follows:
(14) 
Lemma 1.
If and , then there is a function isometric to such that:
(15) 
Proof.
If is invertible, then is a function . is an nearisometry because . Then if , the conditions for Theorem 1 are satisfied, so there exists an isometry
(16) 
Then
(17) 
Therefore if is invertible, (15) holds with .
If is not invertible, then there exists such that . We divide the elements of into subsets and such that is invertible on , , and is unchanged on . Consequently, (17) holds on .
Moving our attention to , for all there exists such that . Then because is unchanged on , . Therefore (15) holds for and on . ∎
Theorem 2.
and upper bound each other within a linear factor for all with some minimum thickness .
Proof.
We first prove that upper bounds . To this end, fix the minimizing function in the following expression:
(18) 
From Lemma 1 we have that:
(19) 
with . From the definition of Hausdorfflike distance (11) we have (20), and from (12) we have (21):
(20)  
(21)  
(22) 
For the converse claim that upper bounds , we once again fix the that minimizes the following expression:
(23) 
It is worth noting that (25) holds for all . Furthermore, we can swap and in (25) and use (23) to get (26) and thus (27).
(26)  
(27) 
(27) proves that upper bounds within a constant factor of . ∎
Theorem 2 shows that and are exchangeable as minimization objectives because they upper bound each other within linear factors. Now that we have rewritten the minimization objective as a difference of two distance functions, we can derive the Triplet Loss.
Recovering the Triplet Loss
In this section, we will prove that (Definition 4) is equivalent to the Triplet Loss sampled by hard negative mining.
Theorem 3.
The Triplet Loss sampled by hard negative mining and the isometric error upper bound each other within a linear factor.
We present the proof for Theorem 3 in Appendix A.
Corollary 1.
The optimal solution to the Triplet Loss sampled by hard negative mining is equivalent to the optimal solution to within a constant factor.
Proof.
The proof follows from Theorems 2 and 3, where we show that , , and Triplet Loss sampled by hard negative mining upper and lower bound each other by constant factors. Consequently, the optimal solution to Triplet Loss sampled by hard negative mining, and to , are equivalent within a constant factor. ∎
Illustrative Examples
In this section, we illustrate the key ideas of the previous section’s theorems by using a toy example with points and embedding dimension . As we will illustrate the equivalence between the Triplet Loss with the Hausdorfflike distance and isometric error, we can visualize the embedding points without any underlying data or neural network. See Figure 2 for the toy example setup.
First, we will visualize in Figure 3. The numerical value of is determined by the maximum length of the arrows (see caption to Figure 3), which is marked in the figure with black outlines. Here, we compute the ideal by optimizing the embedding points.
Figure 4 illustrates , which measures the discrepancy in distance metric. Note that the that minimizes is not necessarily the same as . Revisiting the second part of the proof for Theorem 2, is lower bounded by and upper bounded by . For this specific toy example, we calculate the constant factor error to be , making an almost linear factor of . This essentially illustrates Theorem 2.
Next, we show the Triplet Loss sampled by hard negative mining in Figure 5. The equivalence proved by Theorem 3 is shown by comparing Figure 5 against Figure 4, as the triplet selected by hard negative mining corresponds with the same three points with the largest discrepancies in distance metric. Through Figures 3, 4, and 5, we have a visualization of the statement and proof of Corollary 1.
Discussion
Novel Insights on Network Collapse
As mentioned in the Literature Review section, current literature observes that hard negative mining results in network collapse inconsistently. We propose a theory that explains hard negative mining’s intermittent behavior.
We hypothesize that network collapse happens when the that minimizes is a collapsed function, or when the “nearest” function maps all the data points onto a much smaller subset. An example of this effect can be seen in figure 6, where we have 20 random data points with a collapsed .
We observe that when the number of samples is much greater than the embedding dimension , the ideal counterpart is much more likely to collapse. On the contrary, when , the network is much less likely collapse. Therefore when training a network with a batchhard negative mining strategy where the network learns on the hardest triplet in a single batch, we expect the batchhard fails when , and does not fail otherwise.
We further support this hypothesis by examining prior publications that use hard negative mining. Herman et al. Hermans et al. (2017) find hard negative mining works with a batch size of and embedding dimension . On the other hand, Schroff et al. Schroff et al. (2015) find that hard negative mining fails when they use thousands of samples per batch with embedding dimension .
Limitations and Future Work
In (2) of the Background and Definitions section, we assumed the margin term to be zero. This assumption may not be valid when is large enough to affect the optimal solution to the Triplet Loss sampled by hard negative mining. In the future, we plan to further study the effect of the margin on the Triplet Loss.
Additionally, we intend to study methods to avoid network collapse. As illustrated in Figure 6, network collapse occurs because the that minimizes is a collapsed function. To avoid network collapse, we would restrict the set to disallow collapsed functions. This would necessitate deriving a new set of equations that correctly utilize the restricted set.
Conclusion
In this paper, we apply the isometric approximation theorem to prove that the Triplet Loss sampled by hard negative mining is equivalent to minimizing a Hausdorfflike distance. This mathematical foundation produces new insights into hard negative mining. In particular, it explains network collapse, a phenomenon that prior theories were unable to fully explain.
With these insights, we provide the groundwork for future forms of hard negative mining that avoid network collapse. Further, as the theory of isometric approximation is independent of the Triplet Loss, it can be applied to any system utilizing the Euclidean metric or cosine similarity on a sphere. Therefore, this theory can be extended to analyze other metric learning methods like Ladder Loss or Contrastive Learning.
Through this and future work, we intend to leverage the power of mathematics to explain the fundamental principles of ‘blackbox’ machine learning approaches. Categorizing this previously undefined behavior forms new opportunities to strengthen modern machine learning and artificial intelligence research.
References
 Isometric approximation. Israel Journal of Mathematics 125 (1), pp. 61–82. Cited by: Isometric Approximation, Theorem 1.
 A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: Introduction, Introduction.
 Big selfsupervised models are strong semisupervised learners. Advances in neural information processing systems 33, pp. 22243–22255. Cited by: Introduction.
 Vse++: improving visualsemantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612. Cited by: Literature Review.
 Visualtextual association with hardest and semihard negative pairs mining for person search. arXiv preprint arXiv:1912.03083. Cited by: Literature Review.

Dimensionality reduction by learning an invariant mapping.
In
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
, Vol. 2, pp. 1735–1742 (eng). External Links: ISBN 9780769525976, ISSN 10636919 Cited by: Introduction.  In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737. Cited by: Introduction, Literature Review, Literature Review, Literature Review, Novel Insights on Network Collapse.
 Deep fragment embeddings for bidirectional image sentence mapping. CoRR abs/1406.5679. External Links: Link, 1406.5679 Cited by: Introduction.
 Rethinking preventing classcollapsing in metric learning with marginbased losses. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10316–10325. Cited by: Literature Review.
 SphereFace: deep hypersphere embedding for face recognition. CoRR abs/1704.08063. External Links: Link, 1704.08063 Cited by: Introduction.

Learning to answer questions from image using convolutional neural network
. CoRR abs/1506.00333. External Links: Link, 1506.00333 Cited by: Introduction.  Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: Literature Review.
 ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Introduction.
 Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Introduction, Introduction, Literature Review, Literature Review, Literature Review, Triplet Loss and Hard Negative Mining, Novel Insights on Network Collapse.

CSI: novelty detection via contrastive learning on distributionally shifted instances
. CoRR abs/2007.08176. External Links: Link, 2007.08176 Cited by: Introduction.  A survey of nearisometries. arXiv preprint math/0201098. Cited by: Isometric Approximation.
 Isometric approximation property in euclidean spaces. Israel Journal of Mathematics 128 (1), pp. 1–27. Cited by: Isometric Approximation.
 Show and tell: A neural image caption generator. CoRR abs/1411.4555. External Links: Link, 1411.4555 Cited by: Introduction.
 Hard negative examples are hard, but useful. In European Conference on Computer Vision, pp. 126–142. Cited by: Literature Review, Triplet Loss and Hard Negative Mining.
 Ladder loss for coherent visualsemantic embedding. CoRR abs/1911.07528. External Links: Link, 1911.07528 Cited by: Introduction.
Appendix A. Proof of Theorem 3
Proof.
From the definition of in (28), we introduce the anchor, positive, and negative triplet in (29) by relabelling . Recognizing that must either have the same or different label from , we relabel or , and pick the max of these distances for any given triplet.
(28)  
(29) 
Inequality (30) follows from for positive .
(30) 
Now fix the that minimizes (30). We next prove via contradiction that the first term (31a) is positive and the second term (31b) is negative.
(31a)  
(31b) 
There are four cases we must consider here, as we treat the zero case as either positive or negative. Case 1: (31a) is positive, (31b) is positive. Denoting this as , our four cases are , , , . Now we prove by contradiction that case 4 is the only valid one.
Case . Consider the function where is a small constant. Then , contradicting the statement that minimizes .
Case . Consider the function where is a small constant. Then , contradicting the statement that minimizes .
Case . We can algebraically rearrange (30) to get:
(32)  
(33)  
(34) 
(33) comes from the definition of as a TripletSeparated function; then (34) comes from combining (32) and (33). However, this means that the triplet that maximizes the expression has negative Triplet Loss, therefore there must be some other with a smaller value. This contradicts the statement that minimizes .
With Cases 1, 2, and 3 eliminated, we only have Case 4 and all the zero cases (, , , , ). We note that the cases and can be disproven using the same logic as Case 3. This leaves the four following valid cases (, , , ), where we can connect back to (30) and write:
(35)  
(36) 
Note that (36) resembles the Triplet Loss. The Triplet Loss for cannot dominate the maximum triplet loss for , otherwise it would contradict the statement that minimizes the isometric error, giving us:
(37) 
Note that (38) is identical to twice the expression for the Triplet Loss sampled by Hard Negative Mining. Therefore the the Triplet Loss sampled by Hard Negative Mining upper bounds the isometric error by a constant factor of 2.
Additionally, we can prove that the Triplet Loss sampled by Hard Negative Mining upper bounds the isometric error. Starting from the definition of isometric error in (39), inequality (40) follows from .
(39)  
(40) 
Once again fixing , we have equality (41) by the same logic as the previous part. Inequality (42) follows from the fact that by the definition of as TripletSeparated.
(41)  
(42) 
Therefore isometric error upper bounds the Triplet Loss sampled by Hard Negative Mining by a constant factor of 2. ∎