DeepAI
Log In Sign Up

Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem

10/20/2022
by   Albert Xu, et al.
34

In deep metric learning, the Triplet Loss has emerged as a popular method to learn many computer vision and natural language processing tasks such as facial recognition, object detection, and visual-semantic embeddings. One issue that plagues the Triplet Loss is network collapse, an undesirable phenomenon where the network projects the embeddings of all data onto a single point. Researchers predominately solve this problem by using triplet mining strategies. While hard negative mining is the most effective of these strategies, existing formulations lack strong theoretical justification for their empirical success. In this paper, we utilize the mathematical theory of isometric approximation to show an equivalence between the Triplet Loss sampled by hard negative mining and an optimization problem that minimizes a Hausdorff-like distance between the neural network and its ideal counterpart function. This provides the theoretical justifications for hard negative mining's empirical efficacy. In addition, our novel application of the isometric approximation theorem provides the groundwork for future forms of hard negative mining that avoid network collapse. Our theory can also be extended to analyze other Euclidean space-based metric learning methods like Ladder Loss or Contrastive Learning.

READ FULL TEXT VIEW PDF
04/05/2017

Smart Mining for Deep Metric Learning

To solve deep metric learning problems and producing feature embeddings,...
02/14/2022

Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?

The triplet loss with semi-hard negatives has become the de facto choice...
11/27/2019

AdaSample: Adaptive Sampling of Hard Positives for Descriptor Learning

Triplet loss has been widely employed in a wide range of computer vision...
03/21/2017

No Fuss Distance Metric Learning using Proxies

We address the problem of distance metric learning (DML), defined as lea...
02/26/2020

A Quadruplet Loss for Enforcing Semantically Coherent Embeddings in Multi-output Classification Problems

This paper describes one objective function for learning semantically co...
10/18/2022

No Pairs Left Behind: Improving Metric Learning with Regularized Triplet Objective

We propose a novel formulation of the triplet objective function that im...
02/16/2021

Semi Supervised Learning For Few-shot Audio Classification By Episodic Triplet Mining

Few-shot learning aims to generalize unseen classes that appear during t...

Introduction

Research in deep metric learning studies techniques for training deep neural networks to learn similarities and dissimilarities between data samples, typically by learning a distance metric via feature embeddings in . Most extensively, deep metric learning is used in face recognition Schroff et al. (2015); Liu et al. (2017); Hermans et al. (2017) and other computer vision tasks Tack et al. (2020); Chen et al. (2020a) where there are an abundance of label values.

Common deep metric learning techniques include contrastive loss Hadsell et al. (2006) and triplet loss Schroff et al. (2015). Moreover, each of these methods have variants to address specific applications. SimCLR Chen et al. (2020a, b)

, for example, is a recent contrastive loss variant designed to address unsupervised deep metric learning with state-of-the-art performance on ImageNet

Russakovsky et al. (2015). Ladder Loss Zhou et al. (2019), a generalized variant of triplet loss, improved upon existing methods for coherent visual-semantic embedding and has important applications in multiple visual and language understanding tasks Karpathy et al. (2014); Ma et al. (2015); Vinyals et al. (2014). Given the success of metric learning in a wide range of applications, we see value in investigating its underlying theories. In this paper, we present a theoretical framework which explains observed but previously unexplained behaviors of the Triplet Loss.

Literature Review

We choose to analyze Triplet Loss’s underlying theory due to its strong dependence on the triplet selection strategy. This makes the Triplet Loss fickle to work with, as empirical results had shown that randomly sampling these triplets yielded unsatisfactory results. On the other hand, successful triplet selection strategies like hard negative mining can face issues like network collapse, a phenomenon where the network projects all data points onto a single point Schroff et al. (2015), while more stable triplet selection strategies do not perform as well in practice Hermans et al. (2017).

In the original FaceNet paper, Schroff et al. find that with large batch sizes (thousands), hard negative mining lead to collapsed solutions. To address this, they instead used a strategy they called semi-hard mining Schroff et al. (2015). On the other hand, Herman et al. find that with smaller batch sizes (), the hardest mining strategy significantly out-performs other mining strategies, and does not suffer from collapsed solutions Hermans et al. (2017). These seemingly contradictory results showcase the need for a theoretical framework to explain the theory of hard negative mining and the root cause of collapsed solutions.

There has been some prior literature investigating the phenomenon of network collapse. Xuan et al. show that hard negative mining leads to collapsed solutions by analyzing the gradients of a simplified neural network model Xuan et al. (2020). However, they do not account for the many cases where hard negative mining does work. Levi et al. prove that, under a label randomization assumption, the globally optimal solution to the triplet loss necessarily exhibits network collapse Levi et al. (2021). Rather than investigating functional hard mining strategies, Levi et al. instead suggest using the less effective easy positive mining to avoid network collapse.

In literature, there are plenty of claims that hard negative mining succeeds Hermans et al. (2017); Faghri et al. (2017), and numerous examples where it fails Schroff et al. (2015); Ge et al. (2019); Oh Song et al. (2016). Our work explains why network collapse happens by using the theory of isometric approximation to better characterize the behavior of the Triplet Loss.

Background and Definitions

Establishing the notation used in the paper, let be the data manifold and let be the classes with being the number of classes. Let be the true hypothesis function, or true labels of the data. Then the dataset consists of pairs with and . We define the learned neural network as a function which maps similar points in the data manifold to similar points in .

As our paper focuses on metric learning, we define the similarity between embeddings to be the Euclidean distance where . Further, we define the shorthand where .

Triplet Loss and Hard Negative Mining

In this section, we discuss the Triplet Loss, one of the more successful approaches to supervised metric learning introduced by Schroff et al. Schroff et al. (2015). The Triplet Loss considers samples as triplets of data, composed of the anchor , positive , and negative samples, described in (1). The similarity relation (1a) requires that the anchor and positive samples must be of the same class, while the dissimilarity relation (1b) requires the anchor and negative must be of different classes.

(1a)
(1b)

Restating the objective of supervised metric learning, the embedding of the anchor sample must be closer to the positive than the negative for every triplet. An example of a satisfactory triplet is shown in Figure 1. Formally, we express this relation via (2), where is the margin term.

(2)

This leads to the definition of the Triplet Loss in (3).

(3)

The function zeroes negative values in order to ignore all the triplets that already satisfy the desired relation. In addition, as the margin

adds only a constant value to the loss function, its effect is negligible for small

. Therefore, we will assume a zero value for the margin () for the remainder of this paper.

Definition 1.

Triplet-Separated. We refer to non-empty subsets as Triplet-Separated if for every and with we have

(4)

This property can be extended to a function by checking whether the embedding subsets are Triplet-Separated.

(5)
Figure 1: An example Anchor, Positive, and Negative triplet. The blue dotted contour is the Triplet-Separated boundary for Class A. It is computed by considering inequality (4) for all points in Class A. Because Class B is outside the Triplet-Separated boundary for Class A, the Triplet Loss for this example is zero.

It is worth noting that if and only if is Triplet-Separated. An example of two Triplet-Separated sets is shown in Figure 1.

As mentioned in the Literature Review section, the Triplet Loss relies heavily on its triplet mining strategy to achieve its performance for two popularly accepted reasons: First, enumerating all triplets of data every iteration would be too computationally intensive to guarantee fast training. Second, improper sampling of triplets risks network collapse Xuan et al. (2020). Our work substantiates the use of hard negative mining, a successful triplet mining strategy, by characterizing conditions that lead to network collapse.

Isometric Approximation

We will present a novel application of the isometric approximation theorem in Euclidean subsets in order to mathematically justify hard negative mining. The isometric approximation theorem primarily defines the behavior of near-isometries, or functions that are close to isometries, as given by Definition 2.

Definition 2.

-nearisometry. Let and be real normed spaces. A function where is called an -nearisometry () if

(6)

In other words, an -nearisometry is a function that preserves the distance metric within . The isometric approximation theorem seeks to determine how close is to an isometry, say , as given by (7). Note is a function of that is fixed for a given and is thus independent of . Consequently, inequality (7) holds for all and all -nearisometries .

(7)

Now consider the case where and are n-dimensional Euclidean metric spaces, making . Then the following theorems and definitions Väisälä (2002); Vaisala (2002); Alestalo et al. (2001) prove that is linear in given a thickness condition on the set .

Definition 3.

Thickness

. For each unit vector

, define the projection by the dot product . The thickness of a bounded set is the number

(8a)
(8b)
Theorem 1 (From Theorem 3.3 Alestalo et al. (2001)).

Suppose that and is a compact set with . Let be an -nearisometry. Then there is an isometry such that

(9)

with depending only on dimension.

As this property depends entirely on the set , we call Theorem 1 the -Isometric Approximation Property (c-IAP) on set with diam(.

Theory and Proofs

Overview and Problem Setup

From the background and definitions, the goal of the Triplet Loss is to learn a function such that the induced distance metric satisfies the property in (2). However, rather than starting with the Triplet Loss then introducing hard negative mining as a method to correct the weaknesses of the Triplet Loss, we instead derive hard negative mining and the Triplet Loss together, using the Hausdorff-like distance as a starting point.

Hausdorff-Like Distance

Reiterating the training objective from the problem setup, we aim to learn a function that is Triplet-Separated (Definition 1). We restate this problem as a distance minimization problem, and prove that it is equivalent to hard negative mining with the Triplet Loss.

First we construct set of all functions that are Triplet-Separated and denote it with . We next construct the Hausdorff-like distance metric (denoted by ) between these functions that compares the embedding subsets via the Hausdorff distance metric .

(10)
(11)

One way to solve metric learning is to find the closest to any function in as indicated by (12).

(12)

We claim that the Triplet Loss with hard negative mining is equivalent to minimizing within a constant factor (see Corollary 1).

Isometric Approximation Applied to

In this section, we present Theorem 2 to show that minimizing the Hausdorff-like distance is equivalent to minimizing a discrepancy in distance metrics, referred to as the isometric error (Definition 4).

Definition 4.

isometric error. For two functions , we define the isometric error to be the maximum discrepancy between their distance metrics.

(13)

Similar to (12), we extend the definition of isometric error to as follows:

(14)
Lemma 1.

If and , then there is a function isometric to such that:

(15)
Proof.

If is invertible, then is a function . is an -nearisometry because . Then if , the conditions for Theorem 1 are satisfied, so there exists an isometry

(16)

Then

(17)

Therefore if is invertible, (15) holds with .

If is not invertible, then there exists such that . We divide the elements of into subsets and such that is invertible on , , and is unchanged on . Consequently, (17) holds on .

Moving our attention to , for all there exists such that . Then because is unchanged on , . Therefore (15) holds for and on . ∎

Theorem 2.

and upper bound each other within a linear factor for all with some minimum thickness .

Proof.

We first prove that upper bounds . To this end, fix the minimizing function in the following expression:

(18)

From Lemma 1 we have that:

(19)

with . From the definition of Hausdorff-like distance (11) we have (20), and from (12) we have (21):

(20)
(21)
(22)

(22) follows from (19-21), proving that upper bounds within a constant factor of .

For the converse claim that upper bounds , we once again fix the that minimizes the following expression:

(23)

Next, for the four points , , , and , apply the triangle inequality via (24) to get (25).

(24)
(25)

It is worth noting that (25) holds for all . Furthermore, we can swap and in (25) and use (23) to get (26) and thus (27).

(26)
(27)

(27) proves that upper bounds within a constant factor of . ∎

Theorem 2 shows that and are exchangeable as minimization objectives because they upper bound each other within linear factors. Now that we have rewritten the minimization objective as a difference of two distance functions, we can derive the Triplet Loss.

Recovering the Triplet Loss

In this section, we will prove that (Definition 4) is equivalent to the Triplet Loss sampled by hard negative mining.

Theorem 3.

The Triplet Loss sampled by hard negative mining and the isometric error upper bound each other within a linear factor.

We present the proof for Theorem 3 in Appendix A.

From Theorems 2 and 3 we have Corollary 1.

Corollary 1.

The optimal solution to the Triplet Loss sampled by hard negative mining is equivalent to the optimal solution to within a constant factor.

Proof.

The proof follows from Theorems 2 and 3, where we show that , , and Triplet Loss sampled by hard negative mining upper and lower bound each other by constant factors. Consequently, the optimal solution to Triplet Loss sampled by hard negative mining, and to , are equivalent within a constant factor. ∎

Illustrative Examples

In this section, we illustrate the key ideas of the previous section’s theorems by using a toy example with points and embedding dimension . As we will illustrate the equivalence between the Triplet Loss with the Hausdorff-like distance and isometric error, we can visualize the embedding points without any underlying data or neural network. See Figure 2 for the toy example setup.

Figure 2: The setup for our toy example is a dataset of five arbitrary points in , divided into two classes (red and blue).
Figure 3: Illustration of . Using the toy example shown in Figure 2, we compute a that minimizes . Arrows represent the function and stars indicate the embedding points for each class. The red and blue star sets are Triplet-Separated because they lie outside the other’s Triplet-Separated boundary, indicated by the dashed colored border. The three most important contributors to are marked with black arrows.

First, we will visualize in Figure 3. The numerical value of is determined by the maximum length of the arrows (see caption to Figure 3), which is marked in the figure with black outlines. Here, we compute the ideal by optimizing the embedding points.

Figure 4: Illustration of . Using the toy example shown in Figure 2, we compute a that minimizes Here we compare the distance metric, subtracting the distance between two points under (circles) and the distance under (stars), to compute as shown by the black vertical bar on the right.

Figure 4 illustrates , which measures the discrepancy in distance metric. Note that the that minimizes is not necessarily the same as . Revisiting the second part of the proof for Theorem 2, is lower bounded by and upper bounded by . For this specific toy example, we calculate the constant factor error to be , making an almost linear factor of . This essentially illustrates Theorem 2.

Figure 5: Illustration of the Triplet Loss sampled by hard negative mining. Using the toy example shown in Figure 2, we take the triplet (anchor, positive, negative) that maximizes the Triplet Loss.

Next, we show the Triplet Loss sampled by hard negative mining in Figure 5. The equivalence proved by Theorem 3 is shown by comparing Figure 5 against Figure 4, as the triplet selected by hard negative mining corresponds with the same three points with the largest discrepancies in distance metric. Through Figures 3, 4, and 5, we have a visualization of the statement and proof of Corollary 1.

Discussion

Novel Insights on Network Collapse

As mentioned in the Literature Review section, current literature observes that hard negative mining results in network collapse inconsistently. We propose a theory that explains hard negative mining’s intermittent behavior.

We hypothesize that network collapse happens when the that minimizes is a collapsed function, or when the “nearest” function maps all the data points onto a much smaller subset. An example of this effect can be seen in figure 6, where we have 20 random data points with a collapsed .

Figure 6: Illustration of how hard negative mining leads to collapsed solutions given samples and embedding dimension . We find that minimizes and observe that the embedding points (stars) collapse into a much smaller subset, marked by the ellipse in green.

We observe that when the number of samples is much greater than the embedding dimension , the ideal counterpart is much more likely to collapse. On the contrary, when , the network is much less likely collapse. Therefore when training a network with a batch-hard negative mining strategy where the network learns on the hardest triplet in a single batch, we expect the batch-hard fails when , and does not fail otherwise.

We further support this hypothesis by examining prior publications that use hard negative mining. Herman et al. Hermans et al. (2017) find hard negative mining works with a batch size of and embedding dimension . On the other hand, Schroff et al. Schroff et al. (2015) find that hard negative mining fails when they use thousands of samples per batch with embedding dimension .

Limitations and Future Work

In (2) of the Background and Definitions section, we assumed the margin term to be zero. This assumption may not be valid when is large enough to affect the optimal solution to the Triplet Loss sampled by hard negative mining. In the future, we plan to further study the effect of the margin on the Triplet Loss.

Additionally, we intend to study methods to avoid network collapse. As illustrated in Figure 6, network collapse occurs because the that minimizes is a collapsed function. To avoid network collapse, we would restrict the set to disallow collapsed functions. This would necessitate deriving a new set of equations that correctly utilize the restricted set.

Conclusion

In this paper, we apply the isometric approximation theorem to prove that the Triplet Loss sampled by hard negative mining is equivalent to minimizing a Hausdorff-like distance. This mathematical foundation produces new insights into hard negative mining. In particular, it explains network collapse, a phenomenon that prior theories were unable to fully explain.

With these insights, we provide the groundwork for future forms of hard negative mining that avoid network collapse. Further, as the theory of isometric approximation is independent of the Triplet Loss, it can be applied to any system utilizing the Euclidean metric or cosine similarity on a sphere. Therefore, this theory can be extended to analyze other metric learning methods like Ladder Loss or Contrastive Learning.

Through this and future work, we intend to leverage the power of mathematics to explain the fundamental principles of ‘black-box’ machine learning approaches. Categorizing this previously undefined behavior forms new opportunities to strengthen modern machine learning and artificial intelligence research.

References

  • P. Alestalo, D. Trotsenko, and J. Väisälä (2001) Isometric approximation. Israel Journal of Mathematics 125 (1), pp. 61–82. Cited by: Isometric Approximation, Theorem 1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: Introduction, Introduction.
  • T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton (2020b) Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems 33, pp. 22243–22255. Cited by: Introduction.
  • F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612. Cited by: Literature Review.
  • J. Ge, G. Gao, and Z. Liu (2019) Visual-textual association with hardest and semi-hard negative pairs mining for person search. arXiv preprint arXiv:1912.03083. Cited by: Literature Review.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    ,
    Vol. 2, pp. 1735–1742 (eng). External Links: ISBN 9780769525976, ISSN 1063-6919 Cited by: Introduction.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: Introduction, Literature Review, Literature Review, Literature Review, Novel Insights on Network Collapse.
  • A. Karpathy, A. Joulin, and L. Fei-Fei (2014) Deep fragment embeddings for bidirectional image sentence mapping. CoRR abs/1406.5679. External Links: Link, 1406.5679 Cited by: Introduction.
  • E. Levi, T. Xiao, X. Wang, and T. Darrell (2021) Rethinking preventing class-collapsing in metric learning with margin-based losses. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10316–10325. Cited by: Literature Review.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. CoRR abs/1704.08063. External Links: Link, 1704.08063 Cited by: Introduction.
  • L. Ma, Z. Lu, and H. Li (2015)

    Learning to answer questions from image using convolutional neural network

    .
    CoRR abs/1506.00333. External Links: Link, 1506.00333 Cited by: Introduction.
  • H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: Literature Review.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Introduction.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Introduction, Introduction, Literature Review, Literature Review, Literature Review, Triplet Loss and Hard Negative Mining, Novel Insights on Network Collapse.
  • J. Tack, S. Mo, J. Jeong, and J. Shin (2020)

    CSI: novelty detection via contrastive learning on distributionally shifted instances

    .
    CoRR abs/2007.08176. External Links: Link, 2007.08176 Cited by: Introduction.
  • J. Vaisala (2002) A survey of nearisometries. arXiv preprint math/0201098. Cited by: Isometric Approximation.
  • J. Väisälä (2002) Isometric approximation property in euclidean spaces. Israel Journal of Mathematics 128 (1), pp. 1–27. Cited by: Isometric Approximation.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2014) Show and tell: A neural image caption generator. CoRR abs/1411.4555. External Links: Link, 1411.4555 Cited by: Introduction.
  • H. Xuan, A. Stylianou, X. Liu, and R. Pless (2020) Hard negative examples are hard, but useful. In European Conference on Computer Vision, pp. 126–142. Cited by: Literature Review, Triplet Loss and Hard Negative Mining.
  • M. Zhou, Z. Niu, L. Wang, Z. Gao, Q. Zhang, and G. Hua (2019) Ladder loss for coherent visual-semantic embedding. CoRR abs/1911.07528. External Links: Link, 1911.07528 Cited by: Introduction.

Appendix A. Proof of Theorem 3

Proof.

Here, we present a detailed proof for the theorem using equations (28-38)

From the definition of in (28), we introduce the anchor, positive, and negative triplet in (29) by re-labelling . Recognizing that must either have the same or different label from , we re-label or , and pick the max of these distances for any given triplet.

(28)
(29)

Inequality (30) follows from for positive .

(30)

Now fix the that minimizes (30). We next prove via contradiction that the first term (31a) is positive and the second term (31b) is negative.

(31a)
(31b)

There are four cases we must consider here, as we treat the zero case as either positive or negative. Case 1: (31a) is positive, (31b) is positive. Denoting this as , our four cases are , , , . Now we prove by contradiction that case 4 is the only valid one.

Case . Consider the function where is a small constant. Then , contradicting the statement that minimizes .

Case . Consider the function where is a small constant. Then , contradicting the statement that minimizes .

Case . We can algebraically rearrange (30) to get:

(32)
(33)
(34)

(33) comes from the definition of as a Triplet-Separated function; then (34) comes from combining (32) and (33). However, this means that the triplet that maximizes the expression has negative Triplet Loss, therefore there must be some other with a smaller value. This contradicts the statement that minimizes .

With Cases 1, 2, and 3 eliminated, we only have Case 4 and all the zero cases (, , , , ). We note that the cases and can be dis-proven using the same logic as Case 3. This leaves the four following valid cases (, , , ), where we can connect back to (30) and write:

(35)
(36)

Note that (36) resembles the Triplet Loss. The Triplet Loss for cannot dominate the maximum triplet loss for , otherwise it would contradict the statement that minimizes the isometric error, giving us:

(37)

Using (37), we have the following relation with respect to (36).

(38)

Note that (38) is identical to twice the expression for the Triplet Loss sampled by Hard Negative Mining. Therefore the the Triplet Loss sampled by Hard Negative Mining upper bounds the isometric error by a constant factor of 2.

Additionally, we can prove that the Triplet Loss sampled by Hard Negative Mining upper bounds the isometric error. Starting from the definition of isometric error in (39), inequality (40) follows from .

(39)
(40)

Once again fixing , we have equality (41) by the same logic as the previous part. Inequality (42) follows from the fact that by the definition of as Triplet-Separated.

(41)
(42)

Therefore isometric error upper bounds the Triplet Loss sampled by Hard Negative Mining by a constant factor of 2. ∎