# Robustness and Generalization for Metric Learning

Metric learning has attracted a lot of interest over the last decade, but the generalization ability of such methods has not been thoroughly studied. In this paper, we introduce an adaptation of the notion of algorithmic robustness (previously introduced by Xu and Mannor) that can be used to derive generalization bounds for metric learning. We further show that a weak notion of robustness is in fact a necessary and sufficient condition for a metric learning algorithm to generalize. To illustrate the applicability of the proposed framework, we derive generalization results for a large family of existing metric learning algorithms, including some sparse formulations that are not covered by previous results.

## Authors

• 33 publications
• 25 publications
• ### Supervised Metric Learning with Generalization Guarantees

The crucial importance of metrics in machine learning algorithms has led...
07/17/2013 ∙ by Aurélien Bellet, et al. ∙ 0

• ### Sparse Compositional Metric Learning

We propose a new approach for metric learning by framing it as learning ...
04/15/2014 ∙ by Yuan Shi, et al. ∙ 0

• ### Metric Learning via Maximizing the Lipschitz Margin Ratio

In this paper, we propose the Lipschitz margin ratio and a new metric le...
02/09/2018 ∙ by Mingzhi Dong, et al. ∙ 0

• ### Provably Robust Metric Learning

Metric learning is an important family of algorithms for classification ...
06/12/2020 ∙ by Lu Wang, et al. ∙ 0

• ### Generalization Bounds for Metric and Similarity Learning

Recently, metric learning and similarity learning have attracted a large...
07/23/2012 ∙ by Qiong Cao, et al. ∙ 0

• ### A Metric Learning Reality Check

Deep metric learning papers from the past four years have consistently c...
03/18/2020 ∙ by Kevin Musgrave, et al. ∙ 85

• ### Dimension Free Generalization Bounds for Non Linear Metric Learning

In this work we study generalization guarantees for the metric learning ...
02/07/2021 ∙ by Mark Kozdoba, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The past ten years have seen a growing interest in supervised metric learning. Indeed, the relevance of a distance or a similarity, for a given task, is of crucial importance to the effectiveness of many classification or clustering methods. For this reason, a lot of research has been devoted to automatically learning distances or similarities from supervised data. Existing approaches rely on the fairly reasonable principle that, according to a good metric, pairs of examples with the same (resp. different) labels must be close to each other (resp. far away). Learning thus generally consists in finding the best parameters of the metric function given a set of labeled pairs.111These pairs are sometimes replaced by triplets such that example must be closer to example than to example , where and share the same label and does not. The most classic and commonly used approach in the literature focuses on Mahalanobis distance learning where the objective is to learn a positive semi-definite (PSD) matrix Schultz2003 ; Shalev-Shwartz2004 ; Rosales2006 ; Davis2007 ; Jain2008 ; Weinberger2009 ; Qi2009 ; Ying2009 inducing a linear projection of the data where the Euclidean distance performs well. Other approaches have also considered arbitrary similarity functions with no PSD constraint Chechik2009 ; Qamar2010 ; ShalitWC10 . The learned distance or similarity is then typically used to improve the performance of nearest-neighbor methods.

From a theoretical standpoint, many papers have studied the convergence rate of the optimization problem used to learn the parameters of the metric. However and somewhat surprisingly, few studies have been done about the generalization ability of learned metrics on unseen data. This situation can be explained by the fact that one cannot assume that the learning pairs provided to a metric learning algorithm are independent and identically distributed (IID). Indeed, these pairs are generally given by an expert and/or extracted from a sample of individual instances. For example, common procedures for building such learning pairs are based either on the nearest or farthest neighbors of each example, some criterion of diversity Kar2011 , taking all the possible pairs or drawing pairs randomly from a learning sample. Online methods Shalev-Shwartz2004 ; Jain2008 ; Chechik2009 nevertheless offer guarantees, but only in the form of regret bounds assessing the deviation between the cumulative loss suffered by the online algorithm and the loss induced by the best hypothesis that can be chosen in hindsight. Apart from these results, as far as we know, very few papers have proposed a theoretical study on the generalization ability of supervised metric learning methods. The approach of Bian and Tao Bian2011 uses a statistical analysis to give generalization guarantees for loss minimization methods, but their results assume some hypotheses on the distribution of the examples and do not take into account any regularization on the metric. The most general contribution has been proposed by Jin et al. Jin09 who adapted the framework of uniform stability Bousquet02 to regularized metric learning. However, their approach is based on a Frobenius norm regularizer and cannot be applied to any type of regularization, in particular sparsity-inducing norms Xu2012 .

In this paper, we propose to address this lack of theoretical framework by studying the generalization ability of metric learning algorithms according to a notion of algorithmic robustness. Algorithmic robustness, introduced by Xu et al. XUrobustness ; XUrobustness-ML , allows one to derive generalization bounds when given two “close” training and testing examples the variation between their associated loss is bounded. This notion of closeness examples relies on a partition of the input space into different regions such that two examples in the same region are said close. This framework has been successfully used in the classic supervised learning setting for deriving generalization bounds for SVM, Lasso and more. We propose here to adapt this notion of algorithmic robustness to metric learning that works both for similarity and distance learning. We show that, in this context, the problem of non-IIDness of the learning pairs can be worked around by simply assuming that the pairs are built from an IID sample of labeled examples. Moreover, following the work of Xu et al. XUrobustness-ML , we provide a notion of weak robustness that is necessary and sufficient for metric learning algorithms to generalize well, highlighting that robustness is a fundamental property. We illustrate the applicability of our framework by deriving generalization bounds, using very few approach-specific arguments, for a larger class of problems than Jin et al. that can accommodate a vast choice of regularizers, without any assumption on the distribution of the examples.

The rest of the paper is organized as follows. We introduce some preliminaries and notations in Section 2. Our notion of algorithmic robustness for metric learning is presented in Section 3. The necessity and sufficiency of weak robustness is shown in Section 4. Section 5 is devoted to the illustration of our framework to actual metric learning algorithms. Finally, we conclude in Section 6.

## 2 Preliminaries

### 2.1 Notations

Let be the instance space, be a finite label set and let . In the following, means and . Let

be an unknown probability distribution over

. We assume that is a compact convex metric space w.r.t. a norm such that , thus there exists a constant such that , . A similarity or distance function is a pairwise function . In the following, we use the generic term metric to refer to either a similarity or a distance function. We denote by a labeled training sample consisting of training instances drawn IID from . The sample of all possible pairs built from is denoted by such that . A metric learning algorithm takes as input a finite set of pairs from and outputs a metric. We denote by the metric learned by an algorithm from a sample of pairs. For any pair of labeled examples and any metric

, we associate a loss function

which depends on the examples and their labels. This loss is assumed to be nonnegative and uniformly bounded by a constant . We define the true generalization loss over by . We denote the empirical loss over the sample by .

### 2.2 Robustness for classical supervised learning

The notion of algorithmic robustness, introduced by Xu and Mannor XUrobustness ; XUrobustness-ML in the context of classic supervised learning, is based on the deviation between the losses associated to two training and testing instances that are close. An algorithm is said -robust if there exists a partition of the space into disjoint subsets such that for every learning and testing instances belonging to the same region of the partition, the deviation between their associated losses is bounded by a term . From this definition, the authors have proved a convergence bound for the difference between the empirical and true losses of the form (with probability ). This bound depends on and which can be made as small as desired by refining this partition. When considering metric spaces, the partition of can be obtained by the notion of covering number Kolmogorov61 .

###### Definition 1

For a metric space , and , we say that is a -cover of , if , such that . The -covering number of is

 N(γ,T,ρ)=min{|^T|:^T\ is a\ γ−\ cover of\ T}.

For example, when is a compact convex space, for any , the quantity is finite leading to a finite cover. If we consider the space , we can note that the label set can be partitioned into sets. Thus, can be partitioned into subsets such that if two instances , belong to the same subset, then and .

## 3 Robustness and Generalization for Metric Learning

We present here our adaptation of robustness to metric learning. The idea is to use the partition of at the pair level: if a new test pair of examples is close to a learning pair, then the losses of the two pairs must be close. Two pairs are close when each instance of the first pair fall into the same subset of the partition of as the corresponding instance of the other pair, as shown in Figure 1. A metric learning algorithm with this property is said robust. This notion is formalized as follows.

###### Definition 2

An algorithm is robust for and if can be partitioned into disjoints sets, denoted by , such that for all sample and the pair set associated to this sample, the following holds:
if and then

 |l(Aps,s1,s2)−l(Aps,z1,z2)|≤ϵ(ps). (1)

and quantify the robustness of the algorithm which depends on the learning sample. The property of robustness is required for every training pair of the sample; we will see later that this property can be relaxed.

Note that this definition of robustness can be easily extended to triplet based metric learning algorithms. Instead of considering all the pairs from an IID sample , we take the admissible triplet set of such that means and share the same label while and have different ones, with the interpretation that must be more similar to than to . The robustness property can then be expressed by: if , and then

 |l(Atrips,s1,s2,s3)−l(Atrips,z1,z2,z3)|≤ϵ(trips). (2)

### 3.1 Generalization of robust algorithms

We now give a PAC generalization bound for metric learning algorithms fulfilling the property of robustness (Definition 2). We first begin by presenting a concentration inequality that will help us to derive the bound.

###### Proposition 1 (VDV-empricalprocess )

Let

an IID multinomial random variable with parameters

and . By the Breteganolle-Huber-Carol inequality we have: , hence with probability at least ,

 K∑i=1∣∣∣Nin−μ(Ci)∣∣∣≤√2Kln2+2ln(1/δ)n. (3)

We now give our first result on the generalization of metric learning algorithms.

###### Theorem 1

If a learning algorithm is -robust and the training sample is made of the pairs obtained from a sample generated by IID draws from , then for any , with probability at least we have:

 |L(Aps)−lemp(Aps)|≤ϵ(ps)+2B√2Kln2+2ln1/δn.
• Let be the set of index of points of that fall into the . is a IID random variable with parameters and . We have:

 |L(Aps)−lemp(Aps)| = ∣∣ ∣∣K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)μ(Ci)μ(Cj)−1n2n∑i=1n∑j=1l(Aps,si,sj)∣∣ ∣∣ (a)≤ ∣∣ ∣∣K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)μ(Ci)μ(Cj)− K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)μ(Ci)|Nj|n∣∣ ∣∣+ (b)≤ ∣∣ ∣∣K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)μ(Ci)(μ(Cj)−|Nj|n)∣∣ ∣∣+ ∣∣ ∣∣K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)μ(Ci)|Nj|n− K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)|Ni|n|Nj|n∣∣ ∣∣+ ∣∣ ∣∣K∑i=1K∑j=1Ez1,z2∼μ(l(Aps,z1,z2)|z1∈Ci,z2∈Cj)|Ni|n|Nj|n−1n2n∑i=1n∑j=1l(Aps,si,sj)∣∣ ∣∣
 (c)≤ B(∣∣ ∣∣K∑j=1μ(Cj)−|Nj|n∣∣ ∣∣+∣∣ ∣∣K∑i=1μ(Ci)−|Ni|n∣∣ ∣∣)+ ∣∣ ∣∣1n2K∑i=1K∑j=1∑so∈Ni∑sl∈Njmaxz∈Cimaxz′∈Cj|l(Aps,z,z′)−l(Aps,so,sl)|∣∣ ∣∣ (d)≤ ϵ(ps)+2BK∑i=1∣∣∣|Ni|n−μ(Ci)∣∣∣(e)≤ϵ(ps)+2B√2Kln2+2ln1/δn.

Inequalities and are due to the triangle inequality, uses the fact that is bounded by , that by definition of a multinomial random variable and that by definition of the . Lastly, is due to the hypothesis of robustness (Equation 1) and to the application of Proposition 1.

The previous bound depends on which is given by the cover chosen for . If for any , the associated is a constant (i.e. ) for any , we can prove a bound holding uniformly for all : . This also gives an insight into the objective of any robust algorithm: according to a partition of the labeled input space, given two regions, minimize the maximum loss over pairs of examples belonging to each region.

For triplet based metric learning algorithms, by following the definition of robustness given by Equation 2 and adapting straight forwardly the losses to triplets such that they output zero for non admissible ones, Theorem 1 can be easily extended to obtain the following generalization bound:

 |L(Atrips)−lemp(Atrips)|≤ϵ(trips)+3B√2Kln2+2ln1/δn. (4)

### 3.2 Pseudo-robustness

The previous study requires the robustness property to be true for every learning pair. We show, with the following definition, that it is possible to relax the robustness to be true for only a subpart of the sample and yet be able to derive generalization guarantees.

###### Definition 3

An algorithm is pseudo robust for , and , if can be partitioned into disjoints sets, denoted by , such that for all IID from , there exists a subset of training pairs samples , with , such that the following holds:
: if and then

 |l(Aps,s1,s2)−l(Aps,z1,z2)|≤ϵ(ps). (5)

We can easily observe that -robust is equivalent to pseudo-robust. The following theorem illustrates the generalization guarantees associated to the pseudo-robustness property.

###### Theorem 2

If a learning algorithm is pseudo-robust, the training pairs come from a sample generated by IID draws from , then for any , with probability at least we have:

 |L(Aps)−lemp(Aps)|≤^pn(ps)n2ϵ(ps)+B(n2−^pn(ps)n2+2√2Kln2+2ln1/δn).

The proof is similar to that of Theorem 1 and is given in Appendix A.1.

The notion of pseudo-robustness characterizes a situation that often occurs in metric learning: it is sometimes difficult to optimize the metric over all the possible pairs. This theorem shows that it suffices to have a property of robustness over only a subset of the possible pairs to have generalization guarantees. Moreover, it also gives an insight into the behavior of metric learning approaches aiming at learning a distance to be plugged in a

-nearest neighbor classifier such as LMNN

Weinberger2009 . These methods do not optimize the distance according to all possible pairs, but only according to the nearest-neighbors of the same class and some pairs of different class. According to the previous theorem, this principle is founded provided that the robustness property is fulfilled for some of the pairs used to optimize the metric. Finally, note that this notion of pseudo-robustness can be also easily adapted to triplet based metric learning.

## 4 Necessity of Robustness

We prove here that a notion of weak robustness is actually necessary and sufficient to generalize in a metric learning setup. This result is based on an asymptotic analysis following the work of Xu and Mannor

XUrobustness-ML . We consider pairs of instances coming from an increasing sample of training instances and from a sample of test instances such that both samples are assumed to be drawn IID from a distribution . We use and to denote the first examples of the two samples respectively, while denotes a fixed sequence of examples.

We use to refer to the average loss given a set of pairs for any learned metric , and for the expected loss.

We first define a notion of generalizability for metric learning.

###### Definition 4
1. Given a training pair set coming from a sequence of examples , a metric learning method generalizes w.r.t. if .

2. A learning method generalizes with probability 1 if it generalizes with respect to the pairs of almost all samples IID from .

Note this notion of generalizability implies convergence in mean. We then introduce the notion of weak robustness for metric learning.

###### Definition 5
1. Given a set of training pairs coming from a sequence of examples , a metric learning method is weakly robust with respect to if there exists a sequence of such that and

 limn{max^s(n)∈Dn∣∣L(Aps∗(n),p^s(n))−L(Aps∗(n),ps∗(n))∣∣}=0.
2. A learning method is almost surely weakly robust if it is robust w.r.t. almost all .

The definition of robustness requires the labeled sample space to be partitioned into disjoints subsets such that if some instances of pairs of train/test examples belong to the same partition, then they have similar loss. Weak robustness is a generalization of this notion where we consider the average loss of testing and training pairs: if for a large (in the probabilistic sense) subset of data, the testing loss is close to the training loss, then the algorithm is weakly robust. From Proposition 1, we can see that if for any fixed there exists such that an algorithm is robust, then is weakly robust. We now give the main result of this section about the necessity of robustness.

###### Theorem 3

Given a fixed sequence of training examples , a metric learning method generalizes w.r.t. if and only if it is weakly robust w.r.t. .

• Following XUrobustness-ML , the sufficiency is obtained by the fact that the testing pairs are obtained from a sample constituted of IID instances. We give the proof in Appendix A.2.

For the necessity, we need the following lemma which is a direct adaptation of a result introduced in XUrobustness-ML (Lemma 2). We provide the proof in Appendix A.3 for the sake of completeness.

###### Lemma 1

Given , if a learning method is not weakly robust w.r.t. , there exists such that the following holds for infinitely many :

 Pr(|L(Aps∗(n),pt(n))−L(Ap∗s(n),ps∗(n))|≥ϵ∗)≥δ∗. (6)
• Now, recall that is positive and uniformly bounded by , thus by the McDiarmid inequality (recalled in Appendix A.4) we have that for any there exists an index such that for any , with probability at least , we have . This implies the convergence , and thus from a given index:

 |L(Aps∗(n),pt(n))−L(Aps∗(n))|≤ϵ∗2. (7)

Now, by contradiction, suppose algorithm is not weakly robust, Lemma 1 implies Equation 6 holds for infinitely many . This combined with Equation 7 implies that for infinitely many :

which means does not generalize, thus the necessity of weak robustness is established.

The following corollary follows immediately from Theorem 3.

###### Corollary 1

A metric learning method generalizes with probability 1 if and only if it is almost surely weakly robust.

## 5 Examples of Robust Metric Learning Algorithms

We first restrict our attention to Mahalanobis distance learning algorithms of the following form:

 minM⪰0 c∥M∥+1n2∑(si,sj)∈psg(yij[1−f(M,xi,xj)]), (8)

where , , if and otherwise, is the Mahalanobis distance parameterized by the PSD matrix , some matrix norm and a regularization parameter. The loss function outputs a small value when its input is large positive and a large value when it is large negative. We assume to be nonnegative and Lipschitz continuous with Lipschitz constant . Lastly, is the largest loss when is .

To prove the robustness of (8), we will need the following theorem, which essentially says that if a metric learning algorithm achieves approximately the same testing loss for testing pairs that are close to each other, then it is robust.

###### Theorem 4

Fix and a metric of . Suppose satisfies

 |l(Aps,z1,z2)−l(Aps,z′1,z′2)|≤ϵ(ps),∀z1,z2,z′1,z′2:z1,z2∈s,ρ(z1,z′1)≤γ,ρ(z2,z′2)≤γ

and . Then is -robust.

• By definition of covering number, we can partition in subsets such that each subset has a diameter less or equal to . Furthermore, since is a finite set, we can partition into subsets such that . Therefore,

 |l(Aps,z1,z2)−l(Aps,z′1,z′2)|≤ϵ(ps),∀z1,z2,z′1,z′2:z1,z2∈s,ρ(z1,z′1)≤γ,ρ(z2,z′2)≤γ

implies which establishes the theorem.

We now prove the robustness of (8) when is the Frobenius norm.

###### Example 1 (Frobenius norm)

Algorithm (8) with is -robust.

• Let be the solution given training data . Thus, due to optimality of , we have

 c∥M∗∥F+1n2∑(si,sj)∈psg(yij[1−f(M,xi,xj)])≤c∥0∥F+1n2∑(si,sj)∈psg(yij[1−f(0,xi,xj)])=g0

and thus . We can partition as sets, such that if and belong to the same set, then and . Now, for , if , , and , then:

 |g(y12[1−f(M∗,x1,x2)])−g(y′12[1−f(M∗,x′1,x′2)])| ≤ U|(x1−x2)TM∗(x1−x2)−(x′1−x′2)TM∗(x′1−x′2)| = U|(x1−x2)TM∗(x1−x2)−(x1−x2)TM∗(x′1−x′2) + (x1−x2)TM∗(x′1−x′2)|−(x′1−x′2)TM∗(x′1−x′2)| = U|(x1−x2)TM∗(x1−x2−(x′1+x′2))+(x1−x2−(x′1+x′2))TM∗(x′1+x′2)| ≤ U(|(x1−x2)TM∗(x1−x′1)|+|(x1−x2)TM∗(x′2−x2)| + |(x1−x′1)TM∗(x′1+x′2)|+|(x′2−x2)TM∗(x′1+x′2)|) ≤ U(∥x1−x2∥2∥M∗∥F∥x1−x′1∥2+∥x1−x2∥2∥M∗∥F∥x′2−x2∥2 + ∥x1−x′1∥2∥M∗∥F∥x′1−x′2∥2+∥x′2−x2∥2∥M∗∥F∥x′1−x′2∥2)≤8URγg0c

Hence, the example holds by Theorem 4.

Note that for the special case of Example 1, a generalization bound (with same order of convergence rate) based on uniform stability was derived in Jin09 . However, it is known that sparse algorithms are not stable Xu2012 , and thus stability-based analysis fails to assess the generalization ability of recent sparse metric learning approaches Rosales2006 ; Qi2009 ; Ying2009 . The key advantage of robustness over stability is that it can accommodate arbitrary -norms (or even any regularizer which is bounded below by some -norm), thanks to the equivalence of norms. To illustrate this, we show the robustness when is either the norm (used in Rosales2006 ; Qi2009 ) which promotes sparsity at the component level, or the norm (used in Ying2009 ), which is particularly interesting in the context of Mahalanobis distance learning since it induces group sparsity at the column/row level.222In this case, the linear projection space of the data induced by the learned Mahalanobis distance is of lower dimension than the original space, allowing more efficient computations and smaller storage size. The proofs are reminiscent of that of Example 1 and can be found in Appendices A.5 and A.6.

###### Example 2 (ℓ1 norm)

Algorithm (8) with is -robust.

###### Example 3 (ℓ2,1 norm)

Consider Algorithm (8) with , where is the -th column of . This algorithm is -robust.

Some metric learning algorithms have kernelized versions, for instance Schultz2003 ; Davis2007 . In the following example we show robustness for a kernelized formulation.

###### Example 4 (Kernelization)

Consider the kernelized version of Algorithm (8):

 minM⪰0 c∥M∥H+1n2∑(si,sj)∈psg(yij[1−f(M,ϕ(xi),ϕ(xj))]), (9)

where is a feature mapping to a kernel space , the norm function of and the kernel function. Consider a cover of by ( being compact) and let and . If the kernel function is continuous, and are finite for any and thus Algorithm 9 is