On Lipschitz Continuity and Smoothness of Loss Functions in Learning to Rank

05/03/2014 ∙ by Ambuj Tewari, et al. ∙ University of Michigan 0

In binary classification and regression problems, it is well understood that Lipschitz continuity and smoothness of the loss function play key roles in governing generalization error bounds for empirical risk minimization algorithms. In this paper, we show how these two properties affect generalization error bounds in the learning to rank problem. The learning to rank problem involves vector valued predictions and therefore the choice of the norm with respect to which Lipschitz continuity and smoothness are defined becomes crucial. Choosing the ℓ_∞ norm in our definition of Lipschitz continuity allows us to improve existing bounds. Furthermore, under smoothness assumptions, our choice enables us to prove rates that interpolate between 1/√(n) and 1/n rates. Application of our results to ListNet, a popular learning to rank method, gives state-of-the-art performance guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the setting of binary classification or regression, it is well known that Lipschitz continuity of the loss function impacts the generalization error of algorithms that minimize the loss on training examples. A key result that controls this impact is the Lipschitz contraction property of Rademacher (or Gaussian) complexity that, in turn, follows from the celebrated Ledoux-Talagrand contraction principle. It is also well known that Lipschitz continuity of the derivative of the loss, sometimes referred to as “smoothness”, also impacts generalization error bounds. For instance, under smoothness, one can derive rates that interpolate between an “optimistic” rate and a “pessimistic” rate depending on whether or not the expected loss of the best predictor is close to zero.

In this paper, we investigate the impact of Lipschitz continuity and smoothness of loss function in the learning to rank problem. In learning to rank, the loss function takes a vector of predictions (or scores) as an argument. This leads to an interesting question that does not arise in binary classification or regression: which norm do we use to define Lipschitz continuity or smoothness of the loss function? Previous work has considered the use of the “default” Euclidean (or ) norm for this purpose. We show that this choice can lead to suboptimal bounds and that better bounds can be obtained by using the norm in defining Lipschitz continuity and smoothness.

Using online regret bounds as a guide, we first show why one should expect to get better bounds under Lipschitz continuity with respect to the norm. However, online regret bounds require convexity of the loss function and, even under convexity, they do not establish uniform convergence of empirical loss averages to their expectations (and therefore do not lead to generalization error bounds for empirical risk minimization (ERM)). Our first key result (Theorem 4) establishes a generalization error bound – via uniform convergence – for ERM under Lipschitz continuity of the loss. We consider linear scoring functions, a popular choice in theory as well as in practice. We consider both -norm and

-norm bounded linear predictors. Our result in the latter case appears to be the first of its kind for learning to rank and can be useful if the dimensionality of the feature space is high and there is a need for feature selection.

Next we consider smoothness of the loss function, again with respect to the norm, and show why it is natural to expect that it is the right notion to derive rates that interpolate between optimistic and pessimistic cases. Our second key result (Theorem 9) is a generalization bound for ERM under smoothness. This is proved via a uniform convergence analysis using local Rademacher complexities. Not only was such a result not known for general, possibly non-convex, loss functions, we are not even aware of such a result for any specific loss function used in learning to rank.

As an illustration, we apply our key results to ListNet, a loss very popular in the learning to rank literature111The original ListNet paper has been cited close to 500 times already.. We discover that both its Lipschitz constant as well as smoothness constant do not increase with the number of documents being ranked per query. Our results, therefore, additionally provide novel theoretical insights into a popular learning to rank method.

2 Preliminaries

The increasing use of machine learning for web ranking and information retrieval tasks has led to a lot of recent research activity on the learning to rank problem (sometimes also called “subset ranking” to distinguish it from other related problems, for example, bipartite ranking). A training example in the learning to rank setting is of the form

. Here is a search query and are documents with varying degrees of relevance to the query. Human labelers provide the relevance vector where the entries in contain the relevance labels for the individual documents. Typically, has integer-valued entries in the range where is often less than . For our theoretical analysis, we get rid of some of these details by assuming that some feature map exists to map a query document pair to . As a result, the training example gets converted into where is an matrix with the query-document feature vector as rows. With this abstraction, we have an input space and a label space .

A training set consists of iid examples drawn from some underlying distribution . To rank order the documents in a new instance , often a score vector is computed. A ranking of the documents can then be obtained from by sorting its entries in decreasing order, for instance. A common choice for the scoring function is to make it linear in the input . Accordingly, we consider the following two classes in this paper:

In the input space , it is natural to contain the rows of to have a bound on the appropriate dual norm. Accordingly, whenever we use , the input space is set to

where denotes th row of and . Similarly, when we use , we set

These are natural counterparts to the following function classes studied in binary classification and regression:

A key ingredient in the basic setup of the learning to rank problem is a loss function where denotes the set of non-negative real numbers. For vector valued scores, the Lipschitz constant of depends on the norm that we decide to use in the score space:

If is differentiable, this is equivalent to:

Similarly, the smoothness constant of depends on the norm used in the score space:

If is twice differentiable, this is equivalent to

where is the operator norm induced by the pair and defined as Define the expected loss of under the distribution as:

and its empirical loss on the sample as

We may occasionally refer to expectations w.r.t. the sample using . To reduce notational clutter, we often refer to jointly by and by .

2.1 Related work

Our work is directly motivated by a very interesting generalization bound for learning to rank due to Chapelle and Wu (2010, Theorem 1). They considered a Lipschitz continuous loss with Lipschitz constant w.r.t. the norm

. They show that, with probability at least

,

The dominant term on the right is . Using the informal notation to hide logarithmic factors, our first key result (Theorem 4) will improve this to where is the Lipschitz constant of w.r.t. norm. Since , our bound can never be worse than their bound. However, as we show in Section 7, for the popular ListNet loss function, both and are constants independent of . In such cases, our bound offers an improvement by a factor of .

Our proof technique is very different from that of Chapelle and Wu (2010). In the absence of an obvious contraction principle that would allow one to get rid of the loss function and work directly with the complexity of the underlying linear function class, they resorted to first principles and invoked Slepian’s lemma. However, that forces them to define the Lipschitz constant w.r.t. the norm. We deal with the absence of a general contraction principle by using covering number arguments that work quite nicely when the Lipschitz content is defined w.r.t. the norm.

To the best of our knowledge, our second key result (Theorem 9) has no direct predecessor in the learning to rank literature. But in terms of techniques, we do rely heavily on previous work by Bousquet (2002) and Srebro et al. (2010). A key lemma (Lemma 6) we prove here is based on a vector extension of an inequality that was shown to hold in the scalar predictions case by Srebro et al. (2010) when a smooth loss function is used.

3 Online regret bounds under Lipschitz continuity

In this section, we build some intuition as to why we it is natural to use in defining the Lipschitz constant of the loss . To this end, consider the following well known online gradient descent (OGD) regret guarantee. Recall that OGD refers to the simple online algorithm that makes the update at time . If we run OGD to generate ’s, we have, for all :

where is a bound on the maximum -norm of the gradients and ’s have to be convex. If are iid then by setting , , and using a standard online to batch conversion technique we can guarantee an excess risk bound of:

where and has to satisfy

where we use the chain rule to express

in terms of . Finally, we can upper bound

(1)

because of the following lemma.

Lemma 1.

For any ,

where is the dual exponent of (i.e., ).

Proof.

The first equality is true because

The second is true because

Thus, we have shown that if has Lipschitz constant w.r.t. , then we can guarantee an excess risk bound. This is encouraging but there are two deficiencies of this approach based on online regret bounds. First, there is no way to generalize the result to Lipschitz, but non-convex loss functions. Second, the result applies to the output of a specific algorithm. That is, we do not get uniform convergence bounds or excess risk bounds for ERM. We now address these issues.

4 Generalization error bounds under Lipschitz continuity

The above discussion suggests that we have a possibility of deriving tighter, possibly -independent, generalization error bounds by assuming that is Lipschitz continuous w.r.t. . The standard approach in binary classification is to appeal to the Ledoux-Talagrand contraction principle for Rademacher complexity (Bartlett and Mendelson, 2003) by getting rid of the Lipschitz loss function (that takes a scalar argument in the binary classification case) and incurring a factor equal to the Lipschitz constant of the loss in the Rademacher complexity bound. It is not immediately clear how such an approach would work when the loss takes vector valued arguments and is Lipschitz w.r.t. since we are not aware of an appropriate extension of the Ledoux-Talagrand contraction principle. Note that Lipschitz continuity w.r.t. the Euclidean norm does not pose a significant challenge since Slepian’s lemma can be applied to get rid of the loss function. As we mentioned before, several authors have already exploited Slepian’s lemma in this context (Bartlett and Mendelson, 2003; Chapelle and Wu, 2010).

In the absence of a general principle that would allow us to deal with an arbitrary loss function that is Lipschitz w.r.t. , we take a route involving covering numbers. Define the data-dependent (pseudo-)metric:

and let be the covering number at scale of the class or w.r.t. the above metric. Also define

With these definitions in place, we can state our first result on covering numbers.

Proposition 2.

Let the loss be Lipschitz in its first argument w.r.t. with constant . Then the following covering number bounds hold:

Proof.

Note that

This immediately implies that if we have a cover of the class (respectively ) at scale w.r.t. the metric

then it is also a cover of (respectively ) w.r.t. . From the point of view of the scalar valued function classes or , the vectors constitute a data set of size . Therefore, we have

(2)

as well as

(3)

Now we appeal to the following bound due to Zhang (2002, Corollary 3 and Corollary 5):

Plugging these into (2) and (3) respectively proves the result. ∎

Recall that the covering number uses the (pseudo-)metric:

It is well known that a control on provides control on the empirical Rademacher complexity and that covering numbers are smaller than ones. For us, it will be convenient to use a more refined version222We use a further refinement due to Srebro and Sridharan available at http://ttic.uchicago.edu/~karthik/dudley.pdf due to Mendelson (2002). Let be a class of functions uniformly bounded by . Then, we have

(4)
(5)

Here is the empirical Rademacher complexity of the class defined as

where

are iid Rademacher (symmetric Bernoulli) random variables.

Corollary 3.

Let be Lipschitz w.r.t. and uniformly bounded333A uniform bound on the loss easily follows under the (very reasonable) assumption that . Then . by for (or as the case may be). Then the empirical Rademacher complexities of the classes are bounded as

Proof.

These follow by simply plugging in estimates from Proposition 

2 into (5) and choosing optimally. ∎

Control on the Rademacher complexity immediately leads to uniform convergence bounds and generalization error bounds for ERM. The informal notation hides factors logarithmic in . Note that all hidden factors are small and computable from the results above.

Theorem 4.

Suppose is Lipschitz w.r.t. with constant and is uniformly bounded by over the function class being used. With probability at least ,

and therefore with probability at least ,

where is an empirical risk minimizer (i.e. a minimizer of ) over . The same result holds for the class with replaced with .

Proof.

Follows from standard bounds using Rademacher complexity. See, for example, Bartlett and Mendelson (2003). ∎

As we said before, ignoring logarithmic factors, the bound for is an improvement over the bound of Chapelle and Wu (2010). The generalization bound for appears to be new and could be useful in learning to rank situations involving high dimensional features.

5 Online regret bounds under smoothness

Let us go back to OGD guarantee, this time presented in a slightly more refined version. If we run OGD with learning rate then, for all :

where (if is not differentiable at then we can set to be an arbitrary subgradient of at ). Now assume that all ’s are non-negative functions and are smooth w.r.t. with constant . Lemma 3.1 of Srebro et al. (2010) tells us that any non-negative, smooth function enjoy an important self-bounding property for the gradient:

which bounds the magnitude of the gradient of at a point in terms of the value of the function itself at that point. This means that which, when plugged into the OGD guarantee, gives:

Again, setting , , and using the online to batch conversion technique, we can arrive at the bound: for all :

At this stage, we can fix , the optimal -norm bounded predictor and optimize the right hand side over by setting

(6)

After plugging this value of in the bound above and some algebra (see Section A), we get the upper bound

(7)

Such a rate interpolates between a rate in the “pessimistic” case () and the rate in the “optimistic” case () (this terminology is due to Panchenko (2002)).

We have not yet related the smoothness constant to the smoothness of the underlying loss (views as a function of the score vector). We do this now. Assuming to be twice differentiable. Then we need to choose such that

using the chain rule to express in terms of . Note that, for OGD, we need smoothness in w.r.t. which is why the matrix norm above is the operator norm corresponding to the pair

. In fact, when we say “operator norm” without mentioning the pair of norms involved, it is this norm that is usually meant. It is well known that this norm is equal to the largest singular value of the matrix. But, just as before, we can bound this in terms of the smoothness constant of

w.r.t. :

(8)

where we used Lemma 1 once again.

This result using online regret bounds is great for building intuition but suffers from the two defects we mentioned at the end of Section 3. In the smoothness case, it additionally suffers from a more serious defect: the correct choice of the learning rate requires knowledge of which is seldom available.

6 Generalization error bounds under smoothness

Once again, to prove a general result for possibly non-convex smooth losses, we will adopt an approach based on covering numbers. To begin, we will need the following useful lemma from Srebro et al. (2010, Lemma A.1 in the Supplementary Material). Note that, for functions over the reals, we do not need to talk about the norm when dealing with smoothness since essentially the only norm available is the absolute value.

Lemma 5.

For any -smooth non-negative function and any we have

We first provide an easy extension of this lemma to the vector case.

Lemma 6.

If is a non-negative function with smoothness constant w.r.t. a norm then for any we have

Proof.

See Appendix B. ∎

Using the basic idea behind local Rademacher complexity analysis, we define the following loss class:

Note that this is a random subclass of functions since is a random variable.

Proposition 7.

Let be smooth, in its first argument, w.r.t. with constant . The covering numbers of in the metric defined above are bounded as follows:

Proof.

Let . Using Lemma 6

where the last inequality follows because .

This immediately implies that if we have a cover of the class at scale w.r.t. the metric

then it is also a cover of w.r.t. . Therefore, we have

(9)

Appealing once again to a result by Zhang (2002, Corollary 3), we get

which finishes the proof. ∎

Corollary 8.

Let be smooth w.r.t. and uniformly bounded by for . Then the empirical Rademacher complexity of the class is bounded as

where .

Proof.

See Appendix C. ∎

With the above corollary in place we can now prove our second key result.

Theorem 9.

Suppose is smooth w.r.t. with constant and is uniformly bounded by over . With probability at least ,

where . Moreover, with probability at least ,

where are minimizers of and respectively (over ).

Proof.

We appeal to Theorem 6.1 of Bousquet (2002) that assumes there exists an upper bound

where is a non-negative, non-decreasing, non-zero function such that is non-increasing. The upper bound in Corollary 8 above satisfies these conditions and therefore we set with as defined in Corollary 8. From Bousquet’s result, we know that, with probability at least ,

where and is the largest solution to the equation . In our case, . This proves the first inequality

Now, using the above inequality with , the empirical risk minimizer and noting that , we get

The second inequality now follows after some elementary calculations detailed in Appendix D. ∎

7 Application to ListNet

We now apply the results of this paper to the ListNet loss function (Lan et al., 2009). ListNet is a popular learning method with competitive performance on a variety of benchmark data sets. It is defined in the following way444The ListNet paper actually defines a family of losses based on probability models for top documents. We use in our definition since that is the version implemented in their experimental results.. Define maps from to as: for . Then, we have

Since our results need the constants and we first compute them for the ListNet loss function.

Proposition 10.

The Lipschitz and smoothness constants of w.r.t. satisfy and for any .

Proof.

See Appendix E. ∎

Since the bounds above are independent of , so the generalization bounds resulting from their use in Theorem 4 and Theorem 9 will also be independent of (up to logarithmic factors). We are not aware of prior generalization bounds for ListNet that do not scale with the number of documents. In particular, the results of Lan et al. (2009) have an dependence since they consider the top- version of ListNet. However, even if the top- variant above is considered, it seems that their proof technique will result in at least a linear dependence on and can never result in as tight a bound as we get from our general results. Moreover, generalization error bounds for ListNet that interpolate between the pessimistic and optimistic rates have not been provided before.

8 Conclusion

In this paper, we derived generalization error bounds for learning to rank under Lipschitz continuity and smoothness assumptions on the loss function. Under the latter assumption, our bounds interpolate between and rates. We showed why it is natural to measure Lipschitz and smoothness constants for learning to rank losses with respect to the norm. Our bounds under Lipschitz continuity improve previous results whereas our results under smoothness assumptions, to the best of our knowledge, are the first of their kind in the learning to rank setting.

A number of interesting avenues present themselves for further exploration. If the covering number approach can be by-passed via an argument directly at the level of Rademacher complexity, then it might be possible to avoid some logarithmic factors that we incur in our bounds. Another thing to note is that our arguments do not rely much on the specifics of the learning to rank setting and might apply more generally to situations, such as multi-label learning, that involve losses taking a vector of predictions as an argument.

Acknowledgments

We gratefully acknowledge the support of NSF under grant IIS-1319810.

References

  • Bartlett and Mendelson [2003] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
  • Bousquet [2002] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, PhD thesis, Ecole Polytechnique, 2002.
  • Chapelle and Wu [2010] O. Chapelle and M. Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
  • Lan et al. [2009] Yanyan Lan, Tie-Yan Liu, Zhiming Ma, and Hang Li. Generalization analysis of listwise learning-to-rank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 577–584, 2009.
  • Mendelson [2002] Shahar Mendelson.

    Rademacher averages and phase transitions in Glivenko-Cantelli classes.

    Information Theory, IEEE Transactions on, 48(1):251–263, 2002.
  • Panchenko [2002] Dmitriy Panchenko. Some extensions of an inequality of Vapnik and Chervonenkis. Electronic Communications in Probability, 7:55–65, 2002.
  • Srebro et al. [2010] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise, and fast rates. In Advances in Neural Information Processing Systems 23, pages 2199–2207, 2010.
  • Zhang [2002] Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527–550, 2002.

Appendix A Calculations involved in deriving Equation (7)

Plugging in the value of from (6) into the expression