Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses

07/02/2012
by   Shivani Agarwal, et al.
ERNET India
0

The problem of bipartite ranking, where instances are labeled positive or negative and the goal is to learn a scoring function that minimizes the probability of mis-ranking a pair of positive and negative instances (or equivalently, that maximizes the area under the ROC curve), has been widely studied in recent years. A dominant theoretical and algorithmic framework for the problem has been to reduce bipartite ranking to pairwise classification; in particular, it is well known that the bipartite ranking regret can be formulated as a pairwise classification regret, which in turn can be upper bounded using usual regret bounds for classification problems. Recently, Kotlowski et al. (2011) showed regret bounds for bipartite ranking in terms of the regret associated with balanced versions of the standard (non-pairwise) logistic and exponential losses. In this paper, we show that such (non-pairwise) surrogate regret bounds for bipartite ranking can be obtained in terms of a broad class of proper (composite) losses that we term as strongly proper. Our proof technique is much simpler than that of Kotlowski et al. (2011), and relies on properties of proper (composite) losses as elucidated recently by Reid and Williamson (2010, 2011) and others. Our result yields explicit surrogate bounds (with no hidden balancing terms) in terms of a variety of strongly proper losses, including for example logistic, exponential, squared and squared hinge losses as special cases. We also obtain tighter surrogate bounds under certain low-noise conditions via a recent result of Clemencon and Robbiano (2011).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/27/2012

Consistent Multilabel Ranking through Univariate Losses

We consider the problem of rank loss minimization in the setting of mult...
09/14/2010

Calibrated Surrogate Losses for Classification with Label-Dependent Costs

We present surrogate regret bounds for arbitrary surrogate losses in the...
12/17/2009

Composite Binary Losses

We study losses for binary classification and class probability estimati...
05/20/2018

Exp-Concavity of Proper Composite Losses

The goal of online prediction with expert advice is to find a decision s...
05/17/2020

On loss functions and regret bounds for multi-category classification

We develop new approaches in multi-class settings for constructing prope...
01/19/2022

Learning to Rank For Push Notifications Using Pairwise Expected Regret

Listwise ranking losses have been widely studied in recommender systems....
12/02/2019

Risk Bounds for Low Cost Bipartite Ranking

Bipartite ranking is an important supervised learning problem; however, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ranking problems arise in a variety of applications ranging from information retrieval to recommendation systems and from computational biology to drug discovery, and have been widely studied in machine learning and statistics in the last several years. Recently, there has been much interest in understanding statistical consistency and regret behavior of algorithms for a variety of ranking problems, including various forms of label/subset ranking as well as instance ranking problems

[10, 8, 12, 3, 2, 30, 13, 21, 5, 9, 19, 29].

In this paper, we study regret bounds for the bipartite instance ranking problem, where instances are labeled positive or negative and the goal is to learn a scoring function that minimizes the probability of mis-ranking a pair of positive and negative instances, or equivalently, that maximizes the area under the ROC curve [14, 1]. A popular algorithmic and theoretical approach to bipartite ranking has been to treat the problem as analogous to pairwise classification [17, 18, 14, 20, 7, 8]. Indeed, this approach enjoys theoretical support since the bipartite ranking regret can be formulated as a pairwise classification regret, and therefore any algorithm minimizing the latter over a suitable class of functions will also minimize the ranking regret (this follows formally from results of [8]; see Section 3.1

for a summary). Nevertheless, it has often been observed that algorithms such as AdaBoost, logistic regression, and in some cases even SVMs, which minimize the exponential, logistic, and hinge losses respectively in the standard (non-pairwise) setting, also yield good bipartite ranking performance

[11, 20, 24]

. For losses such as the exponential or logistic losses, this is not surprising since algorithms minimizing these losses (but not the hinge loss) are known to effectively estimate conditional class probabilities

[31]; since the class probability function provides the optimal ranking [8], it is intuitively clear (and follows formally from results in [8, 9]) that any algorithm providing a good approximation to the class probability function should also produce a good ranking. However, there has been very little work so far on quantifying the ranking regret of a scoring function in terms of the regret associated with such surrogate losses.

Recently, [19] showed that the bipartite ranking regret of a scoring function can be upper bounded in terms of the regret associated with balanced versions of the standard (non-pairwise) exponential and logistic losses. However their proof technique builds on analyses involving the reduction of bipartite ranking to pairwise classification, and involves analyses specific to the exponential and logistic losses (see Section 3.2). More fundamentally, the balanced losses in their result depend on the underlying distribution and cannot be optimized directly by an algorithm; while it is possible to do so approximately, one then loses the quantitative nature of the bounds.

In this work we obtain quantitative regret bounds for bipartite ranking in terms of a broad class of proper (composite) loss functions that we term

strongly proper. Our proof technique is considerably simpler than that of [19], and relies on properties of proper (composite) losses as elucidated recently for example in [22, 23, 15, 6]. Our result yields explicit surrogate bounds (with no hidden balancing terms) in terms of a variety of strongly proper (composite) losses, including for example logistic, exponential, squared and squared hinge losses as special cases. We also obtain tighter surrogate bounds under certain low-noise conditions via a recent result of [9].

The paper is organized as follows. In Section 2 we formally set up the bipartite instance ranking problem and definitions related to loss functions and regret, and provide background on proper (composite) losses. Section 3 summarizes related work that provides the background for our study, namely the reduction of bipartite ranking to pairwise binary classification and the result of [19]. In Section 4 we define and characterize strongly proper losses. Section 5 contains our main result, namely a bound on the bipartite ranking regret in terms of the regret associated with any strongly proper loss, together with several examples. Section 6 gives a tighter bound under certain low-noise conditions via a recent result of [9]. We conclude with a brief discussion and some open questions in Section 7.

2 Formal Setup, Preliminaries, and Background

This section provides background on the bipartite ranking problem, binary loss functions and regret, and proper (composite) losses.

2.1 Bipartite Ranking

As in binary classification, in bipartite ranking there is an instance space and binary labels , with an unknown distribution on . For and , we denote and . Given i.i.d. examples , the goal is to learn a scoring function (where ) that assigns higher scores to positive instances than to negative ones.111Most algorithms learn real-valued functions; we also allow values and for technical reasons. Specifically, the goal is to learn a scoring function with low ranking error (or ranking risk), defined as222We assume measurability conditions where necessary.

(1)

where are assumed to be drawn i.i.d. from , and is 1 if its argument is true and 0 otherwise; thus the ranking error of is simply the probability that a randomly drawn positive instance receives a lower score under than a randomly drawn negative instance, with ties broken uniformly at random. The optimal ranking error (or Bayes ranking error or Bayes ranking risk) can be seen to be

(2)
(3)

The ranking regret of a scoring function is then simply

(4)

We will be interested in upper bounding the ranking regret of a scoring function in terms of its regret with respect to certain other (binary) loss functions. In particular, the loss functions we consider will belong to the class of proper (composite) loss functions. Below we briefly review some standard notions related to loss functions and regret, and then discuss some properties of proper (composite) losses.

2.2 Loss Functions, Regret, and Conditional Risks and Regret

Assume again a probability distribution

on as above. Given a prediction space , a binary loss function (where ) assigns a penalty for predicting when the true label is .333Most loss functions take values in , but some loss functions (such as the logistic loss, described later) can assign a loss of to certain label-prediction pairs. For any such loss , the -error (or -risk) of a function is defined as

(5)

and the optimal -error (or optimal -risk or Bayes -risk) is defined as

(6)

The -regret of a function is the difference of its -error from the optimal -error:

(7)

The conditional -risk is defined as444Note that we overload notation by using here to refer to a number in ; the usage should be clear from context.

(8)

where denotes a

-valued random variable taking value

with probability . The conditional Bayes -risk is defined as

(9)

The conditional -regret is then simply

(10)

Clearly, we have for ,

(11)

and

(12)

We note the following:

Lemma 1.

For any and binary loss , the conditional Bayes -risk is a concave function on .

The proof follows simply by observing that is defined as the pointwise infimum of a family of linear (and therefore concave) functions, and therefore is itself concave.

2.3 Proper and Proper Composite Losses

In this section we review some background material related to proper and proper composite losses, as studied recently in [22, 23, 15, 6]. While the material is meant to be mostly a review, some of the exposition is simplified compared to previous presentations, and we include a new, simple proof of an important fact (Theorem 4).

Proper Losses. We start by considering binary class probability estimation (CPE) loss functions that operate on the prediction space . A binary CPE loss function is said to be proper if for all ,

(13)

and strictly proper if the minimizer is unique for all . Equivalently, is proper if for all , , and strictly proper if for all . We have the following basic result:

Lemma 2 ([15, 26]).

Let be a binary CPE loss. If is proper, then is a decreasing function on and is an increasing function. If is strictly proper, then is strictly decreasing on and is strictly increasing.

We will find it useful to consider regular proper losses. As in [15], we say a binary CPE loss is regular if and , i.e. if is finite for all except possibly for and , which are allowed to be infinite. The following characterization of regular proper losses is well known (see also [15]):

Theorem 3 ([25]).

A regular binary CPE loss is proper if and only if for all there exists a superderivative of at such that555Here is a superderivative of at if for all , .

The following is a characterization of strict properness of a proper loss in terms of its conditional Bayes risk :

Theorem 4.

A proper loss is strictly proper if and only if is strictly concave.

This result can be proved in several ways. A proof in [15] is attributed to an argument in [16]. If is twice differentiable, an alternative proof follows from a result in [6, 26], which shows that a proper loss is strictly proper if and only if its ‘weight function’ satisfies for all except at most countably many points ; by a very recent result of [27], this condition is equivalent to strict convexity of the function , or equivalently, strict concavity of . Here we give a third, self-contained proof of the above result that is derived from first principles, and that will be helpful when we study strongly proper losses in Section 4.

Proof of Theorem 4.

Let be a proper loss. For the ‘if’ direction, assume is strictly concave. Let such that . Then we have

Thus is strictly proper.

Conversely, to prove the ‘only if’ direction, assume is strictly proper. Let such that , and let . Then we have

Thus is strictly concave. ∎

Proper Composite Losses. The notion of properness can be extended to binary loss functions operating on prediction spaces other than via composition with a link function . Specifically, for any , a loss function is said to be proper composite if it can be written as

(14)

for some proper loss and strictly increasing (and therefore invertible) link function . Proper composite losses have been studied recently in [22, 23, 6], and include several widely used losses such as squared, squared hinge, logistic, and exponential losses.

It is worth noting that for a proper composite loss formed from a proper loss , . Moreover, any property associated with the underlying proper loss can also be used to describe the composite loss ; thus we will refer to a proper composite loss formed from a regular proper loss as regular proper composite, a composite loss formed from a strictly proper loss as strictly proper composite, etc. In Section 4, we will define and characterize strongly proper (composite) losses, which we will use to obtain regret bounds for bipartite ranking.

3 Related Work

As noted above, a popular theoretical and algorithmic framework for bipartite ranking has been to reduce the problem to pairwise classification. Below we describe this reduction in the context of our setting and notation, and then review the result of [19] which builds on this pairwise reduction.

3.1 Reduction of Bipartite Ranking to Pairwise Binary Classification

For any distribution on , consider the distribution on defined as follows:

  1. Sample and i.i.d. from ;

  2. If , then go to step 1; else set666Throughout the paper, if and otherwise.

    and return .

Then it is easy to see that, under ,

(15)
(16)
(17)

Moreover, for the 0-1 loss given by

, we have the following for any pairwise binary classifier

:

(18)
(19)
(20)

Now for any scoring function , define as

(21)

Then it is easy to see that:

(22)
(23)

where . The equality in Eq. (23) follows from the fact that the classifier achieves the Bayes 0-1 risk, i.e.  [8]. Thus

(24)

and therefore the ranking regret of a scoring function can be analyzed via upper bounds on the 0-1 regret of the pairwise classifier .777Note that the setting here is somewhat different from that of [3] and [2], who consider a subset version of bipartite ranking where each instance consists of some finite subset of objects to be ranked; there also the problem is reduced to a (subset) pairwise classification problem, and it is shown that given any (subset) pairwise classifier , a subset ranking function can be constructed such that the resulting subset ranking regret is at most twice the subset pairwise classification regret of [3], or in expectation at most equal to the pairwise classification regret of [2].

In particular, as noted in [8], applying a result of [4], we can upper bound the pairwise 0-1 regret above in terms of the pairwise -regret associated with any classification-calibrated margin loss , i.e. any loss of the form for some function satisfying ,888We abbreviate , , etc.

(25)

We note in particular that for every proper composite margin loss, the associated link function satisfies [22], and therefore every strictly proper composite margin loss is classification-calibrated in the sense above.999We note that in general, every strictly proper (composite) loss is classification-calibrated with respect to any cost-sensitive zero-one loss, using a more general definition of classification calibration with an appropriate threshold (e.g. see [22]).

Theorem 5 ([4]; see also [8]).

Let be such that the margin loss defined as is classification-calibrated as above. Then strictly increasing function with such that for any ,

[4] give a construction for ; in particular, for the exponential loss given by and logistic loss given by , both of which are strictly proper composite losses (see Section 5.2) and are therefore classification-calibrated, one has

(26)
(27)

As we describe below, [19] build on these observations to bound the ranking regret in terms of the regret associated with balanced versions of the exponential and logistic losses.

3.2 Result of Kotlowski et al. (2011)

For any binary loss , consider defining a balanced loss as

(28)

Note that such a balanced loss depends on the underlying distribution via . Then [19] show the following, via analyses specific to the exponential and logistic losses:

Theorem 6 ([19]).

For any ,

Combining this with the results of Eq. (24), Theorem 5, and Eqs. (26-27) then gives the following bounds on the ranking regret of any scoring function in terms of the (non-pairwise) balanced exponential and logistic regrets of :

(29)
(30)

This suggests that an algorithm that produces a function with low balanced exponential or logistic regret will also have low ranking regret. Unfortunately, since the balanced losses depend on the unknown distribution , they cannot be optimized by an algorithm directly.101010We note it is possible to optimize approximately balanced losses, e.g. by estimating from the data. [19] provide some justification for why in certain situations, minimizing the usual exponential or logistic loss may also minimize the balanced versions of these losses; however, by doing so, one loses the quantitative nature of the above bounds. Below we obtain upper bounds on the ranking regret of a function directly in terms of its loss-based regret (with no balancing terms) for a wide range of proper (composite) loss functions that we term strongly proper, including the exponential and logistic losses as special cases.

4 Strongly Proper Losses

We define strongly proper losses as follows:

Definition 7.

Let be a binary CPE loss and let . We say is -strongly proper if for all ,

We have the following necessary and sufficient conditions for strong properness:

Lemma 8.

Let . If is -strongly proper, then is -strongly concave.

Proof.

The proof is similar to the ‘only if’ direction in the proof of Theorem 4. Let be -strongly proper. Let such that , and let . Then we have

Thus is -strongly concave. ∎

Lemma 9.

Let and let be a regular proper loss. If is -strongly concave, then is -strongly proper.

Proof.

Let . By Theorem 3, there exists a superderivative of at such that

This gives

Thus is -strongly proper. ∎

This gives us the following characterization of strong properness for regular proper losses:

Theorem 10.

Let and let be a regular proper loss. Then is -strongly proper if and only if is -strongly concave.

Several examples of strongly proper (composite) losses will be provided in Section 5.2 and Section 5.3. Theorem 10 will form our main tool in establishing strong properness of many of these loss functions.

5 Regret Bounds via Strongly Proper Losses

We start by recalling the following result of [8] (adapted to account for ties, and for the conditioning on ):

Theorem 11 ([8]).

For any ,

As noted by [9], this leads to the following corollary on the regret of any plug-in ranking function based on an estimate :

Corollary 12.

For any ,

For completeness, a proof is given in Appendix A. We are now ready to prove our main result.

5.1 Main Result

Theorem 13.

Let and let . Let be a -strongly proper composite loss. Then for any ,

Proof.

Let be a -strongly proper loss and be a (strictly increasing) link function such that for all . Let . Then we have,

Theorem 13 shows that for any strongly proper composite loss , a function with low -regret will also have low ranking regret. Below we give several examples of such strongly proper (composite) loss functions; properties of some of these losses are summarized in Table 1.

5.2 Examples

Eaxmple 1 (Exponential loss).

The exponential loss defined as

is a proper composite loss with associated proper loss and link function given by

It is easily verified that is regular. Moreover, it can be seen that

with

Thus is 4-strongly concave, and so by Theorem 10, we have is 4-strongly proper composite. Therefore applying Theorem 13 we have for any ,

Eaxmple 2 (Logistic loss).

The logistic loss defined as

is a proper composite loss with associated proper loss and link function given by