Online Learning with Pairwise Loss Functions

01/22/2013 ∙ by Yuyang Wang, et al. ∙ Tufts University Akamai Technologies 0

Efficient online learning with pairwise loss functions is a crucial component in building large-scale learning system that maximizes the area under the Receiver Operator Characteristic (ROC) curve. In this paper we investigate the generalization performance of online learning algorithms with pairwise loss functions. We show that the existing proof techniques for generalization bounds of online algorithms with a univariate loss can not be directly applied to pairwise losses. In this paper, we derive the first result providing data-dependent bounds for the average risk of the sequence of hypotheses generated by an arbitrary online learner in terms of an easily computable statistic, and show how to extract a low risk hypothesis from the sequence. We demonstrate the generality of our results by applying it to two important problems in machine learning. First, we analyze two online algorithms for bipartite ranking; one being a natural extension of the perceptron algorithm and the other using online convex optimization. Secondly, we provide an analysis for the risk bound for an online algorithm for supervised metric learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The standard framework in learning theory considers learning from examples

, (independently) drawn at random from an unknown probability distribution

on (e.g. and ). Typically a univariate loss function is adopted to measure the performance of the hypothesis , for example, for regression or for classification. The aim of learning is to find a hypothesis that generalizes well, i.e. has small expected risk .

In this paper we study learning in the context of pairwise loss functions, that depend on pairs of examples and can be expressed as where the hypothesis is applied to pairs of examples, i.e. . Pairwise loss functions capture ranking problems that are important for a wide range of applications. For example, in the supervised ranking problem one wishes to learn a ranking function that predicts the correct ordering of objects. The hypothesis is called a ranking rule such that if is ranked higher than and vice versa. The misranking loss (Clemençon et al., 2008; Peel et al., 2010) is a pairwise loss such that

where is the indicator function and the loss is 1 when the examples are ranked in the wrong order. The goal of learning is to find a hypothesis that minimizes the expected misranking risk ,

(1)

In many interesting cases, finding a ranking rule amounts to learning a good scoring function such that . Therefore, higher ranked examples will have higher scores. Another application comes from distance metric learning, where the learner wishes to learn a distance metric such that examples that share the same label should be close while ones from different labels are far away from each others.

This problem, especially the bipartite ranking problem where , has been extensively studied over the past decade in the batch setting, i.e., where the entire sequence is presented to the learner in advance of learning. Freund et al. (2003) gave generalization bounds for the RankBoost algorithm, based on the uniform convergence results for classification. Agarwal et al. (2005) derived uniform convergence bounds for the bipartite ranking loss, using a quantity called rank-shatter coefficient, which generalizes ideas from the classification setting. Agarwal et. al. provided bounds for the bipartite ranking problem (Agarwal and Niyogi, 2005) and the general ranking problem (Agarwal and Niyogi, 2009) using ideas from algorithmic stability. Rudin et al. (2005) approached a closely related problem where the goal is to correctly rank only the top of the ranked list, and derived generalization bounds based on covering number. Recently, several authors investigated oracle inequalities for pairwise-based quantities via the formalization of -statistics (Clemençon et al., 2008; Rejchel, 2012) using empirical processes. Peel et al. (2010) gave an empirical Bernstein inequality for higher order -statistics. Another thread comes from the perspective of reducing ranking problems to the more familiar classification problems (Kotłlowski et al., 2011; Ertekin and Rudin, 2011; Agarwal, 2012).

In this paper we investigate the generalization performance of online learning algorithms, where examples are presented in sequence, in the context of pairwise loss functions. Specifically, on each round , an online learner receives an instance and predicts a label according to the current hypothesis . The true label is revealed and is updated. The goal of the online learner is to minimize the expected risk w.r.t. a pairwise loss function .

Over the past two decades, online learning algorithms have been studied extensively, and theoretical results provide relative loss bounds, where the online learner competes against the best hypothesis (with hindsight) on the same sequence. Conversions of online learning algorithms and their performance guarantees to provide generalization performance in the batch setting have also been investigated (e.g., (Kearns et al., 1987; Littlestone, 1990; Freund and Schapire, 1999; Zhang, 2005)). Cesa-Bianchi et al. (2004) provided a general online-to-batch conversion result that holds under some mild assumptions on the loss function. Given a univariate loss function , a sample and an ensemble of hypotheses generated by an online learner , the following cumulative loss of is defined as

Cesa-Bianchi and Gentile (2008) showed (as a refined version of the bound in (Cesa-Bianchi et al., 2004)) that one can extract a hypothesis from the ensemble such that

Therefore, if one can develop an online learning algorithm with bounded cumulative loss for every possible realization of , then its generalization performance is guaranteed. A sharper bound exists when the loss function is strongly convex (Kakade and Tewari, 2009). The key step of these derivations is to realize that is a martingale difference sequence. Thus one can use martingale concentration inequalities (Azuma’s inequality or Friedman Inequality) to bound . Unfortunately, this property no longer holds for pairwise loss functions.

Of course, as mentioned for example in the work of  Peel et al. (2010, Sec. 4.2), one can slightly adapt an existing online learning classification algorithm (e.g., perceptron), feeding it with data sequence and modifying the update function accordingly. In this case, previous analysis (Cesa-Bianchi and Gentile, 2008) does apply. However, this does not make full use of the examples in the training sequence. In addition, empirical results show that this naive algorithm, which corresponds to the algorithm for online maximization of the area under the ROC curve (AUC) with a buffer size of one in (Zhao et al., 2011), is inferior to algorithms that retain some form of the history of the sequence. Alternatively, it is tempting to consider feeding the online algorithm with pairs on each round. However, in this case, existing results would again fail because are not i.i.d. Hence, a natural question is whether we can prove data dependent generalization bounds based on the online pairwise loss.

This paper provides a positive answer to this question for a large family of pairwise loss functions. On each round , we measure , the average loss of on examples . Let denote the average loss, averaging over on a training sequence of length where is a small constant. The main result of this paper, provides a model selection mechanism to select one of the hypotheses of an arbitrary online learner, and states that the probability that the risk of the chosen hypothesis satisfies,

is at most

Here is the covering number for the hypothesis class and is determined by the Lipschitz constant of the loss function (definitions and details are provided in the following sections). Thus, our results provide an online-to-batch conversion for pairwise loss functions. We demonstrate our results with the following two applications:

  1. We analyze two online learning algorithms for the bipartite ranking problem. We first provide an analysis of a natural generalization of the perceptron algorithm to work with pairwise loss functions, that provides loss bounds in both the separable case and the inseparable case. As a byproduct, we also derive a new simple proof of the best based mistake bound for the perceptron algorithm in the inseparable case. Combining with our main results we provide the first online algorithm with corresponding risk bound for bipartite ranking. Secondly, we analyze another algorithm using the online convex optimization techniques, with similar risk bounds.

  2. Several online metric learning algorithms have been proposed with corresponding regret analyzes, but the generalization performance of these algorithms has been left open, possibly because no tools existed to provide online-to-batch conversion with pairwise loss functions. We provide risk bounds for an online algorithm for distance metric learning combining with the results for online convex optimization with matrix argument.

The rest of this paper is organized as follows. Section 2 defines the problem and states our main technical theorem and Section 3 provides a sketch of the proof. We provide model selection results and risk analysis for convex and general loss functions in Section 4. In Section 5, we describe our online algorithm for bipartite ranking and analyze it. The results in sections 2-5 are given for a model and algorithms with an “infinite buffer”, that is, where the update of the online learner at step depends on the entire history of the sequence, . Section 6 shows that the results and algorithms can be adopted to a buffer of limited size. Interestingly, to guarantee convergence our results require that the buffer size grows logarithmically with the sequence size. Section 7 is devoted to the analysis of online metric learning. Finally, we conclude the paper and discuss possible future directions in Section 8.

2 Main Technical Result

Given a sample where = and a sequence of hypotheses , generated by an online learning algorithm, we define the sample statistic as

(2)

where and is a small positive constant. measures the performance of the hypothesis on the next example when paired with all previous examples. Note that instead of considering all the generated hypotheses, we only consider the average of the hypotheses where the statistic is reliable and the last two hypotheses are discarded for technical reasons. In the following, to simplify the notation, denotes and denotes . We define and .

As in (Cesa-Bianchi et al., 2004), our goal is to bound the average risk of the sequence of hypotheses in terms of , which can be obtained using the following theorem. Assume the hypothesis space is compact. Let be the ensemble of hypotheses generated by an arbitrary online algorithm working with a pairwise loss function such that,

where is a Lipschitz function w.r.t. the second variable with a finite Lipschitz constant . Then, , we have for sufficiently large

(3)

Here the covering number is defined to be the minimal in such that there exist disks in with radius that cover . We make the following remarks. Let denote . It can be seen that is no longer a martingale difference sequence. Therefore, martingale concentration inequalities that are usually used in online-to-batch conversion do not directly yield the desired bound. We need the assumption that the hypothesis space is compact so that its covering number is finite. As an example, suppose and the hypothesis space is the class of linear functions that lie within a ball It can be shown (see Cucker and Zhou, 2007, chap. 5) that the covering number is one if and otherwise

(4)

We say that is Lipschitz w.r.t the second argument if . This form of the pairwise loss function is not restrictive and is widely used. For example, in the supervised ranking problem, we can take the hinge loss as

which can be thought as a surrogate function for . Since is not bounded, we define the bounded hinge loss using if and 0 otherwise. We next show that is Lipschitz. This is trivial for . For , when the first argument is bounded by a constant , satisfies

Alternatively, one can take the square loss, i.e. = If its support is bounded then is Lipschitz.

3 Proof of the Main Technical Result

The proof is inspired by the work of (Cucker and Smale, 2002; Agarwal et al., 2005; Rudin, 2009). The proof makes use of the Hoeffding-Azuma inequality, McDiarmid’s inequality, symmetrization techniques and covering numbers of compact spaces.

[Proof of Theorem 2] By the definition of (see (2)), we wish to bound

(5)

which can be rewritten as

(6)

Thus, we can bound the two terms separately. The proof consists of four parts, as follows.

Step 1: Bounding the Martingale difference

First consider the second term in (3). We have that is a martingale difference sequence, i.e. . Since the loss function is bounded in , we have Therefore by the Hoeffding-Azuma inequality, can be bounded such that

(7)

Step 2: Symmetrization by a ghost sample

In this step we bound the first term in (3). Let us start by introducing a ghost sample where each follows the same distribution as . Recall the definition of and define as

(8)

The difference between and is that is the sum of the loss incurred by on the current instance and all the previous examples on which is trained, while is the loss incurred by the same hypothesis on the current instance and an independent set of examples .

Claim 1

The following equation holds

(9)

whenever .

Notice that the probability measure on the right hand side of (9) is on .

[Sketch of the proof of Claim 1] It can be seen that the RHS (without the factor of 2) of (9) is at least

Since , by Chebyshev’s inequality

(10)

To bound the variance, we first investigate the largest variation when changing one random variable

with others fixed. From (8), it can be easily seen that changing any of the varies each , where by at most by . Recall that we are only concerned with when . Therefore, we can see that the variation of regarding the th example is bounded by

(11)

Thus, by Theorem 9.3 in (Devroye et al., 1996), we have

(12)

Thus, whenever , the LHS of (10) is greater or equal than . This completes the proof of Claim 1.

Step 3: Uniform Convergence

In this step, we show how one can bound the RHS of (9) using uniform convergence techniques, McDiarmid’s inequality and covering number. Our task reduces to bound the following quantity

(13)

Here we want to bound the probability of the large deviation between the empirical performance of the ensemble of hypotheses on the sequence on which they are learned and on an independent sequence . Since relies on and is independent of , we resort to uniform convergence techniques to bound this probability. Define Thus we have

(14)

To bound the RHS of (3), we start with the following lemma. Given any function and any

(15)

The proof which is given in the appendix shows that has a bounded variation of when changing each of its variables and applies McDiarmid’s inequality. Finally, our task is to bound Consider the simple case where the hypothesis space is finite, then using the union bound, we immediately get the desired bound. Although is not finite, a similar analysis goes through based on the assumption that is compact. We will follow Cucker and Smale (2002) and show how this can be bounded. The next two lemmas (see proof of Lemma 15 in the appendix) are used to derive Lemma 3. For any two functions , the following equation holds

Let and . Then

For every , we have

(16)

[Proof of Lemma 3] Let and consider such that the disks centered at and with radius cover . By Lemma 15, we have

Thus, we get

Combining this with (15), and Lemma 3 and replacing by , we have (16). Combining (16) and (3), we have

(17)

This shows why we need to discard the first hypotheses in the ensemble. If we include for example, according to (16), we have . As grows, this heavy term remains in the sum, and the desired bound cannot be obtained.

Step 4: Putting it all together

From (9) and (3) and substituting with in (17), we have

(18)

From (18) and (7), accompanied with the fact that (18) decays faster than (7), we complete the proof for Theorem 2.

4 Model Selection

Following Cesa-Bianchi et al. (2004) our main tool for finding a good hypothesis from the ensemble of hypotheses generated by the online learner is to choose the one that has a small empirical risk. We measure the risk for on the remaining examples, and penalize each based on the number of examples on which it is evaluated, so that the resulting upper bound on the risk is reliable. Our construction and proofs (in the appendix) closely follow the ones in (Cesa-Bianchi et al., 2004), using large deviation results for -statistics (see Clemençon et al., 2008, Appendix) instead of the Chernoff bound.

4.1 Risk Analysis for Convex losses

If the loss function is convex in its second argument and is convex, then we can use the average hypothesis . It is easy to show that achieves the desired bound (the proof is in the Appendix), i.e.

(19)

4.2 Risk Analysis for General Losses

Define the empirical risk of hypothesis on as

The hypothesis is chosen to minimize the following penalized empirical risk,

(20)

where

Notice that we discard the last two hypotheses so that any is well defined. The following theorem, which is the main result of this paper, shows that the risk of is bounded relative to . The proof of Theorem 4.2 is in Appendix E. Let be the ensemble of hypotheses generated by an arbitrary online algorithm working with a pairwise loss which satisfies the conditions given in Theorem 2. , if the hypothesis is chosen via (20) with the confidence chosen as

then, when is sufficiently large, we have

5 Application: Online Algorithms for Bipartite Ranking

In the bipartite ranking problem we are given a sequence of labeled examples . Minimizing the misranking loss under this setting is equivalent to maximizing the AUC, which measures the probability that ranks a randomly drawn positive example higher than a randomly drawn negative example. This problem has been studied extensively in the batch setting, but the corresponding online problem has not been investigated until recently. In this section, we investigate two online algorithms, analyze their relative loss bounds and combine them with the main result to derive risk bounds for them.

5.1 Online AUC Max with Infinite Buffer (OAM-I)

Recently, Zhao et al. (2011) proposed an online algorithm using linear hypotheses for this problem based on reservoir sampling, and derived bounds on the expectation of the regret of this algorithm. Zhao et al. (2011) use the hinge loss (that bounds the 0-1 loss) to derive the regret bound. The hinge loss is Lipschitz, but it is not bounded and therefore not suitable for our risk bounds. Therefore, in the following we use a modified loss function where we bound the Hinge loss in such that

where is defined in Remark 2. Using this loss function together with Theorem 4.2 all we need is an online algorithm that minimizes (or an upper bound of ) and this guarantees generalization ability of the corresponding online learning algorithm. To this end, we propose the following perceptron-like algorithm, shown in Algorithm 1, and provide loss bounds for this algorithm. Notice that the algorithm does not treat each pair of examples separately, and instead for each it makes a large combined update using its loss relative to all previous examples. Our algorithm corresponds to the algorithm of Zhao et al. (2011) with an infinite buffer, but it uses a different learning rate and different loss function which are important in our proofs.

Initialize: ;
repeat
       At the -th iteration, receive a training instance . for  to  do
             Calculate instantaneous loss .
       end for
   

   Update the weight vector such that

until the last instance;
Algorithm 1 Online AUC Maximization (OAM) with Infinite Buffer.

Suppose we are given an arbitrary sequence of examples , and let be any unit vector. Assume and define

where

. That is, is the cumulative average hinge loss suffers on the sequence with margin . Then, after running Algorithm 1 on the sequence, we have

When the data is linearly separable by margin , (i.e. there exists an unit vector such that ), we have and the bound is constant.

[Proof of Theorem 5.1] First notice that and we also have the following fact

which implies that when ,

(21)

On the other hand, when , then . Thus we can write

(22)

On the other hand, we have,

(23)

Combining (5.1) and (5.1), we have which yields

We therefore get the risk bound for the proposed algorithm as follows. Let be the ensemble of hypotheses generated by Algorithm 1. , if the hypothesis is chosen via (20) with the confidence chosen to be

then the probability that