# Online Boosting for Multilabel Ranking with Top-k Feedback

We present online boosting algorithms for multilabel ranking with top-k feedback,where the learner only receives information about the top-k items from the ranking it provides. We propose a novel surrogate loss function and unbiased estimator, allowing weak learners to update themselves with limited information. Using these techniques we adapt full information multilabel ranking algorithms (Jung and Tewari, 2018) to the top-k feedback setting and provide theoretical performance bounds which closely match the bounds of their full information counter parts, with the cost of increased sample complexity. The experimental results also verify these claims.

## Authors

• 2 publications
• 7 publications
• 47 publications
• ### Online Multiclass Boosting with Bandit Feedback

We present online boosting algorithms for multiclass classification with...
10/11/2018 ∙ by Daniel Zhang, et al. ∙ 0

• ### Online Boosting Algorithms for Multi-label Ranking

We consider the multi-label ranking approach to multi-label learning. Bo...
10/23/2017 ∙ by Young Hun Jung, et al. ∙ 0

• ### PAC-Battling Bandits with Plackett-Luce: Tradeoff between Sample Complexity and Subset Size

We introduce the probably approximately correct (PAC) version of the pro...
08/12/2018 ∙ by Aditya Gopalan, et al. ∙ 0

• ### Online Ranking: Discrete Choice, Spearman Correlation and Other Feedback

Given a set V of n objects, an online ranking system outputs at each tim...
08/30/2013 ∙ by Nir Ailon, et al. ∙ 0

• ### Improving Label Ranking Ensembles using Boosting Techniques

Label ranking is a prediction task which deals with learning a mapping b...
01/21/2020 ∙ by Lihi Dery, et al. ∙ 0

• ### A review on ranking problems in statistical learning

Ranking problems define a widely spread class of statistical learning pr...
09/06/2019 ∙ by Tino Werner, et al. ∙ 0

• ### Word learning under infinite uncertainty

Language learners must learn the meanings of many thousands of words, de...
12/08/2014 ∙ by Richard A. Blythe, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The classical theory of boosting is an impressive algorithmic and theoretical achievement (see Schapire and Freund (2012) for an authoritative treatment). However, for the most part it assumes that learning occurs with a batch of data that is already collected and that ground truth labels are fully observed by the learning algorithm. Modern “big data” applications require us to go beyond these assumptions in a number of ways.

First, large volumes of available data mean that online algorithms (Shalev-Shwartz, 2012; Hazan, 2016) are needed to process them effectively. Second, in many applications such as text categorization, multimedia (e.g., images and videos) annotation, bioinformatics, and cheminformatics, the ground truth may not be just a single label but a set of labels (Zhang and Zhou, 2013; Gibaja and Ventura, 2015). Third, in a multilabel setting, a common design decision (Tsoumakas et al., 2011) is to have the learner output a ranking of the labels. Fourth, human annotators may not have the patience to go down the full ranking to give us the ground truth label set. Therefore, the learner may have to deal with partial feedback. A very natural partial feedback is top- feedback (Chaudhuri and Tewari, 2017) where the annotator only provides ground truth for the top- ranked labels. Theory and algorithms for online, multilabel boosting with top- feedback have thus far been missing. Our goal in this paper is to fill this gap in the literature.

Existing literature has dealt with some of the challenges mentioned above. For example, recent work has developed the theory of online boosting for single label problems such as binary classification (Beygelzimer et al., 2015) and multiclass classification (Jung et al., 2017). This was followed by an extension to the complex label setting of multilabel ranking (Jung and Tewari, 2018). All of these works were in the full information setting where ground truth labels are fully revealed to the learner. Zhang et al. (2018) recently extended the theory of multiclass boosting to the bandit setting where the learner only observes whether the (single) label it predicted is correct or not. However, none of the available extensions of classical boosting has all of the following three desired attributes at once, namely online updates, multilabel rankings, and the ability to learn from only top- feedback.

Note that top- feedback is not bandit feedback. Unlike the bandit multiclass setting, the learner does not even get to compute its own loss! Thus, a key challenge in our setting is to use the structure of the loss to design estimators that can produce unbiased estimates of the loss from only top- feedback. This intricate interplay between loss functions and partial feedback does not occur in previous work on online boosting.

Specifically, we extend the full information algorithms of Jung and Tewari (2018) to the top- feedback setting for the multilabel ranking problems. Our algorithms randomize their predictions and construct novel unbiased loss estimates. In this way, we can still let our weak learners update themselves even with partial feedback. Interestingly, the performance bounds of the proposed algorithms match the bounds of their full information counterparts with increased sample complexities. That is, even with the top- feedback, one can eventually obtain the same level of accuracy provided sufficient data. We also run our algorithms on the same data sets that are investigated by Jung and Tewari (2018), and obtain results supporting our theoretical findings. Our empirical results also verify that a larger (i.e., more information to the learner) does decrease the sample complexity of the algorithms.

## 2 Preliminaries

The space of all possible labels is denoted by for some positive integer , which we assume is known to the learner. For a technical reason, we assume . We denote the indicator function as , the

th standard basis vector as

, and the zero vector as . Let . We denote a ranking as an ordered tuple. For example, a ranking ranks label the highest and label the lowest. Given a ranking , we let return an unordered set of the top ranked elements. For example, .

We will frequently use a score vector to denote a ranking. We convert it to a ranking using the function , which orders the members of according to the score . We break ties by preferring smaller labels, for example is preferred over if votes are even. This makes the mapping injective. For example, . When it is clear from the context, we will use a score vector and the corresponding ranking interchangeably.

### 2.1 Problem Setting

We first describe the multilabel ranking (MLR) problem with top- feedback. At each timestep , the relevant labels are , and the irrelevant labels are . An adversary chooses a labelled example (where is some domain) and sends to the learner. As we are interested in the MLR setting, the learner then produces an -dimensional score vector , and sends this result to the adversary. In the full information setting, the learner then observes and suffers a loss which will be defined later. In the top- feedback setting, however, it only observes whether is in for each label . That is to say, if , then it becomes the full information problem, and smaller implies less information. For a technical reason, we assume . This feedback occurs naturally in applications such as ads placement and information retrieval, where the end user of the system has limited feedback capabilities. In such scenarios, may be the set of ads or documents which the user finds relevant, and may be the total set of documents. The user only gives feedback (e.g., by clicking relevant ads) for a few documents placed on top by the algorithm. The learner’s end goal is still to minimize the loss . It might not be able to compute the exact value of the loss because is unknown. We want to emphasize that even the size of relevant labels is not revealed.

To tackle this problem we use the online multilabel boosting setup of Jung and Tewari (2018). In this setting, the learner is constructed from online weak learners,

plus a booster which manages the weak learners. Each weak learner predicts a probability distribution across all possible labels, which we write as

. Previous work has shown that this weak learner restriction encompasses a variety of prediction formats including binary predictions, multiclass single-label predictions, and multiclass multilabel predictions (Jung and Tewari, 2018).

In the MLR version of boosting, each round starts when the booster receives . It shares this with all the weak learners and then aggregates their predictions into the final score vector . Once the booster receives its feedback, it constructs a cost vector for so that the weak learners incur loss , where is the th weak learner’s prediction at time . Each weak learner’s goal is to adjust itself over time to produce that minimizes its loss. The goal of the booster is to generate cost vectors which encourage the weak learners to cooperate in creating better . It should be noted that despite the top- feedback, our weak learners get full-information feedback, which means the entire cost vector is revealed to them. Constructing a complete cost vector with partial feedback is one of the main challenges in this problem.

### 2.2 Estimating a Loss Function

Because of top- feedback, we require methods to estimate loss functions dependent on labels outside of the top- labels from our score vector . One common way of dealing with partial feedback is to introduce randomized predictions and construct an unbiased estimator of the loss using the known distribution of the prediction. This way, we can obtain a randomized loss function for our learner to use. Thus, we propose a novel unbiased estimator to randomize arbitrary . This estimator requires some structure within the loss function it is estimating.

We require that loss to be writable as a sum of functions which only require as input the scores and relevance of two particular labels, each containing one relevant and irrelevant label. In particular, our loss must have the form

 L(s,R)=∑a∈R∑b∉Rf(s[a],s[b])=:∑a,b∈[m]fa,b(s), where fa,b(s)=I(a∈R)I(b∉R)f(s[a],s[b]).

Here is an arbitrary score vector in , and is a given function. We call this property pairwise decomposability. This decomposability allows us to individually estimate each and thus .

In fact, various valid MLR loss functions are pairwise decomposable. An example is the unweighted rank loss

 Lrnk(s,Rt)= ∑a∈Rt∑b∉RtI(s[a]≤s[b]),

which has various surrogates, including the following unweighted hinge rank loss

 Lhinge(s,Rt)= ∑a∈Rt∑b∉Rtmax{0,s[b]−s[a]+1}.

It should be noted that the weighted rank loss

 Lwrnk(s,Rt)=1|Rt|(m−|Rt|)Lrnk(s,Rt)

cannot be computed using this strategy because its normalization weight is non-linearly dependent on . In such cases, it is possible to upper bound the target loss function with a surrogate loss that is pairwise decomposable. For example, the unweighted rank loss is an obvious upper bound of the weighted one.

Returning to our estimator, we first elaborate on a method of randomized prediction given that will allow us to construct our unbiased estimator. This randomized prediction is paramaterized by the exploration rate . After computing , with probability we use as our final ranking. Otherwise, with probability , we choose111It is this part of our construction that requires and . two elements, denoted by , from and two elements, denoted by , from , the set of labels which have rank lower than . Then, we take the higher ranked labels from and and swap them, and do the same for the lower ranked labels, producing our final ranking. This process is more complicated than simply using a random ranking with probability , but with our method, stays closer to , which would be favorable provided has a small loss. Figure 1 presents an example of this exploration step. In case the loss is a function of score vector instead of ranking, we can get a random score out of in a similar manner.

We now present our unbiased estimator. Let be the random ranking from the previously described process, and let be an arbitrary score vector in . We note that given any two distinct labels and , . Since being in the top- provides the learner with full information regarding the relevance and scores of the labels, we have this unbiased estimator using importance sampling

 ^L(s,Rt)=∑a,b∈[m]I(a,b∈Tk(~rt))Pr[a,b∈Tk(~rt)]fa,b(s). (1)

We prove that this is an unbiased estimator in Lemma 5 in the appendix. Our algorithms will use this unbiased estimator to estimate certain surrogate functions which we construct to be pairwise decomposable.

One useful quality of this estimator is what we call -boundedness. We say a random vector is -bounded if almost surely,

. This definition also applies to scalar random variables, in which case the infinity norm becomes the absolute value. In Lemma

6 in the appendix, we prove that if the pairwise functions are bounded by some , then any such unbiased estimator like in Eq. 1 is bounded with a constant that is .

Now suppose that the cost vector (to be fed to the th weak learner at time ) requires full knowledge of to compute. If each of its entries is a function that is pairwise decomposable, we can use the same unbiased estimation strategy to obtain random cost vectors that are in expectation equal to .

## 3 Algorithms

We introduce two different online boosting algorithms along with their performance bounds. Our bounds rely on the number and quality of the weak learners, so we define the edge of a weak learner. Our first algorithm assumes every weak learner has a positive edge , while our second algorithm uses an edge measured adaptively. These edges have a close relationship to each other and are also closely related to the full information edges defined by Jung and Tewari (2018), allowing us to show that our theoretical error bounds closely match theirs. Our first algorithm uses this edge information to achieve an exponentially decreasing error bound with respect to the number and quality of weak learners. For our second algorithm we use empirical edges to bound the loss, and allow adaptive weighing of weak learners. This makes it more practical while sacrificing exponentially decreasing bounds.

### 3.1 Algorithm Template

We describe the template which our two boosting algorithms share. It does not specify certain steps, which will be filled in by the two boosting algorithms. Also, in our template we do not restrict weak learners in any way except that each predicts a distribution over , receives a full cost vector , and suffers the loss according to its prediction.

The booster keeps updating the learner weights and constructs experts, where the th expert is the weighted cumulative votes from the first weak learners . The booster chooses an expert index at each round to use. The first algorithm fixes to be , while the second one draws it randomly using an adaptive distribution. The booster then uses to compute its final random prediction . After obtaining feedback, the booster computes random cost vectors for each weak learner and lets them update parameters.

### 3.2 An Optimal Algorithm

Our first algorithm, TopkBBM222 Boost-By-Majority for ranking with top- feedback, assumes the ranking weak learning condition and is optimal, meaning it matches the asymptotic loss bounds of an optimal full information boosting algorithm in the number of weak learners used, up to a constant factor.

#### 3.2.1 Ranking Weak Learning Condition

The ranking weak learning condition states that within the cost vector framework, weak learners can minimize their loss better than a randomly guessing competitor, so long as the cost vectors satisfy certain conditions, and with the weak learners only observing versions of the cost vectors tainted by some noise.

We define the randomly guessing competitor at time as

, which is an almost uniform distribution placing

more weight on each label in . In particular, for a any label we define it as

 uγRt[l]=⎧⎨⎩1−|Rt|γm+γ if l∈Rt1−|Rt|γm if l∉Rt.

The intuition is that if a weak learner predicts a label using at each round, then its accuracy would be better than random guessing by at least an edge of .

Given , we specify the set of possible cost vectors as

 C(Rt)={c∈[0,1]m ∣minlc[l]=0,maxlc[l]=1, maxi∈Rtc[i]≤minj∉Rtc[j]}.

We also allow a sample weight to be multiplied by this cost vector. This feasible set of cost vectors is equivalent to those used in the full information setting studied by Jung and Tewari (2018).

As in the full information setting, at each round we allow the adversary to choose an arbitrary cost vector from and its weight for the learner. In our top- feedback setting, we further permit the adversary to generate random cost vectors and weights, so long as in expectation each random cost vector is in .

We now introduce our top- feedback weak learning condition, presented beside the full information online weak learning condition from Jung and Tewari (2018), to show their similarity.

###### Definition 1 (OnlineWLC).

For parameters , and , a pair of a learner and adversary satisfies if for any , with probability , the learner can generate predictions that satisfy

 T∑t=1wtct⋅ht≤T∑t=1wtct⋅uγRt+S.
###### Definition 2 (Top-kWLC).

For parameters , and , a pair of a learner and adversary satisfies if for any , with probability , the learner can generate predictions that satisfy

 T∑t=1wtct⋅ht≤T∑t=1wtct⋅uγRt+S,

while only observing random cost vectors , where all are -bounded and .

In these definitions, is called the excess loss. The weak learning conditions only differ by the introduction of random noise. In

if the variance of each

is , then , and we recover the full information weak learning condition. For a positive , the definition of -boundedness implies is -bounded. From this, we can infer in the top- setting.

#### 3.2.2 TopkBBM Details

We require that our loss function be pairwise decomposable, and that each of its pairwise function has three properties, which we now describe.

Properness is non-increasing in for and non-decreasing in for . This is a generic feature of ranking losses since putting a higher score on a relevant label should decrease the loss.

Uncrossability For any , , and , we assume . Intuitively, this means that if a weak learner is unsure which label to prefer, it cannot place even weight on both labels and cheat its way to a lower cost.

Convexity in Scores For any and , is convex w.r.t. the score vector .

We prove these properties for the loss functions we use in Appendix A.6.

We now briefly discuss notation to describe potential functions, whose relation to boosting has been thoroughly discussed by Mukherjee and Schapire (2013). Recall and let be a distribution over this set. Given a starting vector , a function , and a non-negative integer , we define where is the summation of random vectors drawn from .

Moving on to potential functions for boosting, in the full information setting where is revealed we would use the ground truth potential function

 ΥNt(s)\coloneqqφNuγRt(s,L(⋅,Rt))

to create cost vectors. It takes the current cumulative votes as an input and estimates the booster’s loss when the relevant labels are and the weak learners guess from a distribution . However, because is not known in our setting, we provide a surrogate potential function, using the assumptions listed previously in this section.

To compute our surrogate potential function, we first rewrite the ground truth potential by moving the expectation inside the pairwise summations:

 ΥNt(s)=∑a,b∈[m]φNuγRt(s,fa,b), (2)

where we slightly abuse the notation by letting takes as an input. Then we propose the following surrogate potential function

 ΦNt(s)=∑a∈Rt∑b∉RtΛa,b,Nt(s), where Λa,b,Nt(s)=φNuγa(s,fa,b) with uγa=uγ{a}.

We record important qualities of this surrogate potential as a proposition, with the proof in Appendix A.2.

###### Proposition 1.

is proper and convex, and for any , , , and , we have

We also stress that is pairwise decomposable into each of its smaller potential functions.

Returning to the algorithm, we assume that weak learners satisfy . Our goal is to set . Because is pairwise decomposable, we can create an unbiased estimator of it using the technique in Section 2.2 as

 ^ΦNt (s)=∑a∈Rt∑b∉RtI(a,b∈Tk(^rt))Pr[a,b∈Tk(^rt)]Λa,b,Nt(s).

Because is simply a potential function using , any upper bound on also upper bound . Then we can use Lemma 6 to claim is -bounded.

Thus, we can create unbiased estimates of as

 ^cit[l]=^ΦNRt(si−1t+el). (3)

The rest of the algorithm is straightforward. We set for all , and select the best expert to be . This means that we always take an equal-weighted vote from all the weak learners. Intuitively, the booster wants to use all weak learners because they are all guaranteed to do better than random guessing in the long run, and weigh them equally because all weak learners have the same edge . Lastly, given the last expert , our algorithm predicts using the same random process described in Section 2.2, so that we can use the unbiased estimator.

#### 3.2.3 TopkBBM Loss Bound

We can theoretically guarantee the performance of TopkBBM on any proper and pairwise decomposable loss function. In our theorem, we bound instead of the true, randomized loss because without extra guarantees, we cannot say anything about how random predictions will affect . We provide corollaries later for specific losses which we will bound . The following theorem holds for any pairwise decomposable loss functions whose pairwise losses satisfy the three qualities listed earlier. The proof appears in Appendix A.3.

###### Theorem 2 (TopkBBM, General Loss Bound).

For any satisfying , the total loss incurred by TopkBBM satisfies the following inequality with probability at least

 T∑t=1L(sNt,Rt)≤ΦNRt(0)T+~O(2m2−k2ρzN),

where is the maximum possible value that any can output, and suppresses dependence on .

Note that there is no single canonical loss in the MLR setting unlike the classification setting where the - loss is quite standard. Still, the weighted rank loss comes close to being canonical since it is often used in practice and is implemented in standard MLR libraries (Tsoumakas et al., 2011). It has also has been analyzed theoretically in previous work on ranking (e.g., see Cheng et al. (2010) and Gao and Zhou (2011)).

We note that this loss is not convex, and not pairwise decomposable. Thus we use the unweighted hinge loss as a surrogate. Since the unweighted hinge loss upper bounds the rank loss, Theorem 2 can be used to bound it. This allows us to present the following corollary, whose proof can be found in Appendix A.4

###### Corollary 3 (TopkBBM, Rank Loss Bound).

For any and , TopkBBM’s randomized predictions satisfy the following bound on the rank loss with probability at least

 T∑t=1Lrnkt(~yt)≤ m24(N+1)exp(−γ2N2)T+2ρmT + ~O(2m2−k2ρN2√T).

We can optimize so that the first term in the bound becomes the asymptotic average loss bound. We can compare it to the asymptotic error bounds in Jung and Tewari (2018) by multiplying the full information algorithm loss bounds by , which is the maximum value of the rank loss normalization constant. Let be the score vectors produced by the full information algorithm. Then we have that

 T∑t=1Lrnkt(s′t)≤m24(N+1)exp(−γ2N2)T+m22NS.

We see that the asymptotic losses, after optimizing , are identical, so that the cost of top- feedback appears only in the excess loss. Furthermore, since the optimal algorithm in Jung and Tewari (2018) is optimal in the number of weak learners it requires to achieve some asymptotic loss, TopkBBM is also optimal in this regard since the problem it faces is only harder because of partial information.

While TopkBBM is theoretically sound, it has a number of drawbacks in real world applications. Firstly, it is difficult to actually measure for a particular weak learner, and usually the weak learners will not all have the same edge. Secondly, potential functions often do not have closed form definitions, and thus require expensive random walks to compute. To address these issues, we propose an adaptive algorithm, TopkAdaptive, modifying Ada.OLMR from Jung and Tewari (2018) so that it can use top- feedback.

#### 3.3.1 Logistic Loss and Empirical Edges

Like other adaptive boosting algorithms, we require a surrogate loss. We take the logistic loss for multilabel ranking from Ada.OLMR, but ignore its normalization (as that would require knowledge of

 Llog(s,Rt)\coloneqq∑a∈Rt∑b∉Rtlog(1+exp(s[a]−s[b])).

This loss is proper and convex. As in Ada.OLMR, the booster’s prediction is still graded using the (unweighted) rank loss. This surrogate loss only plays a role in optimizing parameters.

Similarly to , we create an unbiased estimator ,, of the logistic loss as

 ∑a∈Rt∑b∉RtI(a,b∈Tk(^rt))Pr[a,b∈Tk(^rt)]log(1+exp(s[a]−s[b])).

Our goal is to set . However, because we cannot always evaluate the logistic loss, we use instead to make random cost vectors which in expectation are the desired cost vectors:

 ^cit=∇^Llogt(si−1t). (4)

Even though the algorithm is adaptive, we still need an empirical measure of the weak learner’s predictive powers for performance bounds. As in Ada.OLMR, we use the following empirical edge of :

 γi=−∑Tt=1cithit∥wi∥1wi[t]=∑a∈Rt∑b∉Rt11+exp(si−1t[a]−si−1t[b]), (5)

where is the definition of the weight of a cost vector taken from Jung and Tewari (2018). A useful remark is that if a weak learner satisfies with edge , then for large it should have an empirical edge with high probability.

Having a similar edge as Ada.OLMR allows us to precisely evaluate the cost of bandit feedback. We can check that since , . It is apparent the empirical edge is not visible to the learner, since it requires the expected cost vector to compute. This is fine because this value is only useful in proving the loss bound, and is not used by the algorithm.

We now go into the details of TopkAdaptive. The choice of cost vectors is discussed in the previous section. As this is an adaptive algorithm, we want to choose the weak learner’s weights at each round. We would like to choose them to minimize the cumulative logistic loss

 ∑tgit(αit) where git(αit)=Llogt(si−1t+αitelit)

with only the unbiased estimate

 ∑t^git(αit) where ^git(αit)=^Llogt(si−1t+αitelit)

at each time step available to the booster.

Since the logistic loss is convex, we can use our partial feedback to run stochastic gradient descent (SGD). To apply SGD, besides convexity we require that the feasible space be compact, so we let . To stay in the feasible space, we use the projection function in our update rule as where is the learning rate. We bound the loss from SGD and show it provides a regret within . The details are in the proof in Appendix A.5.

We cannot prove that the last expert is the best because our weak learners do not adhere to any weak learning condition. Instead, we prove that at least one expert is reliable. Our algorithm uses the Hedge algorithm from Freund and Schapire (1997) to select the best expert from the ones available, taking as input for the th expert the unweighted rank loss of the th expert, which we define as

 ^Lrnk(sit,Rt)=∑a,b∈[m]I(a,b∈Tk(~rt))Pr[a,b∈Tk(~rt]I(s[a]≥s[b]).

Because the exploration rate controls the variance of the loss estimate, we can combine the analysis of the Hedge algorithm with a concentration inequality to obtain a similar result.

We now bound the cumulative rank loss of TopkAdaptive using the weak learner’s empirical edges. The proof appears in Appendix A.5.

###### Theorem 4 (TopkAdaptive, Rank Loss Bound).

For any satisfying , the cumulative rank loss of TopkAdaptive, , satisfies the following bound with probability at least :

 2m2∑i|γi|T+2ρmT+~O((2m2−k2)N√Tρ∑i|γi|),

where suppresses dependence on .

By optimizing , we get the first term of the bound as the asymptotic average loss bound. To compare it with the adaptive algorithm in Jung and Tewari (2018), we again multiply the bound by to account for the normalization constant. Let be the scores of the full information adaptive algorithm at time . Then we have

 ∑tLrnkt(s′t)≤2m2∑i|γi|T+~O(N2m2∑i|γi|).

Which matches the asymptotic loss in TopkAdaptive, after optimizing for . Thus, the cost of top- feedback is again only present in the excess loss.

## 4 Experiments

We compare various boosting algorithms on benchmark data sets using publicly available codes. The models we use are our own, TopkBBM (TopOpt) and TopkAdaptive (TopAda), along with OnlineBMR (FullOpt) and Ada.OLMR (FullAda) by Jung and Tewari (2018), the full information algorithms we compared our theoretical results to. All of these boosters use multilabel weak learners.

We examine several data sets from the UCI data repository from Tsoumakas et al. (2011) that have been used to evaluate the full information algorithms. We follow the preprocessing steps from Jung and Tewari (2018) to ensure consistent comparisons, and the data set details and statistics appear in Table 3 in the appendix. However, because the top- feedback algorithms require more data to converge, we loop over the training set a number of times before evaluating on the testing data set. We consider the number of loops a hyper-parameter. However, it never exceeds . The other hyper-parameters we optimize are the number of weak learners , and the the edge for TopOpt. Experiment details can be found in Appendix B.

### 4.1 Asymptotic Performance

Since for the rank loss, the theoretical asymptotic error bounds of the proposed algorithms match their full information counterparts, we first compare the models’ empirical asymptotic performance. For the full information algorithms, we looped the training set once and then ran the test set, while for the top- algorithms, we looped them as described in the previous subsection. In these tests, we set . The selected hyper-parameters, including number of loops, appear in Table 2 in the appendix. Each table entry is the result of runs averaged together.

In Table 1, we see that in each data set, the full information algorithms outperform their top- feedback counterparts, but that the gap is quite small. The largest gap is between TopAda and FullAda on M-reduced at . In part, this is due to differences in weighting between the unweighted and weighted rank loss. When the number of relevant labels per example is constant, the weighted and unweighted rank loss are exactly proportional because the rank loss weight is constant, but when the number of labels varies, this rank loss weight will change. This leads to a discrepancy in the goals the full information algorithms and our algorithms are boosting towards. Overall, however, especially after accounting for exploration, the smallness of this gap shows our algorithms are learning as effectively as their full information counterparts.

Another factor is the number of loops run. For the data sets with smaller , the number of loops multiplied by implies that our algorithms could have observed each label multiple times in its training. For example, with and , theoretically in two loops of the training set, our algorithms could have observed them in their entirety. However, in the Mediamill and M-reduced data sets, our algorithms manage comparable asymptotic performance, while at best, given and they could only have possibly observed labels, or about 60% training labels. This shows they are capable of making inferences even with partial information.

### 4.2 Effects of Varying Observability

To show the empirical effects of top- feedback on model convergence and asymptotic loss, we repeat our experiments with the Yeast data set, keeping the same hyper-parameters, but increasing . Figure 2 plots the weighted rank loss averaged over every 100 consecutive rounds for TopkBBM models with various . Each line in the figure is itself averaged from runs, with the same hyper-parameters in Table 2. Clearly as increases, the number of rounds TopOpt requires to converge decreases. Despite this, by the th round of feedback, the three lines in figure 2 have closed tightly on each other, supporting our theory that the cost of changing is only taken by the excess loss.

## References

• A. Beygelzimer, S. Kale, and H. Luo (2015) Optimal and adaptive algorithms for online boosting. In

International Conference on Machine Learning

,
pp. 2323–2331. Cited by: §1.
• N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: §A.5.
• S. Chaudhuri and A. Tewari (2017) Online learning to rank with top-k feedback. The Journal of Machine Learning Research 18 (1), pp. 3599–3648. Cited by: §1.
• W. Cheng, E. Hüllermeier, and K. J. Dembczynski (2010)

Bayes optimal multilabel classification via probabilistic classifier chains

.
In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 279–286. Cited by: §3.2.3.
• Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §3.3.2.
• W. Gao and Z. Zhou (2011) On the consistency of multi-label learning. In Proceedings of the 24th annual conference on learning theory, pp. 341–358. Cited by: §3.2.3.
• E. Gibaja and S. Ventura (2015) A tutorial on multilabel learning. ACM Computing Surveys (CSUR) 47 (3), pp. 52. Cited by: §1.
• E. Hazan (2016) Introduction to online convex optimization. Foundations and Trends in Optimization 2 (3-4), pp. 157–325. Cited by: §1.
• Y. H. Jung, J. Goetz, and A. Tewari (2017) Online multiclass boosting. In Advances in Neural Information Processing Systems 30, pp. 920–929. Cited by: §1.
• Y. H. Jung and A. Tewari (2018) Online boosting algorithms for multi-label ranking. In

Proceedings of the 21st International Conference on Artificial Intelligence and Statistics

,
Proceedings of Machine Learning Research, Vol. 84, pp. 279–287. Cited by: §A.4, Document, §1, §1, §2.1, §3.2.1, §3.2.1, §3.2.3, §3.3.1, §3.3.3, §3.3, §3, §4, §4.
• I. Mukherjee and R. E. Schapire (2013) A theory of multiclass boosting. Journal of Machine Learning Research 14 (Feb), pp. 437–497. Cited by: §3.2.2.
• R. E. Schapire and Y. Freund (2012) Boosting: foundations and algorithms. MIT Press. Cited by: §1.
• S. Shalev-Shwartz (2012) Online learning and online convex optimization. Foundations and Trends in Machine Learning 4 (2), pp. 107–194. Cited by: §1.
• G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas (2011) Mulan: a java library for multi-label learning. Journal of Machine Learning Research 12 (Jul), pp. 2411–2414. Cited by: §1, §3.2.3, §4.
• D. Zhang, Y. H. Jung, and A. Tewari (2018) Online multiclass boosting with bandit feedback. arXiv preprint arXiv:1810.05290. Cited by: §1.
• M. Zhang and Z. Zhou (2013) A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26 (8), pp. 1819–1837. Cited by: §1.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: §A.5.

## References

• A. Beygelzimer, S. Kale, and H. Luo (2015) Optimal and adaptive algorithms for online boosting. In

International Conference on Machine Learning

,
pp. 2323–2331. Cited by: §1.
• N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: §A.5.
• S. Chaudhuri and A. Tewari (2017) Online learning to rank with top-k feedback. The Journal of Machine Learning Research 18 (1), pp. 3599–3648. Cited by: §1.
• W. Cheng, E. Hüllermeier, and K. J. Dembczynski (2010)

Bayes optimal multilabel classification via probabilistic classifier chains

.
In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 279–286. Cited by: §3.2.3.
• Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §3.3.2.
• W. Gao and Z. Zhou (2011) On the consistency of multi-label learning. In Proceedings of the 24th annual conference on learning theory, pp. 341–358. Cited by: §3.2.3.
• E. Gibaja and S. Ventura (2015) A tutorial on multilabel learning. ACM Computing Surveys (CSUR) 47 (3), pp. 52. Cited by: §1.
• E. Hazan (2016) Introduction to online convex optimization. Foundations and Trends in Optimization 2 (3-4), pp. 157–325. Cited by: §1.
• Y. H. Jung, J. Goetz, and A. Tewari (2017) Online multiclass boosting. In Advances in Neural Information Processing Systems 30, pp. 920–929. Cited by: §1.
• Y. H. Jung and A. Tewari (2018) Online boosting algorithms for multi-label ranking. In

Proceedings of the 21st International Conference on Artificial Intelligence and Statistics

,
Proceedings of Machine Learning Research, Vol. 84, pp. 279–287. Cited by: §A.4, Document, §1, §1, §2.1, §3.2.1, §3.2.1, §3.2.3, §3.3.1, §3.3.3, §3.3, §3, §4, §4.
• I. Mukherjee and R. E. Schapire (2013) A theory of multiclass boosting. Journal of Machine Learning Research 14 (Feb), pp. 437–497. Cited by: §3.2.2.
• R. E. Schapire and Y. Freund (2012) Boosting: foundations and algorithms. MIT Press. Cited by: §1.
• S. Shalev-Shwartz (2012) Online learning and online convex optimization. Foundations and Trends in Machine Learning 4 (2), pp. 107–194. Cited by: §1.
• G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas (2011) Mulan: a java library for multi-label learning. Journal of Machine Learning Research 12 (Jul), pp. 2411–2414. Cited by: §1, §3.2.3, §4.
• D. Zhang, Y. H. Jung, and A. Tewari (2018) Online multiclass boosting with bandit feedback. arXiv preprint arXiv:1810.05290. Cited by: §1.
• M. Zhang and Z. Zhou (2013) A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26 (8), pp. 1819–1837. Cited by: §1.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: §A.5.

## Appendix A Detailed Proofs

We provide proofs that are skipped in the main body.

### a.1 Proofs for the Unbiased Estimator

###### Lemma 5.

Suppose that we have , and a random ranking such that for any , . Let . Then the expectation of Eq. 1 becomes .

###### Proof.

We first write out the unbiased estimator that Eq. 1 provides of

 ^g(s)=∑a∈Rt∑b∉RtI(a,b∈Tk(^r))Pr[a,b∈Tk(^r)]fa,b(s).

Then we rewrite the expectation of by moving it inside the sum.

 E^r[^g(s,Rt)] =∑a∈Rt∑b∉RtPr[a,b∈Tk(^r)]fa,b(s)Pr[a,b∈Tk(^r)] =g(s)

where the middle equality is due to the expectation inside the summation being zero unless . ∎

###### Lemma 6.

Suppose our loss function is pairwise decomposable, , and thus has an unbiased estimator like in Eq. 1. If there exists such that for all feasible available to the booster, then we have

 |^L(s,Rt)−L(s,Rt)|=O(z2m2−k2ρ) % almost surely.
###### Proof.

We record again the definition of from Eq. 1.

 ^L(s,Rt)=∑a,b∈[m]I(a,b∈Tk(~rt))Pr[a,b∈Tk(~rt)]fa,b(s)=∑a∈Rt∑b∉RtI(a,b∈Tk(~rt))Pr[a,b∈Tk(~rt)]f(s[a],s[b]).

We first bound the case where our estimator underestimates the true loss. In this case, the worst scenario would be all the pairwise loss functions which we activate evaluate to , while all other functions evaluate to . In these cases, we are bounded by .

We now bound the cases where our estimator overestimates the true loss. We proceed by bounding the difference in our estimate from the expectation on a case-by-case basis, and then counting the number of each case that can arise in a worst-case scenario. We recall that has implicit parameter that is used to generate , as described in Section 2.2. Let and . If we decide not to explore, and , then our worst case scenario where we overestimate is when all the functions which we don’t activate are , and all the functions which we do active are . In this case, we are bounded by . To see that this is in , we note that .

Now we consider the case where we decide to explore. In this case, because we overestimate all of our activated pairwise functions, our worst case scenario is where all the pairwise functions we activate evaluate to , and all other pairwise functions evaluate to 0. Firstly, suppose . Then we have , which implies . In expectation, we expect this weight to be , so we can upper bound the deviation of each pair by , and we know there must be fewer than of these pairs.

Secondly, suppose exactly one of and is in . Without loss of generality, let us assume and . In this case, for both labels to be in , the algorithm must decide to explore, with probability . Then it must select to be in the set of taken from outside , and it must not select from . From this, we obtain

 Pr[a,b∈~V]=ρ⋅2m−k⋅k−2k.

Thus in this case, our bound on the difference between estimator and expectation is . We again count that the maximum number of such pairs that could appear is , because there are such pairs for each label transplanted from outside of the top-.

Lastly, suppose . They must be chosen to be moved up when the algorithm decides to explore with probability . Therefore, we have

 Pr[a,b∈~V]=ρ1(m−k2)=ρ2(m−k)(m−k−1).

Thus in this case, . There most be only one of these labels present, because we only ever choose two labels from outside of the top- to swap. Now, to produce our final bound, we multiply the weight produced from each case by the number of times it can occur, to obtain the sum

 G =z[13k2+2(k−2)k(m−k)2(k−2)ρ+(m−k)22ρ] =z[13k2+k(m−k)ρ+(m−k)22ρ] =z[13k2+12m2−k2ρ] ≤z[m2−k2+16k22ρ]≤z2m2−k2ρ=O(z2m2−k2ρ)

where the first inequality results from .

### a.2 Proofs for the Optimal Algorithm

For our optimal algorithm proofs, we require an important lemma comparing biased uniform distributions and our surrogate potential function distributions.

###### Lemma 7.

Let and be a pairwise function which satisfies the three properties stated in Section 3.2.2. Then for any set of relevant labels , we have

 Eel∼uγR[fa,b(s+el)]≤Eel′∼uγa[fa,b(s+el′)].

where and are the biased uniform distributions, placing more weight on members of and the label respectively.

###### Proof.

Recall . Hence, if or , then becomes a zero function, and the inequality trivially holds.

Suppose and . By definition of and , we have

 uγa[a]−uγa[b]=uγR[a]−uγR[b]=γ,

from which we can deduce

 uγa[a]−uγR[a]=uγa[b]−uγR[b]=:Δ>0 and ∑l∈[m]−{a,b}uγa[l]−uγR[l]=−2Δ.

Furthermore, observe that if , then . From this, we can infer