Multi-label learning has important practical applications (e.g., Schapire and Singer ), and its theoretical properties continue to be studied (e.g., Koyejo et al. ). In contrast to standard multi-class classifications, multi-label learning problems allow multiple correct answers. In other words, we have a fixed set of basic labels, and the actual label is a subset of the basic labels. Since the number of subsets increases exponentially as the number of basic labels grows, thinking of each subset as a different class leads to intractability.
It is quite common in applications for the multi-label learner to output a ranking of the labels on a new test instance. For example, the popular MULAN library designed by Tsoumakas et al.  allows the output of multi-label learning to be a multi-label ranker. In this paper, we focus on the multi-label ranking (MLR) setting. That is to say, the learner produces a score vector
score vectorsuch that a label with a higher score will be ranked above a label with a lower score. We are particularly interested in online MLR settings where the data arrive sequentially. The online framework is designed to handle a large volume of data that accumulates rapidly. In contrast to classical batch learners, which observe the entire training set, online learners do not require the storage of a large amount of data in memory and can also adapt to non-stationarity in the data by updating the internal state as new instances arrive.
Boosting, first proposed by Freund and Schapire , aggregates mildly powerful learners into a strong learner. It has been used to produce state-of-the-art results in a wide range of fields (e.g., Korytkowski et al.  and Zhang and Wang ). Boosting algorithms take weighted majority votes among weak learners’ predictions, and the cumulative votes can be interpreted as a score vector. This feature makes boosting very well suited to MLR problems.
The theory of boosting has emerged in batch binary settings and became arguably complete (cf. Schapire and Freund ), but its extension to an online setting is relatively new. To our knowledge, Chen et al.  first introduced an online boosting algorithm with theoretical justifications, and Beygelzimer et al.  pushed the state-of-the-art in online binary settings further by proposing two online algorithms and proving optimality of one. Recent work by Jung et al.  has extended the theory to multi-class settings, but their scope remained limited to single-label problems.
In this paper, we present the first online MLR boosting algorithms along with their theoretical justifications. Our work is mainly inspired by the online single-label work (Jung et al. ). The main contribution is to allow general forms of weak predictions whereas the previous online boosting algorithms only considered homogeneous prediction formats. By introducing a general way to encode weak predictions, our algorithms can combine binary, single-label, and MLR predictions.
After introducing the problem setting, we define an edge of an online learner over a random learner (Definition 1). Under the assumption that every weak learner has a known positive edge, we design an optimal way to combine their predictions (Section 3.1). In order to deal with practical settings where such an assumption is untenable, we present an adaptive algorithm that can aggregate learners with arbitrary edges (Section 3.2). In Section 4, we test our two algorithms on real data sets, and find that their performance is often comparable with, and sometimes better than, that of existing batch boosting algorithms for MLR.
The number of candidate labels is fixed to be , which is known to the learner. Without loss of generality, we may write the labels using integers in . We are allowing multiple correct answers, and the label is a subset of . The labels in is called relevant, and those in , irrelevant. At time , an adversary sequentially chooses a labeled example , where is some domain. Only the instance is shown to the learner, and the label is revealed once the learner makes a prediction . As we are interested in MLR settings, is a dimensional score vector. The learner suffers a loss
where the loss function will be specified later in Section3.1.
In our boosting framework, we assume that the learner consists of a booster and weak learners, where is fixed before the training starts. This resembles a manager-worker framework in that booster distributes tasks by specifying losses, and each learner makes a prediction to minimize the loss. Booster makes the final decision by aggregating weak predictions. Once the true label is revealed, the booster shares this information so that weak learners can update their parameters for the next example.
2.1 Online Weak Learners and Cost Vector
We keep the form of weak predictions general in that we only assume it is a distribution over . This can in fact represent various types of predictions. For example, a single-label prediction, , can be encoded as a standard basis vector , or a multi-label prediction by . Due to this general format, our boosting algorithm can even combine weak predictions of different formats. This implies that if a researcher has a strong family of binary learners, she can simply boost them without transforming them into multi-class learners through well known techniques such as one-vs-all or one-vs-one [Allwein et al., 2000].
We extend the cost matrix framework, first proposed by Mukherjee and Schapire  and then adopted in online settings by Jung et al. , as a means of communication between booster and weak learners. At round , booster computes a cost vector for the weak learner , whose prediction suffers the cost . The cost vector is unknown to until it produces , which is usual in online settings. Otherwise, can trivially minimize the cost.
A binary weak learning condition states a learner can attain over 50% accuracy however the sample weights are assigned. In our setting, cost vectors play the role of sample weights, and we will define the edge of a learner in similar manner.
Finally, we assume that weak learners can take an importance weight as an input, which is possible for many online algorithms.
2.2 General Online Boosting Schema
We introduce a general algorithm schema shared by our algorithms. We denote the weight of at iteration by . We keep track of weighted cumulative votes through . That is to say, we can give more credits to well performing learners by setting larger weights. Furthermore, allowing negative weights, we can avoid poor learner’s predictions. We call a prediction made by expert . In the end, the booster makes the final decision by following one of these experts.
The schema is summarized in Algorithm 1. We want to emphasize that the true label is only available once the final prediction is made. Computation of weights and cost vectors requires the knowledge of , and thus it happens after the final decision is made. To keep our theory general, the schema does not specify which weak learners to use (line 4 and 12). The specific ways to calculate other variables such as , , and depend on algorithms, which will be introduced in the next section.
3 Algorithms With Theoretical Loss Bounds
An essential factor in the performance of boosting algorithms is the predictive power of the individual weak learners. For example, if weak learners make completely random predictions, they cannot produce meaningful outcomes according to the booster’s intention. We deal with this matter in two different ways. One way is to define an edge of a learner over a completely random learner and assume all weak learners have positive edges. Another way is to measure each learner’s empirical edge and manipulate the weight to maximize the accuracy of the final prediction. Even a learner that is worse than random guessing can contribute positively if we allow negative weights. The first method leads to OnlineBMR (Section 3.1), and the second to Ada.OLMR (Section 3.2).
3.1 Optimal Algorithm
We first define the edge of a learner. Recall that weak learners suffer losses determined by cost vectors. Given the true label , the booster chooses a cost vector from
where the name is used by Jung et al.  and “eor” stands for edge-over-random. Since the booster wants weak learners to put higher scores at the relevant labels, costs at the relevant labels should be less than those at the irrelevant ones. Restriction to makes sure that the learner’s cost is bounded. Along with cost vectors, the booster passes the importance weights so that the learner’s cost becomes .
We also construct a baseline learner that has edge . Its prediction is also a distribution over that puts
more probability for the relevant labels. That is to say, we can write
where the value of depends on the number of relevant labels, .
Now we state our online weak learning condition.
(OnlineWLC) For parameters , and , a pair of an online learner and an adversary is said to satisfy OnlineWLC if for any , with probability at least , the learner can generate predictions that satisfy
is called an edge, and an excess loss.
This extends the condition made by Jung et al. [2017, Definition 1]. The probabilistic statement is needed as many online learners produce randomized predictions. The excess loss can be interpreted as a warm-up period. Throughout this section, we assume our learners satisfy OnlineWLC with a fixed adversary.
The optimal design of a cost vector depends on the choice of loss. We will use to denote the loss without specifying it where s is the predicted score vector. The only constraint that we impose on our loss is that it is proper, which implies that it is decreasing in , and increasing in (readers should note that “proper loss” has at least one other meaning in the literature).
Then we introduce potential function
, a well known concept in game theory which is first introduced to boosting bySchapire :
aims to estimate booster’s final loss whenmore weak learners are left until the final prediction and s is the current state. It can be easily shown by induction that many attributes of are inherited by potentials. Being proper or convex are good examples.
Essentially, we want to set
where is the prediction of expert . The proper property inherited by potentials ensures the relevant labels have less costs than the irrelevant. To satisfy the boundedness condition of , we normalize (2) to get
where . Since Definition 1 assumes that , we have to further normalize . This requires the knowledge of . This is unavailable until we observe all the instances, which is fine because we only need this value in proving the loss bound.
The algorithm is named by OnlineBMR (Online Boost-by-majority for Multi-label Ranking) as its potential function based design has roots in the classical boost-by-majority algorithm (Schapire ). In OnlineBMR, we simply set , or in other words, the booster takes simple cumulative votes. Cost vectors are computed using (2), and the booster always follows the last expert , or . These datails are summarized in Algorithm 2.
The following theorem holds either if weak learners are single-label learners or if the loss is convex.
(BMR, General Loss Bound) For any and , the final loss suffered by OnlineBMR satisfies the following inequality with probability :
where the last inequality is in fact equality if weak learners are single-label learners, or holds by Jensen’s inequality if the loss is convex (which implies the convexity of potentials). Also note that . Since both and have norm , we can subtract common numbers from every entry of without changing the value of . This implies we can plug in at the place of . Then we have
By summing this over , we have
OnlineWLC provides, with probability ,
Plugging this in (5), we get
Now summing this over , we get with probability (due to union bound),
which completes the proof. ∎
Now we evaluate the efficiency of OnlineBMR by fixing a loss. Unfortunately, there is no canonical loss in MLR settings, but following rank loss is a strong candidate (cf. Cheng et al.  and Gao and Zhou ):
where is a normalization constant that ensures the loss lies in . Note that this loss is not convex. In case weak learners are in fact single-label learners, we can simply use rank loss to compute potentials, but in more general case, we may use the following hinge loss to compute potentials:
where . It is convex and always greater than rank loss, and thus Theorem 2 can be used to bound rank loss. In Appendix A, we bound two terms in the RHS of (4) when the potentials are built upon rank and hinge losses. Here we record the results.
For the case that we use rank loss, we can check
Combining these results with Theorem 2, we get the following corollary.
(BMR, Rank Loss Bound) For any and , OnlineBMR satisfies following rank loss bounds with probability .
With single-label learners, we have
and with general learners, we have
When we divide both sides by , we find the average loss is asymptotically bounded by the first term. The second term determines the sample complexity. In both cases, the first term decreases exponentially as grows, which means the algorithm does not require too many learners to achieve a desired loss bound.
Matching Lower Bounds
From (6), we can deduce that to attain average loss less than , OnlineBMR needs learners and samples. A natural question is whether these numbers are optimal. In fact the following theorem constructs a circumstance that matches these bounds up to logarithmic factors. Throughout the proof, we consider as a fixed constant.
For any , , and , there exists an adversary with a family of learners satisfying OnlineWLC such that to achieve error rate less than , any boosting algorithm requires at least learners and samples.
We introduce a sketch here and postpone the complete discussion to Appendix B. We assume that an adversary draws a label uniformly at random from , and the weak learners generate single-label prediction w.r.t. . We manipulate such that weak learners satisfy OnlineWLC but the best possible performance is close to (6).
Boundedness conditions in and the Azuma-Hoeffding inequality provide that with probability ,
For the optimality of the number of learners, we let for all . The above inequality guarantees OnlineWLC is met. Then a similar argument of Schapire and Freund [2012, Section 13.2.6] can show that the optimal choice of weights over the learners is . Finally, adopting the argument in the proof of Jung et al. [2017, Theorem 4], we can show
Setting this value equal to , we have , considering as a fixed constant. This proves the first part of the theorem.
For the second part, let and define for and for . Then OnlineWLC can be shown to be met in a similar fashion. Observing that weak learners do not provide meaningful information for , we can claim any online boosting algorithm suffers a loss at least . Therefore to obtain the certain accuracy , the number of instances should be at least , which completes the second part of the proof. ∎
3.2 Adaptive Algorithm
Despite the optimal loss bound, OnlineBMR has a few drawbacks when it is applied in practice. Firstly, potentials do not have a closed form, and their computation becomes a major bottleneck (cf. Table 3). Furthermore, the edge becomes an extra tuning parameter, which increases the runtime even more. Finally, it is possible that learners have different edges, and assuming a constant edge can lead to inefficiency. To overcome these drawbacks, rather than assuming positive edges for weak learners, our second algorithm chooses the weight adaptively to handle variable edges.
Like other adaptive boosting algorithms (e.g., Beygelzimer et al.  and Freund et al. ), our algorithm needs a surrogate loss. The choice of loss is broadly discussed by Jung et al. , and logistic loss seems to be a valid choice in online settings as its gradient is uniformly bounded. In this regard, we will use the following logistic loss:
It is proper and convex. We emphasize that booster’s prediction suffers the rank loss, and this surrogate only plays an intermediate role in optimizing parameters.
The algorithm is inspired by Jung et al. [2017, Adaboost.OLM], and we call it by Ada.OLMR111Online, Logistic, Multi-label, and Ranking. Since it internally aims to minimize the logistic loss, we set the cost vector to be the gradient of the surrogate:
Next we present how to set the weights . Essentially, Ada.OLMR wants to choose to minimize the cumulative logistic loss:
After initializing equals to , we use online gradient descent method, proposed by Zinkevich , to compute the next weights. If we write , we want to satisfy
where is some feasible set, and is a sublinear regret. To apply the result by Zinkevich [2003, Theorem 1], needs to be convex, and should be compact. The former condition is met by our choice of logistic loss, and we will use for the feasible set. Since the booster’s loss is invariant under the scaling of weights, we can shrink the weights to fit in .
Taking derivative, we can check . Now let denote a projection onto : . By setting
we get . Considering that , we can also write .
Finally, it remains to address how to choose . In contrast to OnlineBMR, we cannot show that the last expert is reliably sophisticated. Instead, what can be shown is that at least one of the experts is good enough. Thus we use classical Hedge algorithm (cf. Freund and Schapire  and Littlestone and Warmuth 
) to randomly choose an expert at each iteration with adaptive probability distribution depending on each expert’s prediction history. In particular, we introduce new variables, which are initialized as . At each iteration, is randomly drawn such that
and then is updated based on the expert’s rank loss:
The details are summarized in Algorithm 3.
As we are not imposing OnlineWLC, we need another measure of the learner’s predictive power to prove the loss bound. From (8), it can be observed that the relevant labels have negative costs and the irrelevant ones have positive cost. Furthermore, the summation of entries of is exactly . This observation suggests a new definition of weight:
This does not directly correspond to the weight used in (3), but plays a similar role. Then we define the empirical edge:
The baseline learner has this value exactly , which suggests that it is a good proxy for the edge defined in Definition 1.
Now we present the loss bound of Ada.OLMR.
(Ada.OLMR, Rank loss bound) For any and , with probability , the rank loss suffered by Ada.OLMR is bounded as follows:
where notation suppresses dependence on .
We start the proof by defining the rank loss suffered by expert as below:
According to the formula, there is no harm to define since . As the booster chooses an expert through the Hedge algorithm, a standard analysis (cf. [Cesa-Bianchi and Lugosi, 2006, Corollary 2.3]) along with the Azuma-Hoeffding inequality provides with probability ,
where notation suppresses dependence on .
It is not hard to check that , from which we can infer
where is defined in (9). Note that this relation holds for the case as well.
Now let denote the difference of the cumulative logistic loss between two consecutive experts:
Then the online gradient descent algorithm provides
Here we record an univariate inequality:
We expand the difference to get
For the ease of notation, let denote an index that moves through all tuples of , and and denote following terms.
Then from (16), we have and . Now we express in terms of and as below.
where the inequality holds by Jensen’s inequality. From this, we can deduce that
where the last inequality can be checked by investigating and observing the convexity of the exponential function. This proves our claim that
Summing over , we get by telescoping rule
Note that and . Therefore we have
Plugging this in (12), we get with probability ,
where the last inequality holds from AM-GM inequality: . This completes our proof. ∎
Comparison with OnlineBMR
We finish this section by comparing our two algorithms. For a fair comparison, assume that all learners have edge . Since the baseline learner has empirical edge , for sufficiently large , we can deduce that with high probability. Using this relation, (11) can be written as
Comparing this to either (6) or (7), we can see that OnlineBMR indeed has better asymptotic loss bound and sample complexity. Despite this sub-optimality (in upper bounds), Ada.OLMR shows comparable results in real data sets due to its adaptive nature.
We performed an experiment on benchmark data sets taken from MULAN222Tsoumakas et al. , http://mulan.sourceforge.net/datasets.html. We chose these four particular data sets because Dembczynski and Hüllermeier  already provided performances of batch setting boosting algorithms, giving us a benchmark to compare with. The authors in fact used five data sets, but image data set is no longer available from the source. Table 2 summarizes the basic statistics of data sets, including training and test set sizes, number of features and labels, and three statistics of the sizes of relevant sets. The data set m-reduced is a reduced version of mediamill obtained by random sampling without replacement. We keep the original split for training and test sets to provide more relevant comparisons.
VFDT algorithms presented by Domingos and Hulten  were used as weak learners. Every algorithm used trees whose parameters were randomly chosen. VFDT is trained using single-label data, and we fed individual relevant labels along with importance weights that were computed as . Instead of using all covariates, the booster fed to trees randomly chosen covariates to make weak predictions less correlated.
All computations were carried out on a Nehalem architecture 10-core 2.27 GHz Intel Xeon E7-4860 processors with 25 GB RAM per core. Each algorithm was trained at least ten times333OnlineBMR for m-reduced was tested 10 times due to long runtimes, and others were tested 20 times with different random seeds, and the results were aggregated through mean. Predictions were evaluated by rank loss. The algorithm’s loss was only recorded for test sets, but it kept updating its parameters while exploring test sets as well.
Since VFDT outputs a conditional distribution, which is not of a single-label format, we used hinge loss to compute potentials. Furthermore, OnlineBMR has an additional parameter of edge . We tried four different values444 for small and