A Multiclass Classification Approach to Label Ranking

Authors

• 42 publications
• 7 publications
• Ranking Data with Continuous Labels through Oriented Recursive Partitions

We formulate a supervised learning problem, referred to as continuous ra...
01/17/2018 ∙ by Stephan Clémençon, et al. ∙ 0

• Ranking Median Regression: Learning to Order through Local Consensus

10/31/2017 ∙ by Stephan Clémençon, et al. ∙ 0

• A Structured Prediction Approach for Label Ranking

We propose to solve a label ranking problem as a structured output regre...
07/06/2018 ∙ by Anna Korba, et al. ∙ 0

• Improving Label Ranking Ensembles using Boosting Techniques

Label ranking is a prediction task which deals with learning a mapping b...
01/21/2020 ∙ by Lihi Dery, et al. ∙ 0

• Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach

Whereas most dimensionality reduction techniques (e.g. PCA, ICA, NMF) fo...
10/15/2018 ∙ by Mastane Achab, et al. ∙ 0

• Empirical Risk Minimization under Random Censorship: Theory and Practice

We consider the classic supervised learning problem, where a continuous ...
06/05/2019 ∙ by Guillaume Ausset, et al. ∙ 0

• Enforcing Label and Intensity Consistency for IR Target Detection

This study formulates the IR target detection as a binary classification...
09/09/2014 ∙ by Toufiq Parag, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the standard formulation of the multiclass classification problem, is a random pair defined on a probability space

with unknown joint probability distribution

, where is a label valued in with and the r.v. takes its values in a possibly high-dimensional Euclidean space, say with , and models some input information that is expected to be useful to predict the output variable . The objective pursued is to build from training data , supposed to be independent copies of the generic pair

, a (measurable) classifier

that nearly minimizes the risk of misclassification

 L(g)=P{Y≠g(X)}. (1)

Let be the vector of posterior probabilities: , for and . For simplicity, we assume here that the distribution of the r.v. is continuous, so that the ’s are pairwise distinct with probability one. It is well-known that the minimum risk is attained by the Bayes classifier

 g∗(x)=argmaxk∈{1,…,K}ηk(x),

and is equal to

 L∗=L(g∗)=1−E[max1≤k≤Kηk(X)].

As the distribution

is unknown, a classifier must be built from the training dataset and from the perspective of statistical learning theory, the Empirical Risk Minimization (ERM) paradigm encourages us to replace the risk (

1) by a statistical estimate , typically the empirical version denoting by the indicator function of any event , and consider solutions of the optimization problem

 ming∈GˆLn(g), (2)

where the infimum is taken over a class of classifier candidates, with controlled complexity (e.g. of finite VC dimension), though supposed rich enough to yield a small bias error , i.e. to include a reasonable approximation of the Bayes classifier . Theoretical results assessing the statistical performance of empirical risk minimizers are very well documented in the literature, see e.g. Devroye et al. (1996), and a wide collection of algorithmic approaches has been designed in order to solve possibly smoothed/convexified and/or penalized versions of the minimization problem (2). Denoting by the symmetric group of order (i.e. the group of permutations of ), another natural statistical learning goal in this setup, halfway between multiclass classification and estimation of the posterior probability function and referred to as label ranking throughout the article, is to learn, from the training data , a ranking rule , i.e. a measurable mapping , such that the permutation sorts, with ’high probability’, all possible label values in by decreasing order of the posterior probability , that is to say in the same order as the permutation defined by: ,

 ησ∗−1x(1)>ησ∗−1x(2)>…>ησ∗−1x(K). (3)

Equipped with this notation, observe that for all

. Given a loss function

(i.e. a symmetric measurable mapping s.t. for all ), one may formulate label ranking as the problem of finding a ranking rule which minimizes the ranking risk

 (4)

Except when and in the case when the loss function considered only measures the capacity of the ranking rule to recover the label that is ranked first, that is to say when (in this case, ), the nature of the label ranking problem significantly differs from that of multiclass classification. There is no natural empirical counterpart of the risk (4) based on the observations , which makes the ERM strategy inapplicable in a straightforward fashion. It is the goal of the present paper to show that the label ranking problem can be solved, under appropriate noise conditions, by means of the One-Versus-One (OVO) approach to multiclass classification. The learning strategy proposed is directly inspired from recent advances in consensus ranking and ranking median regression (RMR), see Korba et al. (2017) and Clémençon et al. (2018). In the RMR setup, assigned to the input random vector , one considers an output r.v. that takes its values in the group (in recommending systems, may represent the preferences over a set of items indexed by of a given user, whose profile is described by the features ). The goal is to find a ranking rule that minimizes , that is to say, for any , a consensus/median ranking related to the conditional distribution of given w.r.t. the metric . In this paper, by means of a coupling technique we show that the label ranking problem stated above can be viewed as a variant of RMR where the output ranking is very partially observed in the training stage, through the label ranked first solely. Based on this analogy, the main result of the article shows that the OVO method permits to recover the optimal label ranking with high probability, provided that noise conditions are fulfilled for all binary classification subproblems. Incidentally, the analysis carried out provides statistical guarantees in the form of (possibly fast) learning rate bounds for the OVO approach to multiclass classification under the hypotheses stipulated. The theoretical results established in this article are also empirically confirmed by various numerical experiments.

The paper is organized as follows. In section 2, the OVO methodology for multiclass classification is recalled at length, together with recent results in RMR. The main results of the article are stated in section 3: principally, a coupling result connecting label ranking to RMR and statistical guarantees for the OVO approach to label ranking in the form of nonasymptotic probability bounds. Numerical experiments are displayed in section 4, while some concluding remarks are collected in section 5. The proofs are deferred to the Appendix section.

2 Preliminaries

As a first go, we recall the OVO approach for defining a multiclass classifier from binary classifiers. Basic hypotheses and results related to Ranking Median Regression (RMR) are next briefly described.

2.1 From Binary to Multiclass Classification

A classifier is entirely characterized by the collection of subsets of the feature space : , where for . Observe that the ’s are pairwise disjoint and their union is equal to . Hence, they form a partition of , except that it may happen that a certain subset is empty, i.e. a certain label is never predicted by .

The OVO approach. Partitioning the feature space

in more than two subsets may lead to practical difficulties and certain learning algorithms such as Support Vector Machines (SVM’s) are originally tailored to the binary situation (

i.e. to the case ). In this case, a natural way of extending such algorithms, usually referred to as the ’One-Versus-One’ approach to multi-class classification is to run it times, for each binary subproblem, see e.g. Hastie and Tibshirani (1998), Moreira and Mayoraz (1998), Allwein et al. (2000), Fürnkranz (2002) or Wu et al. (2004): for any , based on the fraction of the training data with labels in only,

 Dk,l={(Xi,Yi):Yi∈{k,l},i=1,…,n},

the algorithm outputs a classification rule with risk

 Lk,l(gk,l)def=P{Yk,l≠gk,l(X)∣Y∈{k,l}},

where , as small as possible and combine, for any possible input value , the binary predictions so as to produce a multi-class classifier with minimum risk . A possible fashion of combining the results of the ’duels’ is to take as predicted label which has won the largest number of duels (and stipulate a rule for breaking possible ties). The rationale behind this OVO approach lies in the fact that

 g∗(x)=argmaxk∈{1,…,K}N∗k(x), (5)

where, for all , denotes the number of duels won by label with optimal/Bayes classifiers for all binary subproblems, namely

 N∗k(x)=∑l

where is the minimizer of the risk for . The proof is straightforward. Indeed, it suffices to observe that, for all ,

Remark 1

(One-Versus-All) An alternative to the OVO approach in order to reduce multiclass classification to binary subproblems and apply the SVM methodology consists in comparing each class to all of the others in two-class duels. A test point is classified as follows: the signed distances from each of the

separating hyperplanes are computed, the winner being simply the class corresponding to the largest signed distance. However, other rules have been proposed in

Vapnik (1998) and in Weston and Watkins (1999).

Label Ranking. As underlined in the Introduction section, rather than learning to predict the likeliest label given , it may also be desirable to rank all possible labels according to their conditional likelihood. The goal is then to recover the permutation defined through (3). Practically, this boils down to build a predictive rule from the training data that maps to and minimizes the ranking risk (4), where is an appropriate loss function defined on . For instance, one may consider or the Hamming distance to measure the dissimilarity between two permutations and in . Classic metrics on (see Deza and Huang (1998)) also provide natural choices for the loss function, including

• the Kendall distance: ,

 dτ(σ,σ′)=∑i
• the Spearman footrule: ,

 d1(σ,σ′)=K∑i=1∣∣σ(i)−σ′(i)∣∣;
• the Spearman distance: ,

 d2(σ,σ′)=K∑i=1(σ(i)−σ′(i))2.

As shall be explained below, the label ranking problem can be viewed as a variant of the standard ranking median regression problem.

2.2 Ranking Median Regression

This problem of minimizing (4) shares some similarity with that referred to as ranking median regression in Clémençon et al. (2018), also called label ranking sometimes, see e.g. Tsoumakas et al. (2009) and Vembu and Gärtner (2010)

. In this supervised learning problem, the output associated with the input variable

is a random vector taking its values in (expressing the preferences on a set of items indexed by of a user with a profile characterized by drawn at random in a certain statistical population) and the goal pursued is to learn from independent copies of the pair a (measurable) ranking rule that nearly minimizes

 R(s)=E[d(Σ,s(X))]. (6)

The name ranking median regression arises from the fact that any rule mapping to a median of ’s conditional distribution given w.r.t. the metric/loss (refer to Korba et al. (2017) for a statistical learning formulation of the consensus/median ranking problem) is a minimizer of (6), see Proposition 5 in Clémençon et al. (2018). In certain situations, the minimizer of (6) is unique and a closed analytic form can be given for the latter, based on the pairwise probabilities: for and .

Assumption 2

For all , we have: , and

 pi,j(x)>1/2 and pj,k(x)>1/2⇒pi,k(x)>1/2. (7)

Indeed, when choosing the Kendall distance as loss function, it has been shown that, under Assumption 2, referred to as strict stochastic transitivity, the minimizer of (6) is almost-surely unique and given by: , with probability one:

 s∗X(k)=1+∑l≠kI{pk,l(X)<1/2}. (8)
Remark 3

(Conditional BTLP model) A Bradley-Terry-Luce-Plackett model for ’s conditional distribution given , , assumes the existence of a hidden preference vector , where is interpreted as a preference score for item of a user with profile , see e.g. Bradley and Terry (1952), Luce (1959) or Plackett (1975). The conditional distribution of given can be defined sequentially as follows: is distributed according to a multinomial distribution of size with support and parameters and, for , is distributed according to a multinomial distribution of size with support with parameters , . The conditional pairwise probabilities are given by and one may easily check that Assumption 2 is fulfilled as soon as the ’s are pairwise distinct with probability one. In this case, is the permutation that sorts the ’s in decreasing order.

In Clémençon et al. (2018), certain situations where empirical risk minimizers over classes of ranking rules fulfilling appropriate complexity assumptions can be proved to achieve fast learning rates (i.e. faster than ) have been investigated. More precisely, denoting by the essential infimum of any real valued r.v. , the following ’noise condition’ related to conditional pairwise probabilities was considered.

Assumption 4

We have:

 H=essinfmini0. (9)

Precisely, it is shown in Clémençon et al. (2018) (see Proposition 7 therein) that, under Assumptions 2-4, minimizers of the empirical version of (6) over a VC major class of ranking rules with the Kendall distance as loss function achieves a learning rate bound of order (without the impact of model bias). Since (cf Eq. in Clémençon et al. (2018)), a bound for the probability that the empirical risk minimizer differs from the optimal ranking rule at a random point can be immediately derived.

3 Label Ranking

We now describe at length the connection between label ranking and RMR and state the main results of the article.

3.1 Label Ranking as RMR

The major difference with label ranking in the multi-class classification context lies in the fact that only the partial information is observable in presence of noise, under the form of the random label assigned to ( being the mode of ’s conditional distribution given ), in order to mimic the optimal rule .

Lemma 5

Let be a random pair on the probability space

. One may extend the sample space so as to build a random variable

that takes its values in and whose conditional distribution given is a BTLP model with preference vector such that

 Y=Σ−1(1) with probability one. (10)

See the Appendix section for the technical proof. The noteworthy fact that the probabilities related to the optimal pairwise comparisons are given by a BTLP model has been pointed out in Hastie and Tibshirani (1998). With the notations introduced in Lemma 5, we have in addition

 P{Σ(k)<Σ(l)∣X} = ηk(X)/(ηk(X)+ηl(X)), := ηk,l(X).

Eq. (10) can be interpreted as follows: the label ranking problem as defined in subsection 2.1 can be viewed as a specific RMR problem under strict stochastic transitivity (i.e. Assumption 2 is always fulfilled) with incomplete observations

 (X1,Σ−11(1)),…,(Xn,Σ−1n(1))).

Due to the incomplete character of the training data, one cannot recover the optimal ranking rule by minimizing a statistical version of (6) of course. As an alternative, one may attempt to build directly an empirical version of based on the explicit form (8), which only involves pairwise comparisons, in a similar manner as in Korba et al. (2017) for consensus ranking. Indeed, in the specific RMR problem under study, Eq. (8) becomes

 σ∗X(k)=1+∑l≠kI{g∗k,l(X)=−1}, (11)

for all . The OVO procedure precisely permits to construct such an empirical version. As shall be shown by the subsequent analysis, in spite of the very partial nature of the statistical information at disposal, the OVO approach permits to recover the optimal RMR rule with high probability provided that fulfills (a possibly weakened version of) Assumption 4, combined with classic complexity conditions. Using Korba et al. (2018) or Brinker and Hüllermeier (2019), one can tackle RMR with partial information, but lacks theoretical guarantees.

Remark 6

(On the noise condition) Attention should be paid to the fact that, when applied to the random pair defined in Lemma 5, Assumption 4 simply means that the classic Massart’s noise condition is fulfilled for every binary classification subproblem, see Massart and Nédélec (2006).

3.2 The OVO Approach to Label Ranking

Let be a class of decision rules . As stated in subsection 2.1, the OVO approach to multiclass classification is implemented as follows. For all , compute a minimizer of the empirical risk

 ˆLk,l(g)=1nk+nl∑i:Yi∈{k,l}I{g(Xi)≠Yk,l,i} (12)

over class , with for and the convention that . We set for by convention. Equipped with these classifiers, for any test (i.e. input and unlabeled) observation , the ’s define a complete directed graph with the labels as vertices: , if and otherwise. The analysis carried out in the next subsection shows that under appropriate noise conditions, with large probability, the random graph is acyclic, meaning that the complete binary relation is transitive (i.e. and ), in other words that the scoring function

 ˆs(X)(k)=1+∑k≠lI{ˆgk,l(X)=−1},=1+∑k≠lI{k→Xl}, for k∈{1,…,K} (13)

defines a permutation, which, in addition, coincides with , cf Eq. (11). The equivalence between the transitivity of , the acyclicity of and the membership of in is straightforward, details are left to the reader (see e.g. the argument of Theorem 5’s proof in Clémençon et al. (2018)). The quantity (13) can be related to the Copeland score, see Copeland (1951): the score of label being equal to plus the number of duels it has lost, while its Copeland score is its number of victories minus its number of defeats, so that

 ˆs(X)(.)=(K+1−CX(.))/2. (14)

When is not transitive, or equivalently when , one may build a ranking from the scoring function (13) by breaking ties in an arbitrary fashion, as proposed below for simplicity. Alternatives could be considered of course. The issue of building a ranking/permutation of the labels in from (13) can be connected with the feedback set problem for directed graphs, see e.g. Di Battista et al. (1999): for a directed graph, a minimal feedback arcset is a set of edges of smallest cardinality such that a directed acyclic graph is obtained when reversing the edges in it. Refer to e.g. Festa et al. (1999) for algorithms.

3.3 Statistical Guarantees for Label Ranking

It is the purpose of the subsequent analysis to show that, provided that the conditions listed below are fulfilled, the ranking rule can be fully recovered through the OVO approach previously described with high probability. We denote by the marginal distribution of the input variable , by the conditional distribution of given and set for .

Assumption 7

There exists and such that: for all and ,

 P{|2ηk,l(X)−1|
Assumption 8

The class is of finite VC dimension .

Assumption 9

There exists a constant , s.t. for all in and , .

Assumption 7 means that Assumption 4 is satisfied by the random pair defined in Lemma 5 in the case (notice incidentally that it is void when ) and reduces to the classic Mammen-Tsybakov noise condition in the binary case , see Mammen and Tsybakov (1999). The following result provides nonasymptotic bounds for the ranking risk of the OVO ranking rule in the case where the loss function is , i.e. for the probability of error. Extension to any other loss function is straightforward, insofar as we obviously have with probability one.

Theorem 10

Suppose that Assumptions 7-9 are fulfilled. Then, for all , we have with probability (w.p.) at least : ,

 P{ˆσX≠σ∗X∣D}≤βε{(K2)rαn(δ(K2))+∑k

where denotes a r.v. drawn from , independent from the training data , , and with ,

 rn(δ)=2(1/(nh))12−α×[(64C2Vlogn)12−α+(32log(2/δ))12−α].

Refer to the Appendix section for the technical proof.

Remark 11

(On the noise condition (bis)) We point out that the results of this paper can be straightforwardly extended to the situation where the noise exponent may vary depending on the binary subproblem considered. For the sake of simplicity only, here we restrict the analysis to the homogeneous setup described by Assumption 7.

Hence, for the RMR problem related to the partially observed BTLP model detailed in subsection 3.1, the rate bound achieved by the OVO ranking rule in Theorem 10 is of order , ignoring the bias term and the logarithmic factors. In the case , it is exactly the same rate as that attained by minimizers of the ranking risk in the standard RMR setup, as stated in Proposition 7 in Korba et al. (2017). Whereas situations where the OVO multi-class classification may possibly lead to ’inconsistencies’ (i.e. where the binary relationship is not transitive) have been exhibited many times in the literature, no probability bound for the excess of classification risk of the general OVO classifier, built from ERM applied to all binary subproblems, is documented to the best our knowledge. Hence, attention should be paid to the fact that, as a by-product of the argument of Theorem 10’s proof, generalization bounds for the OVO classifier

 ¯g(X)def=ˆσ−1X(1).

can be established, as stated in Corollary 13 below. More generally, the statistical performance of the label ranking rule produced by the method described in subsection 3.2 can be assessed for other risks. For instance, rather than just comparing the true label assigned to to the label ranked first, as in OVO classification approach, one could consider , with for all , equal to when does not appear in the top list and to otherwise, where is fixed in . For any ranking rule , the corresponding risk is then

 Wk(s)=E[ℓk(Y,s(X))]. (15)

Set , where the minimum is taken over the set of all possible ranking rules . As shown in the Appendix section, the argument leading to Theorem 10 can be adapted to prove a rate bound for the risk excess of the OVO ranking rule .

Proposition 12

Let be fixed. Then:

 W∗k=Wk(σ∗.).

Suppose in addition that Assumptions 7-9 are fulfilled. Then, for all , we have w.p. : ,

 Wk(ˆσ.)−W∗k≤βε(Kk)k(K−k)×(rαn(δ(K2))+2⋅maxm≠l(infg∈GLl,m(g)−L∗l,m)α).

Since we have for any label ranking rule , in the case the result above provides a generalization bound for the excess of misclassification risk of the OVO classifier .

Corollary 13

Suppose that Assumptions 7-9 are fulfilled. Then, for all , we have w.p. : ,

 L(¯g)−L∗≤βεK(K−1)×(rαn(δ(K2))+2⋅maxk≠l(infg∈GLk,l(g)−L∗k,l)α).

4 Experimental Results

This section first illustrates the results of Theorem 10 using simulated datasets, of which distributions satisfy Assumption 7, for certain values of the noise parameter , highlighting the impact/relevance of this condition. In the experiments based on real data next displayed, the OVO approach to top- classification, cf Eq. (15), is shown to surpass rankings relying on the scores output by multiclass classification algorithms. Due to space limitations, details and comments are postponed to the Supplementary Material.

Synthetic data. In this toy illustrative example, we consider , and learn a simple decision stump, i.e. a function of the form where are unknown parameters. A representation of the ’s for all as well as the expected Kendall distance of OVO label ranking models for different values of are given in Fig. 2. For each value of , the boxplot is computed using independent trials, representing different learning rates, for and namely.

Real data. Regarding top-

performance, for two popular datasets, MNIST and fashion-MNIST, the OVO label ranking approach is benchmarked against the rankings based on the probability estimates related to a multiclass logistic regression in Table

1.

5 Conclusion

In this paper, a statistical problem halfway between multiclass classification and posterior probability estimation, referred to as label ranking here, is considered. The goal is to design a method to rank, for any test observation , all the labels that can be possibly assigned to it by decreasing order of magnitude of the (unknown) posterior probability . Formulated as a specific ranking median regression problem with incomplete observations, this problem is shown to have a solution that takes the form of a Copeland score, involving pairwise comparisons only. Based on this crucial observation, it is proved that the OVO procedure for multiclass classification permits to build, from training classification/labelled data, the optimal ranking with high probability, under appropriate hypotheses. This is also empirically supported by numerical experiments. Remarkably, the analysis carried out here incidentally provides a rate bound for the OVO classifier.

Appendix - Technical Details

Proof of Lemma 5

As a first go, define as . Next, given and , draw as a BTLP model on the set with preference parameters , . For all , set and invert the permutation to get a random permutation with the desired properties.

Proof of Theorem 10

Fix and let . Assumption 7 implies that the Mammen-Tsybakov noise condition is fulfilled for the binary classification problem related to the pair given that . When Assumptions 8-9 are also satisfied, a possibly fast rate bound for the risk excess of the empirical risk minimizer can be established, as stated in the following lemma.

Lemma 14

Suppose that Assumptions 7-9 are fulfilled. Let . Then, for all , we have w.p. : ,

 Lk,l(ˆgk,l)−L∗k,l≤2(infg∈GLk,l(g)−L∗k,l)+rn(δ), (16)

where, for all and ,

 rn(δ)=2(1/(nh))12−α×[(64C2Vlogn)12−α+(32log(2/δ))12−α].

proof.  The result is a slight variant of that proved in P. Bartlett and Mendelson (2005) (see therein), the sole difference lying in the fact that the empirical risk (and, consequently, its minimizer as well) is built from a random number of training observations (i.e. those with labels in ). Note that . Details are given in the Supplementary Material.

Observe that the probabilities appearing in this proof are conditional probabilities given the training sample and, as a consequence, must be considered as random variables. However, to simplify notations, we omit to write the conditioning w.r.t. explicitly. Notice first that Assumption 7 implies that, with ,

 PX{ˆgk,l(X)≠g∗k,l(X)∣Y∈{k,l}}≤β(Lk,l(ˆgk,l)−L∗k,l)α, (17)

with probability one. Observe in addition that

 PX{ˆgk,l(X)≠g∗k,l(X)∣Y∈{k,l}}=EX[dμk,ldμ(X)×I{ˆgk,l(X)≠g∗k,l(X)}], (18)

denoting by the conditional distribution of given that . Under Assumption 9, we almost-surely have:

 dμk,ldμ(X)≥εpk+pl≥ε.

Hence, from (17) and Lemma 14, we get that

 εβPX{ˆgk,l(X)≠g∗k,l(X)}≤(Lk,l(ˆgk,l)−L∗k,l)α≤2(infg∈GLk,l(g)−L∗k,l)α+rαn(δ), (19)

using Minkowski’s inequality. Since

 ⋂k

with probability one, combining the bound above with the union bound, for all , w.p. :

Proof of Proposition 12

Let us first show that

For any ranking rule and all , we define

 Topk(s(x))={s(X)−1(1),…,s(X)−1(k)},

and also set . Indeed, for any ranking rule , we can write

 Wk(s)=E[E[ℓk(Y,s(X))∣X]],

and we almost-surely have

 E[ℓk(Y,s(X))∣X]=K∑l=1ηl(X)I{l∉Topk(s(X))}. (20)

As is defined through (3), one easily sees that the quantity (20) is minimum for any ranking rule s.t.

 Topk(s(X))=Top∗k(X). (21)

Hence, the collection of optimal ranking rules regarding the risk (15) coincides with the set of ranking rules such that (21) holds true with probability one. Observe that, with probability one,

 I{Y∉Topk(s(X))}−I{Y∉Top∗k(X)}≤I{Top∗k(X)≠Topk(s(X))},

for any ranking rule , so that

 Wk(s)−W∗k≤PX{Topk(s(X))≠Top∗k(X)}.

 Wk(ˆσX)−W∗k≤PX{Top∗k(X)≠Topk(ˆσX)}=∑L⊂Y:#L=kPX{Top∗k(X)=L,Top∗k(X)≠Topk(ˆσX)},≤∑L⊂Y:#L=k∑l∈L,m∉LPX{ˆgl,m(X)≠g∗l,m(X)},≤βε(Kk)k(K−k)×(rαn(δ(K2))+2⋅maxm≠l(infg∈GLl,m(g)−L∗l,m)α),

using (19).

References

• Allwein et al. (2000) E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers.

Journal of Machine Learning Research

, 1:113–141, 2000.
• Boucheron et al. (2005) S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification : a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
• Bousquet et al. (2004) O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pages 169–207. 2004.
• Bradley and Terry (1952) R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
• Brinker and Hüllermeier (2019) K. Brinker and E. Hüllermeier. A reduction of label ranking to multiclass classification. In ECML PKDD. 2019.
• Clémençon et al. (2018) S. Clémençon, A. Korba, and E. Sibony. Ranking median regression: Learning to order through local consensus. In Proceedings of the conference Algorithmic Learning Theory, 2018.
• Copeland (1951) A. H. Copeland. A reasonable social welfare function. In Seminar on applications of mathematics to social sciences, University of Michigan, 1951.
• Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi.

A probabilistic theory of pattern recognition

.
Springer, 1996.
• Deza and Huang (1998) M. Deza and T. Huang. Metrics on permutations, a survey. 1998.
• Di Battista et al. (1999) G. Di Battista, P. Eades, R. Tamasia, and I. Tollis. Graph Drawing. Prentice Hall, 1999.
• Festa et al. (1999) P. Festa, P. Pardalos, and M. C. Resende. Feedback Set Problems, pages 209–258. Springer US, Boston, MA, 1999.
• Fürnkranz (2002) J. Fürnkranz. Round robin classification. Journal of Machine Learning Research, 2:721–747, 2002.
• Hastie and Tibshirani (1998) T. Hastie and R. Tibshirani. Classification by pairwise coupling. In Proceedings of NIPS, 1998.
• Korba et al. (2017) A. Korba, S. Clémençon, and E. Sibony. A learning theory of ranking aggregation. In Proceeding of AISTATS 2017, 2017.
• Korba et al. (2018) A. Korba, A. Garcia, and F. d’Alché Buc. A structured prediction approach for label ranking. In NeurIPS. 2018.
• Luce (1959) R. D. Luce. Individual Choice Behavior. Wiley, 1959.
• Mammen and Tsybakov (1999) E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
• Massart and Nédélec (2006) P. Massart and E. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 34(5), 2006.
• Moreira and Mayoraz (1998) M. Moreira and E. Mayoraz. Improved pairwise coupling classification with correcting classifiers. In In the Proceedings of ECML, 1998.
• P. Bartlett and Mendelson (2005) O. B. P. Bartlett and S. Mendelson. Localized rademacher complexities. The Annals of Statistics, 33(1):497–1537, 2005.
• Plackett (1975) R. L. Plackett. The analysis of permutations. Applied Statistics, 2(24):193–202, 1975.
• Tsoumakas et al. (2009) G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2009.
• van der Vaart and Wellner (1996) A. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. 1996. ISBN 0-387-94640-3.
• Vapnik (1998) V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
• Vembu and Gärtner (2010) S. Vembu and T. Gärtner. Label ranking algorithms: A survey. In Preference learning, pages 45–64. Springer, 2010.
• Weston and Watkins (1999) J. Weston and C. Watkins. Multiclass support vector machines. In Proceedings of ESANN99, D. Facto Press, Brussels., 1999.
• Wu et al. (2004) T. Wu, C. Lin, and R. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975–1005, 2004.
• Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/abs/1708.07747.
• Zou et al. (2009) B. Zou, H. Zhang, and Z. Xu.

Learning from uniformly ergodic Markov chains

, volume 25.
2009.

References

• Allwein et al. (2000) E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers.

Journal of Machine Learning Research

, 1:113–141, 2000.
• Boucheron et al. (2005) S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification : a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
• Bousquet et al. (2004) O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pages 169–207. 2004.
• Bradley and Terry (1952) R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
• Brinker and Hüllermeier (2019) K. Brinker and E. Hüllermeier. A reduction of label ranking to multiclass classification. In ECML PKDD. 2019.
• Clémençon et al. (2018) S. Clémençon, A. Korba, and E. Sibony. Ranking median regression: Learning to order through local consensus. In Proceedings of the conference Algorithmic Learning Theory, 2018.
• Copeland (1951) A. H. Copeland. A reasonable social welfare function. In Seminar on applications of mathematics to social sciences, University of Michigan, 1951.
• Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi.

A probabilistic theory of pattern recognition

.
Springer, 1996.
• Deza and Huang (1998) M. Deza and T. Huang. Metrics on permutations, a survey. 1998.
• Di Battista et al. (1999) G. Di Battista, P. Eades, R. Tamasia, and I. Tollis. Graph Drawing. Prentice Hall, 1999.
• Festa et al. (1999) P. Festa, P. Pardalos, and M. C. Resende. Feedback Set Problems, pages 209–258. Springer US, Boston, MA, 1999.
• Fürnkranz (2002) J. Fürnkranz. Round robin classification. Journal of Machine Learning Research, 2:721–747, 2002.
• Hastie and Tibshirani (1998) T. Hastie and R. Tibshirani. Classification by pairwise coupling. In Proceedings of NIPS, 1998.
• Korba et al. (2017) A. Korba, S. Clémençon, and E. Sibony. A learning theory of ranking aggregation. In Proceeding of AISTATS 2017, 2017.
• Korba et al. (2018) A. Korba, A. Garcia, and F. d’Alché Buc. A structured prediction approach for label ranking. In NeurIPS. 2018.
• Luce (1959) R. D. Luce. Individual Choice Behavior. Wiley, 1959.
• Mammen and Tsybakov (1999) E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
• Massart and Nédélec (2006) P. Massart and E. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 34(5), 2006.
• Moreira and Mayoraz (1998) M. Moreira and E. Mayoraz. Improved pairwise coupling classification with correcting classifiers. In In the Proceedings of ECML, 1998.
• P. Bartlett and Mendelson (2005) O. B. P. Bartlett and S. Mendelson. Localized rademacher complexities. The Annals of Statistics, 33(1):497–1537, 2005.
• Plackett (1975) R. L. Plackett. The analysis of permutations. Applied Statistics, 2(24):193–202, 1975.
• Tsoumakas et al. (2009) G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2009.
• van der Vaart and Wellner (1996) A.