The predictive learning problem considered in this paper can be easily stated in an informal fashion, as follows. Given a collection of objects of arbitrary cardinality, say, respectively described by characteristics in a feature space , the goal is to learn how to order them by increasing order of magnitude of a certain unknown continuous variable . To fix ideas, the attribute can represent the ’size’ of the object and be difficult to measure, as for the physical measurement of microscopic bodies in chemistry and biology or the cash flow of companies in quantitative finance and the features may then correspond to indirect measurements. The most convenient way to define a preorder on a feature space is to transport the natural order on the real line onto it by means of a (measurable) scoring function : an object with charcateristics is then said to be ’larger’ (’strictly larger’, respectively) than an object described by according to the scoring rule when (when ). Statistical learning boils down here to build a scoring function , based on a training data set of objects for which the values of all variables (direct and indirect measurements) have been jointly observed, such that and tend to increase or decrease together with highest probability or, in other words, such that the ordering of new objects induced by matches that defined by their true measures as well as possible. This problem, that shall be referred to as continuous ranking throughout the article can be viewed as an extension of bipartite ranking, where the output variable is assumed to be binary and the objective can be naturally formulated as a functional
-estimation problem by means of the concept ofcurve, see . Refer also to , ,  for approaches based on the optimization of summary performance measures such as the criterion in the binary context. Generalization to the situation where the random label is ordinal and may take a finite number of values is referred to as multipartite ranking and has been recently investigated in  (see also e.g. ), where distributional conditions guaranteeing that surface and the criterion can be used to determine optimal scoring functions are exhibited in particular.
It is the major purpose of this paper to formulate the continuous ranking problem in a quantitative manner and explore the connection between the latter and bi/multi-partite ranking. Intuitively, optimal scoring rules would be also optimal for any bipartite subproblem defined by thresholding the continuous variable with cut-off , separating the observations such that from those such that . Viewing this way continuous ranking as a continuum of nested bipartite ranking problems, we provide here sufficient conditions for the existence of such (optimal) scoring rules and we introduce a concept of integrated curve ( curve in abbreviated form) that may serve as a natural performance measure for continuous ranking, as well as the related notion of integrated criterion, a summary scalar criterion, akin to Kendall tau. Generalization properties of empirical Kendall tau maximizers are discussed in the Supplementary Material. The paper also introduces a novel recursive algorithm that solves a discretized version of the empirical integrated curve optimization problem, producing a scoring function that can be computed by means of a hierarchical combination of binary classification rules. Numerical experiments providing strong empirical evidence of the relevance of the approach promoted in this paper are also presented.
The paper is structured as follows. The probabilistic framework we consider is described and key concepts of bi/multi-partite ranking are briefly recalled in section 2. Conditions under which optimal solutions of the problem of ranking data with continuous labels exist are next investigated in section 3, while section 4 introduces a dedicated quantitative (functional) performance measure, the curve. The algorithmic approach we propose in order to learn scoring functions with nearly optimal curves is presented at length in section 5. Numerical results are displayed in section 6. Some technical proofs are deferred to the Supplementary Material.
2 Notation and Preliminaries
Throughout the paper, the indicator function of any event is denoted by . The pseudo-inverse of any cdf on is denoted by , while
denotes the uniform distribution on the unit interval.
2.1 The probabilistic framework
Given a continuous real valued r.v.
representing an attribute of an object, its ’size’ say, and a random vectortaking its values in a (typically high dimensional euclidian) feature space modelling other observable characteristics of the object (e.g. ’indirect measurements’ of the size of the object), hopefully useful for predicting , the statistical learning problem considered here is to learn from training independent observations , drawn as the pair , a measurable mapping , that shall be referred to as a scoring function throughout the paper, so that the variables and tend to increase or decrease together: ideally, the larger the score , the higher the size . For simplicity, we assume throughout the article that with and that the support of ’s distribution is compact, equal to say. For any , we denote by the Lebesgue measure on equipped with its Borelian
-algebra and suppose that the joint distributionof the pair has a density
w.r.t. the tensor product measure. We also introduces the marginal distributions and , where and as well as the conditional densities and . Observe incidentally that the probabilistic framework of the continuous ranking problem is quite similar to that of distribution-free regression. However, as shall be seen in the subsequent analysis, even if the regression function can be optimal under appropriate conditions, just like for regression, measuring ranking performance involves criteria that are of different nature than the expected least square error and plug-in rules may not be relevant for the goal pursued here, as depicted by Fig. 2 in the Supplementary Material.
Scoring functions. The set of all scoring functions is denoted by here. Any scoring function defines a total preorder on the space : , . We also set when and when for .
2.2 Bi/multi-partite ranking
Suppose that is a binary label, taking its values in say, assigned to the r.v. . In bipartite ranking, the goal is to pick in so that the larger , the greater the probability that is equal to ideally. In other words, the objective is to learn such that the r.v. given is as stochastically larger111Given two real-valued r.v.’s and , recall that is said to be stochastically larger than when for all . as possible than the r.v. given : the difference between and should be thus maximal for all . This can be naturally quantified by means of the notion of curve of a candidate , i.e. the parametrized curve , which can be viewed as the graph of a mapping , connecting possible discontinuity points by linear segments (so that when has no flat part in , where ). A basic Neyman Pearson’s theory argument shows that the optimal elements related to this natural (functional) bipartite ranking criterion (i.e. scoring functions whose curve dominates any other curve everywhere on ) are transforms
of the posterior probability, where is any strictly increasing borelian mapping. Optimization of the curve in norm has been considered in  or in  for instance. However, given its functional nature, in practice the curve of any is often summarized by the area under it, which performance measure can be interpreted in a probabilistic manner, as the theoretical rate of concording pairs
where denoted an independent copy of . A variety of algorithms aiming at maximizing the criterion or surrogate pairwise criteria have been proposed and studied in the literature, among which ,  or , whereas generalization properties of empirical maximizers have been studied in ,  and . An analysis of the relationship between the and the error rate is given in .
Extension to the situation where the label takes at least three ordinal values (i.e. multipartite ranking) has been also investigated, see e.g.  or . In , it is shown that, in contrast to the bipartite setup, the existence of optimal solutions cannot be guaranteed in general and conditions on ’s distribution ensuring that optimal solutions do exist and that extensions of bipartite ranking criteria such as the manifold and the volume under it can be used for learning optimal scoring rules have been exhibited. An analogous analysis in the context of continuous ranking is carried out in the next section.
3 Optimal elements in ranking data with continuous labels
In this section, a natural definition of the set of optimal elements for continuous ranking is first proposed. Existence and characterization of such optimal scoring functions are next discussed.
3.1 Optimal scoring rules for continuous ranking
Considering a threshold value , a considerably weakened (and discretized) version of the problem stated informally above would consist in finding so that the r.v. given is as stochastically larger than given as possible. This subproblem coincides with the bipartite ranking problem related to the pair , where . As briefly recalled in subsection 2.2, the optimal set is composed of the scoring functions that induce the same ordering as
where and .
A continuum of bipartite ranking problems. The rationale behind the definition of the set of optimal scoring rules for continuous ranking is that any element should score observations in the same order as (or equivalently as ).
(Optimal scoring rule) An optimal scoring rule for the continuous ranking problem related to the random pair is any element that fulfills: ,
In other words, the set of optimal rules is defined as .
It is noteworthy that, although the definition above is natural, the set can be empty in absence of any distributional assumption, as shown by the following example.
As a counter-example, consider the distributions such that and . Observe that , so that for all and there exists s.t. is not constant. Hence, there exists no in such that (2) holds true for all .
(Invariance) We point out that the class of optimal elements for continuous ranking thus defined is invariant by strictly increasing transform of the ’size’ variable (in particular, a change of unit has no impact on the definition of ): for any borelian and strictly increasing mapping , any scoring function that is optimal for the continuous ranking problem related to the pair is still optimal for that related to (since, under these hypotheses, for any : ).
3.2 Existence and characterization of optimal scoring rules
We now investigate conditions guaranteeing the existence of optimal scoring functions for the continuous ranking problem.
The following assertions are equivalent.
For all , for all :
There exists an optimal scoring rule (i.e. ).
The regression function is an optimal scoring rule.
The collection of probability distributions, satisfies the monotone likelihood ratio property: there exist and, for all , an increasing function such that: ,
Refer to the Appendix section for the technical proof. Truth should be said, assessing that Assertion is a very challenging statistical task. However, through important examples, we now describe (not uncommon) situations where the conditions stated in Proposition 1 are fulfilled.
We give a few important examples of probabilistic models fulfilling the properties listed in Proposition 1.
Regression model. Suppose that , where is a borelian function and is a centered r.v. independent from . One may easily check that .
Exponential families. Suppose that for all , where is borelian, is a borelian strictly increasing function and is a borelian mapping such that .
We point out that, although the regression function is an optimal scoring function when , the continuous ranking problem does not coincide with distribution-free regression (notice incidentally that, in this case, any strictly increasing transform of belongs to as well). As depicted by Fig. 2 the least-squares criterion is not relevant to evaluate continuous ranking performance and naive plug-in strategies should be avoided, see Remark 3 below. Dedicated performance criteria are proposed in the next section.
4 Performance measures for continuous ranking
We now investigate quantitative criteria for assessing the performance in the continuous ranking problem, which practical machine-learning algorithms may rely on. We place ourselves in the situation where the setis not empty, see Proposition 1 above.
A functional performance measure. It follows from the view developped in the previous section that, for any and for all , we have:
denoting by the curve of any related to the bipartite ranking subproblem and by the corresponding optimal curve, i.e. the curve of strictly increasing transforms of . Based on this observation, it is natural to design a dedicated performance measure by aggregating these ’sub-criteria’. Integrating over w.r.t. a -finite measure with support equal to , this leads to the following definition . The functional criterion thus defined inherits properties from the ’s (e.g. monotonicity, concavity). In addition, the curve with dominates everywhere on any other curve for . However, except in pathologic situations (e.g. when is constant), the curve is not invariant when replacing ’s distribution by that of a strictly increasing transform . In order to guarantee that this desirable property is fulfilled (see Remark 1), one should integrate w.r.t. ’s distribution (which boils down to replacing by the uniformly distributed r.v. ).
(Integrated criteria) The integrated curve of any scoring rule is defined as: ,
The integrated criterion is defined as the area under the integrated curve: ,
The following result reveals the relevance of the functional/summary criteria defined above for the continuous ranking problem. Additional properties of curves are listed in the Supplementary Material.
Let . The following assertions are equivalent.
For all , .
We have , where for all .
If , then we have: ,
In addition, for any borelian and strictly increasing mapping , replacing by leaves the curves , , unchanged.
Equipped with the notion defined above, a scoring rule is said to be more accurate than another one if for all .The curve criterion thus provides a partial preorder on . Observe also that, by virtue of Fubini’s theorem, we have for all , denoting by the of related to the bipartite ranking subproblem . Just like the for bipartite ranking, the scalar criterion defines a full preorder on for continuous ranking. Based on a training dataset of independent copies of , statistical versions of the criteria can be straightforwardly computed by replacing the distributions , and by their empirical counterparts in (3)-(5), see the Supplementary Material for further details. The lemma below provides a probabilistic interpretation of the criterion.
Let be a copy of the random pair and a copy of the r.v. . Suppose that , and are defined on the same probability space and are independent. For all , we have:
This result shows in particular that a natural statistical estimate of based on involves -statistics of degree . Its proof is given in the Supplementary Material for completeness.
The Kendall statistic. The quantity (6) is akin to another popular way to measure the tendency to define the same ordering on the statistical population in a summary fashion:
where denotes an independent copy of , observing that . The empirical counterpart of (4) based on the sample , given by
is known as the Kendall statistic
and is widely used in the context of statistical hypothesis testing. The quantity (4) shall be thus referred to as the (theoretical or true) Kendall . Notice that is invariant by strictly increasing transformation of and thus describes properties of the order it defines. The following result reveals that the class , when non empty, is the set of maximizers of the theoretical Kendall . Refer to the Supplementary Material for the technical proof.
Suppose that . For any , we have: .
Equipped with these criteria, the objective expressed above in an informal manner can be now formulated in a quantitative manner as a (possibly functional) -estimation problem. In practice, the goal pursued is to find a reasonable approximation of a solution to the optimization problem (respectively ), where the supremum is taken over the set of all scoring functions . Of course, these criteria are unknown in general, just like ’s probability distribution, and the empirical risk minimization (ERM in abbreviated form) paradigm (see ) invites for maximizing the statistical version (8) over a class of controlled complexity when considering the criterion for instance. The generalization capacity of empirical maximizers of the Kendall can be straightforwardly established using results in . More details are given in the Supplementary Material.
Before describing a practical algorithm for recursive maximization of the curve, a few remarks are in order.
(On Kendall and ) We point out that, in the bipartite ranking problem (i.e. when the output variable takes its values in , see subsection 2.2) as well, the criterion can be expressed as a function of the Kendall related to the pair when the r.v. is continuous. Indeed, we have in this case , where and , denoting by an independent copy of .
(Connection to distribution-free regression) Consider the nonparametric regression model , where is a centered r.v. independent from . In this case, it is well-known that the regression function is the (unique) solution of the expected least squares minimization. However, although , the least squares criterion is far from appropriate to evaluate ranking performance, as depicted by Fig. 2. Observe additionally that, in contrast to the criteria introduced above, increasing transformation of the output variable may have a strong impact on the least squares minimizer: except for linear stransforms, is not an increasing transform of .
(On discretization) Bi/multi-partite algorithms are not directly applicable to the continuous ranking problem. Indeed a discretization of the interval [0, 1] would be first required but this would raise a difficult question outside our scope: how to choose this discretization based on the training data? We believe that this approach is less efficient than ours which reveals problem-specific criteria, namely and .
5 Continuous Ranking through Oriented Recursive Partitioning
It is the purpose of this section to introduce the algorithm CRank, a specific tree-structured learning algorithm for continuous ranking.
5.1 Ranking trees and Oriented Recursive Partitions
Decision trees undeniably figure among the most popular techniques, in supervised and unsupervised settings, refer to  or  for instance. This is essentially due to the visual model summary they provide, in the form of a binary tree graphic that permits to describe predictions by means of a hierachichal combination of elementary rules of the type ”” or ””, comparing the value taken by a (quantitative) component of the input vector (the split variable) to a certain threshold (the split value). In contrast to local learning problems such as classification or regression, predictive rules for a global problem such as ranking cannot be described by a (tree-structured) partition of the feature space: cells (corresponding to the terminal leaves of the binary decision tree) must be ordered so as to define a scoring function. This leads to the definition of ranking trees as binary trees equipped with a ”left-to-right” orientation, defining a tree-structured collection of anomaly scoring functions, as depicted by Fig. 1. Binary ranking trees have been in the context of bipartite ranking in  or in  and in  in the context of multipartite ranking. The root node of a ranking tree of depth represents the whole feature space : , while each internal node with and corresponds to a subset , whose left and right siblings respectively correspond to disjoint subsets and such that . Equipped with the left-to-right orientation, any subtree defines a preorder on : elements lying in the same terminal cell of being equally ranked. The scoring function related to the oriented tree can be written as:
5.2 The CRank algorithm
Based on Proposition 2, as mentioned in the Supplementary Material, one can try to build from the training dataset a ranking tree by recursive empirical Kendall maximization. We propose below an alternative tree-structured recursive algorithm, relying on a (dyadic) discretization of the ’size’ variable . At each iteration, the local sample (i.e. the data lying in the cell described by the current node) is split into two halves (the highest/smallest halves, depending on ) and the algorithm calls a binary classification algorithm to learn how to divide the node into right/left children. The theoretical analysis of this algorithm and its connection with approximation of are difficult questions that will be adressed in future work. Indeed we found out that the cannot be represented as a parametric curve contrary to the , which renders proofs much more difficult than in the bipartite case.
The CRank Algorithm
Input. Training data , depth , binary classification algorithm .
Initialization. Set .
Iterations. For and ,
Compute a median of the dataset and assign the binary label to any data point lying in , i.e. such that .
Solve the binary classification problem related to the input space and the training set , producing a classifier
, producing a classifier. Set . Output. Ranking tree .
Of course, the depth should be chosen such that . One may also consider continuing to split the nodes until the number of data points within a cell has reached a minimum specified in advance. In addition, it is well known that recursive partitioning methods fragment the data and the unstability of splits increases with the depth. For this reason, a ranking subtree must be selected. The growing procedure above should be classically followed by a pruning stage, where children of a same parent are progressively merged until the root is reached and a subtree among the sequence with nearly maximal should be chosen using cross-validation. Issues related to the implementation of the CRank algorithm and variants (e.g. exploiting randomization/aggregation) will be investigated in a forthcoming paper.
6 Numerical Experiments
In order to illustrate the idea conveyed by Fig. 2 that the least squares criterion is not appropriate for the continuous ranking problem we compared on a toy example CRank with CART. Recall that the latter is a regression decision tree algorithm which minimizes the MSE (Mean Squared Error). We also runned an alternative version of CRank which maximizes the empirical Kendall instead of the empirical : this method is refered to as Kendall from now on. The experimental setting is composed of a unidimensional feature space (for visualization reasons) and a simple regression model without any noise: . Intuitively, a least squares strategy can miss slight oscillations of the regression function, which are critical in ranking when they occur in high probability regions as they affect the order among the feature space. The results are presented in Table 1. See Supplementary Material for further details.
This paper considers the problem of learning how to order objects by increasing ’size’, modeled as a continuous r.v. , based on indirect measurements . We provided a rigorous mathematical formulation of this problem that finds many applications (e.g. quality control, chemistry) and is referred to as continuous ranking. In particular, necessary and sufficient conditions on ’s distribution for the existence of optimal solutions are exhibited and appropriate criteria have been proposed for evaluating the performance of scoring rules in these situations. In contrast to distribution-free regression where the goal is to recover the local values taken by the regression function, continuous ranking aims at reproducing the preorder it defines on the feature space as accurately as possible. The numerical results obtained via the algorithmic approaches we proposed for optimizing the criteria aforementioned highlight the difference in nature between these two statistical learning tasks.
This work was supported by the industrial chair Machine Learning for Big Data from Télécom ParisTech and by a public grant (Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH).
-  S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the ROC curve. J. Mach. Learn. Res., 6:393–425, 2005.
-  L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.
-  G. Clémençon, M. Depecker, and N. Vayatis. Ranking Forests. J. Mach. Learn. Res., 14:39–73, 2013.
-  S. Clémençon, G. Lugosi, and N.Vayatis. Ranking and scoring using empirical risk minimization. In Proceedings of COLT 2005, volume 3559, pages 1–15. Springer., 2005.
-  S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of u-statistics. The Annals of Statistics, 36:844–874, 2008.
-  S. Clémençon and S. Robbiano. The TreeRank Tournament algorithm for multipartite ranking. Journal of Nonparametric Statistics, 25(1):107–126, 2014.
-  S. Clémençon and N. Vayatis. Tree-based ranking methods. IEEE Transactions on Information Theory, 55(9):4316–4336, 2009.
-  S. Clémençon and N. Vayatis. The RankOver algorithm: overlaid classification rules for optimal ranking. Constructive Approximation, 32:619–648, 2010.
-  Corinna Cortes and Mehryar Mohri. Auc optimization vs. error rate minimization. In Advances in neural information processing systems, pages 313–320, 2004.
L. Devroye, L. Györfi, and G. Lugosi.
A Probabilistic Theory of Pattern Recognition. Springer, 1996.
-  Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
-  Aditya Krishna Menon and Robert C Williamson. Bipartite ranking: a risk-theoretic perspective. Journal of Machine Learning Research, 17(195):1–102, 2016.
-  J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1(1):1–81, 1986.
-  S. Rajaram and S. Agarwal. Generalization bounds for k-partite ranking. In NIPS 2005 Workshop on Learn to rank, 2005.
-  A. Rakotomamonjy. Optimizing Area Under Roc Curve with SVMs. In Proceedings of the First Workshop on ROC Analysis in AI, 2004.
-  S. Robbiano S. Clémençon and N. Vayatis. Ranking data with ordinal labels: optimality and pairwise aggregation. Machine Learning, 91(1):67–104, 2013.
Appendix - Technical Proofs
Proof of Proposition 1
Observe first that and are obvious.
: Let us assume that assertion is true. Let and such that . Then, from assumption , . For , if , it leads to the following contradiction: . Hence .
: Let us assume that assertion is true. Let and such that . Observe that is continuous. It follows from assumption that for , with strict inequality on a nonempty interval by continuity of . Integrating the latter inequality against the uniform distribution over leads to .
Proof of Theorem 1
The implications and are obvious.
: Let us assume that assertion is true. Assume ad absurdum that is false. Then there exists s.t. . Notice that and, for any scoring function , are continuous. By integration w.r.t. we obtain , which contradicts assertion . Hence is true.
Proof of Lemma 1
Recall that, for any and all , we have:
Integrating the terms in the equation above w.r.t. leads to the desired formula. Then, a natural empirical version of is:
The asymptotic and nonasymptotic study of the deviation of will be the subject of future work.
Proof of Proposition 2
We assume that is a continuous r.v. for simplicity, the slight modifications needed to extend the argument to the general framework being left to the reader. As a first go, observe that
Notice next that, for any , is nothing else than the criterion of related to the distribution of given (negative distribution) and (positive distribution). Since we assumed , the collection is of increasing likelihood ratio and according to Theorem 1, any
is a Neyman Pearson test statistic and thus defines uniformly most powerful tests (among unbiased tests) ofagainst . Hence, for any , . Integrating over w.r.t. yields the desired result.
On Empirical Kendall Maximization
Here we state a result describing the performance of scoring rules obtained through maximization of the empirical Kendall over a class of controlled complexity. An empirical maximizer over is any scoring function s.t.
Suppose that and set for . Assume that is a VC major class of functions with VC dimension . Let . With probability at least , we have:
The argument is based on the simple bound
combined with the use of concentration results for the -process . The proof is finished by mimicking that of Corollary 3 in . ∎
From a computational perspective, maximizing
is a challenge, the optimization problem being NP-hard due to the absence of convexity/smoothness of the pairwise loss function. Whereas replacing this loss by a surrogate loss, more suited to continuous optimization, is a possible strategy, using greedy algorithms in the spirit of the popular CART method can also be considered for this purpose. A slight modification of CART based on recursive maximization of the empirical Kendall criterion (rather than the Gini index or the least squares criterion) permit to build an oriented ranking tree in a top down manner, see subsection 5.1. Just like for classification/regression, the procedure can be followed by a pruning stage (model selection), based here on (e.g. cross-validation based) estimates of Kendall .
Appendix - Additional Remarks
Properties of curves
For any scoring function and , we define the conditional cdfs of as follows:
Now we give some properties of the curve which are easily derived from curve properties by integration over bipartite ranking subproblems.
For any scoring function , the following properties hold:
Limit values. We have and .
Invariance. For any strictly increasing funciton , we have for all , .
Concavity. If for all the likelihood ratio is a monotone function, then the curve is concave.
Use Proposition 24 in  for each bipartite ranking subproblem at level . Then integrate over w.r.t. . ∎
Distribution-free regression vs continuous ranking
Numerical Experiments (Figures)
We considered a polynomial regression function over and valued in , namely: