A new approach in model selection for ordinal target variables

by   Elena Ballante, et al.
University of Pavia

This paper introduces a novel approach to assess model performance for predictive models characterized by an ordinal target variable in order to satisfy the lack of suitable tools in this framework. Our methodological proposal is a new index for model assessment which satisfies mathematical properties and can be easily computed. In order to show how our performance indicator works, empirical evidence achieved on a toy examples and simulated data are provided. On the basis of results at hand, we underline that our approach discriminates better for model selection with respect to performance indexes proposed in the literature.



There are no comments yet.


page 1

page 2

page 3

page 4


Bayesian Model Selection for a Class of Spatially-Explicit Capture Recapture Models

A vast amount of ecological knowledge generated recently has hinged upon...

Model selection criteria for regression models with splines and the automatic localization of knots

In this paper we propose a model selection approach to fit a regression ...

Learning a binary search with a recurrent neural network. A novel approach to ordinal regression analysis

Deep neural networks are a family of computational models that are natur...

An ordinal measure of interrater absolute agreement

A measure of interrater absolute agreement for ordinal scales is propose...

Selection of Exponential-Family Random Graph Models via Held-Out Predictive Evaluation (HOPE)

Statistical models for networks with complex dependencies pose particula...

Model Selection for Simulator-based Statistical Models: A Kernel Approach

We propose a novel approach to model selection for simulator-based stati...

An approach utilizing negation of extended-dimensional vector of disposing mass for ordinal evidences combination in a fuzzy environment

How to measure the degree of uncertainty of a given frame of discernment...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evaluation measures are widely used in predictive models to compare different algorithms, thus providing the selection of the best model for the data at hand.
Performance indicators can be used to assess the performance of a model in terms of accuracy, discriminatory power and stability of the results. The choice of indicators to made model selection is a fundamental point and many approaches have been proposed over the years (see e.g. Adams ; Bradley ; Hand2009 ).
Restricting to binary target variables, distinct criteria for comparing the performance of classification models are available (see Hand1997 ; Hand2000 ; review ; AccuracyFROC ).
Multi-class classification models are generally evaluated averaging binary classification indicators (see AUCmulticlass ; review ; perfMeas ) and in the literature there is not a clear distinction among them with respect to multi-class nominal and ordinal targets (e.g. simpleapproach ; Gaudette ; Pang ).
While in the model definition stage for ordinal target variable there are different approaches in the literature (see agresti ; ordinal1 ; ordinal2 ; ordinal3 ), for the model selection there is a lack of adequate tools (performance ).
In our opinion, performance indicators should take into account the nature of the target variable, especially when the dependent variable is ordinal. This leads us to propose a new class of measures to select the best model in predictive contexts characterized by a multi-class ordinal target variable, using the misclassification errors coupled with a measure of uncertainty on the prediction.
The paper is structured as follow: Section 2 reviews the metrics most used in literature; Section 3 shows our methodological proposal and proves some mathematical properties; Section 4 explains how our proposal works in two toy examples; Section 5 reports the empirical evidence obtained on simulated data. Conclusions and further ideas for research are summarized in Section 6.

2 Review of the literature for ordinal dependent variable

The most popular measures of performances in ordinal predictive classification models are based on AUC (Area Under the ROC curve), accuracy (expressed in terms of correct classification) and MSE (Mean Square Error) (see Gaudette and contr

among others). The accuracy (percentage of correct predictions over total instances) is the most used evaluation metric for binary and multi-class classification problems (

AccuracyFROC ), assuming that the costs of the different misclassifications are equal.
The AUC for multi-class classification is defined in AUCmulticlass as a generalization of the AUC (based on the probabilistic definition of AUC); it suffers of different weaknesses also in the binary classification problem (ROCint ) and it is cost-independent, assumption that can be viewed as a weakness when the target is ordinal.
The mean square error (MSE) measures the difference between prediction values and observed values in regression problems using an Euclidean distance. MSE can be used in ordinal predictive models, converting the classes of the ordinal target variable

in integers and computing the difference between them and it does not takes into account the ordering in a predictive model characterized by ordinal classes in the response variable.

Furthermore, it is well known that in imbalanced data characterized by under-fitting or over-fitting the mean square error could provide trivial results (see review ).

3 A new index for model performances evaluation and comparison for ordinal target

Let be a test set for the ordinal target variable , where (with number of classes ordered of the target variable) and let be the data matrix, where is the number of observations and the number of covariates.
The output of a predictive model is a matrix , where

, which contains the probability that observation

belong to the class

, estimated by the model under evaluation.

Standard multi-class classification rules assign the observation to the class .
In order to introduce our proposal, the definitions of classification function and error interval are required.

Definition 3.1 (Classification function).

Let observations grouped by the estimated classes . For each class, sort the observations in a non-increasing order with respect to

. The vector of indexes

of the observations is a permutation of the original vector, according to the ordering defined above. For a given model, the classification function is a piecewise constant function such that for .

As a special case, the perfect classification function, is a piecewise constant function such that each estimated class corresponds to the real class identified by .
Note that the function is unique except for permutation of the observations in the same estimated class.

The error interval in each class can be derived as the interval between the first misclassified observation and the end of the observations in that estimated class.

Definition 3.2 (Error Interval).

Suppose that the range corresponding to the estimated class is , let the first misclassified observation. So the error interval is defined as and its length is .
If no misclassification occurs in , the error interval is defined as an empty set and the length is .

Consider, for example, observations and a three levels target variable (). Suppose that a predictive model returns the predictions as in Table 1. For each observation, the real class is reported.

Observation Probabilities Estimated Class Real Class
Class 1 Class 2 Class 2
1 0.288 0.174 0.538 3 1
2 0.325 0.478 0.197 2 2
3 0.828 0.013 0.159 1 1
4 0.310 0.106 0.584 3 3
5 0.120 0.262 0.618 3 3
6 0.426 0.167 0.407 1 3
7 0.849 0.126 0.025 1 2
8 0.520 0.401 0.079 1 1
9 0.147 0.670 0.183 2 2
10 0.142 0.593 0.265 2 3

Table 1: Example

The classification function is derived grouping the observations in the estimated class as: {3,6,7,8} in Class 1, {2,9,10} in Class 2 and {1,4,5} in Class 3. In each group the observations are sorted with respect to the probability of the estimated class. For the group 1 the probabilities are 0.828, 0.426, 0.849, 0.520 respectively, then the ordered group is: {7,3,8,6}. Following the same rule the group 2 becomes {9,10,2} and group 3 {5,4,1}.
The final sequence of observations can be written as in Table 2.

i 7 3 8 6 9 10 2 5 4 1
1 2 3 4 5 6 7 8 9 10
x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
y 2 1 1 3 2 3 2 3 3 1
1 1 1 1 2 2 2 3 3 3
Table 2: Index construction

The classification function and the corresponding perfect classification function are depicted in Figure 2 and Figure 2 respectively.

Figure 1: Classification function
Figure 2: Perfect classification function

In order to define the three error intervals, as a preliminary step we identify the intervals of observations related to each estimated class: for Class 1, for Class 2, for Class 3. From Table 2, in the estimated Class 1 the first error corresponds to the first observation, so the error interval is , in the estimated Class 2 the first error corresponds to the observation 6, then the error interval is and in the estimated Class 3 the first error corresponds to the observation 10 and the error interval is .

Starting from Definition 3.1 and Definition 3.2, Definition 3.3 introduces a new index for model performance evaluation in predictive models characterized by an ordinal target variable.

Definition 3.3 (Index).

where is the length of the class in the domain, and .

On the basis of the previous example, we can compute the value for the index introduced in Definition 3.3: the three integral results are (0.3, 0.1, 0.2) and the corresponding weights are (1, 0.67, 0.33), thus .
The index satisfies the following properties.

Property 1.

if and only if



  • ,

by definition, than we can conclude that .
We prove also that if and only if .

or in .

  • , i.e there are not classification errors, so in class .

  • in the class .

So we can conclude that .
The other implication is trivial. ∎

Property 2.

has a sharp upper bound
The upper bound is reached if and only if (binary classification).


If we obtain so that . If , for at least one class (by construction) the inequality is strict. ∎

Proposition 3.4.

where is defined as


The maximum value is reached when the worst classification is obtained, i.e. when all observations are associated to the fairest class. If this happens, the error interval is long as the class domain, so and each integral is the sum is a rectangle with basis the class domain and height the maximum height reachable.

Definition 3.5 (Normalized index).

where is the maximum defined in the Proposition 3.4.
So .

In the previous example, and the corresponding value of the defined normalized index is .

Proposition 3.6.

The accuracy is a special case of the index introduced in Definition 3.3.


The accuracy is i.e. the proportion of misclassified observations.
Setting , from the Proposition 3.4, .
, each error weights if and . ∎

Property 3 (Monotonicity).

Consider a classification with misclassification and observations. Operating a transformation of the classification in

where an observation right classified is changed in a misclassification, the index

becomes higher.


In the classification , = are misclassified observations: the observations misclassified in plus a new misclassification. Suppose that the new misclassification is the observation that is classified in the class instead of the real class .
All the components in the sum of the index remain unchanged except for the , thus obtaining . So

Looking at each of the two elements in the product:

  • Two different cases are possible: if the probability associated to the observations is less or equal than the probability of the first error, the error interval ; on the other hand, the error interval become larger, thus .

  • In there is one misclassification more than in , so the distance between and increases.

We can conclude that . ∎

We remark that in the Proposition 3 the vice versa does not hold, i.e. if we can not make conclusion on the number of misclassified observations in the two classifications.

4 Toy examples

In order to show how our index works with respect to the indexes proposed in the literature toy examples are reported in this section with the main aim of discussing the behaviour in terms of model selection of our index with respect to AUC, accuracy and MSE.
is a target variable characterized by levels and model 1 and model 2 are two competitive models under comparison.

4.1 First toy example

In the first toy example we take into account the ordinal structure of the target variable . Table 3 and Table 4 are the corresponding confusion matrices for model 1 and model 2. It is clear that the model 2 makes a better classification than model 1.

1 2 3


1 5 0 1
2 0 7 0
3 0 0 7
Table 3: Confusion matrix model 1
1 2 3


1 5 1 0
2 0 6 0
3 0 0 8
Table 4: Confusion matrix model 2
Model Proposed Index Normalized Index AUC accuracy MSE
1 0.083 0.051 0.956 0.950 0.200
2 0.042 0.025 0.956 0.950 0.050
Table 5: Results

For the sake of comparison, for each model the AUC, the accuracy, the MSE and our index are computed as summarized in Table 5.
We remark that looking at Table 5 the values obtained for the AUC and the accuracy indexes for model 1 and model 2 are exactly equal, thus, in terms of model choice, model 1 and model 2 are indifferently. Our index highlights a difference in terms of performance between the two models under comparison and it selects model 2 as the best one.

4.2 Second toy example

The second toy example considers the probability assigned to each observation. In practical applications where we need also to evaluate how much uncertainty is associated to a prediction, the starting point considers the probability that the new observation belongs to the estimated class.
From Table 6, Model 1 and model 2 assign an observation of the first class to the second one. The first classification assigns a higher probability to the misclassified observation than the second. Then we can conclude that model 2 is better than model 1 for data at hands.

1 2 3


1 5 0 0
2 0 7 0
3 1 0 7

Table 6: Confusion matrix

From Table 7 both models are equivalent in terms of MSE and accuracy, thus on the basis of classical measures model 1 and model 2 are indifferent. Our index reports different values for the models under comparison and select model 2 as the best one.

Model Proposed Index Normalized Index AUC accuracy MSE
1 0.083 0.051 0.956 0.950 0.200
2 0.017 0.010 0.983 0.950 0.200
Table 7: Results

5 Empirical evaluation on simulated data

In order to show how our proposal works in model selection, this section reports the empirical results achieved on a simulated dataset.
The simulated dataset is composed of three covariates obtained by a Monte Carlo simulation and an ordinal target variable with , as reported in Table 8. The sample size is .

y 1 2 3 4 5
x1 N(2,1.5) N(3,1) N(4,1.5) N(5,1) N(6,1)
x2 N(1,2.5) N(5,2) N(7,2.5) N(8.5,2) N(9.5,2)
x3 U(0,3)
Table 8: Simulated data structure.

Five different models are under comparison:

For each model AUC, accuracy, MSE and our index are computed.

Table 9 reports, in terms of out of sample, the values of the metrics under comparison obtained for each model using a 10-fold cross validation.

Model Proposed Index Normalized index AUC Accuracy MSE
Ord Log 0.450 0.141 0.864 0.577 0.571
Tree 0.487 0.146 0.835 0.585 0.654
SVM 0.439 0.135 0.871 0.589 0.564
RFor 0.493 0.151 0.855 0.569 0.672
kNN 0.003 0.001 0.999 0.977 0.024
Table 9: Model comparison

For sake of clarity, Table 10 shows the resulting ranks for the models, using the results obtained for the four metrics under comparison.

Model Proposed Index/Normalized AUC Accuracy MSE
Ord Log 3 3 4 3
Tree 4 5 3 4
SVM 2 2 2 2
RFor 5 4 5 5
kNN 1 1 1 1
Table 10: Results in terms of ranking.

We can see that the k-nearest neighbour is classified as the best model according to all the indexes employed for model choice. Furthermore, from table 9 the k-nearest neighbour outperforms the other models. The Support vector machine is considered the second-best model with respect to all performance indicators. The rest of the models under comparison are ranked differently with respect to the evaluation metrics adopted.

6 Conclusions

A new performance indicator is proposed to compare predictive classification models characterized by ordinal target variable.
Our index is based on a definition of a classification function and an error interval. A normalized version of the index is derived. The empirical evidence at hands underlined that our index discriminates better among different models with respect to classical measures available in the literature.
Our index can be used coupled with other metrics for model performance for model selection.
From a computational point of view a further idea of research will consider the implementation of our index in a new R package. In terms of application we think that our index could be directly incorporate in the process of assessment for predictive analytics.


  • (1) Adams, N.M., Hand, D.J. (2000). Improving the Practice of Classifier Performance Assessment. Neural Computation, Vol. 12, pp. 305-311.
  • (2) Agresti, A. (2010). Analysis of ordinal categorical data. Vol. 656, John Wiley & Sons.
  • (3) Ahmad, A., Brown, G. (2015). Random ordinality ensembles: ensembles methods for multi-valued categorical data. Information Sciences, Vol. 296, pp. 75-94.
  • (4) Bradley, A.P. (1997).

    The use of the area under the ROC curve in evaluation of machine learning algorithms.

    Pattern Recognition, Vol. 30, pp. 1145-1159.
  • (5) Cardoso, J., Sousa, R. (2011). Measuring the performance of ordinal classification.

    International Journal of Pattern Recognition and Artificial Intelligence, Vol. 25, No. 8, pp. 1173-1195.

  • (6) Frank, E., Hall, M. (2001). A simple approach to ordinal classification. Technical Report 01/05, Department of Computer Science, University of Waikato.
  • (7) Gaudette, L., Japkowicz, N. (2009). Evaluation Methods for Ordinal Classification. In: Gao Y., Japkowicz N. (eds) Advances in Artificial Intelligence, pp. 207-210.
  • (8) Gigliarano, C., Figini, S., Muliere, P. (2014). Making classifier performance comparisons when ROC curves intersect. Computational Statistics and Data Analysis, Vol. 77, pp. 300-312.
  • (9) Hand, D.J. (1997). Construction and Assessment of Classification Rules. Chichester: Wiley.
  • (10) Hand, D.J. (2000). Measuring diagnostic accuracy of statistical prediction rules. Statistica Neerlandica, Vol. 53, pp. 1-14.
  • (11) Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, Vol. 45, pp. 171-186.
  • (12) Hand, D.J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, Vol. 77, pp. 103-123.
  • (13) Hanley, J.A., McNeil, B.J. (1982). The Meaning and Use of Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, Vol. 143, No. 1, pp. 29-36.
  • (14) Hossin, M., Sulaiman, M.N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, Vol. 5, No. 2, pp. 171-186.
  • (15) Huang, J., Ling, C.X. (2005). Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on knowledge and data engineering, Vol. 17, No. 3, pp.299-310.
  • (16) Huang, J., Ling, C.X. (2007). Constructing New and Better Evaluation Measures for Machine Learning. Proc. 20th International Conference on Artificial Intelligence (IJCAI2007), pp. 859-864.
  • (17) Kotlowski, W., Dembczynski, K., Greco, S., Slowinski, R. (2008). Stochastic dominance-based rough set model for ordinal classification. Information Sciences, Vol. 178, No. 21, pp. 4019-4037.
  • (18) Ling, C.X., Huang, J., Zhang, H. (2003). AUC: A Statistically Consistent and More Discriminating Measure than Accuracy. Proc. 18th International Conference on Artificial Intelligence (IJCAI2003), pp 329-341.
  • (19) Liu, Y., Chen, W., Arendt, P., Huang,H.Z. (2011). Toward a Better Understanding of Model Validation Metrics. Journal of Mechanical Design, Vol. 133.
  • (20) Pang, B., lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceeding of the 43nd Annual Meeting on Association for Computational Linguistics (ACL2005).
  • (21) Salzberg, S.L. (1999). On Comparing Classifiers: A Critique of Current Research and Methods. Data Mining and Knowledge Discovery, Vol. 1, pp. 1-12.
  • (22) Sokolova, M., Japkowicz, N., Szpakowicz, S. (2006).

    Beyond Accuracy, F-score and ROC: a Family of Discriminant Measures for Performance Evaluation.

    AI 2006: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 4304.
  • (23) Sokolova, M., Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, Vol. 45, pp. 427-437.
  • (24) Torra, V., Domingo-Ferrer, J., Mateo-Sanz, J., Ng, M. (2006).

    Regression for ordinal variables without underlying continuous variables

    Information Sciences, Vol. 176, No. 4, pp. 465-474.
  • (25) Waegeman, W., Baets, B.D., Boullard, L. (2008). ROC analysis in ordinal regression learning Pattern Recognition Letters, Vol. 29, pp. 1-9.