1 Introduction
Evaluation measures are widely used in predictive models to compare different algorithms, thus providing the selection of the best model for the data at hand.
Performance indicators can be used to assess the performance of a model in terms of accuracy, discriminatory power and stability of the results. The choice of indicators to made model selection is a fundamental point and many approaches have been proposed over the years (see e.g. Adams ; Bradley ; Hand2009 ).
Restricting to binary target variables, distinct criteria for comparing the performance of classification models are available (see Hand1997 ; Hand2000 ; review ; AccuracyFROC ).
Multiclass classification models are generally evaluated averaging binary classification indicators (see AUCmulticlass ; review ; perfMeas ) and in the literature there is not a clear distinction among them with respect to multiclass nominal and ordinal targets (e.g. simpleapproach ; Gaudette ; Pang ).
While in the model definition stage for ordinal target variable there are different approaches in the literature (see agresti ; ordinal1 ; ordinal2 ; ordinal3 ), for the model selection there is a lack of adequate tools (performance ).
In our opinion, performance indicators should take into account the nature of the target variable, especially when the dependent variable is ordinal. This leads us to propose a new class of measures to select the best model in predictive contexts characterized by a multiclass ordinal target variable, using the misclassification errors coupled with a measure of uncertainty on the prediction.
The paper is structured as follow: Section 2 reviews the metrics most used in literature; Section 3 shows our methodological proposal and proves some mathematical properties; Section 4 explains how our proposal works in two toy examples; Section 5 reports the empirical evidence obtained on simulated data. Conclusions and further ideas for research are summarized in Section 6.
2 Review of the literature for ordinal dependent variable
The most popular measures of performances in ordinal predictive classification models are based on AUC (Area Under the ROC curve), accuracy (expressed in terms of correct classification) and MSE (Mean Square Error) (see Gaudette and contr
among others). The accuracy (percentage of correct predictions over total instances) is the most used evaluation metric for binary and multiclass classification problems (
AccuracyFROC ), assuming that the costs of the different misclassifications are equal.The AUC for multiclass classification is defined in AUCmulticlass as a generalization of the AUC (based on the probabilistic definition of AUC); it suffers of different weaknesses also in the binary classification problem (ROCint ) and it is costindependent, assumption that can be viewed as a weakness when the target is ordinal.
The mean square error (MSE) measures the difference between prediction values and observed values in regression problems using an Euclidean distance. MSE can be used in ordinal predictive models, converting the classes of the ordinal target variable
in integers and computing the difference between them and it does not takes into account the ordering in a predictive model characterized by ordinal classes in the response variable.
Furthermore, it is well known that in imbalanced data characterized by underfitting or overfitting the mean square error could provide trivial results (see review ).
3 A new index for model performances evaluation and comparison for ordinal target
Let be a test set for the ordinal target variable , where (with number of classes ordered of the target variable) and let be the data matrix, where is the number of observations and the number of covariates.
The output of a predictive model is a matrix , where
, which contains the probability that observation
belong to the class, estimated by the model under evaluation.
Standard multiclass classification rules assign the observation to the class .
In order to introduce our proposal, the definitions of classification function and error interval are required.
Definition 3.1 (Classification function).
Let observations grouped by the estimated classes . For each class, sort the observations in a nonincreasing order with respect to
. The vector of indexes
of the observations is a permutation of the original vector, according to the ordering defined above. For a given model, the classification function is a piecewise constant function such that for .As a special case, the perfect classification function, is a piecewise constant function such that each estimated class corresponds to the real class identified by .
Note that the function is unique except for permutation of the observations in the same estimated class.
The error interval in each class can be derived as the interval between the first misclassified observation and the end of the observations in that estimated class.
Definition 3.2 (Error Interval).
Suppose that the range corresponding to the estimated class is , let the first misclassified observation. So the error interval is defined as and its length is .
If no misclassification occurs in , the error interval is defined as an empty set and the length is .
Consider, for example, observations and a three levels target variable (). Suppose that a predictive model returns the predictions as in Table 1. For each observation, the real class is reported.
Observation  Probabilities  Estimated Class  Real Class  
Class 1  Class 2  Class 2  
1  0.288  0.174  0.538  3  1 
2  0.325  0.478  0.197  2  2 
3  0.828  0.013  0.159  1  1 
4  0.310  0.106  0.584  3  3 
5  0.120  0.262  0.618  3  3 
6  0.426  0.167  0.407  1  3 
7  0.849  0.126  0.025  1  2 
8  0.520  0.401  0.079  1  1 
9  0.147  0.670  0.183  2  2 
10  0.142  0.593  0.265  2  3 

The classification function is derived grouping the observations in the estimated class as: {3,6,7,8} in Class 1, {2,9,10} in Class 2 and {1,4,5} in Class 3.
In each group the observations are sorted with respect to the probability of the estimated class. For the group 1 the probabilities are 0.828, 0.426, 0.849, 0.520 respectively, then the ordered group is: {7,3,8,6}. Following the same rule the group 2 becomes {9,10,2} and group 3 {5,4,1}.
The final sequence of observations can be written as in Table 2.
i  7  3  8  6  9  10  2  5  4  1 

1  2  3  4  5  6  7  8  9  10  
x  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9 
y  2  1  1  3  2  3  2  3  3  1 
1  1  1  1  2  2  2  3  3  3 
The classification function and the corresponding perfect classification function are depicted in Figure 2 and Figure 2 respectively.
In order to define the three error intervals, as a preliminary step we identify the intervals of observations related to each estimated class: for Class 1, for Class 2, for Class 3. From Table 2, in the estimated Class 1 the first error corresponds to the first observation, so the error interval is , in the estimated Class 2 the first error corresponds to the observation 6, then the error interval is and in the estimated Class 3 the first error corresponds to the observation 10 and the error interval is .
Starting from Definition 3.1 and Definition 3.2, Definition 3.3 introduces a new index for model performance evaluation in predictive models characterized by an ordinal target variable.
Definition 3.3 (Index).
where is the length of the class in the domain, and .
On the basis of the previous example, we can compute the value for the index introduced in Definition 3.3: the three integral results are (0.3, 0.1, 0.2) and the corresponding weights are (1, 0.67, 0.33), thus .
The index satisfies the following properties.
Property 1.
.
if and only if
Proof.
and

,

by definition, than we can conclude that .
We prove also that if and only if .
or in .

, i.e there are not classification errors, so in class .

in the class .
So we can conclude that .
The other implication is trivial.
∎
Property 2.
has a sharp upper bound
The upper bound is reached if and only if (binary classification).
Proof.
If we obtain so that . If , for at least one class (by construction) the inequality is strict. ∎
Proposition 3.4.
,
where is defined as
Proof.
The maximum value is reached when the worst classification is obtained, i.e. when all observations are associated to the fairest class. If this happens, the error interval is long as the class domain, so and each integral is the sum is a rectangle with basis the class domain and height the maximum height reachable.
∎
Definition 3.5 (Normalized index).
where is the maximum defined in the Proposition 3.4.
So .
In the previous example, and the corresponding value of the defined normalized index is .
Proposition 3.6.
The accuracy is a special case of the index introduced in Definition 3.3.
Proof.
The accuracy is i.e. the proportion of misclassified observations.
Setting , from the Proposition 3.4, .
, each error weights if and .
∎
Property 3 (Monotonicity).
Consider a classification with misclassification and observations. Operating a transformation of the classification in
where an observation right classified is changed in a misclassification, the index
becomes higher.Proof.
In the classification , = are misclassified observations: the observations misclassified in plus a new misclassification. Suppose that the new misclassification is the observation that is classified in the class instead of the real class .
All the components in the sum of the index remain unchanged except for the , thus obtaining .
So
Looking at each of the two elements in the product:

Two different cases are possible: if the probability associated to the observations is less or equal than the probability of the first error, the error interval ; on the other hand, the error interval become larger, thus . 
In there is one misclassification more than in , so the distance between and increases.
We can conclude that . ∎
We remark that in the Proposition 3 the vice versa does not hold, i.e. if we can not make conclusion on the number of misclassified observations in the two classifications.
4 Toy examples
In order to show how our index works with respect to the indexes proposed in the literature toy examples are reported in this section with the main aim of discussing the behaviour in terms of model selection of our index with respect to AUC, accuracy and MSE.
is a target variable characterized by levels and model 1 and model 2 are two competitive models under comparison.
4.1 First toy example
In the first toy example we take into account the ordinal structure of the target variable . Table 3 and Table 4 are the corresponding confusion matrices for model 1 and model 2. It is clear that the model 2 makes a better classification than model 1.
Actual  

1  2  3  
Predict 
1  5  0  1 
2  0  7  0  
3  0  0  7 
Actual  

1  2  3  
Predict 
1  5  1  0 
2  0  6  0  
3  0  0  8 
Model  Proposed Index  Normalized Index  AUC  accuracy  MSE 

1  0.083  0.051  0.956  0.950  0.200 
2  0.042  0.025  0.956  0.950  0.050 
For the sake of comparison, for each model the AUC, the accuracy, the MSE and our index are computed as summarized in Table 5.
We remark that looking at Table 5 the values obtained for the AUC and the accuracy indexes for model 1 and model 2 are exactly equal, thus, in terms of model choice, model 1 and model 2 are indifferently. Our index highlights a difference in terms of performance between the two models under comparison and it selects model 2 as the best one.
4.2 Second toy example
The second toy example considers the probability assigned to each observation.
In practical applications where we need also to evaluate how much uncertainty is associated to a prediction, the starting point considers the probability that the new observation belongs to the estimated class.
From Table 6, Model 1 and model 2 assign an observation of the first class to the second one. The first classification assigns a higher probability to the misclassified observation than the second. Then we can conclude that model 2 is better than model 1 for data at hands.
Actual  

1  2  3  
Predict 
1  5  0  0 
2  0  7  0  
3  1  0  7  

From Table 7 both models are equivalent in terms of MSE and accuracy, thus on the basis of classical measures model 1 and model 2 are indifferent. Our index reports different values for the models under comparison and select model 2 as the best one.
Model  Proposed Index  Normalized Index  AUC  accuracy  MSE 

1  0.083  0.051  0.956  0.950  0.200 
2  0.017  0.010  0.983  0.950  0.200 
5 Empirical evaluation on simulated data
In order to show how our proposal works in model selection, this section reports the empirical results achieved on a simulated dataset.
The simulated dataset is composed of three covariates obtained by a Monte Carlo simulation and an ordinal target variable with , as reported in Table 8. The sample size is .
y  1  2  3  4  5 

x1  N(2,1.5)  N(3,1)  N(4,1.5)  N(5,1)  N(6,1) 
x2  N(1,2.5)  N(5,2)  N(7,2.5)  N(8.5,2)  N(9.5,2) 
x3  U(0,3) 
Five different models are under comparison:

Ordinal logistic regression (Ord Log),

Classification tree (Tree),

Support vector machine (SVM),

Random forest (RFor),

k Nearest Neighbour (kNN).
For each model AUC, accuracy, MSE and our index are computed.
Table 9 reports, in terms of out of sample, the values of the metrics under comparison obtained for each model using a 10fold cross validation.
Model  Proposed Index  Normalized index  AUC  Accuracy  MSE 
Ord Log  0.450  0.141  0.864  0.577  0.571 
Tree  0.487  0.146  0.835  0.585  0.654 
SVM  0.439  0.135  0.871  0.589  0.564 
RFor  0.493  0.151  0.855  0.569  0.672 
kNN  0.003  0.001  0.999  0.977  0.024 
For sake of clarity, Table 10 shows the resulting ranks for the models, using the results obtained for the four metrics under comparison.
Model  Proposed Index/Normalized  AUC  Accuracy  MSE 

Ord Log  3  3  4  3 
Tree  4  5  3  4 
SVM  2  2  2  2 
RFor  5  4  5  5 
kNN  1  1  1  1 
We can see that the knearest neighbour is classified as the best model according to all the indexes employed for model choice. Furthermore, from table 9 the knearest neighbour outperforms the other models. The Support vector machine is considered the secondbest model with respect to all performance indicators. The rest of the models under comparison are ranked differently with respect to the evaluation metrics adopted.
6 Conclusions
A new performance indicator is proposed to compare predictive classification models characterized by ordinal target variable.
Our index is based on a definition of a classification function and an error interval. A normalized version of the index is derived. The empirical evidence at hands underlined that our index discriminates better among different models with respect to classical measures available in the literature.
Our index can be used coupled with other metrics for model performance for model selection.
From a computational point of view a further idea of research will consider the implementation of our index in a new R package. In terms of application we think that our index could be directly incorporate in the process of assessment for predictive analytics.
References
 (1) Adams, N.M., Hand, D.J. (2000). Improving the Practice of Classifier Performance Assessment. Neural Computation, Vol. 12, pp. 305311.
 (2) Agresti, A. (2010). Analysis of ordinal categorical data. Vol. 656, John Wiley & Sons.
 (3) Ahmad, A., Brown, G. (2015). Random ordinality ensembles: ensembles methods for multivalued categorical data. Information Sciences, Vol. 296, pp. 7594.

(4)
Bradley, A.P. (1997).
The use of the area under the ROC curve in evaluation of machine learning algorithms.
Pattern Recognition, Vol. 30, pp. 11451159. 
(5)
Cardoso, J., Sousa, R. (2011). Measuring the performance of ordinal classification.
International Journal of Pattern Recognition and Artificial Intelligence, Vol. 25, No. 8, pp. 11731195.
 (6) Frank, E., Hall, M. (2001). A simple approach to ordinal classification. Technical Report 01/05, Department of Computer Science, University of Waikato.
 (7) Gaudette, L., Japkowicz, N. (2009). Evaluation Methods for Ordinal Classification. In: Gao Y., Japkowicz N. (eds) Advances in Artificial Intelligence, pp. 207210.
 (8) Gigliarano, C., Figini, S., Muliere, P. (2014). Making classifier performance comparisons when ROC curves intersect. Computational Statistics and Data Analysis, Vol. 77, pp. 300312.
 (9) Hand, D.J. (1997). Construction and Assessment of Classification Rules. Chichester: Wiley.
 (10) Hand, D.J. (2000). Measuring diagnostic accuracy of statistical prediction rules. Statistica Neerlandica, Vol. 53, pp. 114.
 (11) Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, Vol. 45, pp. 171186.
 (12) Hand, D.J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, Vol. 77, pp. 103123.
 (13) Hanley, J.A., McNeil, B.J. (1982). The Meaning and Use of Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, Vol. 143, No. 1, pp. 2936.
 (14) Hossin, M., Sulaiman, M.N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, Vol. 5, No. 2, pp. 171186.
 (15) Huang, J., Ling, C.X. (2005). Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on knowledge and data engineering, Vol. 17, No. 3, pp.299310.
 (16) Huang, J., Ling, C.X. (2007). Constructing New and Better Evaluation Measures for Machine Learning. Proc. 20th International Conference on Artificial Intelligence (IJCAI2007), pp. 859864.
 (17) Kotlowski, W., Dembczynski, K., Greco, S., Slowinski, R. (2008). Stochastic dominancebased rough set model for ordinal classification. Information Sciences, Vol. 178, No. 21, pp. 40194037.
 (18) Ling, C.X., Huang, J., Zhang, H. (2003). AUC: A Statistically Consistent and More Discriminating Measure than Accuracy. Proc. 18th International Conference on Artificial Intelligence (IJCAI2003), pp 329341.
 (19) Liu, Y., Chen, W., Arendt, P., Huang,H.Z. (2011). Toward a Better Understanding of Model Validation Metrics. Journal of Mechanical Design, Vol. 133.
 (20) Pang, B., lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceeding of the 43nd Annual Meeting on Association for Computational Linguistics (ACL2005).
 (21) Salzberg, S.L. (1999). On Comparing Classifiers: A Critique of Current Research and Methods. Data Mining and Knowledge Discovery, Vol. 1, pp. 112.

(22)
Sokolova, M., Japkowicz, N., Szpakowicz, S. (2006).
Beyond Accuracy, Fscore and ROC: a Family of Discriminant Measures for Performance Evaluation.
AI 2006: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 4304.  (23) Sokolova, M., Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, Vol. 45, pp. 427437.

(24)
Torra, V., DomingoFerrer, J., MateoSanz, J., Ng, M. (2006).
Regression for ordinal variables without underlying continuous variables
Information Sciences, Vol. 176, No. 4, pp. 465474.  (25) Waegeman, W., Baets, B.D., Boullard, L. (2008). ROC analysis in ordinal regression learning Pattern Recognition Letters, Vol. 29, pp. 19.
Comments
There are no comments yet.