DeepAI

# Noncrossing Ordinal Classification

Ordinal data are often seen in real applications. Regular multicategory classification methods are not designed for this data type and a more proper treatment is needed. We consider a framework of ordinal classification which pools the results from binary classifiers together. An inherent difficulty of this framework is that the class prediction can be ambiguous due to boundary crossing. To fix this issue, we propose a noncrossing ordinal classification method which materializes the framework by imposing noncrossing constraints. An asymptotic study of the proposed method is conducted. We show by simulated and data examples that the proposed method can improve the classification performance for ordinal data without the ambiguity caused by boundary crossings.

08/07/2020

### Deep Ordinal Regression Forests

Ordinal regression is a type of regression techniques used for predictin...
10/23/2018

### OCAPIS: R package for Ordinal Classification And Preprocessing In Scala

Ordinal Data are those where a natural order exist between the labels. T...
11/14/2010

### The Data Replication Method for the Classification with Reject Option

Classification is one of the most important tasks of machine learning. A...
07/23/2019

### An ordinal measure of interrater absolute agreement

A measure of interrater absolute agreement for ordinal scales is propose...
08/23/2022

### Variable selection and basis learning for ordinal classification

We propose a method for variable selection and basis learning for high-d...
04/01/2021

### Combining conflicting ordinal quantum evidences utilizing individual reliability

How to combine uncertain information from different sources has been a h...
04/27/2020

### Learning Lines with Ordinal Constraints

We study the problem of finding a mapping f from a set of points into th...

## 1 Introduction

Data with ordinal class labels are very common in reality and they are collected from many scientific areas and social practices, such as disease diagnosis and prognosis, national security threat detection, and quality control. For example, the development of tumor can be classified to Stage I, Stage II, Stage III, etc.; the U.S. homeland security advisory system has five categories, Green, Blue, Yellow, Orange and Red, ordered from the least to the most severe threats; the quality of a randomly sampled product can be categorized to excellent, good, fair and bad. The goal of ordinal classification is to classify a data point to one of these ordinal categories, , based on the covariates . Here we consider the case . The actual labels are of no importance, so long as the order can be recognized.

Note that ordinal data are a special case of the more general multicategory data. Ignoring the order information, one may classify ordinal data in the same way as one would do multicategory data, by applying a multicategory classification method. There is a large body of literature for the latter. This includes those which combine multiple binary classifiers, such as the One-Versus-One and One-Versus-Rest paradigms (see for example Duda et al., 2001)

, and those which estimate multiple classification boundaries simultaneously, such as

Weston and Watkins (1999), Crammer and Singer (2002), Lee et al. (2004), and Huang et al. (2013). While using multicategory classification method for ordinal data sometimes works, such treatment can be suboptimal, because the classes are treated equally without their connections and relative superiority being considered. Moreover, a counterexample in Section 2 reveals that it is desirable to use an approach which fully utilizes the ordinal information available.

Suppose there are classes in total. A simple but very useful strategy for ordinal classification is to sequentially conduct binary classifications between the combined meta-class and meta-class , for , and then pool the classification results from these steps to reach a final prediction (see Frank and Hall, 2001). In binary classification, usually a discriminant function is estimated, and data point is classified to the positive class if , or to the negative class otherwise. The classification boundary is defined by . As there are binary classifiers in this strategy, there are classification boundaries. This approach assumes that each class is sandwiched by two adjacent classification boundaries.

An inherent difficulty of this approach is that since these boundaries are trained separately, it is possible that they may cross with each other. Consequently, how to make a final conclusion becomes ambiguous for some data points.

In this article, we propose a flexible margin-based classification method for ordinal data. The direction we pursue is to construct the boundaries simultaneously. Our method is equipped with extra noncrossing constraints to fix the crossing issue, hence is named Noncrossing Ordinal C

lassification (NORDIC). Similar noncrossing constraints were studied and used in the quantile regression context

(for example, Bondell et al., 2010, Liu and Wu, 2011). Compared to the vanilla idea of training binary classifiers separately, simultaneous learning can borrow the strength from different classes, which leads to better classification accuracy and improved robustness to mislabeled data. Moreover, compared to many existing methods, our method is more flexible, since it does not assume that the boundaries are parallel.

Among the existing related work in classifying ordinal data, Herbrich et al. (2000) tried to find the classification boundaries by maximizing the margin in the space of pairs of data vectors; Frank and Hall (2001) was among the first to consider the idea of pooling binary classifiers; Shashua and Levin (2003)

generalized the support vector formulation for ordinal regression and proposed to optimize multiple thresholds to define parallel separating hyperplanes;

Chu and Keerthi (2005) improved the work of Shashua and Levin (2003) and guaranteed that the thresholds were properly ordered; Chu and Ghahramani (2005) used a probabilistic kernel approach based on Gaussian processes; Cardoso and da Costa (2007) replicated the data and cast the ordinal classification problem to a single binary classification problem. Many of these approaches, although ensuring noncrossing, have posed a fairly strong assumption that the boundaries are parallel to each other (either in the original sample space or in the kernel feature space), which may be lack of flexibility and be unrealistic in many cases.

The rest of the article is organized as follows. In Section 2, we compare the multicategory classification with the ordinal one, and review a simple framework for the ordinal classification. We introduce the main idea of the NORDIC method and the computation algorithm in Section 3. A more precise version of NORDIC, which makes use of a less popular optimization algorithm, is introduced in Section 4. The theoretical properties are studied in Section 5. Several simulated examples are used to compare NORDIC with other methods in Section 6. A real data example is studied in Section 7. Concluding remarks are made in Section 8.

## 2 Ordinal Classification

In this section, we first demonstrate, using a real example that, in some cases, it is better not to ignore the ordinal information by treating ordinal data as regular multicategory data. We then introduce a framework of ordinal classification via binary classifiers. Lastly we compare the principles of multicategory and ordinal classifications.

### 2.1 An Example in U.S. Presidential Election

In a multicategory classifier with classes, usually discriminant functions , , are estimated and the class prediction for is . Let

denote the conditional probability for the

th class, . In this case, any multicategory classifier would aim to mimic the Bayes classification rule, , which has the smallest conditional classification risk, , among all possible rules.

For the ordinal data, one can opt to ignore the ordinal information and classify them using a multicategory classifier. However, a counterexample suggests that this may not always be a wise strategy. Consider the presidential election in the United State. Any voter can be viewed as being from a red state (a state which is most conservative and predominantly vote for the Republican Party), a blue state (a state which is least conservative and predominantly vote for the Democratic Party) and a purple state (also known as a swing state, where both parties receive strong support). In 2012, the states of North Carolina, Florida, Ohio, and Virginia were the swing states. There are many more blue and red states in the U.S. than swing states (and a much larger population in the former two types of states than that in the latter). Suppose each voter is associated with a covariate vector and the color of her home state is the class label. The statistical task here is to classify her to one of the three types of states, .

Recall that the Bayes rule in multicategory classification classifies to the class with the greatest . It is more likely for a multicategory classifier to classify a voter to a blue state or a red state, since both tend to have larger . To see this, note that , where is the density of the covariate given that she is from the th class and is the unconditional class probability for the th class. Clearly, both and are much greater than , leading to that their ’s tend to be larger as well. The bottom line is, it seems to be unfair that the chance that a voter from the purple state is correctly identified is compromised simply because there is a smaller population in purple states. Ironically, in a U.S. presidential election, the swing states are the most important battleground, because it is the swing states that break the even in a presidential campaign.

In this example, the imbalanced class prior probabilities appear to be the proximate cause that leads to the aforementioned issue. The underlying root cause, however, is that the ordinal data nature herein has been ignored. A classification method which makes use of the ordinal information is more appropriate in this case. We describe a simple strategy for this example here which leads to the more formal methodology in the next subsection: for a randomly selected voter, we first consider classifying her to a

blue state, versus a purple or red state. If she is classified to the latter, then she tends to be relatively more conservative (than blue states voters). We then classify her to a blue or purple state, against a red state. If she is classified to the former, then she is relatively less conservative (than red state voters). The results of the two comparisons can lead to the final conclusion that she is classified to a purple state.

### 2.2 Ordinal Classification via Binary Classifiers

In general, consider an ordinal classification problem with classes. Furthermore, consider binary classifiers, where the th classification boundary separates the combined set from the combined set where and . For the th binary classification, we code the former the negative class and the latter the positive class by constructing a dummy class label if and if . The th binary classifier is associated with a discriminant function so that the classification rule is . Let denote the prediction set of observation with respect to the th subproblem, defined as if , or otherwise. The final prediction for , aggregating all the results from the binary classifiers above, will be the intersection of , i.e., .

In a four-class toy example, Table 1 tabulates the prediction of the three binary classifiers for some observation . The first binary classifier compares Class 1 and the meta-class . The prediction is that the observation is from . Similarly, the second binary classifier compares and and the prediction is . Lastly, the third binary classifier classifies the observation to . Clearly, Class 2 is favored by all three binary classifiers and it is the final prediction for . This framework for reaching an ordinal classification prediction by pooling binary classifiers was first noted by Frank and Hall (2001).

### 2.3 Principle of Ordinal Classification

We are now ready to compare the principles of multicategory classification and ordinal classification. A cartoon in Figure 1 can tellingly demonstrate the distinction between these principles. In a data set with , there are two example data points (shown in the top and the bottom rows respectively). For each data point, the length of each block denotes the conditional class probability . The sum of all four conditional probabilities is 1. The principle in multicategory classification chooses Class 1 in the top example and Class 4 in the bottom example, as they correspond to the greatest in both cases. In contrast, in ordinal classification, the desired prediction would be Class 2 and Class 3 respectively. For example, for the top example, the data point is more likely from Class than from Class , and more likely from Class than from Class . Hence Class 2 is the most plausible choice for this data point. Similarly, the data point in the bottom is most likely from Class 3. In particular, they both correspond to Class such that and for each . In the cartoon, a vertical line corresponding to 0.5 cuts the blocks for the desired predictions.

A useful notion here is that the principle of multicategory classification is to select the ‘mode’ of the class labels, based on , while that of the ordinal classification is to select the ‘median’.

## 3 Noncrossing Ordinal Classification

Conducting ordinal classification via binary classifiers is very easy to implement as long as one has access to an efficient binary classifier. There are many options, such as Support Vector Machine (SVM; Cortes and Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000), Distance Weighted Discrimination (DWD; Marron et al., 2007, Qiao et al., 2010), hybrids of the two (Qiao and Zhang, 2015b, a), -learner (Shen et al., 2003), Large-Margin Unified Machines (Liu et al., 2011) and so on.

However, because the classification boundaries are trained separately, it is possible that they cross with each other. Figure 2 is a cartoon which shows the possible crossing between classification boundaries. Here there are four classes (annotated as 1, 2, 3 and 4) and three estimated classification boundaries (I, II and III). The second and the third estimated boundaries cross with each other. Consequently, the red star point cannot be classified properly. In particular, it will be classified by classifier I to , by classifier II to and by classifier I to . The intersection of all three prediction sets is empty. Although one may argue that this point might be Class 2 or Class 4, no definite answer can be given, and there is an ambiguity as to how to classify this red star point.

Hence, it is desired that the estimated classification boundaries do not cross with each other. Let be the discriminant function for the th binary classification. Recall that its boundary are defined by . For these boundaries to be noncrossing, mathematically, it is equivalent that for all not on any boundary, where is a subset of , there exists , such that for all and for all . Let . Then the condition above is the same as that is a monotonically decreasing function with respect to for any fixed ,

 S(x,1)≥S(x,2)≥⋯≥S(x,K−1). (1)

### 3.1 Direct NORDIC

The noncrossing condition (1) can be fairly difficult to implement. We consider a sufficient condition first in this subsection. In this article, we use SVM as the basic binary classifier. For a Mercer kernel function , the Representer Theorem (Kimeldorf and Wahba, 1971) allows the th classification function to be represented by .

Note that if we add the constraints that

 ωk,i≥ωk+1,i and bk≥bk+1 for k=1,…,K−2,

then as long as the kernel function is always nonnegative with

(which is true for many kernel functions such as the Gaussian radial basis function kernel), we will have

, and hence for any .

Hence we consider a direct approach to NORDIC, called NORDIC-0, by solving the following joint optimization problem with the extra noncrossing constraints (3)–(4):

 minωk,j,bk K−1∑k=1[n∑i=1(1−y(k)ifk(x))++λ2\boldmath{ω}Tk⋅K\boldmath{ω}k⋅], (2)

where , the coefficient vector for the th function is , and is an by matrix whose th entry is , subject to

 bk≥bk+1, for k=1,…,K−2, (3) ωk,i≥ωk+1,i, for i=1,…n, k=1,…,K−2. (4)

Here is the regularization term for the th discriminant function.

The term inside the square bracket of (2) is the objective function of kernel SVM corresponding to the th classifier. We try to minimize the sum of these objective functions with the extra noncrossing constraints (3)–(4).

### 3.2 Indirect NORDIC

The constraints (3)–(4) for NORDIC-0 are sufficient conditions for noncrossing boundaries. However, such condition may be too strong. A weaker, but almost sufficient set of conditions would be inequality (3) along with the inequality that Note that they ensure that for all the data in the training data set. Thus when the training data is rich enough to cover the base of , then they are almost sufficient conditions for noncrossing. This approach is an indirect approach to noncrossing through the training data points, which is called NORDIC-1 in this article. A bonus of this set of constraints compared to (3)–(4) is that one does not need to take the inverse of later in the implementation, which we will explain in the next subsection.

Let be the dummy class label vector of the observations for the th classifier, and . For neatness, we let denote the diagonal matrix with as its diagonal elements, i.e., . By replacing the Hinge loss in (2) by a slack variable , and incorporating the new constraints, we can write the optimization problem for NORDIC-1 as,

 min\boldmath{ω}k⋅,bk,\boldmath{ξ}k⋅ K−1∑k=1(12\boldmath{ω}Tk⋅K\boldmath{ω}k⋅+CeT\boldmath{ξ}k⋅), (5)

subject to

 e−Yk⋅(K% \boldmath{ω}k⋅+bke)≤\boldmath{ξ}k⋅, for k=1,…,K−1, (6) \boldmath{ξ}k⋅≥0, for k=1,…,K−1, (7) bk≥bk+1, for k=1,…,K−2, (8) K\boldmath{ω}k⋅≥K\boldmath{ω}(k+1)⋅, for k=1,…,K−2. (9)

### 3.3 Implementations of NORDIC

We start off by deriving the Wolfe duality of the optimization problem for NORDIC-1. The implementation of NORDIC-0 will come clearer later as a variant of that of NORDIC-1. We introduce nonnegative Lagrange multipliers , , and for the constraints (6), (7), (8) and (9) respectively. The Lagrangian for the primal problem (5)–(9) is,

 L= K−1∑k=1[(12\boldmath{ω% }Tk⋅K\boldmath{ω}k⋅+CeT\boldmath{ξ}k⋅) +\boldmath{α}Tk⋅{e−Yk⋅(K\boldmath{ω}k⋅+bke)−\boldmath{ξ}k⋅} −\boldmathζTk⋅\boldmath{ξ}k⋅−γk(bk−bk+1)1{k≠K−1} −\boldmath{φ}Tk⋅(K% \boldmath{ω}k⋅−K\boldmath{ω}(k+1)⋅)1{k≠K−1}].

It can be rearranged, so that in the square bracket, the subscripts for the primal variables are with the same index , as follows,

 L= K−1∑k=1[(12\boldmath{ω% }Tk⋅K\boldmath{ω}k⋅+CeT\boldmath{ξ}k⋅) (10) +\boldmath{α}Tk⋅{e−Yk⋅(K\boldmath{ω}k⋅+bke)−\boldmath{ξ}k⋅} −\boldmathζTk⋅\boldmath{ξ}k⋅−bk(γk1{k≠K−1}−γk−11{k≠1})

The Karush-Kuhn-Tucker (KKT) conditions for the primal problem require the following:

 0=∂L∂\boldmath{ω}k⋅ =K\boldmath{ω}k⋅−KYk⋅\boldmath{α}k⋅ (11) 0=∂L∂bk =−yTk⋅\boldmath{α}k⋅ (12) −(γk1{k≠K−1}−γk−11{k≠1}), 0=∂L∂\boldmath{ξ}k⋅ =Ce−\boldmath{α}k⋅−% \boldmathζk⋅. (13)

Once the KKT conditions (12) and (13) are inserted to (10), the items that are associated with and will be eliminated. Moreover, from (11), we have

 \boldmath{ω}k⋅=Yk⋅\boldmath{α}k⋅+(\boldmath{φ}k⋅1{k≠K−1}−\boldmath{φ}(k−1)⋅1{k≠1})

when is full rank. Let

 R =[diag{Yk⋅}1≤k≤K−1∣I(n)n(K−1) −I(−n)n(K−1)∣0n(K−1)×(K−2)]

and , where for a matrix , denotes a matrix whose upper rows are and the bottom rows are all 0, and denotes a matrix whose bottom rows are and the top rows are all 0. Summarizing all these conditions, we can see that the optimality of the primal problem is given by the dual problem,

 max\boldmath{θ}k≡(\boldmath{α}k⋅;\boldmath{φ}k⋅;γk) −12\boldmath{θ}T{RT(IK−1⊗K)R}\boldmath{θ}+eT\boldmath{α}, subject to 0≤\boldmath{α}k⋅≤Ce, \boldmath{φ}k⋅≥0, γk≥0,

where is the Kronecker product.

The dual problem above is nothing but a quadratic programming (QP) problem about with equality and bound-inequality constraints, which can be solved by many third-party off-the-shelf QP subroutines. More efficient implementations, such as Platt’s SMO (Platt, 1999), are possible, but is not explored here as it is beyond the scope of this paper.

The optimal primal variables are calculated from the optimal dual variables using the relation . By the KKT complementary conditions, the bias term for the th classifier can be found from any in the training data with , due to the relations that . Alternatively, one can fix the ’s in the primal (5) as known and minimize (5)–(9) with respect to and

. This would lead to a linear programming problem.

The implementation for NORDIC-0 is similar, except that the Lagrangian is

 L0= K−1∑k=1[(12\boldmath{ω% }Tk⋅K\boldmath{ω}k⋅+CeT\boldmath{ξ}k⋅) +\boldmath{α}Tk⋅{e−Yk⋅(K\boldmath{ω}k⋅+bke)−\boldmath{ξ}k⋅} −\boldmathζTk⋅\boldmath{ξ}k⋅−γk(bk−bk+1)1{k≠K−1} −\boldmath{φ}Tk⋅(% \boldmath{ω}k⋅−\boldmath{ω}(k+1)⋅–––––––––––––––––––––––––––––––––––––––––––)1{k≠K−1}].

The only difference of the Lagrangian of NORDIC-0 from that of NORDIC-1 is underlined. Consequently, the KKT conditions are almost the same, except that,

 0=∂L0∂\boldmath% {ω}k⋅ =K\boldmath{ω}k⋅−KYk⋅\boldmath{α}k⋅

This leads to at the optimality being

 \boldmath{ω}k⋅=Yk⋅\boldmath{α}k⋅+K−1(\boldmath{φ}k⋅1{k≠K−1}−\boldmath{φ}(k−1)⋅1{k≠1}),

assuming that is invertible. The rest of the implementation is identical to that in NORDIC-0, except that we let

 R=[diag{Yk⋅}1≤k≤K−1∣{IK−1⊗K−1}(n)

and .

## 4 Exact NORDIC via Integer Programming

Recall that the necessary and sufficient condition for noncrossing (1) is that the sign of , , is a monotonically decreasing function with respect to for any fixed , The constraints for NORDIC-0 and NORDIC-1 that we have discussed in the last section is sufficient to ensure that which ultimately ensures noncrossing. However, they are not the weakest sufficient conditions we can impose. As a matter of fact, the discriminative functions themselves need not to be monotonically decreasing with respect to in order for noncrossing. In this section, we explore an idea which aims for exact noncrossing by posing conditions on the sign of the discriminative functions.

For each , there are one out of two alternative situations with regard to the prediction result from a discriminant function : either or . According to the noncrossing condition (1), the former implies that (recall that the sign is monotonically decreasing in ). Thus, the noncrossing condition (1) is logically equivalent to the condition that at least one of the following two constraints is satisfied,

 (i) fk(x)≥0, and (ii) fk+1(x)≤0;

i.e., (i) and (ii) cannot be both false. Specifically, if (i) is not true, i.e., if , then (ii) is true. This leads to the noncrossing condition.

Such logical implication can be modeled by the following Logical Constraints which involve binary integer variables ,

 −fk(x)−M1z1k ≤0, fk+1(x)−M2z2k ≤0, z1k+z2k ≤1,

where and are two large numbers due to technicality. In particular, implies that at least one between and has to be zero, hence (considering the first two constraints) either or , or both are true; this is the noncrossing condition discussed above. Note that if both and were 1, then the first two constraints became and , which would essentially impose no constraint on and so that the undesired case that and may occur. See Bradley et al. (1977) for an introduction to integer programming. We can use this technique to model the noncrossing constraints. In particular, we seek to

 minωk,j,bk K−1∑k=1[n∑i=1(1−y(k)ifk(x))++λ∥\boldmath{ω}k⋅∥1], (14)

subject to

 −fk(xi)−M1z1ik≤0, (15) fk+1(xi)−M2z2ik≤0, (16) z1ik+z2ik≤1, (17) z1ik, z2ik∈{0,1}, (18)

for This method is referred to as NORDIC-2 in this article. Here the constrains (15)–(18) are almost sufficient and (exactly) necessary conditions to noncrossing. It is again not exactly sufficient because we impose the constraints to all the training data vectors, instead of all , similar to the case of NORDIC-1. However, again, if the data vectors in the training data are rich enough, noncrossing across the board can be expected. These conditions are weaker than those in NORDIC-0 and NORDIC-1 because they ensure the monotonicity of the sign of , rather than the value of itself.

Note that the objective function of NORDIC-2 is a little different from those of NORDIC-0 and NORDIC-1, especially in the use of the norm penalty. We choose not to use the more common penalty, which leads to a quadratic objective function in SVM, because it is rather difficult to solve a mixed integer programing problem with quadratic objective function. In fact, we are not aware of an efficient off-the-shelf computing freeware which solves such a problem. In order to show the usefulness of the new noncrossing constraints, which is the main point of this article, we choose to use the penalty for computational simplicity.

It is worth noting that so long as there is an efficient mixed integer programming package which is capable of dealing with quadratic objective functions, an extension will be very natural and readily available.

Indeed, integer programming can solve such nonstandard problem which traditional optimization methods such as QP or linear programming cannot. However, integer programming can been overlooked by statisticians for a long time (probably due to the high computational cost and few statistical problem that this method applies). To the author’s best knowledge, this article is one of only a few work in the statistical literature which employs the integer programming technique. See Liu and Wu (2006) for another instance which uses mixed integer programming to solve a statistical problem.

## 5 Theoretical Properties

In this section, we study two aspects of the theoretical properties of NORDIC. The first subsection is about the Bayes rule and Fisher consistency of the loss function in ordinal classification. The second one pertains to the asymptotic normality of the NORDIC solution.

### 5.1 Bayes rules and Fisher Consistency

For binary classification, a classifier with loss is Fisher consistent if the minimizer of has the same sign as . The latter is the Bayes rule for binary classification. Intuitively, Fisher consistency requires that the classifier yields the Bayes decision rule asymptotically. See Lin (2004) for Fisher consistency of binary large margin classifiers.

In multicategory classification, a classifier with loss function , where is the discriminant functions, is Fisher consistent if the minimizer of , , satisfies that . Here, is the Bayes classification rule for multicategory classification. See, for example, Liu (2007) for some discussions on Fisher consistency for multicategory SVM classifiers.

Below we formally define the Bayes rule and Fisher consistency for ordinal classification. The Bayes rule for ordinal classification is where is such that and . This rule guarantees that each component binary classification in ordinal classification yields the Bayes rule in the binary sense.

###### Definition 1.

(Generalized Fisher consistency for ordinal classification) An ordinal classification method with loss function is Generalized Fisher consistent if for any , the minimizer of

 E[K−1∑k=1V3(Y(k),fk(X))]

satisfies that for . Here is the dummy class label for Class in the th binary classification subproblem.

Generalized Fisher consistency means that the discriminant functions jointly trained under the loss function , is essentially the same as the Bayes rule , as Note that has the smallest risk with respect to the aggregated 0-1 loss for the binary subproblems. Hence it is also the one which has the smallest risk under the so-called distance loss, defined as (see Qiao, 2015).

Because of the use of the Hinge loss function for SVM (which is Fisher consistent in the binary sense), our NORDIC method is Generalized Fisher consistent for ordinal classification. The proof is omitted.

### 5.2 Asymptotic Normality of Linear NORDIC

When the kernel function , that is, the linear kernel, we can have the following linear NORDIC classifier, with the objective function,

 K−1∑k=1[n∑i=1(1−y(k)i(xTi\boldmath{ω}k⋅+bk))++λ2\boldmath{ω}Tk⋅\boldmath{ω}k⋅], (19)

and one of the two following sets of constraints that correspond to NORDIC-1 and NORDIC-2 respectively,

 xTi\boldmath{ω}k⋅+bk≥xTi\boldmath{ω}k+1,⋅+bk+1,

and

 −(xTi\boldmath{ω}k⋅+bk)−M1z1k≤0, (xTi\boldmath{ω}k+1,⋅+bk+1)−M2z2k≤0, z1k+z2k≤1, z1k, z2k∈{0,1},

for .

Because linear kernel could be negative, the NORDIC-0 method cannot be directly extended to the linear kernel case. We can use the technique in Liu and Wu (2011) to create a new kernel that satisfies the nonnegativity assumption essential for NORDIC-0. In this subsection, we prove the asymptotic normality of linear NORDIC.

Koo et al. (2008) has provided a Bahadur representation of the linear SVM and proved its asymptotic normality under some conditions. In particular, they have shown that , where are the solution to the SVM classifier and are the minimizer of the expected loss function.

Theorem 1 below shows that the limiting distribution of the constrained NORDIC solution has the same limiting distribution to the unconstrained binary SVM classifiers. To prove this result, we need all the regularity conditions in Koo et al. (2008).

###### Theorem 1.

For , let and be the constrained and unconstrained solutions, respectively, to the th binary linear SVM problem in (19). Assume that the regularity conditions in Koo et al. (2008) are satisfied for . Then for any ,

 ∣∣∣P[n1/2{(^\boldmath{ω}k⋅,^bk)T−(\boldmath{ω}0k⋅,b0k)T}≤u] −P[n1/2{(~\boldmath{% ω}k⋅,~bk)T−(\boldmath{ω}0k⋅,b0k)T}≤u]∣∣∣→0,

so that the constrained solution has the same limiting distribution as the classical unconstrained solution.

Based on Theorem 1, inference for the constrained NORDIC can be obtained by applying the known asymptotic results for binary linear SVM, through the unconstrained NORDIC solutions. For example, we can show the asymptotic normality of the coefficients to the SVM components in linear NORDIC in the same way as those in Koo et al. (2008).

## 6 Numerical Results

We compare NORDIC-0, NORDIC-1, NORDIC-2, the vanilla ordinal classification method that uses separately trained (Frank and Hall, 2001) using binary SVM classifiers (BSVM), the data replication method by Cardoso and da Costa (2007) (DR) and the parallel discriminant hyperplane method by Chu and Keerthi (2005) (CK). We use our own experimental codes in the R environment to implement these methods. The Gaussian radial basis function kernel is used for all classifiers. The kernel parameter is tuned among the 10%, 50% and 90% quantiles of the pairwise distances between training vectors. The tuning parameters are tuned from a grid of possible values ranging from .

### 6.1 Nonlinear Three-class Examples

We consider a data setting with three classes and variables: , where

• and ,

• and ,

• and .

Here, and truly determine the class labels (see below) but only their contaminated counterparts and are observed. In particular, let

 f1 =−2~X1+0.2~X21−0.1~X22+0.2, f2 =−0.4~X21+0.2~X22−0.4, f3 =2~X1+0.2~X21−0.1~X22+0.2.

We assign each observation to class with probability proportional to for

. We generate 100 data points in the training set, 100 in the tuning set and 10000 in the test set. The standard deviation of the measurement error,

, ranges from 0.5, 1 to 1.5. When and (no perturbation), this is the same example as the nonlinear example in Zhang et al. (2008). However, we perturb the data and increase the dimension () to make the problem more challenging.

Note that this example was initially designed by Zhang et al. (2008) as a regular multicategory classification, instead of an ordinal classification one. Figure 3 shows a sample realization of the data without perturbation at the first two dimensions. In a general sense, Class 2 can be viewed as in the middle of Class 1 and Class 3. We pretend that the class labels are of an ordinal nature and compare different ordinal classification methods.

Figure 4 summarizes the results over 100 simulations. The NORDIC-0 and NORDIC-1 are the better classifiers in terms of classification performance when the dimensions are small. For higher dimensions, the NORDIC-2 method is better than the other methods. The DR method is the most computational costly and the CK method is the most efficient one. The reason that NORDIC works here is probably due to the perturbation that is added to this data set. A NORDIC method, with the help of the noncrossing constraints, can borrow strength from different classes and become more robust to perturbation.

### 6.2 Donut Examples

We now consider a more challenging setting, which is tailered toward the ordinal data. We first generate data points from a 2-dimensional plate with radius 4 uniformly, and label them as from Class 1, except for those which are within a circle centered at with radius , which are labeled as Class 2, and those which are within a circle centered at with radius , which are labeled as Class 3. The observations for the additional dimensions are all 0. We then perturb all the data points by adding independent

-dimensional Gaussian distributed random vector from

. We let , and and let range from 10 to 75. Figure 5 shows one realization of the data on the first two dimensions without the perturbation and the natural boundaries between the classes. This generalizes the classic donut examples in nonlinear classification.

Note that Class 2 is sandwiched by Class 1 and Class 3 from both outside and inside, and the high density region for Class 2 is very thin due to the construction. Hence, it is perceivable that a Class 2 observation is very difficult to be correctly classified. The noncrossing constraints here may be of some help because the boundary between Classes 1 and 2 may boost the estimation of the boundary between Classes 2 and 3, and vice versa.

The simulation results are reported in Figure 6. The first row shows the test error over 100 simulations. It appears that many times the DR method is the best. However, recall that in this data set the three classes are highly imbalanced in terms of their sample size. On average, there are only 6.25% Class 2 points and 18.75% Class 3 points. A more reasonable measure to look into here is some weighted error rate that incorporates the different costs of misclassification. Here we report the weighted error with the configuration that:

• each misclassified point from Class 1 costs ;

• each misclassified point from Class 2 to either Class 1 or Class 3 costs ;

• each misclassified point from Class 3 to Class 2 costs , and from Class 3 to Class 1 costs .

Such assignment of the cost reflects the protection for Class 2, and the additional penalization for misclassifying across two boundaries (the cost for misclassifying from Class 3 to Class 1 is the sum of the costs for misclassifying from 3 to 2 and from 2 to 1.)

The second row of Figure 6 reports the weighted error rate. It is obvious that expect for NORDIC-2, which is the best in this case, all other methods are more or less the same in terms of the weighted error. Interestingly, the NORDIC-0 and NORDIC-1 methods do not perform as well as their sibling NORDIC-2. They perform comparably to the other methods (they may have a very small advantage over CK and DR methods when the perturbation is small, for example, when and .) Recall that NORDIC-0 and NORDIC-1 imposes stronger constraints which aim for the monotonicity of the discriminant function

itself, as opposed to its sign. In contrast, the constraint from NORDIC-2 is much lighter, which may have left enough “degrees of freedom” to optimize the generalization performance.

One may argue that the choice of the costs in the weighted error may be arbitrary. In this case, it may be helpful to look into the confusion matrix to see the cause of the different performance. Figure

7 depicts the confusion matrices for the case with contamination for different methods and different dimensions. For the th plot in the array, the reported value is the proportion of observations from Class that are classified to Class (). Note that the aggregation of the three plots in the same column equals to 1. A good classifier is expected to have high rates in the diagonal plots and low rates in the off-diagonal plots. There are, on average, 7496.2 observations from Class 1, and almost all the methods classify them correctly. Class 2 (with only 626.5 observations) is clearly a very difficult class. Even our NORDIC-2 has a poor classification accuracy of . That said, NORDIC-2 shows more advantages for higher dimensional cases. For Class 3, NORDIC-2 shows improved accuracy, especially with much fewer misclassifications into Class 1 (shown in the upper-left plot).

The computational time results are similar to what we have seen for the last example and are not reported here.

## 7 Real Application

We use the scale balance data set from the UCI Machine Learning Repository

(Lichman, 2013) to test the usefulness of the NORDIC method. This data set, studied in Siegler (1976), was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The four attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. There are 625 instances in the data, with 288 tip to the left (), 49 balanced (), and 288 tip to the right ().

There is a clear order between the three classes (, and ,) and hence ordinal classification methods are appropriate. We randomly select points from the data set for training, for tuning, and the remaining are for testing. The proportions of the three classes are preserved when the partitioning is conducted. The random experiment is repeated for 100 times. We consider four cases, where and respectively.

A naive coding of 1, 2 and 3 for these three classes followed by a regression method will prove to be suboptimal. In particular, in addition to the ordinal classification methods, we also compare with support vector regression (SVR Smola and Schölkopf, 2004, implemented by svm() in the R package e1071) with Gaussian radial basis function kernel. SVR is applied to the data with {1,2,3} coding, and the predicted class is obtained by cut-off values and for the predicted numerical outcome.

Figure 8 shows the weighted error rates of different methods over 100 random splitting of the data set and 4 different sample sizes. Here we let a misclassified point from Class 3 to Class 1, or from Class 1 to Class 3, to bear a cost of ; other types of misclassification cost only . All three NORDIC methods are among the best, with NORDIC-2 having a significant advantage. The other two NORDIC methods are comparable to the DR method especially for small sample cases. The SVR is the worst classifier in this experiment.

Figure 9 shows the confusion matrices. It can be seen that the poor performance of the SVR method is probably because it classifies much more instances to Class , and this may be due to the arbitrary choice of the cut-off values and . However, one may have no better way to choose the cut-offs except through another layer of tuning parameter selection. On the other hand, NORDIC-2 stands out as the best classifier due to its best performance on Class among the other methods (except for SVR.) Note that for Classes and , all methods (except for SVR) perform more or less the same.

## 8 Concluding Remarks

In this article, three versions of NORDIC classifiers are proposed to make use of the order information in classifying ordinal data. All three classifiers train binary SVM classifiers simultaneously with extra constrains to ensure noncrossing among classification boundaries. The NORDIC-0 and NORDIC-1 methods focus on a sufficient condition for noncrossing and are solved by QP. The NORDIC-2 method aims for the exact condition for noncrossing but has to be solved by the integer programming algorithm.

Let us turn our attention back to the formulation for NORDIC-0, (2)–(4). Without the additional constraints (3) and (4), the NORDIC-0 method is the combination of independently trained SVM classifiers (with the common tuning parameter). It is known that for a single binary SVM classifier, the discriminant function is given by . The coefficients is calculated by maximizing the following dual problem of SVM,

 LSVM =∑iαi−12∑i,jαiαjyiyjK(xi,xj), subject to ∑iαiyi=0, and 0≤αi≤C. (20)

See, for example, Burges (1998) for a tutorial. The maximization problem above is the dual problem of SVM, while our NORDIC-0 method was based on the primal problem of SVM.

One may wonder if a dual-based NORDIC is possible. Indeed, a variant of NORDIC can be viewed as to maximize the sum of such objective functions as in (20), with extra noncrossing constraints that , that is . Note that the constraints are the same as in NORDIC-0 but the objective function is based on the dual objective function. However, one can show that this formulation ultimately reduces to the method proposed by Chu and Keerthi (2005), namely, all the classifiers share the same vector. Hence the CK method can be viewed as a special case in the NORDIC family. Note that in our NORDIC-0 proposal, we focus on the primal formulation. As a consequence, the resulting boundaries are not parallel to each other, leading to more flexibility.

The usefulness and efficiency of the proposed methods are supported by the comparison with the competitors. Promising results are obtained from simulated and data examples. Fisher consistency of the NORDIC method and asymptotic normality of the linear NORDIC method further validate the proposed methods.

There is a natural connection between ordinal classification and ordered logistic regression. Both methods fully utilize the ordinal class information. Their difference can be viewed as analogous to the difference between binary SVM and (binary) logistic regression, or that between multicategory SVM and multinomial logistic regression. It is interesting to explore the benefit of using machine learning techniques including NORDIC, over the modeling approaches such as ordered logistic regression. See

Lee and Wang (2015) for such a comparison in the binary case.

We have provided three distinct formulations. They may perform differently on different types of data sets, both in terms of the generalization error and the computational time; the derivation of these optimization problems may give insights into which kernels can more easily admit truly non-crossing boundaries. It is an interesting future research direction to identify specific kernels for which we can provide truly non-crossing boundaries.

## Appendix

### Proof to Theorem 1

Let and denote and