Deep learning, introduced by Lecun et al. 
, combines multiple machine learning techniques and allows computational models that are composed of numerous processing layers to learn representations of data with various levels of abstraction. These methods have dramatically improved the state-of-the-art in many domains, such as image classification[2, 3, 4], speech recognition , control problems , object detection [7, 8], privacy and security protection [9, 10, 11], recovery of human pose , semantic segmentation 
and image retrieval[14, 15]. Convolutional Neural Networks (CNN) are one of the types of deep networks that are designed to process data that come in the form of multiple arrays. CNNs are appropriate for images, video, speech and audio processing, and they have been used extensively in the last years for automatic classification tasks [16, 17, 18]
. For images, each colour channel is represented by a 2D array, and convolutional layers extract the main features from the pixels, and, after that, a fully connected layer classify every sample based on its extracted features. At the output of the CNN, a softmax function provides the probabilities of the set of classes predefined in the model for classification tasks. However, the softmax may not be the best option depending on the problem considered.
Ordinal classification problems are those classification tasks where labels are ordered, and there are different inter-classes importances for each pair of categories. This kind of problem can be treated as nominal classification, but this discards the ordinal information. A better approach is to use specific methods that take the ordinality into account to improve the performance of the classification model. The Proportional Odds Model (POM)
is an ordinal alternative to the binary logistic regression. It belongs to a wider family of models called Cumulative Link Models (CLMs). CLMs are inspired in the concept of a latent variable that is projected into a 1-dimensional space and a set of thresholds that divides the projection into the different ordinal levels. A link function needs to be specified, which can be of different types, although the most common option is the logit, which is used in POM. In this paper, we explore different existing alternatives, as explained in depth in Section III-A.
In this paper, we propose the use of CLMs for deriving deep learning ordinal classifiers111The source code is available at https://github.com/ayrna/deep-ordinal-clm.
. In the case of CNNs, the model projection used by the threshold model can be obtained from the last layer of the network. Given that we work with a 1-dimensional space, the last layer would have only one neuron (projection of the pattern), and its value could be used to classify the sample into the corresponding class according to the thresholds. Some previous works have used thelogit in shallow neural networks , but this strategy has not been considered for deep learning, and alternative link functions have not been evaluated. To further improve the results, we train these models by minimising an ordinal loss function based on the Weighted Kappa index , instead of using the standard cross-entropy.
An experimental study evaluating the three most common link functions is performed. Also, other parameters that can affect the training process and the model performance are studied, such as the learning rate of the optimization algorithm, the batch size, and their interaction. The nominal version of this model is used as a baseline for comparison. We contrast the results obtained with a statistical analysis to provide more robust conclusions. An ANOVA III test , followed by a posthoc Tukey’s test , is performed over runs of the experiments, because of the demands of computational time required to run a higher number of executions. The experiments are run using two different ordinal datasets: Diabetic Retinopathy , which contains high-resolution fundus images related with diabetes disease, and Adience , which includes human faces images associated with an age range.
The paper is organized as follows: in Section II, we analyse previous related works. Section III presents a formal description of the proposal in this paper, which combines a CLM with an ordinal loss function. In Section IV, we describe the experiments and the datasets used, while, in Section V, we present the results obtained and the statistical analysis. Finally, Section VI exposes the conclusions of this work.
Ii Related works
There are many works related to the application and development of CNN models 
, but few works focus on ordinal classification problems. The existing deep ordinal approaches are mainly based on simply using an ordinal evaluation metric, on solving the ordinal problem as multiple binary sub-problems, on using an ordinal loss functions or on constraining the probability distribution of the output layer. These works are described in the following subsections.
Ii-a Simply using an ordinal evaluation metric.
ALALI et al.  proposed a complex CNN architecture for solving Twitter Sentiment Classification as an ordinal problem. They checked that using average pooling preserves significant features that provide more expressiveness to ordinal scale. They didn’t propose any method to include the ordinal information into the classifier, but they tried to find the best CNN model architecture based on an ordinal metric.
Ii-B Solving the ordinal problem as multiple binary sub-problems
Niu et al.  proposed a learning approach to address ordinal regression problems using CNNs. They divided the problem into a series of binary classification sub-problems and proposed a multiple output CNN optimization algorithm to collectively solve these classification sub-problems, taking into account the correlation between them.
Li et al.  applied deep learning techniques for solving the ordinal problem of Alzheimer’s diagnosis and detecting the different levels of the disease as multiple binary sub-problems.
Liu et al.  proposed a new approach which transforms the ordinal regression problem to binary classification sub-problems and use triplets with instances from different categories to train deep neural networks. In this way, high-level features describing the ordinal relationships are extracted automatically. Given that triplets must be generated, this approach is only recommended for small datasets.
Chen et al.  proposed a deep learning method termed Ranking-CNN. This method combines multiple binary CNNs that are trained with ordinal age labels. The binary outputs are aggregated for the final age prediction. They achieved a tighter error bound for ranking-based age estimation.
In general, all these approaches increment the number of parameters to adjust, as several binary classifiers are simultaneously learnt.
Ii-C Using an ordinal loss function
de la Torre et al.  proposed the use of a continuous version of the quadratic weighted kappa (QWK) metric as loss function for the optimization algorithm. They compared this cost function against the traditional log-loss function using three different datasets, including the Diabetic Retinopathy database as the most complex one. They proved that their function could improve the results as it reduces overfitting and training time. Also, they checked the importance of hyper-parameter tuning.
Rios et al.  presented a CNN model designed to handle ordinal regression tasks on psychiatric notes. They combined an ordinal loss function, a CNN model and conventional feature engineering. Also, they applied a technique called Locally Interpretable Model-agnostic Explanation (LIME) to make the non-linear model more interpretable.
Fu et al.  applied deep learning techniques to Monocular Depth Estimation. They introduced a spacing-increasing discretization strategy to treat the problem as an ordinal regression problem. They improved the performance when training the network with an ordinal regression loss. Also, they used a multi-scale network structure that avoids unnecessary spatial pooling.
Pal et al.  defined a loss function for CNN that is based on the Earth Mover’s Distance and takes into account the ordinal class relationships.
Liu et al.  proposed a constrained optimization formulation for the ordinal regression problem which minimizes the negative loglikelihood for a multi-class problem constrained by the order relationship between instances.
Although the use of these losses introduces the ordinality in model learning, the nature of the models remain nominal.
Ii-D Unimodal probability distributions
Beckham and Pal  proposed a straightforward technique to constrain discrete ordinal probability distributions to be unimodal, via the use of the Poisson and binomial probability distributions. The parameters of these distributions were learnt by using a deep neural network. They evaluated this approach on two large ordinal image datasets, including the Adience dataset used in this paper, obtaining promising results. They also included a simple squared-error reformulation  that was sensitive to class ordering.
This approach is the one most related to the CLMs considered in this paper. However, CLMs indirectly model a latent space together with the set of threshold separating the ordered classes, which provides a more flexible and interpretable approach to deep ordinal classification.
Iii Model proposal
Based on the previous analysis of the state-of-the-art, our proposal is to combine a flexible threshold model in the output layer (different forms of a CLM) with an ordinal loss function, in order to better introduce ordinal constraints during learning.
Iii-a Cumulative Link Model (CLM)
An ordinal classification problem consists in predicting the label
of an input vector, where and , i.e. is in a -dimensional input space, and is in a label space of different labels. The objective in an ordinal problem is to find a function to predict the labels or categories of new patterns, given a training set of samples, . Labels have a natural ordering in ordinal problems: . The order between labels gives us the possibility to compare two different elements of by using the relation . This is not possible under the nominal classification setting. In regression (where ), real values in can be ordered by the standard operator, but labels in ordinal regression () do not carry metric information, i.e. the category serves as a qualitative indication of the pattern rather than a quantitative one.
The Proportional Odds Model (POM) arises from a statistical background and is one of the first models designed explicitly for ordinal regression . It dated back to 1980 and is a member of a wider family of models lately recognised as Cumulative Link Models (CLMs) . CLMs predict probabilities of groups of contiguous categories, taking the ordinal scale into account. In this way, cumulative probabilities are estimated, which can be directly related to standard probabilities:
with , and considering that and .
The model is inspired by the notion of a latent variable, where represents a one-dimensional mapping. The decision rule is not fitted directly, but stochastic ordering of space is satisfied by the following general model form :
where is a monotonic function often termed as the inverse link function, and is the threshold defined for class . Consider the latent variable , where is the random component of the error. The most common choice for the probability distribution of is the logistic function (which is the default function for POM). Label is predicted if and only if , where the function and are to be determined from the data. It is assumed that and , so the real line defined by , is divided into consecutive intervals. Each interval corresponds to a category. The constraints ensure that increases with .
In this work, we consider different link functions previously proposed in CLMs for the probability distribution of , including logit, probit and complementary log-log (clog-log). These three types of links are explained below and represented in Figure 1. They all follow the same form .
The logit link function is the function used for the POM and is defined as:
or the equivalent expression:
link function is the inverse of the standard normal cumulative distribution function (cdf). Its expression is:
which can also be expressed as:
The clog-log takes a response that is restricted to the interval and converts it into a value in the interval (like logit and probit transformations). The clog-log expression is:
with that is:
logit and probit links are symmetric:
which means that the response curve for is symmetric around the point , i.e. has the same rate when approaching 0 than when approaching 1. This symmetry property can be demonstrated as follows:
Let . For the logit function, we have:
For the probit:
which leads to:
Unlike logit and probit, the clog-log link is asymmetrical. In this way, when the given data is not symmetric in the interval and increase slowly at small to moderate value but increases sharply near 1, the logit and probit models are inappropriate, while clog-log can lead to better results.
In this paper, the probabilistic structure of CLMs is proposed as a link function for deep convolutional neural networks. This can be achieved by defining a new type of output layer alternative to the standard softmax layer. In this way, the proposed output layer will transform the one-dimensional projection, previously denoted as, into a set of probabilities. is estimated from a nonlinear transformation of the set of features learnt by the previous layers, , where is the pattern being evaluated and is a latent representation of the pattern given by the output of a single neuron. In order to apply unconstrained optimizers while ensuring , we can redefine the thresholds. All of them can be derived from the first one in the following form:
where and are the learnable parameters, and is the number of classes.
Iii-B Continuous Quadratic Weighted Kappa (QWK) loss function
In order to increase the performance of the deep ordinal model, the CLM structure in the output layer is combined with the continuous version of the QWK loss  function. The Kappa index is a well-known metric that measures the agreement between two different raters. The Weighted Kappa (WK)  is based on the Kappa index and adds different weights to the different types of disagreements based on a weight matrix. It is useful to evaluate the performance in ordinal problems, as it gives a higher weight to the errors that are further from the correct class. This metric is defined as follows:
where is the number of samples rated, is the penalization matrix (in this case, quadratic weights are considered, , ),
is the confusion matrix,, is the sum of the row and is the sum of the column.
The WK defined above cannot be used as a loss function for the optimization algorithm as it is not continuous. However, it has been previously redefined  in terms of probabilities of the predictions:
where , and are the input data and the real class of the -th sample, is the number of classes, is the number of samples, is the number of samples of the -th class, is the probability that the -th sample belongs to class (estimated using the CLM structure), and are the elements of the penalization matrix (). This loss function can be minimized using a gradient descent based algorithm.
In order to evaluate the different models, we make use of two ordinal datasets:
Iv-A1 Diabetic Retinopathy (DR)
DR222https://www.kaggle.com/c/diabetic-retinopathy-detection/data is a dataset consisting of extremely high-resolution fundus image data. The training set consists of pairs of images (where a pair includes a left and right eye image corresponding to a patient). In this dataset, we try to predict the correct category from five levels of diabetic retinopathy: no DR ( images), mild DR ( images), moderate DR ( images), severe DR ( images), or proliferative DR ( images). The test set contains pairs of images. These images are taken in variable conditions: by different cameras, conditions of illumination and resolutions. They come from the EyePACS dataset that was used in the DR detection competition hosted on the Kaggle platform. Also, this dataset has been used in different works [22, 38], where the cost function was considered in  to achieve better performance. A validation set is set aside, consisting of of the patients in the training set. The images are resized to 128 by 128 pixels and rescaled to range. Data augmentation techniques, described in Section IV-C, are applied to achieve a higher number of samples. A few test images of this dataset are shown in Figure 2.
Adience333http://www.openu.ac.il/home/hassner/Adience/data.html dataset consists of faces belonging to subjects. We use the form of the dataset where faces have been pre-cropped and aligned. The dataset was preprocessed, using the methods described in a previous work , so that the images are 256 pixels in width and height, and pixels values follow a normal distribution. The original dataset was split into five cross-validation folds. The training set consists of merging the first four folds which comprise a total of images. From this, of the images are held out as part of a validation set. The last fold is used as test set. Some images of this dataset are shown in Figure 3. Adience dataset has been used in other works for human age estimation but most of them solved the problem as a multi-class problem instead of using the ordinal relation between classes. Eidinger et al. 
presented an approach using support vector machines and neural networks. Chen et al. proposed a coarse-to-fine strategy for deep CNNs. Levi and Hassner  presented another convolutional network model for age estimation. As previously discussed, Beckham and Pal  proposed a straightforward method to constrain discrete probability distributions to be unimodal.
CNNs have been used for both datasets. The different architectures of CNNs used in these experiments are presented in Table I. The architecture for DR is the same that was used in  and the network for Adience is a small Residual Network (ResNet)  that was used in 
. The most important parameters for convolutional layers are the number of filters that are used to make the convolution operation, the size of these filters and the stride, which is the number of pixels that the filter is moved in every operation. Pooling layers have similar parameters: pool size (number of pixels that will be involved in the operation) and stride. For convolutional layers, ConvWxH@FsS stands for filters of size WxH and stride S. For pooling layers, PoolWxHsS corresponds to a pool size of WxH and stride S.
The Exponential Linear Unit (ELU) 43]
function, as it mitigates the effects of the vanishing gradient problem[44, 45] via the identity for positive values. Also, ELUs lead to faster training and better generalization performance than ReLU and Leaky ReLU (LReLU)  functions on networks with more than five layers.
After every ELU activation function of the convolutional layers, Batch Normalization is applied. This method reduces the internal covariate shift by normalizing layer outputs. It allows us to use higher learning rates and be less careful about weight initialization. It also eliminates the need for using regularization techniques like Dropout.
At the output of the network, the CLM is used (see Section III-A). Also, a learnable parameter has been used to rescale the projections used by the CLM to make it more stable and guarantee the convergence in most cases. The following expression describes the transformation applied to these projections:
where is optimized as a free parameter.
|2 Conv3x3@64s1||122x122x64||2 ResBlock3x3@64s1||55x55x32|
|2 Conv3x3@128s1||57x57x128||2 ResBlock3x3@128s1||28x28x64|
|2 Conv3x3@128s1||24x24x128||2 ResBlock3x3@256s1||14x14x128|
Iv-C Experimental design
Weights are adjusted using a batch based first-order optimization algorithm called Adam . We study different initial learning rates () in order to find the optimal one for each problem. We apply an exponential decay 
across training epochs to the initial learning rate () following the expression below:
Both datasets are artificially balanced using data augmentation techniques . However, different transformations are applied to each one. DR dataset augmentation is based on image cropping and zooming, horizontal and vertical flipping, brightness adjustment and random rotations. Horizontal flipping is the only transformation applied to the Adience dataset. These transformations are applied every time a new batch is loaded, and the parameters of each one are randomly chosen from a defined range ( for zooming, for brightness and degrees for rotation), providing a new set of transformed images for each batch. This technique reduces the overfitting risk and provides an important performance boost as we always work with different but similar images .
The epoch size is equal to the number of images in the training set. It could be a higher number as we are using data augmentation, but instead of increasing the epoch size, we rather run the training for more epochs. In this case, we set the maximum number of epochs to . However, we always save the best model, that is evaluated when the training finishes.
The models are mainly evaluated using the QWK metric defined in Eq. (1). Also, other evaluation metrics are used to ease the comparison with alternative works:
Minimum Sensitivity (MS)  is the lowest percentage of samples correctly predicted to belong to a class with respect to the number of samples of that class.
where is the confusion matrix and is the number of classes.
Mean Absolute Error (MAE)  is the average absolute deviation of the predicted category from the real one.
where is the number of samples, is the number of classes and O is the confusion matrix.
Accuracy-based metrics. Correct Classification Rate (CCR) or standard accuracy is the most common metric for classification tasks and shows the percentage of correctly classified samples. We also include Top-2 CCR and Top-3 CCR , which are similar to CCR, but they take a prediction as correct when the real class is between the two or three classes, respectively, with the highest probability.
QWK, MAE and 1-off accuracy are ordinal evaluation metrics, while MS, CCR, Top-2 CCR and Top-3 CCR do not take the order of categories into consideration.
In order to ease reproducibility, the source code is available in a public repository444https://github.com/ayrna/deep-ordinal-clm.
In our study, three different factors are considered:
Learning rate (LR, ). LR is one of the most critical hyper-parameters to tune for training deep neural networks. Optimal learning rate can vary depending on the dataset and the CNN architecture. Previous works have presented some techniques that adjust this parameter in order to achieve better performance [52, 53]. In this work, we consider three different values for the initial value of this parameter: , and .
Batch size (BS). Batch size is also an important parameter as it controls the number of weight updates that are made on every epoch. It can affect the training time and the model performance. In this paper, we try three different batch sizes for each dataset. For the DR dataset, we use , and , while, for Adience, , and images are used. We took the batch sizes that were used in  and  as a reference, and we expand the range on both sides.
Link function (LF). Different link functions are used for the CLM at the last layer output: logit, probit and complementary log-log (see Section III-A).
In this Section, we present the results of the experiments. First, in Sections V-A, V-B, and V-C, we perform the study for adjusting the value of the different parameters. Then, Section V-D compares the results against the state of the art.
For each dataset, we show a table with the detailed results of the experiments performed for training the model with each combination of parameters. Every parameter combination was run five times. These tables show the mean value and the standard deviation (SD) of each metric across these five executions for the test set.
V-a Diabetic Retinopathy
Detailed test results for the DR dataset are presented in Table II. The best result for each metric is marked in bold and the second best is in italic font.
The best mean QWK value was obtained with the clog-log link function using a BS of 10 and a LR of . However, the best CCR value was obtained with a BS of 15, the logit link and a LR of . The optimal configuration depends on the metric we are analysing. In this case, as we are working with an ordinal problem, the most reliable metric is the QWK. However, the rest of the metrics are also included to allow further comparisons with future works.
Test results for the experiments made with the Adience dataset are shown in Table III. The best result for each metric is marked in bold and the second best is in italic font.
The best mean QWK value was obtained with the logit link function using a BS of 64 and a LR of . Also, this configuration obtained the best score for Top-2, Top-3 and 1-off accuracy, and the second best for MS, MAE and CCR. In this case, this configuration can be selected as the optimal for this problem.
V-C Statistical analysis
In this subsection, a statistical analysis will be performed in order to obtain conclusions from the results. The significance and relative importance of the parameters concerning the results obtained, as well as the most suitable values, were obtained using an ANalysis Of the VAriance (ANOVA).
The ANOVA test  is one of the most widely used statistical techniques. ANOVA is essentially a method of analysing the variance to which a response is subject into its various components, corresponding to the sources of variation which can be identified. ANOVA, in this case, examines the effects of three quantitative variables (termed factors) on one quantitative response. Considered factors are the LF, the LR for the Adam optimization algorithm, and the BS. We assume that five executions are enough to do the statistical tests because of the computational time limitations.
The ANOVA test results show that there are significant differences in average QWK depending on the LF and also depending on the LR for (). Moreover, an interaction between the LF and the LR can be recognised ().
Given that there exist significant differences between the means, we analyse now these differences. A post-hoc multiple comparison test has been performed on the mean QWK obtained. An HSD Tukey’s test 
has been selected under the null hypothesis that the variance of the error of the dependent variable is the same between the groups. The results of this test over the test set are shown in TableIV. They show that the best LF is the clog-log but the probit link performance is close to it. Also, the best value for the LR parameter is . The BS is not relevant for this dataset with the values considered.
The results of the ANOVA III test for the Adience dataset, first, demonstrate that there exist significant differences in average QWK concerning the three factors (). Secondly, we found interactions between all the pairs of factors and between all the three factors together (p-values , , and , respectively).
As we did for the DR dataset, a post-hoc multiple comparison test has been performed on the average QWK obtained for Adience. Under the null hypothesis that the variance of the error of the dependent variable is the same between the groups, the HSD Tukey’s test has been applied. The results of this test over the test set are shown in Table IV.
|LF||LF||Mean diff.||Sig.||Mean diff.||Sig.|
|LR||LR||Mean diff.||Sig.||Mean diff.||Sig.|
|BS||BS||Mean diff.||Sig.||Mean diff.||Sig.|
The results over the test set show that the best LF is the logit, the best LR is and the best BS is 128. However, the interactions between these factors made the configuration that uses a logit link, and BS of 64, the best configuration. It obtained a mean QWK value of for validation and for test. The same parameters, but using the probit link, achieves the second best result (). The standard deviation is very low for both cases.
To sum up, the results showed that the best parameter configuration depends on the problem that is being solved. The clog-log function offers the best results in DR dataset while the logit link is the best option for the Adience dataset. However, the best LR for both datasets were . It is recommended to use this value for future datasets. The best BS for DR was 10, while the best value for Adience was 128 (intermediate values considered). Finally, there are more interactions between the three factors for the Adience dataset than for DR. These results highlight the importance of adjusting the hyper-parameters for each problem instead of trying to find an optimal configuration for all the datasets.
V-D Comparison with nominal method and previous works
Once the factor parameters have been studied and selected, experiments are run with the standard cross-entropy loss and the softmax function too in order to prove the performance improvement of considering the ordinality of the problem (QWK loss and the CLM). The evaluation metrics remains the same in order to be able to compare. When considering the nominal version, for the DR dataset, the best mean value of QWK was and was obtained when using a BS of 10 and a LR of . In the case of Adience dataset, the highest QWK was and was achieved with a BS of 64 and a LR of . There are some parameter configurations where the training process gets stagnated and a very low QWK is obtained. As we saw in Sections V-A and V-B, this problem is not found when using the ordinal method.
These results are included in Table V, together with the comparison against previous works of the state-of-the-art. All the results are given for the test set, except those from  (DR dataset), because the authors only provided validation results for images (however, validation results are usually better than test results). The results for  were obtained by reproducing their experiments because they did not provide the results for test set. As can be checked, the proposed ordinal model outperforms all the other alternatives in terms of QWK.
The performance gain of CLM over the nominal version reaches for DR and for Adience dataset. The improvement of the ordinal method for DR is higher than that for Adience. It seems that the method proposed in this work offers a more significant improvement as the given problem complexity increases. Moreover, when compared against alternative ordinal methods (many of them with a deep structure), CLMs are very competitive, possibly because of the flexibility provided by the threshold model structure (where the threshold of each class is independently adjusted).
|CLM network (proposal)|
|de la Torre et al. ||-||-|
|Nebot et al. ||-||-|
|CLM network (proposal)|
|Beckham and Pal |
|Eidinger et al. ||-|
|Chen et al. ||-|
|Levi and Hassner ||-|
This paper introduces a new deep ordinal network based on combining CLM models with a continuous QWK loss function. The proposed model is able to improve the performance of the deep network compared to the equivalent nominal version and other models proposed in previous works. Also, it is able to reduce the chance that the model gets stuck when training with some parameter configurations. We conclude that the optimal values for the different parameters considered are problem-dependant. The results highlig