Rank Consistent Logits for Ordinal Regression with Convolutional Neural Networks

by   Wenzhi Cao, et al.

While extraordinary progress has been made towards developing neural network architectures for classification tasks, commonly used loss functions such as the multi-category cross entropy loss are inadequate for ranking and ordinal regression problems. To address this issue, approaches have been developed that transform ordinal target variables series of binary classification tasks, resulting in robust ranking algorithms with good generalization performance. However, to model ordinal information appropriately, ideally, a rank-monotonic prediction function is required such that confidence scores are ordered and consistent. We propose a new framework (Consistent Rank Logits, CORAL) with theoretical guarantees for rank-monotonicity and consistent confidence scores. Through parameter sharing, our framework benefits from low training complexity and can easily be implemented to extend common convolutional neural network classifiers for ordinal regression tasks. Furthermore, our empirical results support the proposed theory and show a substantial improvement compared to the current state-of-the-art ordinal regression method for age prediction from face images.



There are no comments yet.


page 6


Consistent Rank Logits for Ordinal Regression with Convolutional Neural Networks

While extraordinary progress has been made towards developing neural net...

Universally Rank Consistent Ordinal Regression in Neural Networks

Despite the pervasiveness of ordinal labels in supervised learning, it r...

Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities

In recent times, deep neural networks achieved outstanding predictive pe...

Robust Deep Ordinal Regression Under Label Noise

State-of-the-art ordinal regression methods rely on the correctness of t...

Cumulative Sum Ranking

The goal of Ordinal Regression is to find a rule that ranks items from a...

Your 2 is My 1, Your 3 is My 9: Handling Arbitrary Miscalibrations in Ratings

Cardinal scores (numeric ratings) collected from people are well known t...

Meta ordinal weighting net for improving lung nodule classification

The progression of lung cancer implies the intrinsic ordinal relationshi...

Code Repositories


Consistent Rank Logits for Ordinal Regression with Convolutional Neural Networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ordinal regression, sometimes also referred to as ordinal classification, describes the task of predicting object labels on an ordinal scale. Here, a ranking rule or classifier maps each object into an ordered set , where . In contrast to classification, the ranks include ordering information. In comparison with metric regression, which assumes that

is a continuous random variable, ordinal regression regards

as a finite sequence where the metric distance between ranks is not defined.

Along with age estimation 

(Niu et al., 2016), popular applications for ordinal regression include predicting the progression of various diseases, such as Alzheimer’s disease (Doyle et al., 2014), Crohn’s disease (Weersma et al., 2009), artery disease (Streifler et al., 1995), and kidney disease (Sigrist et al., 2007). Also, ordinal regression models are common choices for text message advertising (Rettie et al., 2005) and various recommender systems (Parra et al., 2011).

While the field of machine learning field developed many powerful algorithms for predictive modeling, most algorithms were designed for classification tasks. About ten years ago, Li and Lin proposed a general framework for ordinal regression via extended binary classification (Li & Lin, 2007), which has become the standard choice for extending state-of-the-art machine learning algorithms for ordinal regression tasks. However, implementations of extended binary classification for ordinal regression commonly suffer from classifier inconsistencies among the binary rankings (Niu et al., 2016), which we address in this paper with a new method and theorem for guaranteed classifier consistency that can easily be implemented in various machine learning algorithms. Furthermore, we present an empirical study of our approach on challenging real-world datasets for predicting the age of individuals from face images using our method with convolutional neural networks (CNN).

The main contributions of our paper are as follows:

  1. [itemsep=1.2pt,topsep=0pt]

  2. the Consistent Rank Logits (CORAL) framework for ordinal regression with theoretical guarantees for classifier consistency and well-defined generalization bounds with and without dataset- and task-specific importance weighting;

  3. CNN architectures with CORAL formulation for ordinal regression tasks that come with the added side benefit of reducing the number of parameters to be trained compared to CNNs for classification;

  4. experimental validation showing that the guaranteed classifier consistency leads to a substantial improvement over the state-of-the-art CNN for ordinal regression applied to age estimation from face images.

2 Related Work

2.1 Ordinal Regression and Ranking

Several multivariate extensions of generalized linear models have been developed in the past for ordinal regression, including the popular proportional odds and the proportional hazards models

(McCullagh, 1980)

. Moreover, ordinal regression has become a popular topic of study in the field of machine learning to extend classification algorithms by reformulating the problem to utilize multiple binary classification tasks. Early work in this regard includes the use of perceptrons  

(Crammer & Singer, 2002; Shen & Joshi, 2005)

and Support Vector Machines 

(Herbrich et al., 1999; Shashua & Levin, 2003; Rajaram et al., 2003; Chu & Keerthi, 2005). A general reduction framework that unified the view of a number of these existing algorithms for ordinal regression was later proposed in (Li & Lin, 2007).

While earlier works on using CNNs for ordinal targets have employed conventional classification approaches (Levi & Hassner, 2015; Rothe et al., 2015), the general reduction framework from ordinal regression to binary classification by (Li & Lin, 2007) was recently adopted by (Niu et al., 2016). In (Niu et al., 2016), an ordinal regression problem with ranks was transformed into binary classification problems, with the th task predicting whether the age label of a face image exceeds rank , . Here, all tasks share the same intermediate layers but are assigned distinct weight parameters in the output layer. One issue with this architecture is that for some input images the outputs of the tasks do not agree with each other. Hence, the model does not guarantee that the predictions are consistent. For example, in an age estimation setting, it would be contradictory if the th binary task predicted that the age of a person was larger than 30, but a previous task predicted it was not larger than 20, which is suboptimal when the task predictions are combined to obtain the estimated age.

While the ordinal regression CNN yielded state-of-the-art results on an ordinal regression problem such as age estimation, the authors acknowledged the classifier inconsistency as not being ideal but also noted that ensuring that the binary classifiers are consistent would increase the training complexity substantially (Niu et al., 2016). Our proposed method addresses both of these issues with a theoretical guarantee for classifier consistency as well as a reduction of the training complexity.

2.2 CNN Architectures for Age Estimation

Due to its broad utility in social networking, video surveillance, and biometric verification, age estimation from human faces is an area of active research. Various techniques have been developed for extracting facial features as inputs to classification or metric regression algorithms (O’Toole et al., 1999; Ramanathan et al., 2009b; Turaga et al., 2010; Kohail, 2012; Wu et al., 2012; Geng et al., 2013).

In recent years, CNN research has rapidly advanced, and CNNs now surpass most traditional methods on image-analyses tasks while not requiring feature extraction beyond standard image preprocessing steps

(Krizhevsky et al., 2012; Parkhim & Zisserman, 2015; Canziani et al., 2016). Hence, most state-of-the-art age estimation methods are now utilizing CNN architectures (Rothe et al., 2015; Chen et al., 2016; Niu et al., 2016; Ranjan et al., 2017; Chen et al., 2017).

Related to the idea of training binary classifiers separately and combining the independent predictions for ranking (Frank & Hall, 2001), a modification of the ordinal regression CNN (Niu et al., 2016) was recently proposed for age estimation, called Ranking-CNN, that trains an ensemble of CNNs for binary classifications and aggregates the predictions to predict the age label of a given face image (Chen et al., 2017). The researchers showed that training a series of CNNs improves the predictive performance over a single CNN with multiple binary outputs. However, ensembles of CNNs come with a substantial increase in training complexity and do not guarantee classifier consistency, which means that the individual binary classifiers used for ranking can produce contradictory results. Another approach for utilizing binary classifiers for ordinal regression is the siamese CNN architecture by (Polania et al., 2018)

. Since this siamese CNN has only a single output neuron, comparisons between the input image and multiple, carefully selected anchor images are required to compute the rank.

Age distribution learning (Pan et al., 2018) has made other notable progress in age estimation; here, the researchers defined a new loss function to penalize the difference between estimated age distributions and the ground truth age labels. Recent research has also shown that training a multi-task CNN for various face analysis tasks, including face detection, gender prediction, age estimation, etc., can improve the overall performance across different tasks compared to a single-task CNN (Ranjan et al., 2017) by sharing lower-layer parameters. In (Chen et al., 2016), a cascaded convolutional neural network was designed to classify face images into age groups followed by regression modules for more accurate age estimation. In both studies, the authors used metric regression for the age estimation subtasks. While our paper focuses on the comparison of different ordinal regression approaches, we hypothesize that such all-in-one and cascaded CNNs can be further improved by our method, since, as shown in (Niu et al., 2016), ordinal regression CNNs outperform metric regression CNNs in age estimation tasks.

3 Proposed Method

This section describes the proposed CORAL framework that addresses the problem of classifier inconsistency in ordinal regression CNNs based on multiple binary classification tasks for ranking.

3.1 Preliminaries

Let be the training dataset consisting of examples. Here, denotes the -th image and denotes the corresponding rank, where with ordered rank . The symbol denotes the ordering between the ranks. The ordinal regression task is to find a ranking rule such that some loss function is minimized.

Let be a cost matrix (Li & Lin, 2007), where is the cost of predicting an example as rank . Typically, and for . In ordinal regression, we generally prefer each row of the cost matrix to be V-shaped. That is if and if . The classification cost matrix has entries , which does not consider ordering information. In ordinal regression, where the ranks are treated as numerical values, the absolute cost matrix is commonly defined by .

In  (Li & Lin, 2007), the researchers proposed a general reduction framework for extending an ordinal regression problem into several binary classification problems. This framework requires the use of a cost matrix that is convex in each row ( for each ) to obtain a rank-monotonic threshold model. Since the cost-related weighting of each binary task is specific for each training example, this approach was described as unfeasible in practice due to its high training complexity (Niu et al., 2016). Our proposed CORAL framework does neither require a cost matrix with convex-row conditions nor explicit weighting terms that depend on each training example to obtain a rank-monotonic threshold model and to produce consistent predictions for each binary task. Moreover, CORAL allows for an optional task importance weighting, e.g., to adjust for label and class imbalances, which makes it more applicable in practice.

3.2 Ordinal Regression with a Consistent Rank Logits model

We propose the Consistent Rank Logits (CORAL) model for multi-label CNNs with ordinal responses. Within this framework, the binary tasks produce consistently ranked predictions.

Label Extension and Rank Prediction.

Given the training dataset , we first extend a rank label into binary labels such that indicates whether exceeds rank , i.e., . The indicator function is if the inner condition is true, and otherwise. Providing the extended binary labels as model inputs, we train a single CNN with binary classifiers in the output layer. Here, the binary tasks share the same weight parameter but have independent bias units, which solves the inconsistency problem among the predicted binary responses and reduces the model complexity.

Based on the binary task responses, the predicted rank for an input is then obtained via


where is the prediction of the th binary classifier in the output layer. We require that reflect the ordinal information and are rank-monotonic,


which guarantees that the predictions are consistent.

Loss Function.

Let denote the weight parameters of the neural network excluding the bias units of the final layer. The penultimate layer, whose output is denoted as , shares a single weight with all nodes in the final output layer. independent bias units are then added to such that are the inputs to the corresponding binary classifiers in the final layer. Let

be the logistic sigmoid function. The predicted empirical probability for task

is defined as


For model training, we minimize the loss function


which is the weighted cross-entropy of binary classifiers. For rank prediction (Eq. 1), the binary labels are obtained via


In Eq. (4), denotes the weight of the loss associated with the th classifier (assuming ). In the remainder of the paper, we refer to as the importance parameter for task . Some tasks may be less robust or harder to optimize, which can be taken into consideration by choosing a non-uniform task weighting scheme. Also, in many real-world applications, features between certain adjacent ranks may have more subtle distinctions. For example, facial aging is commonly regarded as a non-stationary process (Ramanathan et al., 2009a) such that face feature transformations could be more detectable during certain age intervals. Moreover, the relative predictive performance of the binary tasks may also be affected by the degree of binary data imbalance for a given task that occurs as a side-effect of extending a rank label into binary labels. Hence, we hypothesize that choosing non-uniform task weighting schemes improves the predictive performance of the overall model. The choice of task importance parameters is covered in more detail in Section 3.5. Next, we provide a theoretical guarantee for classifier consistency under uniform and non-uniform task importance weighting given that the task importance weights are positive numbers.

3.3 Theoretical Guarantees for Classifier Consistency

In the following theorem, we show that by minimizing the loss (Eq. 4), the learned bias units of the output layer are non-increasing such that . Consequently, the predicted confidence scores or probability estimates of the tasks are decreasing, i.e., for all , ensuring classifier consistency. given by Eq. 5 are also rank-monotonic.

Theorem 1 (ordered biases).

By minimizing loss function defined in Eq. (4), the optimal solution satisfies .


Suppose is an optimal solution and for some . Claim: by either replacing with or replacing with , we can decrease the objective value . Let

By the ordering relationship we have . Denote and

Since is increasing in , we have and .

If we replace with , the loss terms related to th task are updated. The change of loss (Eq. 4) is given as

Accordingly, if we replace with , the change of is given as

By adding and , we have

and know that either or . Thus, our claim is justified, and we conclude that any optimal solution that minimizes satisfies . ∎

Note that the theorem for rank-monotonicity in (Li & Lin, 2007), in contrast to Theorem 1, requires the use of a cost matrix with each row being convex. Under this convexity condition, let be the weight of loss of the th task on the th example, which depends on the label . In (Li & Lin, 2007), the researchers proved that by using example-specific task weights , the optimal thresholds are ordered. This assumption requires that when , and when . Theorem 1 is free from this requirement and allows us to choose a fixed weight for each task that does not depend on the individual training examples, which greatly reduces the training complexity. Moreover, Theorem 1 allows for choosing either a simple uniform task weighting or taking dataset imbalances into account (Section 3.5) while still guaranteeing that the predicted probabilities are non-decreasing and the task predictions are consistent.

3.4 Generalization Bounds

Based on well-known generalization bounds for binary classification, we can derive new generalization bounds for our ordinal regression approach that apply to a wide range of practical scenarios as we only require and . Moreover, Theorem 2 shows that if each binary classification task in our model generalizes well in terms of the standard 0/1-loss, the final rank prediction via (Eq. 1) also generalizes well.

Theorem 2 (reduction of generalization error).

Suppose is the cost matrix of the original ordinal label prediction problem, with and for . is the underlying distribution of , i.e., . If are rank-monotonic, then


For any , we have

If , then .
If , then . We have and Also, and . Thus, if and only if . Since ,

Similarly, if , then and

In any case, we have

By taking the expectation on both sides with , we arrive at Eq. (6). ∎

In (Li & Lin, 2007), by assuming the cost matrix to have V-shaped rows, the researchers define generalization bounds by constructing a discrete distribution on conditional on each , given that the binary classifications are rank-monotonic or every row of is convex. However, the only case they provided for the existence of rank-monotonic binary classifiers was the ordered threshold model, which requires a cost matrix with convex rows and example-specific task weights. Our result does not rely on cost matrices with V-shaped or convex rows and can be applied to a broader variety of real-world use cases.

3.5 Task Importance Weighting

According to Theorem 1, minimizing the loss of the CORAL model guarantees that the bias units are non-increasing and thus the binary classifiers are consistent as long as the task importance parameters are positive ().

We first experimented with a weighting scheme proposed in (Niu et al., 2016) that aims to address the class imbalance in the face image datasets. However, compared to using a uniform scheme (), we found that it had a negative effect on the predictive performance for all models evaluated in this study.

Hence, we propose a weighting scheme that takes the rank distribution of the training examples into account but also considers the label imbalance for each classification task after extending the original ranks into binary labels. Specifically, our task weighting scheme (under which CORAL still guarantees classifier consistency) is defined as follows. Let be the number of examples whose ranks exceed . By the rank ordering we have . Let be the number of majority binary label for each task. We define the importance of the th task as the scaled :


Under this weighting scheme, the general class imbalance of a dataset is taken into account. Moreover, in our examples classification tasks corresponding to the edges of the distribution of unique rank labels receive a higher weight than the classification tasks that see more balanced rank label vectors during training (Figure 1), which may help improve the predictive performance of the model. The lowest weight may not always be assigned to the center-rank: if , the last task has the lowest weight, and if , the first task has the lowest weight. It shall be noted that the task importance weighting is only used for model parameter optimization; when computing the predicted rank by adding the binary results (Eq. 1), each task has the same influence on the final rank prediction. Since , it prevents tasks from having negligible weights as in (Niu et al., 2016) when a dataset contains only a small number of examples for certain ranks. We provide an empirical comparison between a uniform task weighting and task weighting according to Eq. (7) in Section 5.2.

Figure 1: Example of the task importance weighting according to Eq. (7) shown for the AFAD dataset (Section 4.1).

Figure 2: Illustration of the Consistent Rank Logits CNN (CORAL-CNN) used for age prediction. From the estimated probability values, the binary labels are obtained via Eq. (5) and converted to the age label via Eq. (1).

4 Experiments

4.1 Datasets and Preprocessing


The MORPH-2 dataset (Ricanek & Tesafaye, 2006) (55,608 face images) was preprocessed by locating the average eye-position in the respective dataset using facial landmark detection (Sagonas et al., 2016) via MLxtend (Raschka, 2018) and then aligning each image in the dataset to the average eye position. The faces were then re-aligned such that the tip of the nose was located in the center of each image. The age labels used in this study ranged between 16-70 years. The CACD database (Chen et al., 2014) was preprocessed similar to MORPH-2 such that the faces spanned the whole image with the nose tip being in the center. The total number of images is 159,449 in the age range 14-62 years.

AFAD and UTKFace.

Since the faces were already centered in the Asian Face Database (AFAD; 165,501 faces with ages labels between 15-40) (Niu et al., 2016), no further alignment was applied. The UTKFace database (Zhang & Qi, 2017) was also available in a preprocessed form such that no additional steps were required. In this study, we considered face images with age labels between 21-60 years (16,434 images).

Each image database was randomly divided into 80% training data and 20% test data. All images were resized to 128x128x3 pixels and then randomly cropped to 120x120x3 pixels to augment the model training. During model evaluation, the 128x128x3 face images were center-cropped to a model input size of 120x120x3.

4.2 Convolutional Neural Network Architectures

To evaluate the performance of CORAL for age estimation from face images, we chose the ResNet-34 architecture (He et al., 2016), which is a modern CNN architecture that is known for achieving good performance on a variety of image classification tasks. For the remainder of this paper, we refer to the original ResNet-34 CNN with cross entropy loss as CE-CNN. To implement CORAL, we replaced the last output layer with the corresponding binary tasks (Figure 2) and refer to this CNN as CORAL-CNN. Similar to CORAL-CNN, we replaced the cross-entropy layer of the ResNet-34 with the binary tasks for ordinal regression described in (Niu et al., 2016) and refer to this architecture as OR-CNN.

4.3 Training and Evaluation

For model evaluation and comparison, we computed the mean absolute error (MAE) and root mean squared error (RMSE), which are standard metrics used for crow-counting and age prediction:


where is the ground truth rank of the th test example and

is the predicted rank, respectively. The MAE and RMSE values reported in this study were computed on the test set after the last training epoch. The training was repeated three times with different random seeds for model weight initialization while the random seeds were consistent between the different methods to allow for fair comparisons. All CNNs were trained for 200 epochs with stochastic gradient descent via adaptive moment estimation 

(Kingma & Ba, 2015) using exponential decay rates and

(PyTorch default) and learning rate


In addition, we computed the Cumulative Score (CS) as the proportion of images for which the absolute differences between the predicted rank labels and the ground truth are below a threshold :


By varying the threshold , CS curves were plotted to compare the predictive performances of the different age prediction models (the larger the area under the curve, the better).

4.4 Hardware and Software

All loss functions and neural network models were implemented in PyTorch 1.0 (Paszke et al., 2017) and trained on NVIDIA GeForce 1080Ti and Titan V graphics cards. The source code is available at https://github.com/Raschka-research-group/coral-cnn.

Method Random Seed MORPH-2 AFAD UTKFace CACD
CE-CNN 0 3.40 4.88 3.98 5.55 6.57 9.16 6.18 8.86
1 3.39 4.87 4.00 5.57 6.24 8.69 6.10 8.79
2 3.37 4.87 3.96 5.50 6.29 8.78 6.13 8.87
AVG SD 3.39 0.02 4.89 0.01 3.98 0.02 5.54 0.04 6.37 0.18 8.88 0.25 6.14 0.04 8.84 0.04
OR-CNN (Niu et al., 2016) 0 2.98 4.26 3.66 5.10 5.71 8.11 5.53 7.91
1 2.98 4.26 3.69 5.13 5.80 8.12 5.53 7.98
2 2.96 4.20 3.68 5.14 5.71 8.11 5.49 7.89
AVG SD 2.97 0.01 4.24 0.03 3.68 0.02 5.13 0.02 5.74 0.05 8.08 0.06 5.52 0.02 7.93 0.05
CORAL-CNN (ours) 0 2.68 3.75 3.49 4.82 5.46 7.61 5.56 7.80
1 2.63 3.66 3.46 4.83 5.46 7.63 5.37 7.64
2 2.61 3.64 3.52 4.91 5.48 7.63 5.25 7.53
AVG SD 2.64 0.04 3.68 0.06 3.49 0.03 4.85 0.05 5.47 0.01 7.62 0.01 5.39 0.16 7.66 0.14
Table 1: Age prediction errors on the test sets without task importance weighting.

Figure 3: Comparison of age prediction models without

task importance weighting. CS curves are shown as averages over three independent runs with standard deviation.

5 Results and Discussion

We conducted a series of experiments on four independent face image datasets for age estimation (Section 4.1) to compare our CORAL approach (CORAL-CNN) with the ordinal regression approach described in (Niu et al., 2016), denoted as OR-CNN. All implementations were based on the ResNet-34 architecture as described in Section 4.2, including the standard ResNet-34 with cross-entropy loss (CE-CNN) as performance baseline.

(Niu et al., 2016)
NO 2.97 0.01 3.68 0.02 5.74 0.05 5.52 0.02
(Niu et al., 2016)
YES 2.91 0.02 3.65 0.03 5.76 0.19 5.49 0.02
NO 2.64 0.04 3.49 0.03 5.47 0.01 5.39 0.16
YES 2.59 0.03 3.48 0.03 5.39 0.07 5.35 0.09
Table 2: Performance comparison after training with and without task importance weighting (Eq. 7). The performance values are reported as average MAE SD from 3 independent runs each.

5.1 Estimating the Apparent Age from Face Images

First, we note that for all methods, the overall predictive performance on the different datasets appears in the following order: MORPH-2 AFAD CACD UTKFace (Table 1 and Figure 3). Possible reasons why all approaches perform best on MORPH-2 are that MORPH-2 has the best overall image quality and relatively consistent lighting conditions and viewing angles. For instance, we found that AFAD includes some images of particularly low resolution (e.g., 20x20). While UTKFace and CACD also contain some lower-quality images, a possible reason why the methods perform worse on UTKFace compared to AFAD is that UTKFace is about ten times smaller than AFAD. While CACD has approximately the same size as AFAD, the lower performance can be explained by the wider age range that needs to be considered (14-62 in CACD compared to 15-40 in AFAD).

Across all datasets (Table 1 and Figure 3), we found that both OR-CNN and CORAL-CNN outperform the standard cross-entropy loss (OR-CNN) on these ordinal regression tasks, as expected. Similarly, as summarized in Table 1 and Figure 3, our CORAL method shows a substantial improvement over the current state-of-the-art ordinal regression method (OR-CNN) by (Niu et al., 2016), which does not guarantee classifier consistency. Moreover, we repeated each experiment three times using different random seeds for model weight initialization and dataset shuffling, to ensure that the observed performance improvement of CORAL-CNN over OR-CNN is reproducible and not coincidental. Furthermore, along with providing the theoretical proof for classifier consistency in CORAL-CNN (Theorem 1), we also empirically verified that the bias units of the CORAL-CNN output layers were indeed ordered after model training, in contrast to OR-CNN. From these results, we can conclude that guaranteed classifier consistency via CORAL has a substantial, positive effect on the predictive performance of an ordinal regression CNN.

5.2 Task Importance Weighting

While all results described in the previous section are based on experiments without task importance weighting (i.e., ), we repeated all experiments using our weighting scheme proposed in Section 3.5, which takes label imbalances into account. Note that according to Theorem 1, CORAL still guarantees classifier consistency under any chosen task weighting scheme as long as weights are assigned positive values. From the results provided in Table 2, we find that by using a task weighting scheme that also takes label imbalances into account, we can further improve the performance of CORAL-CNNs across all four datasets.

6 Conclusions

In this paper, we developed the CORAL framework for ordinal regression via extended binary classification with theoretical guarantees for classifier consistency. Moreover, we proved classifier consistency without requiring rank- or training label-dependent weighting schemes, which permits straightforward implementations and efficient model training. Furthermore, the theoretical generalization bounds assure that if the binary tasks generalize well, then the final rank prediction also generalizes well. We also showed that CORAL could be readily implemented to extend CNNs for ordinal regression tasks and evaluated it empirically on four large image databases for predicting the apparent age from face images. The results unequivocally showed that the guaranteed classifier consistency via CORAL substantially improved the predictive performance of CNNs for age estimation. While we evaluated the CORAL framework in an end-to-end learning approach using CNNs for age estimation, our method can be readily generalized to other ordinal regression problems and different types of neural network architectures, including multilayer perceptrons and recurrent neural networks.

7 Acknowledgements

Support for this research was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation. Also, we thank the NVIDIA Corporation for a generous donation via an NVIDIA GPU grant to support this study.


  • Canziani et al. (2016) Canziani, A., Paszke, A., and Culurciello, E. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
  • Chen et al. (2014) Chen, B.-C., Chen, C.-S., and Hsu, W. H. Cross-age reference coding for age-invariant face recognition and retrieval. In

    Proceedings of the European Conference on Computer Vision

    , pp. 768–783. Springer, 2014.
  • Chen et al. (2016) Chen, J.-C., Kumar, A., Ranjan, R., Patel, V. M., Alavi, A., and Chellappa, R. A cascaded convolutional neural network for age estimation of unconstrained faces. In Proceedings of the IEEE Conference on Biometrics Theory, Applications and Systems, pp. 1–8, 2016.
  • Chen et al. (2017) Chen, S., Zhang, C., Dong, M., Le, J., and Rao, M. Using Ranking-CNN for age estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 5183–5192, 2017.
  • Chu & Keerthi (2005) Chu, W. and Keerthi, S. S. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pp. 145–152. ACM, 2005.
  • Crammer & Singer (2002) Crammer, K. and Singer, Y. Pranking with ranking. In Advances in Neural Information Processing Systems, pp. 641–647, 2002.
  • Doyle et al. (2014) Doyle, O. M., Westman, E., Marquand, A. F., Mecocci, P., Vellas, B., Tsolaki, M., Kłoszewska, I., Soininen, H., Lovestone, S., Williams, S. C., et al. Predicting progression of Alzheimer’s disease using ordinal regression. PloS one, 9(8):e105542, 2014.
  • Frank & Hall (2001) Frank, E. and Hall, M. A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning, pp. 145–156. Springer, 2001.
  • Geng et al. (2013) Geng, X., Yin, C., and Zhou, Z.-H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10):2401–2412, 2013.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
  • Herbrich et al. (1999) Herbrich, R., Graepel, T., and Obermayer, K. Support vector learning for ordinal regression. In Proceedings of the IET Conference on Artificial Neural Networks, volume 1, pp. 97–102, 1999.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In Proceedings of the Conference on Learning Representations, 2015.
  • Kohail (2012) Kohail, S. N. Using artificial neural network for human age estimation based on facial images. In Proceedings of the IEEE Conference on Innovations in Information Technology, pp. 215–219, 2012.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
  • Levi & Hassner (2015) Levi, G. and Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–42, 2015.
  • Li & Lin (2007) Li, L. and Lin, H.-T. Ordinal regression by extended binary classification. In Advances in Neural Information Processing Systems, pp. 865–872, 2007.
  • McCullagh (1980) McCullagh, P. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), pp. 109–142, 1980.
  • Niu et al. (2016) Niu, Z., Zhou, M., Wang, L., Gao, X., and Hua, G. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4920–4928, 2016.
  • O’Toole et al. (1999) O’Toole, A. J., Price, T., Vetter, T., Bartlett, J. C., and Blanz, V. 3D shape and 2D surface textures of human faces: The role of ”averages” in attractiveness and age. Image and Vision Computing, 18(1):9–19, 1999.
  • Pan et al. (2018) Pan, H., Han, H., Shan, S., and Chen, X.

    Mean-variance loss for deep age estimation from a face.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5285–5294, 2018.
  • Parkhim & Zisserman (2015) Parkhim, Omkar M, A. V. and Zisserman, A. Deep face recognition. In Proceedings of the British Machine Vision Conference, volume 3, 2015.
  • Parra et al. (2011) Parra, D., Karatzoglou, A., Amatriain, X., and Yavuz, I.

    Implicit feedback recommendation via implicit-to-explicit ordinal logistic regression mapping.

    Proceedings of the CARS Workshop of the Conference of Recommender Systems, pp.  5, 2011.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In Neural Information Processing Systems Autodiff Workshop, 2017.
  • Polania et al. (2018) Polania, L., Wang, D., and Fung, G. Ordinal regression using noisy pairwise comparisons for body mass index range estimation. arXiv preprint arXiv:1811.03268, 2018.
  • Rajaram et al. (2003) Rajaram, S., Garg, A., Zhou, X. S., and Huang, T. S. Classification approach towards ranking and sorting problems. In Proceedings of the European Conference on Machine Learning, pp. 301–312. Springer, 2003.
  • Ramanathan et al. (2009a) Ramanathan, N., Chellappa, R., and Biswas, S. Age progression in human faces: A survey. Journal of Visual Languages and Computing, 15:3349–3361, 2009a.
  • Ramanathan et al. (2009b) Ramanathan, N., Chellappa, R., and Biswas, S. Computational methods for modeling facial aging: A survey. Journal of Visual Languages & Computing, 20(3):131–144, 2009b.
  • Ranjan et al. (2017) Ranjan, R., Sankaranarayanan, S., Castillo, C. D., and Chellappa, R. An all-in-one convolutional neural network for face analysis. In Proceedings of the IEEE Conference on Automatic Face & Gesture Recognition, pp. 17–24, 2017.
  • Raschka (2018) Raschka, S.

    MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack.

    The Journal of Open Source Software, 3(24), 2018.
  • Rettie et al. (2005) Rettie, R., Grandcolas, U., and Deakins, B. Text message advertising: Response rates and branding effects. Journal of Targeting, Measurement and Analysis for Marketing, 13(4):304–312, 2005.
  • Ricanek & Tesafaye (2006) Ricanek, K. and Tesafaye, T. Morph: A longitudinal image database of normal adult age-progression. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pp. 341–345, 2006.
  • Rothe et al. (2015) Rothe, R., Timofte, R., and Van Gool, L. DEX: Deep expectation of apparent age from a single image. In Proceedings of the IEEE Conference on Computer Vision Workshops, pp. 10–15, 2015.
  • Sagonas et al. (2016) Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47:3–18, 2016.
  • Shashua & Levin (2003) Shashua, A. and Levin, A. Ranking with large margin principle: Two approaches. In Advances in Neural Information Processing Systems, pp. 961–968, 2003.
  • Shen & Joshi (2005) Shen, L. and Joshi, A. K. Ranking and reranking with perceptron. Machine Learning, 60(1-3):73–96, 2005.
  • Sigrist et al. (2007) Sigrist, M. K., Taal, M. W., Bungay, P., and McIntyre, C. W. Progressive vascular calcification over 2 years is associated with arterial stiffening and increased mortality in patients with stages 4 and 5 chronic kidney disease. Clinical Journal of the American Society of Nephrology, 2(6):1241–1248, 2007.
  • Streifler et al. (1995) Streifler, J. Y., Eliasziw, M., Benavente, O. R., Hachinski, V. C., Fox, A. J., and Barnett, H. Lack of relationship between leukoaraiosis and carotid artery disease. Archives of neurology, 52(1):21–24, 1995.
  • Turaga et al. (2010) Turaga, P., Biswas, S., and Chellappa, R. The role of geometry in age estimation. In Proceedings of the IEEE Conference on Acoustics Speech and Signal Processing, pp. 946–949, 2010.
  • Weersma et al. (2009) Weersma, R. K., Stokkers, P. C., van Bodegraven, A. A., van Hogezand, R. A., Verspaget, H. W., de Jong, D. J., Van Der Woude, C., Oldenburg, B., Linskens, R., and Festen, E. Molecular prediction of disease risk and severity in a large dutch crohn’s disease cohort. Gut, 58(3):388–395, 2009.
  • Wu et al. (2012) Wu, T., Turaga, P., and Chellappa, R. Age estimation and face verification across aging using landmarks. Proceedings of the IEEE Conference on Transactions on Information Forensics and Security, 7:1780–1788, 2012.
  • Zhang & Qi (2017) Zhang, Zhifei, S. Y. and Qi, H.

    Age progression/regression by conditional adversarial autoencoder.

    In Proceddings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.