Consistent Rank Logits for Ordinal Regression with Convolutional Neural Networks
While extraordinary progress has been made towards developing neural network architectures for classification tasks, commonly used loss functions such as the multi-category cross entropy loss are inadequate for ranking and ordinal regression problems. To address this issue, approaches have been developed that transform ordinal target variables series of binary classification tasks, resulting in robust ranking algorithms with good generalization performance. However, to model ordinal information appropriately, ideally, a rank-monotonic prediction function is required such that confidence scores are ordered and consistent. We propose a new framework (Consistent Rank Logits, CORAL) with theoretical guarantees for rank-monotonicity and consistent confidence scores. Through parameter sharing, our framework benefits from low training complexity and can easily be implemented to extend common convolutional neural network classifiers for ordinal regression tasks. Furthermore, our empirical results support the proposed theory and show a substantial improvement compared to the current state-of-the-art ordinal regression method for age prediction from face images.READ FULL TEXT VIEW PDF
Consistent Rank Logits for Ordinal Regression with Convolutional Neural Networks
Ordinal regression, sometimes also referred to as ordinal classification, describes the task of predicting object labels on an ordinal scale. Here, a ranking rule or classifier maps each object into an ordered set , where . In contrast to classification, the ranks include ordering information. In comparison with metric regression, which assumes that
is a continuous random variable, ordinal regression regardsas a finite sequence where the metric distance between ranks is not defined.
Along with age estimation(Niu et al., 2016), popular applications for ordinal regression include predicting the progression of various diseases, such as Alzheimer’s disease (Doyle et al., 2014), Crohn’s disease (Weersma et al., 2009), artery disease (Streifler et al., 1995), and kidney disease (Sigrist et al., 2007). Also, ordinal regression models are common choices for text message advertising (Rettie et al., 2005) and various recommender systems (Parra et al., 2011).
While the field of machine learning field developed many powerful algorithms for predictive modeling, most algorithms were designed for classification tasks. About ten years ago, Li and Lin proposed a general framework for ordinal regression via extended binary classification (Li & Lin, 2007), which has become the standard choice for extending state-of-the-art machine learning algorithms for ordinal regression tasks. However, implementations of extended binary classification for ordinal regression commonly suffer from classifier inconsistencies among the binary rankings (Niu et al., 2016), which we address in this paper with a new method and theorem for guaranteed classifier consistency that can easily be implemented in various machine learning algorithms. Furthermore, we present an empirical study of our approach on challenging real-world datasets for predicting the age of individuals from face images using our method with convolutional neural networks (CNN).
The main contributions of our paper are as follows:
the Consistent Rank Logits (CORAL) framework for ordinal regression with theoretical guarantees for classifier consistency and well-defined generalization bounds with and without dataset- and task-specific importance weighting;
CNN architectures with CORAL formulation for ordinal regression tasks that come with the added side benefit of reducing the number of parameters to be trained compared to CNNs for classification;
experimental validation showing that the guaranteed classifier consistency leads to a substantial improvement over the state-of-the-art CNN for ordinal regression applied to age estimation from face images.
Several multivariate extensions of generalized linear models have been developed in the past for ordinal regression, including the popular proportional odds and the proportional hazards models(McCullagh, 1980)
. Moreover, ordinal regression has become a popular topic of study in the field of machine learning to extend classification algorithms by reformulating the problem to utilize multiple binary classification tasks. Early work in this regard includes the use of perceptrons(Crammer & Singer, 2002; Shen & Joshi, 2005)1999; Shashua & Levin, 2003; Rajaram et al., 2003; Chu & Keerthi, 2005). A general reduction framework that unified the view of a number of these existing algorithms for ordinal regression was later proposed in (Li & Lin, 2007).
While earlier works on using CNNs for ordinal targets have employed conventional classification approaches (Levi & Hassner, 2015; Rothe et al., 2015), the general reduction framework from ordinal regression to binary classification by (Li & Lin, 2007) was recently adopted by (Niu et al., 2016). In (Niu et al., 2016), an ordinal regression problem with ranks was transformed into binary classification problems, with the th task predicting whether the age label of a face image exceeds rank , . Here, all tasks share the same intermediate layers but are assigned distinct weight parameters in the output layer. One issue with this architecture is that for some input images the outputs of the tasks do not agree with each other. Hence, the model does not guarantee that the predictions are consistent. For example, in an age estimation setting, it would be contradictory if the th binary task predicted that the age of a person was larger than 30, but a previous task predicted it was not larger than 20, which is suboptimal when the task predictions are combined to obtain the estimated age.
While the ordinal regression CNN yielded state-of-the-art results on an ordinal regression problem such as age estimation, the authors acknowledged the classifier inconsistency as not being ideal but also noted that ensuring that the binary classifiers are consistent would increase the training complexity substantially (Niu et al., 2016). Our proposed method addresses both of these issues with a theoretical guarantee for classifier consistency as well as a reduction of the training complexity.
Due to its broad utility in social networking, video surveillance, and biometric verification, age estimation from human faces is an area of active research. Various techniques have been developed for extracting facial features as inputs to classification or metric regression algorithms (O’Toole et al., 1999; Ramanathan et al., 2009b; Turaga et al., 2010; Kohail, 2012; Wu et al., 2012; Geng et al., 2013).
In recent years, CNN research has rapidly advanced, and CNNs now surpass most traditional methods on image-analyses tasks while not requiring feature extraction beyond standard image preprocessing steps(Krizhevsky et al., 2012; Parkhim & Zisserman, 2015; Canziani et al., 2016). Hence, most state-of-the-art age estimation methods are now utilizing CNN architectures (Rothe et al., 2015; Chen et al., 2016; Niu et al., 2016; Ranjan et al., 2017; Chen et al., 2017).
Related to the idea of training binary classifiers separately and combining the independent predictions for ranking (Frank & Hall, 2001), a modification of the ordinal regression CNN (Niu et al., 2016) was recently proposed for age estimation, called Ranking-CNN, that trains an ensemble of CNNs for binary classifications and aggregates the predictions to predict the age label of a given face image (Chen et al., 2017). The researchers showed that training a series of CNNs improves the predictive performance over a single CNN with multiple binary outputs. However, ensembles of CNNs come with a substantial increase in training complexity and do not guarantee classifier consistency, which means that the individual binary classifiers used for ranking can produce contradictory results. Another approach for utilizing binary classifiers for ordinal regression is the siamese CNN architecture by (Polania et al., 2018)
. Since this siamese CNN has only a single output neuron, comparisons between the input image and multiple, carefully selected anchor images are required to compute the rank.
Age distribution learning (Pan et al., 2018) has made other notable progress in age estimation; here, the researchers defined a new loss function to penalize the difference between estimated age distributions and the ground truth age labels. Recent research has also shown that training a multi-task CNN for various face analysis tasks, including face detection, gender prediction, age estimation, etc., can improve the overall performance across different tasks compared to a single-task CNN (Ranjan et al., 2017) by sharing lower-layer parameters. In (Chen et al., 2016), a cascaded convolutional neural network was designed to classify face images into age groups followed by regression modules for more accurate age estimation. In both studies, the authors used metric regression for the age estimation subtasks. While our paper focuses on the comparison of different ordinal regression approaches, we hypothesize that such all-in-one and cascaded CNNs can be further improved by our method, since, as shown in (Niu et al., 2016), ordinal regression CNNs outperform metric regression CNNs in age estimation tasks.
This section describes the proposed CORAL framework that addresses the problem of classifier inconsistency in ordinal regression CNNs based on multiple binary classification tasks for ranking.
Let be the training dataset consisting of examples. Here, denotes the -th image and denotes the corresponding rank, where with ordered rank . The symbol denotes the ordering between the ranks. The ordinal regression task is to find a ranking rule such that some loss function is minimized.
Let be a cost matrix (Li & Lin, 2007), where is the cost of predicting an example as rank . Typically, and for . In ordinal regression, we generally prefer each row of the cost matrix to be V-shaped. That is if and if . The classification cost matrix has entries , which does not consider ordering information. In ordinal regression, where the ranks are treated as numerical values, the absolute cost matrix is commonly defined by .
In (Li & Lin, 2007), the researchers proposed a general reduction framework for extending an ordinal regression problem into several binary classification problems. This framework requires the use of a cost matrix that is convex in each row ( for each ) to obtain a rank-monotonic threshold model. Since the cost-related weighting of each binary task is specific for each training example, this approach was described as unfeasible in practice due to its high training complexity (Niu et al., 2016). Our proposed CORAL framework does neither require a cost matrix with convex-row conditions nor explicit weighting terms that depend on each training example to obtain a rank-monotonic threshold model and to produce consistent predictions for each binary task. Moreover, CORAL allows for an optional task importance weighting, e.g., to adjust for label and class imbalances, which makes it more applicable in practice.
We propose the Consistent Rank Logits (CORAL) model for multi-label CNNs with ordinal responses. Within this framework, the binary tasks produce consistently ranked predictions.
Given the training dataset , we first extend a rank label into binary labels such that indicates whether exceeds rank , i.e., . The indicator function is if the inner condition is true, and otherwise. Providing the extended binary labels as model inputs, we train a single CNN with binary classifiers in the output layer. Here, the binary tasks share the same weight parameter but have independent bias units, which solves the inconsistency problem among the predicted binary responses and reduces the model complexity.
Based on the binary task responses, the predicted rank for an input is then obtained via
where is the prediction of the th binary classifier in the output layer. We require that reflect the ordinal information and are rank-monotonic,
which guarantees that the predictions are consistent.
Let denote the weight parameters of the neural network excluding the bias units of the final layer. The penultimate layer, whose output is denoted as , shares a single weight with all nodes in the final output layer. independent bias units are then added to such that are the inputs to the corresponding binary classifiers in the final layer. Letis defined as
For model training, we minimize the loss function
which is the weighted cross-entropy of binary classifiers. For rank prediction (Eq. 1), the binary labels are obtained via
In Eq. (4), denotes the weight of the loss associated with the th classifier (assuming ). In the remainder of the paper, we refer to as the importance parameter for task . Some tasks may be less robust or harder to optimize, which can be taken into consideration by choosing a non-uniform task weighting scheme. Also, in many real-world applications, features between certain adjacent ranks may have more subtle distinctions. For example, facial aging is commonly regarded as a non-stationary process (Ramanathan et al., 2009a) such that face feature transformations could be more detectable during certain age intervals. Moreover, the relative predictive performance of the binary tasks may also be affected by the degree of binary data imbalance for a given task that occurs as a side-effect of extending a rank label into binary labels. Hence, we hypothesize that choosing non-uniform task weighting schemes improves the predictive performance of the overall model. The choice of task importance parameters is covered in more detail in Section 3.5. Next, we provide a theoretical guarantee for classifier consistency under uniform and non-uniform task importance weighting given that the task importance weights are positive numbers.
In the following theorem, we show that by minimizing the loss (Eq. 4), the learned bias units of the output layer are non-increasing such that . Consequently, the predicted confidence scores or probability estimates of the tasks are decreasing, i.e., for all , ensuring classifier consistency. given by Eq. 5 are also rank-monotonic.
By minimizing loss function defined in Eq. (4), the optimal solution satisfies .
Suppose is an optimal solution and for some . Claim: by either replacing with or replacing with , we can decrease the objective value . Let
By the ordering relationship we have . Denote and
Since is increasing in , we have and .
If we replace with , the loss terms related to th task are updated. The change of loss (Eq. 4) is given as
Accordingly, if we replace with , the change of is given as
By adding and , we have
and know that either or . Thus, our claim is justified, and we conclude that any optimal solution that minimizes satisfies . ∎
Note that the theorem for rank-monotonicity in (Li & Lin, 2007), in contrast to Theorem 1, requires the use of a cost matrix with each row being convex. Under this convexity condition, let be the weight of loss of the th task on the th example, which depends on the label . In (Li & Lin, 2007), the researchers proved that by using example-specific task weights , the optimal thresholds are ordered. This assumption requires that when , and when . Theorem 1 is free from this requirement and allows us to choose a fixed weight for each task that does not depend on the individual training examples, which greatly reduces the training complexity. Moreover, Theorem 1 allows for choosing either a simple uniform task weighting or taking dataset imbalances into account (Section 3.5) while still guaranteeing that the predicted probabilities are non-decreasing and the task predictions are consistent.
Based on well-known generalization bounds for binary classification, we can derive new generalization bounds for our ordinal regression approach that apply to a wide range of practical scenarios as we only require and . Moreover, Theorem 2 shows that if each binary classification task in our model generalizes well in terms of the standard 0/1-loss, the final rank prediction via (Eq. 1) also generalizes well.
Suppose is the cost matrix of the original ordinal label prediction problem, with and for . is the underlying distribution of , i.e., . If are rank-monotonic, then
For any , we have
If , then .
If , then . We have and Also, and . Thus, if and only if . Since ,
Similarly, if , then and
In any case, we have
By taking the expectation on both sides with , we arrive at Eq. (6). ∎
In (Li & Lin, 2007), by assuming the cost matrix to have V-shaped rows, the researchers define generalization bounds by constructing a discrete distribution on conditional on each , given that the binary classifications are rank-monotonic or every row of is convex. However, the only case they provided for the existence of rank-monotonic binary classifiers was the ordered threshold model, which requires a cost matrix with convex rows and example-specific task weights. Our result does not rely on cost matrices with V-shaped or convex rows and can be applied to a broader variety of real-world use cases.
According to Theorem 1, minimizing the loss of the CORAL model guarantees that the bias units are non-increasing and thus the binary classifiers are consistent as long as the task importance parameters are positive ().
We first experimented with a weighting scheme proposed in (Niu et al., 2016) that aims to address the class imbalance in the face image datasets. However, compared to using a uniform scheme (), we found that it had a negative effect on the predictive performance for all models evaluated in this study.
Hence, we propose a weighting scheme that takes the rank distribution of the training examples into account but also considers the label imbalance for each classification task after extending the original ranks into binary labels. Specifically, our task weighting scheme (under which CORAL still guarantees classifier consistency) is defined as follows. Let be the number of examples whose ranks exceed . By the rank ordering we have . Let be the number of majority binary label for each task. We define the importance of the th task as the scaled :
Under this weighting scheme, the general class imbalance of a dataset is taken into account. Moreover, in our examples classification tasks corresponding to the edges of the distribution of unique rank labels receive a higher weight than the classification tasks that see more balanced rank label vectors during training (Figure 1), which may help improve the predictive performance of the model. The lowest weight may not always be assigned to the center-rank: if , the last task has the lowest weight, and if , the first task has the lowest weight. It shall be noted that the task importance weighting is only used for model parameter optimization; when computing the predicted rank by adding the binary results (Eq. 1), each task has the same influence on the final rank prediction. Since , it prevents tasks from having negligible weights as in (Niu et al., 2016) when a dataset contains only a small number of examples for certain ranks. We provide an empirical comparison between a uniform task weighting and task weighting according to Eq. (7) in Section 5.2.
The MORPH-2 dataset (Ricanek & Tesafaye, 2006) (55,608 face images) was preprocessed by locating the average eye-position in the respective dataset using facial landmark detection (Sagonas et al., 2016) via MLxtend (Raschka, 2018) and then aligning each image in the dataset to the average eye position. The faces were then re-aligned such that the tip of the nose was located in the center of each image. The age labels used in this study ranged between 16-70 years. The CACD database (Chen et al., 2014) was preprocessed similar to MORPH-2 such that the faces spanned the whole image with the nose tip being in the center. The total number of images is 159,449 in the age range 14-62 years.
Since the faces were already centered in the Asian Face Database (AFAD; 165,501 faces with ages labels between 15-40) (Niu et al., 2016), no further alignment was applied. The UTKFace database (Zhang & Qi, 2017) was also available in a preprocessed form such that no additional steps were required. In this study, we considered face images with age labels between 21-60 years (16,434 images).
Each image database was randomly divided into 80% training data and 20% test data. All images were resized to 128x128x3 pixels and then randomly cropped to 120x120x3 pixels to augment the model training. During model evaluation, the 128x128x3 face images were center-cropped to a model input size of 120x120x3.
To evaluate the performance of CORAL for age estimation from face images, we chose the ResNet-34 architecture (He et al., 2016), which is a modern CNN architecture that is known for achieving good performance on a variety of image classification tasks. For the remainder of this paper, we refer to the original ResNet-34 CNN with cross entropy loss as CE-CNN. To implement CORAL, we replaced the last output layer with the corresponding binary tasks (Figure 2) and refer to this CNN as CORAL-CNN. Similar to CORAL-CNN, we replaced the cross-entropy layer of the ResNet-34 with the binary tasks for ordinal regression described in (Niu et al., 2016) and refer to this architecture as OR-CNN.
For model evaluation and comparison, we computed the mean absolute error (MAE) and root mean squared error (RMSE), which are standard metrics used for crow-counting and age prediction:
where is the ground truth rank of the th test example and
is the predicted rank, respectively. The MAE and RMSE values reported in this study were computed on the test set after the last training epoch. The training was repeated three times with different random seeds for model weight initialization while the random seeds were consistent between the different methods to allow for fair comparisons. All CNNs were trained for 200 epochs with stochastic gradient descent via adaptive moment estimation(Kingma & Ba, 2015) using exponential decay rates and
(PyTorch default) and learning rate.
In addition, we computed the Cumulative Score (CS) as the proportion of images for which the absolute differences between the predicted rank labels and the ground truth are below a threshold :
By varying the threshold , CS curves were plotted to compare the predictive performances of the different age prediction models (the larger the area under the curve, the better).
All loss functions and neural network models were implemented in PyTorch 1.0 (Paszke et al., 2017) and trained on NVIDIA GeForce 1080Ti and Titan V graphics cards. The source code is available at https://github.com/Raschka-research-group/coral-cnn.
|AVG SD||3.39 0.02||4.89 0.01||3.98 0.02||5.54 0.04||6.37 0.18||8.88 0.25||6.14 0.04||8.84 0.04|
|OR-CNN (Niu et al., 2016)||0||2.98||4.26||3.66||5.10||5.71||8.11||5.53||7.91|
|AVG SD||2.97 0.01||4.24 0.03||3.68 0.02||5.13 0.02||5.74 0.05||8.08 0.06||5.52 0.02||7.93 0.05|
|AVG SD||2.64 0.04||3.68 0.06||3.49 0.03||4.85 0.05||5.47 0.01||7.62 0.01||5.39 0.16||7.66 0.14|
We conducted a series of experiments on four independent face image datasets for age estimation (Section 4.1) to compare our CORAL approach (CORAL-CNN) with the ordinal regression approach described in (Niu et al., 2016), denoted as OR-CNN. All implementations were based on the ResNet-34 architecture as described in Section 4.2, including the standard ResNet-34 with cross-entropy loss (CE-CNN) as performance baseline.
|NO||2.97 0.01||3.68 0.02||5.74 0.05||5.52 0.02|
|YES||2.91 0.02||3.65 0.03||5.76 0.19||5.49 0.02|
|NO||2.64 0.04||3.49 0.03||5.47 0.01||5.39 0.16|
|YES||2.59 0.03||3.48 0.03||5.39 0.07||5.35 0.09|
First, we note that for all methods, the overall predictive performance on the different datasets appears in the following order: MORPH-2 AFAD CACD UTKFace (Table 1 and Figure 3). Possible reasons why all approaches perform best on MORPH-2 are that MORPH-2 has the best overall image quality and relatively consistent lighting conditions and viewing angles. For instance, we found that AFAD includes some images of particularly low resolution (e.g., 20x20). While UTKFace and CACD also contain some lower-quality images, a possible reason why the methods perform worse on UTKFace compared to AFAD is that UTKFace is about ten times smaller than AFAD. While CACD has approximately the same size as AFAD, the lower performance can be explained by the wider age range that needs to be considered (14-62 in CACD compared to 15-40 in AFAD).
Across all datasets (Table 1 and Figure 3), we found that both OR-CNN and CORAL-CNN outperform the standard cross-entropy loss (OR-CNN) on these ordinal regression tasks, as expected. Similarly, as summarized in Table 1 and Figure 3, our CORAL method shows a substantial improvement over the current state-of-the-art ordinal regression method (OR-CNN) by (Niu et al., 2016), which does not guarantee classifier consistency. Moreover, we repeated each experiment three times using different random seeds for model weight initialization and dataset shuffling, to ensure that the observed performance improvement of CORAL-CNN over OR-CNN is reproducible and not coincidental. Furthermore, along with providing the theoretical proof for classifier consistency in CORAL-CNN (Theorem 1), we also empirically verified that the bias units of the CORAL-CNN output layers were indeed ordered after model training, in contrast to OR-CNN. From these results, we can conclude that guaranteed classifier consistency via CORAL has a substantial, positive effect on the predictive performance of an ordinal regression CNN.
While all results described in the previous section are based on experiments without task importance weighting (i.e., ), we repeated all experiments using our weighting scheme proposed in Section 3.5, which takes label imbalances into account. Note that according to Theorem 1, CORAL still guarantees classifier consistency under any chosen task weighting scheme as long as weights are assigned positive values. From the results provided in Table 2, we find that by using a task weighting scheme that also takes label imbalances into account, we can further improve the performance of CORAL-CNNs across all four datasets.
In this paper, we developed the CORAL framework for ordinal regression via extended binary classification with theoretical guarantees for classifier consistency. Moreover, we proved classifier consistency without requiring rank- or training label-dependent weighting schemes, which permits straightforward implementations and efficient model training. Furthermore, the theoretical generalization bounds assure that if the binary tasks generalize well, then the final rank prediction also generalizes well. We also showed that CORAL could be readily implemented to extend CNNs for ordinal regression tasks and evaluated it empirically on four large image databases for predicting the apparent age from face images. The results unequivocally showed that the guaranteed classifier consistency via CORAL substantially improved the predictive performance of CNNs for age estimation. While we evaluated the CORAL framework in an end-to-end learning approach using CNNs for age estimation, our method can be readily generalized to other ordinal regression problems and different types of neural network architectures, including multilayer perceptrons and recurrent neural networks.
Support for this research was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation. Also, we thank the NVIDIA Corporation for a generous donation via an NVIDIA GPU grant to support this study.
Proceedings of the European Conference on Computer Vision, pp. 768–783. Springer, 2014.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5183–5192, 2017.
Mean-variance loss for deep age estimation from a face.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5285–5294, 2018.
Implicit feedback recommendation via implicit-to-explicit ordinal logistic regression mapping.Proceedings of the CARS Workshop of the Conference of Recommender Systems, pp. 5, 2011.
MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack.The Journal of Open Source Software, 3(24), 2018.
Age progression/regression by conditional adversarial autoencoder.In Proceddings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.