Deep Ordinal Regression with Label Diversity

by   Axel Berg, et al.

Regression via classification (RvC) is a common method used for regression problems in deep learning, where the target variable belongs to a set of continuous values. By discretizing the target into a set of non-overlapping classes, it has been shown that training a classifier can improve neural network accuracy compared to using a standard regression approach. However, it is not clear how the set of discrete classes should be chosen and how it affects the overall solution. In this work, we propose that using several discrete data representations simultaneously can improve neural network learning compared to a single representation. Our approach is end-to-end differentiable and can be added as a simple extension to conventional learning methods, such as deep neural networks. We test our method on three challenging tasks and show that our method reduces the prediction error compared to a baseline RvC approach while maintaining a similar model complexity.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning a binary search with a recurrent neural network. A novel approach to ordinal regression analysis

Deep neural networks are a family of computational models that are natur...

Improving the Performance of Neural Networks in Regression Tasks Using Drawering

The method presented extends a given regression neural network to make i...

Algorithmic Probability-guided Supervised Machine Learning on Non-differentiable Spaces

We show how complexity theory can be introduced in machine learning to h...

Structured (De)composable Representations Trained with Neural Networks

The paper proposes a novel technique for representing templates and inst...

Backdoor Attack and Defense for Deep Regression

We demonstrate a backdoor attack on a deep neural network used for regre...

Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities

In recent times, deep neural networks achieved outstanding predictive pe...

More layers! End-to-end regression and uncertainty on tabular data with deep learning

This paper attempts to analyze the effectiveness of deep learning for ta...

Code Repositories


Deep Ordinal Regression with Label Diversity

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Choosing the right objective function is a crucial part of successfully training an accurate and generalized regression model, for example a deep neural network. Among the standard objective functions are the mean squared error (MSE, or loss), the mean absolute error (MAE, or -loss) and hybrid variants such as the Huber loss. Much attention has also been given to deriving problem-specific objective functions that incorporate certain aspects of the target variable, such as modularity and norm constraints for geometric regression. It is also possible to treat a continuous dependent variable as belonging to a finite number of discrete classes, although this necessarily comes at the expense of introducing a discretization error. Such approaches are known as regression via classification (RvC) and they are frequently used in tasks where a regression loss would at first seem more natural [1, 2, 3, 4, 5, 6].

Ordinal regression techniques can be applied to classification problems where the dependent variable exhibits a relative ordering. Techniques for ordinal regression can be applied to RvC problems in order to preserve the ordinal structure of the labels, and recent work has shown that this method can be used to improve the accuracy in several regression problems, such as age estimation

[7] [8], image ranking and depth estimation [9].

One problem with the RvC approach is the ambiguity in how the discrete classes should be created from the distribution of the dependent variable. The standard approach is to create bins of equal width covering the target output range. For skewed distributions one can also apply the method of equal frequency, where the bins are created from the cumulative distribution function of the target, such that each bin contains the same number of training examples. Regardless of the method, the number of bins must be selected and optimized for the given task, which raises the question of what the optimal number of bins is for a given problem. If the bin-width is too small, this can result in few training examples in each class, but if it is too large, the discretization error can become a limiting factor.

The ambiguity in how to bin a continuous variable leads to a diverse range of possibilities in how to represent the target values — a fact that can be exploited. From previous research it is well known that a diverse ensemble of individual predictors can be combined in order to reduce the overall prediction error. Such approaches often involves training multiple regressors where the diversity is ensured by either data augmentation or model selection. This leads to increased overhead at both training and inference time [10]. However, with the use of label diversity and deep neural networks, one can create a multi-output predictor that enforces diversity without extra computational complexity.


In this paper, we show that a collection of different binning variants of the target values can be used to improve prediction accuracy without increasing the computational complexity compared to a standard classifier. We do so by training a deep multi-output convolutional neural network (CNN) to classify training examples in multiple overlapping bins simultaneously. By doing so, we can effectively take into account the ordinal structure of the regression problem, while also making use of the diversity of the different possible representations of the target variable. We demonstrate our method on a number of different tasks and show competitive results compared to current state-of-the-art methods.

Fig. 1: An illustration of the neural network architecture with multiple output heads. At inference time the expected values of the different distributions are combined using an ensemble average.

Ii Related Work

Ii-a Methods

Ii-A1 Ordinal Regression

Ordinal regression, or ranking learning, is used for problems where the target variable exhibits a relative ordering on an arbitrary scale, e.g. categories such as ”bad”, ”good” and ”very good”. When performing vanilla RvC, the ordinal information contained in the target values is lost, but this can be resolved by using ordinal regression techniques. A common variant is to use extended binary classification, where a single multi-class classification problem is reduced to a set of several binary classification problems [11]. With the advance of deep learning in recent years, ordinal regression has been used successfully for several tasks, including monocular depth estimation [12], age estimation [7]

, head pose estimation

[13], medical diagnosis [14] and historical image dating [15].

Ii-A2 Ensemble Learning

The fact that a set of individual regressors or classifiers can be combined into an ensemble in order to reduce the overall prediction accuracy of a model is frequently exploited in machine learning research


. A key notion in ensemble learning is the bias-variance-covariance decomposition, which says that in order for an ensemble to reduce the prediction error, there has to be some variance in the predictions of the ensemble members. Therefore, the aim should be to create ensembles that consist of accurate but diverse predictors.

Diversity can be created in several ways, most commonly through methods like bagging [16] and boosting [17], which rely on different forms of data diversity. However, using such method comes at the cost of extra model complexity and training time. Recently, Zhang et al. [18] proposed a framework for training an ensemble of networks using negative correlation learning, where the high level features are learned via parameter sharing. This reduces the overhead, while keeping the benefits of a diverse ensemble. In general, a multi-output neural network can be used to form an ensemble, which we exploit in our proposed method.

Ii-B Relevant Applications

Ii-B1 Age Estimation

Deep learning methods have been used successfully for age estimation, where the task consists of predicting the age of a person given a single RGB image of the person’s face. Depending on the implementation, a person’s age can be considered either as a continuous variable on the positive real numbers, or as a class belonging to a set of discrete positive integers. Rothe et al. [6] first highlighted the use of end-to-end training of CNNs for age estimation from a single image. The authors noted that classification yielded better results than direct regression and since then, several new methods using ordinal regression techniques have been published.

Agustsson et al. [19]

performed end-to-end piecewise linear regression by assigning each regressor to an anchor point. Others have observed that it is easier for a human to distinguish differences in age between two persons, rather than their absolute age and used this as a design principle for ordinal regression

[20, 7]

. Alternative methods have focused on various soft encodings of the age over classes, where the elements of the probability vector are proportional to the distance from the true class

[21, 22, 9]. In this way both the ordinal and metric information can be effectively encoded in the labels. Furthermore, it has also been shown that forcing the output of the classifier to be rank consistent over ages can improve the overall accuracy [8, 23].

Ii-B2 Head Pose Estimation

Head pose estimation is the task of predicting the pose of a human head with three degrees of freedom, given an image and possibly depth information

[24]. There are several ways to represent the pose, including three rotation angles (pitch, yaw and roll) with respect to a set of principal axes, a

rotation matrix or a single quaternion. During the past years, several CNN-based head pose estimators trained on specific loss functions have been proposed. Chang et al.

[25] combined direct head pose regression with facial landmark detection and used it for facial alignment. Ruiz et al. [26] used a multi-loss CNN and showed that using a balanced hybrid variant of regression and classification yielded improvements over previous methods. Ordinal methods have also been used, such as in [13], where a combined ranking and MSE regression loss is used in conjunction with a quaternion representation of the head pose. The results indicated that training the CNN to regress the three angles while simultaneously solving several binary ranking problems improved the prediction accuracy compared to a standard regression or classification baseline.

Ii-B3 Historical Image Dating

Palermo et al. [27]

first introduced the task of automatically estimating the historical time in which a color photograph was taken using machine learning techniques. The authors noted that there are several features of the color imaging process that are typical to the era in which the images were taken, such as hue, saturation and color histogram. They then used a support vector machine to classify the images into different decades and showed that this method vastly outperformed untrained humans in terms of classification accuracy. Ginosar et al.

[28] used American high school yearbooks to train a deep neural network for the same task, but in this case the extracted features were also dependent on the image content, e.g. facial attributes and hairstyle. Recently, ordinal regression techniques have also been applied successfully to the task of image dating [15, 29].

Iii Proposed Method

Fig. 2: Examples of how the sets of class intervals can be constructed using the different methods.

Iii-a Label Diversity by Overlapping Bins

Let denote the input (independent variable) and let be the corresponding target value (dependent value) for . Now let be a set of of non-overlapping intervals on the real line , as shown in Figure 2a, such that covers the samples for all . The standard RvC approach would now be to map each target value to a unique class

and train a classifier to predict the posterior probability over classes


In order to create label diversity, we instead introduce new sets of class intervals for , such that , . By doing so, we have created several new discretizations that all cover the support of the target value, but in different ways. Here we do not provide an answer to exactly how these discretizations ought to be chosen, since this in itself is an optimization problem where the solution most likely depends on the problem domain, but instead we focus on the two following possibilities, where we assume that , , such that each discretization contains equally many classes:

  1. Assuming that we do not want to increase the complexity of the training algorithm compared to the standard RvC approach, we fix the total number of classes such that . Then we create discretizations, each containing equally wide bins, such that the overlap between and is fixed. An illustration of this approach is shown in Figure 2b. We refer to this method as ”equal width”.

  2. In order to maximize diversity between different discretizations, for each , we randomly sample classes (with replacement) from and let these classes be the centers of the new bins in , such that target values that do not belong to any of the chosen classes are assigned to the nearest neighbor in the sample. An illustration of this approach is shown in Figure 2c We refer to this method as ”randomized bins”.

If the target is multi-dimensional, the same methods can be applied by creating classes for each dimension individually.

Iii-B Backpropagation

The proposed methods can be used together with a neural network by replacing the last fully connected layer and activation with fully connected layers of size in parallel, where each layer has its own softmax activation. The network is trained by minimizing the sum of the negative cross-entropies between the individual classifier and the targets over each mini-batch of size . The loss function then becomes



is a one-hot encoding of the binned target value, such that the only non-zero element is the one where the bin overlaps the true target:


The predictions are computed using the individual softmax heads of size , and the loss is then back-propagated through the network by differentiating with respect to the predictions as shown in Figure 3.

In order to see how the loss function incorporates the ordinal relationship between the targets, we can again consider Figure 2. In order to make a correct prediction, each softmax head must output a high probability for the correct class in each discretization. If the output probabilities are correct for only a subset of the different discretizations (which implies that the prediction is slightly off) then this will be penalized by an increased loss.

Fig. 3:

During training, the loss is backpropagated through the different softmax heads to the previous layers.

Iii-C Inference

At inference time, the output posterior should be evaluated and converted to a point estimate of the target value by a hard decision. For standard RvC methods, this is typically done by either taking the expected value of the output distribution over classes or using the maximum-a-posteriori estimator. Therefore, we propose two similar methods to perform inference.

For problems where the target value belongs to the real line, we do this by computing the expected value over each estimated posterior distribution as


where denotes the mean value of the bin . This gives us different estimates of the target, which are then combined using a weighted average. Here we only consider uniform weighting of the individual estimates, i.e.


This is a form of ensemble average, and we note that we can decompose an individual error made by the ensemble using the ambiguity decomposition [30] as


which shows that the ensemble error is less than the average error of the individual estimates if there is enough variance within the ensemble. In this case, the non-zero variance is guaranteed by the different weights associated with each expected value in equation (3).

On the other hand, if the target belongs to a set of ordinal classes where we cannot define a useful distance metric, it is more suitable to use the MAP estimate. We do this by marginalizing over in order to estimate the average posterior over the original classes as


where we assume that the conditional probability is independent of the input , i.e.


Then we compute the MAP estimate as


This shows that our method can be used both for true regression problems where we can measure distances between target values and for classification problems where an ordering of targets exists, but without a well-defined distance. In section V we show experiments for tasks of both the first and second kind.

Iv An Illustrative Example

In order to demonstrate the ensemble-like effect of our method, we train a shallow CNN on the task of predicting the rotation angle of digits from the MNIST dataset of handwritten digits [31]. The dataset consists of 5,000 training images and 5,000 test images, where a digit is rotated by an integer drawn uniformly in the interval . We implemented label diversity using the randomized bins method for all combinations of and . We then trained a five-layer CNN, with softmax heads, each containing output units, to predict the one-hot encoded labels of the rotation angles. At inference time, we used equation (4) for prediction and evaluated the MAE on the test set. For comparison, we also trained a regression baseline by using the same shallow CNN with the MSE loss, but where the softmax heads were replaced by a single output unit with a linear activation. The training process was repeated for 10 random initializations of the network and the MAE was averaged over the 10 different trials.

The results of the experiment is presented in Figure 4. Here we clearly see the ensemble-like effect of our method, where the error decreases as the number of softmax heads is increased. This is expected from the error decomposition in equation (5). Additionally, we note that increasing the number of output units , which leads to a decreased discretization error caused by the binning of the target, does not necessarily imply a decrease in prediction error. In this experiment, yielded the smallest MAE for all values of . This agrees with previous findings, namely that too few bins leads to a large discretization error, but too many can lead to poor convergence [9, 12]. In general, the optimal number of bins depends on the specific task and finding it therefore requires an extensive parameter search.

Fig. 4: The MAE on the rotated MNIST datasets using the randomized bins method for different combinations of and

V Experiments

We compare our method with direct regression and classification baselines, as well as current state of the art methods on three challenging datasets. In order to make fair comparisons of the methods, we use the same baseline architecture for all experiments. More specifically, we use a pre-trained ResNet50 [32]

and replace the final fully connected layer with one fully connected layer of size 2048, a ReLU activation, and then a method-specific layer. For the direct regression approach (referred to as ”Direct”), we use a fully connected layer of size 1 with a linear activation and train it by minimizing the MSE. For the RvC approach, we use one fully connected layer of size

with a single softmax activation and train it using the cross-entropy loss function. For our own method we use fully connected layers of size

, with an individual softmax activation function on each layer, and train it by minimizing the sum of the individual cross-entropies for each discretization as in equation (


For all datasets, we train the network for 30 epochs using the ADAM optimizer

[33] with a mini-batch size of 32, a learning rate of 0.0005 that is decreased by a factor of 0.1 every 10th epoch, and an

-regularization factor of 0.001 on the weights. For data augmentation, we use random horizontal flipping of the images and apply a uniformly distributed random translation and scaling between [-20, 20] pixels and [0.7, 1.4] respectively. All results are averaged over 10 trials with different random initializations of the last fully connected layers. The experiments were implemented in Matlab and the network training was done using an NVIDIA Titan V graphics card. The source code for our experiments is available at

V-a Age Estimation

Fig. 5: A subset of the images in UTKFace dataset [34] with ground truth (GT) labels and predictions using the equal width method.

For age estimation, we test our method on the UTKFace dataset [34], which provides a large collection of images with human subjects labeled with ground truth ages. More specifically, we use the same train-test split of the data as in [23] and [35], where the ages are in the set of integers . The subset contains 13,144 training images and 3,287 test images. Furthermore, the images have been cropped such that only the faces of the subjects are visible, as is shown in the examples in Figure 5.

For the RvC baseline, we simply use one class per age group in years, such that . We compare the baseline with two implementations of the proposed method: equal-width overlapping bins and randomized bins. For the equal width approach we let and , such that , which means that the network complexity similar to the complexity of the RvC baseline. In practice this means that the first output head will classify images into age categories of {21—25, 26—30, …, 56—60}, while the second output head will have {21, 22—26, 27—31, …, 57—60} and so on.

For the randomized bins approach, we let and and draw a new sample of randomized bins for each run. This approach leads to a increased number of output heads compared to the RvC baseline, but the increase in model complexity is still small in relation to the size of the CNN backbone.

The results are shown and compared to current state-of-the art methods in Table 1, where we have evaluated the mean average error (MAE) of the different methods on the test set. The results are averaged over the trials with different random seeds and presented with the corresponding standard deviation. Of the two baselines, the direct method performs best. Since this is perhaps the most natural design choice, it should not come as a surprise, although other papers have reported contrary results on age estimation tasks

[6]. Both the equal width and randomized bins approach improve over the RvC baseline significantly, and they yield a small improvement over the direct method, which show that for this task, our proposed methods are more effective than the baseline methods. Furthermore, we reduce the average error compared to current state of the art by 2%. We hypothesize that this error reduction is due to the diverse representation of the target values, which also incorporates the ordinal relationship between the age categories.

Method CORAL [23] DCTD [35] Direct (Ours) RvC (Ours) Equal Width (Ours) Randomized Bins (Ours)
MAE 4.55 0.04
TABLE I: Mean average error in years for the different methods on the UTKFace [34] test set.

V-B Head Pose Estimation

The BIWI dataset [36] consists of 24 video sequences of 20 subjects recorded in a controlled environment and each frame is labeled with the corresponding head pose of the subject. We use the train-test split defined as protocol 2 in [37], where 16 videos are used for training and 8 for testing. In total, the training set consists of 10,063 images and the test set of 5,065 images. The pose is represented using the yaw, pitch and roll angles of the head, where each angle is approximately distributed in the range .

Following [35], we use the same approach as for head pose estimation, but with small modifications needed to get a three dimensional output from the network. For the direct regression approach, we simply replace the last fully connected layer of size 1 by three fully connected layers of the same size. For the RvC, we use three chains of fully connected layers at the end, one for each angle, with a corresponding softmax head. We discretize each angle using 1 degree per bin, such that .

For the equal width approach, we let and , such that . Again, this gives a similar network complexity as the RvC baseline. Hence each softmax head has 3 degrees per bin, i.e. {(-75) — (-73), …, 72 — 75}, {-75, (-74) — (-71), …, 73 — 75 } and similarly for . For the randomized bins approach, we let and and make a new sample of randomized bins for each run.

The results for the BIWI dataset are shown in Table II, where we have evaluated the MAE for the three different angles for each method. On average, direct regression performs best, while the randomized bins method is best at predicting the yaw angle. However, both equal width and randomized bins outperform the standard RvC approach. Additionally, our direct method reduces the current state of the art average error by 16 %, which shows that a carefully tuned regression baseline can outperform more sophisticated methods on this problem.

Fig. 6: A subset of the images in BIWI dataset [36] with ground truth (GT) labels and predictions using the equal width method.
Method Yang et al. [37] DCTD [35] Direct (Ours) RvC (Ours) Equal Width (Ours) Randomized Bins (Ours)
Yaw 2.89 2.64 0.16 2.85 0.12 2.62 0.11 2.52 0.06
Pitch 4.29 2.75 0.05 3.22 0.09 3.12 0.08 3.01 0.10
Roll 3.60 2.24 0.07 2.48 0.08 2.33 0.09 2.34 0.10
Average 2.54 0.05 2.85 0.08 2.69 0.04 2.63 0.04
TABLE II: Mean average error in degrees for the different methods on the Biwi [36] test set.

V-C Historical Image Dating

In order to test our method on a small dataset with a small number of ordinal classes, we ran experiments on the Historical Color Images (HCI) dataset [27]. The dataset consists of 1,375 color images from five decades, spanning from the 1930s to the 1970s. For evaluation, we use Monte Carlo random sampling with an 80/20 train-test split for each decade, drawn randomly at each iteration.

For this dataset, the target value can then be considered as belonging to one of five ordinal classes , where each class corresponds to one of the five decades. Likewise, the number of possibilities for creating our new sets of bins is limited, so the methods of overlapping and randomized bins are not suitable. We instead create 5 new sets as shown in Figure 7 and refer to this method simply as ”Label Diversity”. Although it is possible to define a distance metric between classes as the distance in decades, this is unsuitable, since the year 1939 is closer to 1940 than it is to 1949, but the distances in decades are the same. We therefore treat this as an ordinal classification problem and use the MAP estimate in equation (8) for inference.

We evaluate the methods in terms of classification accuracy and MAE. A sample of correct and incorrect predictions are shown in Figure 8. The results are shown in Table III and we conclude that using label diversity improves both accuracy and MAE over the regression baseline. Label diversity also decreases MAE compared to the RvC baseline, which we claim is due to the exploitation of the ranking between classes. Furthermore, our method improves the accuracy by one percentage point compared to the current state of the art method [29].

Fig. 7: The sets of overlapping classes used for label diversity on the historical image dating task.
Method CNNOr [15] PN [29] ELB [29] Direct (Ours) RvC (Ours) Label Diversity (Ours)
Acc (%) 41.56 56.67 2.30 57.1 1.9 57.7 2.0
MAE 1.04 0.63 0.04
TABLE III: Accuracy and mean average error in decades for the different methods on the HCI dataset [27].

Vi Conclusion

In this work, we have shown that employing a series of different discrete representations of the target values, it is possible to improve the predictive performance of a deep neural network, compared to when using a single such representation. For some problems, it can also outperform direct regression. We note that our methods yield the strongest improvement compared to the regression baseline on historical image dating, where the target belongs to a small set of ordinal classes. For head pose estimation, where the target is continuous, the improvement was either small or negligible, but there is a significant improvement over the RvC baseline. For age estimation, where the target belongs to a large set of ordered classes, there is an improvement over both the regression and RvC method. Nevertheless, our method consistently improves over the RvC baseline for all methods. Hence our conclusion is that, if it is suitable to approach a regression problem via classification, then using several diverse representations can improve performance.

This opens up a wide range of options when it comes to selecting the representations, since there are many ways to create different discrete binnings of a continuous target value. As we have also shown, the number of discretizations and the number of bins for each discretization will have an impact on the prediction performance, but since these choices also affect the training convergence, it is difficult to select them for a given problem without extensive parameter search. In future research we will continue to investigate these questions and how diversity in label representations can be exploited in other ways.

Fig. 8: A subset of the images in HCI dataset [27] with ground truth (GT) labels and predictions using the label diversity method.


This work was supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP), funded by Knut and Alice Wallenberg Foundation.