Ordinal regression (sometimes called ordinal classification) is applied to data in which the features of the -th example correspond to a label from a set of elements that have a well-defined ranking or ordering . However, unlike traditional metric regression, the ranks cannot be assumed to have quantitative differences or distances amongst themselves. For example, while a syntactic statement such as “terrible”“great”“best” may be intuitive, it conveys nothing about quantitative distance between the ranks nor qualitative differences i.e. whether the distance between “terrible” and “great” is equal to that between “great” and “best”. The aim in this setting is to build a reliable rule or regression function from the domain of the features to the range of the ordinal labels .
In the published literature for applied problems, it remains commonplace to ignore the ordering of the labels and apply categorical algorithms to such data (Levi & Hassncer, 2015; Rothe et al., 2015), which in neural networks often results in application of the categorical cross entropy (CCE) loss. Problematically, a categorical loss assumes all mislabeling by is equally wrong, whereas it is clear that predicting “great” when the true label is “best” would be preferable to a prediction of “terrible”. Although the problematic nature of this practice has been recognized for more than 35 years (Forrest & Andersen, 1986), it still remains common to make the implicit or explicit assumption that ordinal data or labels exist on an interval or ratio scale.
Most contemporary algorithms found in the ordinal regression literature (McCullagh, 1980; Obermayer, 1999; Crammer & Singer, 2002; Shashua & Levin, 2002; Rajaram et al., 2003; Shen & Joshi, 2005; Chu & Keerthi, 2005; Li & Lin, 2007; Baccianella et al., 2009; Niu et al., 2016; Cao et al., 2020), can be viewed through the lens of a general framework proposed by Li & Lin (2007)
wherein the labels are encoded as binary vectors by an invertible encoderand the regression function becomes a collection of
binary classifiers along with the decoder. However, many of these existing algorithms share one major limitation: rank-inconsistency among predictions. Briefly, the binary tasks are not independent and without special care, training classifiers will produce conflicting predictions on the binary tasks, impeding both performance and interpretation of the results. Most recently, Cao et al. (2020) attempted to address this problem in deep neural network (DNN) architectures by proposing a final layer that shares weights among binary outputs (differing only in their bias terms).
Herein, we identify the theoretical and practical limitations of the CORAL approach by Cao et al. (2020)
, and address these concerns by implementing a new algorithm ‘Conditionals for Ordinal Regression’ (CONDOR). We prove that CONDOR is universally rank consistent and that it is sufficiently expressive to reach any rank consistent solution. In theory, the method is compatible with any combination of binary classification algorithms, but herein we focus on DNN architectures trained by backpropagation. Using our open source PyTorch and TensorFlow packages, CONDOR can be implemented with minor modifications to any categorical DNN model in these frameworks, allowing for increased adoption of ordinal regression in the applied literature.
We begin by introducing the CONDOR framework and its notations, and then provide proofs of the universal rank consistency and the full expressiveness of the framework.
2.1 CONDOR framework
The Li & Lin (2007) encoder converts the ordinal regression label into binary classification labels using indicator variables , where the indicator variable is defined as
The ordinal classification problem then becomes a matter of producing binary classifier subtasks , which we assume come from thresholding predicted binary class probabilities as
For convenience, we often deal with the rank index associated with the rank . From the binary classifier subtasks, the rank index for input feature vector
can be estimated as
although multiple methods are possible to produce point estimates for from the probabilities . As shown in Figure 1, the aforementioned binary encoding approach requires that be rank-monotonic
for all to guarantee consistent predictions.
Rather than directly estimating the marginal probabilities , , as in existing approaches based on Li & Lin (2007), we estimate the conditionals
for where we set the boundary condition with unit probability by convention. Equality (2) follows by construction of the binary labels since implies , which by definition means for . By the same reasoning, the marginal probability is equivalent to the joint probability
and by the product rule we produce a heterogeneous Markov chain representation of our marginal probabilities
The above method in Equations (2) & (3) are exact (i.e., not approximations) and thus fully general. They can in theory be applied in any classifier estimating binary probabilities, but we focus on the application to DNNs. Namely, we select the final layer of the neural network to have nodes with sigmoid activations representing . For training, the loss is the weighted binary cross-entropies of all the subtasks
where is the importance parameter for task , which we set to one in the subsequent. We call this approach Conditionals for Ordinal Regression (CONDOR).
2.2 Rank consistency
Here we substantiate CONDOR’s guarantee for preserving rank consistency and its ability to represent any rank consistent solution.
CONDOR provides universal rank consistency (i.e., rank consistent estimates for all input data and any parameterization of the DNN).
In neural networks, we can enforce for all for any weight parameterization of the DNN by having output nodes with sigmoid activations representing . Because for all , we have by Equation (3)
for all and . Thus we have rank consistency
for all and any weight parameterization of the DNN. ∎
Assuming that a neural network can universally approximate any function such that for all and some , then adding a CONDOR output layer to said network can approximate any rank consistent continuous ordinal regressors with error .
By rank consistency
for any and we have defined the boundary condition . Then we define for
and note that for each and the function is continuous and satisfies
Then define the continuous functions
Because the upstream neural network can approximate any continuous function with error , we can set and have the upstream neural network produce the functions for all and . Then after the CONDOR sigmoid activations we would have the neural network produce for all at its output nodes333
which by Taylor Series about produces
Then the CONDOR approach yields
By iteration it follows that
Using the definition of in equation 5, we get
where Equality (2.2) comes from the power series . ∎
3 Numerical Experiments
Here we discuss the challenges of measuring the performance of ordinal regression methods as well as the potential strengths and weaknesses of various candidate measures. In the subsequent sections, we then demonstrate the superiority of the performance of CONDOR compared to the state-of-the-art on several benchmark and real-world data sets.
The binary cross entropy (BCE) loss in Equation (4
) is one of the few common evaluation metrics fully sensitive to the ordinal nature of the outcomes, since most other metrics either ignore crucial information about ordering (e.g., accuracy, which is categorical and thus lacks any notion of “incorrect but close”) or artificially impose an interval or ratio scale on the data (e.g., mean absolute error or earth movers distance, which by necessity require metric distances to be defined between each rank). Despite its suitability, the cross entropy remains a less intuitive method than its alternatives and so we benchmark these additional performance measures while acknowledging their shortcomings in the ordinal setting. Specifically, we profile the earth movers distance (EMD) on the rank indices with unit distance between ranks, and the mean absolute error (MAE) on the rank indices with unit distance between ranks using Equation (1) for the point estimate.
In the subsequent sections, the only difference between the three methods is the choice of loss function and final layer of the neural network; all other details of the DNN architecture, the optimization algorithms, hyperparameters and random number seeds are kept equal throughout each experiment. Namely, CORAL and CONDOR both have the sum of the BCE for thesubtasks as a loss, whereas categorical uses the CCE. Likewise, CORAL uses a custom final layer with weight sharing Cao et al. (2020) among output nodes, CONDOR uses a final dense layer with output nodes which after sigmoid activation represent , and the categorical algorithm uses a dense layer with
nodes and a softmax activation. All results were gathered with three random number seeds and reported as the mean plus or minus the standard deviation across these seeds.
3.1 Synthetic Quadrants Dataset
We consider the simple task of ordinal classification wherein the labels
are the quadrants of the plane going counterclockwise and the features are generated from a 2D standard normal distribution. We drawsamples and do a
train/test split of the dataset. We select as the upstream network architecture two dense layers with ten neurons and RELU activations and an Adam optimizer withepochs and early stopping patience of using a validation split of . As can be seen in Table 1, CONDOR has the best performance in BCE, EMD and MAE.
|CONDOR||0.0768 0.0100||0.0167 0.0153||0.0799 0.0526|
|CORAL||0.5074 0.0754||0.0733 0.0416||0.4080 0.0377|
|CATEGORICAL||1.3438 0.0458||0.0200 0.0173||1.0318 0.0267|
3.2 MNIST as an ordinal dataset
|CONDOR||0.1784 0.0043||0.0596 0.0027||0.0818 0.0065|
|CORAL||1.2724 0.0139||0.4583 0.0028||0.7501 0.0214|
|CATEGORICAL||5.5424 0.0013||0.0592 0.0042||3.0638 0.0013|
Depending on the application, MNIST can be considered a categorical problem or an interval regression problem. If the digits are used for license-plate recognition, then the problem is categorical since there is no notion of “close” errors. By contrast, if the digits are used for GPS coordinates or postal codes, then the ordering and distance between numerals becomes relevant and categorical classification is no longer the most appropriate framing of the task. It is valid to treat interval regression as an ordinal problem since the latter assumes less structure on the dataset, although it is recommended to exploit the interval scale. Here we treat MNIST as ordinal data for the purpose of benchmarking our ordinal algorithms, while acknowledging that it should likely be treated as either a categorical classification or interval regression as dictated by the specific real-world application setting. The MNIST data are split into training, validation and test sets of 55K, 5K and 10K images, respectively. We utilize a convolutional neural network with two convolutional layers of 64 and 32 filters respectively and a kernel size of 3, before flattening and passing to the appropriate output layer and loss function for our three models (CONDOR, CORAL, Categorical). Training is performed with the Adam optimizer, a maximum ofepochs and an early-stopping patience of . The results in Table 2 indicate that CONDOR demonstrates superior performance in terms of both BCE and EMD, while the Categorical approach performs marginally better in MAE.
3.3 NLP on Amazon reviews dataset
Here we consider a natural language processing (NLP) dataset consisting of(non-duplicate and non-empty) Amazon Pantry text reviews with their corresponding one to five star ratings (Ni et al., 2019). We split the data to have a test set with examples. For the neural network architecture, we use the fixed and pre-trained Google universal sentence encoder (Cer et al., 2018) and append a dense layer with 64 ReLu-activated neurons and a dropout of , followed by the appropriate output layer and loss function for each model. Training is performed with the Adam optimizer, a maximum of epochs and an early-stopping patience of 10 with a validation split of . The results in Table 3 demonstrate that CONDOR provides the strongest performance in this benchmark across all three performance metrics.
|CONDOR||0.7807 0.0113||0.3180 0.0047||0.4582 0.0050|
|CORAL||0.8095 0.0098||0.3263 0.0052||0.4726 0.0041|
|CATEGORICAL||2.4663 0.4678||0.4195 0.0074||1.6906 0.2422|
3.4 GRU-D for COVID-19 prognostication
This study adheres to a research protocol approved by the Mayo Clinic Institutional Review Board. Here we extend the results from Sankaranarayanan et al. (2021) to progress from their binary classification predicting mortality to ordinal regression predicting severity of outcomes. Namely, this clinical dataset includes two binary severity outcomes: an indicator variable for mechanical ventilation or extracorporeal membrane oxygenation (ECMO), as well as an indicator variable of whether patient death occurred. From these, a clear three point ordinal scale can be constructed whereby a patient is scored a zero when they have no severe outcome, a one when they experienced the severe outcome of ventilation or ECMO, and a two corresponding to death (with or without prior ventilation or ECMO). Sankaranarayanan et al. (2021) identified the GRU-D recurrent neural network architecture as the best performing model for binary mortality prediction, and we extend that approach to the ordinal problem. We do this for the CONDOR, CORAL, and categorical algorithms using their corresponding final layers and loss functions. This GRU-D architecture (Che et al., 2018) deals explicitly with the dimensional time series that is missing not at random (MNAR) due to the manner in which clinical data is ordered and recorded in an electronic health record (EHR). The default hyperparameters (dropout of , l2 regularizaton of , hidden and recurrent neurons, batch size of , adam learning rate , no batch norm, no bidirectional RNN, max time steps, epochs with early stopping patience of epochs) have previously demonstrated strong performance (Sankaranarayanan et al., 2021) and so are retained here.
|CONDOR||0.6526 0.0004||0.2986 0.0044||0.4151 0.0063|
|CORAL||0.6711 0.0000||0.3021 0.0015||0.4261 0.0023|
|CATEGORICAL||1.1075 0.0019||0.3076 0.0021||0.8185 0.0010|
The dataset is split into a training/validation set of patients who tested positive for COVID-19 prior to December 15 2020 by PCR test, and a prospective testing set of patients who tested positive on or after that date. For training we use an identical training/validation split on the , which facilitates early stopping with patience. The results in Table 4 demonstrate that CONDOR is superior in all of evaluation metrics. Furthermore, the CONDOR-based GRU-D model has a prospective test set AUROC of for the mortality prediction subtask, which is greater than the 0.901 reported in Sankaranarayanan et al. (2021) wherein the authors trained the algorithm as a binary classifier specifically for mortality prediction. This demonstrates that there is no loss of mortality prediction performance when building a DNN to address the more challenging task of prognostication.
We have demonstrated the ability of the CONDOR approach to overcome limitations present in popular alternative methods and to produce rank consistent results in the classification of data with ordinal labels. Rank consistency is not only important for theoretical soundness, but in application settings where explainability is important and a rank inconsistent prediction will be unacceptably contradictory and fundamentally unexplainable. Regardless of the loss function being optimized or the parameterization of the neural network, CONDOR provides universal guarantees of rank consistency by Lemma 2.1, which is to say the CONDOR approach is “sufficient” for rank consistency. Our next result leverages the fact that there are a wide-variety of universal approximation theorems for neural networks each with their own technical conditions (e.g., see Chong (2020) for discussion of various universal approximation proofs and technical conditions). Namely, Theorem 2.2 states than any upstream neural network satisfying the conditions for universal approximation can be provided a CONDOR output layer, which will create a universally rank consistent network that can approximate any rank consistent solution. This theorem can be interpreted as CONDOR being “necessary” for rank consistency, insofar as any rank consistent solution can be represented by a CONDOR neural network.
In contrast, note that Theorem 1 of Cao et al. (2020) only provides rank consistency at the global minimum of an optimization problem with the specified loss function. Since neural network training is not guaranteed or expected to reach a global optimal parameterization, the Cao et al. (2020) approach in practice may produce rank inconsistent solutions and requires post hoc checks of the estimated bias terms to verify rank consistency. Furthermore, Cao et al. (2020) restricts expressiveness in the binary classifier outputs that must have “parallel slopes” (i.e., differ only by a bias parameter whose impact is completely independent of the feature vector). In Appendix A.1, we formalize these comments with two proofs demonstrating that the CORAL framework (Cao et al., 2020) lacks the theoretical guarantees of CONDOR.
Beyond our mathematical justifications, ultimately it is critical that the method perform well within a wide variety of neural network architectures and ordinal problem settings. Our benchmarking of dense networks, CNNs, attention networks, and exotic RNNs all shows practical benefit of using the CONDOR algorithm in a diverse set of applications. In all benchmarks, CONDOR provided the best performance according to the ordinal metric of BCE. Furthermore, CONDOR often provided the best performance according to the interval-scale metrics of EMD and MAE. We emphasize however that both of these metrics assume a interval scale on the ranks (i.e., uniform-spacing between ranks) that is not formally justified from a mathematical perspective in an ordinal setting. Less formally, if we recall the ordinal example, “terrible”“great”“best”, then uniform spacing would not be compatible with most readers’ intuition about the problem. Any attempt to arbitrarily embed these three ranks into a metric space would be unlikely to achieve universal agreement amongst practitioners. Thus, ordinal regression is best exemplified in use-cases where a well-defined ordering is clear but distances between the ranks are undefined. In that sense the BCE is the only metric that directly evaluates ordinal performance. Nonetheless CONDOR not only remains competitive in the categorical performance measure of accuracy, but in-fact provides improved classification in true ordinal problems when compared to categorically optimized neural networks, in theory due to the ability of the network to exploit “clues” encountered during ordinal training (Appendix A.2).
In addition to the theoretical strengths and performance improvements of the CONDOR method, we note that most applied papers simply use categorical classification in their problem settings rather than consider current state-of-the-art methods for ordinal regression. Part of this may be educational, as most beginners are only taught binary/categorical classification and continuous regression, and discussions of the ordinal setting are largely only found in the advanced literature. However, the authors also believe part of the barrier is programmatic ease-of-use. We provide production-ready and user-friendly software packages in both PyTorch and TensorFlow, in order to minimize the effort required to convert existing categorical code-bases into CONDOR ordinal code-bases. In Appendix A.3, we demonstrate the modest code changes required to implement our methodology in an existing categorical code base.
The key requirements for successful supervised learning tasks are algorithms that respect the structure of the problem, and access to sufficient amounts of labeled data. Since CONDOR satisfies the first requirement by providing a robust algorithm for ordinal regression, we conclude with the latter by emphasizing the prevalence of available ordinal outcome measurements, using medical applications as a prototypical applied problem domain. Survey research for instance, frequently utilizes ordinal responses such as the psychometric Likert scale (Likert, 1932)
, providing a large corpus of existing data with ordinal labels. Furthermore, while labeling outcomes from the Electronic Health Record (EHR) is one of the most time-consuming and expensive aspects of applied machine learning in the medical space, the proliferation of ordinal scales in modern medical practice (see AppendixA.4) means the EHR already contains physician-provided ordinal outcomes from a large variety of settings. The ubiquity of ordinal outcome measurements throughout survey research and medical settings represents a rich untapped reserve of training data that have yet to be explored by ordinal regression algorithms, and it is our hope that CONDOR’s demonstrated capabilities and its ease-of-use will encourage its adoption and enable broad exploration of underutilized data across these and other domains.
The funding for this research has been provided Mayo Clinic Center for Individualized Medicine. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors thank Saranya Sankaranarayanan and Jagadheshwar Balan for sharing their preprocessed versions of the COVID-19 data set and their code for GRU-D mortality prediction.
- Baccianella et al. (2009) Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Evaluation measures for ordinal regression. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 283–287, 2009. doi: 10.1109/ISDA.2009.230.
- Cao et al. (2020) Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140:325–331, dec 2020. doi: 10.1016/j.patrec.2020.11.008. URL https://doi.org/10.1016%2Fj.patrec.2020.11.008.
- Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. arXiv, pp. 1803.11175, 2018. URL https://arxiv.org/abs/1803.11175.
- Che et al. (2018) Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
- Chong (2020) Kai Fong Ernest Chong. A closer look at the approximation capabilities of neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkevSgrtPr.
- Chu & Keerthi (2005) Wei Chu and S. Sathiya Keerthi. New approaches to support vector ordinal regression. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp. 145–152, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102370. URL https://doi.org/10.1145/1102351.1102370.
- Crammer & Singer (2002) K. Crammer and Y. Singer. Pranking with ranking. Advances in Neural Information Processing Systems, 14:641–647, 2002. cited By 207.
- Forrest & Andersen (1986) M. Forrest and B. Andersen. Ordinal scale and statistics in medical research. Br Med J (Clin Res Ed), 292(6519):537–538, Feb 1986.
- Godoy et al. (2018) M. C. B. Godoy, E. G. L. C. Odisio, J. J. Erasmus, R. C. Chate, R. S. Dos Santos, and M. T. Truong. Understanding Lung-RADS 1.0: A Case-Based Review. Semin Ultrasound CT MR, 39(3):260–272, Jun 2018.
Levi & Hassncer (2015)
Gil Levi and Tal Hassncer.
Age and gender classification using convolutional neural networks.
2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 34–42, 2015. doi: 10.1109/CVPRW.2015.7301352.
- Li & Lin (2007) L. Li and H.-T. Lin. Ordinal regression by extended binary classification. Advances in Neural Information Processing Systems, pp. 865–872, 2007.
- Likert (1932) Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 22(140):55, 1932.
- McCullagh (1980) P. McCullagh. Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society, Series B, 42:109–142, 1980. cited By 12.
- Mercado (2014) C. L. Mercado. BI-RADS update. Radiol Clin North Am, 52(3):481–487, May 2014.
- Moore & Moore (2010) E. E. Moore and F. A. Moore. American Association for the Surgery of Trauma Organ Injury Scaling: 50th anniversary review article of the Journal of Trauma. J Trauma, 69(6):1600–1601, Dec 2010.
- Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 188–197, 2019.
- Niu et al. (2016) Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple output cnn for age estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4920–4928, 2016. doi: 10.1109/CVPR.2016.532.
- Obermayer (1999) K. Obermayer. Support vector learning for ordinal regression. IET Conference Proceedings, pp. 97–102(5), 1999. URL https://digital-library.theiet.org/content/conferences/10.1049/cp_19991091.
- Rajaram et al. (2003) Shyamsundar Rajaram, Ashutosh Garg, Xiang Sean Zhou, and Thomas S. Huang. Classification approach towards ranking and sorting problems. In Nada Lavrač, Dragan Gamberger, Hendrik Blockeel, and Ljupčo Todorovski (eds.), Machine Learning: ECML 2003, pp. 301–312, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-39857-8.
- Reith et al. (2017) F. C. M. Reith, H. F. Lingsma, B. J. Gabbe, F. E. Lecky, I. Roberts, and A. I. R. Maas. Differential effects of the Glasgow Coma Scale Score and its Components: An analysis of 54,069 patients with traumatic brain injury. Injury, 48(9):1932–1943, Sep 2017.
- Richards et al. (2015) S. Richards, N. Aziz, S. Bale, D. Bick, S. Das, J. Gastier-Foster, W. W. Grody, M. Hegde, E. Lyon, E. Spector, K. Voelkerding, and H. L. Rehm. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 17(5):405–424, May 2015.
- Rothe et al. (2015) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 252–257, 2015. doi: 10.1109/ICCVW.2015.41.
- Sankaranarayanan et al. (2021) Saranya Sankaranarayanan, Jagadheshwar Balan, Jesse R Walsh, Yanhong Wu, Sara Minnich, Amy Piazza, Collin Osborne, Gavin R Oliver, Jessica Lesko, Kathy L Bates, Kia Khezeli, Darci R Block, Margaret DiGuardo, Justin Kreuter, John C O’Horo, John Kalantari, Eric W Klee, Mohamed E Salama, Benjamin Kipp, William G Morice, and Garrett Jenkinson. Covid-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: Algorithm development and validation. J Med Internet Res, 23(9):e30157, Sep 2021. ISSN 1438-8871. doi: 10.2196/30157. URL https://www.jmir.org/2021/9/e30157.
- Shashua & Levin (2002) Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, pp. 961–968, Cambridge, MA, USA, 2002. MIT Press.
Shen & Joshi (2005)
Libin Shen and Aravind K. Joshi.
Ranking and reranking with perceptron.Machine Learning, 60(1-3):73–96, jun 2005. doi: 10.1007/s10994-005-0918-9. URL https://doi.org/10.1007%2Fs10994-005-0918-9.
- Silverman et al. (2019) S. G. Silverman, I. Pedrosa, J. H. Ellis, N. M. Hindman, N. Schieda, A. D. Smith, E. M. Remer, A. B. Shinagare, N. E. Curci, S. S. Raman, S. A. Wells, S. D. Kaffenberger, Z. J. Wang, H. Chandarana, and M. S. Davenport. Bosniak Classification of Cystic Renal Masses, Version 2019: An Update Proposal and Needs Assessment. Radiology, 292(2):475–488, 08 2019.
- Williamson & Hoggart (2005) A. Williamson and B. Hoggart. Pain: a review of three commonly used pain rating scales. J Clin Nurs, 14(7):798–804, Aug 2005.
- Zachariah et al. (2015) S. Zachariah, W. Wykes, and D. Yorston. Grading diabetic retinopathy (DR) using the Scottish grading protocol. Community Eye Health, 28(92):72–73, 2015.
Appendix A Appendices
a.1 CORAL proofs
In this appendix, we demonstrate formally that the CORAL framework of Cao et al. (2020) does not have the universal rank consistency nor the expressiveness of CONDOR.
CORAL is not universally rank consistent.
For CORAL the last layer shares weights and only has different bias terms meaning it represents the output probabilities as (Cao et al., 2020)
for some . This means in the notation of CONDOR that
for and . Note that if the neural network parameters are chosen such that for some then and CORAL is rank inconsistent. ∎
It is quite clear from Equation (9) that the functional form of CORAL is far more restrictive than CONDOR, which allows arbitrary functions. For completeness, we prove formally in the next lemma that it is less expressive.
CORAL can not approximate every rank consistent solution with error.
For simplicity, consider a univariate input and . Then we have for CORAL
and . Consider an extremely simple CONDOR network with no hidden layers, bias parameters fixed to zero and only two weights producing
which is rank consistent by Lemma 2.1. Suppose by way of contradiction that there exists CORAL and such that for all and . Thus
which is a contradiction since the left hand side has an infinite range depending on whereas the right hand side is a constant up to an error of order . ∎
a.2 Categorical accuracy results
Accuracy is a categorical performance measure wherein there is not an increasing penalty for being further from the correct label, and therefore no relative “credit” given for being close to the correct label. One might therefore expect that training a neural network with a categorical loss (i.e., CCE) would result in higher categorical accuracy than if the network were trained with a ordinal method.
|CONDOR||0.9900 0.0100||0.9805 0.0004||0.7498 0.0032||0.7250 0.0033|
|CORAL||0.9333 0.0379||0.6381 0.0037||0.7343 0.0036||0.7175 0.0015|
|CATEGORICAL||0.9933 0.0058||0.9845 0.0008||0.7299 0.0037||0.7244 0.0016|
Surprisingly, in Table 5 we find in some benchmarks the ordinal CONDOR method provided higher categorical accuracy than the networks trained using categorical cross entropy to specifically optimize categorical performance. We attribute this remarkable finding to the “clues” provided to the network when the ordinal nature of the problem is exploited during training. For instance, if the network incorrectly guesses rank index 7 when the true rank index is 8, the categorical loss treats this equivalently to a guess of a rank index of 1; the back propagation does not send any signals indicating that the guess of rank 7 was “close” to the true rank of 8. By contrast, Equation (4) would capture the fact that most of the binary subtasks are correctly predicted when a rank 7 is estimated for a ground truth rank of 8. In problems like MNIST where the features are not necessarily trending with increasing rank, we could understand how a categorical loss produces a stronger categorical accuracy. But in problems like Amazon star ratings, where the language and sentiment of 4 and 5 star reviews are likely closer in the NLP embedding space than the language and sentiment of 1 star reviews, one can also understand how training with an ordinal loss could provide higher categorical accuracy than a categorical loss that only has the capacity to indicate “correct” versus “incorrect” and never “incorrect but close”.
a.3 Minimal code changes
CONDOR is open sourced as both TensorFlow444https://github.com/GarrettJenkinson/condor_tensorflow and PyTorch555https://github.com/GarrettJenkinson/condor_pytorch repositories that make it simple to modifying existing categorical code bases to use CONDOR. See Figure 2 for a hypothetical example in TensorFlow. Both the TensorFlow and PyTorch versions of the GitHub repositories have full mkdocs documentation, docker files, ipynb tutorials, continuous integration testing and pip packaging. The authors hope this reduces the barrier to using proper and cutting-edge ordinal regression in applied problems.
a.4 Ordinal outcomes in medical practice
Modern medical practice emphasizes the importance of a standardized and reproducible communication of findings and outcomes within the global medical community. Consistent diagnosis, prognostication, treatment, and decision making all require evidence- and consensus-based labeling of patient disease states. Frequently, these categorizations are made ordinal to align with the expected prognosis or severity of disease.
As a result, across nearly every sub-specialty of medicine, one can find a plethora of ordinal outcome scales. Well-known to the general public is the use of tumor staging in oncology to characterize neoplasms. However, we provide a non-comprehensive sampling of other specialties that are perhaps less well-known. For instance, the American Association for the Surgery of Trauma provides 32 ordinal scales (Moore & Moore, 2010) for assessing the severity of trauma to 32 organs on scale of 1 (minimal) to 6 (lethal). Molecular testing results, such as DNA variant sequencing, are often graded on an ordinal scale such as The American College of Medical Genetics and Genomics’ variant scoring from variants from benign, likely benign, variant of unknown significance, likely pathogenic, to pathogenic (Richards et al., 2015). Subjective outcomes are often measured on ordinal scales, such as the 11-level Pain Rating Scale (Williamson & Hoggart, 2005) from 0 (no pain) to 10 (worst possible pain). Traumatic brain injury is assessed on the 5-point Glasgow outcome scale (Reith et al., 2017). Radiologists make frequent use of ordinal scales, including but not limited to Lung-RADS screening lesions from 0 to 4 (Godoy et al., 2018), BI-RADS for breast cancer screening from 0 to 6 (Mercado, 2014), and Bozniak classification for renal lesions from 1 to 4 (Silverman et al., 2019). Retinopathy has the Scottish Grading protocol from 0 to 4 (Zachariah et al., 2015). While these examples are not intended to be a comprehensive review, they hopefully provide some insight into just how prevalent ordinal scales are in modern medicine.