1 Introduction
Ordinal regression (sometimes called ordinal classification) is applied to data in which the features of the th example correspond to a label from a set of elements that have a welldefined ranking or ordering . However, unlike traditional metric regression, the ranks cannot be assumed to have quantitative differences or distances amongst themselves. For example, while a syntactic statement such as “terrible”“great”“best” may be intuitive, it conveys nothing about quantitative distance between the ranks nor qualitative differences i.e. whether the distance between “terrible” and “great” is equal to that between “great” and “best”. The aim in this setting is to build a reliable rule or regression function from the domain of the features to the range of the ordinal labels .
In the published literature for applied problems, it remains commonplace to ignore the ordering of the labels and apply categorical algorithms to such data (Levi & Hassncer, 2015; Rothe et al., 2015), which in neural networks often results in application of the categorical cross entropy (CCE) loss. Problematically, a categorical loss assumes all mislabeling by is equally wrong, whereas it is clear that predicting “great” when the true label is “best” would be preferable to a prediction of “terrible”. Although the problematic nature of this practice has been recognized for more than 35 years (Forrest & Andersen, 1986), it still remains common to make the implicit or explicit assumption that ordinal data or labels exist on an interval or ratio scale.
Most contemporary algorithms found in the ordinal regression literature (McCullagh, 1980; Obermayer, 1999; Crammer & Singer, 2002; Shashua & Levin, 2002; Rajaram et al., 2003; Shen & Joshi, 2005; Chu & Keerthi, 2005; Li & Lin, 2007; Baccianella et al., 2009; Niu et al., 2016; Cao et al., 2020), can be viewed through the lens of a general framework proposed by Li & Lin (2007)
wherein the labels are encoded as binary vectors by an invertible encoder
and the regression function becomes a collection ofbinary classifiers along with the decoder
. However, many of these existing algorithms share one major limitation: rankinconsistency among predictions. Briefly, the binary tasks are not independent and without special care, training classifiers will produce conflicting predictions on the binary tasks, impeding both performance and interpretation of the results. Most recently, Cao et al. (2020) attempted to address this problem in deep neural network (DNN) architectures by proposing a final layer that shares weights among binary outputs (differing only in their bias terms).Herein, we identify the theoretical and practical limitations of the CORAL approach by Cao et al. (2020)
, and address these concerns by implementing a new algorithm ‘Conditionals for Ordinal Regression’ (CONDOR). We prove that CONDOR is universally rank consistent and that it is sufficiently expressive to reach any rank consistent solution. In theory, the method is compatible with any combination of binary classification algorithms, but herein we focus on DNN architectures trained by backpropagation. Using our open source PyTorch and TensorFlow packages, CONDOR can be implemented with minor modifications to any categorical DNN model in these frameworks, allowing for increased adoption of ordinal regression in the applied literature.
2 Methods
We begin by introducing the CONDOR framework and its notations, and then provide proofs of the universal rank consistency and the full expressiveness of the framework.
2.1 CONDOR framework
The Li & Lin (2007) encoder converts the ordinal regression label into binary classification labels using indicator variables , where the indicator variable is defined as
The ordinal classification problem then becomes a matter of producing binary classifier subtasks , which we assume come from thresholding predicted binary class probabilities as
For convenience, we often deal with the rank index associated with the rank . From the binary classifier subtasks, the rank index for input feature vector
can be estimated as
(1) 
although multiple methods are possible to produce point estimates for from the probabilities . As shown in Figure 1, the aforementioned binary encoding approach requires that be rankmonotonic
for all to guarantee consistent predictions.
Rather than directly estimating the marginal probabilities , , as in existing approaches based on Li & Lin (2007), we estimate the conditionals
(2)  
for where we set the boundary condition with unit probability by convention. Equality (2) follows by construction of the binary labels since implies , which by definition means for . By the same reasoning, the marginal probability is equivalent to the joint probability
and by the product rule we produce a heterogeneous Markov chain representation of our marginal probabilities
(3) 
The above method in Equations (2) & (3) are exact (i.e., not approximations) and thus fully general. They can in theory be applied in any classifier estimating binary probabilities, but we focus on the application to DNNs. Namely, we select the final layer of the neural network to have nodes with sigmoid activations representing . For training, the loss is the weighted binary crossentropies of all the subtasks
(4) 
where is the importance parameter for task , which we set to one in the subsequent. We call this approach Conditionals for Ordinal Regression (CONDOR).
2.2 Rank consistency
Here we substantiate CONDOR’s guarantee for preserving rank consistency and its ability to represent any rank consistent solution.
Lemma 2.1.
CONDOR provides universal rank consistency (i.e., rank consistent estimates for all input data and any parameterization of the DNN).
Proof.
In neural networks, we can enforce for all for any weight parameterization of the DNN by having output nodes with sigmoid activations representing . Because for all , we have by Equation (3)
for all and . Thus we have rank consistency
for all and any weight parameterization of the DNN. ∎
Theorem 2.2.
Assuming that a neural network can universally approximate any function such that for all and some , then adding a CONDOR output layer to said network can approximate any rank consistent continuous ordinal regressors with error .
Proof.
By rank consistency
for any and we have defined the boundary condition . Then we define for
(5) 
and note that for each and the function is continuous and satisfies
Then define the continuous functions
Because the upstream neural network can approximate any continuous function with error , we can set and have the upstream neural network produce the functions for all and . Then after the CONDOR sigmoid activations we would have the neural network produce for all at its output nodes^{3}^{3}3
Note that only the activation at the last layer is fixed as the sigmoid function. The activation of the hidden layers could be other functions, e.g., ReLU.
which by Taylor Series about produces
Then the CONDOR approach yields
By iteration it follows that
Using the definition of in equation 5, we get
(6)  
(7) 
for all and . The last Equality (7) comes from considering separately the cases and . In the former we have and Equation (6) reduces to . In the latter we find from Equation (6)
where Equality (2.2) comes from the power series . ∎
3 Numerical Experiments
Here we discuss the challenges of measuring the performance of ordinal regression methods as well as the potential strengths and weaknesses of various candidate measures. In the subsequent sections, we then demonstrate the superiority of the performance of CONDOR compared to the stateoftheart on several benchmark and realworld data sets.
The binary cross entropy (BCE) loss in Equation (4
) is one of the few common evaluation metrics fully sensitive to the ordinal nature of the outcomes, since most other metrics either ignore crucial information about ordering (e.g., accuracy, which is categorical and thus lacks any notion of “incorrect but close”) or artificially impose an interval or ratio scale on the data (e.g., mean absolute error or earth movers distance, which by necessity require metric distances to be defined between each rank). Despite its suitability, the cross entropy remains a less intuitive method than its alternatives and so we benchmark these additional performance measures while acknowledging their shortcomings in the ordinal setting. Specifically, we profile the earth movers distance (EMD) on the rank indices with unit distance between ranks, and the mean absolute error (MAE) on the rank indices with unit distance between ranks using Equation (
1) for the point estimate.In the subsequent sections, the only difference between the three methods is the choice of loss function and final layer of the neural network; all other details of the DNN architecture, the optimization algorithms, hyperparameters and random number seeds are kept equal throughout each experiment. Namely, CORAL and CONDOR both have the sum of the BCE for the
subtasks as a loss, whereas categorical uses the CCE. Likewise, CORAL uses a custom final layer with weight sharing Cao et al. (2020) among output nodes, CONDOR uses a final dense layer with output nodes which after sigmoid activation represent , and the categorical algorithm uses a dense layer withnodes and a softmax activation. All results were gathered with three random number seeds and reported as the mean plus or minus the standard deviation across these seeds.
3.1 Synthetic Quadrants Dataset
We consider the simple task of ordinal classification wherein the labels
are the quadrants of the plane going counterclockwise and the features are generated from a 2D standard normal distribution. We draw
samples and do atrain/test split of the dataset. We select as the upstream network architecture two dense layers with ten neurons and RELU activations and an Adam optimizer with
epochs and early stopping patience of using a validation split of . As can be seen in Table 1, CONDOR has the best performance in BCE, EMD and MAE.ALGORITHM  BCE  MAE  EMD 

CONDOR  0.0768 0.0100  0.0167 0.0153  0.0799 0.0526 
CORAL  0.5074 0.0754  0.0733 0.0416  0.4080 0.0377 
CATEGORICAL  1.3438 0.0458  0.0200 0.0173  1.0318 0.0267 
3.2 MNIST as an ordinal dataset
ALGORITHM  BCE  MAE  EMD 

CONDOR  0.1784 0.0043  0.0596 0.0027  0.0818 0.0065 
CORAL  1.2724 0.0139  0.4583 0.0028  0.7501 0.0214 
CATEGORICAL  5.5424 0.0013  0.0592 0.0042  3.0638 0.0013 
Depending on the application, MNIST can be considered a categorical problem or an interval regression problem. If the digits are used for licenseplate recognition, then the problem is categorical since there is no notion of “close” errors. By contrast, if the digits are used for GPS coordinates or postal codes, then the ordering and distance between numerals becomes relevant and categorical classification is no longer the most appropriate framing of the task. It is valid to treat interval regression as an ordinal problem since the latter assumes less structure on the dataset, although it is recommended to exploit the interval scale. Here we treat MNIST as ordinal data for the purpose of benchmarking our ordinal algorithms, while acknowledging that it should likely be treated as either a categorical classification or interval regression as dictated by the specific realworld application setting. The MNIST data are split into training, validation and test sets of 55K, 5K and 10K images, respectively. We utilize a convolutional neural network with two convolutional layers of 64 and 32 filters respectively and a kernel size of 3, before flattening and passing to the appropriate output layer and loss function for our three models (CONDOR, CORAL, Categorical). Training is performed with the Adam optimizer, a maximum of
epochs and an earlystopping patience of . The results in Table 2 indicate that CONDOR demonstrates superior performance in terms of both BCE and EMD, while the Categorical approach performs marginally better in MAE.3.3 NLP on Amazon reviews dataset
Here we consider a natural language processing (NLP) dataset consisting of
(nonduplicate and nonempty) Amazon Pantry text reviews with their corresponding one to five star ratings (Ni et al., 2019). We split the data to have a test set with examples. For the neural network architecture, we use the fixed and pretrained Google universal sentence encoder (Cer et al., 2018) and append a dense layer with 64 ReLuactivated neurons and a dropout of , followed by the appropriate output layer and loss function for each model. Training is performed with the Adam optimizer, a maximum of epochs and an earlystopping patience of 10 with a validation split of . The results in Table 3 demonstrate that CONDOR provides the strongest performance in this benchmark across all three performance metrics.ALGORITHM  BCE  MAE  EMD 

CONDOR  0.7807 0.0113  0.3180 0.0047  0.4582 0.0050 
CORAL  0.8095 0.0098  0.3263 0.0052  0.4726 0.0041 
CATEGORICAL  2.4663 0.4678  0.4195 0.0074  1.6906 0.2422 
3.4 GRUD for COVID19 prognostication
This study adheres to a research protocol approved by the Mayo Clinic Institutional Review Board. Here we extend the results from Sankaranarayanan et al. (2021) to progress from their binary classification predicting mortality to ordinal regression predicting severity of outcomes. Namely, this clinical dataset includes two binary severity outcomes: an indicator variable for mechanical ventilation or extracorporeal membrane oxygenation (ECMO), as well as an indicator variable of whether patient death occurred. From these, a clear three point ordinal scale can be constructed whereby a patient is scored a zero when they have no severe outcome, a one when they experienced the severe outcome of ventilation or ECMO, and a two corresponding to death (with or without prior ventilation or ECMO). Sankaranarayanan et al. (2021) identified the GRUD recurrent neural network architecture as the best performing model for binary mortality prediction, and we extend that approach to the ordinal problem. We do this for the CONDOR, CORAL, and categorical algorithms using their corresponding final layers and loss functions. This GRUD architecture (Che et al., 2018) deals explicitly with the dimensional time series that is missing not at random (MNAR) due to the manner in which clinical data is ordered and recorded in an electronic health record (EHR). The default hyperparameters (dropout of , l2 regularizaton of , hidden and recurrent neurons, batch size of , adam learning rate , no batch norm, no bidirectional RNN, max time steps, epochs with early stopping patience of epochs) have previously demonstrated strong performance (Sankaranarayanan et al., 2021) and so are retained here.
ALGORITHM  BCE  MAE  EMD 

CONDOR  0.6526 0.0004  0.2986 0.0044  0.4151 0.0063 
CORAL  0.6711 0.0000  0.3021 0.0015  0.4261 0.0023 
CATEGORICAL  1.1075 0.0019  0.3076 0.0021  0.8185 0.0010 
The dataset is split into a training/validation set of patients who tested positive for COVID19 prior to December 15 2020 by PCR test, and a prospective testing set of patients who tested positive on or after that date. For training we use an identical training/validation split on the , which facilitates early stopping with patience. The results in Table 4 demonstrate that CONDOR is superior in all of evaluation metrics. Furthermore, the CONDORbased GRUD model has a prospective test set AUROC of for the mortality prediction subtask, which is greater than the 0.901 reported in Sankaranarayanan et al. (2021) wherein the authors trained the algorithm as a binary classifier specifically for mortality prediction. This demonstrates that there is no loss of mortality prediction performance when building a DNN to address the more challenging task of prognostication.
4 Discussion
We have demonstrated the ability of the CONDOR approach to overcome limitations present in popular alternative methods and to produce rank consistent results in the classification of data with ordinal labels. Rank consistency is not only important for theoretical soundness, but in application settings where explainability is important and a rank inconsistent prediction will be unacceptably contradictory and fundamentally unexplainable. Regardless of the loss function being optimized or the parameterization of the neural network, CONDOR provides universal guarantees of rank consistency by Lemma 2.1, which is to say the CONDOR approach is “sufficient” for rank consistency. Our next result leverages the fact that there are a widevariety of universal approximation theorems for neural networks each with their own technical conditions (e.g., see Chong (2020) for discussion of various universal approximation proofs and technical conditions). Namely, Theorem 2.2 states than any upstream neural network satisfying the conditions for universal approximation can be provided a CONDOR output layer, which will create a universally rank consistent network that can approximate any rank consistent solution. This theorem can be interpreted as CONDOR being “necessary” for rank consistency, insofar as any rank consistent solution can be represented by a CONDOR neural network.
In contrast, note that Theorem 1 of Cao et al. (2020) only provides rank consistency at the global minimum of an optimization problem with the specified loss function. Since neural network training is not guaranteed or expected to reach a global optimal parameterization, the Cao et al. (2020) approach in practice may produce rank inconsistent solutions and requires post hoc checks of the estimated bias terms to verify rank consistency. Furthermore, Cao et al. (2020) restricts expressiveness in the binary classifier outputs that must have “parallel slopes” (i.e., differ only by a bias parameter whose impact is completely independent of the feature vector). In Appendix A.1, we formalize these comments with two proofs demonstrating that the CORAL framework (Cao et al., 2020) lacks the theoretical guarantees of CONDOR.
Beyond our mathematical justifications, ultimately it is critical that the method perform well within a wide variety of neural network architectures and ordinal problem settings. Our benchmarking of dense networks, CNNs, attention networks, and exotic RNNs all shows practical benefit of using the CONDOR algorithm in a diverse set of applications. In all benchmarks, CONDOR provided the best performance according to the ordinal metric of BCE. Furthermore, CONDOR often provided the best performance according to the intervalscale metrics of EMD and MAE. We emphasize however that both of these metrics assume a interval scale on the ranks (i.e., uniformspacing between ranks) that is not formally justified from a mathematical perspective in an ordinal setting. Less formally, if we recall the ordinal example, “terrible”“great”“best”, then uniform spacing would not be compatible with most readers’ intuition about the problem. Any attempt to arbitrarily embed these three ranks into a metric space would be unlikely to achieve universal agreement amongst practitioners. Thus, ordinal regression is best exemplified in usecases where a welldefined ordering is clear but distances between the ranks are undefined. In that sense the BCE is the only metric that directly evaluates ordinal performance. Nonetheless CONDOR not only remains competitive in the categorical performance measure of accuracy, but infact provides improved classification in true ordinal problems when compared to categorically optimized neural networks, in theory due to the ability of the network to exploit “clues” encountered during ordinal training (Appendix A.2).
In addition to the theoretical strengths and performance improvements of the CONDOR method, we note that most applied papers simply use categorical classification in their problem settings rather than consider current stateoftheart methods for ordinal regression. Part of this may be educational, as most beginners are only taught binary/categorical classification and continuous regression, and discussions of the ordinal setting are largely only found in the advanced literature. However, the authors also believe part of the barrier is programmatic easeofuse. We provide productionready and userfriendly software packages in both PyTorch and TensorFlow, in order to minimize the effort required to convert existing categorical codebases into CONDOR ordinal codebases. In Appendix A.3, we demonstrate the modest code changes required to implement our methodology in an existing categorical code base.
The key requirements for successful supervised learning tasks are algorithms that respect the structure of the problem, and access to sufficient amounts of labeled data. Since CONDOR satisfies the first requirement by providing a robust algorithm for ordinal regression, we conclude with the latter by emphasizing the prevalence of available ordinal outcome measurements, using medical applications as a prototypical applied problem domain. Survey research for instance, frequently utilizes ordinal responses such as the psychometric Likert scale (Likert, 1932)
, providing a large corpus of existing data with ordinal labels. Furthermore, while labeling outcomes from the Electronic Health Record (EHR) is one of the most timeconsuming and expensive aspects of applied machine learning in the medical space, the proliferation of ordinal scales in modern medical practice (see Appendix
A.4) means the EHR already contains physicianprovided ordinal outcomes from a large variety of settings. The ubiquity of ordinal outcome measurements throughout survey research and medical settings represents a rich untapped reserve of training data that have yet to be explored by ordinal regression algorithms, and it is our hope that CONDOR’s demonstrated capabilities and its easeofuse will encourage its adoption and enable broad exploration of underutilized data across these and other domains.Acknowledgments
The funding for this research has been provided Mayo Clinic Center for Individualized Medicine. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors thank Saranya Sankaranarayanan and Jagadheshwar Balan for sharing their preprocessed versions of the COVID19 data set and their code for GRUD mortality prediction.
References
 Baccianella et al. (2009) Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Evaluation measures for ordinal regression. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 283–287, 2009. doi: 10.1109/ISDA.2009.230.
 Cao et al. (2020) Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140:325–331, dec 2020. doi: 10.1016/j.patrec.2020.11.008. URL https://doi.org/10.1016%2Fj.patrec.2020.11.008.
 Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario GuajardoCéspedes, Steve Yuan, Chris Tar, YunHsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. arXiv, pp. 1803.11175, 2018. URL https://arxiv.org/abs/1803.11175.
 Che et al. (2018) Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
 Chong (2020) Kai Fong Ernest Chong. A closer look at the approximation capabilities of neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkevSgrtPr.
 Chu & Keerthi (2005) Wei Chu and S. Sathiya Keerthi. New approaches to support vector ordinal regression. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp. 145–152, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102370. URL https://doi.org/10.1145/1102351.1102370.
 Crammer & Singer (2002) K. Crammer and Y. Singer. Pranking with ranking. Advances in Neural Information Processing Systems, 14:641–647, 2002. cited By 207.
 Forrest & Andersen (1986) M. Forrest and B. Andersen. Ordinal scale and statistics in medical research. Br Med J (Clin Res Ed), 292(6519):537–538, Feb 1986.
 Godoy et al. (2018) M. C. B. Godoy, E. G. L. C. Odisio, J. J. Erasmus, R. C. Chate, R. S. Dos Santos, and M. T. Truong. Understanding LungRADS 1.0: A CaseBased Review. Semin Ultrasound CT MR, 39(3):260–272, Jun 2018.

Levi & Hassncer (2015)
Gil Levi and Tal Hassncer.
Age and gender classification using convolutional neural networks.
In
2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
, pp. 34–42, 2015. doi: 10.1109/CVPRW.2015.7301352.  Li & Lin (2007) L. Li and H.T. Lin. Ordinal regression by extended binary classification. Advances in Neural Information Processing Systems, pp. 865–872, 2007.
 Likert (1932) Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 22(140):55, 1932.
 McCullagh (1980) P. McCullagh. Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society, Series B, 42:109–142, 1980. cited By 12.
 Mercado (2014) C. L. Mercado. BIRADS update. Radiol Clin North Am, 52(3):481–487, May 2014.
 Moore & Moore (2010) E. E. Moore and F. A. Moore. American Association for the Surgery of Trauma Organ Injury Scaling: 50th anniversary review article of the Journal of Trauma. J Trauma, 69(6):1600–1601, Dec 2010.
 Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantlylabeled reviews and finedgrained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 188–197, 2019.
 Niu et al. (2016) Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple output cnn for age estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4920–4928, 2016. doi: 10.1109/CVPR.2016.532.
 Obermayer (1999) K. Obermayer. Support vector learning for ordinal regression. IET Conference Proceedings, pp. 97–102(5), 1999. URL https://digitallibrary.theiet.org/content/conferences/10.1049/cp_19991091.
 Rajaram et al. (2003) Shyamsundar Rajaram, Ashutosh Garg, Xiang Sean Zhou, and Thomas S. Huang. Classification approach towards ranking and sorting problems. In Nada Lavrač, Dragan Gamberger, Hendrik Blockeel, and Ljupčo Todorovski (eds.), Machine Learning: ECML 2003, pp. 301–312, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 9783540398578.
 Reith et al. (2017) F. C. M. Reith, H. F. Lingsma, B. J. Gabbe, F. E. Lecky, I. Roberts, and A. I. R. Maas. Differential effects of the Glasgow Coma Scale Score and its Components: An analysis of 54,069 patients with traumatic brain injury. Injury, 48(9):1932–1943, Sep 2017.
 Richards et al. (2015) S. Richards, N. Aziz, S. Bale, D. Bick, S. Das, J. GastierFoster, W. W. Grody, M. Hegde, E. Lyon, E. Spector, K. Voelkerding, and H. L. Rehm. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 17(5):405–424, May 2015.
 Rothe et al. (2015) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 252–257, 2015. doi: 10.1109/ICCVW.2015.41.
 Sankaranarayanan et al. (2021) Saranya Sankaranarayanan, Jagadheshwar Balan, Jesse R Walsh, Yanhong Wu, Sara Minnich, Amy Piazza, Collin Osborne, Gavin R Oliver, Jessica Lesko, Kathy L Bates, Kia Khezeli, Darci R Block, Margaret DiGuardo, Justin Kreuter, John C O’Horo, John Kalantari, Eric W Klee, Mohamed E Salama, Benjamin Kipp, William G Morice, and Garrett Jenkinson. Covid19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: Algorithm development and validation. J Med Internet Res, 23(9):e30157, Sep 2021. ISSN 14388871. doi: 10.2196/30157. URL https://www.jmir.org/2021/9/e30157.
 Shashua & Levin (2002) Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, pp. 961–968, Cambridge, MA, USA, 2002. MIT Press.

Shen & Joshi (2005)
Libin Shen and Aravind K. Joshi.
Ranking and reranking with perceptron.
Machine Learning, 60(13):73–96, jun 2005. doi: 10.1007/s1099400509189. URL https://doi.org/10.1007%2Fs1099400509189.  Silverman et al. (2019) S. G. Silverman, I. Pedrosa, J. H. Ellis, N. M. Hindman, N. Schieda, A. D. Smith, E. M. Remer, A. B. Shinagare, N. E. Curci, S. S. Raman, S. A. Wells, S. D. Kaffenberger, Z. J. Wang, H. Chandarana, and M. S. Davenport. Bosniak Classification of Cystic Renal Masses, Version 2019: An Update Proposal and Needs Assessment. Radiology, 292(2):475–488, 08 2019.
 Williamson & Hoggart (2005) A. Williamson and B. Hoggart. Pain: a review of three commonly used pain rating scales. J Clin Nurs, 14(7):798–804, Aug 2005.
 Zachariah et al. (2015) S. Zachariah, W. Wykes, and D. Yorston. Grading diabetic retinopathy (DR) using the Scottish grading protocol. Community Eye Health, 28(92):72–73, 2015.
Appendix A Appendices
a.1 CORAL proofs
In this appendix, we demonstrate formally that the CORAL framework of Cao et al. (2020) does not have the universal rank consistency nor the expressiveness of CONDOR.
Lemma A.1.
CORAL is not universally rank consistent.
Proof.
For CORAL the last layer shares weights and only has different bias terms meaning it represents the output probabilities as (Cao et al., 2020)
for some . This means in the notation of CONDOR that
(9)  
for and . Note that if the neural network parameters are chosen such that for some then and CORAL is rank inconsistent. ∎
It is quite clear from Equation (9) that the functional form of CORAL is far more restrictive than CONDOR, which allows arbitrary functions. For completeness, we prove formally in the next lemma that it is less expressive.
Lemma A.2.
CORAL can not approximate every rank consistent solution with error.
Proof.
For simplicity, consider a univariate input and . Then we have for CORAL
and . Consider an extremely simple CONDOR network with no hidden layers, bias parameters fixed to zero and only two weights producing
which is rank consistent by Lemma 2.1. Suppose by way of contradiction that there exists CORAL and such that for all and . Thus
which implies
(10)  
(11) 
and plugging in Equation (10) into Equation (11) we find after rearrangement
which is a contradiction since the left hand side has an infinite range depending on whereas the right hand side is a constant up to an error of order . ∎
a.2 Categorical accuracy results
Accuracy is a categorical performance measure wherein there is not an increasing penalty for being further from the correct label, and therefore no relative “credit” given for being close to the correct label. One might therefore expect that training a neural network with a categorical loss (i.e., CCE) would result in higher categorical accuracy than if the network were trained with a ordinal method.
ALGORITHM  Quadrants  MNIST  Amazon  COVID19 

CONDOR  0.9900 0.0100  0.9805 0.0004  0.7498 0.0032  0.7250 0.0033 
CORAL  0.9333 0.0379  0.6381 0.0037  0.7343 0.0036  0.7175 0.0015 
CATEGORICAL  0.9933 0.0058  0.9845 0.0008  0.7299 0.0037  0.7244 0.0016 
Surprisingly, in Table 5 we find in some benchmarks the ordinal CONDOR method provided higher categorical accuracy than the networks trained using categorical cross entropy to specifically optimize categorical performance. We attribute this remarkable finding to the “clues” provided to the network when the ordinal nature of the problem is exploited during training. For instance, if the network incorrectly guesses rank index 7 when the true rank index is 8, the categorical loss treats this equivalently to a guess of a rank index of 1; the back propagation does not send any signals indicating that the guess of rank 7 was “close” to the true rank of 8. By contrast, Equation (4) would capture the fact that most of the binary subtasks are correctly predicted when a rank 7 is estimated for a ground truth rank of 8. In problems like MNIST where the features are not necessarily trending with increasing rank, we could understand how a categorical loss produces a stronger categorical accuracy. But in problems like Amazon star ratings, where the language and sentiment of 4 and 5 star reviews are likely closer in the NLP embedding space than the language and sentiment of 1 star reviews, one can also understand how training with an ordinal loss could provide higher categorical accuracy than a categorical loss that only has the capacity to indicate “correct” versus “incorrect” and never “incorrect but close”.
a.3 Minimal code changes
CONDOR is open sourced as both TensorFlow^{4}^{4}4https://github.com/GarrettJenkinson/condor_tensorflow and PyTorch^{5}^{5}5https://github.com/GarrettJenkinson/condor_pytorch repositories that make it simple to modifying existing categorical code bases to use CONDOR. See Figure 2 for a hypothetical example in TensorFlow. Both the TensorFlow and PyTorch versions of the GitHub repositories have full mkdocs documentation, docker files, ipynb tutorials, continuous integration testing and pip packaging. The authors hope this reduces the barrier to using proper and cuttingedge ordinal regression in applied problems.
a.4 Ordinal outcomes in medical practice
Modern medical practice emphasizes the importance of a standardized and reproducible communication of findings and outcomes within the global medical community. Consistent diagnosis, prognostication, treatment, and decision making all require evidence and consensusbased labeling of patient disease states. Frequently, these categorizations are made ordinal to align with the expected prognosis or severity of disease.
As a result, across nearly every subspecialty of medicine, one can find a plethora of ordinal outcome scales. Wellknown to the general public is the use of tumor staging in oncology to characterize neoplasms. However, we provide a noncomprehensive sampling of other specialties that are perhaps less wellknown. For instance, the American Association for the Surgery of Trauma provides 32 ordinal scales (Moore & Moore, 2010) for assessing the severity of trauma to 32 organs on scale of 1 (minimal) to 6 (lethal). Molecular testing results, such as DNA variant sequencing, are often graded on an ordinal scale such as The American College of Medical Genetics and Genomics’ variant scoring from variants from benign, likely benign, variant of unknown significance, likely pathogenic, to pathogenic (Richards et al., 2015). Subjective outcomes are often measured on ordinal scales, such as the 11level Pain Rating Scale (Williamson & Hoggart, 2005) from 0 (no pain) to 10 (worst possible pain). Traumatic brain injury is assessed on the 5point Glasgow outcome scale (Reith et al., 2017). Radiologists make frequent use of ordinal scales, including but not limited to LungRADS screening lesions from 0 to 4 (Godoy et al., 2018), BIRADS for breast cancer screening from 0 to 6 (Mercado, 2014), and Bozniak classification for renal lesions from 1 to 4 (Silverman et al., 2019). Retinopathy has the Scottish Grading protocol from 0 to 4 (Zachariah et al., 2015). While these examples are not intended to be a comprehensive review, they hopefully provide some insight into just how prevalent ordinal scales are in modern medicine.