1 Introduction
The effective monitoring of degenerative patient conditions represents a significant challenge in many clinical decisionmaking problems and has given rise to the development of numerous mathematical and computational models brownell1999dopamine ; gratwicke2017early ; llano2017multivariate ; chen2014credit . Developing a knowledgedriven contemporaneous health index (CHI) that can precisely reflect the underlying patient condition across the course of the condition’s progression holds a unique value, like facilitating a range of clinical decisionmaking opportunities spring2013healthy ; rivera2012optimized ; deshpande2014control , enhancing the continuity of care, and facilitating communications between clinicians, healthcare providers, and patients. It will also be a crucial enabling factor for the development of many envisioned AI systems to implement adaptive interventions for better healthcare management, given a representation of the dynamic evolution of the patient’s condition.
Thus, to ensure continuity of care, we should be more explicit about our level of confidence in model outputs. Ideally, decisionmakers should be provided with recommendations that are robust in the face of substantial uncertainty about future outcomes. However, computational models are an abstraction of clinical observations, as such, they are usually built on analytically tractable assumptions that may simplify the realworld problem. Also, most of these models are estimated from imperfect data, subjecting them to all kinds of statistical errors. An approach that yields only a single prediction doesn’t adequately reflect any uncertainty, neither in the empirical data nor the estimated parameters
allmaras2013estimating . As a result, the outcomes from such mathematical models may not be consistent with the clinical observations. Uncertainty is an unavoidable feature that affects prediction capabilities in realworld domains such as healthcare hoffman1994propagation ; meghdadi2017brain , manufacturing montomoli2015uncertainty ; nannapaneni2014uncertainty , signal processing reynders2016uncertainty ; nobari2015uncertainty , and etc. A certain amount of uncertainty is always involved in decisionmaking systems that do not encounter samples when the experimental data are insufficient to calibrate. In such cases, there is always a chance that the model parameters be determined unambiguously even in the existence of complex mathematical models. In clinical predictions, it is necessary to deal with such uncertainty in an effective manner, because if the model parameters are not well constrained, the resulting predictions may represent an unacceptable degree of posterior uncertainty. What is more, while most existing models in patient monitoring generate one single prediction without telling confidence level, uncertainty quantification could tell us on which samples we may not be ready to act based on the model. Therefore, to develop a reliable model for a clinically relevant prediction, uncertainty quantification is a muchneeded capacity collis2017bayesian ; biglino2017computational ; bozzi2017uncertainty .A number of patient monitoring index approaches have been developed in the literature. A standard formulation of these health indices is to use weighted sum models (e.g., regression models), and combine multiple static clinical measurements to predict the disease condition. For example, there exist many risk score models to predict AD by using multimodality data integration methods liu2013data ; yuan2012multi ; zhang2011multimodal to combine neuroimaging data weiner2013alzheimer ; weiner20152014 , genomics data biffi2010genetic , clinical data reitz2010summary , etc. There are a few approaches that have formulated the decline of ADrelated score over time as a multitask learning model zhou2013modeling ; zhou2012modeling . These existing efforts have been limited to combining static data rather than longitudinal data. Besides, these data are usually sampled at irregular time points, which adds in another layer of complexity to the modeling efforts. Our problem’s objective is fundamentally different from the existing risk score models; we focus on developing the contemporaneous health index (CHI) that can fuse irregular multivariate longitudinal time series dataset to quantify the severity of degenerative disease conditions that are required to fit the monotonic degradation process of the disease condition. For example, in our previous work samareh2018dl to address the patient heterogeneity, we developed a dictionary learningbased contemporaneous health index for degenerative disease monitoring, called DLCHI, that leveraged the knowledge of the monotonic disease progression process to fuse the data by integrating CHI with dictionary learning. The basic idea of DLCHI was learning individual models via the CHI formulation, and then rebuilding the model parameters of each patient’s models through a supervised dictionary learning. However, both CHI and DLCHI frameworks only generate one single prediction value for a sample and ignore the sampling uncertainty (i.e., it is common in healthcare that the label information is usually obtained by subjective methods which are subject to uncertainty). Therefore, if we could enable CHI to conduct uncertainty quantification and incorporate the uncertainty in labels in its modeling, we can widen its applicability in realworld contexts. The main objective of this paper is to develop a framework that can focus on the contemporaneous health index (CHI) developed in huang2017chi , and can further equip CHI with uncertainty quantification capacity.
In this paper, we develop the uncertainty quantification based contemporaneous longitudinal index, named UQCHI, with a particular focus on continuous patient monitoring of degenerative conditions. Our method is to combine convex optimization and Bayesian learning using the maximum entropy learning (MEL) framework, integrating uncertainty on labels as well. The basic idea of MEL is to identify the distribution of the parameters of a statistical model that bears the maximum uncertainty, a principle that is conservative and robust mackay2003information ; izenman2008modern ; phillips2006maximum
. It has been investigated in a few machine learning models
jaakkola2000maximum ; sun2013multi ; chao2019semi ; zhu2018semi as well. For example, in jaakkola2000maximum, MEL was used to learn a distribution of the parameters in the support vector machine model rather than a single vector of the parameters. This distribution of the parameters could help us evaluate the uncertainty of the learned support vector machine model and translate into the uncertainty of predictions.
To adapt the MEL formulation and to develop UQCHI, few challenges should be addressed. The objective function of MEL, as its distinct feature, bears the full spirit of maximum entropy: no matter what is the model, we are studying, the learning objective of MEL is to learn the distribution model of the parameters of the model that has the maximum entropy. If there is a prior distribution of the parameters, the Kullback–Leibler divergence could be used to extend this idea. In our case, the design of the prior distribution should be studied to account for label uncertainties. Besides the objective function, the MEL encodes information from the data into constraints, e.g., if the model is for classification, for each sample, there would be a constraint that the expectation of the prediction over the distribution of the parameters should match the observed outcome on this sample. In our case, we will derive the constraints from the CHI model and integrate with the MEL framework. In detail, we consider two steps in our method, i.e., training and prediction. In the training step, we consider a prior uncertainty over the labels to handle uncertain or incomplete labels. Then we derive a solution to the optimization problem by using a specific prior formulation. In the second step, we develop a prediction method, with a rejection option method, for new samples with the obtained uncertainty quantification capacity. A distinct feature of our model is that it provides a closedform solution for predicting the label of a new example. The whole pipeline of this UQCHI model is shown in Figure
1.The remainder of this paper is organized as follows: in Section 2, we will review related literature in modeling the contemporaneous health index for degenerative conditions and the MEL framework. In Section 3, the UQCHI framework will be presented. In Section 4, we will implement and evaluate the UQCHI using a simulated dataset. We then continue the numerical analysis with a realworld application on Alzheimer’s disease dataset in Section 5. We will conclude the study in Section 6. Note that, in this paper, we use lowercase letters, e.g., x, to represent scalars, boldface lowercase letters, e.g., v, to represent vectors, and boldface uppercase letters, e.g., W, to represent matrices.
2 Related works
In this section, we will first briefly present the basic formulation of the contemporaneous health index (CHI) model, and its extension, the dictionary learning based contemporaneous health index (DLCHI), then we will present the proposed model: the UQCHI.
2.1 The CHI model
The CHI model is developed in huang2017chi which exploits the monotonic pattern of disease over the course of progression to improve further the data fusion of multivariate clinical measurements taken at irregular time points. The CHI framework was inspired by the common characteristics of degenerative conditions (e.g., AD) that often cause irreversible degradation. For example, in AD, to measure the degradation of the neural systems a number of biomarkers were developed, including neuroimaging modalities such as PET and MRI scans mueller2005alzheimer ; petrella2003neuroimaging . For example, MRI scans show a decline in the brain volume over time along with the disease progression. The same phenomenon could be observed on the PET scans when there is a persistent shrinkage of metabolic activities. Such monotonic patterns indicate that once the disease progression started, it tends to deteriorate over time increasingly. The task of CHI is to translate multivariate longitudinal and irregular clinical measurements into a contemporaneous health index to capture the patient’s condition changing over the course of progression. Note, clinical measurements for each patient could be taken with different length of time and at different time locations. Targeting degenerative conditions, CHI is designed to be monotonic, i.e., if , while higher index represents a more severe condition. CHI is a latent structure; hence, clinical variables associated with it should be measured over time to facilitate data for learning the index.
Let, , denote a training set of patients. Each measurement , is the value of the th variable for the th subject in a given time , where is the time index. our goal is, given a training set, convert each measurement into an health index , which requires a mathematical model of . For simplicity, multivariable form of the hypothesis function was studies in huang2017chi , i.e., , where is a vector of weight coefficients that combines the variables. The total number of positive and negative samples is shown by and respectively, i.e., and . The formulation of the CHI learning framework is shown in below:
(1a)  
(1b)  
(1c)  
(1d)  
(1e)  
(1f) 
Items in (1) can be explained as follows:

The first term (1a) and the second term (1b) are derived from a general formulation of support vector machine (SVM). These two terms are used to enhance the discriminatory power of CHI by utilizing the label information. Here, is the label of the th sample that indicates if the th subject has the disease or not.

To accommodate the monotonic pattern of disease progression, and to enforce the monotonicity of the learned health index, the term (1c) is invented, i.e., if . Here, is the difference of two successive data vectors .

To encourage sparsity of the features, norm penalty is used as shown in the last term (1f).
The CHI formulation can be solved by using the block coordinate descent algorithm that is illustrated in huang2017chi . Note, the CHI formulation generalizes many existing models, such as SVM, sparse SVM, LASSO, etc.
2.2 The DLCHI model
CHI formulation is designed for learning a model for the average of a population, and thus, ignores the patient heterogeneity. Patients who suffer from AD have very heterogeneous progression patterns cummings2000cognitive ; folstein1989heterogeneity ; friedland1988alzheimer . Building a personalized model on an individual basis could be used to consider the heterogeneity. However, such models require a significant amount of labeled training samples, which is not feasible in such clinical settings. Towards this goal, the DLCHI approach was further developed in samareh2018dl by integrating CHI with dictionary learning olshausen1996emergence ; cummings2000cognitive . Dictionary learning algorithms reconstruct the input signals as an approximated signal via a sparse linear combination of a few dictionary elements or basis wright2009robust
(each column of the dictionary represents a basis vector). Dictionary learning algorithms can reveal the hidden structures in the data (in a similar spirit as principal component analysis) by spanning the space of a personalized model and capturing patient heterogeneity. They play a role in the regularization of the model learning, in a way that each dictionary basis vector can be viewed as the numerical representations of patient heterogeneity. Thus, DL algorithms can improve the classification performance. Translating this wisdom into DLCHI, the basic idea is first to learn individual models through the CHI formulation, and then, reconstruct the model parameters of the individual learned models via supervised dictionary learning. As such, each model is represented as a sparse linear combination of the basis vectors. Numerous experiments in both simulated and realworld data have shown the effectiveness of DLCHI in creating personalized CHI models.
Despite accounting the patient heterogeneity, DLCHI ignores the sampling uncertainty, therefore limits its applicability in realworld applications. Thus, this motivates us to enable CHI to conduct uncertainty quantification.
2.3 The MEL formulation
As mentioned in Section 1, MEL formulation has a distinct objective function that aims to learn the distribution of the parameters of a model that encodes maximum uncertainty (i.e., evaluated by the entropy concept). It also has constraints that encode information from the data, e.g., if the model is for classification, for each sample, there would be a constraint that the expectation of the prediction over the distribution of the parameters should match the observed outcome on this sample. To further illustrate some details, one typical application of the MEL is the maximum entropy discrimination (MED) method that focuses on the application of MEL on classification models.
Let’s consider a binary classification problem, where the response variable
takes values from . Let be an input feature vector and be a discriminant function parameterized by , and e.g., . The training set is defined by and the hinge loss is defined as . The classification margin is defined as , and it is large and positive when the label agrees with the prediction. Traditional learning machines such as the maxmargin methods learn the optimal parameter setting by the empirical loss and the regularization penalty as shown below:(2)  
Where
is the loss function which is a nonincreasing and convex function of the margin, and
is the regularization penalty. However, MED considers a more general problem of finding a distribution over and classification margin parameters . This could be done by minimizing its relative entropy with respect to some prior target distribution under certain margin constraints. Specifically, suppose that a prior distribution, denoted as , is available, then MED learns a distributionby solving a regularized risk minimization problem. When the prior distribution is not a uniform distribution, this can be generalized as minimizing the relative entropy (or KullbackLeibler divergence) and the regularization penalty as follows (penalizing larger distances from priors):
(3) 
Here, is a constant and is the hingeloss that captures the largemargin principle underlying the MED prediction rule:
(4) 
And the KL divergence is defined as follows:
(5) 
Here in (3), the classification margin quantities are included; as slack variables in the optimization, which represents the minimum margin that must satisfy. MED considers an expectation form of the traditional approaches and casts Eq. (2
) as an integration. The classification constraints will also be applied in an expected form. As a result, MED no longer finds a fixed set of the parameters, but a distribution over them, and it uses a convex combination of discriminant functions rather than one single discriminant function to make model averaging for decisions. In particular, MED formulation finds distributions that are as close as possible with the prior distribution over all parameters regarding KLdivergence subject to various moment constraints. This analogy extends to cases where the distributions are also over unlabeled samples, missing values, or other probabilistic entities that are introduced when designing the discriminant function. Correspondingly, MED is an effective approach to learn a discriminative classifier as well as consider uncertainties over model parameters, which combines generative and discriminative learning
sun2018multi ; zhu2018semi . This generalization facilitates a number of extensions of the basic approach, including uncertainty quantification described in this paper. The present work contributes by introducing a novel generalization of CHI formulation by integrating the MED to perform the task of uncertainty quantification.3 The proposed work: the UQCHI model
The overall goal of UQCHI is to learn a distribution over the parameters of CHI model . An additional goal is that this could be done even if only partial labels are given, and the labels might also be with uncertainty. Therefore, the first step in constructing the UQCHI is to create the constraint structure. To design the UQCHI, we incorporate some features from the original formulation of the CHI via Eq. (1) as follows: First, we utilize the label information by defining the discriminant function which corresponds to (1b). We, then incorporate the distinct feature of the CHI formulation, the monotonicity regularization function that corresponds to Eq. (1c). Note that, here, we will not incorporate the additional terms in Eq. (1d) and Eq. (1e) as they demand full knowledge of labels of the samples. In addition, we don’t include the sparsity regularization term (1f), since our focus is to learn rather than the parameter vector . Also, our model can induce sparsity, e.g., if we impose a Laplace prior distribution for the parameters as to what is done in Bayesian Lasso model park2008bayesian .
In the following subsections, we will introduce how we design the prior distributions, the constraints, and how to derive computational algorithms and closedform solutions for training and prediction.
3.1 Design of constraints and prior distributions
As aforementioned, there are two types of constraints that we can extract from the CHI formulation into the development of UQCHI. One corresponds to the discriminant function used in CHI, to generate prediction on samples, while the other one corresponds to the monotonicity regularization function . Based on the CHI formulation, it is supposed that the model should lead to and . As this perfect model may not exist, a set of margin variables are introduced. We consider an expectation form of the previous approach and cast Eq. (1) as an integration. Hence, the classification constraints are applied in an expected sense. This will lead to the following formulation for the constraints:
(6a)  
(6b) 
Here, the term (6a) is the discriminant function and the term (6b) is the monotonicity regularization function. And, is the distribution of , and is the distribution of . With the prior distribution, we can derive the prediction rule: .
Now we move on to the design of the prior distribution . It is natural to decompose the joint prior distribution as a product of three distributions:
(7) 
In what follows we discuss each of the three prior distributions. Specifically, it is reasonable to assume that a level of uncertainty can be designed to each example in defining . A simple solution is to set whenever is observed and otherwise. To define , we choose
to be a Gaussian distribution with mean vector as
and covariance matrix as an identity matrix
. To define the prior over the margin variables, we assume that it could be factorized . Further, following the idea proposed in jaakkola2000maximum , we can set and . Here, is actually the mean of the prior distribution of , so the idea of this distribution is to incur a penalty only for margins smaller than , while for margins larger than this quantity are not penalized. More details about the design of prior distributions will be given in Section 3.4.3.2 The computational algorithm for UQCHI
The full formulation of the proposed UQCHI model is shown below:
(8a)  
(8b)  
(8c) 
Essentially, solving optimization formulation Eq. (8) is to find a solution by calculating the relative entropy projection from the overall prior distribution to the admissible set of distributions that are consistent with the constraints. In what follows, we develop the computational algorithm to solve this formulation Eq. (8) and further derive the method for the prediction on samples.
3.2.1 Step 1: Training the model
In the training step, we consider a joint distribution of
, and the margin vector of while fixing . In this step, we first explain the solution to the MED optimization problem subject to the terms in (3).Lemma 3.1.
Let the loss function be a nonincreasing and convex function of the margin, and let the Lagrangian of the optimization problem defined as and be a set of nonnegative Lagrange multipliers. Given the prior distribution and the model distribution , and the discriminant function in order to minimize the relative entropy in terms of the KLdivergence () subjected to set of defined constraints, the MED optimization problem (3) can be written as:
(9)  
Here, is the normalization constant defined as:
(10) 
The proof of Lemma 3.1 can be found in A. Now, the model training problem is revealed to be another optimization problem, that is learning optimal by solving the dual objective function under positivity constraint. Based on the results from Lemma 3.1, after adding dual variables for the constraint in Eq. (8), the Lagrangian of the optimization problem can be written as:
(11)  
In order to find a solution, we require:
(12)  
Which results in the following theorem.
Theorem 3.2.
The solution to the UQCHI problem has the following general form:
(13)  
Thus, finding the solution to (8) depends on being able to evaluate the normalization constant .
Lemma 3.3.
The proof of Lemma 3.3 can be found in the B. Given the reformulated normalization constant in (14), the maximum of the jointly concave function objective function showing in Eq. (9) can be found through a constrained nonlinear optimization. As a result, by substituting Eq. (14) in Eq. (9) we get:
(15) 
Here, . Thus, we have the following dual optimization problem:
(16)  
The Lagrange multiplier , is recovered by solving the convex optimization problem Eq. (16). Note that since the prior factorizes across , UQCHI solution also factorized as well, i.e., .
Corollary 3.4.
From results in Theorem 3.2 the marginal distribution can be found as follows:
(17)  
3.2.2 Step 2: Prediction
After obtaining the marginal distribution in (17), the following lemma is used to predict the label of a new example . Referring to the solution of the UQCHI problem in (13), we can easily modify the regularization approach for predicting a new label from a new input sample that is shown by . In what follows, we generate the predictive label for the upcoming new labels.
Lemma 3.5.
3.2.3 Summary of the algorithms
A full description of the training and prediction of UQCHI model is given in Algorithm 1.
3.3 UQCHI with rejection option
Typically the performance of a prediction model is evaluated based on its accuracy, on a scheme of classifying all samples, regardless of the degree of confidence associated with the classification of the samples. However, accuracy is not the only measurement that can be used to judge the model’s performance. In many healthcare application, it is safer to make predictions when the confidence assigned to the classification is relatively high, rather than classify all samples even if confidence is low. In this case, a sample can be rejected if it doesn’t fit into any of the classes. In pattern recognition, this problem is typically solved by estimating the class conditional probabilities and rejecting the samples that have the lowest class posterior probabilities, that are the most unreliable samples. As UQCHI enables uncertainty quantification, here, we create a rejection option in prediction to show the utility of uncertainty quantification in practice. The basic idea of rejection option is that the prediction model rejects to generate a prediction if the uncertainty is higher than a given threshold. In other words, a sample that is most likely to be misclassified is rejected as described below:
(19) 
Here, T is the rejection rate. The samples are rejected for which the maximum posterior probability is below a threshold. And a sample is accepted when:
(20) 
Thus, we define a classification with rejection as , where, if a sample is rejected , denotes rejection, else, , where, corresponds to the classification of the th sample defined in Eq. (18).
Algorithm name  UQCHI  CHI  
Label ratio  Training ratio  Rejection rate  
Low = 20  Medium = 40  High = 60  
Low = 10  30  0.69  0.74  0.81  0.61 
50  0.73  0.76  0.83  0.62  
70  0.75  0.77  0.85  0.65  
Medium = 20  30  0.66  0.72  0.73  0.55 
50  0.69  0.73  0.74  0.60  
70  0.71  0.75  0.78  0.64  
High = 50  30  0.64  0.69  0.72  0.53 
50  0.67  0.71  0.73  0.56  
70  0.70  0.73  0.75  0.60 
3.4 Tractability of UQCHI related to design of prior distribution
Recall that by applying the MED to our optimization problem we no longer learn the model parameter, and instead, we specify the probability distributions. These distributions give rise to penalty functions for the model and the margins via KLdivergence. In detail, the model distribution will give rise to a divergence term
, and the margin distribution will give rise to the divergence term which corresponds to the regularization penalty and the loss function respectively. The tradeoff between classification loss and regularization now are on a common probabilistic scale, since both terms are based on probability distributions and KLdivergence. Hence, there is a relationship between defining a prior distribution over margins and parameters and defining the objective function and the penalty term in the original function. Recall that, are the classification margins as slack variables in the optimization which represent the minimum margin that must satisfy. Hence, the choice of the margin distribution corresponds to the use of the slack variables in the formulation of the UQCHI. For example, in our case we set and . If we mathematically expand the normalization function in (10), we get the two terms and as shown in (14), and given the choice of margin priors in Section 3.1 we get:(21) 
From (21) we can see that a penalty occurs when the margins are smaller than , and any margins larger than this would not be penalized. The margin distribution becomes peaked when that is when , and this is equivalent to having fixed margins. If the margin values are held fixed the discriminant function might not be able to separate the training examples with such prespecified margin values. Because of nonseparable datasets this will generate an empty convex hull for the solution space. Thus, we need to revisit the setting of the margin values, and the loss function upon them. The parameter will play an almost identical role as the regularization parameter which upper bounds the Lagrange multipliers. Note, if the objective function grow without a bound, it may generate a search space for parameters that are no longer a convex hull. This compromises the uniqueness and solvability of the problem. Therefore, the selection of a prior forms a concave function for a unique optimum in the Lagrange multiplier space.
4 Numerical studies
In this section, we design our simulation studies to evaluate the efficacy of UQCHI in terms of prediction and uncertainty quantification, in comparison with the CHI model under a variety of practical scenarios.
4.1 Simulated dataset
We simulate data following the procedure described as follows. The synthetic dataset is generated with two classes with partial labels. We conduct several experiments with the simulated data to investigate the performance of our method across different settings. Without loss of generality, we assume that there are two groups, normal vs. diseased with a proportion of of class normal and of complete labels. For all the experiments, we set the number of features , For each class, we simulate subjects, where we assumed that for .
4.2 Incomplete labels and length of longitudinal data
UQCHI can handle partial labels well, i.e., by assigning a prior distribution of the labels and obtaining posterior distributions after model training, in our experiment, we consider a low, medium and high level of label availability, i.e., , and of unlabeled examples. Also, we evaluate our methodology’s robustness in the presence of downsampling of the training data, i.e., only using a percentage of the data (for example, ranging from , and ), to train both UQCHI and CHI models. A model that can predict well with less longitudinal data holds great value in clinical applications.
4.3 Uncertainty quantification with rejection option
As mentioned in 3.3, UQCHI has a unique capacity of rejection option. The algorithm rejects to predict on a sample if it cannot be predicted reliably. The key parameter is the threshold that will be used in the rejection option. In our experiments, we use several levels of the threshold to create a range of rejection options from loose to strict, and further calculate the resulting accuracies on the predictions on the accepted samples. Specifically, we vary the size of the rejection region from , , to .
4.4 Parameter tuning and validation
In our experiments, we randomly split the data into two parts, one for training and one for testing. For the training dataset, we use 10fold crossvalidation to tune the parameters. The average accuracies from the split of the testing dataset are reported in the result section. In Section 3.4 we specify under what condition the computation would remain tractable. It has been pointed out that, based on the choice of the margin distribution described in 3.4, is bounded by the parameter . Recall that is a parameter in the prior for the margins. Therefore, the parameter will play an important role. Hence, we conduct experiments with the parameter chosen from to see the impact of various choices of on the testing accuracy.
4.5 Discussion
In the following, we discuss the tractability of the model given the simulated data for various choices of the parameter in Table 2. We simulated different selection of the parameter to check its impact on the testing accuracy. If we observe that increasing this parameter imposes no effect on the performance, we would then ignore the higher values for reasons discussed in Section 3.4. The results show that for a more significant quantity of parameter the accuracy decreases. As shown in Table 2, additional potential terms of the parameter would not carry huge effects as the margin distribution may have become at its peak () which is equivalent to have fixed margins. Note that to test the impact of the parameter we simulated the data with a proportion of of class normal and complete labels. Here we can observe that after increasing the values for the parameter beyond , the performance of the model doesn’t change significantly, which indicates that the margin distribution may have become at its peak, and hence it is equal to a fixed value. Higher values of this parameter generate relatively similar performance. Consequently, lower values of preserve flexibility to estimate a distribution over parameters instead of using fixed margins.
Next, we examine how the incomplete label information would affect the performance of UQCHI with regards to the testing accuracy given different sampling ratios in Table 3. A model which can be trained with less training data is more promising in healthcare applications where the data collection is relatively costlier than other realworld applications. The results in Table 3 show that with even a ratio of of incomplete label information the UQCHI can perform with a testing accuracy of . This confirms that the model is capable of performing well in the face of lack of label information.
Incorporating a rejection option into the model improves the prediction accuracy of classifiers. There is a general relationship between the testing accuracy and rejection rate: the testing accuracy increases monotonically with increasing rejection rate. The testing accuracies for different rejection options are reported in Table 1. Comparisons of varying rejection rates for the UQCHI confirms that for a high rejection rate of , the testing accuracy could go up to for a given label ratio of , which in comparison with a lower rejection rate, this can be a promising result. In Table 1
, we also compared our methodology with CHI framework. Recall that CHI is not strictly a supervised learning problem. In
huang2017chi , both simulation studies and realworld applications demonstrated that without label information, CHI method could still be trained and used to predict. However, we show that the UQCHI can generate relativity a better performance than CHI by incorporating the rejection option. UQCHI can obtain a testing accuracy in a range of to for a given rejection rate of and a labeling ratio of .Parameter c  Testing accuracy 

1.5  81.2 
3  80.2 
5  79.8 
10  77.2 
20  77.3 
100  76.1 
Sample ratio  Label ratio  

Low = 10%  Medium = 20%  High = 50%  
30  0.85 0.033  0.80 0.032  0.74 0.033 
50  0.86 0.060  0.83 0.053  0.76 0.027 
70  0.88 0.074  0.85 0.041  0.78 0.037 
The average classification accuracies and standard deviations (%) for the simulated dataset
5 Realworld application on Alzheimer’s disease
We further test UQCHI on an Alzheimer’s disease data which exhibited monotonic disease progression. We use the FDGPET images of 162 patients (Alzheimer’s Disease: 74, Normal aging: 88) downloaded from the ADNI (www.loni.usc.edu/ADNI). The data is sampled at irregular time points where each patient has at least three time points and at most seventime points. The data is preprocessed, and the Automated Anatomical Labeling (AAL) is used to segment each image into 116 anatomical volumes of interest (AVOIs). For this study, 90 AVOIs that are in the cerebral cortex are selected (each AVOI becomes a variable here). According to the mechanism of FDGPET, the measurement data of each region are the local average FDG binding counts, which represents the degree of glucose metabolism. The glucose metabolism declines as the function of aging, and the progression of many neurodegenerative diseases such as AD further accelerates this declination. Thus, ADNI dataset facilitates a perfect application example to test the proposed method. While the ADNI dataset consists of fully labeled examples, we exploit the dataset settings to create a variety of uncertainties to the label information.
The results for tuning the parameter for the ADNI dataset is reported in Table 4. The results show that for a more significant quantity of parameter the accuracy decreases. Table 5 shows the performance of the UQCHI across different uncertainty levels as well as different sampling ratios. The proposed method shows an excellent capability to quantify the uncertainties for the realworld dataset. As shown in Table 5, The UQCHI is even capable of dealing with a data that has of incomplete labels with an accuracy in the range of for the ADNI dataset.
On the other hand, we show that by only using a small proportion of the training samples as low as of the data, we still can maintain reasonable performance in a range of , which indicates that UQCHI can be trained with less training data. The rejection options against the testing accuracy as well as these values against the training ratios are shown in Tables 6. Incorporating a rejection option into the model improves the prediction accuracy of classifiers. Comparisons of different rejection rates for the UQCHI confirms that for a high rejection rate of , the testing accuracy could go up to or higher, which compared with a lower rejection rate, this can be a promising result.
Parameter c  Testing accuracy 

1.5  78.8 
3  77.9 
5  77.3 
10  75.3 
20  72 
100  68.9 
Sample ratio  Label ratio  

Low = 10%  Medium = 20%  High = 50%  
30  0.82 0.022  0.79 0.052  0.70 0.032 
50  0.84 0.014  0.82 0.005  0.74 0.049 
70  0.87 0.040  0.83 0.032  0.76 0.043 
Algorithm name  UQCHI  CHI  
Label ratio  Training ratio  Rejection rate  
Low = 20  Medium = 40  High = 60  
Low = 10  30  0.71  0.76  0.83  0.64 
50  0.75  0.78  0.84  0.66  
70  0.77  0.79  0.87  0.70  
Medium = 20  30  0.67  0.71  0.72  0.58 
50  0.70  0.72  0.75  0.62  
70  0.71  0.75  0.76  0.63  
High = 50  30  0.66  0.70  0.71  0.55 
50  0.69  0.71  0.73  0.58  
70  0.71  0.72  0.74  0.62 
6 Conclusion
In this paper, we develop the UQCHI method to enable uncertainty quantification for continuous patient monitoring. This probabilistic generalization will facilitate a few extensions to the basic CHI model for decisionmaking purposes. For example, in many degenerative disease conditions such as AD, it is essential to triage patients to determine the priority of resource allocations and patient care. Therefore, the UQCHI framework would equip us with an optimal decision considering imperfect and continuous delivery of knowledge. In the future, we would like to extend this method to other diseases that may show different degradation characteristics in the context of degenerative diseases. Another extension of this methodology is to apply on a nonlinear index and further explore the feasibility of varying discriminant functions.
Appendix A Proof to Lemma 3.1
Proof.
By adding a set of dual variables, one for each constraint, the Lagrangian of the optimization problem in (3) can be written as:
(22) 
In order to find the solution to Eq. (3), and given the definition of the KLdivergence in (5) we require,
(23) 
The solution to the MED optimization problem has the following general form:
(24) 
Here, is the normalization constant defined in (10), then the general exponential form of the solution becomes:
(25) 
∎
Hence, the dual of the MED problem can be shown in (9).
Appendix B Proof to Lemma 3.3
Proof.
Let be the normalization constant defined in Eq. (10), given the constraints in (8) the normalization constant can be reformulated as follows:
(26a)  
(26b)  
(26c)  
(26d)  
(26e)  
(26f) 
Given the priors in (7), each term in Eq. (26) can be reformulated as follows: For the term in (26d) and (26e) we have the followings:
(27)  
And for the last term (26f) we have the following:
(28)  