1 Introduction & Literature Review
Doctors often rely on their past experience in order to diagnose patients, often considering similar cases from the past to diagnose a new case. A radiologist might consider how similar a medical image is to an exemplar from a medical reference book, or an image from one of their own past cases. An infectious disease specialist might consider how similar someone’s symptoms are from past patients in order to diagnose an illness. For a doctor with enough experience, almost every patient would have similarities to key cases seen in the past, and each new patient could be viewed as a mixture (a patchwork) of these key past cases. Because doctors often tend to reason this way, a computationally aided diagnostic tool that thinks in the same way might be helpful in locating key past cases of interest that could assist with diagnosis.
More generally, beyond medical decisions, it is well known that various types of exemplarbased reasoning and prototyping are fundamental to the way humans make tactical decisions (Carroll, 1980; Cohen et al., 1996; Klein, 1989; Newell and Simon, 1972). Humans also tend to think of categories as being represented by a specific member of that category, which is the center of the subfield of cognitive science called prototype theory (Rosch, 1973). It has been shown that often, user confidence in a solution is increased when an example case is displayed rather than the logic of the decision rule alone (Cunningham et al., 2003). The fact that medical students and MBAs are taught through casebased teaching methods is not a coincidence, it is because this type of thinking is how our minds work. In recognitionprimed decision making (Klein, 1989) for firefighters, students are taught to consider past cases in order to make informed decisions. By using previous experience in form of cases to understand and solve new problems, casebased reasoning has been reported as a suitable and successful technique for medical knowledgebased systems and medical decision making (see for example Schmidt et al. (2001); Dussart et al. (2008) and the discussion about the medical domain).
CaseBased Reasoning (CBR) (see Aamodt and Plaza (1994); Biswas et al. (2014); Slade (1991); Irissappane and Zhang (2015)
) is a subfield of artificial intelligence dedicated to automated exemplarbased decisions. CBR has been used in realworld applications for many years (e.g.,
Anaissi et al. (2015); Bichindaritz and Marling (2006); Janssen et al. (2015); Li and Sun (2008)), including many medical applications Begum et al. (2011, 2009). Researchers have used casebased reasoning for treatments of patients with anxiety disorders (Janssen et al., 2015), while others have used it for elder care and health assessments (Hu et al., 2014), analyzing gene expression data to study acute lymphoblastic leukemia (ALL) (Anaissi et al., 2015), and to study patient registration on renal transplant waiting lists (CampilloGimenez et al., 2013). Many casebased reasoning systems use some form of nearest neighbor method (Cover and Hart, 1967) (also see Anaissi et al. (2015); Cover and Hart (1967); Song et al. (2013)). Several works provide indepth discussions of the issues and challenges of applying casebased reasoning in medical domains (Begum et al., 2009, 2011; Bichindaritz and Marling, 2006).Our approach is different than typical casebased reasoning approaches in that each new case is modeled as a mixture of different parts of past cases, where the past cases vote on the features and label of the new case. In essence, each patient is modeled as if it were a patchwork of parts of past cases. This approach avoids fixing a single similarity measure as in nearest neighbor approaches; we want to compare patients only on relevant subsets of features. For instance, hipinjury patients should be compared only on the similarity of their hip injuries (and age and related characteristics), and heart patients should be compared only on aspects related to heart health. The definition of a patient’s neighbors should depend on how they are being compared, and different comparisons should be made based on different subsets of features. Our method performs joint inference on the selection of past cases that are influential to each new case, the subset of features of the past cases that are relevant to the new case, and how strong the influences of the past cases are on the new case. Because we use Bayesian hierarchical modeling, the hierarchy allows us to borrow information across cases about which features are important for new cases. The hierarchical Bayesian setting helps to quantify uncertainty, and yields reasonable estimates of the (meaningful) parameter posteriors, and avoids overfitting.
We hope that by providing doctors with the logic they would have considered anyway, had they been able to do lengthy querying of the database for each case, doctors will be more likely to trust the model and make personalized decisions that are both evidencebased and experiencebased. This method can help train doctors as well, by showing them how the key elements of past cases are relevant. It should be pointed out that our model can be applied in other application settings where interpretability of the model is of some importance. Examples are criminal justice, and energy reliability. We should point out that for policybased decisions that rely on subpopulations, our model may not be the right approach.
In Section 2, the structure of the main model and the developed hierarchical framework are presented, along with a useful extension. In Section 3
we show the result of our models on two healthcare datasets which are publicly available. We also compare our model with some other important machine learning algorithms in Section
3. Several appendices include details of the inference procedure.2 Structure of Bayesian Patchworks
Each patient (“case”) is represented by a vector of features, including history of treatments, genetic makeup, environmental factors, and lifestyle. In this model, each patient is comprised of important pieces of related individuals. These special individuals, called
neighbors, belong to a large historical set of past cases called the parent set. Stated in terms of the generative structure of the model, each patient is generated from its neighbors, where the neighbors give the new patient their influential features. The neighbors also vote to determine the label for the new case. Figure 1 illustrates this, where each feature of a new patient’s medical state is considered with respect to other patients (parents) that are similar in that feature.Thinking in reverse (starting from the patient, rather than its generative process), this form of mathematical modeling naturally leads to answers to the following questions: For a new patient with unique characteristics/features, what are important similar past cases? How close are these past cases to the current patient and what are the special factors that make this past case resemble the current case? How can this similarity help with inference – can we infer the patient’s health condition and possible treatments using the influential neighbors? The answers to these questions are learned (inferred) from the data with our hierarchical model. We introduce the main important variables below. After that we will introduce the fullblown generative model.
Each individual has features (e.g., age, race, gender, treatment or diagnosisrelated features, drugs), and we denote the feature of the observation as . The th feature can take any of its possible discrete outcomes. The set of feature values associated with the th individual are denoted by vector . Each individual is associated with one of the possible health classes/outcomes, which we denote for the th individual by . The whole dataset is then denoted by (,). The main parameters of the model are as follows:
: This is an indicator variable denoting whether patient from the parent set is in the neighbor set of individual (we say “ is a neighbor of ”). A patient and its neighbors share at least some important common properties. The neighbors for several cases can be the same. We have and and is the number of current cases, is the number of parents (past cases).
: This binary parameter (where ) shows whether feature of neighbor is important to individual , and thus shows whether two cases (a new case and a case from the pool of past cases in the parent set) are connected to each other.
: This determines the probability that feature
of patient takes value . Specifically, it determines how likely outcome will be copied from the neighbors. The vector for the th outcome of feature () is calculated as follows:(1) 
Based on this definition, we have a boost in the score for the th outcome of feature for any neighbor who has outcome for feature . Here is the baseline rate of seeing outcome , and determines how much influence each neighbor has.
: This determines how much the neighbors of individual influence it to choose the th class label (). This variable is defined as
(2) 
Here, and
are the hyperparameters that together reflect how neighbors can affect the generation of a label. In this paper we assumed that
. If is very large, then neighbors are not important in determining the class of each individual. We will discuss extensions of this model later in the paper.Consider the generation process for individual . After the values and are generated for and , we know ’s neighbors and their influences. From there, generates vector and then finally generates the value of the th feature of individual (i.e., ). Also from , we know that generates the outcome for individual (i.e., ).
As an example, we consider a patient . All of the past patients with common properties are neighbors, so that . The first neighbor has a circulatory system condition that is similar to patient . The second neighbor has a hip injury similar to patient , and so on. Each of these medical conditions of the neighbors provide some information about patient ’s features and labels. For the information from neighbor that definitely informs feature for patient , we would have , and the probability that patient would have the same value of the feature as the neighbor would be large, meaning would be large for value . If neighbor influences ’s class label, then would be large for neighbor ’s class label. This way, we can consider a patient as a patchwork of past patients that are similar to the present patient. The neighbors’ important characteristics (the hip injury, the circulatory condition, etc.) all contribute to the features and outcomes for the present patient. An understanding of these similar past patients serves as a point of reference for how to care for the present patient.
2.1 Full Hierarchical Bayesian Patchworks Model (Model I)
The variables of the hierarchical model (herein denoted BPatch) are as follows:
: Here (
) follows a Bernoulli distribution with parameters
, where hyperparameter denotes how many neighbors each individual tends to have on average. This parameter controls how many parents generate each individual. Note that if is chosen to be large, then the model may overfit by choosing many cases from the parent set as neighbors.: This parameter determines the general importance of feature ranging from 0 to 1,
The hyperparameters and control the number of features used for generating the new observations and their class labels.
: This parameter determines the importance of feature to parent
. Its values follow a Beta distribution centered around mean
, that is, . (To see that the mean is note that the mean of the Beta distribution is .) Here, is another hyperparameter that governs the sparsity of the distribution. The intuition behind this parameter is that if it is set to a value that makes the variance of the
too large, then some neighbors may be very influential for feature and some neighbors will not be influential. If it is set to a value that makes the variance too small, then neighbors may all be approximately equally influential for feature .: As explained earlier, this binary parameter shows whether feature of neighbor is important to individual . This is determined via:
This allows for individual level effects on neighbor ’s influence for feature .
: The distribution of values for feature for individual comes directly from the ’s (as given in Eq. (1)), that is
which means and . So, is the distribution of feature values for the th feature of the th individual.
: The th feature of the th individual, , is generated from a categorical distribution with parameter vector . Therefore, the probability that takes its th outcome becomes .
Now that we have generated the observations, we need only to generate the class labels.
: The distribution of class outcomes for individual will be determined by defined in Eq. (2) through
This way, for all and for all .
: The distribution of class outcomes for individual is generated from a categorical distribution with parameter vector . To use our model for classification, the class with the highest probability can be used as the predicted class.
According to this model, neighbors of an individual determine both its feature values and its outcome . The hyperparameters of our model are . The graphical model is illustrated in Fig. 2.
The full model given all hyperparameters is summarized below:
To summarize, the determine who the relevant neighbors are for , the next three lines determine how the features are generated from the values of the neighbors, and the two lines after that determine how the labels are generated. Inference for the model is discussed in Appendix A.
2.1.1 Continuous Features
Model I can be edited to handle continuous features in several ways. One option is to use a bandwidth or a binary distance function that determines whether two feature values in the continuous domain are close:
where is a binary similarity measure/kernel for the th feature. A simple example of such measure is a binary kernel with a bandwidth as follows:
From there, and
values could be generated from a normal distribution or another continuous distribution.
2.1.2 More Complicated Outcome Relationships
Once the features
are generated, if desired one could replace the outcome model with any other standard generative modeling procedure (e.g., linear regression model or logistic regression). We chose a simple voting model for the outcome, where each neighbor casts one vote for their own outcome.
2.1.3 Handling Missing Points
Handling missing data is not the focus of the paper, one may handle it in the usual way, for instance, by performing any method for imputation. Other than the above method, our model can be updates as follows to deal with missing points: for each dimension of features, we can add an artificial output representing missing value. For example for feature
with possible discrete outcomes, we add an artificial outcome and replace all missing values in that direction with it.2.2 Comparison with closely related work
Now that we have introduced the full model, we can give a more technical description of how the models presented above differ from ordinary nearest neighbor techniques (knearest neighbor – KNN) as well as the Bayesian Case Model (BCM)
(Kim et al., 2014), and Latent Dirichlet Allocation (LDA).Nearest Neighbors: In KNN, adaptive nearest neighbors, or nearest prototype models, there is a single fixed distance metric used for comparing a current observation to past observations. Here, the choice of distance metric depends on the observation. This allows, for instance, a patient with a heart condition and sports injury to be neighbors of patients with a similar heart condition (regardless of sports injury), and patients with a sports injury (regardless of heart condition). Our method can be viewed as a version of nearest neighbors that is much more flexible, in terms of how many neighbors are used, the distance metric between neighbors, and the completeness of the characterization of the new patient by the neighbors.
Bayesian Case Model: First, BCM is unsupervised, while BPatch is supervised. More importantly, BPatch includes individual level effects for influences of parents, whereas BCM uses prototypes instead. BCM limits new cases to be compared with only certain learned past cases (the prototypes). An easy way to see this is that BPatch has variables for each unit, controlling whether feature of neighbor is important for observation . In BCM, the analogous parameter only determines whether feature is important for cluster , and there is no individual level effect. As a result, the whole model for BPatch is different: in order to determine the distribution of values for any feature (which is ), BPatch needs to consider how many parents are going to generate it (controlled by ). In contrast, BCM’s values depend only on the cluster . For BCM, takes on values that are single cluster indices (“Which cluster will generate feature for individual ?”), whereas for BPatch, determines which parents generate individual (“Which parents will be allowed to generate individual ’s features?”). As a result, BPatch’s generation of features is a result of a vote among parents, whereas BCM’s generation of features is a result of a single prototype sharing its feature with the new observation. By considering individual level effects, generated for each individual as a patchwork of its parents, BPatch is conceptually different than BCM, which is clusterbased. BPatch is more similar to nearest neighbors than prototype or clusterbased modeling like BCM.
Note that there are papers dealing with supervised prototype classification other than BCM (e.g., Bien and Tibshirani (2011)), though not using the important parts of each case like BCM and BPatch; this is problematic, and causes these algorithms to choose a huge number of prototypes in practice. It also limits us to compare with a single nearest prototype even if our current observation is clearly a mixture of different past situations.
Latent Dirichlet Allocation: LDA has a strong (and false) assumption that each feature value is generated independently. BPatch and BCM do not have this assumption, and the generation of features is correlated (either through clusters or parents). BCM and BPatch are more complicated and more realistic models, leading to better performance. This was discussed via experiments by Kim et al. (Kim et al., 2014).
2.3 Model II
When generating feature values or outcomes, the previous BPatch model (Model I) counts each vote from a neighbor equally, whereas in Model II, the votes are weighted by the degree of influence of the neighbor. For instance, if one neighbor is influential for several features of the current patient, then it should potentially have a stronger vote than other neighbors in determining the feature values and outcomes. The degree of influence of a neighbor depends on the influences of its individual features. We denote , as the degree of influence between case and parent , which is
The higher is for parent , the stronger its influence is on individual . This is different from Model I where the influence of parent on individual was determined through a binary indicator . In Model II, the amount that each influential neighbor can potentially affect feature of individual is defined as
(3) 
and similarly the vote for the outcome can be weighted as
(4) 
Based on this model, not only how many neighbors vote for a feature matters but also how strongly they vote matters too. The rest of the parameters are the same as the model given in Section 2. In this model, the degree of influence directly depends on how similar each pair of individuals are. Based on this definition, if , then , and if then equals . The directed acyclic graph of this model is given in Fig. 3.
The variable in this model is not a parameter to be estimated as it can be calculated from and . However the structure of the posterior and conditionals are different from the original model in the sense that now depends also on all .
The posterior for Model II is derived in Appendix C.
We also propose a possible third model (Model III) in Appendix D, where the relationship between parent and individual can have a degree/weight between 0 and 1 following a Beta distribution. This is different from both Models I and II where
was a binary variable. Model III did not tend to perform well, and we chose to omit it in our experiments. It is possible that improvements to the model might lead to better performance, though we leave that for future work.
2.4 Health Outcome/Label Prediction
Let us assume that a new patient with a feature set is available, for which we may or may not have the health label . We can use our model to make three types of predictions for this new individual as follows:

For a known and unknown , what are the set of influential neighbors , for , and their influences , for , ?

For the same setting as (I) above, what is the estimated probability of being at each health outcome, and what is the predicted outcome ?

For a set of known , what are the set of influential neighbors and their influences , for ?
The answers to prediction problems (I) and (II) can be obtained in a completely unsupervised way. That is, they can be answered having removed the right part of the model in Fig. 2 that generates labels Y. In other words, we do not need the outcomes of the parent set for inference in the unsupervised case.
3 Numerical Experiments: Applications to Healthcare Analytics
We applied our models to two healthcare problems, heart disease and breast cancer predictions. Then, in Section 3.5, we applied our models to a larger data set with more features.
Heart Disease Diagnosis: The Cleveland database (Cleveland Clinic Foundation), obtained via (Bache and Lichman, 2013), reports the presence of heart disease in 303 patients, 274 of which had no missing information, each labeled with whether they had heart disease, and features are described in Table I.
#  Feature Name  Number of Possible Outcomes  Description 

1  Age  3  (0,45),[45,55),[55,100) 
2  Gender  2  Male, Female 
3  Chest Pain Type  2  1 or 2, 3, 4 
4  Trestbps (blood pressure)  3  (0,120], (120,140],(140, 300) 
5  Serum Cholestrol (Chol)  3  (0,240], (240,260], (260,400) 
6  Fasting Blood Sugar  2  0, 0 
7  Resting Electrographic Results (Restecg)  2  0, 0 
8  Maximum Heart Rate Achieved (Thalach)  3  (0,50],(50,65],(65,202) 
9  Exercise induced Angina (Exang)  2  0, 0 
10  Oldpeak  2  (0,0.5], (.5,6.2) 
11  Slope of the peak  2  1, 2 or 3 
12  Number of major vessels colored by fluoroscopy (Ca)  2  0, 1 or 2 or 3 
13  Thal  2  3, 3 
Label  Heart Disease  2  Presence, Absence 
Breast Cancer Recurrence Prediction: We used a dataset provided by the Oncology Institute, University Medical Centre, Ljubljana, Yugoslavia (Bache and Lichman, 2013). The attributes are described in Table II, and each observation was labeled as to whether it led to recurrence of breast cancer. After removing patients with missing values, we randomly selected a balanced set of 162 patients (81 from each class) for our analysis.
#  Feature Name  Number of Outcomes  Description 

1  Age  3  (0,50),[50,59),[60,100) 
2  Menopause  3  40, 40, premeno 
3  Tumorsize  4  [0,19], [20,29],[30,39], [40,59] 
4  Invnodes (blood pressure)  2  (0,2], 2 
5  NodeCaps  2  yes, no 
6  Degmalig  3  1,2,3 
7  Breast  2  left, right 
8  BreastQuad  4  leftup, leftlow, rightup, or rightlow, central 
9  Irradiant  2  yes, no 
Label  class distribution  2  norecurrenceevents, recurrenceevents 
3.1 Prediction Results
Both supervised and unsupervised versions of BPatch perform about equally well in predicting patients with cancer and heart disease. This indicates that learning a good set of neighbors is useful not only for characterizing the current patient’s features, but also for predicting their label. Let us show this.
The hyperparameters were chosen as follows: , , , , , , , . The number of parents was initially chosen to be 80 () for both datasets. Parents were randomly selected from the training sets in each fold. In Section 3.3, we discuss the sensitivity of the prediction results with respect to all hyperparameters and discussed why these parameters are interpretable in that they can control both the structure of the model and its accuracy/prediction power. Prediction evaluation was performed in two ways, unsupervised and supervised. For the unsupervised analysis, the model structure was trained without using the outcome of any of the patients in the training set and only the vector of patient attributes was used to infer model parameters. In other words, the neighbors and their influences were found solely from the feature values (see Section 2.4). Note that while the training process itself is unsupervised, we did use the trained parameters along with the labels of training points in order to make predictions for the cases in the test set. The training set determines how we should combine the neighbors’ labels to produce a label for the current case. In the supervised case, we used the information on the actual health outcomes of the patients in the training set both to train the model and to make predictions for the new cases. For both cases, we performed 5fold crossvalidation; here we divided our data into five folds, each with the same number of parents, trained our models on four folds and tested on the fifth.^{1}^{1}1The individuals in the parent list were chosen uniformly at random in each fold. We repeated our experiments with fixed sets of parents over all folds. The results were similar to models trained with different parents in each fold (omitted here).
We performed all experiments for both data sets using Model I and Model II. After training, we used the posterior predictive distribution of
to estimate the probability of having heart disease and breast cancer (as a recurrent event) for all patients in the test set. Using the estimated probability for patient as a classification input (where 0.5 is the threshold), we then calculated the classification accuracy, sensitivity, specificity, precision, recall, and Fmeasure (see Tables III and IV). We reported these measures for each fold and followed by the mean and standard devision over folds.Supervised  

Measure  Accuracy  Sensitivity  Specificity  Precision  Recall  
Fold  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II 
1  74.55  81.82  73.33  73.33  76.00  92.00  78.57  91.67  73.33  73.33% 
2  78.18  76.36  80.00  80.00  76.00  72.00  80.00  77.42  80.00  80.00% 
3  78.18  78.18  69.23  73.08  86.21  82.76  81.82  79.17  69.23  73.08% 
4  92.73  92.73  92.31  92.31  93.10  93.10  92.31  92.31  92.31  92.31% 
5  90.74  88.89  96.00  100.00  86.21  79.31  85.71  80.65  96.00  100.00% 
Mean  82.88  83.60  82.17  83.74  83.50  83.83  83.68  84.24  82.17  83.74% 
SD  8.25  7.01  11.66  11.98  7.41  8.86  5.52  7.17  11.66  11.98% 
Unsupervised  
Measure  Accuracy  Sensitivity  Specificity  Precision  Recall  
Fold  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II 
1  72.73  72.73  73.33  73.33%  72.00  72.00  75.86  75.86  73.33  73.33% 
2  76.36  74.55  76.67  76.67%  76.00  72.00  79.31  76.67  76.67  76.67% 
3  80.00  80.00  73.08  76.92%  86.21  82.76  82.61  80.00  73.08  76.92% 
4  89.09  94.55  84.62  100.00  93.10  89.66  91.67  89.66  84.62  100.00% 
5  87.04  87.04  92.00  92.00%  82.76  82.76  82.14  82.14  92.00  92.00% 
Mean  81.04  81.77  79.94  83.78  82.01  79.83  82.32  80.87  79.94  83.78% 
SD  6.94  9.06  8.20  11.59  8.33  7.69  5.88  5.53  8.20  11.59% 
Supervised  

Measure  Accuracy  Sensitivity  Specificity  Precision  Recall  
Fold  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II 
1  68.75  71.88  92.86  92.86  50.00  55.56  59.09  61.90  92.86  92.86% 
2  90.91  87.88  100.00  100.00  80.00  73.33  85.71  81.82  100.00  100.00% 
3  62.50  71.88  57.14  71.43  66.67  72.22  57.14  66.67  57.14  71.43% 
4  84.38  81.25  87.50  81.25  81.25  81.25  82.35  81.25  87.50  81.25% 
5  84.85  81.82  94.74  89.47  71.43  71.43  81.82  80.95  94.74  89.47% 
Mean  78.28  78.94  86.45  87.00  69.87  70.76  73.22  74.52  86.45  87.00% 
SD  12.04  6.95  16.98  11.01  12.65  9.36  13.89  9.50  16.98  11.01% 
Unsupervised  
Measure  Accuracy  Sensitivity  Specificity  Precision  Recall  
Fold  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II  Model I  Model II 
1  76.36  71.88  76.67  92.86  76.00  55.56  79.31  61.90  76.67  92.86% 
2  76.36  87.88  76.67  100.00  76.00  73.33  79.31  81.82  76.67  100.00% 
3  59.38  65.63  57.14  71.43  61.11  61.11  53.33  58.82  57.14  71.43% 
4  89.09  84.38  84.62  87.50  93.10  81.25  91.67  82.35  84.62  87.50% 
5  78.79  84.85  89.47  94.74  64.29  71.43  77.27  81.82  89.47  94.74% 
Mean  76.00  78.92  76.91  89.30  74.10  68.54  76.18  73.34  76.91  89.30% 
SD  10.67  9.64  12.33  10.95  12.58  10.21  13.99  11.90  12.33  10.95% 
We can observe from the results in Tables IIIIV
that both supervised and unsupervised version of Model I and Model II perform well in terms of classifying patients with heart disease and breast cancer conditions. The supervised version performs slightly better in both data sets, which is not surprising because it uses more information (the labels
of past cases) for estimation of parameters. In the unsupervised case, recall that the labels of past cases were not used to train the model parameters, however, they were used to make predictions from the trained parameters. That is, without using the health outcomes of past cases, our models inferred which neighbors to consider for predicting health outcomes simply by looking at neighbors with similar patterns in feature values.Model II tended to have better performance than Model I in both data sets as it gives more weights to neighbors with more similar features. As discussed before, Model II is computationally more expensive than Model I and thus takes longer to converge. See Section 3.5 for the discussion on CPU analysis of two models.
BPatch produces probabilistic estimates for each class, unless other nearest neighbor models (and unlike BCM which is unsupervised). We plotted the estimated probability of having heart disease and a recurrent breast cancer event for the patients in the test sets in Figure 4. This figure shows that the probability of the health condition of interest tends to be higher for patients who actually have that condition.
3.2 Discussion on Interpretability
Interpretability is known to be a major concern in machine learning for highstakes decisions, but it is contextdependent, subjective, and hard to quantify. In what follows, we discuss three aspects of interpretability: the view of each patient through the lens of the neighbors, discussed in Section 3.2.1, and feature importance in Section 3.2.2. We also further provide an interpretability comparison in Section 3.4
to BCM, KNN, and decision trees.
3.2.1 Insight About Cases from Bayesian Patchworks
BPatch can potentially pinpoint which past cases are similar to a new case, why certain past patients are similar to a new case, how similar they are, and how this similarity is quantified. Let us show by few examples. In Figures 58, the top neighbors/parents with the highest are shown for four sample patients with and without breast cancer and heart disease. For each sample, the feature values of the parent associated with important dimensions are shown in bold (i.e., when ). It can also be seen that each parent has a unique combination of features that are important for that parent. The parents that are selected as neighbors tend to be similar to the patients on the important features for each parent. Consider Sample Patient 1 in Fig. 5 for instance. It can be seen that its Parents 14 have some features identical to those of Sample Patient 1. Patients with cancer are connected more often to parents with cancer. For example, three out of the selected four parents have the same health label as Patient 1. Doctors now can look at all selected parents and see why these parents are selected as the influential neighbors and use the predicted probability of the health condition from the neighbors to make their medical decision or diagnosis. Doctor can also use these results to investigate similar past cases in terms of possible treatments in order to make better treatment decisions.
3.2.2 Feature Importance
Bayesian Patchworks has parameters that directly mirror feature importance. These parameters can be used to compare the importance of each feature for distinguishing individuals from different classes.
In Fig. 9, the posterior predictive distribution of for the features used in both data sets are shown for the unsupervised version of Model II. We can observe from this figure that features such as age and blood pressure seem to be more important than features like Serum Cholesterol or Resting Electrographic Results for finding the influential neighbors of each individual. We carried out the same experiments for the breast cancer dataset and plotted the posterior predictive distribution for . Since we could not find any strong scientific resource to directly evaluate or approve these findings, we indirectly evaluate them, by comparing to another feature importance approach: first we used only the top 3 features to train the model; next, we used the next 3 features (feature ranked 4th, 5th, and 6th); finally the next 3 features (feature ranked 7th, 8th, and 9th) and evaluated the prediction power of the model and compared with the case of when all features are employed. Results shown in Fig. 10 indicate that for both datasets, the performance of the top three features alone is similar to when all features are used. When the less important features are employed, the prediction power of the model degrades. This result shows that the vector obtained from is actually indicative of feature importance. In Subsection 3.4.2, we used estimated feature importance vectors as the feature weight vectors for KNN and saw slightly improved prediction performance. We repeated this using Model I and other extensions and the results were similar.
3.3 Sensitivity analysis for various hyperparameters
Sensitivity analysis in a Bayesian setting can help find out how sensitive a model’s performance is to minor changes in its hyperparameters. We report below sensitivity analysis to hyperparameters , , , , and .
3.3.1 Sensitivity analysis for the number of parents ()
The total number of influential neighbors for each new case is a critical factor in the performance of the model. While increasing the number of parents will improve the coverage of parents among the distribution of patients, it can also make the computational time to train the model longer, and potentially lead to overfitting. In Figure 11, we show the accuracy of both Model I and Model II under different values of . As can be seen for both data sets, the accuracy improves as we increase the number of cases in the parent set for our data.
3.3.2 Sensitivity analysis for
As explained earlier, parameter controls how many parents are an influential neighbor of each individual. The average number of influential neighbors is . It is expected that large can improve the performance of the model, however choosing a very large may result in selecting all parents as neighbors and possibly overfitting. Conversely, if is chosen to be too small, then no parent is selected as the influential neighbor of each case and the model performs as random guessing. In Figure 12, the accuracy of Model I and Model II under five different values of is given for the two data sets. Under the two extreme cases of and , the model acts as a random guess with accuracy close to 50%. The best choices for are at nonextreme values; ideally would be crossvalidated if computation time permits.
3.3.3 Sensitivity analysis for
As explained earlier, determines the importance of feature (ranging between 01) and follows a beta distribution with parameters and . These hyperparameters will change the shape the distribution for and control the number of features and their importance. To explore different shape types of the beta distribution, we consider three symmetric cases: (i) a Ushaped distribution with
=(.5,.5), (ii) a uniform distribution with
=(1,1), (iii) a symmetric unimodal distribution with =(2,2), and two unimodal cases (iv) =(2,5) and (v)=(5,2) with positive and negative skews, respectively. Figure
13 shows the models’ sensitivity to . In both datasets, the symmetric case of(the Jeffreys prior for the Bernoulli and binomial distributions) performs best, but not by very much.
We performed sensitivity analysis for other hyperparameters , , , , and and found out that the model is not very sensitive within the range of reasonable choices.
3.4 Comparison with Existing Models
In this section, we first provide the results of a general comparison between our model and some existing models. We then more closely compare our model with the most similar models including BCM, KNN, and decision trees. Our purpose is not to show that our model outperforms these methods in terms of prediction; all methods perform comparably for our data. We aim instead to show that our model performs well enough, where interpretability serves as an excellent tiebreaker.
3.4.1 Prediction Accuracy and Comparison with Benchmark Models
The methods we compare with are logistic regression, SVM, BCM, AdaBoost, KNN, random forests, and decision trees. For each method and each dataset, we performed 5fold crossvalidation and reported the mean over 5 folds for accuracy, sensitivity, specificity, precision, and recall. The results are summarized in Table
V, indicating that our methods (Model I and Model II) perform comparably with other machine learning algorithms.Model  Heart Disease Dataset  Breast Cancer Dataset  

Learning Type  Accuracy  Sensitivity  Specificity  Precision  Recall  Accuracy  Sensitivity  Specificity  Precision  Recall 
Proposed Model I  0.83  0.82  0.83  0.84  0.82  0.78  0.86  0.7  0.73  0.86 
Proposed Model II  0.84  0.84  0.84  0.84  0.84  0.79  0.87  0.71  0.75  0.87 
BCM ()  0.80  0.86  0.74  0.78  0.86  0.76  0.93  0.60  0.69  0.93 
BCM ()  0.82  0.85  0.80  0.81  0.85  0.75  0.82  0.68  0.72  0.82 
BCM ()  0.63  0.53  0.75  0.64  0.53  0.68  0.97  0.40  0.62  0.97 
BCM ()  0.78  0.80  0.77  0.78  0.80  0.77  0.84  0.71  0.75  0.84 
logistic regression  0.82  0.84  0.80  0.81  0.84  0.77  0.89  0.66  0.72  0.89 
KNN (30)  0.83  0.85  0.82  0.83  0.85  0.76  0.99  0.54  0.68  0.99 
KNN (20)  0.82  0.86  0.79  0.80  0.86  0.76  0.96  0.56  0.69  0.96 
KNN (10)  0.81  0.82  0.80  0.81  0.82  0.75  0.99  0.53  0.68  0.99 
KNN (5)  0.81  0.82  0.81  0.81  0.82  0.73  0.90  0.57  0.67  0.90 
KNN (1)  0.76  0.74  0.78  0.77  0.74  0.68  0.70  0.65  0.66  0.70 
Random Forest (5)  0.78  0.79  0.77  0.78  0.79  0.78  0.85  0.72  0.75  0.85 
Random Forest (10)  0.80  0.81  0.79  0.80  0.81  0.74  0.80  0.68  0.71  0.80 
Random Forest (100)  0.83  0.84  0.82  0.83  0.84  0.78  0.87  0.70  0.74  0.87 
Decision Tree  0.75  0.77  0.72  0.76  0.77  0.73  0.70  0.75  0.73  0.70 
SVM  0.81  0.85  0.76  0.79  0.85  0.79  1.00  0.59  0.71  1.00 
boosted DT (100, I)  0.82  0.80  0.84  0.83  0.80  0.77  0.90  0.65  0.72  0.90 
boosted DT (50, I)  0.82  0.81  0.84  0.84  0.81  0.79  0.93  0.65  0.73  0.93 
boosted DT (10, I)  0.83  0.81  0.84  0.84  0.81  0.79  0.99  0.60  0.71  0.99 
boosted DT (5, I)  0.80  0.81  0.78  0.80  0.81  0.76  0.91  0.62  0.70  0.91 
boosted DT (100, II)  0.81  0.85  0.77  0.79  0.85  0.77  0.90  0.65  0.72  0.90 
boosted DT (50,II)  0.81  0.85  0.77  0.79  0.85  0.77  0.90  0.65  0.72  0.90 
boosted DT (10, II)  0.81  0.85  0.77  0.79  0.85  0.78  0.91  0.66  0.73  0.91 
boosted DT (5, II)  0.81  0.85  0.77  0.79  0.85  0.78  0.91  0.65  0.72  0.91 
The numbers in the parentheses are the number of parents. The distance measure for KNN is Hamming. The number in the parentheses for random forests is the number of trees. The numbers in the parentheses are the number of the weak classifiers and the weak learning algorithm type for AdaBoost. The learning for Type 1 is Tree and for Type II is Discriminant. We used the fitensemble function in Matlab with AdaBoostM1 as the method.
3.4.2 Comparison with KNN and its extensions
One of the main differences between Bayesian Patchworks and the regular KNN is that unlike KNN that requires a distance measure that is the same for all instances of data, our model uses an adaptive metric, where it finds neighbors from parents based on their important features. Also, unlike KNN, BPatch does not force a fix number of neighbors/parents for each case. In our experiments, the neighbors chosen by KNN were not always similar to those chosen by BPatch.
There are some extensions of KNN that have addressed some of the limitations of KNN (see the survey for such techniques (Bhatia and Vandana, 2010)). As explained in (Bhatia and Vandana, 2010), most of these extensions are improvements over basic kNN to gain speed efficiency as well as space efficiency and may hold good in particular field under particular circumstances. Since we cannot compare our model with all of the extensions of KNN, we chose two of the most relevant extensions of KNN and compare our results with them. The first extension is the KNN with distance weighted votes, where the training points are assigned weights according to their distances from sample data point (we used the inverse of the distance). The second extension is KNN with feature weights. We chose in KNN and in our models to be the same, so that, on average, the same number of features are selected for each neighbor. The results of this comparison are shown in Fig. 14. The performance of our model was still better than KNN and its two extensions. In KNN and its extensions, whether or not distance weights, feature weights, or bandwidths are considered, the comparison between a new case and a case in the training set is always based on a fixed formula with a fixed number of neighbors. Our distance metric is more flexible.
3.4.3 Comparison with BCM
To adapt the Bayesian Case Model Kim et al. (2014) for supervised classification (and health outcome prediction), the mixture vector from BCM can be used as a feature of length for input to other classification methods, such as SVM and logistic regression. In Fig. 15, we compared our models (Model I and Model II) with four possible values for and compared it with 6 cases of BCM (combinations of , and and ). Note that no larger was considered for BCM simply due to the low performance of the trained models. The performance of BPatch was generally better.
Interpreting BPatch’s results may also be easier than interpreting those of BCM. The posterior estimate of the mixture vector over the prototypes is not directly interpretable since the prototypes are not fixed, but we can make choices that allow for sparser or larger mixtures. Let us consider BCM under two conditions for the hyperparameter used in Kim et al. (2014), which controls the sparsity of the prototypes chosen for a new case.
Sparse BCM (e.g., ): Only one prototype generates each observation. Table VI shows this prototype, which shares some of the feature values with the new observation.
Nonsparse BCM (e.g., ): BCM chooses multiple prototypes, however, only one prototype generates each feature of the new patient. In contrast, BPatch’s features are generated from multiple parents. Table VI illustrates this.
Model  Feature  F1  F2  F3  F4  F5  F6  F7  F8  F9  F10  F11  F12  F13 



Sample Patient  2  1  3  2  3  1  1  3  2  2  2  1  2  2  

1st Prototype  3  2  3  3  1  2  2  3  1  2  2  2  2  1  2  

1st Prototype  2  1  2  2  3  2  1  3  1  1  1  1  1  0.15  1  
2nd Prototype  3  2  3  3  3  1  1  3  2  2  2  2  2  0.54  2  
3rd Prototype  1  2  3  2  1  1  1  3  1  1  1  1  1  0.00  1  
4th Prototype  1  2  3  1  1  1  1  3  2  2  2  1  2  0.31  2  
Model II  1st Parent  2  1  3  2  3  1  1  3  2  2  2  1  2  1  2  
2nd Parent  3  1  3  3  3  1  2  3  2  2  2  1  2  0.11  2  
3rd Parent  3  1  3  3  3  1  2  3  2  2  2  1  1  0.08  2  
4th Parent  2  1  2  3  1  1  2  3  2  2  2  1  1  0.05  1 
3.4.4 Comparison with Decision Trees
Decision trees are a completely different form of interpretable model that is not casebased. In the datasets we are considering, there are many good predictive models (a large “Rashomon Set”) meaning that many models that are different from each other that all perform well. We used CART, whose performance (shown in Table V) was not as strong as some of the other methods. Interestingly, the features that CART identifies were not the same as those identified by BPatch. The feature ranking from both methods is shown in Fig. 16. More result is given in Appendix E.
3.5 Application to Larger Dataset and its Computational Challenges
Computation is the major challenge for BPatch. Here, we show that BPatch’s performance does not deteriorate on a slightly larger dataset with more samples and more features, despite more challenging computation.
This dataset is provided by the Center for Clinical and Translational Research, Virginia Commonwealth University and is also available at the UCI website (Strack et al., 2014). This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (19992008) of clinical care at 130 US hospitals and integrated delivery networks. Similar to Strack et al. (2014), we focused on early readmission, and defined the readmission label (i.e., health outcome) as having two values: “readmitted,” if the patient was readmitted within 30 days of discharge or “otherwise,” which covers both readmission after 30 days and no readmission at all. The list of the features and their possible outcomes are given in Table VII, where features age and timein hospital were discretized. We used data from two sets of 500 patients each, performed 5fold crossvalidation, and reported the average accuracy for BPatch and similar methods in Table VIII. The hyperparameters are the same as previous experiments and the number of parents was set to 60. Again, most methods give comparable results, with BPatch’s accuracy somewhere in the middle. For interpretability, we also checked (not shown here) that selected neighbors were reasonably close to their respective patients.
feature Name  Description  feature Name  Description  

Race 

Final diagnosis 


Gender  male, female  Metformin  Up, Down  
Age  [10(i1)–10i], {1..9}  Glimepiride  Up, Down  
discharge_disposition_id  Discharged to home, otherwise  Glipizide  Up, Down  
Admission_source 

Glyburide  Up, Down  
Time_in_hospital  <2,27, >7  Pioglitazone  Up, Down  
Medical_specialty 

Rosiglitazone  Up, Down  
A1Cresult  >7, >8, none, other  Insulin  Up, Down  
DiabetesMed  Yes, No  Admission_type  Emergency, Other  
Changes in diabetes  Yes, no 
Learning Type  Accuracy  Sensitivity  Specificity  Precision  Recall  Fmeasure 

Model (I)  (=60)  74.50  84.97  64.05  70.58  84.97  79.26 
Our Model (II) (=60)  75.4  82.7  68.0  72.6  82.7  79.1 
BCM ()  76.80  85.99  67.73  72.73  85.99  78.74 
BCM ()  77.30  87.40  67.30  72.79  87.40  79.36 
Logistic Regression  75.9  78.5  73.2  74.5  78.5  76.4 
KNN (30)  62.3  96.1  28.7  57.5  96.1  71.8 
KNN (20)  63.1  96.5  29.8  58.0  96.5  72.4 
KNN (10)  64.1  95.4  32.9  59.0  95.4  72.7 
KNN (5)  67.3  92.7  42.3  61.8  92.7  73.9 
KNN (1)  66.9  88.0  45.9  62.0  88.0  72.6 
Random Forest (5)  78.8  83.2  74.4  76.7  83.2  79.6 
Random Forest (10)  80.6  84.6  76.5  78.5  84.6  81.4 
Random Forest (100)  83.1  85.7  80.5  81.7  85.7  83.6 
Decision Tree  77.9  78.9  76.9  77.5  78.9  78.1 
SVM  67.0  88.2  45.6  65.6  88.2  73.6 
boosted DT (100, I)  79.3  82.4  75.9  77.9  82.4  80.0 
boosted DT (50, I)  78.7  83.0  74.2  76.7  83.0  79.6 
boosted DT (10, I)  75.0  82.6  67.5  72.3  82.6  76.7 
boosted DT (5, I)  72.4  84.7  60.0  68.5  84.7  75.3 
boosted DT ( 100, II)  74.3  79.7  68.7  72.1  79.7  75.5 
boosted DT (50, II)  74.3  79.7  68.7  72.1  79.7  75.5 
boosted DT (10, II)  74.3  79.7  68.7  72.1  79.7  75.5 
boosted DT (5, II)  74.3  79.7  68.7  72.1  79.7  75.5 
As shown in Fig. 17 for all three data sets, as the number of parents increases, it takes more time to learn the structure of the model. It is also shown that training Model I is faster than Model II. Also, given the same number of parents and prototypes, each step of BPatch and BCM takes around the same amount of time, but BPatch uses more memory, mainly due to the existence of 3dimensional variable . Thus, we can conclude that without sacrificing accuracy, our model is slightly slower than BCM and much slower than the rest of the models presented in Table V. On the other hand, there is something to be gained from that extra computation, and the tradeoff between computation and the need for interpretability should depend on the application. BPatch is currently suited for smaller datasets (e.g., clinical trials, data from rare diseases, preliminary studies with highquality handcurated samples, data from one physician’s case files, or data from one department in a hospital) where one requires a high level of interpretability in order to reason about cases.
4 Conclusion and Future Work
Doctors reason through cases already. In a sense, we are proposing a datadriven formalization of techniques that are used in an ad hoc way throughout much of current practice. In some sense, one could view Bayesian Patchworks as a form of adaptive nearest neighbors. Each case is generated by several neighbors, each using a different distance measure to the present case. In the generative sense, each patient is built the way a patchwork quilt is built; the larger design is built from pieces gathered from other sources. It is easy to argue that considering neighbors through different lenses (as Bayesian Patchworks does) makes more sense than fixing one distance measure, and having only one way to search for nearest neighbors.
Bayesian Patchworks operates similarly to real estate “comps,” which are similar properties previously sold that one might use to price a current house on the market. There are many possible future extensions of Bayesian Patchworks, for instance, one could modify the prior to favor using fewer total parents. Or, one could generalize the method to handle problems other than classification. Regression would be a trivial extension, but more complex outputs could be more interesting.
Acknowledgments
The authors would like to thank Siemens and Wistron for supporting this work.
Appendix A Appendix: Inference
The model hyperparameters are denoted as . Given a set of known and hyperparameters , the full posterior, which is the product of the likelihood and priors, can be computed as
(5)  
We can use conjugate priors to simplify the above expression and also integrate out the latent factors
. This simplifies to:(6)  
The rest of the calculations will be for the three integrations inside Eq. (6) (lines 24). After performing these integrations, we will be left only with , , and . The details of those calculations are given in Appendix B. From the results given in Appendix B, we have the following three important relationships, which can be used to compute , , and in Eq. (6), respectively:
(7)  
(8) 
(9) 
where and is the Beta function with parameters a and .
a.1 MetropoliswithinGibbs (MWG)
An iteration of MCMC cycles over the components of the parameter vectors , , and , keeping everything else fixed. We first show that the conditional posterior distributions of , , and admit closed forms. Denoting as the set of all parameters in the model except for (including the ’s and ’s, and implicitly including the variables that have been integrated out already), we have
(10)  
where is obtained from Eq. (7). The step above uses conditional independence with respect to the ’s, and other conditional independences specified in the model. It should be noted that given a node’s parents in a directed acyclic graph, that node is conditionally independent of its grandparents and any other ancestors (see Proposition 5.2 in Jackman (2009) on how to find conditional distribution using directed acyclic graph). Now we have a simple form for the conditionals that drive the MCMC sampler. We use a Metropolitan Hastings algorithm to sample from the above conditional distribution.
The rest of the parameters ( and ) in the model can be sampled directly from the conditional distributions using Gibbs sampling. In particular, using from Eq. (7) and from Eq. (8), we get the following:
(11)  
where includes all variables except . This means that we can sample directly from the above conditional distribution as takes only two possible values of 0 and 1.
Finally, using that the ’s and ’s are the children of , we get
(12) 
which can be simplified using Eqs. (8) and (9) to
(13) 
Now, we can implement a Metropolis withinGibbs sampling method to generate samples of the posterior distribution. After we draw samples for (Eq. 13), (Eq. 10), and (Eq. 11), we then calculate the posterior predictive distribution for all measures of interest. At iteration , we sample
1) from using direct sampling,
2) from using randomwalk MetropolisHastings, and
3) from using direct sampling.
For better identifiably, we can add a constraint for so that if , then all become zero for all . However, since this changes the structure of the directed acyclic graph and adds more steps to calculate the conditional distributions for and , we chose not to do that in this paper.
a.2 Inference for outcomes
Let us assume that is the set of parameters that can be learned from dataset (), which includes past cases which do not include patient . Similarly, let denote the parameters associated only with the th individual (i.e., ). We can write the posterior predictive distribution as:
(14) 
Then according to the model, is the posterior distribution given , and (which can be calculated from Eq. (6)). The term can be calculated given that follows a Categorical distribution with parameter vector , which can be directly calculated from (i.e. ).
Now, if is the output of the MCMC at the th iteration, (see Section A.1), the posterior predictive distribution can be estimated using
(15)  
For Type III predictions, we simply input in to the model in the learning process and then use the posterior predictive distribution in Eq. (15) to carry out inference for the parameters of interest.
Appendix B Simplification of the Posterior Distribution
After performing the integrations discussed above, we will be left only with , , and . For the first term, we have
(16)  
As stated earlier, has a Bernoulli distribution with parameter and is a Beta distribution with parameters (
). Since the posterior predictive distribution of a Binomial random variable with a Beta distribution prior on the success probability is a BetaBinomial distribution, the inner integral in Eq. (
16) becomes
Comments
There are no comments yet.