1 Introduction
According to the Centers for Medicare and Medicaid Services (CMS), U.S. health care expenditures grew percent in 2014, reaching trillion or per person. As a share of the nation’s Gross Domestic Product, health spending accounted for percent. Rising health care costs have become a major concern for hospital chains [1, 2, 3, 4], which increasingly have to deliver the best care possible within a given budget, forcing them to make better medical decisions.
Common practice is to assign patients to medical professionals (general practitioners, specialists, nurse practitioners) on a firstavailable basis. This ignores special expertise with particular medical conditions, as well as the past performance of the physician or facility. At the same time, physicians may face choices in terms of how to treat a condition, which tends to be guided in part by the past experiences of each physician. Thus, we have choices of physician (or type of physician), care facility, and specific treatments. The best choices depend on a combination of the characteristics of the patient, the physician, the facility, and the treatment plan. We address the problem of how to make the best decisions in the presence of imperfect (and sometimes highly imperfect) understanding of the relationship between patient attributes, medical decisions and medical outcomes.
We are particularly interested in total knee replacement [5, 6], a common operation for people with osteoarthritis of the knee which affects more than 27 million people in the U.S. according to the Arthritis Foundation. More than 600,000 knee replacements are performed each year in the U.S., which also leads to extensive postoperative rehabilitation which varies widely from one patient to the next. In order to promote a health care system that provides better care and spends health care dollars more wisely, in 2016, the United States Department of Health and Human Services (HHS) proposed the Comprehensive Care for Joint Replacement (CJR) where the hospital may be required to repay Medicare for a portion of the cost of a knee replacement episode if the cost and quality fall outside of specified ranges. (HHS 2015) [7].
In this paper, we consider a binary feedback (success/failure) model where if the postoperative cost is below some threshold, the episode of care (spanning initial diagnosis and testing, inpatient treatment and outpatient care) is said to be successful; otherwise it is treated as failure. The aim is to decide the most appropriate physicians and caregivers for each individual patient and maximize the success rate over time. This is an example of the broader area of personalized medicine, which formalizes clinical decision making as a function that maps individual patient information (including measures of disease stage severity, medical history, clinical diagnosis, genomic information and environmental information) to a recommended treatment. Our work is part of a growing trend toward personalized care where medical decisions are tuned to the characteristics of each patient. This approach, however, introduces considerable uncertainty in the identification of the best treatments since there is very little data describing patients with the same (or even similar) attributes. As a result, there is considerable uncertainty in models of the relationship between medical decisions (and patient attributes) with treatment outcomes. This motivates our work that addresses the problem of balancing between making what appear to be the best decisions, and learning to make better decisions in the future.
There is a variety of new methods to aid in the search for the optimal treatment regime, where a single decision or a series of sequential decisions may be involved, including sequential multiple assignment randomized trials (SMART) [8, 9]
, doubly robust estimators
[10, 11, 12], Qlearning [13, 14], adaptive strategies [15] and other dynamic treatment regimes studied at length by Robins and colleagues [16, 17, 18, 19]. Much of the work is trained on past observational data and the treatment regime is not updated with new patient responses. Yet when work with historical data, the result is less objective and can be biased substantially due to the differences, for example, in patient populations and in medical facilities or the evolving of the diseases. This is known as offline settings where we are not punished for errors incurred during training and only concern with the final treatment regime after the offline training phases. In this paper, we focus on the case of a single decision and take an online view, continuously using accumulating data to modify aspects of the treatment regime as new patients coming in. We adopt a Bayesian approach in which the results from pilot study can be used to construct a reasonable prior treatment regime and which can efficiently identify any trend, or optimal clinical benefit as information accures, with the potential to allow smaller yet more informative trials and for patients to receive better treatment.In this regard, it bears similarity with adaptive designs [20, 21, 22, 23, 24, 25]. Decisions are made adaptively throughout the running of the trial. Make an observation, update your knowledge, decide what information to collect next is commonly regarded to be the Scientific Method and the practical guidelines [26]
. Both our setting and adaptive designs face the same exploration/exploitation dilemma: (1) treat current patients as effectively as possible and (2) have a high probability of correctly identifying the better treatment. Making what we think is currently the best decision may not be the best given the uncertainty in our model, forcing us to recognize that we have to learn to make better decisions in the future.
In this paper, we formalize personalized medicine as a Bayesian contextual bandit problem. We encounter two challenges. First, there are very few patients with the same characteristics, which means that it is unlikely that an individual physician sees a sufficient number of eligible patients to produce statistically reliable performance measurements on medical outcomes. We overcome this situation using a parametric belief model that allows us to learn relationships across a wide range of patients and health providers, which is different from the earlier multiarmed bandit formulations in clinical trials [27, 28] which ignore the attributes of individual patients. Second, due to ethical reasons, testing a treatment decision is expensive. This puts us in the setting of optimal learning where we need to learn the best treatment as fast as possible. This represents a distinctly different learning environment than what has traditionally been considered using popular policies such as upper confidence bounding (UCB) which have proven effective in settings with high sampling rates such as learning adclicks or the doubly robust estimation which is trained on historical data [29, 11]. We therefore adopt a Bayesian approach and the knowledge gradient policy which takes advantage of domain knowledge to produce rapid learning, and which maximizes success rates for both the current and future patients.
The rest of this paper is organized as follows. In Section 2
, we establish the contextual bandit formularisation for the sequential decision making problems in personalized medicine. Due to the sequential nature of our problem, we adopt an online Bayesian logistic regression algorithm in Section
3 to handle recursive updates with each patient response. In Section 4, we introduce the concept of postobservation states based on which we develop a knowledge gradient type policy for maximizing the number of successful treatments. In Section 5, we describe a study investigating the performance of the knowledge gradient policy in the design of health care decisions for knee replacement patients which demonstrates the value of an optimal learning policy to reduce health care costs. We conduct feature section through clustering and LASSO to deal with the intrinsic sparsity in the knee replacement data,2 Problem Definition
In personalized medicine, a patient arrives characterized by a set of unique characteristics such as measures of disease stage severity, medical history, clinical diagnosis, genomic information, and environmental information, with a health complaint that requires medical intervention. We use this information as a basis for decisions about medical treatment, including choice of physician, tests, drugs, surgery, and rehabilitation/follow up. At the end of a treatment episode, we observe a dichotomous health outcome (e.g. whether cost is below a Medicarespecified threshold, whether the treatment is effective), which is then used to update our understanding of relationships between medical decisions and health outcomes for a patient with a specific set of characteristics.
To design a personalized treatment strategy, the learner is presented with a context vector
(the characteristics of the th patient ) and a set of actions (doctors, treatments, rehabilitation). Each action is associated with a feature vector . Discrete treatments are handled with indicator variables (such as if an attribute refers to administering a particular drug). After choosing an action , a response of patient {failure, success} for the action is revealed, but the rewards of other actions are not observable. The “success” or “failure” may depend stochastically on and .A policy or a treatment regime is a function mapping from any context information to an action . We denote the ‘patient horizon’ as which is the number of present and future patients who will be treated with one of the treatment in . It is worth noting that need not to be known beforehand and may depend on the pattern of a emerging disease, the performance of current treatments, the emergence of new treatments, and may be infinite for recurring conditions such as knee replacements.
We adopt probabilistic modeling for the unknown probability of success. Under general assumptions, the posterior probability of class
can be written as a link function acting on a linear function of the feature vectorwith the link function often chosen as the logistic function or probit function
The main difference between the two sigmoid functions is that the logistic function has slightly heavier tails than the normal CDF. We can make this more compact by introducing the basis function
. Now let be a column vector of basis functions, and be a column vector of the coefficients. The probability of class can be rewritten asWe adopt a Bayesian view and start with a multivariate prior distribution for the unknown parameter vector . We use to denote the state of the system at time which includes the “state of knowledge” that captures our belief about the parameters and the context information . Each of the past observations are made of triplets , assuming labels are generated independently. Let denote the previous measured data set for any . Note that the notation here is slightly different from the (passive) PAC learning model where the data are i.i.d. and are denoted as . Yet in our (adaptive) sequential decision setting, measurement depends on the state , while
is a random variable that has not been observed at time
. This notation with superscript indexing time stamp is standard, for example, in control theory, stochastic optimization and optimal learning. A history of the process can be represented usingWe use Bayes’ theorem to form a sequence of posterior predictive distributions
for from the prior and the previous measurements.The goal is to find a policy that selects actions such that the cumulative reward is as large as possible over time, or equivalently, treatment on patients is as effectively as possible:
(1) 
where denotes the random variable of the patient response, denotes the treatment recommended by the dynamic treatment regime .
3 Online Bayesian logistic regression based on Laplace approximation
A Bayesian approach to linear classification models requires a prior distribution for the weight parameters , and the ability to compute the conditional posterior given the observation. Specifically, suppose we begin with an arbitrary prior and apply Bayes’ theorem to calculate the posterior:
where the normalization constant is the unknown evidence. An
regularized logistic regression can be interpreted as a Bayesian model with a Gaussian prior on the weights with standard deviation
.Unfortunately, exact Bayesian inference for linear classifiers are intractable since the evaluation of the posterior distribution comprises a product of sigmoid functions; in addition, the integral in the normalization constant is intractable as well. We can either use analytic approximations to the posterior, or solutions based on Monte Carlo sampling, foregoing a closedform expression for the posterior. Observations come one by one due to the sequential nature of our problem setting. After each new observation, retraining the Bayesian classifier using all the previous data is computationally inefficient in terms of both time and space complexity. To this end, we use the online Bayesian linear classification algorithm proposed by Wang
et al. [30] handle recursive updates with each patient response.Due to its equivalence to regularization, independent normal priors (with , where
is the identity matrix) are considered. Laplace approximation to the posterior is used to make the computation tractable. In our model, the posterior is approximated by a Gaussian approximation with diagonal covariance matrices. It can be obtained by finding the mode of the posterior distribution and then fitting a Gaussian distribution centered at that mode. Specifically, define the logarithm of the unnormalized posterior distribution
The Laplace approximation is based on a secondorder Taylor expansion to around its MAP (maximum a posteriori) solution :
(2) 
where is the Hessian of the negative log posterior evaluated at :
By exponentiating both sides of Eq. (2), we can see that the Laplace approximation results in a normal approximation to the posterior
In our setting, the use of this convenient approximation of the posterior is twofold. It first serves as a prior on the weights to update the model when a new patient response becomes available. Second, it defines the belief states in the Bayesian policies, for example, the knowledge gradient policy and Thompson sampling that we introduce later. Starting from Gaussian priors
over with meanand variance
, after the first patient responses, the Laplace approximated posterior distribution is . At the th time step, we find the MAP solution (2) to the posterior after the new information by the onedimensional bisection method [30]:(3) 
where is a compact notation for . The inverse variance of each weight is given by the curvature at the mode as:
(4) 
where .
4 Knowledge gradient policy
The knowledge gradient (KG) policy, first proposed for offline ranking and selection problems, maximizes the value of information from a decision. In ranking and section problems, the performance of each alternative is represented by a (nonparametric) lookup table model. Although originally developed for offline learning (where we do not pay attention to successes while we are learning), it is easily adapted to online learning where we seek to maximize the cumulative number of successes. After its first appearance, KG has been extended to various belief models (e.g. [31, 32, 33, 34, 30]). Experimental studies have shown good performance of the knowledge gradient policy in learning settings with expensive experiments, especially in early iterations [34, 30, 35, 33]. This is particularly well suited to personalized medicine where we want to learn as fast as possible from each patient response so as to provide better treatment on the upcoming patients. In comparison, other policies like upper confidence bounding (UCB) is known to explore more than necessary, leading to unnecessarily poor treatment performance on current patient.
4.1 Markov decision process formulation
We define the knowledge state in our setting as . In a dynamic program, the value function is defined as the value of the optimal policy given a particular state at time , and may also be determined recursively through Bellman’s equation. At time , we should simply choose the alternative that looks the best given everything we have learned, because there are no longer any future decisions that might benefit from learning. Since the goal of personalized medicine is to maximize the cumulative success of treatment, the terminal value function is given by
The value function at any other time is given recursively by Bellman’s equation for dynamic programming:
(5) 
We write Bellman’s equation in its standard form (Eq. (5)) by writing the sequence of states, actions and information using:
When we include contextual information, this sequence would be written
Here, the knowledge state might be called the postobservation state, while is the predecision state, representing the state after a patient has arrived. Instead of relating to as is classically done in Bellman’s equation, we can break the recursion into two steps: from to , and then from to , giving us the following twostep version of Bellman’s equation.
(6)  
(7) 
Bellman’s equation works well for problems with small state and action spaces, and where the transition matrix can be easily computed. But in personalized health care, the context information can only be observed (the distribution is unknown). The context information can be an arbitrary sequence which is fixed beforehand or stored in historical data, or it can be nonstochastically chosen by an adversary. The attractiveness of the postobservation state that the maximum and the expectation over context are interchanged, giving us computational advantages to use simulation based approach without probabilistic modeling of the contextual information and by treating the contextual information as arbitrarily given by the oracle.
4.2 Knowledge gradient policy with contextual information
In order to approximately solve the Bellman’s equations in the previous section, we develop the knowledge gradient policy around the values in postobservation states . The knowledge gradient of measuring an action in state is defined as the singlestep expected improvement in value if action is taken.
(8) 
where is the next stochastic state of knowledge if we choose treatment right now, allowing us to observe the stochastic patient response . This allows us to update and based on Eq. (3) and Eq. (4), transitioning to the next state of knowledge . The knowledge gradient policy then balances the treatment that appears to be the current best and the one that learns the most by choosing an action at time as [30]:
where reflects a planning horizon that captures the value of the information we have gained on future patients.
Given a knowledge state , or equivalently
, the predictive Bernoulli distribution of patient response
for a context and a treatment can be found by marginalization over ,(9) 
For notational simplicity, we drop ’s dependence on and . Denoting and the Dirac delta function, we have . Hence we have
where Since the delta function imposes a linear constraint on and is Gaussian, the marginal distribution is also Gaussian. We can evaluate by calculating the mean and variance of this distribution. We have
Thus Since the convolution of a Gaussian with a logistic function cannot be evaluated analytically, we apply the approximation with . Denoting , we have
We are now ready to compute the knowledge gradient value in Eq. (8). First, is deterministic at time . Since the patient response is not known at the time of selection, the expectation is computed over the Bernoulli distribution and the current belief model specified by . Specifically, the expectation can be obtained by averaging the two possible responses as follows:
where denotes the next belief state given outcome Y according to Eq. (3) and Eq. (4).
It can be seen that the Laplace approximation and the recursive update make the computation of the knowledge gradient tractable by analytically approximating the value in the next state and offering computational simplicity with Gaussian distributions.
5 A case study on the knee replacement
In the knee replacement patient datasets, we have 212 episodes involving 26735 records of patient information with either icd9 diagnosis, icd9 procedure or hcpcs/cpt procedure record type, where icd9 is the ninth revision of the International Classification of Diseases and the Healthcare Common Procedure Coding System (hcpcs) is a set of health care procedure codes based on the American Medical Association’s Current Procedural Terminology (cpt). To make a fair statement of the costs, all the 212 episodes are obtained from the same health care provider. The episode ID, beneficiary ID and claim ID have been replaced with randomly assigned numbers for anonymization. There are procedures that occur before the knee replacement, and procedures after the replacement that constitute rehabilitation. We see a variety of different postoperative costs of 212 episodes, ranging from to , as depicted in Fig. 1
. We adopt our success/failure model that if the cost is smaller than the 2/3 quantile of the costs (
), we think it as acceptable, otherwise we treat it as failure. The challenge is how the physicians and caregivers affect the postoperative costs.A patient can have a number of attributes, spanning personal characteristics (age, weight, gender, ethnicity, body type), their medical history, and diagnostic information. We first translate the data file into a flat file. Specifically, we create a set of columns consisting of every diagnosis (icd9 diagnosis code) and/or preoperative procedure (icd9 procedure or hcpcs/cpt procedure code) that has appeared in the dataset. For each patient, mark the value of that column as 1 if the patient has a diagnosis or uses the procedure, otherwise mark as 0. Together with the demographic characteristics (sex and age), these constitute the feature vector of a patient and thus the context to the contextual bandit problems.
Not surprisingly, there are a huge number of features associated with each patient, e.g. 979 columns of possible diagnoses and 772 columns of preoperative procedures. These features are typically sparse – compared to nearly 2000 features, the number of 1’s for each patient ranges from 8 to 110, with an average of 31. For this reason, we use regularization in our Bayesian logistic regression to handle the sparsity of the model. One way of learning the sequential physicians/caregivers assignment is by separating the 212 episodes into two sets. We can use one set to generate a prior distribution on the weight vector and use the other set for online learning of physicians/caregivers assignments. Yet if we directly use these features, the sparsity and the relative small number of patients makes learning more difficult and is computationally expensive. Besides, simplification of models can make them easier to interpret by researchers and enhance generalization by reducing overfitting. We instead find the lower dimension feature representation as explained in the next section.
5.1 Feature selection
There are several ways to look for more compact representations, such as principal component analysis (PCA), singular value decomposition (SVD) or auto encoder to perform the dimension reduction. However, the features from the nonlinear dimension reduction lose the original meanings of the health terminologies. Yet in health care analytics, interpretability of the resulting feature subspace is desired. For example, it is interesting to know whether age or malignant essential hypertension affect the cost or quality of total knee replacement.
Due to the nature of health analytics, many features tend to happen at the same time. For example, in terms of diagnoses, obesity and hypertension are linked, with obese patients having higher rates of hypertension than normalweight individuals [36]. In addition, certain tests are often run together. In order to capture this characteristic, we first cluster the diagnoses into groups based on their occurrences. We construct an undirected network to represent the relationship of different diagnoses as follows. We treat each icd9 diagnosis code as a node in the network. We measure the occurrence similarity of any diagnosis pair by their intersection angle. Specifically, each diagnosis is represented by a 212 dimensional binary vector indicating whether each patient has that diagnosis. We then set a threshold of the cosine of the intersection angle . For a diagnosis pair , if the cosine of the intersection is larger than the threshold, we draw an edge between them. It is worth noting that if we set the threshold to be 1, it is equivalent to saying that and always happen at the same time across all the patients if there is an edge between them. When we set the threshold to , the presence of an edge means that and are recorded together 80 percent of the time.
After the construction of the network, we find the clusters/groups by detecting the weakly connected components in the network. In Fig. 2, each node is labeled with its icd9 diagnosis code with the size of the node corresponding to its degree. Different colors represent different groups/cliques. The nodes with degree less than 3 are filtered. After clustering, 979 diagnoses have been grouped into 608 components. For example, the red group on the upper right consists of icd9 diagnosis code V160 (Family history of malignant neoplasm of gastrointestinal tract), V161 (Family history of malignant neoplasm of trachea, bronchus, and lung), 99527 (Other drug allergy), 6273 (Postmenopausal atrophic vaginitis), E9300 (Penicillins causing adverse effects in therapeutic use), V7262 (Laboratory examination ordered as part of a routine general medical examination).
Instead of using individual diagnosis codes as the features for the patients, we first use the clusters as the features. We further conduct feature selection by selecting a subset of relevant features for use in model construction. Specifically, we use the
regularized (LASSO) logistic regression to yield the sparse solutions of selecting a subset of relevant features using 25 regularization parameter (Lambda) values and 10fold cross validation on the patient datasets. Fig. 3 identifies the minimumdeviance point with a green circle and dashed line as a function of the regularization parameter. The blue circled point has minimum deviance plus no more than one standard deviation. We use the 31 selected features at the blue line as the set of relevant patient attributes to proceed our contextual bandit learning in the next section. Many other interesting statistical questions can be asked regarding this dataset, e.g. feature importance, best prediction model or statistical significance. Yet they are not the main focus of this work which addressed the optimal learning challenge with stochastic binary feedback.5.2 Personalized physicians and caregivers assignment
We discovered from data that not surprisingly there is a number of caregivers (with national provider identifier NPIs) performing the rehabilitation. The caregivers should be divided naturally into communities or modules. Different people doing rehab on the same patient will belong to the same facility. Since some facilities keep patients longer than other ones, in this case, what is important is the facility, not the individual caregiver. If we can detect and characterize this community structure, this should give us a set of ”facilities” (in the form of these clusters). We use the same idea as in the previous section to construct a network that represents the relationships between different caregivers. Each node is one caregiver. Since two caregivers from the same facility do not necessarily always come to the same patients, we lower the threshold of the cosine of their intersection angle to 0.5 to capture this reality.
In the previous section, when the threshold is 1, the edge represents an equivalence relation (reflexive, symmetric and transitive), making each weakly connected components a clique. Yet when lowering the threshold, it is less meaningful to treat weakly connected components as a group. We instead detect the modularity of the total 768 caregivers using spectral community detection [37]. Fig. 4 depicts the modular classes (58 in total) and different classes are drawn with different colors. Network visualization provides strong hints of connectional relationships. Nodes within each modular class have dense connections with each other while intermodularclass connections are sparse.
We now study the problems of how physicians and facilities (rather than each caregiver) affect the cost of each patient. We represent each physician/facility by an indicator variable and the Bayesian linear classification belief models can be thus presented as with and denote the chosen physician and facility, respectively,
(10) 
For each patient, one and only one (or ) can be assigned such that exactly one is 1.
We sort the patient data chronologically. For each patient visit, based on the patient attributes , we assigned a physician for the surgery and/or a facility for the rehab. We then receive a payoff of whether it is success or failure. The goal is to maximize the number of successes across all 211 patients. There is a fundamental exploration vs. exploitation tradeoff: in order to learn the success rate of each physician/facility, it needs to be tried for long time benefit, leading to a potential drop in the shortterm performance.
Evaluating an exploration/exploitation policy is difficult since we do not know the outcome for physicians and facilities that were not chosen for a particular patient in the record data. Based on the real world context and patient features in the knee replacement dataset, we instead simulate the true outcomes using a weight vector . We use our ability to simulate the true to compute the true probability of a successful outcome, and compare this to our simulated success rate based on decisions from our policy using the estimated model. Although we have modeled the choice of physician and rehabilitation facilities, these are handled in an identical way as in Eq. (10), and as a result we focused just on the choice of physician.
We set the number of available physicians as . The experimental results are reported on 500 repetitions of each algorithm. The only difference is the way each policy selects the actions; all the rest, including the model updates, is identical as described in Section 3.
Binary feedbacks (such as success/failure outcomes) are inherently noisy, which creates a problem for valueofinformation policies such as the knowledge gradient. The problem is that we learn very little from a single outcome. Now consider what happens if we decide to test, say, patients. The value of information from patients may be quite low, but the value can grow nonlinearly (specifically, in the shape of an Scurve), producing much greater value if we are willing to consider the combined information learned from, say, patients [38]. As a result of this nonconcavity in the value of information, we propose to consider the impact of posterior reshaping
for the KG policy. In particular, for normally distributed posteriors on
, decreasing the variance would have the effect of increasing exploitation over exploration. In our simulations, we have tried to reshape the covariance matrix to . This only affects the calculation of the knowledge gradient and does not change the model updates.Posterior reshaping has an effect similar to the KG policy. The KG policy mimics the effect of doing a single batch of samples, which is the same as replacing an experiment with precision with one where the precision is (higher precision means lower uncertainty). The idea behind the KG policy is to find that produces the highest average value of information, which effectively finds the tangent point of the value of information of measuring a single alternative for many times [38]
. Yet finding the tangent point is a heuristic and the KG
policy implicitly assumes that our budget is large enough to sample alternative roughly times. In comparison, posterior reshaping can also be understood as (hypothetically) repeated experiments for each alternative and the tunable parameter offers great computational efficiency and flexibility (e.g. it is not restricted to the tangent point).We compare our policy with pure exploitation (which assigns the physician that seems to be the best), pure exploration (which randomly assign a physician, as would happen if you assigned the first available physician) and Thompson sampling [39]. Thompson sampling has been successfully applied to twotreatment adaptive designs [40, 41] and other applications [42, 43]. In our personalized health settings, at each time step , given the patient information , it first draws a sample according to the posterior distribution . It then selects a treatment (that is, the physician) that maximizes the probability of success under the sample parameter value .
We report the number of cumulative success divided by the number of treated patients after each of the 212 patient visits in Fig. 5(a). We also report the distribution of the number of successes produced by each policy after the learning budget is exhausted in Fig. 5(b)
. On each box, the central red line is the median, the edges of the box are the 25th and 75th percentiles, and outliers are plotted individually.
We can see from the figure that the KG policy with
yields the best performance. Pure exploitation also does well. This seems at first a bit odd given that the system has no prior knowledge about the true parameter value
. A possible explanation is that the change in context induces some level of exploration. In terms of posterior reshaping, value of smaller in general yields larger number of successes since it is in favor of exploitation over exploration. But the price to pay is slightly higher variance as seen in Fig. 5(b). Random assignment does not perform well since it is not learning from its past experiences while other policies use their past observations to guide the next assignment. We conclude that even though the data is sparse (no two patients are alike), through careful selection of physicians we can improve success rates by around 25 percent.The real value of the knowledge gradient is its rapid learning, which is especially important in a health setting since if we can learn faster, we can benefit more patients. The knowledge gradient correctly captures the full value of information, properly balancing exploitation (doing well now) and exploration (learning to do well in the future). Thompson sampling captures the explorationexploitation tradeoff only approximately. The real appeal of Thompson sampling is the ease of computation which is useful in highfrequency internet applications. Other optimizing policies such as pure exploitation (a greedy policy based on the prior) or Bayes greedy (a greedy policy based on the posterior, similar to Thompson sampling) do not accurately capture the value of information which requires capturing the value of reducing the uncertainty in the belief.
6 Conclusion
In this paper, we consider the problem of personalized medicine which formalizes clinical decision making as a function that maps individual patient information to a recommended treatment. The learner is rewarded by “successes” and “failures” which can be predicted through an unknown relationship that depends on the patient information and the selected treatment. Each experiment is expensive, forcing us to learn the most from each experiment. The goal is to treat current patients as effectively as possible and correctly identify the better treatment as quickly as possible. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy using Bayesian logistic regression, for which an approximation method is used to overcome computational challenges in finding the knowledge gradient. We provide a detailed study on the problem of how sequentially assignment of physicians/facilities to individual patients can reduce the health care cost. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.
References
 [1] Bodenheimer T. High and rising health care costs. part 1: seeking an explanation. Annals of internal medicine 2005; 142(10):847–854.
 [2] Ginsburg PB. High and rising health care costs: Demystifying us health care spending. Princeton, NJ 2008; .
 [3] Wennberg JE, Fisher ES, Goodman DC, Skinner JS. Tracking the care of patients with severe chronic illnessthe dartmouth atlas of health care 2008 2008; .
 [4] Lehnert T, Heider D, Leicht H, Heinrich S, Corrieri S, Luppa M, RiedelHeller S, König HH. Review: health care utilization and costs of elderly persons with multiple chronic conditions. Medical Care Research and Review 2011; 68(4):387–420.
 [5] Callahan CM, Drake BG, Heck DA, Dittus RS. Patient outcomes following tricompartmental total knee replacement: a metaanalysis. Jama 1994; 271(17):1349–1357.
 [6] Buckwalter JA, Lohmander S. Operative treatment of osteoarthrosis. current practice and future development. J Bone Joint Surg Am 1994; 76(9):1405–1418.
 [7] HHS. https://innovation.cms.gov/initiatives/ccjr/index.html 2015.
 [8] Almirall D, Compton SN, GunlicksStoessel M, Duan N, Murphy SA. Designing a pilot sequential multiple assignment randomized trial for developing an adaptive treatment strategy. Statistics in medicine 2012; 31(17):1887–1902.
 [9] Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in medicine 2005; 24(10):1455–1481.
 [10] Murphy SA, Van Der Laan M, Robins JM. Marginal mean models for dynamic regimes. Journal of the American Statistical Association 2001; 96(456):1410–1423.
 [11] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics 2012; 68(4):1010–1018.
 [12] Brinkley J. A doubly robust estimator for the attributable benefit of a treatment regime. Statistics in medicine 2014; 33(29):5057–5073.
 [13] Laber EB, Zhao YQ, Regh T, Davidian M, Tsiatis A, Stanford JB, Zeng D, Song R, Kosorok MR. Using pilot data to size a twoarm randomized trial to find a nearly optimal personalized treatment strategy. Statistics in medicine 2015; .
 [14] Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Annals of statistics 2011; 39(2):1180.
 [15] Lavori PW, Dawson R. A design for testing clinical strategies: biased adaptive withinsubject randomization. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2000; 163(1):29–38.
 [16] Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003; 65(2):331–355.
 [17] Murphy SA, Collins LM. Customizing treatment to the patient: Adaptive treatment strategies. Drug and alcohol dependence 2007; 88(Suppl 2):S1.
 [18] Robins JM. Optimal structural nested models for optimal sequential decisions. Proceedings of the second seattle Symposium in Biostatistics, Springer, 2004; 189–326.

[19]
Robins JM. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers.
Proceedings of the Biopharmaceutical Section, American Statistical Association, vol. 24, American Statistical Association, 1993; 3. 
[20]
Ashby D. Bayesian statistics in medicine: a 25 year review.
Statistics in medicine 2006; 25(21):3589–3631.  [21] Berry SM, Carlin BP, Lee JJ, Muller P. Bayesian adaptive methods for clinical trials. CRC press, 2010.
 [22] Jack Lee J, Chu CT. Bayesian clinical trials in action. Statistics in medicine 2012; 31(25):2955–2972.
 [23] Chow SC. Adaptive clinical trial design. Annual review of medicine 2014; 65:405–415.
 [24] Bather J. On the allocation of treatments in sequential medical trials. International Statistical Review/Revue Internationale de Statistique 1985; :1–13.
 [25] Yin G, Chen N, Jack Lee J. Phase ii trial design with bayesian adaptive randomization and predictive probability. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2012; 61(2):219–235.
 [26] Thall PF, Simon R. Practical bayesian guidelines for phase iib clinical trials. Biometrics 1994; :337–349.
 [27] Lai TL. Adaptive treatment allocation and the multiarmed bandit problem. The Annals of Statistics 1987; :1091–1114.
 [28] Lai TL, Liao OYW. Efficient adaptive randomization and stopping rules in multiarm clinical trials for testing a new treatment. Sequential analysis 2012; 31(4):441–457.
 [29] Dudík M, Erhan D, Langford J, Li L, et al.. Doubly robust policy evaluation and optimization. Statistical Science 2014; 29(4):485–511.

[30]
Wang Y, Wang C, Powell W. The knowledge gradient for sequential decision making
with stochastic binary feedbacks.
Proceedings of The 33rd International Conference on Machine Learning
, 2016; 1138–1147.  [31] Mes MR, Powell WB, Frazier PI. Hierarchical knowledge gradient for sequential sampling. The Journal of Machine Learning Research 2011; 12:2931–2974.
 [32] Negoescu DM, Frazier PI, Powell WB. The knowledgegradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing 2011; 23(3):346–363.
 [33] Ryzhov IO, Powell WB, Frazier PI. The knowledge gradient algorithm for a general class of online learning problems. Operations Research 2012; 60(1):180–195.
 [34] Wang Y, Reyes KG, Brown KA, Mirkin CA, Powell WB. Nestedbatchmode learning and stochastic optimization with an application to sequential multistage testing in materials science. SIAM Journal on Scientific Computing 2015; 37(3):B361–B381.
 [35] Russo D, Van Roy B. Learning to optimize via posterior sampling. Mathematics of Operations Research 2014; .
 [36] Chiang BN, PERLMAN LV, EPSTEIN FH. Overweight and hypertension a review. Circulation 1969; 39(3):403–421.
 [37] Newman ME. Modularity and community structure in networks. Proceedings of the national academy of sciences 2006; 103(23):8577–8582.
 [38] Frazier PI, Powell WB. Paradoxes in learning and the marginal value of information. Decision Analysis 2010; 7(4):378–403.
 [39] Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933; :285–294.
 [40] Berry DA, Eick SG. Adaptive assignment versus balanced randomization in clinical trials: a decision analysis. Statistics in medicine 1995; 14(3):231–246.
 [41] Hu F, Rosenberger WF. The theory of responseadaptive randomization in clinical trials, vol. 525. John Wiley & Sons, 2006.
 [42] Graepel T, Candela JQ, Borchert T, Herbrich R. Webscale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine. Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010; 13–20.
 [43] Agrawal S, Goyal N. Thompson sampling for contextual bandits with linear payoffs. arXiv preprint arXiv:1209.3352 2012; .
Comments
There are no comments yet.