1. Introduction
Treatment recommendation has been studied for a long history. Specially, medication recommendation systems have been verified to support doctors in making better clinical decisions. Early treatment recommendation systems match diseases with medications via classification based on expert systems (Zhuo et al., 2016; Gunlicksstoessel et al., 2017; Almirall et al., 2012). But it heavily relies on knowledge from doctors, and is difficult to achieve personalized medicine. With the availability of electronic health records (EHRs) in recent years, there are enormous interests to exploit personalized healthcare data to optimize clinical decision making. Thus the research on treatment recommendation shifts from knowledgedriven into datadriven.
The datadriven research on treatment recommendation involves two main branches: supervised learning (SL) and reinforcement learning (RL) for prescription. SL based prescription tries to minimize the difference between the recommended prescriptions and indicator signal which denotes doctor prescriptions. Several patternbased methods generate recommendations by utilizing the similarity of patients (Zhang et al., 2014; Sun et al., 2016; Hu et al., 2016), but they are challenging to directly learn the relation between patients and medications. Recently, some deep models achieve significant improvements by learning a nonlinear mapping from multiple diseases to multiple drug categories (Bajor and Lasko, 2017; Zhang et al., 2017; Wang et al., 2018). Unfortunately, a key concern for these SL based models still remains unresolved, i.e, the ground truth of “good” treatment strategy being unclear in the medical literature (Marik, 2015). More importantly, the original goal of clinical decision also considers the outcome of patients instead of only matching the indicator signal.
The above issues can be addressed by reinforcement learning for dynamic treatment regime (DTR) (Robins, 1986; Murphy, 2003). DTR is a sequence of tailored treatments according to the dynamic states of patients, which conforms to the clinical practice. As a real example shown in Figure 1, treatments for the patient vary dynamically over time with the accruing observations. The optimal DTR is determined by maximizing the evaluation signal which indicates the longterm outcome of patients, due to the delayed effect of the current treatment and the influence of future treatment choices (Chakraborty and Moodie, 2013). With the desired properties of dealing with delayed reward and inferring optimal policy based on nonoptimal prescription behaviors, a set of reinforcement learning methods have been adapted to generate optimal DTR for lifethreatening diseases, such as schizophrenia, nonsmall cell lung cancer, and sepsis (Shortreed and Moodie, 2012; Zhao et al., 2011; Nemati et al., 2016). Recently, some studies employ deep RL to solve the DTR problem based on large scale EHRs (Weng et al., 2017; Prasad et al., 2017; Raghu et al., 2017). Nevertheless, these methods may recommend treatments that are obviously different from doctors’ prescriptions due to the lack of the supervision from doctors, which may cause high risk (Mihatsch and Neuneier, 2002) in clinical practice. In addition, the existing methods are challenging for analyzing multiple diseases and the complex medication space.
In fact, the evaluation signal and indicator signal paly complementary roles (Barto, 2002; Clouse and Utgoff, 1992)
, where the indicator signal gives a basic effectiveness and the evaluation signal helps optimize policy. Imitation learning
(Abbeel and Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Levine et al., 2011; Finn et al., 2016) utilizes the indicator signal to estimate a reward function for training robots by supposing the indicator signal is optimal, which is not in line with the clinical reality. Supervised actorcritic (Clouse and Utgoff, 1992; Benbrahim and Franklin, 1997; Barto, 2004) uses the indicator signal to pretrain a “guardian” and then combines “actor” output and “guardian” output to send lowrisk actions for robots. However, the two types of signals are trained separately and cannot learn from each other. Inspired by these studies, we propose a novel deep architecture to generate recommendations for more general DTR involving multiple diseases and medications, called Supervised Reinforcement Learning with Recurrent Neural Network (SRLRNN). The main novelty of SRLRNN is to combine the evaluation signal and indicator signal at the same time to learn an integrated policy. More specifically, SRLRNN consists of an offpolicy actorcritic framework to learn complex relations among medications, diseases, and individual characteristics. The “actor” in the framework is not only influenced by the evaluation signal like traditional RL but also adjusted by the doctors’ behaviors to ensure safe actions. RNN is further adopted to capture the dependence of the longitudinal and temporal records of patients for the POMDP problem. Note that treatment and prescription are used interchangeably in this paper.Our contributions can be summarized as follows:

We propose a new deep architecture SRLRNN for handling a more general DTR setting involving multiple diseases and medications. It learns the prescription policy by combing both the indicator signal and evaluation signal to avoid unacceptable risks and infer the optimal dynamic treatment.

SRLRNN applies an offpolicy actorcritic framework to handle complex relations among multiple medications, diseases, and individual characteristics. The “actor” is adjusted by both the indicator signal and evaluation signal and RNN is further utilized to solve POMDP (see Section 4.4).

Quantitative experiments and qualitative case studies on MIMIC3 demonstrate that our method can not only reduce the estimated mortality in the hospital (see Section 5.2) by 4.4%, but also provide better medication recommendation.
The rest of this paper is organized as follows. We summarize the related work in Section 2 and provide necessary background knowledge in Section 3 for later introduction. In what follows, our model is specified in Section 4. Experimental results are presented in Section 5. Finally, we conclude our paper in Section 6.
2. Related Work
Early treatment recommendation systems heuristically map diseases into medications based on expert systems
(Zhuo et al., 2016; Gunlicksstoessel et al., 2017; Almirall et al., 2012). Due to the difficulty of knowledge acquisition, it comes into datadriven approaches with two branches: supervised learning and reinforcement learning. In this section, we overview the related studies on datadriven treatment recommendation, and the methodologies of combing supervised learning and reinforcement learning.Supervised learning for prescription focuses on minimizing the difference between recommended prescriptions and doctors’ prescriptions. Both Cheerla et al. (Cheerla and Gevaert, 2017) and Rosen et al. (RosenZvi et al., 2008) proposed to utilize genomic information for recommending suitable treatments for patients with different diseases. However, genomic information is not widely spread and easy to acquire. In order to leverage massive EHRs to improve treatment recommendation, several patternbased methods generate treatments by the similarity among patients (Zhang et al., 2014; Sun et al., 2016; Hu et al., 2016). Nevertheless, these methods are challenging to directly learn the relationship between patients and medications. Furthermore, it is challenging to calculate the similarities between patients’ complex longitudinal records. Recently, two deep models are proposed to learn a nonlinear mapping from diseases to drug categories based on EHRs, and achieve significant improvements. Bajor et al. (Bajor and Lasko, 2017) adopted a GRU model to predict the total medication categories given historical diagnosis records of patients. Zhang et al. (Zhang et al., 2017) proposed a deep model to not only learn the relations between multiple diseases and multiple medication categories, but also capture the dependence among medication categories. In addition, Wang et al. (Wang et al., 2018) utilized a trilinear model to integrate multisource patientspecific information for personalized medicine.
A major concern for these SLbased prescription is that the behaviors of doctors are prone to be imperfect. Due to the knowledge gap and limited experiences of doctors, the ground truth of “good” treatment strategy is unclear in the medical literature (Marik, 2015). To address this issue, we prefer to use RL that is wellsuited to infer optimal policies based on nonoptimal prescriptions.
Reinforcement learning for prescription gives the treatments by maximizing the cumulated reward, where the reward can be assessment scores of disease or survival rates of patients. Daniel et al. (Susan M. Shortreed, 2011) employed tabular Qlearning to recommend medications for schizophrenia patients on real clinical data. Zhao et al. (Zhao et al., 2011)
applied fitted Qlearning to discover optimal individualized medications for nonsmall cell lung cancer (NSCLC) based on simulation data, where a support vector regression (SVR) model is used to estimate Qfunction. Nemati et al.
(Nemati et al., 2016)leveraged a combination of Hidden Markov Models and deep Qnetworks to predict optimal heparin dosing for patients in ICU under a POMDP environment. Applying these approaches to clinical practice is challenging for their use of relatively small amount of data. Most recently, based on large scale available EHRs, Weng et al.
(Weng et al., 2017)combined sparse autoencoder and policy iteration to predict personalized optimal glycemic trajectories for severely ill septic patients, which reduces 6.3% mortality. Prasad et al.
(Prasad et al., 2017) used Qlearning to predict personalized sedation dosage and ventilator support. Raghu et al. (Raghu et al., 2017) employed dual double deep Qlearning with continuousstate spaces to recommend optimal dosages of intravenous fluid and maximum vasopressor for sepsis.However, without knowledgeable supervisors, the system may recommend treatments that are significantly different from doctors’, which may cause unacceptable risks (Mihatsch and Neuneier, 2002). Besides, these value based methods are hard to handle multiple diseases and complex medication space.
Methods of combing SL and RL utilize expert behaviors to accelerate reinforcement learning and avoid the risks of actions. Common examples are imitation learning (Abbeel and Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Levine et al., 2011; Finn et al., 2016) and supervised actorcritic (Clouse and Utgoff, 1992; Benbrahim and Franklin, 1997; Barto, 2004)
. Given the samples of expert trajectories, imitation learning requires th skill of estimating a reward function in which the reward of expert trajectories should enjoy the highest rewards. Recently, imitation learning combines deep learning to produce successful applications in robotics
(Finn et al., 2016; Levine et al., 2016). However, these methods assume expert behaviors are optimal, which is not in line with clinical reality. Besides, most of them learn the policy based on the estimated reward function, instead of directly telling learners how to act. Supervised actorcritic uses the indicator signal to pretrain a “guardian”, and sends a lowrisk action for robots by the weighted sum of “actor” output and “guardian” output, but each signal cannot learn from each other in the training process. Since this type of model requires much more expert behaviors, it has limited applications. In this paper, we focus on combining RL and SL by utilizing the large amount of doctors’ prescription behaviors. Our proposed model SRLRNN is novel in that: (1) we train “actor” with the indicator signal and evaluation signal jointly instead of learning “guardian” and “actor” separately; (2) SRLRNN combines an offpolicy RL and classification based SL models, while supervised actorcritic combines an onpolicy RL and regression based SL models; and (3) SRLRNN enables capturing the dependence of longitudinal and temporal records of patients to solve the problem of POMDP.3. Background
In this section, we give a definition of the Dynamic Treatment Regime (DTR) problem and an overview of preliminaries for our model. Some important notations mentioned in this paper are summarized in Table 1.
3.1. Problem Formulation
In this paper, DTR is modeled as a Markov decision process (MDP) with finite time steps and a deterministic policy consisting of an action space , a state space , and a reward function . At each time step , a doctor observes the current state of a patient, chooses the medication from candidate set based on an unknown policy , and receives a reward . Given the current observations of diseases, demographics, lab values, vital signs, and the output event of the patient which indicate the state , our goal is to learn a policy to select an action (medication) by maximizing the sum of discounted rewards (return) from time step , which is defined as , and simultaneously minimizes the difference from clinician decisions .
There are two types of methods to learn the policy: the valuebased RL learning a greedy policy and the policy gradient RL maintaining a parameterized stochastic policy or a deterministic policy (see 3.2 for details), where is the parameter of the policy.
Notation  Description  

the number of medications or medication categories  








deterministic policy learned from policy gradient  
stochastic policy learned from policy gradient  
a greedy policy learned from Qlearning  
unknown policy of a doctor  





estimated Q function  




3.2. Model Preliminaries
Qlearning (Watkins and Dayan, 1992) is an offpolicy learning scheme that finds a greedy policy , where denotes actionvalue or Q value and is used in a small discrete action space. For deterministic policy, the Q value can be calculated with dynamic programming as follows:
(1) 
where indicates the environment. Deep Q network (DQN) (Mnih et al., 2015) utilizes deep learning to estimate a nonlinear Q function parameterized by . The strategy of replay buffer is adopted to gain independent and identical distribution of samples for training. Moreover, DQN asynchronously updates a target network to minimize the least square loss as follows:
(2) 
Policy gradient is employed to handle continuous or high dimensional actions. To estimate the parameter of , we maximize the expected return from the start states , which is reformulated as: , where is the state value of the start state. is discounted state distribution, where is the initial state distribution and
is the probability at state
after transition of t time steps from state . Policy gradient learns the parameter by the gradient which is calculated using the policy gradient theorem (Sutton et al., 2000):(3) 
where the instantaneous reward is replaced by the longterm value .
Actorcritic (Konda and Tsitsiklis, 2000) combines the advantages of Qlearning and policy gradient to achieve accelerated and stable learning. It consists of two components: (1) an actor to optimize the policy in the direction of gradient using Equation 3, and (2) a critic to estimate an actionvalue function with the parameter through Equation 2. Finally, we obtain the policy gradient denoted as .
In an offpolicy setting, actorcritic estimates the value function of by averaging the state distribution of behavior policy (Degris et al., 2012). Instead of considering the stochastic policy , the deterministic policy gradient (DPG) theorem (Silver et al., 2014) proves that policy gradient framework can be extended to find deterministic offpolicy , which is given as follows:
(4) 
Deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) adopts deep learning to learn the actor and critic in mini batches which store a replay buffer with tuples (). To ensure the stability of Q values, DDPG uses a similar idea as the target network of DQN to copy the actor and critic networks as and . Instead of directly copying the weights, DDPG uses a “soft” target update:
(5) 
4. SRLRNN Architecture
This section begins with a brief overview of our approach. After that, we introduce the components of SRLRNN and the learning algorithm in detail.
4.1. Overview
The goal of our task is to learn a policy to recommend tailored treatments given the dynamic states of patients. SL learns the policy by matching the indicator signals, which guarantees a standard and safe performance. But the “good” treatment strategy is unclear and the original goal of clinical decisions also tries to optimize the outcome of patients. RL learns the policy by maximizing evaluation signals in a sequential dynamic system which reflects the clinical facts and can infer the optimal policy based on nonoptimal strategies. But without supervision, RL may produce unacceptable medications which give high risk. Intuitively, the indicator signal and evaluation signal play complementary roles (demonstrated in Section 5.4). Thus, SRLRNN is proposed to combine these two signals at the same time, where the cumulated reward is the evaluation signal and the prescription of doctors from the unknown policy is the indicator signal.
Figure 2 shows our proposed supervised reinforcement learning architecture (SRLRNN), which consists of three core networks: Actor (), Critic (), and LSTM. The actor network recommends the timevarying medications according to the dynamic states of patients, where a supervisor of doctors’ decisions provides the indicator signal to ensure safe actions and leverages the knowledge of doctors to accelerate learning process. The critic network estimates the action value associated with the actor network to encourage or discourage the recommended treatments. Due to the lack of fully observed states in the real world, LSTM is used to extend SRLRNN to handle POMDP by summarizing the entire historical observations to capture a more complete observations.
4.2. Actor Network Update
The actor network learns a policy parameterized by to predict the timevarying treatments for patients, where the input is and the output is the prescription recommended by . We employ reinforcement learning and supervised learning to optimize the parameter jointly. By combining the two learning tasks together, we maximize the following objective function:
(6) 
where is the objective function of RL task (in Equation 8) which tries to maximize the expected return, is the objective function of SL task (in Equation 10) which tries to minimize the difference from doctors’ prescriptions, and is a weight parameter to trade off the reinforcement learning and supervised learning tasks. Intuitively, our objective function aims to predicting mediations which give both high expected returns and low errors. Mathematically, the parameter of the learned policy is updated by gradient ascent as follows:
(7) 
where is a positive learning rate, and and are acquired by the RL and SL tasks, respectively.
For the reinforcement learning task, we seek to learn the policy by maximizing the state value of averaging over the state distribution of the behaviors of doctors:
(8) 
Let the parameter in RL task be the weights of a neural network. is updated by gradient ascent, , where is calculated by the deterministic offpolicy gradient using Equation 4:
(9) 
Let be the treatment recommended by .
is obtained by the chain rule, which is used to tell the medications predicted by the actor are “good” or “bad”. When the gradient
is positive, the policy will be pushed to be closer to . Otherwise, will be pushed away from . is a Jacobian matrix where each column is the gradient of th medication of with respect to .For the supervised learning task, we try to learn the policy through minimizing the difference between treatments predicted by and prescriptions given from the doctor’s policy , using the cross entropy loss:
(10) 
where indicates the number of medications or medication categories, and denotes whether the doctor chooses th medication at time step and is the probability of the th medication predicted by . The parameter in SL task is updated by the gradient descent where is derived by the chain rule as follows:
(11) 
where . Substituting Equation 9 and Equation 11 into Equation 7 gives the final actor update:
(12) 
The above formulas show that we can use to trade off the RL and SL tasks, where is a hyper parameter in this paper and the effect of is shown in Figure 6.
4.3. Critic Network Update
The critic network is jointly learned with the actor network, where the inputs are the states of patients, doctor prescriptions, actor outputs, and rewards. We can do this due to the critic network is only needed for guiding the actor during training, while only the actor network is required at test time. The critic network uses a neural network to learn the actionvalue function which is used to update the parameters of the actor in the direction of performance improvement . The output of the critic network is the Q value of state after taking action , where is predicted by the Qfunction .
4.4. Recurrent SRL
In the previous section, we assume the state of a patient is fully observed. In fact, we are always unable to obtain the full states of the patient. Here we reformulate the environment of SRLRNN as POMDP. POMDP can be described as a 4tuple (), where are observations. We obtain the observation directly which conditions on , with the not fully observable state . LSTM has been verified to improve performance on POMDP by summarizing entire historical observations when using policy gradient (Wierstra et al., 2007) and Qlearning (Hausknecht and Stone, 2015). In LSTM, the hidden units is to encapsulate the entire observation history . In order to capture the dependence of the longitudinal and temporal records of the patient, we employ a LSTM with SRL to represent historical observation for a more complete observation.
The updates of the parameters of actor network and critic network are modified as follows:
(15) 
(16) 
4.5. Algorithm
Putting all the aforementioned components together, the learning algorithm of SRLRNN is provided below.
5. Experiments
In this section, we present the quantitative and qualitative experiment results on MIMIC3.
5.1. Dataset and Cohort
The experiments are conducted on a large and publicly available dataset, namely the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC3 v1.4) database (Johnson et al., 2016). MIMIC3 encompasses a very large population of patients compared to other public EHRs. It contains hospital admissions of 43K patients in critical care units during 2001 and 2012, involving 6,695 distinct diseases and 4,127 drugs. To ensure statistical significance, we extract the top 1,000 medications and top 2,000 diseases (represented by ICD9 codes) which cover 85.4% of all medication records and 95.3% of all diagnosis records, respectively. In order to experiment on different granularity of medications, we map the 1,000 medications into the third level of ATC^{1}^{1}1http://www.whocc.no/atc/structure and principles/ (medication codes) using a public tool^{2}^{2}2https://www.nlm.nih.gov/research/umls/rxnorm/, resulting 180 distinct ATC codes. Therefore, the action space size of the experiments is 1,000 exact medications () or 180 drug categories ().
For each patient, we extract relevant physiological parameters with the suggestion of clinicians, which include static variables and timeseries variables. The static variables cover eight kinds of demographics: gender, age, weight, height, religion, language, marital status, and ethnicity. The timeseries variables contain lab values, vital signs, and output events, such as diastolic blood pressure, fraction of inspiration O2, Glascow coma scale, blood glucose, systolic blood pressure, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output. These features correspond to the state in MDP or the observation
in POMDP. We impute the missing variable with knearest neighbors and remove admissions with more than 10 missing variables. Each hospital admission of a patient is regarded as a treatment plan. Timeseries data in each treatment plan is divided into different units, each of which is set to 24 hours since it is the median of the prescription frequency in MIMIC3. If several data points are in one unit, we use their average values instead. Following
(Weng et al., 2017), we remove patients less than 18 years old because of the special conditions of minors. Finally, we obtain 22,865 hospital admissions, and randomly divide the dataset for training, validation, and testing sets by the proportion of 80/10/10.5.2. Evaluation Metrics
Evaluation methodology in treatment recommendation is still a chanllenge. Thus we try all evaluation metrics used in stateofart methods to judge our model.
Following (Weng et al., 2017; Raghu et al., 2017), we use the estimated inhospital mortality rates to measure whether policies would eventually reduce the patient mortality or not. Specifically, we discretize the learned Qvalues of each test example into different units with small intervals shown in the xaxis of Figure 3. Given an example denoting an admission of a patient, if the patient died in hospital, all the Qvalues belonging to this admission are associated with a value of 1 as mortality and the corresponding units add up these values. After scanning all test examples, the average estimated mortality rates for each unit are calculated, shown in yaxis of Figure 3. Based on these results, the mortality rates corresponding to the expected Qvalues of different policies are used as the measurements to denote the estimated inhospital mortality (see details in (Weng et al., 2017; Raghu et al., 2017)). Although the estimated mortality does not equal the mortality in real clinical setting, it is a universal metric currently for computational testing.
Inspired by (Zhang et al., 2017), we further utilize mean Jaccard coefficient to measure the degree of consistency between prescriptions generated by different methods and those from doctors. For a patient in the th day of ICU, let be the medication set given by doctors and be the medication set recommended from learned policies. The mean Jaccard is defined as , where is the number of patients and is the number of ICU days for the patient .
5.3. Comparison Methods
All the adopted baselines models in the experiments are as follows, where BL, RL, and SD3Q are the alternatives of SRLRNN we proposed.

[leftmargin=*]

Popularity20 (POP20): POP20 is a pattenbased method, which chooses the topK most cooccurring medications with the target diseases as prescriptions. We set for its best performance on the validation dataset.

BasicLSTM (BL): BL uses LSTM to recommend the sequential medications based on the longitudinal and temporal records of patients. Inspired by DoctorAI (Choi et al., 2016), BL fuses multisources of patientspecific information and considers each admission of a patient as a sequential treatment to satisfy the DTR setting. BL consists of a 1layer MLP (M1) to model diseases, a 1layer MLP (M2) to model static variables, and a 1layer LSTM sequential model (L1) to capture the timeseries variables. These outputs are finally concatenated to predict prescriptions at each timestep.

RewardLSTM (RL): RL has the same framework as BL, except that it considers another signal, i.e., feedback of mortality, to learn a nontrivial policy. The model involves three steps: (1) clustering the continuous states into discrete states, (2) learning the values using tabular Qlearning, (3) and training the model by simultaneously mimicking medications generated by doctors and maximizing cumulated reward of the policy.

Dueling DoubleDeep Q learning (D3Q) (Raghu et al., 2017): D3Q is a reinforcement learning method which combines dueling Q, double Q, and deep Q together. D3Q regards a treatment plan as DTR.

Supervised Dueling DoubleDeep Q (SD3Q): Instead of separately learning the Qvalues and policy as RL, SD3Q learns them jointly. SD3Q involves a D3Q architecture, where supervised learning is additionally adopted to revise the value function.

Supervised Actor Critic (SAC) (Clouse and Utgoff, 1992): SAC uses the indicator signal to pretrain a “guardian” and then combines “actor” output and “guardian” output to send lowrisk actions for robots. We transform it into a deep model for a fair comparison.

LEAP (Zhang et al., 2017): LEAP leverages a MLP framework to train a multilabel model with the consideration of the dependence among medications. LEAP takes multiple diseases as input and multiple medications as output. Instead of considering each admission as a sequential treatment process, LEAP regards each admission as a static treatment setting. We aggregate the multiple prescriptions recommended by SRLRNN as SRLRNN (agg) for a fair comparison.

LG (Bajor and Lasko, 2017): LG takes diseases as input and adopts a 3layer GRU model to predict the multiple medications.
Estimated Mortality  Jaccard  
Granularity  l3 ATC  Medications  l3 ATC  Medications 
Dynamic treatment setting  
LG  0.226  0.235  0.436  0.356 
BL  0.217  0.221  0.512  0.376 
RL  0.209  0.213  0.524  0.378 
D3Q  0.203  0.212  0.109  0.064 
SD3Q  0.198  0.201  0.206  0.143 
SAC  0.196  0.202  0.517  0.363 
SRLRNN  0.153  0.157  0.563  0.409 
Static treatment setting  
POP20  0.233  0.247  0.382  0.263 
LEAP  0.224  0.229  0.495  0.364 
SRLRNN (agg)  0.153  0.157  0.579  0.426 
5.4. Result Analysis
Model comparisons. Table 2 depicts the mortality rates and Jaccard scores for all the adopted models on MIMIC3. By first comparing the results of LG, BL, and RL, we find that RL outperforms both BL and LG, showing that incorporating the evaluation signal into supervised learning can indeed improve the results. We then compare SD3Q with its simplified version D3Q. The better performance of SD3Q indicates the knowledgeable supervision guarantees a standard performance of the learned policy. In the static treatment setting, LEAP improves the basic method POP20 by a large margin, which demonstrates that capturing the relations between multiple diseases and multiple medications is beneficial for better prescriptions.
Finally, our proposed model SRLRNN performs significantly better than all the adopted baselines, both in the dynamic treatment setting and static treatment setting. The reasons are: 1) SRLRNN regards the treatment recommendation as a sequential decision process, reflecting the clinical practice (compared with LEAP and POP20), and utilizes evaluation signal to infer an optimal treatment (compared with LG and BL); 2) SRLRNN considers the prescriptions of doctors as supervision information to learn a robust policy (compared with D3Q), and applies the offpolicy actorcritic framework to handle complex relations of medications, diseases, and individual characteristics (compared with SD3Q); 3) SRLRNN integrates supervised learning and reinforcement learning in an endtoend learning fashion for sharing the information between evaluation signal and indicator signal (compared with RL and SAC); and 4) SRLRNN adopts RNN to solve POMDP by obtaining the representations of the entire historical observations.
Ablation study. The different contributions of the three types of features are reported in this part. To be specific, we progressively add the patientspecific information, i.e., time series variables, diseases, and demographics, into the selected models. As shown in Table 3, the Jaccard scores of the three methods monotonically increase. In addition, the estimated mortality rates of SRLRNN monotonically decrease. However, the estimated mortality variations of BL and RL are not the same as that of SLRRNN. It might be due to the fact that with constant Qvalues learned by tabular Qlearning, learning a robust policy is a little hard.
Method  Feature  Estimated Mortality  Jaccard 

BL  alldemodisease  0.242  0.323 
alldemo  0.212  0.360  
all  0.221  0.376  
RL  alldemodisease  0.184  0.332 
alldemo  0.203  0.371  
all  0.213  0.378  
SRLRNN  alldemodisease  0.173  0.362 
alldemo  0.162  0.403  
all  0.157  0.409 
Effectiveness and stability of policy. The relations between expected returns and mortality rates are shown in Figure 3. We observe SRLRNN has a more clear negative correlation between expected returns and mortality rates than BL and RL. The reason might be that BL ignores the evaluation signal while RL discretizes the continuous states, incurring information loss.
Figure 4 shows how the observed mortality changes with the difference between the learned policies (by RL and SRLRNN) and doctors’ prescriptions. Let be the treatment difference for patient in the th day, and is the number of candidate classes of medications. We calculate the difference by . When the difference is minimum, we obtain the lowest mortality rates of 0.021 and 0.016 for RL and SRLRNN, respectively. This phenomenon shows that SRLRNN and RL can both learn good policies while SRLRNN slightly outperforms RL for its lower mortality rate.
Figure 5 presents the expected return and Jaccard scores obtained in each learning epoch of SRLRNN with different features. Notably, SRLRNN is able to combine reinforcement learning and supervised learning to obtain the optimal policy in a stable manner. We can see Jaccard is learned faster than Qvalues, indicating that leaning Qvalues might need rich trajectories.
Case studies. Table 4 shows the prescriptions generated by different models for two patients in different ICU days. For the first patient who is finally dead, LG recommends much more medications which seem to be noninformative. LEAP generates the same medications for patients in different ICU days, without considering the change of patient states. An interesting observation is that the prescriptions recommended by SRLRNN are much different from doctors’, especially for the stronger tranquilizers such as Acetaminophen and Morphine Sulfate. Actually, for an ICU patient with a severe trauma, it is important to give full sedation in early treatment. Considering the second surviving patient, the similarity between the prescriptions of SRLRNN and doctor becomes larger, indicating the rationality of SRLRNN. Particularly, all the models except SRLRNN recommend redundant medications in different days, while the prescriptions of doctors are significantly different.
Effect of . Figure 6 shows the effectiveness of the weight parameter in Equation 12, which is used to balance RL and SL. It achieves the highest Jaccard scores and lowest mortality rates when taking the values of 0.5 and 0.6, which verifies that SRLRNN can not only significantly reduce the estimated mortality but also recommend better medications as well.
6. Conclusion
In this paper, we propose the novel Supervised Reinforcement Learning with Recurrent Neural Network (SRLRNN) model for DTR, which combines the indicator signal and evaluation signal through the joint supervised and reinforcement learning. SRLRNN incorporates the offpolicy actorcritic architecture to discover optimal dynamic treatments and further adopts RNN to solve the POMDP problem. The comprehensive experiments on the real world EHR dataset demonstrate SRLRNN can reduce the estimated mortality in hospital by up to 4.4% and provide better medication recommendation as well.
Diagnosis  day  method  treatment  
Traumatic brain hemorrhage, Spleen injury, intrathoracic injury, Contusion of lung, motor vehicle traffic collision, Acute posthemorrhagic anemia  1  Doctor 


LG 


LEAP 


SRLRNN 


2  Doctor 


LG 


LEAP 


SRLRNN 


Coronary atherosclerosis, pleural effusion, Percutaneous coronary angioplasty, Hypopotassemia, pericardium, Personal history of malignant neoplasm  1  Doctor 


LG 


LEAP 


SRLRNN 


2  Doctor 


LG 


LEAP 


SRLRNN 

Acknowledgements
We thank Aniruddh Raghu and Matthieu Komorowski to help us preprocess the data. This work was partially supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000904, NSFC (61702190), and NSFCZhejiang (U1609220).
References
 (1)
 Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In ICML. ACM, 1.
 Almirall et al. (2012) D Almirall, S. N. Compton, M GunlicksStoessel, N. Duan, and S. A. Murphy. 2012. Designing a pilot sequential multiple assignment randomized trial for developing an adaptive treatment strategy. Statistics in Medicine (2012), 1887–902.
 Bajor and Lasko (2017) Jacek M Bajor and Thomas A Lasko. 2017. Predicting Medications from Diagnostic Codes with Recurrent Neural Networks. ICLR (2017).
 Barto (2002) Andrew G Barto. 2002. Reinforcement Learning in Motor Control. In The handbook of brain theory and neural networks.
 Barto (2004) MTRAG Barto. 2004. supervised actorcritic reinforcement learning. Handbook of learning and approximate dynamic programming (2004), 359.
 Benbrahim and Franklin (1997) Hamid Benbrahim and Judy A Franklin. 1997. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems (1997), 283–302.
 Chakraborty and Moodie (2013) Bibhas Chakraborty and EE Moodie. 2013. Statistical methods for dynamic treatment regimes. Springer.
 Cheerla and Gevaert (2017) N Cheerla and O Gevaert. 2017. MicroRNA based PanCancer Diagnosis and Treatment Recommendation:. Bmc Bioinformatics (2017), 32.

Choi et al. (2016)
Edward Choi, Mohammad Taha
Bahadori, Andy Schuetz, Walter F.
Stewart, and Jimeng Sun.
2016.
Doctor AI: Predicting Clinical Events via Recurrent
Neural Networks. In
Proceedings of the 1st Machine Learning for Healthcare Conference
. PMLR, 301–318.  Clouse and Utgoff (1992) Jeffery A. Clouse and Paul E. Utgoff. 1992. A Teaching Method for Reinforcement Learning. In International Workshop on Machine Learning. 92–110.
 Degris et al. (2012) Thomas Degris, Patrick M Pilarski, and Richard S Sutton. 2012. Modelfree reinforcement learning with continuous action in practice. In American Control Conference (ACC). IEEE, 2177–2182.
 Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML. 49–58.
 Gunlicksstoessel et al. (2017) M Gunlicksstoessel, L Mufson, A Westervelt, D Almirall, and S Murphy. 2017. A Pilot SMART for Developing an Adaptive Treatment Strategy for Adolescent Depression. J Clin Child Adolesc Psychol (2017), 1–15.
 Hausknecht and Stone (2015) Matthew Hausknecht and Peter Stone. 2015. Deep recurrent qlearning for partially observable mdps. CoRR, abs/1507.06527 (2015).
 Hu et al. (2016) Jianying Hu, Adam Perer, and Fei Wang. 2016. Data driven analytics for personalized healthcare. In Healthcare Information Management Systems. 529–554.
 Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, et al. 2016. MIMICIII, a freely accessible critical care database. Scientific data (2016).
 Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. 2000. Actorcritic algorithms. In NIPS. 1008–1014.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. Endtoend training of deep visuomotor policies. JMLR (2016), 1–40.
 Levine et al. (2011) Sergey Levine, Zoran Popovic, and Vladlen Koltun. 2011. Nonlinear inverse reinforcement learning with gaussian processes. In NIPS. 19–27.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
 Marik (2015) P. E. Marik. 2015. The demise of early goaldirected therapy for severe sepsis and septic shock. Acta Anaesthesiologica Scandinavica (2015), 561.
 Mihatsch and Neuneier (2002) Oliver Mihatsch and Ralph Neuneier. 2002. Risksensitive reinforcement learning. Machine learning (2002), 267–290.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. 2015. Humanlevel control through deep reinforcement learning. Nature (2015), 529–533.
 Murphy (2003) Susan A Murphy. 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 2 (2003), 331–355.
 Nemati et al. (2016) Shamim Nemati, Mohammad M. Ghassemi, and Gari D. Clifford. 2016. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In Engineering in Medicine and Biology Society. 2978.
 Prasad et al. (2017) Niranjani Prasad, Li Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. 2017. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units. (2017).
 Raghu et al. (2017) Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi, Peter Szolovits, and Marzyeh Ghassemi. 2017. Deep Reinforcement Learning for Sepsis Treatment. arXiv preprint arXiv:1711.09602 (2017).
 Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. 2006. Maximum margin planning. In ICML. ACM, 729–736.
 Robins (1986) James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling (1986), 1393–1512.
 RosenZvi et al. (2008) M RosenZvi, AProsperi M Altmann, E Aharoni, et al. 2008. Selecting antiHIV therapies based on a variety of genomic and clinical factors. Bioinformatics (2008), 399–406.
 Shortreed and Moodie (2012) Susan M. Shortreed and Erica E. M. Moodie. 2012. Estimating the optimal dynamic antipsychotic treatment regime: evidence from the sequential multipleassignment randomized Clinical Antipsychotic Trials of Intervention and Effectiveness schizophrenia study. Journal of the Royal Statistical Society (2012), 577–599.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In ICML. 387–395.
 Sun et al. (2016) Leilei Sun, Chuanren Liu, Chonghui Guo, Hui Xiong, and Yanming Xie. 2016. Datadriven Automatic Treatment Regimen Development and Recommendation.. In KDD. 1865–1874.
 Susan M. Shortreed (2011) Daniel J. Lizotte T. Scott Stroup Joelle Pineau Susan A. Murphy Susan M. Shortreed, Eric Laber. 2011. Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Machine Learning (2011), 109–136.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In NIPS. 1057–1063.
 Wang et al. (2018) Lu Wang, Wei Zhang, Xiaofeng He, and Hongyuan Zha. 2018. Personalized Prescription for Comorbidity. In International Conference on Database Systems for Advanced Applications. 3–19.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning (1992), 279–292.
 Weng et al. (2017) WeiHung Weng, Mingwu Gao, Ze He, Susu Yan, and Peter Szolovits. 2017. Representation and Reinforcement Learning for Personalized Glycemic Control in Septic Patients. arXiv preprint arXiv:1712.00654 (2017).
 Wierstra et al. (2007) Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. 2007. Solving deep memory POMDPs with recurrent policy gradients. In International Conference on Artificial Neural Networks. Springer, 697–706.
 Zhang et al. (2014) Ping Zhang, Fei Wang, Jianying Hu, and Robert Sorrentino. 2014. Towards personalized medicine: leveraging patient similarity and drug similarity analytics. AMIA Summits on Translational Science Proceedings (2014), 132.
 Zhang et al. (2017) Yutao Zhang, Robert Chen, Jie Tang, Walter F. Stewart, and Jimeng Sun. 2017. LEAP: Learning to Prescribe Effective and Safe Treatment Combinations for Multimorbidity. In KDD. 1315–1324.
 Zhao et al. (2011) Yufan Zhao, Donglin Zeng, Mark A Socinski, and Michael R Kosorok. 2011. Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer. Biometrics (2011), 1422–1433.

Zhuo
et al. (2016)
Chen Zhuo, Marple Kyle,
Salazar Elmer, Gupta Gopal, and
Tamil Lakshman. 2016.
A Physician Advisory System for Chronic Heart
Failure management based on knowledge patterns.
Theory and Practice of Logic Programming
(2016), 604–618.  Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum Entropy Inverse Reinforcement Learning.. In AAAI. 1433–1438.
Comments
There are no comments yet.