1 Introduction
Precise, targeted patient monitoring is central to improving treatment in an ICU, allowing clinicians to detect changes in patient state and to intervene promptly and only when necessary. While basic physiological parameters that can be monitored bedside (e.g., heart rate) are recorded continually, those that require invasive or expensive laboratory tests (e.g., white blood cell counts) are more intermittently sampled. These lab tests are estimated to influence up to percent of diagnoses or treatment decisions, and are often cited as the motivation for more costly downstream care [badrick2013evidence, zhi2013landscape].
Recent medical reviews raise several concerns about the overordering of lab tests in the ICU [loftsgard2016clinicians]. Redundant testing can occur when labs are ordered by multiple clinicians treating the same patient or when recurring orders are placed without reassessment of clinical necessity. Many of these orders occur at time intervals that are unlikely to include a clinically relevant change or when large panel testing is repeated to detect a change in a small subset of analyses [konger2016reduction]. This leads to inflation in costs of care and in the likelihood of false positives in diagnostics, and also causes unnecessary discomfort to the patient. Moreover, excessive phlebotomies (blood tests) can contribute to risk of hospitalacquired anaemia; around of patients in the ICU have below normal haemoglobin levels by day 3 of admission and are in need of blood transfusions. It has been shown that phlebotomy accounts for almost half the variation in the amount of blood transfused [icumedical2015].
With the disproportionate rise in lab costs relative to medical activity in recent years, there is a pressing need for a sustainable approach to test ordering. A variety of approaches have been considered to this end, including restrictions on the minimum time interval between tests or the total number of tests ordered per week. More datadriven approaches include an information theoretic framework to analyze the amount of novel information in each ICU lab test by computing conditional entropy and quantifying the decrease in novel information of a test over the first three days of an admission [lee2015using].
In a similar vein, a binary classifier was trained using fuzzy modeling to determine whether or not a given lab test contributes to information gain in the clinical management of patients with gastrointestinal bleeding
[cismondi2013reducing]. An “informative” lab test is one in which there is significant change in the value of the tested parameter, or where values were beyond certain clinically defined thresholds; the results suggest a reduction in lab tests compared with observed behaviour. More recent work looked at predicting the results of ferratin testing for iron deficiency from information in other labs performed concurrently [luosol2016]. The predictability of the measurement is inversely proportional to the novel information in the test. These past approaches underscore the high levels of redundancy that arise from current practice. However, there are many key clinical factors that have not been previously accounted for, such as the lowcost predictive information available from vital signs, causal connection of clinical interventions with test results, and the relative costs associated with ordering tests.In this work, we introduce a reinforcement learning (RL) based method to tackle the problem of developing a policy to perform actionable lab testing in ICU patients. Our approach is twofold: first, we build an interpretable model to forecast future patient states based on past observations, including uncertainty quantification. We adapt multioutput Gaussian processes (MOGPs; [ghassemi2015multivariate, Cheng2017arXiv]
) to learn the patient state transition dynamics from a patient cohort including sparse and irregularly sampled medical time series data, and to predict future states of a given patient trajectory. Second, we model patient trajectories as a Markov decision process (MDP). This framework has been applied to the recommendation of treatment strategies for critical care patients in a variety of different settings, from recommending drug dosages to efficiently weaning patients from mechanical ventilation
[nemati2016optimal, raghu2017continuous, prasad2017reinforcement]. We design the state and reward functions of the MDP to incorporate relevant clinical information, such as the expected information gain, administered interventions, and costs of actions (here, ordering a lab test). A major challenge is designing a reward function that can trade off multiple, often opposing, objectives. There has been initial work on extending the MDP framework to composite reward functions. For example, fitted Qiteration (FQI) has been used to learn policies for multiobjective MDPs with vectorvalued rewards, for the sequence of interventions in twostage clinical antipsychotic trials
[lizotte2016multi]. A variation of Pareto domination was then used to generate a partial ordering of policies and extract all policies that are optimal for some scalarization function, leaving the choice of parameters of the scalarization function to decision makers.Here, we look to translate these principles to the problem of lab test ordering. Specifically, we focus on blood tests relevant in the diagnosis of sepsis or acute renal failure, two common conditions associated with high mortality risk in the ICU: white blood cell count (WBC), blood lactate level, serum creatinine, and blood urea nitrogen (BUN). We present our methods within a flexible framework that can in principle be adapted to a patient cohort with different diagnoses or treatment objectives, influenced by a distinct set of lab results. Our proposed framework integrates prior work on offpolicy RL and Pareto learning with practical clinical constraints to yield policies that are close to intuition demonstrated in historical data. We apply our framework to a publicly available database of ICU admissions, evaluating the estimated policy against the policy followed by clinicians using both importance sampling based estimators for offpolicy policy evaluation and by comparing against multiple clinically inspired objectives, including onset of clinical treatment that was motivated by the lab results.
2 Methods
2.1 Cohort selection and preprocessing
We extract our cohort of interest from the MIMIC III database [johnson2016mimic], which includes deidentified critical care data from over 58,000 hospital admissions. From this database, we first select adult patients with at least one recorded measure for each of 20 vital signs and lab tests commonly ordered and reviewed by clinicians (for instance, the results reported in a complete blood count or basic metabolic panel). We further filter patients by their lengthofstay, keeping only those that were in the ICU for more than a day but less than twenty days, to obtain a final set of 6,060 patients (Table 2.1).
Included in the 20 physiological traits we filter for are eight that are particularly predictive of the onset of severe sepsis, septic shock, or acute kidney failure. These traits are included in the SIRS (System Inflammatory Response Syndrome) and SOFA (Sequential Organ Failure Assessment) criteria. The average daily measurements or lab test orders across the chosen cohort for these eight traits is highly variable (Figure 1). Of these eight traits, the first three are vitals measured using bedside monitoring systems for which approximately hourly measurements are recorded; the latter four are labs requiring phlebotomy and are typically measured just 2–3 times each day. We find the frequency of orders also varies across different labs, possibly due in part to differences in cost; for example, WBC (which is relatively inexpensive to test) is on average sampled slightly more often than lactate. In order to apply our proposed RL algorithm to this sparse, irregularly sampled dataset, we adapt the multioutput Gaussian process (MOGP) framework [Cheng2017arXiv]
to obtain hourly predictions of patient state with uncertainty quantified, on 17 of the 20 clinical traits. For three of the vitals, namely the components of the Glasgow Coma Scale, we impute with the last recorded measurement.
2.2 MDP formulation
Each patient admission is modelled as an MDP with:

a state space , such that the patient physiological state at time is given by ;

an action space from which the clinician’s action is chosen;

an unknown transition function that determines the patient dynamics; and

a reward function that constitutes the observed clinical feedback for this action.
The objective of the RL agent is to learn an optimal policy that maximizes the expected discounted accumulated reward over the course of an admission:
We start by describing the state space of our MDP for ordering lab tests. We first resample the raw time series using a multiobjective Gaussian process with a sampling period of one hour. The patient state at time is defined by:
(1) 
Here, denotes the predictive means and standard deviations respectively of each of the vitals and lab tests. For the predictive SOFA score , we compute the value using its clinical definition, from the predictive means on five traits—mean BP, bilirubin, platelet, creatinine, —along with GCS and related medication history (e.g., dopamine). Vitals include any timevarying physiological traits that we consider when determining whether to order a lab test. Here, we look at four key physiological traits—heart rate, respiratory rate, temperature, and mean blood pressure—and four lab tests—creatinine, BUN, WBC, and lactate. The values are the last known measurements of each of the four labs, and denotes the elapsed time since each was last ordered. This formulation results in a 21dimensional state space. Depending on the labs that we wish to learn recommendations for testing, the action space is a set of binary vectors whose elements indicate whether or not to place an order for a specific lab. These actions can be written as , where is the number of labs.
In order for our RL agent to learn a meaningful policy, we need to design a reward function that provides positive feedback for the ordering of tests where necessary, while penalizing the over or underordering of any given lab test. In particular, the agent should be encouraged to order labs when the physiological state of the patient is abnormal with high probability, based on estimates from the MOGP, or when a lab is predicted to be informative (in that the forecasted value is significantly different from the last known measurement) due to a sudden change in disease state. In addition, the agent should incur some penalty whenever a lab test is taken, decaying with elapsed time since the last measurement, to reflect the effective cost (both economic and in terms of discomfort to the patient) of the test. We formulate these ideas into a vectorvalued reward function
of the state and action at time , as follows:(2) 
Patient state:
The first element, , uses the recently introduced SOFA score for sepsis [singer2016third] which assesses severity of organ dysfunction in a potentially septic patient. Our use of SOFA is motivated by the fact that, in practice, sepsis is more often recognized from the associated organ failure than from direct detection of the infection itself [vincent2016qsofa]. The raw SOFA score ranges from 0 to 24, with a maximum of four points assigned each to symptom of failure in the respiratory system, nervous system, liver, kidneys, and blood coagulation. A change in SOFA score is considered a critical index for sepsis [singer2016third]. We use this rule of thumb to design the first reward term as follows:
(3) 
The raw score at each time step is evaluated using current patient labs and vitals [vincent2016qsofa].
Treatment onset:
The second term is an indicator variable for rewards capturing whether or not there is some treatment or intervention initiated at the next time step, :
(4) 
where denotes the set of diseasespecific categories of interventions of interest. Again, the reward term is positive if a lab is ordered; this is based on the rationale that, if a lab test is ordered and immediately followed by an intervention, the test is likely to have provided actionable information. Possible interventions in the following state include administration of some form of antibiotics, vasopressors, initiation of dialysis or mechanical ventilation.
Lab redundancy:
The term denotes the feedback from taking one or more lab tests with novel information. We quantify this by using the mean squared difference between the last observed value and predictive means from the MOGP as a proxy for the information available:
(5) 
where is the normalization coefficient for lab , and the parameter determines the minimum prediction error necessary to trigger a reward; in our experiments, this is set to the median prediction error for labs ordered in the training data. The larger the deviation from current forecasts, the higher the potential information gain, and in turn the reward if the lab is taken.
Lab cost:
The last term in the reward function, adds a penalty whenever any test is ordered to reflect the effective “cost” of taking the lab at time .
(6) 
where is a decay factor that controls the how fast the cost decays with the time elapsed since the last measurement. In our experiments, we set .
2.3 Learning optimal policies
Once we extract sequences of states, actions, and rewards from the ICU data, we can generate a dataset of onestep transition tuples of the form , . These tuples can then be used to learn an estimate of the Qfunction, —where is the dimensionality of the reward function—to map a given stateaction pair to a vector of expected cumulative rewards. Each element in the Qvector represents the estimated value of that stateaction pair according to a different objective. We learn this Qfunction using a variant of Fitted Qiteration (FQI) with extremely randomized trees [ernst2005tree, prasad2017reinforcement]. FQI is a batch offpolicy reinforcement learning algorithm that is wellsuited to clinical applications where we have limited data and challenging state dynamics. The algorithm adapted here to handle vectorvalued rewards is based on Paretooptimal FittedQ [lizotte2016multi].
In order to scale from the twostage decision problem originally tackled to the much longer admission sequences here ( time steps), we define a stricter pruning of actions: at each iteration we eliminate any dominated actions for a given state—those actions that are outperformed by alternatives for all elements of the Qfunction—and retain only the set for each . Actions are further filtered for consistency: we might consider feature consistency to be defined as rewards being linear in each feature space [lizotte2016multi]. Here, we relax this idea to filter out only those actions from policies that cannot be expressed by our chosen nonlinear treebased classifier. The function will still yield a nondeterministic policy (NDP) as, in most cases, there will not be a strictly optimal action that achieves the highest for all . In the following section, we suggest one possible approach for reducing the NDP to give a single best action for any given state based on practical considerations for this setting.
3 Results
Following the extraction of our 6,060 admissions and resampling in hourly intervals using the forecasting MOGP, we partitioned the cohort into training and test sets of 3,636 and 2,424 admissions respectively. This gave approximately 500,000 onestep transition tuples of the form in the training set, and over 350,000 in the test set. We then ran batched FQI with these samples for iterations with discount factor . Each iteration took 100,000 transitions, sampled from the training set, with probability inversely proportional to the frequency of the action in the tuple. The vectorvalued outputs of estimated Qfunction were then used to obtain a nondeterministic policy for each lab considered (Section 2.3). We chose to collapse this set to a practical deterministic policy as follows:
(7) 
In particular, a lab should be taken only if the action is optimal, or estimated to outperform no other actions for all objectives in the Qfunction. This strong condition for ordering a lab is motivated by the fact that the one of our primary objectives here is to minimize unnecessary ordering; the variable allows us to relax this for certain objectives if desired. For example, if cost is a softer constraint in our case, setting is an intuitive way to specify this preference in the policy. In our experiments, we tuned such that the total number of recommended orders of each lab approximates the number of actual orders in the training set.
With a deterministic set of optimal actions, we could train our final policy function ; again, we used extremely randomized trees. The estimated feature importances of the policies learnt show that in the case of lactate the most important features are the mean and measured lactate, the time since last lactate measurement () and the SOFA score (Figure 2). These relative importance scores are expected: a change in SOFA score may indicate the onset of sepsis, and in turn warrant a lactate test to confirm a source of infection, fitting typical clinical protocol. For the other three policies—WBC, creatinine, and BUN—again the time since last measurement of the respective lab tends be the prominent feature in the policy, along with the terms for the other two labs. This emphasizes the overlap in information conveyed by these three tests: For example, abnormally high white blood cell count is a key criteria for sepsis, and severe sepsis often cascades into renal failure, which is typically diagnosed by elevated BUN and creatinine levels [clarkson2010pocket].
Once we have trained our policy functions, an additional component is added to our final recommendations: we introduce a budget that suggests taking a lab at the end of every 24 hour period for which our policy recommends no orders. This allows us to handle regions of very sparse recommendations by the policy function, and reflects clinical protocols that require minimum daily monitoring of key labs. In the policy for lactate orders in a typical patient admission, looking at the timing of the actual clinician orders, recommendations from our policy, and suggested orders from the budget framework, the actions are concentrated where lactate values are increasingly abnormal, or at sharp rises in SOFA score (Figure 3).
3.1 OffPolicy Evaluation
We evaluated the quality of our final policy recommendations in a number of ways. First, we implemented the perstep weighted importance sampling (PSWIS) estimator to calculate the value of the policy to be evaluated:
given data collected from behaviour policy [precup2000eligibility]. The behaviour policy was found by training a regressor on real stateaction pairs observed in the dataset. The discount factor was set to , so all time steps contribute equally to the value of a trajectory.
We then compared estimates for our policy (MOFQI) against the behaviour policy and a set of randomized policies as baselines. These randomized policies were designed to generate random decisions to order a lab, with probabilities , where is the empirical probability of an order in the behaviour policy. For each , we evaluated ten randomly generated policies and averaged performance over these. We observed that MOFQI outperforms the behaviour policy across all reward components, for all four labs (Figure 4). Our policy also consistently approximately matches or outperforms other policies in terms of cost—note that lower cost is better—even with the inclusion of the slack variable
and the budget framework. Across the remaining objectives, MOFQI outperforms the random policy in at least two of three components for all but lactate. This may be due in part to the relatively sparse orders for lactate resulting in higher variance value estimates.
In addition to evaluating using the perstep WIS estimator, we looked for more intuitive measures of how the final policy influences clinical practice. We computed three metrics here: (i) estimated reduction in total number of orders, (ii) mean information gain of orders taken, and (iii) time intervals between labs and subsequent treatment onsets.
In evaluating the total number of recommended orders, we first filter a sequence of recommended orders to the just the first (onset) of recommendations if there are no clinician orders between them. We argue that this is a fair comparison as subsequent recommendations are made without counterfactual state estimation, i.e., without assuming that the first recommendation was followed the clinician. Empirically, we find that the total number of recommendations is considerably reduced. For instance, in the case of recommending WBC orders, our final policy reports 12,358 orders in the test set, achieving a reduction of 44% from the number of true orders (22,172). In the case of lactate, for which clinicians’ orders are the least frequent (14,558), we still achieved a reduction of 27%.
We also compared the approximate information gain of the actions taken by the estimated policy, in comparison with the policy used in the collected data. To do this, we defined the information gain at a given time by looking at the difference between the approximated true value of the target lab, which we impute using the MOGP model given all the observed values, and the forecasted value, computed using only the values observed before the current time. The distribution of aggregate information gain for orders recommended by our policy and actual clinician’s orders in the test set shows higher mean information gain with MOFQI (Figure 5).
Lastly, we considered the time to onset of critical interventions, which we define to include initiation of vasopressors, antibiotics, mechanical ventilation or dialysis. We first obtained a sequence of treatment onset times for each test patient; for each of these time points, we traced back to the earliest observed or recommended order taking place within the past 48 hours, and computed the time between these: . The distribution of timetotreatment for labs taken by the clinician in the true trajectory against that for recommendations from our policy, for all four labs, shows that the recommended orders tend to happen earlier than the actual time of an order by the clinician—on average over an hour in advance for lactate, and more that four hours in advance for WBC, creatinine, and BUN (Figure 6).
4 Conclusion
In this work, we propose a reinforcement learning framework for decision support in the ICU that learns a compositional optimal treatment policy for the ordering of lab tests from suboptimal histories. We do this by designing a multiobjective reward function that reflects clinical considerations when ordering labs, and adapting methods for multiobjective batch RL to learning extended sequences of Paretooptimal actions. Our final policies are evaluated using importancesampling based estimators for offpolicy evaluation, metrics for improvements in cost, and reducing redundancy of orders. Our results suggest that there is considerable room for improvement on current ordering practices, and the framework introduced here can help recommend best practices and be used to evaluate deviations from these across care providers, driving us towards more efficient health care. Furthermore, the low risk of these types of interventions in patient health care reduces the barrier of testing and deploying clinicianintheloop machine learningassisted patient care in ICU settings.
Comments
There are no comments yet.