Even though model predictive control (MPC) is currently the main algorithm for insulin control in the artificial pancreas (AP), it usually requires complex online optimizations, which are infeasible for resource-constrained medical devices. MPC also typically relies on state estimation, an error-prone process. In this paper, we introduce a novel approach to AP control that uses Imitation Learning to synthesize neural-network insulin policies from MPC-computed demonstrations. Such policies are computationally efficient and, by instrumenting MPC at training time with full state information, they can directly map measurements into optimal therapy decisions, thus bypassing state estimation. We apply Bayesian inference via Monte Carlo Dropout to learn policies, which allows us to quantify prediction uncertainty and thereby derive safer therapy decisions. We show that our control policies trained under a specific patient model readily generalize (in terms of model parameters and disturbance distributions) to patient cohorts, consistently outperforming traditional MPC with state estimation.READ FULL TEXT VIEW PDF
We present foundations for using Model Predictive Control (MPC) as a
We present a learning algorithm for training a single policy that imitat...
In this paper we investigate the use of MPC-inspired neural network poli...
In this work, we present a method for obtaining an implicit objective
In the present paper, we propose an extension of the Deep Planning Netwo...
We present an Imitation Learning approach for the control of dynamical
We show how a posteriori goal oriented error estimation can be used to
The artificial pancreas (AP) is a system for the automated delivery of insulin therapy for Type 1 diabetes (T1D), a disease in which patients produce little or no insulin to regulate their blood glucose (BG) levels and maintain adequate glucose uptake in muscle and adipose tissue. The AP consists of an insulin infusion pump, and a subcutaneous Continuous Glucose Monitor (CGM) for sensing glucose levels. CGM readings are transmitted to a control algorithm that computes the appropriate insulin dose. Such control should maintain a fine balance. Lack of insulin leads to hyperglycemia (i.e., high BG), which, if untreated, can cause complications such as stroke, kidney failure, and blindness. Excessive insulin can lead to hypoglycemia (low BG), a critical state which can result in unconsciousness and death.
Driven by advances in the mathematical modeling of T1D physiology (Hovorka et al., 2004; Man et al., 2014), Model Predictive Control (MPC) (Camacho and Alba, 2013) has become the gold-standard AP algorithm. MPC works by determining the insulin therapy that optimizes the future BG profile, predicted via physiological models. MPC, however, has two important limitations. First, it requires complex (often nonlinear and nonconvex) online optimization, which is infeasible when the algorithm is deployed in resource-constrained medical devices. This is why commercial AP systems either use simplistic linear models within MPC111The algorithm of Typezero® (Forlenza et al., 2019) uses a linearized version of the nonlinear model of (Man et al., 2014). or do not employ MPC at all in favor of simpler control algorithms (e.g., PID control in the Medtronic™ Minimed 670G (Steil, 2013), or fuzzy logic in the Glucositter by DreaMed Diabetes (Atlas et al., 2010)).
Second, and most crucial, MPC requires state estimation (SE) to recover the most recent patient state from CGM measurements (Gondhalekar et al., 2014; Paoletti et al., 2019). Besides its computational cost, SE is error-prone, as it relies strictly on CGM readings, which are an approximate, delayed, and noisy proxy of the target control variable, the BG. Incorrect state estimates might compromise the safety of the insulin therapy.
We present a novel method to derive end-to-end insulin control policies, i.e., policies that subsume state estimation and control, directly mapping CGM measurements into patient-optimal insulin dosages. To capture the complex logic of MPC and SE, we consider policies represented as Deep Neural Networks, LSTMs in particular (Hochreiter and Schmidhuber, 1997). Such an approach addresses the main issues surrounding the use of MPC, as it bypasses explicit SE and avoids the cost of MPC’s online optimizations.
Our approach is centered around the use of Imitation Learning (IL) (Ross et al., 2011), where the control policy is trained on examples provided by an MPC-guided teacher policy. Building on the PLATO framework (Kahn et al., 2017), at training time, we instrument the MPC teacher to access the full state of the patient model. As such, the learner policy, using only CGM measurements, will learn to mimic MPC-based decisions, which are based on the true model state and thus highly accurate. In this way, the learned policy incorporates both SE and control.
We employ IL as it alleviates the covariate shift (Storkey, 2009) caused by the discrepancy between the training-time state distribution (induced by the teacher policy) and the test-time one (induced by the learner). Without IL, such a distribution shift can lead to unpredictable behaviour of the learner, and hence jeopardize the patient’s safety.
We learn our insulin policies via approximate Bayesian inference, using Monte Carlo dropout (Gal and Ghahramani, 2016). The resulting stochastic policy can capture model (epistemic) uncertainty, as well as uncertainty in the data, due to, e.g., noisy measurements (Gal, 2016). Uncertainty quantification is key in safety-critical domains like ours, and is not addressed in the PLATO framework. Variations in the patient’s behavior and in physiological dynamics are indeed the norm, and can greatly affect the performance of a deterministic policy, with health consequences for the patient. In contrast, in a Bayesian framework, such variations likely yield high predictive uncertainty, information that we actively use to make robust therapy decisions.
In summary, the main contribution of this paper is an Imitation Learning-based method for deriving Bayesian neural network policies for AP control, a method that overcomes the two main shortcomings of established MPC-based approaches, namely, SE errors and computational cost. We empirically show that: 1) our IL-based approach outperforms Supervised Learning, and with less supervision data; 2) the learned stochastic policies outperform MPC with SE and deterministic policies; 3) our stochastic policies generalize to never-before-seen disturbance distributions and patient parameters that arise in a virtual patient cohort, while in the same setting, MPC with SE show consistent performance degradation. Overall, our best stochastic policy keeps BG in the safe range 8.4%–11.75% longer than MPC with SE and 2.94%–9.07% longer than the deterministic policy.
We consider an in silico AP system, where the T1D patient is represented by a differential-equation model of BG and insulin metabolism, including absorption, excretion and transport dynamics between different body compartments. In particular, we chose the well-established model of Hovorka et al. (2004), a nonlinear ODE model with 14 state variables. A diagram of the MPC-based AP system is showed in Fig. 3 (a). The state-space description is given in equations (1–3) below. The notation stands for the indexed sequence .
Equation (1) describes the T1D patient model, where is the patient state at time , is the insulin input, are the patient parameters with distribution , and is the meal disturbance (amount of ingested carbohydrate) with distribution . Note that both parameters and disturbances are random and time-dependent. Meal disturbances indeed depend on the patient’s eating behaviour, which follows daily and weekly patterns. Similarly, parameters are typically subject to intra-patient variations such as daily oscillations. The parameter distribution can be alternatively defined to capture a population of patients (as we do in the experiment of Section 6).
Variable is the observed output at time , i.e., the CGM measurement, which is subject to Gaussian sensor noise , as done in previous work (Soru et al., 2012; Patek et al., 2007). With we denote the MPC-based control policy, that, given state estimate and future meal disturbances as inputs, with being the MPC prediction horizon, computes the optimal insulin therapy by solving, at each time , the following online optimization problem (Paoletti et al., 2017, 2019)222At each time step, the solution is actually a sequence of insulin values, of which we retain only the first one.:
where is the control horizon; is the set of admissible insulin values; (7) states that is fixed to the basal insulin rate outside the control horizon. (8) and (9) describe how the predicted state evolves from estimated state at time following the plant model (1). The objective function contains two terms, the former designed to penalize deviations between predicted and target BG (via function ), the latter to avoid abrupt changes of the insulin infusion rate. The coefficient is a hyper-parameter. See (Paoletti et al., 2017, 2019) for more details. Note that in order to predict future BG profiles, MPC requires some information of the future disturbances. In this work, we assume that the true future disturbance values are known. Alternative MPC formulations with weaker assumptions (Hughes et al., 2011; Paoletti et al., 2019) could be considered alternatively.
Finally, (4) describes state estimation (performed by function ), which follows the so-called Moving Horizon Estimation (MHE) method (Rawlings, 2013). MHE is the dual of MPC in that MHE also uses model predictions but, this time, to estimate the most likely state given a sequence of past measurements, control inputs, and disturbances. The estimate is the state that minimizes the discrepancy between observed CGM outputs and corresponding model predictions. As such, the quality of the estimate directly depends on the accuracy of the predictive model and the quality of the measurements. Details of the MHE optimization problem for the AP can be found in (Chen et al., 2019)
. Alternative SE methods include Kalman filters and particle filters(Rawlings, 2013), but these are subject to the same kinds of estimation errors as MHE.
The main goal of our work is to design an end-to-end insulin control policy for the AP, i.e., policies that directly map noisy system outputs (CGM measurements) into an optimal insulin therapy, without requiring knowledge of the system state. The control loop of such a policy is illustrated in Fig. 3 (b). To this purpose, we follow an imitation learning approach where the learner policy is trained from examples provided by an MPC-based expert, called the teacher policy. Following the PLATO algorithm (Kahn et al., 2017), at training time, the teacher is instrumented with full state information, so that its demonstrations are not affected by SE errors.
Supervised learning (SL) is not a viable solution to our problem because SL assumes i.i.d. training and test data, while our case is subject to covariate shift: the training data follow the distribution of trajectories visited through the teacher, while at test time we have a different distribution, induced by the learner policy. If the learner cannot imitate perfectly the teacher, in the long-run the learner’s actions will bring the system into a state far from the training distribution, where the behaviour of the learner becomes unpredictable, and with clear safety implications for our AP system. In IL, the teacher should provide demonstrations on trajectories that the learner would explore, but without knowing the learner in advance. To do so, a common solution is to reduce IL into a sequence of SL problems, where at each iteration, the learner is trained on the distribution induced by the previous learners or by a “mixture” of learner and teacher policies (Ross and Bagnell, 2010; Ross et al., 2011).
Our IL method builds on the PLATO algorithm for adaptive trajectory optimization (Kahn et al., 2017). See Section 7 for a summary of our extensions to PLATO. PLATO also reduces IL into a sequence of SL problems, where at each iteration, the teacher’s actions gradually adapt to those of the current learner policy. This adaptive teacher is an MPC-based policy whose objective function is extended to match the learner’s behaviour. As such, it is non-optimal, and is only used to generate trajectories that approach the distribution induced by the learner, thereby alleviating the covariate shift problem. The training data is obtained by labelling the adaptive teacher trajectories with the original (optimal) MPC policy. We call the latter supervision policy. The training process is summarized in Fig. 4. In our approach the learner policy is stochastic, represented as a Bayesian neural network, as explained in Section 5.
The supervision policy is the MPC policy for the AP system of (5–9). Its control action at time , , is obtained by running the MPC algorithm given the true system state (instead of the estimated state) and future meal disturbances: . The learner policy is represented by an LSTM network with parameters . The choice of a recurrent architecture is natural for our application, because our policies have to subsume both control and SE and, as discussed in Section 2, SE for nonlinear models can be seen as a sequence prediction problem, see (4).
At time , the learner policy takes an input sequence, which includes past observations , past and future disturbances , and past control actions , and predicts the control action at time , , by iterating the following equation:
where is the hidden state of the network at time , and and
are activation functions. The learner results from the composition ofand . Since we use a stochastic policy for (see Section 5), we will denote the learner by the conditional distribution .
The adaptive teacher policy extends the supervision policy with a term to penalize the mismatch between the supervision policy, i.e., the optimal MPC policy, and the learner. The output of is the first control action in the solution of the following MPC problem.
The first term of (13) is the cost function of the supervision policy (5). The second term penalizes the distance between the optimal (supervision) control action and the action that the learner policy would perform. The factor determines the relative importance of matching the behavior of the learner policy against minimizing the cost . In our experiments, we set where is the iteration of our IL algorithm, see Algorithm 1. In this way, the relative importance of matching increases as the learner improves (i.e., as increases). By gradually matching the behaviour of the learner, the control actions taken by will lead to exploring a state space similar to the one induced by the learner policy.
The distance term is defined as the -th Wasserstein distance between the output distribution of the learner policy and the optimal control action , or more precisely, the Dirac distribution centered at , . We choose because the two distributions are likely to have disjoint supports, and so other measures like the KL divergence (used in PLATO) are undefined or infinite. As we will see, an analytical form for is not available, but we can sample from it and obtain the empirical approximation , where are the sampled values of . Thus, we have that
where is a suitable distance. Note that the above equality holds as we are comparing a discrete distribution with a Dirac distribution. We set and .
Algorithm 1 outlines our IL scheme, which consists of a sequence of SL iterations. The learner policy is initialized at random. At each iteration, we start from a random initial state of the plant and generate a trajectory of the system of length . To do so, we first sample a sequence of random meal disturbances, i.e., of ingested carbohydrates (lines 5-7). Then, for each time point of the trajectory, the adaptive teacher computes an insulin infusion rate by solving (13), that is, by optimizing the MPC objective while matching the current learner policy (line 11). The MPC supervision policy is used to compute the optimal control action by solving (5) (line 12). The optimal action is used to label the corresponding training example , which is added to the training set (line 13), while the sub-optimal action by the adaptive teacher is used to evolve the system state (line 14). At the end of each iteration, is trained using the current set of examples . As the teacher gradually matches the learner, the training-time distribution of the system state gradually approximates that at test time.
The time complexity of Algorithm 1 is dominated by the two MPC instances (lines 11 and 12), which are solved times, and the training of the learner , repeated times.
We take a Bayesian approach to learn our control policy, which results in a stochastic policy represented by a neural network with fixed architecture and random parameters . This provides us with a distribution of policy actions, the predictive distribution, from which we can derive uncertainty measures to inform the final therapy decision. Such uncertainty should capture both data uncertainty, due to e.g., noisy measurements, and epistemic uncertainty, i.e., the lack of confidence of the model about a given input (Gal, 2016).
Performing Bayesian inference corresponds, given training data , to computing the posterior from some prior by applying Bayes rule. The distribution of policies is induced by the random parameters . In particular, the predictive distribution , with being the policy inputs, is obtained from as:
For the non-linearity of the neural network function, precise inference is, however, infeasible and thus one needs to resort to approximate methods (Neal and others, 2011; Blundell et al., 2015). Monte Carlo Dropout (MCD) (Gal and Ghahramani, 2016). Technically, this corresponds to multiplying the weights with a dropout mask
, i.e., a vector of Bernoulli variables with parameter. Then, at test time, standard dropout derives a deterministic network by scaling back the weights by a factor of . On the other hand, in MCD the random dropout mask is applied at test time too, resulting in a distribution of network parameters. Gal and Ghahramani (2016) show that applying dropout to each weight layer is equivalent to performing approximate Bayesian inference of the neural network. This property is very appealing as it reduces the problem of inferring to drawing Bernoulli samples, which is very efficient.
The output of our policy is the distribution of insulin actions (15). Hence, we need to define a decision rule that produces an individual value out of this distribution, and we want such rule to account for the predictive uncertainty of the policy. Let be the ordered empirical sample of (15), and let be the last measured glucose value. Our rule selects a particular order statistic , i.e., one of the sampled values, depending on the relative distance of w.r.t. the safe BG upper bound and lower bound . We call this adaptive rule because the selected order statistic is adapted on . Formally,
In this way, if the patient is approaching hypoglycemia ( close to ), we select a conservative insulin value, and we select instead an aggressive therapy if is close to hyperglycemia (). Importantly, as the policy uncertainty increases (and so does the spread of the sample), gets more conservative when is in the lower half of the safe BG range, i.e., we take safer decisions when the policy is less trustworthy333Protecting against hypoglycemia is the primary concern.. For the same principle, when is in the upper half of the range, higher uncertainty yields a more aggressive therapy, but this poses no safety threats because is well away from the hypoglycemia threshold .
In our evaluation, we compare our adaptive rule with the commonly used rule that sets to the sample mean of (15).
We conducted computational experiments to validate the following claims:
Our IL-based approach outperforms SL, and with less supervision data.
The learned stochastic policies outperform both MPC with SE and deterministic policies.
The stochastic policies generalize well to unseen patient physiological parameters and meal disturbances.
For a stochastic policy, the adaptive decision rule (16) outperforms the mean prediction rule.
The predictive uncertainty of the policy increases with out-of-distribution test inputs.
On our workstation444With an Intel i7-8750H CPU and 16GB DDR4 SDRAM the stochastic policy executes in approximately 20 milliseconds, which is well within the CGM measurement period of 5 minutes, and consistently more efficient than MPC-based optimization (on average, 150+ times faster in our experiments)555Optimization for MPC and SE is solved used MATLAB’s implementation of the interior-point algorithm of Byrd et al. (2000).
. Thanks to platforms such as TensorFlow Lite, we believe that a similar runtime can be obtained also after deploying the policy on embedded hardware.
We represent the learner policy as an LSTM network with three layers, of 200 hidden units each, with tanh state activation functions and sigmod
gate activation functions. The network has a fully connected layer, a ReLu activation function layer and a regression layer as final layers. We added a dropout layer before each weight layer with dropout probability, which empirically gave us the best accuracy among other probability values. We also experimented with feedforward architectures but these performed poorly, confirming the need for a recurrent architecture to adequately represent SE and control.
To evaluate and compare the insulin policies, we consider three typical performance measures in the AP domain: : the average percentage of time the virtual patient’s BG is in hypoglycemia (i.e., BG 70 mg/dL), hyperglycemia (i.e., BG 180 mg/dL), and in the safe range, or euglycemia (i.e., 70 mg/dL BG 180 mg/dL), respectively. Given the goal of diabetes treatment and BG control, the target is to maximize while minimizing (hypoglycemia leads to more serious consequences than hyperglycemia).
During training, we consider a fixed (and time-constant) parameterization of the T1D virtual patient model, and the disturbance distribution of Table 1 (which includes three large meals and three small snacks). The length of each training trajectory is set to . We performed iterations of the IL algorithm, totalling to a training set size of . At test time, we consider the following three configurations of the model parameters:
Fixed Patient. The test-time virtual patient has the same parameters as during training.
Varying Patient. We introduce intra-patient temporal variations to the parameters.
Patient cohort. We introduce both inter-patient and intra-patient variations. The distribution of model parameters represents a cohort of virtual patients, where each patient has time-varying parameters. It might include severe cases when the parameters are sampled at marginal areas of the distribution.
The above distributions for modelling intra- and inter-patient variability are taken from (Wilinska et al., 2010), where the authors derived them from clinical data.
For each configuration and policy, we simulate 90 trajectories of a one-day simulation. For each trajectory, we draw a fresh observation of the random disturbance and patient parameters, but, to ensure a fair comparison, these are kept the same for all control algorithms.
In another set of experiments, we consider (in combination with the above configurations) a meal disturbance distribution different from the one at training time. This meal distribution reflects a late eating habit and a larger probability of larger snacks, which leads to meal profiles with higher carbs. Further details are reported in the supplement.
For each experiment, we evaluate the performance of the following policies:
MPC+SI: the supervision policy , i.e., the MPC algorithm with full information on the current state and model parameters. We stress that full observability is an unrealistic assumption, yet useful to establish an ideal performance level. We set control and prediction horizons of MPC to min and min, respectively.
MPC+SE: the MPC algorithm with state estimation, as described in (5–9). This is one of the most common control schemes for the AP. The parameters of the T1D model used within MPC and SE are set to the training-time values. We will show that SE takes a toll on the overall performance, unlike our end-to-end policies.
DLP: the deterministic learner policy, where dropout is used at training time but not at test time.
SLP-M: the stochastic learner policy with the rule that selects the sample mean of the predictive distribution.
SLP-A: the stochastic learner policy with the adaptive rule of (16). As explained above, such rule actively uses uncertainty information by selecting an order statistic of the predictive distribution commensurate to the sensed glucose. We will show that this policy outperforms all the others (but MPC+SI).
We compared the performance of SLP-A, which is trained via the IL algorithm of Section 4, against the corresponding stochastic policy trained using an MPC-guided SL approach, that is, without the adaptive teacher policy that guides the exploration of training trajectories but only using the supervision policy.
|breakfast||snack 1||lunch||snack 2||dinner||snack 3|
We found that the IL-based policy has much better control performance than the SL-based one, using the same number of algorithm iterations and training examples. This suggests that the IL-based policy captures more useful state-action information and that the IL algorithm leads the learner policy to efficiently explore the state space. In particular, after 20 iterations of Algorithm 1, SLP-A achieves an average of for (time in range), while the SL-based counterpart only . IL consistently outperforms SL in all other performance metrics too (see Appendix B for more detailed results).
Results in Table 2 evidence that the stochastic policies (SLP-A and SLP-M) outperform the MPC policy with state estimation in essentially all performance measures and configurations, and in a statistically significant manner. In particular, we observe that, on average, the SLP-A policy stays in range for 8.4%–11.75% longer than MPC+SE, and the SLP-M does so for 5.97%–8.4% longer. This is visible also from the BG profiles of Figure 8. The deterministic policy also performs better than MPC+SE, but lags behind the stochastic ones, with SLP-A achieving a time in range 2.94%–9.07% longer than DLP.
These results suggests that MPC with SE introduces estimation errors that have a detrimental effect on BG control, as also confirmed by the fact that the same control algorithm but with full state information (MPC+SI) is way superior. In realistic settings where the true patient state is not accessible, our analysis shows that an end-to-end policy is to be preferred to explicit SE.
From Table 2 and Figure 8, we observe that our stochastic policies are robust to patient parameters unseen during training, with a time in range constantly above 85% for the varying patient and the patient cohort configurations. The superiority of the stochastic policies over the deterministic one under intra-patient and inter-patient variability evidence the role of considering prediction uncertainty when the policy is deployed in environments that deviate from the training one.
|Fixed Patient Configuration||Varying Patient Configuration||Patient Cohort Configuration|
|MPC+SI||0.00 0.00||99.84 0.60||0.16 0.60||0.000.00||99.800.83||0.200.83||0.000.00||99.151.81||0.851.81|
We further evaluate the robustness of the SLP-A policy under an unseen meal disturbance distribution characterized by a higher carbs intake. Results, reported in Table 3, evidence that our stochastic policy generalize well also in this case. A hypothesis test can be found in Appendix D, showing that the performance statistics SLP-A achieves under intra-patient and inter-patient variability with two different meal distributions has no significant difference.
The SLP-A policy obtains a time in range approximately 5% longer than SLP-M in the fixed patient and varying patient configurations, see Table 2. There is an improvement also in the patient cohort experiment, albeit less significant. With the adaptive rule, the policy adopts a more conservative or aggressive therapy depending on the measured glucose and the predictive uncertainty, which can lead to more stable BG trajectories and explain the observed difference.
We verify that inference via Monte Carlo dropout adequately captures epistemic uncertainty. We indeed observe an increased variability of the predictive distribution in the patient cohort configuration, that is, when the policy is applied to out-of-distribution test data resulting from introducing inter- and intra-patient variability. However, uncertainty does not significantly increase from the first to the second configuration, that is, when introducing only inter-patient variability. These results were confirmed via pair-wise two-sample Kolmogorov-Smirnov tests over the coefficients of variation (CoVs), i.e., the mean-normalized standard deviations (differences significant at ). See Appendix C for the elaboration on the Kolmogorov-Smirnov tests and a plot of the CoVs distributions.
|Performance||Fixed Patient||Varying Patient||Patient Cohort|
Traditional methods for insulin control in the AP mostly rely on MPC (Hovorka et al., 2004; Magni et al., 2007; Lee et al., 2009; Cameron et al., 2011; Paoletti et al., 2019), PID control (Steil et al., 2004; Huyett et al., 2015), and fuzzy rules based on the reasoning of diabetes caregivers (Atlas et al., 2010; Nimri et al., 2012)
. Reinforcement learning approaches have been proposed as well, including policy iteration(De Paula et al., 2015; Weng et al., 2017), actor critic methods (Daskalaki et al., 2016), and deep Q-networks for dual-hormone therapy (Zhu et al., 2019)666Dual-hormone therapy allows infusion of both insulin and its antagonist hormone, glucagon, which protects against hypoglycemia. However, it is not as established as insulin-only therapy, and still has safety concerns (Taleb et al., 2017).. Neural networks have been explored in the AP space not just to represent insulin policies (de Canete et al., 2012; Zhu et al., 2019), but also to predict BG concentration based on inputs such as insulin dosage, nutritional intake, and daily activities (Pappada et al., 2008; Pérez-Gandía et al., 2010; Dutta et al., 2018; Li et al., 2019). Our work is different from the above papers as: 1) it takes an imitation learning approach to learn the insulin policy, mitigating potential (and dangerous) test-time distribution drifts; 2) incorporates in the policy uncertainty information obtained via Bayesian inference; 3) it produces end-to-end policies that do not require learning a separate BG prediction model.
Several imitation learning methods have been proposed, such as (Daumé et al., 2009; Ross and Bagnell, 2010; Ross et al., 2010; Ho and Ermon, 2016). Some of these, like our approach, are tailored to work with MPC teachers (Zhang et al., 2016; Kahn et al., 2017; Amos et al., 2018; Lowrey et al., 2019). Our method is also akin to (Cronrath et al., 2018; Menda et al., 2017; Lee et al., 2019) where Bayesian extensions of IL are presented to quantify the learner’s predictive uncertainty and better guide the queries to the teacher policy. Other Bayesian approaches in policy learning include (Deisenroth and Rasmussen, 2011; Gal et al., 2016; Polymenakos et al., 2019). “Frequentist” alternatives to Bayesian uncertainty quantification also exist (Vovk et al., 2005; Wang and Qiao, 2018; Papernot and McDaniel, 2018; Bortolussi et al., 2019; Park et al., 2020).
Our work builds on the PLATO IL algorithm (Kahn et al., 2017) and extends it in three main direction: 1) we consider recurrent architectures, which are more suitable than feedforward ones (used in PLATO) to represent nonlinear state estimation and control; 2) PLATO also derives stochastic policies but, unlike our work, no uncertainty-aware decision-making strategies are considered; 3) PLATO policies do not support systems with external disturbances beyond noise. In our policies instead, random meal disturbances are central.
We introduced a method based on MPC-guided Imitation Learning and Bayesian inference to derive stochastic Neural Network policies for insulin control in an artificial pancreas. Our policies are end-to-end in that they directly map CGM values into insulin control actions. By using Bayesian neural networks, we can crucially quantify prediction uncertainty, information that we incorporate in the insulin therapy decisions. We empirically demonstrated that our stochastic insulin policies outperform traditional MPC with explicit state estimation and are more robust than their deterministic counterparts, as they generalize well to unseen T1D patient parameters and meal disturbance distributions.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §5.
Handbook of markov chain monte carlo2 (11), pp. 2. Cited by: §5.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. Cited by: §3, §7.
Learning confidence sets using support vector machines. In Advances in Neural Information Processing Systems, pp. 4929–4938. Cited by: §7.
Detailed statistics for all control policies and all configurations are available in Table 4. We show extra performance measures in this table: and , average of maximum and minimum of BG values (in mg/dL); , average insulin infusion rate (in mU/min). We performed significance tests out of the collected data to validate our hypotheses regarding the performance of the learned stochastic policies. Details are given below.
|MPC+SI||Fixed Patient||0.00 + 0.00||99.84+ 0.60||0.16+ 0.60||181.28+13.64||123.40+3.85||28.11+0.95|
We define our null and alternative hypotheses in the following for each of the performance measures: , , , , and .
Null hypothesis . The median of the distribution of the difference of a specific performance metric from two sets of experiments regulated by two control policies is zero. Note the hypothesis of zero median of is not equivalent to a hypothesis of equal median of and .
Alternative hypothesis . The median of distribution of the differences of a specific performance metric from two sets of experiments regulated by two control policies is larger (smaller) than zero.
This formulation is for a paired significance test, which is suitable when comparing, like we do, experimental results about same configuration (and hence, same sampled parameters and disturbances) but different policies.
For each combination of the compared policies, we use sign tests to test the hypothesis that medians of distribution of the difference between their , and are larger than zero, and medians of distribution of the difference between their , and are smaller than zero. In the main paper, we set the significance level to (i.e., confidence).
|Configuration||MPC+SE vs SLP-A||DLP vs SLP-A||SLP-M vs SLP-A||MPC+SE vs DLP||MPC+SE vs SLP-M||DLP vs SLP-M|
We can see the tests reject when comparing and between SLP-A and all other policies, indicating they are statistical significant larger (smaller) when a fixed or varying patient is regulated by SLP-A than those by other policies. The p-values of the sign test on also support that SLP-A can significantly reduce hyperglycemia for a fixed or varying patient compared to all others.
In a patient cohort, the null hypothesis cannot be rejected for metrics and when comparing SLP-A VS SLP-M, indicating there is no significant difference in using these two stochastic policies. However, when comparing these two policies with any other policies, the null is rejected for both and . We can conclude that SLP-A and SLP-M significantly reduce hyperglycemia while maintaining short time in hypoglycemia.
Figure 11 (a) show the proportion of time blood glucose is in every state (eu-, hyper- and hypo-glycemia) at each training iterations using supervised learning, which improves much slower than Figure 11 (b).
We can see the learner policy trained by IL outperforms it by SL in reducing hypo-/hyper-glycemia and keeping patients’ blood glucose in healthy range (euglycemia). The mean of maximum blood glucose decreases faster while the mean of minimum blood glucose increases to prevent dangerous hypoglycemia when SLP-A is trained by IL methods.
Figure 12 is the root mean squared error (RMSE) of both learning methods at each training iteration. As we can see from the curves, the training RMSE for SL and IL goes down as the number of training episodes increases. The RMSE of IL starts larger but decrease dramatically especially after the first few episodes. Although the RMSE is smaller, stochastic learner policy trained by SL methods has a limited training state space and the out-of-distribution disturbance leads to slow learning process from supervision (i.e., MPC controller with perfect state information).
In our machine learning model for the stochastic learner policy, we quantify the uncertainty of our predictive action. Quantifying the uncertainty is critical in real-world settings such as artificial pancreas (AP) system, in which often involve out-of-distribution disturbance and patient variations. We analyze the uncertainty quantifying ability of our Monte-Carlo Dropout Bayesian method to investigate the effect of covariate shift and virtual patient model variation.
The coefficients of variation (CV), aslo known as relative standard deviation, is defined as the division of the standard deviation and the mean.
is the cumulative distribution functions (CDF) of the CV of the stochastic learner policy outputs when the learner policy is the control algorithm of an AP system under three patient configuration as illustrated in Section 6 of our paper. The curve of patient cohort experiments is the slowest to reach a probability of 1, and the curve of the fixed patient experiments is the fastest but the curve of varying patient experiment is close. We remark that, in this case, a faster growing CDF implies a higher probability of smaller CV values.
In statistics, the Kolmogrov-Smirnov test, also known as K-S test, tests null hypothesis for the equality of two probability distributions, or a sample distribution with a reference probability distribution.
We use K-S test to quantify the distance between the empirical CDF of the CV of SLP-A outputs under three patient configurations with significance level . The null hypotheses are rejected when compare the empirical CDF of CV from patient cohort experiment and others in 5% confidence level. Although the CDf curve of fixed patient experiment is faster but the K-S test between it and the CDF of varying patient experiment returns a decision that the null hypothesis is accepted with p value 0.2453, thus they are likely from the same CDF. The p values are believed to be accurate since the sample sizes of the CDFs are large enough, such that , with and are the sample sizes of compared CDFs.
The stochastic learner policy can not only generalize its control ability onto different virtual patients but also to different meal distributions.
The attributes of the distinct meal distribution we used to evaluate the generalization ability of our stochastic learner policy is summarized in Table 6. The new distribution has a higher probability of occurrence of snacks and larger snacks with larger carbohydrate amounts. This describes a larger system disturbance to the virtual patient models, and is more challenging for our stochastic learner policy to regulate the patients’ blood glucose level.
|breakfast||snack 1||lunch||snack 2||dinner||snack 3|
We use above described sign test to evaluate, for each configuration, if there are significant performance differences when training-time VS never-before-seen disturbances are used. The two sets of observations are from Table 7 and those from SLP-A in Table 4. The p values of the statistical tests are listed in Table 8.
We can observe that all p-values in the patient cohort experiments are very high (the smallest is ), indicating that, as expected, no significant performance difference is observed in patient cohort: SLP-A is able to generalize its control ability to an unseen meal distribution during training phase. In the other two patient configurations, the null hypotheses are also accepted except in the second column and both .