1 Introduction
With the recent evolution of mobile health technologies, health scientists are increasingly interested in delivering interventions via notifications on mobile device at the moments when they can most readily help the user prevent negative health outcomes and promote the adoption and maintenance of healthy behaviors. The type and timing of the mobile health interventions should ideally adapt to the realtime collected user’s context, e.g., the time of the day, the location, current activity and stress level. This gives rise to the concept of a justintime adaptive intervention (JITAI) [nahum2016just]. Operationally, JITAI includes a sequence of decision rules (e.g., treatment policy) that takes the user’s current context as input and specifies whether and what type of an intervention should be provided at the moment. In practice, behavioral theory along with expert opinion and analyses of existing data is often used to design the decision rules. However, these theories are often insufficiently mature to precisely specify which particular intervention and when it should be delivered in order to ensure the interventions have the intended effects and optimize the longterm efficacy of the interventions. As a result, there is much interest in how best to use data to inform the design of JITAIs [ghosh2017misspecified, tewari2017ads, bekiroglu2016control, rivera2018intensively, martin2018development, yom2017, paredes2014poptherapy, forman2018can, rabbi2015mybehavior, zhou2018personalizing]
This paper develops a Reinforcement Learning (RL) algorithm to continuously learn, e.g., online, and optimize the treatment policy in the JITAI as the user experiences the intervention. This work is motivated by our collaboration on the design of the HeartSteps V2 clinical trial for individuals who have stage 1 hypertension. In this clinical trial, the HeartSteps V2 RL algorithm learns whether to deliver a contexttailored physical activity suggestion as the trial progresses.
The remainder of the paper is organized as follows. We first describe HeartSteps, including HeartSteps V1, and the current, in progress, clinical trial, HeartSteps V2. We then briefly review RL and identify key challenges in applying RL to optimize JITAI treatment policies in mobile health. Existing mobile health studies that utilized RL are reviewed, as well as related RL algorithms. We then describe the proposed HeartSteps V2 RL algorithm, the implementation and an evaluation of this algorithm using a generative model built on HeartSteps V1 data. We discuss the performance of the proposed algorithm based on the initial pilot data from HeartSteps V2. We close with a discussion of future work.
2 HeartSteps V1 and V2: Physical Activity Mobile Health Study
HeartSteps V2 is an ongoing 90day physical activity clinical trial for improving the physical activity of individuals with blood pressure in the stage 1 hypertension range (120130 systolic). In this trial participants are provided a Fitbit tracker and a mobile phone application on the phone designed to help them improve their physical activity. The participant first wears the Fibit tracker for one week and then install the mobile app on the second week. One of the interventions is a contextually tailored physical activity suggestion that may be delivered at any of the five userspecified times during each day. These five times are roughly separated by 2.5 hours, corresponding to the user’s morning commute, midday, midafternoon, evening commute, and postdinner times. The content of the suggestion is designed to encourage activity in the current context and thus the suggestions are intended to impact near time physical activity. The RL algorithm developed in this paper is being used to both decide at each time whether to send the activity suggestion as well as to optimize these decisions. Currently HeartSteps V2 is being deployed in the field. We will provide an initial assessment of proposed algorithm in Section 7.
In order to design HeartSteps V2, our team conducted HeartSteps V1, which is a 42day physical activity study involving 37 healthy sedentary adults [Liaoetal2015, KlasnjaAnnals, klasnja2015microrandomized, Dempsey_Significance]
. In HeartSteps V1 whether to provide a tailored activity suggestion was randomized at each of the 5 times per day with a constant probability of 0.30. The data collected from HeartSteps V1 is used in this paper to (1) inform the design of RL algorithm for HeartSteps V2 (e.g., selecting the variables that are predictive of future step counts as well as the efficacy of the activity suggestion and form a prior distribution) and (2) to create a simulation environment (e.g., the generative model) in order to evaluate the RL algorithm. See sections
5.5 and 6.3 Challenges to Applying RL in mHealth
Reinforcement Learning (RL) is an area of Machine Learning in which an algorithm learns how to act optimally by continuously interacting with the unknown environment
[sutton2018reinforcement]. The algorithm inputs the current state, selects the next action and receives the reward, with the goal of learning the best sequence of actions (i.e., the policy) to maximize the total rewards. For example, in the case of HeartSteps, the state is a set of features of the user’s current and past context, the actions are whether to deliver an activity suggestion and the reward is a function of near time physical activity. A fundamental challenge in RL is the tradeoff between exploitation (e.g., selecting the action that appears best given data observed so far) and exploration (e.g., gathering information to learn the best action). RL has seen rapid development in recent years and shown remarkable success across many fields, e.g., video games, chessplaying and robotic control. However, many challenges remain that need to be carefully addressed before RL can be usefully deployed to adapt and optimize mobile health interventions. Below we discuss some of these challenges.
[label=(C0)]

The RL algorithm must adjust for longer term effects of current actions. In mobile health, interventions often tend to have positive effect on the immediate reward, but likely produce negative impact on the future rewards due to user habituation and/or burden [Klasnja2008, dimitrijevic1972habituation]. As such, the optimal treatment can only be identified by taking into account the impact of current action on the future rewards. This is akin to using a large discount rate (i.e., a long planning horizon) in RL.

The RL algorithm should learn quickly and accommodate noisy data.
Most online RL algorithms require the agent to interact many times with the environment prior to performing well. This is impractical in mobile health applications as users can lose interest and disengage quickly. Furthermore, because mobile health interventions are provided in uncontrolled, in situ complex environments both context information as well as rewards can be very noisy. For example, step count data collected from the wrist band is noisy due to a variety of confounds including incidental hand movements. Additionally the sensors do not detect the entire context of the user; nonsensed aspects of the current context act as sources of variance. Such high noise settings typically requires even more interactions with the environment to select the optimal action. Additionally, consideration of challenge
1 motivates a long planning horizon. However, it has been shown that, in both practice and theory, a discount rate close to 1 often lead to high variance and slow learning rates [lehnert2018value, jiang2015dependence, arumugam2018mitigating, franccois2015discount]. This results in the need to carefully trade off between bias and variance when designing the RL algorithm. 
The RL algorithm should accommodate some model misspecification and nonstationarity. Due to the complexity of the context space and unobserved aspects of the current context (e.g., engagement or burden), the mapping from context to reward is likely to exhibit nonstationarity over longer periods of time. Indeed, in the analysis of HeartSteps V1, there is strong evidence that the treatment effect of sending a activity suggestion on subsequent activity decreases with the time the user is in the study [KlasnjaAnnals], thus providing evidence of nonstationarity.

The RL algorithm should select actions so that after the study is over, secondary data analyses are feasible. This is particularly the case for experimental trials involving clinical populations. In these settings, an interdisciplinary team is required to design the intervention and to conduct the clinical trial. As a result multiple stakeholders will want to analyze the resulting data in a large variety of ways. Thus, for example, offpolicy learning [thomas2016data, jiang2015doubly] and causal inference [boruvka2018assessing] as well as other more standard statistical analyses must be feasible after study end.
4 Existing RLbased Mobile Health Studies
There are few existing mobile health studies in which RL methods are applied to adapt the individual’s intervention in real time. Here we only focus on the setting where the treatment policy is not prespecified, but instead continuously learned and improved as more data is collected.
In [yom2017], a RL system was deployed to choose the different types of daily suggestions to encourage physical activity in patients with diabetes in a 26week study. The authors uses a contextual bandit learning algorithm combined with a Softmax approach to select the actions (daily suggestion) with the goal of maximizing increased minutes of activity. Paredes et al. [paredes2014poptherapy] employed a contextual bandit learning algorithm combined with a Upper Confidence Bound approach to select among 10 types of stress management strategies when the participant requests an intervention in the mobile app with the goal of maximizing stress reduction. A recent weight loss study is reported in [forman2018can], in which one of three types of interventions is chosen twice a week over 12week period. Their RL system features an explicit separation between exploration and exploitation, e.g., 10 decision times is predetermined for exploration (e.g., randomly selecting the interventions at each decision time) and the rest of 14 decision times for exploitation (e.g., choosing the best intervention that maximizes the designed reward based on the history). In MyBehavior [rabbi2015mybehavior], a smartphone app that delivered personalized interventions for promoting physical activity and dietary health, used EXP3, a multiarm bandit algorithm (e.g., contextfree) to select the interventions. While the RL methods in the aforementioned studies aim to select actions so as to optimize the immediate reward, in a recent physical study reported in [zhou2018personalizing]
, the RL system at the end of every week uses the participant’s historical daily step count data to estimate dynamical system for the daily step count and use it to infer the optimal daily step goals for the next 7 days, with the goal to maximize the minimal number of step counts taken in the next week.
We argue that these RL algorithms are insufficient to address the challenges listed in Section 3 and thus require us to generalize these algorithms in several directions. First, the above mentioned studies only use a pure data collection phase to initialize the RL algorithms; however, often there are additional data from other participants, such as data from a pilot study as well as prior expert knowledge. Consideration of challenge 2 implies that it is critical to incorporate the prior knowledge to speed up the learning in the early phase of the study. Second, the RL algorithms in these studies requires knowledge of the correct model for the reward function, which is unlikely true due to the dimension and complexity of the context space and potential nonstationarity in challenge 3. It is been empirically shown in RL literature that the performance of standard RL algorithms are quite sensitive to the model for the reward function [ghosh2017misspecified, dimakopoulou2017estimation, mintz2017non]. Third, among the above mentioned studies, only the algorithm used in [zhou2018personalizing] attempts to optimize rewards over a time period longer than the immediate time step. It turns out that there is a biasvariance tradeoff when designing how long into the future the RL should attempt to optimize rewards. That is, only focusing on maximizing the immediate rewards speeds the learning rate (e.g., due to lower estimation variance) compared with a full RL algorithm that attempts to maximize over a longer time horizon. However, an RL algorithm focused on optimizing the immediate reward might end up sending too many treatments due to challenge 1, i.e., the treatment tends to have a positive effect on immediate reward and negative effects on future rewards, and lead to poorer overall performance (akin to bias) than the algorithm that attempts to optimize over a longer time horizon to account for treatment burden and disengagement. Lastly, both [paredes2014poptherapy] and [zhou2018personalizing] use algorithms that select the action deterministically based on the history, and [forman2018can] incorporate a pure exploitation phase. It’s known that action selection probabilities close to 0 or 1 cause the instability (i.e., high variance) in batch data analysis that use importance weights, e.g., in the offpolicy evaluation [thomas2016data, jiang2015doubly]. This complicates challenge 4.
5 Reinforcement Learning Algorithm in HeartSteps V2
In this section, we discuss the design of the RL algorithm in HeartStep V2; this algorithm determines whether to send the activity suggestion at each decision time. We first give an overview of how the proposed algorithm addresses the challenges introduced in section 3. Then we will specify each component in our setting, i.e., the decision times, action, states, and reward, and formally introduce our proposed RL algorithm.
5.1 Addressing the Challenges
To address challenge 1, we introduce a “dosage” variable based on the history of past treatments. This is motivated by analyses of HeartSteps V1 in which moving to contexts with larger recent dosage appears to result in smaller immediate effect of treatment and lower future rewards. A similar “dosage” variable was explored in a recent unpublished manuscript [mintz2017non] where they developed a bandit algorithm, called ROGUE (Reducing or Gaining Unknown Efficacy) Bandits. They use the “dosage” idea to accomodate settings in which an (unknown) dosage variable causes nonstationarity in the reward function. Our use of dosage, on the other hand, is to form a proxy of the future rewards, in order to mimic a full RL setting (as opposed to the bandit setting) but managing variance in consideration of challenge 2. We construct a proxy of the future rewards (proxy value) under a low dimensional proxy MDP model. Modelbased RL is well studied in the RL literature [osband2013more, fonteneau2013optimistic, ouyang2017learning]. In these papers, the algorithm uses a model for the transition function from current state and action to next state. Instead the proposed algorithm in this paper only uses the MDP model to provide a low variance proxy to adjust for the longer term impact of actions on future rewards.
To further meet challenge 2
, a lowdimensional linear model is used to model differences in the reward function under alternate actions and as well as Thompson Sampling (TS). The use of a lowdimensional model is to trade off the bias and variance to accelerate learning. TS is a general algorithmic idea that uses a Bayesian paradigm to tradeoff between exploration and exploitation
[russo2018tutorial, russo2014learning]. The use of TS allows us to incorporate prior knowledge in the algorithm through the use of a prior distribution on the parameters. We propose using an informative prior distribution to speed up the learning in the early phase of the study as well as to reduce the variance and diminish the impact of noisy observation. Note that TSbased algorithms have been shown to enjoy not only strong theoretical performance guarantees but strong empirical performance in many problems when compared to other stateoftheart methods, such as Upper Confidence Bound [kaufmann2012thompson, chapelle2011empirical, osband2016posterior, osband2017optimistic].To deal with challenge 3, we use the idea of actioncentering in modelling the reward. The motivation is to protect the RL algorithm from a misspecified model for the “baseline” reward function (e.g., in HeartSteps example with binary actions, the baseline reward function is the expected number of future 30min step count given the current state and no activity suggestion ). The idea of actioncentering in RL was first explored in [greenewald2017action] and recently improved in [krishnamurthy2018semiparametric]. In both works, the RL algorithm is theoretically guaranteed to learn the optimal action under no assumption about the baseline reward generating process (e.g., the baseline reward function can be nonstationary). However, neither of these methods attempts to reduce the noise in the reward. We generalize action centering for use in higher variance, nonstationary reward settings.
Lastly, in consideration of challenge 4, the actions in our proposed RL algorithm are selected stochastically via TS and furthermore we bound the TS probabilities away from 0 and 1 to ensure the ability to conduct secondary analyses when the study is over.
5.2 Reinforcement Learning Framework
Let the participant’s longitudinal data recorded via mobile device be the sequence
Here indexes decision time. In HeartSteps V1, as in the planned HeartSteps V2, there are five decision times each day. We also use to refer the th time decision time on study day . For example, refers to the 5th time in day 3, which corresponds to time . is the action or treatment at time . The treatment is binary (i.e., the action space ), i.e., if an activity suggestion is delivered and otherwise. is the immediate reward collected after action . In HeartSteps, the reward is the log transformation of the step count collected 30 minutes after the decision time.
is the state vector at decision time
. We decompose the state vector as . is used to indicate times at which only is feasible and/or ethical. For example, if sensors indicate that the participant might be driving a car, then the suggestion should not be sent; that is, the participant is unavailable for treatment (). denotes features used to represent the current context at time . In HeartSteps, these features include current location, the prior 30minute step count, yesterday’s daily step count, the current temperature, as well as the measures of how active the participant has been around the current decision time over the last week. Lastly, is the “dosage” variable that captures our proxy for the treatment burden, defined based on the participant’s treatment history. In contrast to HeartSteps V1, in HeartSteps V2, an additional intervention component, i.e., an antisedentary suggestion, will sometimes be delivered when the participant is sedentary. As the antisedentary suggestion, in addition to the activity suggestions, can cause burden, it is included in defining the dosage variable. Specifically, denote by the event that an activity suggestion is sent at decision time (e.g., ) and any antisedentary suggestion is sent between time and . The dosage at the moment is constructed by first multiplying the previous dosage variable by and incrementing it by 1 if any suggestions were sent to the user since last decision time. Specifically, starting with the initial value , the dosage at time is defined as . Based on the data analysis result from HeartSteps V1, we choose ; see section 5.5 for how this value is selected.At each decision time the RL algorithm selects the action based on each participant’s current history (e.g., the past states, actions and rewards), with the goal to optimize the total rewards during the process. The proposed algorithm is stochastic, that is, the algorithm will output a probability to select an action. Denote the history up to the end of day by . The RL algorithm consists two components: (1) the nightly update, e.g., where denote parameters in the posterior distribution for the reward and proxies the delayed effect on future rewards, both calculated at the end of the previous day (see below for more details), and (2) the probability , to select the action (e.g.,
is sampled from a Bernoulli distribution with probability
). Note that, at the beginning of study (e.g., ), both the distribution and the proxy of delayed effect are set based on the HeartSteps V1; see details in section 5.5. Throughout without loss of generality, we implicitly assume the probability is part of the state . The pseudo code of the proposed HeartSteps V2 RL algorithm is provided in Figure 1.5.3 Action Selection
The reward function is given by . The action selection developed here is based on a low dimensional linear model (challenge 2) for the treatment effect:
(1) 
where the feature vector, , is selected based on the domain science as well as on data analyses using HeartSteps V1; see section 5.5 for the discussion of how the features are selected. At the th decison time on day , availability is ascertained (i.e., ). Then for with the dosage variable , the action, is selected based on
where the random variable
, follows a Normal distribution
, e.g., the posterior distribution of the parameters, obtained at the end of previous day. The term proxies the longterm, negative effect of delivering the activity suggestion at the moment given the current dosage level (see the detailed formulation of in section 5.4.2). Note that when , we recover the bandit formulation, e.g., the action is selected to maximize the immediate rewards, ignoring any impact on the future rewards. The probability of sending an activity suggestion, (for , , ) is a clipped version:(2) 
The clipping function is . This restricts the randomization probability of sending nothing and of sending an activity suggestion to be at least and , respectively. The probability clipping enables offpolicy data analyses after the study is over (challenge 4) and, furthermore, ensures that the RL algorithm will continue to explore and learn, instead of locking itself into a particular policy (challenge 3). In HeartSteps V2, and .
5.4 Nightly Updates
The posterior distribution of for the immediate treatment effect and the proxy for the delayed effect are updated at the end of each day. Operationally, the nightly update is a mapping: , that takes the current history up to day as the input and outputs the posterior distribution and proxy of delayed effect, which are used in the action selection in the following day (i.e., during day ). We discuss each of them in turn.
5.4.1 Posterior Update of Immediate Treatment Effect
We use the following linear Bayesian regression “working model” for the reward to derive the posterior distribution for the treatment effect:
(3) 
so that the working model for the reward function is . The baseline feature vector is used to approximate the baseline reward function:
(4) 
The baseline feature vector is selected based on the domain science and data analyses using HeartSteps V1; see section 5.5 for a discussion. The use of in (3) is unusual but provides a number of advantages as follows. Consider the actioncentered term, , in the working model (3). As long as the treatment effect model (1) is correctly specified, the estimator of based on the model (3) is guaranteed to be unbiased even when the baseline reward model (4) is incorrect [boruvka2018assessing], for example, due to the nonlinearity in or nonstationarity ( changes over time). That is, through the use of action centering, we achieve the robustness against misspecification of the approximate baseline model, (4), addressing challenge 3. The rationale of including the term in the Bayesian regression working model (3) is to capture the timevarying aspect of the main effect due to the actioncentered term (e.g., is continuously updated during the study). Omitting this term would reduce the number of parameters in the model but we have found that in experiments the inclusion of reduces the variance of the treatment effect estimates and thus speeds the learning. Second, in the case where the treatment effect model (1) is incorrect, for example, the treatment effect is nonlinear in or is time nonstationary (with timevarying ), it can be shown [boruvka2018assessing] that the Bayesian regression provides a linear approximation to the treatment effect. When the action is not centered, the treatment effect estimates may not converge to any useful approximation at all, which could lead to poor performance in selecting the action.
The Bayesian model requires prior distributions on and . Here the priors are independent and given by ; see in section 5.5 for a discussion of how informative priors (challenge 2) are constructed using HeartSteps V1 data. Because the priors are Gaussian and the error in (3) is Gaussian, the posterior distribution of given the current history is also a Gaussian, denoted by . Below we provide the details about the calculation of . We first calculate the posterior distribution of all parameters, and the posterior distribution of can then be identified. The posterior distribution of , denoted by , given the current history can be found by
(5)  
(6) 
where denotes the joint feature vector and is the prior mean and variance of , e.g., and . Suppose the size of is . Then the posterior mean of , is the last elements of the above and the posterior variance of , is the bottomright corner matrix of size by in .
5.4.2 Proxy Delayed Effect on Future Rewards
The proxy is formed based on a simple Markov Decision Process (MDP) for the states
, in which we make the following working assumptions:
[label=(S0)]

the context is i.i.d. with distribution ,

the availability is i.i.d. with probability

the dosage variable makes transitions according to

the mean reward given and is .
We use this simple MDP to capture the delayed effect on the future rewards of sending the treatment. Note that in this model, the action only impacts the future rewards through the dosage since the context is assumed independent of the actions; this allows us to form an estimate of delayed effect of treatment based on the current dosage. We assume that the context and availability are both i.i.d. across time. The i.i.d. assumption leads to a reduced variance of the estimator of the delayed effect as this assumption does not require that we have to also learn a transition model for the context and availability.
We first discuss how each component in the simple MDP are constructed. Given the history up to the end of day , , we set (1) the average prior availability is , (2) the empirical distribution on is where is the Dirac measure, and (3) the reward function at available decision times is where , are the posterior means based on the model 3. The mean reward at unavailable decision times has the same form but with posterior means from a similar linear Bayesian regression using the unavailable time points in . To complete the description of the MDP, we need to specify the transition model, for the dosage variable . Recall that the dosage variable is defined at the beginning of section 5.2. Let be the probability of delivering any antisedentary suggestions between decision times given no activity suggestion was sent at the previous decision time. We set
based on the planned scheduling of antisedentary suggestions (an average of 1 antisedentary suggestion uniformly distributed in a 12hour time window during the day). Then
is given by . Recall from section 5.2 that .We formulate the proxy of delayed effect based on the above constructed MDP as follows. Consider an arbitrary policy that chooses the action at the state if available (i.e., ) and chooses action otherwise. Recall the stateaction value function for policy under discount rate :
where the subscript means the actions are selected according to the policy . Also recall the state value function . The value function is divided into two parts: where is the expected reward in 4 and
is the sum of future discounted rewards (future value for short). excludes the first, immediate reward () and is only a function of under the working assumptions 1 and 2. Note that the difference measures the impact of sending treatment at dosage on the future rewards in the setting in which future actions are selected by policy . We select policy to maximize the future value under the constraint that only depends on the dosage and availability. Specifically, let . It can be shown that is given by , where the bivariate function solve the following equations:
for all and , where is the constrained action space based on availability, i.e., and , and are the marginal reward function (e.g., marginal in the sense that it only depends on the dosage variable) given by . Finally, the proxy for the delayed effect is calculated by
(7) 
where is the weighted average between the estimate and the initial function calculated based on only data from HeartSteps V1. The selection of the discount rate and the weight will be discussed in section 5.5. This delayed effect is the mean difference of the discounted future rewards between sending nothing versus an activity suggestion. From here we see that in (2) , the action at decision time is essentially selected to maximize the sum of discounted rewards, i.e., .
5.5 Choosing Inputs
We review the inputs required by the HeartSteps V2 RL algorithm and discuss how each is selected based on the data collected from HeartSteps V1. The list of required inputs can be found in Figure 1.
First, the scientific team decided and in the probability clipping to ensure enough exploration, e.g., forcing the RL algorithm continuously explore without locking into a deterministic policy. As mentioned in section 5.2, we define the dosage in the form of (recall this variable is used to form the proxy for the delayed effect (2). Generalized Estimating Equations’ (GEE) analysis [Liang1986] was conducted using HeartStep V1 data for a variety values of . When is relatively large the dosage significantly impacts the effect of the activity suggestions on the subsequent 30 minute step count. The scientific team selected
In the nightly posterior updates of treatment effect estimates, the working model (3) requires the features vectors, and (standardized to be within [0, 1]) in (1) and (4) , the variance of the noise, and the prior distribution, and . We discuss how to choose them using HeartSteps V1 data in the followings.
First, the feature vector are chosen based on the GEE results using HeartSteps V1 data. In particular, each feature is included in a marginal GEE model with the prior 30min step count in the main effect model (to reduce the variance). The feature is included in both the main effect and treatment effect model. The procedure is done for each feature separately and the value is obtained. The feature is then selected into and at the significance level of 0.05. For example, we found that although the 30minute step count prior to the decision is highly predictive of the rewards (e.g., 30 minute step count after the decision), it is not significant in terms of predicting the treatment effect. Therefore, the prior 30minute step count is included in the baseline features , but not in the feature vector for treatment effect. A measure of how participant engages with the mobile app (e.g., the daily number of screens that participant encounters)is planned to include in both and . This variable was not collected in HeartSteps V1. The scientific team believes this variable likely interacts with the treatment and thus decide to include into the features. The features in the feature vector in (1) are dosage, app engagement, location and the variation level of step count 60 minutes around the current time slot in past 7 days. These features along with the prior 30minute step count, yesterday’s total step count and current temperature are included in the baseline feature vector, .
Second, about the variance of the noise . Although can be learned on the fly, e.g., the residual variance by fitting the model using the data collected from the participant, to ensure the stability of the algorithm (e.g., the step count can be highly noisy), we set the variance parameter using the data from HeartSteps V1, that is, is not updated during the study.
Third, the prior is constructed based on the analysis result in HeartSteps V1. Specifically, we first conduct Generalized Estimating Equations’ (GEE) regression analyses [Liang1986]
, using all participants’ data in HeartStep V1 and assess the significance of each feature. To form the prior variance, on each participant we fit a separate GEE linear regression model and calculated the standard deviations of the point estimates across the 37 participant models. We formed the prior mean and prior standard deviation as follows: (1) For the features that are significant in the GEE analysis using all participants’ data, we set the prior mean to be the point estimate from this analysis; we set the prior standard deviation to the standard deviation across participant models from the participant specific GEE analyses. (2) For the features that are not significant, we set the corresponding prior mean to be zero and shrink the standard deviation by half. (3) For the app engagement variable, set the prior mean to be 0 and the standard deviation to be the average prior standard deviation of other features.
are diagonal matrices with the above prior variances on the diagonals. The same procedure is applied to form the prior mean and variance for the reward model at the unavailable times, used in the proxy value updates. The rationale of setting the mean to zero and shrinking the standard deviation for the nonsignificant features is to ensure the stability of the algorithm: unless during the HeartSteps V2 study there is strong evidence or signal detected from the participant, these features only have minimal impact on the selection of actions. In Section 6.1, we also apply the above procedure to construct the prior in the simulation.The initial proxy delayed effect, and the estimates of proxy delayed effect, both require the initial proxy value estimates . To calculate we use the same procedure as described in the section 5.4.2 to calculate , except that the empirical probability of being available, the empirical distribution of contexts and the reward function are constructed only using HeartSteps V1 data.
Two remaining parameters in the HeartSteps V2 RL algorithm need to be specified: the discount rate and the updating weight parameter (both part of the proxy MDP in section 5.4.2) For simplicity, we call them as “tuning parameters” in the rest of the paper. These tuning parameters are difficult to specify directly as the optimal choice likely depend on the noise level of rewards, how the context varies over time and the length of the study. We propose to choose the tuning parameters, based on a simulationbased procedure. Specifically, we first build a simulation environment (e.g., the data generating model) using HeartSteps V1 data. We then apply the algorithm as shown in Figure 1 with each candidate pair of tuning parameters Finally, the tuning parameters is chosen such that it maximizes the total simulated rewards. In Section 6.1, we discuss in details how we form such generative model using HeartSteps V1 data.
6 Simulation Study
In this section, we use HeartSteps V1 data to conduct a simulation study to demonstrate the validity of the procedure for choosing the inputs including the tuning parameters described in Section 5.5, the validity of using proxy value in the proposed algorithm addressing the challenge 1 about the negative delayed effect of treatments and the validity of using actioncentering to protect against modelspecification 3. Here the use of a previous dataset to build a simulation environment for evaluating an online algorithm is similar to [liao2018just]. In Section 7, we also provide the assessment of the proposed algorithm using pilot data from HeartSteps V2.
We consider a threefold cross validation procedure. We partition the HeartSteps V1 dataset by three folds. In each of the three iterations, two folds are marked as a training batch and the third fold is marked as a testing batch. The training batch is used to (1) construct the prior distribution, (2) form an estimate of noise variance, and (3) select the tuning parameters. We call this process as “training phase”. Note that the training batch serves the same purpose as HeartSteps V1. Next, the testing batch is used to construct a simulation environment to test the algorithm with the estimated noise variance, prior and tuning parameters. The use of testing batch is akin to applying the RL algorithm in HeartSteps V2. In Section 6.1 and 6.2 below, we will describe in greater details how the training and testing batch are used in each iteration of cross validation. Note that we will apply the same procedure three times.
We compare the performance with Thompson Sampling Bandit algorithm, a version similar to [agrawal2013thompson]. TS Bandit algorithm is a widely used RL algorithm showing good performance in many realworld settings [chapelle2011empirical]. At each decision time, it selects the action probabilistically according to the posterior distribution of reward with the goal to maximize the immediate reward. We choose TS Bandit as the comparator over other standard contextual bandit algorithms (e.g., LinUCB in [li2010contextual]) because TS Bandit is a stochastic algorithm which better suits our setting due to challenge 4. Below we provide the details of TS Bandit. In TS Bandit, the expected reward is modeled by for some parameter . At each decision time with context and availability , the action is selected with probability , where is the posterior distribution of the parameters given the current history under the Bayesian model of rewards with Gaussian prior and error: . The main difference to our proposed algorithm is that TS Bandit attempts to choose action that only maximizes the immediate reward, while our proposed algorithm takes into account the longer term impact of current action for challenge 1. In addition, the TS Bandit algorithm requires the correct modeling of each arm, while our method uses the actioncentering (see (3)) to protect against misspecifying the baseline reward for challenge 3 and only require correct modeling of the difference of two arms, i.e., the treatment effect model in (1).
In the implementation of TS Bandit, we parametrize the reward model by where and are same feature vectors used as in our proposed algorithm. Furthermore, to allow for a fair comparison, the prior distribution of and the variance of error term are both constructed by the training batch using the same procedure that will be discussed in Section 6.1 and we also clip the probability of selecting each arm with the same constraints.
6.1 Training Phase
Prior distribution
The algorithm requires three prior distributions: the prior of the parameters in main effect when available, the prior of parameters in the treatment effect and the prior of parameters in the mean reward when not available. The last one is used in calculating the proxy value. The prior distributions are calculated using the training batch according to Section 5.5

Fit the GEE using all participants’ data in the training batch (population GEE)

For the parameters that are significant in population GEE, set the prior mean to the point estimates in the population GEE. Otherwise, set the prior mean to zero.

For the parameters that are significant in population GEE, the prior standard deviation is set to the standard deviation of personspecific estimates of the participants in the training batch. Otherwise, set the prior standard deviation to the half of the standard deviation of personspecific estimates of the participants in the training batch.

The prior variance matrix of the parameters is set to the diagonal matrix.
Noise variance
Set the noise variance to be the variance of residuals obtained from the above population GEE.
Initial proxy value function
Recall that the proxy value function requires the specification of (1) context distribution, (2) availability probability, (3) transition model of dosage and (4) reward function (for available and unavailable times), as well as the discount factor ; see Section 5.4.2. We form the initial proxy using the training batch by setting (1) the empirical distribution in the training batch, (2) empirical availability probability in the training batch, (3) in the dosage transition, and (4) the reward estimates from population GEE.
Tuning parameters
Recall the tuning parameters are , corresponding to the discount rate in defining the proxy value and the updating weight in forming the estimated proxy value. The tuning parameters are chosen to optimize the total simulated rewards using the generative model of participants in the training batch. Below we describe how we form the generative model. For participant , we first construct a 90day sequence of context, availability, residuals as follows.

Create the 42day sequence of the context, availability, residual, where residual is obtained from the personspecific regression model fit.

Extend to 90day sequence, , by concatenating (9042) days’ data, randomly selected from the 42days’ data. Specifically, randomly choose from and append all data from day onto the 42day and repeat until we have a sequence of 90day. The sampling is done only once and the sequence is fixed throughout the simulation.
The generative model for participants is given as follows. At time ,

Randomly generate a binary variable
with probability 0.2 (on average 1 per day). Here is the indicator of whether there is any antisedentary suggestion sent between and . 
Obtain the current dosage , where , the event .

Set

Select the action according to (2)

Receive the reward defined as
(8) where the coefficients are set based on population GEE using all participants’ data in the training batch.
For a given candidate value of tuning parameters, together with the above constructed noise variance and prior, the algorithm is run times under each training participant’s generative model. The average total reward (over training participants and reruns) is calculated and we select the tuning parameters that maximizes the average total reward. We use the grid search over and . Recall that the training is done three times, each time corresponding to use two folds as the training batch. The selected tuning parameters for the three iterations in CV are given by .
6.2 Testing Phase
We build the generative model using the testing batch using the same procedure described in Section 6.1 with the only difference that in the testing phase the coefficients in generating the reward (8) are replaced by , which are the regression estimates using the testing dataset. We run the algorithm under each test participant’s generative model with the noise variance estimates, prior distribution and the tuning parameters selected from the training data. The algorithm is run 96 times per testing participant. The average total reward (over reruns) for each test participant is calculated.
Recall that we conduct the threefold cross validation. Every participant in HeartSteps V1 data is assigned to a certain testing batch once in the cross validation. The performance of the proposed algorithm and the comparator, TS Bandit algorithm, for each participant when assigned to the testing batch is provided in Figure 2. We see that for the majority of participants (29 out of 37), the total rewards is higher comparing with TS Bandit algorithm. The average improvement of the total rewards over TS Bandit is 29.753. Recall that TS Bandit algorithm is sensitive to model misspecification/nonstationarity and is greedy to maximize the immediate rewards. The simulation results demonstrates that the use of actioncentering as well as the proxy delayed effect is effective in addressing the challenge 1 and 3.
7 Pilot Data From HeartSteps V2
HeartSteps V2 has been deployed in the field since June 2019. Currently the study is still in the pilot phase for testing the software and multiple intervention components. The RL algorithm developed above is being used to decide whether to trigger the contexttailored activity suggestion at each of the five decision times per day. The input to the algorithm, for example the choice of feature vectors, the prior distribution and the tuning parameters, were determined according to Section 5.5. That is, we apply procedure described in Section 6.1 using all HeartSteps V1 data. Below we provide an initial assessment of the algorithm and also discuss the lessons learned from the pilot participants’ data.
7.1 Initial Assessment
Recall that each participant in HeartSteps V2 wears the Fitbit tracker for one week prior to starting to use the mobile app; no activity suggestion is delivered during this initial week. Currently there are eight participants in the field who have been in the study for over one week and are experiencing the RL algorithm. For each participant, we calculate the average 30min step count after each userspecified decision time during the first week and compare this with the average 30min step count in the subsequent weeks during which activity suggestion are delivered. This is provided in Table 1. All except for one participant (ID = 4) experience a positive increase in step count. We see that on average the participant takes 125 more steps in the 30min window following the decision time than in the first week.
ID  Days 


Difference  

5  32  318.13  561.43  243.29  
7  56  343.79  574.53  230.75  
1  36  252.12  424.31  172.19  
3  32  163.24  295.45  132.21  
8  18  281.65  387.86  106.21  
6  43  215.45  314.17  98.71  
2  22  361.26  418.60  57.35  
4  75  368.50  330.03  38.47 
7.2 Lessons
In this section, we discuss two lessons learned from the examination of the pilot participants’ data. We discuss these lessons using data from participants ID=4 and ID=7. First, consider participant ID=4 who is not responsive to the activity suggestions (i.e., sending a suggestion does not significantly improve the step count). That is, as seen in Table 1 participant ID = 4 has step counts that decrease after the first week. Figure 3 shows the randomization probability and the posterior mean estimates for the participant ID = 4. We see that for this participant that the posterior mean estimates start with positive value and drop below 0, e.g., no sign of the effectiveness of the suggestions, however the randomization probability still ranges between 0.2 and 0.4. Given that HeartSteps is intended for longterm use (recall HeartSteps V2 is a 3month study) and there are other intervention components (e.g., weekly reflection and planning and the antisedentary suggestion sent when participant is currently sedentary), randomizing with this probability is likely too much.
Second, consider participant ID =7 who appears highly responsive to the activity suggestions, see Table 1 and the right graph in Figure 4 of the posterior mean of the treatment effect. From the right graph in Figure 4 we see that this participant’s responsivity begins to decrease around time 0710. This is not that surprising as this participant is receiving many suggestions. Next from the left graph in this same Figure the randomization probabilities from our RL algorithm does not really start to decrease until 0716. Ideally the proxy value should be responding quickly to the excessive dose and signalling that the probability should decrease. The proxy value needs improvement. The proxy is reducing the probability of sending the walking suggestion when the delayed effect is present; see left graph in Figure 4 and compare the black points, corresponds to the actual randomization probability with the red points corresponding to the randomization probability without the proxy adjustment. Ideally we would like to see a bigger gap between the black and red points in the period from 0716 to 0715. We are currently revising the algorithm in response to these two lessons; discussed in Section 8
8 Conclusion and Future Work
In this paper, we developed a Reinforcement Learning algorithm for use in HeartSteps V2. Preliminary validation of the algorithm demonstrates good performance over Thompson Sampling Bandit algorithm in synthetic experiments constructed based on a previous study HeartSteps V1. We also assess the performance of the algorithm using the pilot data from HeartSteps V2. After HeartSteps V2 is completed, the data will be used to further assess the performance and utility of the algorithm.
We foresee some opportunities for future work. First, our proposed algorithm learns the treatment policy separately for each participant (e.g., fully personalized). If the participants in the study are similar enough, pooling information from other participants (either currently still in the study or already having finished the study) can speed learning and achieve better performance, especially for those entering the study later. Second, the current algorithm takes into account the delayed effect of treatment by using a predefined “dosage variable” capturing the burden. It would be interesting to develop a version in which more sophisticated measures of burden as well as engagement are used to approximate the delayed effect and also response quicker to prevent the disengagement. Finally, consideration of user’s engagement and burden, it makes sense to reduce the chance of intervention when the algorithm does not have enough evidence of the effectiveness of intervention.