With billions of smart device (i.e., smart-phones and wearable devices) users worldwide, mobile health (mHealth) interventions (MHI) are increasingly popular among the behavioral health, clinical, computer science and statistic communities [1, 2, 3, 4]. The MHI aims to make full use of smart technologies to collect, transport and analyze the raw data (weather, location, social activity, stress, urges to smoke, etc.) to deliver effective treatments that target behavior regularization . For example, the goal of MHI is to optimally prevent unhealthy behaviors, such as alcohol abuse and eating disorders, and to promote healthy behaviors. Particularly, JITAIs (i.e., Just in time adaptive intervention) is especially interesting and practical due to the appealing properties : (1) JITAIs could make adaptive and efficacious interventions according to user’s ongoing statuses and changing needs; (2) JITAIs allow for the real-time delivery of interventions, which is very portable, affordable and flexible . Therefore, JITAIs are widely used in a wide range of mHealth applications, such as physical activity, eating disorders, alcohol use, mental illness, obesity/weight management and other chronic disorders etc., that aims to guide people to lead healthy lives [4, 2, 6, 3, 7].
Normally, JITAIs is formed as an online sequential decision making (SDM) problem that is aimed to construct the optimal decision rules to decide when, where and how to deliver effective treatments [4, 2, 5]. This is a brand-new topic that lacks of methodological guidance. In 2014, Lei  made a first attempt to formulate the mHealth intervention as an online actor-critic contextual bandit problem. Lei’s method is well suited for the small data set problem in the early stage of the mHealth study. However, this method ignores the important delayed effects of the SDM—the current action may affect not only the immediate reward but also the next states and, through that, all subsequent rewards . To consider the delayed effects, it is reasonable to employ the reinforcement learning (RL) in the discount reward setting. RL is much more complex than the contextual bandit. It requires much more data to acquire good and stable decision rules . However at the beginning of the online learning, there are too few data to start effective online learning. A simple and widely used method is to collect a fixed length of trajectory ( say) via the micro-randomized trials , accumulating a few of samples, then starting the online updating. Such procedure is called the random warm start, i.e. RWS. However, there are two main drawbacks of the RWS: (1) it highly puts off the online RL learning before achieving good decision rules; (2) it is likely to frustrate the users because the random decision rules achieved at the beginning of online learning are not personalized to the users’ needs. Accordingly, it is easy for users to abandon the mHealth app.
To alleviate the above problems, we propose a new online RL methodology by emphasizing effective warm starts. It aims to promote the performance of the online learning at the early stage and, through that, the final decision rule. The motivation is to make full use of the data and the decision rules achieved in the previous study, which is similar to the current study (cf. Sec. III). Specifically, we use the decision rules achieved previously to initialize the parameter of the online RL learning for new users. The data accumulated in the former study is also fully used. As a result, the data size is greatly enriched at the beginning of our online learning algorithm. When the online learning goes on, the data gained from new users will have more and more weights to increasingly dominate the objective function. Our decision rule is still personalized according to the new user. Extensive experiment results show the power of the proposed method.
Ii Markov Decision Process (MDP) and Actor-Critic Reinforcement Learning
: The dynamic system (i.e. the environment) that RL interacts with is generally modeled as a Markov Decision Process (MDP). An MDP is a tuple [9, 10, 11], where is (finite) state space and
is (finite) action space. The state transition probability, from state to the next state when taking action , is given by . Let and
be the random variables at timerepresenting the state, action and immediate reward respectively. The expected immediate reward is assumed to be bounded over the state and action spaces . is a discount factor to reduce the influence of future rewards.
The stochastic policy decides the action to take in a given state . The goal of RL is to interact with the environment to learn the optimal policy that maximizes the total accumulated reward. Usually, RL uses the value function to quantify the quality of a policy , which is the expected discounted cumulative reward, starting from state , first choosing action and then following the policy : The value satisfies the following linear Bellman equation
The parameterized functions are generally employed to approximate the value and policy functions since 
the system usually have too many states and actions to achieve an accurate estimation of value and policy. Instead they have to be iteratively estimated. Due to the great properties of quick convergences, the actor-critic RL algorithms are widely accepted to esimate the parameterized value and stochastic policy , where is a feature function for the -value that merges the information in state and action . To learn the unknown parameters , we need a 2-step alternating updating rule until convergence: (1) the critic updating (i.e., policy evaluation) for to estimate the Q-value function for the current policy, (2) the actor updating (i.e., policy improvement) for to search a better policy based on the newly estimated Q-value [10, 4].
Supposing the online learning for a new user is at decision point , resulting in tuples drawn from the MDP system, i.e., . Each tuple consists of four elements: the current state, action, reward and the next state By using the data in the Least-Squares Temporal Difference for Q-value (LSTD) [11, 8] is used for the critic updating to estimate at time point :
where is the feature at decision point for the value function;
is the feature at the next time point; is the immediate reward at the time point. By maximizing the average reward, i.e., a widely accepted criterion , we have the objective function for the actor updating (i.e., policy improvement)
where is the newly estimated value; and are the balancing parameters for the constraint to avoid singular failures for the critic and actor update respectively. Note that after each actor update, the feature at the next time point has to be re-calculated based on the newly estimated policy parameter When the discount factor , the RL algorithm in (4), (5) is equivalent to the state-of-the-art contextual bandit method in the mHealth .
Iii Our method
The actor-critic RL algorithm in (4), (5) works well when the sample size (i.e. ) is large. However at the beginning of the online learning, e.g., there is only one tuple. It is impossible to do the actor-critic updating with so few samples. A popular and widely accepted method is to accumulate a small number of tuples via the micro-randomized trials  (called RWS). RWS is to draw a fixed length of trajectory (, say) by applying the random policy with probability 0.5 to provide an intervention (i.e., for all states ). RWS works to some extent, they are far from the optimal. One direct drawback with RWS is that it is very expensive in time to wait the micro-randomized trials to collect data from human, implying that we may still have a small number of samples to start the actor-critic updating. This tough problem badly affects the actor-critic updating not only at the beginning of online learning, but also along the whole learning process. Such case is due to the actor-critic objective functions is non-convex; any bad solution at the early online learning would bias the optimization direction, which easily leads some sub-optimal solution. Besides, the random policy in micro-randomized trials and the decision rules achieved at the early online learning is of bad user experience. Such problem makes it possible for the users to be inactive with or even to abandon the mHealth intervention.
To deal with the above problems, we propose a new online actor-critic RL methodology. It emphasizes effective warm starts for the online learning algorithm. The goal is to promote decision rules achieved at the early online learning stage and, through that, guide the optimization in a better direction, leading to a good final policy that is well suited for the new user. Specifically, we make full use of the data accumulated and decision rules learned in the previous study. Note that for the mHealth intervention design, there are usually several rounds of study; each round is pushed forward and slightly different from the former one. By using the data and policy gained in the former study, the RL learning in current study could quickly achieve good decision rules for new users, reducing the total study time and increasing the user experience at the beginning of the online learning.
Supposing that the former mHealth study is carried out in an off-policy, batch learning setting, we have (40, say) individuals. Each individual is with a trajectory including tuples of states, actions and rewards. Thus in total there are tuples, i.e., . Besides the data in , we employ the decision rule achieved in the former study to initialize the parameters in the current online learning. Note that we add a bar above the notations to distinguish the data obtained in the previous study from that of the current study.
At the decision point, we have both the data collected in the former study and the new tuples drawn from the new user in to update the online actor-critic learning. It has two parts: (1) the critic updating for via
and (2) the actor updating via
where is data in the previous study; is the data that is collected from the new user;
is the feature vector at decision pointfor the value function; is the feature at the next time point; is the immediate reward at the point; is the newly updated value.
In (4) and (5), the terms in the blue ink indicate the the previous data, which is with a normalized weight In this setting, all the data obtained in the former study is treated as one sample for the current online learning. When current online learning goes on (i.e., increases), the data collected from the new user gradually dominates the objective functions. Thus, we are still able to achieve personalized JITAIs that is successfully adapted to each new user.
To verify the performance, we compare our method (i.e., NWS-RL) with the conventional RL method with the random warm start (RWS-RL) on the HeartSteps application . The HeartSteps is a 42-day mHealth intervention that encourages users to increase the steps they take each day by providing positive interventions, such as suggesting taking a walk after sedentary behavior. The actions are binary including , where means providing active treatments, e.g., sending an intervention to the user’s smart device, while means no treatment .
Iv-a Simulated Experiments
In the experiments, we draw tuples from each user, i.e.,
, where the observation is a column vector with elements . The initial states and actions are generated by and , where and For , we have the state generation model and immediate reward model as follows
where is the treatment fatigue [4, 2]; at the point is the noise in the state transition (6) and is the noise in the immediate reward model (7). To generate different users, we need different MDPs specified by the value of s in (6) and (7). The are generated in the following two steps: (a) set a basic ; (b) to obtain different s (i.e., users or MDPs), we set the as where , controls how different the users are and
is an identity matrix withelements. To generate the MDP for a future user, we will also use this kind of method to generate new s.
Iv-B Experiment Settings
The expectation of long run average reward (ElrAR) is used to evaluate the quality of an estimated policy on a set of =50 individuals. Intuitively, the ElrAR measures how much average reward in the long run we could totally get by using the learned policy for a number of users. In the HeartSteps application, the ElrAR measures the average steps that users take each day in a long period of time; a larger ElrAR corresponds to a better performance. The average reward is calculated by averaging the rewards over the last elements in a trajectory of tuples under the policy . Then ElrAR is achieved by averaging the ’s.
In the experiment, we assume the parameterized policy in the form
is the unknown variance andis the feature function for policies that stacks constant 1 with the state vector . The number of individuals in the former study is . Each is with a trajectory of =42 time points. For the current study (cf., Table I), there are individuals. RWS has to accumulate tuples till and respectively to start the online learning. Our method (i.e., NWS) has the ability to start the RL online learning algorithm immediately when the tuple is available. Since the comparison of early online learning is our focuses, we set the total trajectory length for the online learning as and , respectively. The noises are set and . Other variances are , , . The feature processing for the value estimation is for all the compared methods. Table I summarizes the experiment results of three methods: RWS-RL, RWS-RL and NWS-RL. It includes two sub-tables: the left one shows the results of early online learning results, i.e., and the right displays the results when . As we shall see, the proposed warm start method (NWS-RL) has an obvious advantage over the conventional RWS-RL, averagely achieving an improvement of steps for and steps for compared with the best policy in blue.
V Conclusion and Discussion
In this paper, we propose a new online actor-critic reinforcement learning methodology for the mHealth application. The main idea is to provide an effective warm start method for the online RL learning. The state-of-the-art RL method for mHealth has the problem of lacking samples to start the online learning. To solve this problem, we make full use of the data accumulated and decision rules achieved in the former study. As a result, the data size is greatly enriched even at the beginning of online learning. Our method is able to start the online updating when the first tuple is available. Experiment results verify that our method achieves clear gains compared with the state-of-the-art method. In the future, we may explore the robust learning [12, 13] and graph learning [14, 15] on the online actor-critic RL learning algorithm.
The authors would like to thank the editor and the reviewers for their valuable suggestions. Besides, this work is supported by R01 AA023187, P50 DA039838, U54EB020404, R01 HL125440.
-  H. Lei, A. Tewari, and S. Murphy, “An actor-critic contextual bandit algorithm for personalized interventions using mobile devices,” in NIPS 2014 Workshop: Personalization: Methods and Applications, pp. 1 – 9, 2014.
-  S. A. Murphy, Y. Deng, E. B. Laber, H. R. Maei, R. S. Sutton, and K. Witkiewitz, “A batch, off-policy, actor-critic algorithm for optimizing the average reward,” CoRR, vol. abs/1607.05047, 2016.
-  W. Dempsey, P. Liao, P. Klasnja, I. Nahum-Shani, and S. A. Murphy, “Randomised trials for the fitbit generation,” Significance, vol. 12, pp. 20 – 23, Dec 2016.
-  P. Liao, A. Tewari, and S. Murphy, “Constructing just-in-time adaptive interventions,” Phd Section Proposal, pp. 1–49, 2015.
-  F. Zhu, P. Liao, X. Zhu, Y. Yao, and J. Huang, “Cohesion-based online actor-critic reinforcement learning for mhealth intervention,” arXiv:1703.10039, 2017.
-  H. Lei, An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention. PhD thesis, University of Michigan, 2016.
-  D. Gustafson, F. McTavish, M. Chih, A. Atwood, R. Johnson, M. B. …, and D. Shah, “A smartphone application to support recovery from alcoholism: a randomized clinical trial,” JAMA Psychiatry, vol. 71, no. 5, 2014.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2nd ed., 2012.
M. Geist and O. Pietquin, “Algorithmic survey of parametric value function
IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 6, pp. 845–867, 2013.
-  I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Trans. Systems, Man, and Cybernetics, vol. 42, no. 6, pp. 1291–1307, 2012.
M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,”
J. of Machine Learning Research (JLMR), vol. 4, pp. 1107–1149, 2003.
-  F. Zhu, B. Fan, X. Zhu, Y. Wang, S. Xiang, and C. Pan, “10,000+ times accelerated robust subset selection (ARSS),” in Proc. Assoc. Adv. Artif. Intell. (AAAI), pp. 3217–3224, 2015.
-  Y. Wang, C. Pan, S. Xiang, and F. Zhu, “Robust hyperspectral unmixing with correntropy-based metric,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4027–4040, 2015.
-  F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan, “Structured sparse method for hyperspectral unmixing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 88, pp. 101–118, 2014.
-  H. Li, Y. Wang, S. Xiang, J. Duan, F. Zhu, and C. Pan, “A label propagation method using spatial-spectral consistency for hyperspectral image classification,” International Journal of Remote Sensing, vol. 37, no. 1, pp. 191–211, 2016.