Personalization of Health Interventions using Cluster-Based Reinforcement Learning

by   Ali el Hassouni, et al.

Research has shown that personalization of health interventions can contribute to an improved effectiveness. Reinforcement learning algorithms can be used to perform such tailoring using data that is collected about users. Learning is however very fragile for health interventions as only limited time is available to learn from the user before disengagement takes place, or before the opportunity to intervene passes. In this paper, we present a cluster-based reinforcement learning approach which learns across groups of users. Such an approach can speed up the learning process while still giving a level of personalization. The clustering algorithm uses a distance metric over traces of states and rewards. We apply both online and batch learning to learn policies over the clusters and introduce a publicly available simulator which we have developed to evaluate the approach. The results show batch learning clearly outperforms online learning. Furthermore, clustering can be beneficial provided that a proper clustering is found.


page 1

page 2

page 3

page 4


Rapidly Personalizing Mobile Health Treatment Policies with Limited Data

In mobile health (mHealth), reinforcement learning algorithms that adapt...

Effective Warm Start for the Online Actor-Critic Reinforcement Learning based mHealth Intervention

Online reinforcement learning (RL) is increasingly popular for the perso...

Causality and Batch Reinforcement Learning: Complementary Approaches To Planning In Unknown Domains

Reinforcement learning algorithms have had tremendous successes in onlin...

Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity

With the recent evolution of mobile health technologies, health scientis...

Eligibility Propagation to Speed up Time Hopping for Reinforcement Learning

A mechanism called Eligibility Propagation is proposed to speed up the T...

Statistical Inference After Adaptive Sampling in Non-Markovian Environments

There is a great desire to use adaptive sampling methods, such as reinfo...

Streamlined Empirical Bayes Fitting of Linear Mixed Models in Mobile Health

To effect behavior change a successful algorithm must make high-quality ...

1 Introduction

Within the domain of health, an ever increasing amount of data is being collected about the health state and health behavior of people. This data can originate from a variety of sources, including medical devices and medical doctors, but also smartphones and other sensory devices we carry with us. Smart devices not only allow for the collection of data, but can also be used to provide interventions to users directly. Determining which intervention works best in what situation is an important problem in this context. One-size fits all solutions, where each user is provided with the same intervention, have been shown to be less effective compared to more personalized approaches where interventions are tailored towards (groups of) users (see e.g. (Kranzler & McKay, 2012; Schmidt et al., 2006; Simon et al., 2000; Curry et al., 1995)). The data collected from the users can help to establish this personalization.

Personalization of interventions poses several challenges. Firstly, the success of interventions is not immediately clear, and an emphasis should be placed on interventions that lead to a sustained improvement in the health state rather than quick wins. Secondly, interventions are typically composed of sequences of actions (e.g. multiple support messages or exercises) that should act in harmony. To address these challenges, reinforcement learning (see e.g. (Wiering & van Otterlo, 2012)) arises as a very natural choice (cf. (Hoogendoorn & Funk, 2017)).

While the reinforcement learning paradigm fits this setting very well, certain properties of reinforcement learning do not. The algorithms typically require a substantial learning period before a suitable policy (specifying which intervention action to select in what situation) is found. In our setting, we do not have a sufficiently long learning period per user and trying a lot of unsuitable actions can disengage users. Hence, there is a need to substantially shorten the learning period. To establish this we can either: (1) start with an existing model (transfer learning, see e.g. 

(Taylor & Stone, 2009)) or (2) pool data from multiple users that require similar policies (cf. (Zhu et al., 2017)). While both are viable options, the latter one has not been explored for more complex and realistic health settings yet, merely for very simple personalization settings.

In this paper, we present a cluster based reinforcement learning algorithm which builds on top of the work done by (Zhu et al., 2017) and test it for a more complex health setting using a dedicated simulator we have built. We use -means clustering to find suitable clusters, thereby automatically selecting a value for using the silhouette score. We learn policies over the clusters using both an online RL algorithm (Q-learning, cf. (Watkins & Dayan, 1992)) and a batch algorithm (LSPI. cf. (Lagoudakis & Parr, 2003)). We compare the cluster based approach to learning a single policy across all users and learning completely individualized policies. The aforementioned simulation environment generates realistic user data for a health setting. Here, the aim is to coach users towards a more active lifestyle. The simulator is made publicly available to allow for benchmarking and make it easier for others to evaluate novel reinforcement approaches for this setting.

This paper is organized as follows. We present our cluster-based reinforcement learning algorithm in Section 2 and we discuss related work in Section 3. We continue with a description of the simulator we have developed in Section 4. We then explain our experimental setup and our results in Sections 5 and 6 respectively. We end with a discussion.

2 Approach

Generally we want to learn an intervention strategy for many types of users, without knowing beforehand which types of users exist, how they differ in terms of behavior, and how they react differently to interventions of the system. In our approach, we utilize existing model-free reinforcement learning algorithms to experiment with different intervention strategies to improve users health states.

User Models and Interventions. Let be the set of users. We see each user as a control problem modeled as a Markov decision process (Wiering & van Otterlo, 2012) , , , , where is a finite set of finite states the user can be in, is the set of possible interventions (actions) for , :: [0,1] is a probabilistic transition function over the states of , and is a reward function that assigns a reward to each state and action .

The user’s state set consists of the observable features of the user state: in general we cannot observe all relevant features of the true underlying user state and is therefore restricted to all measurable aspects, modeled through a set of basis functions over a state

. That is, we use the feature vector representation

of the state of user as representation. If there is no confusion we will use instead of . The reward function determines the goal of optimization. Finally, the transition function , which determines how a user moves from state to due to action , is not accessible from the viewpoint of the reinforcement learner, which is a natural assumption when dealing with real human users. In Section 4 we do show how we have implemented it for the artificial users in our simulator. The granularity of modeling can be set based on the case at hand, ranging from seconds to hours, denoted .

Every time point a user is in some state , the system chooses an intervention , upon which the user enters a new state and a reward is obtained. Note that for both the transition function and the reward function it is unknown whether they can be considered Markov, and thus whether the user can be controlled as an MDP. Nevertheless, we assume it is close enough such that we can employ standard RL algorithms. Note also that all users share the same state representation, but can differ in and . An alternative strategy would be to learn the dynamics of and from experience as in model-based RL (e.g. see (Sutton & Barto, 2017)), but here we focus on learning them implicitly by clustering users who are similar in their behavior (and thus and ).

Evaluating and Learning Interventions. The goal is to learn intervention strategies, or policies, for all users. For any user , specifies the intervention for user in state . The intervention will cause user to transition to a new state and a reward is obtained, resulting in the experience . A sequence of experiences for user can be compactly represented as and is called a trace for user . From here we will drop the user subscript when there is no confusion. To compare policies, we look at the expected reward they receive in the long run. The value of doing intervention in state of policy , where , is:


where is a discount factor weighing rewards in the future, and and are states and actions occurring at some future time . From this -function it is easy to derive a policy, by taking the best action in each state , i.e.


We are looking for the best policy, which is for all and , and .

We employ two off-policy techniques to learn -functions: online, table-based -learning (Watkins & Dayan, 1992) and batch, feature-based least squares policy iteration (LSPI) (Lagoudakis & Parr, 2003). Let be our set of users. For -learning we store each -value , for and separately, and after each experience for a user we update the -function:


where is the learning rate. Note that for all users together one -function is learned. In addition, we use variants of experience replay (Lin, 1992) which amounts to performing additional updates by ”replaying” experienced traces backwards to propagate rewards quicker.

In our second method, LSPI, we employ the basis function representation of a state and compute a linear function approximation of the -function, , from a batch of experiences . Here, consists of tunable weights. LSPI implements an approximate version of standard policy iteration (cf. (Sutton & Barto, 2017)) by alternating a policy evaluation step (Eq 1) and a policy improvement step (Eq 2). However, due to the linear approximation, the evaluation step can be computed by representing the batch of experiences in matrix form and using them to find an optimal weight vector .

Two Learning Phases. For any given set of users we define two phases in learning an optimization strategy. In the first phase (warm-up) we employ a default policy (see the experimental section for details) to generate traces for each user, and use all experiences of all users to compute . By maximization (Eq. 2) we obtain a better policy that is used at the start of the second phase (learning). During this phase we iteratively apply the policy to obtain experiences and update our -function (and policy) using either -learning or LSPI. In this phase some exploration is used, reducing the amount of exploration over time.

Cluster-Based Policy Improvement. So far, we have assumed all users belong to one group. Our main hypothesis is that since users have different (but unknown) transition and reward functions, learning one general policy for all users will not be optimal. To remedy this, we add a clustering step after the warm-up phase. Let be the set of users targeted in the warm-up phase, and let be the set of all traces generated. Let the number of resulting clusters be and be the partitioning of , and let be the partitioning of . Instead of utilizing all experiences of for one -function, we now induce a separate -function (and corresponding policy ) for each user set based on the traces in and continue with learning and performance phases for each subgroup individually. Note that these steps are done in addition to our previous setup, which allows for a comparison between a policy for and subgroup policies.

3 Related Work

We model the intervention system as a reinforcement learning system which can act by sending interventions to users. This use of reinforcement learning for intervention strategies in health, coaching and fitness applications is a relatively new development, although much other work has considered various nudging approaches to stimulate human users to do particular things in various ways. For example, adaptive persuasive systems (Kaptein & van Halteren, 2013) have been tested in field trials, for instance to increase the effectiveness of email reminders.

Reinforcement learning techniques (Wiering & van Otterlo, 2012; Sutton & Barto, 2017) are ideally suited for sequential decision making problems in health interventions, dynamic treatment regimes (Chakraborty & Murphy, 2014), or in motivational strategies in citizen science (Segal et al., 2018). Work in this area has just begun to explore computational approaches. Several problems in (mobile) healthcare generate new challenges for reinforcement learning, such as the problem of missing data, privacy, and especially the difficulty of interactive simulations with real human data. For that reason we implemented a realistic simulator as an alternative data gathering option. A challenge remains however, to keep as close as possible to actual human data.

Hochberg et al. (2016) compare reinforcement learning – in particular contextual bandits – with static reminder policies to encourage diabetes patients through SMS interventions. Raghu et al. (2017

) combine continuous state space models and deep neural networks for the treatment of sepsis and

Rudary et al. (2004) combine reinforcement learning with constraints for reminder support. The latter also shows several forms of personalization that result from learning from patients with different (scheduling) habits. The work by Zhu et al.(2017) is related to ours, in that they too focus on clustering the set of users for personalization purposes and use a form of linear function approximation based batch learning as part of their approach. In addition to algorithmic differences in learning but also in clustering, a major difference is that we base our experiments on extensive runs with our novel simulator. Some other work exists (cf. the mentioned papers) but so far, most is limited to a few datasets and relatively simple methods. The work by Raghu et al.(2017

) is already a step to employ more advanced methods based on deep learning, but many other recent techniques in reinforcement learning will be possible to utilize for m-health applications (cf.

(Li, 2017)).

Our work is also related to multi-task reinforcement learning, where the goal is to learn policies for multiple problems simultaneously. Some work models an explicit distribution over problems (Wilson et al., 2007), or distill a general policy which can be made more specific (Teh et al., 2017). In contrast, we focus on clustering groups users that are alike and learning separate, more specialized policies. Our work is also related to transfer learning (Taylor & Stone, 2009) where learned policies can be transferred to other tasks, in our case from group level to subgroup level.

4 Simulator

For the health setting we focus on in this paper, it is difficult to experiment with different reinforcement learning strategies with real users, as this requires involving a substantial number of users in a large scale study and gathering too many interaction samples per user. We have therefore decided to build a simulator to experiment with algorithmic settings first. The simulator is created for a setting where users have daily schedules of activities and should be encouraged to conduct certain types of (healthy) activities. Below, we discuss the details of the schedules followed by the interventions and the the possibility to define rewards.

4.1 Schedules

We assume that we have users in our simulator: , originating from the set as defined before. Each of these users can conduct one of activities at each time point (). Time points in our simulator have a discrete step size . Let denote the possible values of the activity. Example activities are working, sleeping, working out, and eating breakfast. Each user has a unique activity that is being conducted at a time point (). Note that this activity can also be none. For each user, a prototype schedule can be specified, which expresses for each activity :

i) an early and late start time ( and )
ii) a minimum and maximum duration of the activity ( and )

iii) a standard deviation of the duration of the activity(


iv) a probability per day of performing the activity (

v) priorities of other activities over this activity

Using these prototype schedules, a complete schedule is derived which assigns one unique activity to each time point, on a per day basis, following Algorithm LABEL:alg:simplanning.


The algorithm basically uses the ranges for start times and durations of activities to generate actual start times and durations. It then starts to run a schedule and builds up a queue of activities that are relevant for the current time point (i.e. for which the current time is after the start of the activity and before the end of it). In case of multiple activities, the one already being performed is continued, or in case of a priority activity the user switches to that activity. If the queue is empty, the user is not active (or idle) and selects the none activity.

4.2 Interventions and Rewards

Besides performing activities during a day, interventions can also be sent to users. In our system, the set of interventions contains a binary action as

, representing at each decision moment whether the system sends an intervention or not. An intervention is a message that tells the user to perform a desired activity

(we assume there is only one single desired activity for now). To decide upon acceptance of a message, users have a profile that expresses between what time points they are willing to accept an intervention (e.g. a working person might not accept an interventions when at work). If a message is sent at the right time (and when the activity has not been performed yet on that day), and a gap in the schedule is between and from the time the message is sent, the activity will be performed. Rewards can be defined based on acceptance of the message (i.e. the activity is considered as part of the queue and will be performed) and how long the activity has been performed (e.g. there might be some optimal amount of time spent on the activity). More details for the setting we use for the specific case in this paper are shown in section 5.

5 Experimental Setup

As said, we focus on a health setting where learning a policy as fast as possible (i.e. based on limited experiences) is essential. Within this paper, we aim to answer the following questions:

RQ1: What are the differences between batch and online learning for our simulator setting, and how can generalization over state spaces be used to speed up learning?

RQ2: Can a cluster-based RL algorithm learn faster compared to (1) learning per individual user or (2) learning across all users at once?

RQ3: Can we cluster users in a proper way based on traces of their states and rewards?

5.1 Simulator Setup

In our simulator setup, we aim to improve the amount of physical activity by users. We include different types of users. More specifically, we employ three prototypical users, referred to as the workaholic, Arnold (a vivid athlete), and the retiree. The simulator itself runs on finegrained time scale ( is 1 second) while we model at a coarser granularity ( is one hour).

5.1.1 Activities

We include the following activities: sleep, breakfast, lunch, dinner, work, work out. The specification of the daily schedule for each of our prototypical users is expressed in Table 1. We generate an equal amount of agents for all three types ( per type).

activity param. work-aholic Arnold retiree
sleep early start 23 22 22
late start 23 23 23.5
min duration 6 8 8
max duration 7 9 10
priorities work work work
probs (day) 1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1,1
breakfast early start 7 8 7
late start 7.5 9 10
min duration 0.25 0.25 0.5
max duration 0.25 0.25 0.75
priorities work work work
probs (day) 1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1,1
lunch early start 12 12 12
late start 12 13.5 14
min duration 0.25 0.25 0.5
max duration 0.25 0.5 0.75
priorities None None None
probs (day) 1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1,1
dinner early start 18 19 18
late start 20 20.5 20
min duration 0.5 0.5 0.5
max duration 1 1 1
priorities None None None
probs (day) 1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1,1
work early start 8 9 8
late start 9 9.5 9
min duration 10 8 8
max duration 11 8 8
priorities None None None
probs (day) 1,1,1,1,1,0.8,0 1,1,0,1,0,0,0 0,0,0,0,0,0,0
work out early start 19.5 16 19
late start 20.5 21 21.5
min duration 0.5 1 0.5
max duration 1 1 1
priorities None None None
probs (day) 0,0,0,0,0,0,0 0,0,0,0,0,0,0 0,0,0,0,0,0,0
Table 1: Parameters of profiles

5.1.2 Interventions and Responses

The goal of the scenario is to make sure the total work out time meets the guideline for the amount of daily physical activity (30 minutes per day). Messages can be sent to the user to start working out. As explained before, acceptance of the message is dependent on the planning horizon of the user and whether it fits into the schedule. The workaholic is a chronic planner, the retiree is a spontaneous planner and Arnold is a mixed planner. The planning horizons of the three types are defined as follows: (1) chronic planner ( = 3, = 21, = 0.1), (2) spontaneous planner ( = 0, = 1, = 0.1), and (3) mixed planner ( = 0, = 24, = 0.1). Here, the standard deviation expresses the variation among the agents spawned for this profile. On top of that, the workaholic can only accept interventions when having lunch or being idle while the retiree only accepts when idle and Arnold always accepts. Normally, only one work out per day is performed (and messages can be rejected based on this) However, each of the three types has a probability of working out for a second time in one day. Arnold has a probability of 50% of working out for a second time during one day, while it is 10% for the workaholic and 0% for the retiree.

How long the work out activity will be performed is defined in the profile of the user in 1. Fatigue plays a role here. Fatigue can build up when working out across multiple days. The value of fatigue is the number of times a user worked out in total during a consecutive number of days where at least one workout per day occurred. When the user skips working out for a day fatigue resets to zero. The maximum value of fatigue is 7. Agents start feeling fatigue after a threshold is reached. This threshold depends on the user. For the retiree fatigue starts after value 1, Arnold after 4 and the workaholic after 2. Furthermore, the time spent on working our depends on fatigue in the following way:


5.2 Algorithm Setup

In our simulator environment we instantiate several aspects of our general algorithmic setup from Section 2.

5.2.1 State

As features (i.e. ) we use: i) the current time (hours), ii) the current week day (-), iii) whether the user has already worked out today (binary), iv) fatigue level (numerical), and v) which activities were performed in the last hour (six binary features). All these features are realistically observable through sensor information, or inferable.

5.2.2 Reward

The reward function consists of three components. If an intervention is sent and the user accepts it, the immediate reward is (otherwise ). A second reward component is obtained when the user finishes exercising, where the exact reward value is scaled relative to the length of the exercise. A third component is related to the fatigue level of the agent: higher levels result in small negative reward which shape the intervention strategy such that it does not overstimulate the user with exercises.

5.2.3 Default policy

The first part of a simulation run is a warm-up phase of seven days where interventions are driven by a default policy which sends one intervention per day to each user at random between :h and :h. This allows us to perform exploration and to generate traces for clustering.

5.2.4 Q-learning and LSPI

The second part of a simulation run is the learning phase that lasts for days. Immediately after the start of this phase we update the Q-table using the traces generated during the warm-up phase. In an initial experimentation phase we tuned several parameters, which we will now discuss. During the learning phase we perform updates to Q-table once every hour. For Q learning we use , and and the learning rate decreases from an initial with every day. We initialize the Q-values with a random value between and if the action of the state-action pair is otherwise we initialize the Q-values with a random number between and , all to encourage exploration. To speed up the learning we use experience replay. We store the last experience and use these to update the Q-values. All of these choices have been made based on preliminary runs using our simulator.

For runs with LSPI we learn policies on the traces generated during the warm-up phase immediately after this phase. The policies get updated at the end of each day by training a new policy on traces from the start of the simulation until that day. For LSPI we use , exploration of , a maximum number of iterations of with and a first wins tie breaking strategy. Again, parameters have been set based on initial experiments.

5.3 Setup of Runs

We started this section with a number of research questions. To answer these questions, we run simulations with various configurations. First of all, we vary the usage of the type of RL algorithm: online (Q-learning) and batch learning (LSPI); this enables us to answer RQ1. For each type of algorithm, we perform runs where we learn a single policy across all users (pooled approach) to a cluster based approach and learning a completely individualized policy for each user (separate approach). This variation reflects RQ2. For each algorithm we do two simulation runs for the cluster based approach; one simulation run using k-medoids clustering with the Euclidean distance (clustering approach) and a second simulation run using three homogeneous clusters, one for each type of agent, (grouped benchmark approach). The latter provides us with a (gold standard) benchmark to evaluate the cluster quality (i.e. RQ3). Hence, in total we perform eight runs.

6 Results

Figure 1: Average rewards over all different setups

Batch versus Online Learning: Figure 1 reports the results from our simulation runs. Our results demonstrate that LSPI significantly outperforms Q-learning when we compare the average daily rewards over the days during the learning phase. It does so for all four cases (i.e. separate, pooled, cluster, and grouped benchmark). Significance has been tested using a Wilcoxon Signed-Rank test with a significance level of . LSPI learned policies that result in average daily rewards between 0.18 and . Q-learning learns policies with average daily rewards of at most . The Q-learning experiments show that online (table-based) learning without generalizing over states is not capable of learning reasonable policies in a period of days (although learning curves show progress, and given excessive amounts of extra time, optimal performance would be reached). LSPI on the other hand, generalizes over states and utilizes the relatively short amount of interaction much better. This is not a surprise, but it does confirm that generalization – over the experiences of multiple agents, but also over states – is needed to obtain reasonable policies in ”human-scale” interaction time (and thus answers RQ1).

Different learning approaches: The grouped benchmark approach with LSPI provided us with a policy that outperformed all other policies in this setting. This is of course the result of having perfect information about the profiles of the users which allowed us to created perfect clusters. The separate approach was the second best performing approach and ended very close to the performance of the grouped benchmark approach after learning for 100 days. The separate approach has the ability to match the performance of the grouped benchmark approach given enough time to learn. At the same time the grouped approach clearly outperformed the pooled approach which indicates that clustering helps us learn better policies in a shorter amount of time, by generalizing over the right agents. We can attribute the different in performance between the clustering approach and the grouped benchmark approach to the fact that the clustering methods we used did not find clusters of the same quality of those of the grouped benchmark approach. Both the grouped benchmark approach and the separate approach rely on circumstances that are less realistic in the real world. Having more than days to learn is unrealistic and having complete knowledge of the profiles of the users is not realistic either. With the clustering based approach we are able to speed up the learning time in comparison with the pooled approach to potentially reach better policies.

The policies that were produced by Q-learning show little variation in terms of performance resulting from the different learning approaches. On the contrary, LSPI produces policies learned using the same approaches that are significantly different among each other (Wilcoxon Signed-Rank test, significance). As we can see from Fig. 1, the policy learned with LSPI using the grouped benchmark approach resulted in the highest average daily reward. In this case three clusters were formed each containing precisely the agents of one type. An average daily reward that exceeds twice that of the clustering approach and three times that of the pooled approach was observed. Furthermore, this approach also outperformed the policies learned with the separate approach. Although Q-learning shows little differences across the setups, an interesting observations is that clustering using knowledge about the profiles of the users performs slightly worse in terms of average daily reward than the remaining approach while using Q-learning.

A different way of measuring performance, by the cumulative average daily reward, is reported in Figs. 2 and 3. These two graphs show the cumulative average daily reward across the different learning setups. For policies learned with LSPI the grouped benchmark approach provided the highest cumulative reward throughout the simulation in comparison with all other approaches. A small decay was noticeable after days. The separate approach resulted in a higher cumulative reward throughout the simulation compared to the approaches that learn one policy over all users or rely on clustering to learn a policy per cluster. The pooled approach outperformed the clustering approach during the first 55 days after which the pooled approach started decaying and was overtaken by the clustering approach.

Figure 2: Cumulative reward for LSPI

For the Q-learning case, similar behavior was noticeable for the clustering and the pooled approaches. The latter gets overtaken by the clustering-based approach after day . The grouped benchmark approach provided the lowest cumulative reward throughout the simulation in comparison with all other approaches. The separate approach is in between these two extremes.

Figure 3: Cumulative reward for Q-learning

Clustering: Figure 4 shows the clustering with the k-medoids algorithm and the Euclidean distance metric for the LSPI run. We can clearly see that the clustering assigned users with the workaholic profile to the same cluster. The assignment of the other profiles is less consistent. For the Q-learning case similar patterns were observed.

Figure 4: Profiles in the various clusters for the LSPI runs

In Depth Profile Policy Analysis: Figures 5-7 report on the performance of the different policies on each type of user. We see that Q-learning learn slow, but consistent over all types of users. LSPI however shows great diversity between the different profiles, also which learning setup is most appropriate. For Arnold it does perform consistently well (which is an ”easy” profile as many policies result in a positive reward), while for other profiles some setups work well while other work really bad.

Figure 5: Cumulative reward for agents with profile Workaholic
Figure 6: Cumulative reward for agents with profile Retiree
Figure 7: Cumulative reward for agents with profile Arnold

Overall, we see that there are three different ways to speed up learning such that learning is feasible in human-scale time: i) generalization over states through basis functions (LSPI) outperforms table-based learning (Q-learning), ii) generalization over traces of several agents (group based policies) outperforms learning for agents individually (separate learning), and iii) generalization over the right agents (cluster-based approaches) outperform generalization over all agents (pooled). All three are needed for interventions in realistic, human domains.

7 Discussion

In this paper, we have introduced steps towards a cluster-based reinforcement learning approach for personalization of health interventions. Such a setting is characterized by limited opportunity to collect experiences from users and where the outcome is focused on optimization of long term health behavior. The presented approach allows for the identification of clusters of users that behave in a similar way and require a similar policy. We have posed various research questions to evaluate the suitability of the approach. Based on the results generated using our novel simulator, for our setting we can say that: RQ1: RL with batch learning and function approximation outperforms table-based RL using online learning in a significant way, thereby disqualifying the latter when interaction time is short. RQ2: A cluster-based RL can learn a significantly better policy within days compared to learning per user and learning across all users, provided that a suitable clustering is found. RQ3: Learning suitable clusters using a Euclidean distance function and k-medoids clustering based on traces of states and rewards over days shows to be difficult, suggesting the warm-up phase should be made longer.

While our simulator exhibits realistic behavior, we plan on moving more and more to a setting where the actual user is in the loop. A logical next step which is to use data collected from actual users to drive the behavior of the agent. We envision to do this by applying machine learning on the data per user and using the resulting model as a behavioral model for that specific user. We already have access to data obtained from a mobile treatment app used by around

depressed patients111ref omitted for double blind reviewing. In the data, responses to interventions of individual agents are stored as well as socio-demographic and intake questionnaire data and daily ratings of their mental state. Clustering could even be based on the data collected at the start of the intervention. Integrating such behavioral models in our simulator is merely a small step. Furthermore, from a methodological side, we aim to experiment with more powerful reinforcement learning techniques, and we want to explore different clustering algorithms and more distance metrics to improve the clustering itself.


  • Chakraborty & Murphy (2014) Chakraborty, Bibhas and Murphy, Susan A. Dynamic treatment regimes. Annual review of statistics and its application, 1:447–464, 2014.
  • Curry et al. (1995) Curry, Susan J, McBride, Colleen, Grothaus, Louis C, Louie, Doug, and Wagner, Edward H. A randomized trial of self-help materials, personalized feedback, and telephone counseling with nonvolunteer smokers. Journal of consulting and clinical psychology, 63(6):1005, 1995.
  • Hochberg et al. (2016) Hochberg, Irit, Feraru, Guy, Kozdoba, Mark, Mannor, Shie, Tennenholtz, Moshe, and Yom-Tov, Elad. A reinforcement learning system to encourage physical activity in diabetes patients. arXiv preprint arXiv:1605.04070, 2016.
  • Hoogendoorn & Funk (2017) Hoogendoorn, Mark and Funk, Burkhardt. Machine Learning for the Quantified Self: On the Art of Learning from Sensory Data. Springer, 2017.
  • Kaptein & van Halteren (2013) Kaptein, Maurits and van Halteren, Aart. Adaptive persuasive messaging to increase service retention: using persuasion profiles to increase the effectiveness of email reminders. Personal and Ubiquitous Computing, 17(6):1173–1185, 2013.
  • Kranzler & McKay (2012) Kranzler, Henry R and McKay, James R. Personalized treatment of alcohol dependence. Current psychiatry reports, 14(5):486–493, 2012.
  • Lagoudakis & Parr (2003) Lagoudakis, Michail G and Parr, Ronald. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
  • Li (2017) Li, Yuxi. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
  • Lin (1992) Lin, Long-Ji. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  • Raghu et al. (2017) Raghu, Aniruddh, Komorowski, Matthieu, Celi, Leo Anthony, Szolovits, Peter, and Ghassemi, Marzyeh. Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017.
  • Rudary et al. (2004) Rudary, Matthew, Singh, Satinder, and Pollack, Martha E. Adaptive cognitive orthotics: combining reinforcement learning and constraint-based temporal reasoning. In Proceedings of the twenty-first international conference on Machine learning, pp.  91. ACM, 2004.
  • Schmidt et al. (2006) Schmidt, Ulrike, Landau, Sabine, Pombo-Carril, Maria Guadelupe, Bara-Carril, Nuria, Reid, Yael, Murray, Kathryn, Treasure, Janet, and Katzman, Melanie. Does personalized feedback improve the outcome of cognitive-behavioural guided self-care in bulimia nervosa? a preliminary randomized controlled trial. British Journal of Clinical Psychology, 45(1):111–121, 2006.
  • Segal et al. (2018) Segal, A., Gal, K., Kamar, E., and Horvitz, E. Optimizing interventions via offline policy evaluation: Studies in citizen science. In ProceedingsAAAI18, 2018.
  • Simon et al. (2000) Simon, Gregory E, VonKorff, Michael, Rutter, Carolyn, and Wagner, Edward. Randomised trial of monitoring, feedback, and management of care by telephone to improve treatment of depression in primary care. Bmj, 320(7234):550–554, 2000.
  • Sutton & Barto (2017) Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press Cambridge, 2017. 2nd edition, in progress, based on original 1998 version.
  • Taylor & Stone (2009) Taylor, Matthew E and Stone, Peter. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
  • Teh et al. (2017) Teh, Yee, Bapst, Victor, Czarnecki, Wojciech M., Quan, John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas, and Pascanu, Razvan. Distral: Robust multitask reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4499–4509. Curran Associates, Inc., 2017.
  • Watkins & Dayan (1992) Watkins, Christopher JCH and Dayan, Peter. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Wiering & van Otterlo (2012) Wiering, M.A. and van Otterlo, M. (eds.). Reinforcement Learning: State of the Art. Springer, 2012.
  • Wilson et al. (2007) Wilson, Aaron, Fern, Alan, Ray, Soumya, and Tadepalli, Prasad. Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning, pp. 1015–1022. ACM, 2007.
  • Zhu et al. (2017) Zhu, Feiyun, Guo, Jun, Xu, Zheng, Liao, Peng, and Huang, Junzhou. Group-driven reinforcement learning for personalized mhealth intervention. arXiv preprint arXiv:1708.04001, 2017.