Personalizing Intervention Probabilities By Pooling

12/02/2018 ∙ by Sabina Tomkins, et al. ∙ University of Michigan 0

In many mobile health interventions, treatments should only be delivered in a particular context, for example when a user is currently stressed, walking or sedentary. Even in an optimal context, concerns about user burden can restrict which treatments are sent. To diffuse the treatment delivery over times when a user is in a desired context, it is critical to predict the future number of times the context will occur. The focus of this paper is on whether personalization can improve predictions in these settings. Though the variance between individuals' behavioral patterns suggest that personalization should be useful, the amount of individual-level data limits its capabilities. Thus, we investigate several methods which pool data across users to overcome these deficiencies and find that pooling lowers the overall error rate relative to both personalized and batch approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mobile health (mHealth) interventions can deliver effective treatments in real-time. For example, to help users increase their physical activity, an mHealth application might send a suggestion to walk at a time when a user is motivated and able to pursue the suggestion. Users struggling to quit smoking might be prompted to do a mindfulness exercise when the system detects they are becoming stressed, helping reduce negative affect and the desire to smoke. The promise of mHealth interventions hinges on their ability to provide support at times when users need the support and are receptive to it nahum2017 .

Many interventions included in an mHealth system (e.g. reminders, coping strategies) are designed to be delivered in a particular context. These treatments are often delivered via a wearable or a push notification on a smartphone. An approach to reducing user burden is to budget the number of treatments delivered in a certain period of time (e.g., a day). Given an intervention budget, our goal is to spread this budget across the instances in which a user is in a targeted context (e.g., currently sedentary). By doing so we aim to improve user experience.

Consider an mHealth application designed to reduce sedentary behavior. The application can send users an activity suggestion when they have been sitting for 40 minutes or longer, but the number of suggestions sent each day is constrained by a budget. The subject of this paper is predicting the number of future sedentary episodes remaining in the day. This prediction is used to decide whether to send a suggestion. For example, it would be more important to send a suggestion now if the system predicted only a single additional sedentary episode than if many were likely. As this sedentary behavior can have high variance from person to person, we argue for a personalized approach.

If only a single sedentary episode was predicted in the remainder of the day, and the budget was not yet met, it would make sense to send a reminder. Alternatively, if many future sedentary episodes were predicted, it is less important to send a reminder now.

Mobile health interventions engage with participants to achieve desired health outcomes. Consider the healthcare goal of lowering risk of heart-related illnesses. A target health outcome might be to increase physical activity among a population of participants, measured in daily step counts. A mobile health intervention might be to send reminders to be active, or to notify participants of their current step count. Examples in mobile health abound, from lowering risk of stress by sending mindfulness activity suggestions, to facilitating rehabilitation by reminding participants of their medication schedules.

Across mobile health settings a persistent problem is engaging with participants without over-burdening them. A response to this problem is to impose a budget on the number of messages sent within a period of time, for example, a budget might be 2 messages per day. Given a budget of interactions, our goal is to spread this budget across a period of time. Doing so is instrumental in learning the contexts under which users are receptive to reminders. As we collect data on the interaction between context and receptivity, we can better adapt interventions to be meaningful to each user.

The subject of this paper is predicting the number of future sedentary episodes in the remainder of the day.

Given a budget of interactions, our goal is to spread this budget across all remaining sedentary periods. Doing so is instrumental in learning the contexts under which users are receptive to reminders. As we collect data on the interaction between context and receptivity, we can better adapt interventions to be meaningful to each user.

To reduce burden, we might have a budget for the total number of reminders sent in a day. If only a single sedentary episode was predicted in the remainder of the day, and the budget was not yet met, it would make sense to send a reminder. Alternatively, if many future sedentary episodes were predicted, it is less important to send a reminder now.

The subject of this paper is predicting the number of future sedentary episodes in the remainder of the day. As this quantity can have high variance from person to person, we argue for a personalized approach. Given a budget of daily activity reminders, our goal is to spread this budget across all remaining sedentary periods. Doing so is instrumental in learning the contexts under which users are receptive to reminders. As we collect data on the interaction between context and receptivity, we can better adapt interventions to be meaningful to each user.

is characterizing effective interactions with participants, without not over-burdening them. For example, people may be more responsive to different messages at different times, and we would like to learn which messages are most effective to whom, when. However,

how often to interact with participants without over-burdening them.

While a sufficient level of interaction must be maintained in order to engage participants, interacting too often can place a over-burden on users.

To successfully engage participants, a sufficie

In mobile health settings, a common type of intervention is a message designed to achieve a health outcome. For example, weight loss might be a desired health outcome, and mobile health application might send reminders to walk or push notifications of one’s current activity levels.

Upon enrolling in a study, or downloading a mobile health application, participants/users consent to receive many such messages. Even when one is committed to achieving a health outcome, these messages can quickly become burdensome.

In a health study, participants will receive many such messages.

a common goal is to d designed to achieve a health outcome. For example, one

send participants/users content throughout the day.

For example, this content might be an encouragement or reminder to do an activity, an uplifting message, or a request for information which will be useful in designing interventions. Participants can become quickly burdened by these messages and an open question is determining how to interact with users while minimizing burden. Consider the example of a notification meant to encourage walking. If a model had knowledge of the number of future sedentary episodes in the day, it might randomize future reminders using this knowledge. To reduce burden, we might have a budget for the total number of reminders sent in a day.

Personalization has been a recent focus in healthcare prediction tasks shoeb2004 ; rudovic2018 . Lopez and Picard lopez2017

proposed a multi-task neural network for pain detection and Saeed and Trajanovski

saeed2017 proposed a multi-task neural network approach to detect stress. Here, we consider two forms of pooling. We demonstrate that these approaches allow us to more quickly make appropriate personalized predictions than a fully personalized approach which requires more days of data. Furthermore, pooling approaches outperform a batch model.

2 Models

We predict the number of sedentary times remaining in a day (), for a person . Here, we propose two approaches, a Gaussian Process and a population-informed task-specific model we refer to as (). These are compared to task-specific and batch models.

We treat each person as a task, such that we have tasks for a population of size . Across all models we consider a dataset of inputs, , and outputs where each belongs to a task in , and some are given while others are unknown. Given a set of indices which define a test set, , we are interested in inferring for each in .

2.1 Gaussian Process Variants

A natural question in this setting is how well a task-specific would perform. Thus we introduce, which learns some set of parameters separately for each person. We compare this to which treats all input as belonging to one task, and learns parameters for this population-size task. Both and assume standard forms, that is a prediction is made as:

where in , is the training data for one task , and in is the training data for all tasks . Additionally, we introduce . This model learns some individualized parameter values, as well as some which are shared across the population. For example, we explicitly learn a inter-task similarity matrix . for each person . Additionally, we are interested in how pooling might help, and we introduce, , which, for person , learns which can be represented with a subset , unique to the individual and a subset , which contains population information. Finally, we compare each of these models to , which learns across the entire population. For example, in the multi-task case, a prediction for person(task) , is made according to:

where, denotes the Kronecker product, selects the column of ,

, is the vector of covariances between the test point

and the training points, is the matrix of covariances between all pairs of training points, is an diagonal matrix in which the element is , and is an matrix bonilla2008

. For each model we have use a Radial Basis Function (RBF) kernel for the variance, and a constant mean kernel for the mean. We place a Gaussian prior on the mean function, where this prior has constant mean and the scale is a tuneable hyper-parameter.

2.2 Weighted Regressors

A different approach might be to learn the importance of a batch model relative to a task-specific model. To do so we introduce a simple approach which we refer to as weighted regressors. Here, we learn one population level regressor and one individual regressor. The final prediction is the weighted average of the predictions of these two models, where we adjust the population and task-specific weights, and respectively, in order to shift weight towards the task-specific model as we gain additional information. This is shown in wr.

For a training vector , we input x into each regressor and obtain , and from the population and individual models respectively. These output form a new input and together with the truth we train a linear regressor which can learn to weight each model accordingly. The final output is then . We show this procedure in wr, here we adopt a sliding window approach where we train on past data up to our current time and predict a few days out.

1:    INPUT:  dataset , test  , window size , initial training days
2:    OUTPUT: 
3:          R refers to some class of regressors
4:   for  do
5:       
6:       for   do
7:           
8:           
9:            we predict a window length into the future
10:           
11:           
12:       end for
13:   end for
Algorithm 1 ()

,

3 Empirical Evaluation

It is critical that we obtain high quality predictions in a short amount of time. Thus, to evaluate our methods we consider the setting where we obtain training data on a regular basis and we would like to forecast several days into the future. We use data obtained from a real-life mobile health study klasnja2018 . The goal of the study is to positively impact participants’ long-term health by increasing their physical activity. Participants receive messages which remind them to be active, and step counts are collected as a measure of both overall activity and responsivity to messages. We can incorporate knowledge of predicted sedentary periods in determining whether to send a message at a certain time. We label an interval of 40 minutes as sedentary if the total step count in this period is less than 140.

For simplicity, we predict the number of sedentary periods between 15:00 and 21:00 on each day. In order to use daily context, we focus only on those days with some step count data in the time of 9:00 to 15:00, as well as between 15:00 and 21:00. Additionally, we restrict our attention to those users with at least a week of data. This results in a dataset of 36 participants with approximately 30 days of usable data per participant. For each model and participant we make sliding window predictions where we train with all available data and predict some number of days into the future.

To capture context we model users’ past activity levels and external context. For example, each contains the step count of the morning of , the overall step count at and the number of sedentary periods at . Additionally, each contains a single weather description. These are obtained from 111https://www.kaggle.com/selfishgene/historical-hourly-weather-data, where weather is described with a short text description, such as partly cloudy. In total there are 21 weather descriptions and the description for

is a categorical variable.

3.1 Sources of variance

We expect that the extent to which pooling will be advantageous depends on the sources of variance in this data. For example, if the individuals’ behavior across time varies greatly it will be difficult to train a personalized model, and we expect some pooling to provide better performance. However, if behavior varies too greatly from person to person we might see the shared models perform poorly. Thus, before turning to the results we briefly inspect these two sources of variance.

Individual’s behavior across time To inspect within-participant variance, we perform a log-likelihood difference test. For each participant, we set a window length which intuitively corresponds to the number of days over which a person might have stable behavior. For example, if there was no overall temporal patterns, such that each day could be treated as an independent observation we might train a model on each day’s data. However, if people behave relatively stably over a period of two days, we could train a new model every two days.

For a given window length we have models, where equals the number of days the participant was in the study divided by the window length. We then train Ordinary Least Squares regressors. For each participant, we create a dataset of roughly 30 days, , where each is a vector with context from day and day , and each is the number of sedentary periods from 15:00-21:00 on day . To perform the log-likelihood difference test we form a null model which learns one set of parameters over the entirety of a participant’s history. This is compared to the aggregate of the window-length models. Thus, for each participant we determine that it would be better to train models if:

where is the baseline model which uses all of the historical data, and each is the model for the window. In ns we see that when considering only a single day of data at a time most participants change from day to day. However, at windows of length five we see almost no loss in stationarity relative to training one model of window length 30. This might direct the choice to pool or not, where pooling might offer more advantages for smaller window lengths.

Similarity of behavior between participants To assess between-participant similarities we perform dynamic time warping (DTW) between all pairs of participants. dtw shows similarities between users. We see both large swaths of dissimilarity (light yellow regions) and pockets of high similarity. This provides some optimism for pooling, while also highlighting the need for personalization.

(a) Fitting a model every five days is roughly equivalent to fitting one for all days.
(b) Similarity as inverse DTW score
Figure 1: Within-person variance and potentials for pooling.
Figure 2: Considering periods of five days, most participants have low time variance.
Figure 3: Similarity as inverse DTW score.

3.2 Experimental Results

We assess the overall Mean Squared Error (MSE)(in mse), and evaluate how this changes with increasing amounts of training data in delta. All results are obtained from five-fold cross validation. We split according to participants, as this mirrors the situation where a new participant joins a study and data from previous participants will be utilized for their predictions. The models are implemented in GPyTorch gardner2018 , while the remaining models are implemented in Scikit-learn scikit-learn . We adopt a simple non-personalized baseline which we refer to as Mean. For each person, for each day, we assign the training-data population average as the number of remaining sedentary periods.

Figure 4: Average error on predicting sedentary periods.
Figure 5: Error over day in study.

4 Discussion

mse shows that the two pooling approaches, and achieve the best overall error rates. Furthermore, all of the approaches beat the simple mean prediction baseline. However, the extent to which pooling outperforms task-specific models depends on the window length. When longer windows are used in each training period, we see the error rate of the personalized approach decrease, while it increases for pooling. This suggests that waiting longer before retraining can increase the efficacy of fully personalized models. delta shows the time-variance of this data.

As users’ initial impressions of an intervention’s usefulness are crucial, an especially vital question is how well the models predict the first few days without training data. There, we see that achieves the lowest error rates. In these first critical days achieves a 14% error reduction over , in the three-day case, and a 9% reduction in the five-day case. Thus, if predictions must be made early in the study when limited data is available, some form of pooling might be advantageous.

5 Conclusion

Many mobile health intervention treatments are designed to be delivered in a particular context. Furthermore the number of deliveries per day is often constrained due to concerns about user burden. Thus, spreading the treatments across the times at which a user is in the desired context is critical. Here, we propose to personalize the probability of receiving a treatment according to the number of times the context would occur in the remaining part of the day. We consider the context of being sedentary, and evaluate our ability to predict sedentary periods on a real-world dataset collected from a mobile health study. The data on each individual is small, particularly at the beginning of the study. Utilizing some form of population-level information can contribute towards rapid personalization for each user, and reduced overall predictive error. We leave many open questions for future work, for example the optimal treatment of the non-stationarity of this data.

Determining when to send interventions in mobile health is a critical open question. Here, we propose to personalize the probability of receiving an intervention according to the number of times an event might occur. We consider the event of being sedentary, and evaluate our ability to predict sedentary periods on a real-world dataset collected from a mobile health study. This data is small and prone to many issues. Utilizing some form of population level-information can contribute towards overcoming these issues, and reduced overall predictive error. We leave many open questions for future work, for example the optimal treatment of the non-stationarity of this data.

Acknowledgements

Research presented in this paper was supported by the National Heart, Lung and Blood Institute under award number R01HL125440; the National Institute on Alcohol Abuse and Alcoholism under award number R01AA023187; the National Institute on Drug Abuse under award number P50DA039838; and the National Institute of Biomedical Imaging and Bioengineering under award number U54EB020404.

References

  • [1] Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in neural information processing systems, pages 153–160, 2008.
  • [2] Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In NIPS, 2018.
  • [3] Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers, Eric B Hekler, and Susan A Murphy. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 2018.
  • [4] Daniel Lopez-Martinez and Rosalind Picard. Multi-task neural networks for personalized pain recognition from physiological signals. In Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 2017 Seventh International Conference on, pages 181–184. IEEE, 2017.
  • [5] Inbal Nahum-Shani, Shawna N Smith, Bonnie J Spring, Linda M Collins, Katie Witkiewitz, Ambuj Tewari, and Susan A Murphy. Just-in-time adaptive interventions (JITAIs) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, 52(6):446–462, 2017.
  • [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.

    Scikit-learn: Machine learning in Python.

    Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [7] Ognjen Rudovic, Jaeryoung Lee, Miles Dai, Bjorn Schuller, and Rosalind Picard. Personalized machine learning for robot perception of affect and engagement in autism therapy. arXiv preprint arXiv:1802.01186, 2018.
  • [8] Aaqib Saeed and Stojan Trajanovski. Personalized driver stress detection with multi-task neural networks using physiological signals. arXiv preprint arXiv:1711.06116, 2017.
  • [9] Ali Shoeb, Herman Edwards, Jack Connolly, Blaise Bourgeois, S Ted Treves, and John Guttag. Patient-specific seizure onset detection. Epilepsy & Behavior, 5(4):483–498, 2004.