Fast Physical Activity Suggestions: Efficient Hyperparameter Learning in Mobile Health

12/21/2020 ∙ by Marianne Menictas, et al. ∙ Harvard University Stanford University 8

Users can be supported to adopt healthy behaviors, such as regular physical activity, via relevant and timely suggestions on their mobile devices. Recently, reinforcement learning algorithms have been found to be effective for learning the optimal context under which to provide suggestions. However, these algorithms are not necessarily designed for the constraints posed by mobile health (mHealth) settings, that they be efficient, domain-informed and computationally affordable. We propose an algorithm for providing physical activity suggestions in mHealth settings. Using domain-science, we formulate a contextual bandit algorithm which makes use of a linear mixed effects model. We then introduce a procedure to efficiently perform hyper-parameter updating, using far less computational resources than competing approaches. Not only is our approach computationally efficient, it is also easily implemented with closed form matrix algebraic updates and we show improvements over state of the art approaches both in speed and accuracy of up to 99

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Physical activity can effectively reduce the risk of severe health problems, such as heart disease, yet many people at-risk fail to exercise regularly berlin1990meta ; billinger2014physical . Mobile health (mHealth) applications which provide users with regular nudges over their mobile devices can offer users with convenient and steady support to adopt healthy physical activity habits and have been shown to increase physical activity drastically klasnja2019efficacy . However, to realize their potential, algorithms to provide physical activity suggestions must be efficient, domain-specific, and computationally affordable.

We present an algorithm designed for an mHealth study in which an online Thompson Sampling (TS) contextual bandit algorithm is used to personalize the delivery of physical activity suggestions

tomkins2020intelligentpooling . These suggestions are intended to increase near time physical activity. The personalization occurs via: (i) the user’s current context, which is used to decide whether to deliver a suggestion; and (ii) random effects, which are used to learn user and time specific parameters that encode the influence of the context. The user and time specific parameters are modeled in the reward function (the mean of the reward conditional on context and action). To learn these parameters, information is pooled across users and time dynamically, combining TS with a Bayesian random effects model.

The contributions of this paper are: (i) the development of a TS algorithm involving fast empirical Bayes (EB) fitting of a Bayesian random effects model for the reward; (ii) the EB fitting procedure only requires storage and computing at each iteration on the order of , where is the larger dimension of the two grouping mechanisms considered. For example, may represent the number of users and the time points or vice-versa; and (iii) Our approach reduces the running time over state-of-art methods, and critically, does not require advanced hardware.

2 Methods: Hyperparameter learning for mHealth reward functions

2.1 Problem Setting

At each time, , on each user,

, a vector of context variables,

, is observed. An action, , is selected. We consider actions, where . Subsequently a real-valued reward, is observed. This continues for times and on users. We assume that the reward at time is generated with a person and time specific mean,

where , and are known features of the context and action . The are unknown parameters; in particular is the vector of th user parameters and is the vector of time parameters. Time corresponds to “time-since-under-treatment” for a user. User-specific parameters, , capture unobserved user variables that influence the reward at all times ; in mHealth unobserved user variables may include level of social support for activity, pre-existing problems or preferences that make activity difficult. The time-specific parameters, capture unobserved “time-since-under-treatment” variables that influence the reward for all users. In mHealth unobserved “time-since-under-treatment” variables might include treatment fatigue, decreasing motivation, etc.

We use the following Bayesian mixed effects model laird1982random for the reward as in tomkins2020intelligentpooling :

(1)

We let represent the index for the current time since under treatment and be the index from the beginning of a study until the current time, i.e., . The algorithm is designed with independent Gaussian priors on the unknown parameters:

(2)

The and are called random effects in the statistical literature and the model in (1) and (2) is often referred to as a linear mixed effects model laird1982random

or a linear mixed model with crossed random effects (e.g.,

baayen2008mixed ; jeon2017variational ). At each time, , the TS Algorithm in Algorithm 1 is used to select the action, , based on the context . That is, we compute the posterior distribution for where and for context , select treatment

with posterior probability

(3)

where are the posterior mean and covariance matrix given in the sub-blocks of (S.1).

Further detail, including notation, on the Bayesian mixed effects model components is given in the Supplement in Section S.1. The form of the posterior updates is provided in Section S.2. The Naïve EB procedure is explained in Sections S.2.1 and S.2.2 and the Naïve EB algorithm is provided in Algorithm S.1 in Section S.3.

1:Initialize:
2:for  do
3:     for for  do
4:          Receive context features for user and time
5:          Obtain posterior using Result S.1
6:

          Compute randomization probability

in (3)
7:          Sample treatment
8:           Observe reward
9:          if  then
10:               Update with Alg. 2
11:               Update posterior
12:          end if
13:     end for
14:end for
Algorithm 1 Physical activity suggestions.
1:if log-likelihood not converged then
2:     Initialize: ;  
3:     Repeat:
4:      E-step: Compute components of and
5:      sub-blocks of :
6:      
7:       where returns , , , and
8:       .
9:      M-step:

Compute variance components in

10:       via equation (S.2)
11:      
12:end if
Algorithm 2 Streamlined EB for fitting.

2.1.1 Computational Challenges

At each iteration in Algorithm S.1, computation of requires solving the sparse matrix linear system

Figure 1: Sparsity in under the Bayesian mixed effects model as represented in (1) and (2) where . Blue-square: Non-zero entries; yellow-square: zero entries.

where the LHS has sparse structure imposed by the random effects as exemplified in Figure 1. This matrix has dimension . Often, the number of fixed effects parameters , random effects parameters per user and random effects parameters per time are small. Consequently, naïve computation of is , i.e., cubic dependence on the number of random effects group sizes and . To address this computational problem, we take advantage of the block-diagonal structure in Figure 1 and the fact that the closed form updates of the variance components in (S.2) require computation of only certain sub-blocks of and not the entire matrix.

2.2 Streamlined EB

Streamlined updating of and each of the sub-blocks of required for the E-step in Alg. S.1 can be embedded within the class of two-level sparse matrix problems as defined in nolan2019streamlined and is encapsulated in Result 1. Result 1 is analogous and mathematically identical to Result 2 in menictas2019streamlinedCrossed . The difference being that the authors in menictas2019streamlinedCrossed

do not apply their methodologies to the mHealth setting and use full variational Bayesian inference for fitting as opposed to our use of EB.

The streamlined equivalent of Alg. S.1 is given in Alg. 2. Alg. 2 makes use of the STLSLS algorithm which was first presented in nolan2019streamlined but also provided in the appendix of this article. The computing time and storage for the streamlined updating of and each of the sub-blocks of required for the E-step in Alg. 2 becomes . For moderate sized , this reduces to .

3 Performance Assessment and Comparison

We compare the timing and accuracy of our streamlined method against state of the art software, GPyTorch gardner2018gpytorch , which is a highly efficient implementation of Gaussian Process Regression modeling, with GPU acceleration. Note that the Bayesian linear mixed effects model as given in (1) and (2) is equivalent to a Gaussian Process regression model with a simple structured kernel. We use sEB to refer to the streamlined EB algorithm, GPyT-CPU for EB fitting using GPyTorch with CPU and GPyT-GPU for EB fitting using GPyTorch with GPU. sEB and GPyT-CPU were computed using an Intel Xeon CPU E5-2683. GPyT-GPU was computed using an Nvidea Tesla V100-PCIE-32GB.

3.1 Batch Speed Assessment

We simulated batch data according to versions of the Bayesian mixed effects model in (1) and (2

) for a continuous predictor generated from the Uniform distribution on the unit interval. The true parameter values were set to

and, was set to , the number of data points for each user and time period, , was set to 5. Four studies were run with differing values for the number of users . The total number of data points is then , . We simulated 50 replications of the data for each and recorded computational times for from GPyT-CPU, GPyT-GPU and sEB. Alg. 2 was implemented in R with Fortran 77 wrappers. EM iterations stopped once the absolute difference between successive expected complete data log-likelihood values fell below . The stopping criterion was the same for gPyTorch with the maximum number of iterations set to 15.

Table 1

shows the mean (std dev) of elapsed computing times in seconds for estimation of the variance components using

sEB, GPyT-CPU and GPyT-GPU. NA values mean that computational burden was too high to produce results. Figure 3 shows the absolute error values for each variance components estimated using sEB, GPyT-CPU and GPyT-GPU summarized as a boxplot.

Datapoints sEB GPyT-CPU GPyT-GPU
1,500 0.7 (0.10) 5.8 (0.14) 1.5 (0.16)
7,500 1.7 (0.15) 163.8 (1.81) 1.3 (0.04)
15,000 2.8 (0.21) 736.2 (38.36) 5.2 (0.03)
1,500,000 322.1 (24.82) NA (NA) NA (NA)
Table 1: Mean (std dev) of computing times in seconds for estimation of the variance components using sEB via Alg. 2, GPyT-CPU and GPyT-GPU for comparison.
Figure 2: Summary of the batch speed assessment. Absolute error values for each variance component estimated using sEBGPyT-CPU and GPyT-GPU summarized as a boxplot.
Figure 3: Regret averaged across all users for each week in the simulated mHealth trial associated with approaches sEB, GPyT-CPU and GPyT-GPU.

3.2 Online TS Contextual Bandit mHealth Simulation Study

Next, we evaluate our approach in a simulated mHealth study designed to capture many of the real-world difficulties of mHealth clinical trials. Users in this simulated study are sent interventions multiple times each day according to Alg. 1. Each intervention represents a message promoting healthy behavior. In this setting there are 32 users and each user is in the study for 10 weeks. Users join the study in a staggered fashion, such that each week new users might be joining or leaving the study. Each day in the study users can receive up to 5 mHealth interventions.

In Figure 3 we show the ability of our streamlined algorithm to minimize regret where real data is used to inform the simulation. We also compare our approach to GPyT-CPU and GPyT-GPU. For all users we show the average performance for their

th week in the study. For example, we first show the average regret across all users in their first week of the study, however this will not be the same calendar time week, as users join in a staggered manner. The average total time (standard deviation) for estimation of the variance components was 757.1 (76.48) using

GPyT-CPU, 7.5 (0.27) using GPyT-GPU and 7.3 (0.16) using sEB.

4 Conclusion

Inspection of the computational times in Table 1 shows sEB achieving the lowest average time across all simulations and data set sizes, compared to GPyT-CPU and GPyT-GPU. In the starkest case this results in a 98.96% reduction in running time. Even in the more modest comparison, sEB timing is similar to that of GPyT-GPU but doesn’t require advanced hardware.

This work was supported in part by U.S National Institutes of Health grants R01AA23187 (NIH /NI AAA), P50DA039838 (NIH/NIDA) and U01CA229437 (NIH/NCI).

References

  • [1] Sivaram Ambikasaran, Daniel Foreman-Mackey, Leslie Greengard, David W Hogg, and Michael O’Neil. Fast direct methods for gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 38(2):252–265, 2015.
  • [2] R Harald Baayen, Douglas J Davidson, and Douglas M Bates. Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4):390–412, 2008.
  • [3] Jesse A Berlin and Graham A Colditz. A meta-analysis of physical activity in the prevention of coronary heart disease. American journal of epidemiology, 132(4):612–628, 1990.
  • [4] Sandra A Billinger, Ross Arena, Julie Bernhardt, Janice J Eng, Barry A Franklin, Cheryl Mortag Johnson, Marilyn MacKay-Lyons, Richard F Macko, Gillian E Mead, Elliot J Roth, et al. Physical activity and exercise recommendations for stroke survivors: a statement for healthcare professionals from the american heart association/american stroke association. Stroke, 45(8):2532–2553, 2014.
  • [5] Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pages 314–323, 2016.
  • [6] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017.
  • [7] George Casella. An introduction to empirical bayes data analysis. The American Statistician, 39(2):83–87, 1985.
  • [8] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 844–853. JMLR. org, 2017.
  • [9] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • [10] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
  • [11] Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025–1033, 2013.
  • [12] V Dorie. blme: Bayesian linear mixed-effects models. URL: https://CRAN. R-project. org/package= blme R package version, pages 1–0, 2015.
  • [13] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, 2018.
  • [14] Matthew D Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for bayesian optimization. In UAI

    , pages 327–336. Citeseer, 2011.

  • [15] Minjeong Jeon, Frank Rijmen, and Sophia Rabe-Hesketh. A variational maximization–maximization algorithm for generalized linear mixed models with crossed random effects. psychometrika, 82(3):693–716, 2017.
  • [16] Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers, Eric B Hekler, and Susan A Murphy. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573–582, 2019.
  • [17] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
  • [18] Nan M Laird, James H Ware, et al. Random-effects models for longitudinal data. Biometrics, 38(4):963–974, 1982.
  • [19] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
  • [20] Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When gaussian process meets big data: A review of scalable gps. arXiv preprint arXiv:1807.01065, 2018.
  • [21] Marianne Menictas, Gioia Di Credico, and Matt P Wand. Streamlined variational inference for linear mixed models with crossed random effects. arXiv preprint arXiv:1910.01799, 2019.
  • [22] Carl N Morris. Parametric empirical bayes inference: theory and applications. Journal of the American statistical Association, 78(381):47–55, 1983.
  • [23] Tui H Nolan, Marianne Menictas, and Matt P Wand. Streamlined computing for variational inference with higher level random effects. arXiv preprint arXiv:1903.06616, 2019.
  • [24] Tui H Nolan and Matt P Wand. Solutions to sparse multilevel matrix problems. arXiv preprint arXiv:1903.03089, 2019.
  • [25] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • [26] Sabina Tomkins, Peng Liao, Predrag Klasnja, and Susan Murphy. Intelligentpooling: Practical thompson sampling for mhealth. arXiv preprint arXiv:2008.01571, 2020.
  • [27] Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pages 1022–1031, 2016.

s.1 Bayesian Mixed Effects Model Components

We define the following data matrices

and the parameter vectors , , , where, as before, and represent the known features of the context and action . The dimensions of matrices, for and , are: , , , , , , and .

s.2 Posterior Updates

The posterior distribution for the immediate treatment effect for user at time is updated and then used to assign treatment in the subsequent time point, . Here, we show the form of the full posterior for . Define

The estimated posterior distribution for the fixed and random reward effects vector is

(S.1)

The focus of this work is to enable fast incremental estimation of the variance components . We describe a natural, but computationally challenging approach for estimating these variances in Section S.2.1 and our streamlined alternative approach in Section 2.2.

s.2.1 Parametric Empirical Bayes

At each time, , the EB [22, 7] procedure maximizes the marginal likelihood based on all user data up to and including data at time with respect to . The marginal likelihood of Y is and has the following form

The maximization is commonly done via the Expectation Maximisation (EM) algorithm [9].

s.2.2 Expectation Maximization (EM) Method

The expected complete data log likelihood is given by where the expectation is over the distribution of given in (2). The M-step yields the following closed form iteration estimates for the variance components in :

(S.2)

where the posterior mean reward components for the fixed and random effects

(S.3)

and the posterior variance-covariance reward components for the fixed and random effects

(S.4)

are computed in the E-step using equation (S.1). Note that (S.3) are the sub vectors in the the posterior mean and (S.4) are sub-block entries in the posterior variance covariance matrix . The naïve EM algorithm is given in Alg. S.1. The challenge in Alg. S.1 is computation of the posterior at each iteration.

s.3 The Naïve Expectation Maximization Algorithm

Initialize: Set repeat E-step: Compute and via equation (S.1) to obtain necessary mean and covariance components needed for the M-step. M-step: Compute variance components in via equation (S.2). until log-likelihood converges
Algorithm S.1 Naïve EM Algorithm for EB estimates of the variance components in the Bayesian mixed effects model as given in (1) and (2).

s.4 Main Result

Result S.1

(Analogous to Result 2 in [21]). The posterior updates under the Bayesian mixed effects model as given in (1) and (2) for and each of the sub-blocks of are expressible as a two-level sparse matrix least squares problem of the form where and the non-zero sub-blocks of B, according to the notation in the appendix, are, for ,

with each of these matrices having rows. The solutions are

and

, where the , , , and notation is given in the appendix.

s.5 The Stlsls Algorithm

The STLSLS algorithm is listed in [24] and based on Theorem 2 of [24]. Given its centrality to Algorithm 2 we list it again here. The algorithm solves a sparse version of the the least squares problem:

which has solution where and where B and have the following structure:

(S.5)

The sub-vectors of x and the sub-matrices of A corresponding to its non-zero blocks of are labelled as follows:

(S.6)

with denoting sub-blocks that are not of interest. The STLSLS algorithm is given in Algorithm S.2.

Inputs:    ;    For : Decompose such that and is upper-triangular.    ;       ;      ;        ;     Decompose such that and R is upper-triangular.   ;      ;    For :   ;    Output:
Algorithm S.2 STLSLS for solving the two-level sparse matrix least squares problem: minimise in x and sub-blocks of corresponding to the non-zero sub-blocks of . The sub-block notation is given by (S.5) and (S.6).

s.6 Related Work

The fundamental streamlined EB Algorithm 2 makes use of linear system solutions and sub-block matrix inverses for two-level sparse matrix problems ([24]). Our result for streamlined posterior computation in Section 2.2 is analogous and mathematically identical to Result 2 in [21] who instead focus on streamlined mean field variational Bayes approximate inference for linear mixed models with crossed random effects. In the present article, we make use of a similar result for EB posterior inference for use in the mHealth setting. Our EB algorithm allows streamlined estimation of the variance components within an online TS contextual bandit algorithm.

Other approaches include using mixed model software packages for high-performance statistical computation. For example: (i) BLME provides a posteriori estimation for linear and generalized linear mixed effects models in a Bayesian setting [12]; and (ii) Stan [6]

provides full Bayesian statistical inference with MCMC sampling. Even though

BLME offers streamlined algorithms for obtaining the predictions of fixed and random effects in linear mixed models, the sub-blocks of the covariance matrices of the posterior required for construction of the EM method in the streamlined empirical Bayes algorithm are not provided by such software. On the other hand, Stan does offer support for computation of these sub-blocks, but is well known to suffer computationally in large data settings.

As we point to in Section 3, the Gaussian linear mixed effects model used in the Thompson-Sampling algorithm is equivalent to a Gaussian Process regression model with a structured kernel matrix induced by the use of random effects. Gaussian process models have been used for multi-armed bandits [8, 14, 1, 25, 10, 27, 11, 5], and for contextual bandits [19, 17]. To address the challenges posed by mHealth, [26] illustrate the benefits of using mixed effects Gaussian Process models in the context of reinforcement learning. Computational challenges in the Gaussian Process regression setting is a known and common problem which has led to contributions from the computer science and machine learning communities. For instance, to address challenges posed for Gaussian Process regression suffering from cubic complexity to data size, a variety of scalable GPs have been presented, including the approach we compare to earlier: GPyTorch [13]. A review on state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning can be found in [20]. The sparsity imposed by the use of random effects, however, afford us accurate inference in the cases considered in this article, and thus do not suffer from the potential loss of accuracy that could result from the approximate methods, such as those discussed in [20].