This work is motivated by a mobile health study in which an online Thompson Sampling contextual bandit algorithm is used to personalize the delivery of physical activity suggestions . These suggestions are intended to increase near time physical activity. The personalization occurs via two routes, first the user’s current context is used to decide whether to deliver a suggestion and second, random effects, as described in Section 2.1
, are used to learn user and time specific parameters that encode the influence of each of the contextual variables in this decision. The user and time specific parameters are modeled in the reward function (the mean of the reward conditional on context and action). To learn these parameters, information is pooled across both users and time in a dynamic manner, combining Thompson sampling with a Bayesian random effects model for the reward function. In contrast to fully Bayesian methods, empirical Bayes estimates the value of hyper-parameters as a function of observed data.
The contributions of this paper are as follows
We develop a Thompson sampling algorithm coupled with explicit streamlined empirical Bayes updates for fitting linear mixed effects models. To estimate hyper-parameters and compute the estimated posterior distribution, our algorithm computes closed form updates which are within the class of two-level sparse least squares problems introduced by .
This work provides an efficient empirical Bayes algorithm in which the amount of storage and computing at each iteration is , where is the larger dimension of the two grouping mechanisms considered. For example, may represent the number of users and the time points or vice-versa.
Our approach reduces the running time over other state-of-art methods, and critically, does not require advanced hardware.
These contributions make our approach practical for online mHealth settings, in which incremental learning algorithm updates are required (e.g., at nightly increments), where swift computations are necessary for subsequent online
policy adaptation. This facilitates incremental, accurate tuning of the variance hyper-parameters in a Thompson-Sampling contextual bandit algorithm.
describes a natural parametric empirical Bayes approach to hyperparameter tuning and Section2.3 presents our streamlined alternative. A performance assessment and comparison is shown in Section 3.
2.1 Problem Setting
At each time, , on each user,
, a vector of context variables,, is observed. An action, , is then selected. Here we consider actions, where . Subsequently a real-valued reward, is observed. This continues for times and on users. We assume that the reward at time is generated with a person and time specific mean,
where , and are known features of the context and action . are unknown parameters; in particular is the vector of th user parameters and is the vector of time parameters. Time corresponds to “time-since-under-treatment” for a user. User-specific parameters, , capture unobserved user variables that influence the reward at all times ; in mobile health unobserved user variables may include level of social support for activity, pre-existing problems or preferences that make activity difficult. The time-specific parameters, capture unobserved “time-since-under-treatment” variables that influence the reward for all users. In mobile health unobserved “time-since-under-treatment” variables might include treatment fatigue, decreasing motivation, etc.
The Thompson Sampling algorithm in  uses the following Bayesian mixed effects model for the reward :
The algorithm is designed with independent Gaussian priors on the unknown parameters:
or a linear mixed model with crossed random effects (e.g.,[2, 13]). At each time, , Thompson Sampling is used to select the action, , based on the context . That is, we compute the posterior distribution for where
and for context , select treatment
where are the posterior mean and variance covariance matrix given in the sub-blocks of (7).
2.1.1 Bayesian Mixed Effects Model Components
We define the following data matrices
and the following parameter vectors
where, as before, and represent the known features of the context and action . The dimensions of matrices, for and , are
2.1.2 Posterior Updates
The posterior distribution for the immediate treatment effect for user at time is updated and then used to assign treatment in the subsequent time point, . Here, we show the form of the full posterior for . Define
The estimated posterior distribution for the fixed and random reward effects vector is
The focus of this work is to enable fast incremental estimation of the variance components . We describe a natural, but computationally challenging approach for estimating these variances in Section 2.2 and our streamlined alternative approach in Section 2.3.
2.2 Parametric Empirical Bayes
and has the following form
The maximization is commonly done via the Expectation Maximisation (EM) algorithm .
2.2.1 EM Method
The expected complete data log likelihood is given by
where the expectation is over the distribution of given in (2). The M-step yields the following closed form iteration estimates for the variance components in :
where the posterior mean reward components for the fixed and random effects
and the posterior variance-covariance reward components for the fixed and random effects
are computed in the E-step using equation (7). Note that (9) are the sub vectors in the the posterior mean and (10) are sub-block entries in the posterior variance covariance matrix . The naïve EM algorithm is given in Algortihm 1.
2.2.2 Computational Challenges
At each iteration in Algorithm 1, computation of requires solving the sparse matrix linear system
This matrix has dimension
It is often the case that the number of fixed effects parameters , the number of random effects parameters per user and the number of random effects parameters per time are of moderate size. Consequently, it is well known that naïve computation of is , that is, cubic dependence on the number of random effects group sizes and .
To address this computational problem, we employ the fact that in this setting the matrix requiring inversion is sufficiently block-diagonal that its sparsity can be exploited. In addition, the closed form updates of the variance components in (8) require computation of only the sub-blocks of that correspond to the non-zero sub-blocks of (as illustrated in Figure 1) and not the entire matrix.
2.3 Streamlined Empirical Bayes
Streamlined updating of and each of the sub-blocks of required for the E-step in Algorithm 1 can be embedded within the class of two-level sparse matrix problems as defined in  and is encapsulated in Result 1. Result 1 is analogous and mathematically identical to Result 2 in . The difference being that the authors in 
do not apply their methodologies to the mobile health setting and use full variational Bayesian inference for fitting as opposed to our use of empirical Bayes.
Result 1 (Analogous and mathematically identical to Result 2 in ). The posterior updates under the Bayesian mixed effects model as given in (1) and (2) for and each of the sub-blocks of are expressible as a two-level sparse matrix least squares problem of the form where and the non-zero sub-blocks of B, according to the notation in the appendix, are, for ,
with each of these matrices having rows. The solutions are
, where the , , , and notation is given in the appendix.
The streamlined equivalent of Algorithm 1 is given in Algorithm 2. Algorihm 2 makes use of the SolveTwoLevelSparseLeastSquares algorithm which was first presented in  but also provided in the appendix of this article.
The computing time and storage for the streamlined updating of and each of the sub-blocks of required for the E-step in Algorithm 2 becomes . For moderate sized , this reduces to . If both and are large, one may resort to coupling the streamlining present in this article with an approximation of the posterior so as to further reduce computation time and storage. However, care needs to be taken with the choice of approximation so as to incur as little degradation in accuracy as possible. As explained in Section 3.1 of 
, mean field variational Bayes approximations tend to be very accurate for Gaussian response models. However, such high accuracy does not manifest in general. Ignoring important posterior dependencies via mean field restrictions often lead to credible intervals being too small (e.g.).
3 Performance Assessment and Comparison
In order to evaluate the speed achieved by our streamlined empirical Bayes algorithm, we compare the timing and accuracy of our method against state of the art software, GPyTorch , which is a highly efficient implementation of Gaussian Process Regression modeling, with GPU acceleration. Note that the Gaussian linear mixed effects model as given in (1) and (2) is equivalent to a Gaussian Process regression model with a structured kernel matrix induced by the use of random effects. For ease of notation, in the following, we use sEB to refer to the streamlined empirical Bayes algorithm, GPyT-CPU to refer to empirical Bayes fitting using GPyTorch with CPU and GPyT-GPU to refer to empirical Bayes fitting using GPyTorch with GPU. The sEB and GPyT-CPU computations were conducted using an Intel Xeon CPU E5-2683. The GPyT-GPU computations were conducted using an Nvidea Tesla V100-PCIE-32GB.
3.1 Batch Speed Assessment
) and for which both the fixed effects and random effects had dimension two, corresponding to random intercepts and slopes for a single continuous predictor which was generated from the Uniform distribution on the unit interval. The true parameter values were set to
and, during the studies, the values were set to , and the number of datapoints specific to each user and time period, , was set to 5. Four separate studies were run with differing values for the number of users . The total number of data points is then , that is, . We then simulated 50 replications of the data for each and recorded the computational times for variance estimation from GPyT-CPU, GPyT-GPU and sEB. Algorithm 2 was implemented in Fortran 77. The EM iterations were stopped once the absolute difference between successive expected complete data log-likelihood values fell below . The stopping criterion was the same for gPyTorch and additionally the maximum number of iterations was set to 15.
shows the mean and standard deviation of elapsed computing times in seconds for estimation of the variance components usingsEB, GPyT-CPU and GPyT-GPU. Figure 2 shows the absolute error values for each variance components estimated using sEB, GPyT-CPU and GPyT-GPU summarized as a boxplot.
|1,500||0.7 (0.10)||5.8 (0.14)||1.5 (0.16)|
|7,500||1.7 (0.15)||163.8 (1.81)||1.3 (0.04)|
|15,000||2.8 (0.21)||736.2 (38.36)||5.2 (0.03)|
|1,500,000||322.1 (24.82)||NA (NA)||NA (NA)|
3.2 Online Thompson Sampling Contextual Bandit mHealth Simulation Study
Next, we evaluate our approach in a simulated mHealth study designed to capture many of the real-world difficulties of mHealth clinical trials. Users in this simulated study are sent interventions multiple times each day according to Algorithm 3. Each intervention represents a message promoting healthy behavior.
In this setting there are 32 users and each user is in the study for 10 weeks. Users join the study in a staggered fashion, such that each week new users might be joining or leaving the study. Each day in the study users can receive up to 5 mHealth interventions.
The Bayesian mixed effects model as represented in (1) and (2) offers several advantages in this setting. In mHealth not only can users differ in the context that they experience, but in their response to treatment under the same context. The user level random effects allow learning of personalized policies for each user, overcoming the flaws of methods which treat individuals as the same. Additionally, there can be non-stationarity in how users respond to treatment, for example, they might be more responsive in the beginning of a study than in the end. By modeling time level random effects , each person’s policy can be sensitive to a dynamic environment, and is informed by how other users’ responsivity has been influenced by time.
We evaluate our approach in a setting which demands personalization within a dynamic environment. Users are modeled to be heterogenous in their response to treatment. Additionally, their responsivity declines with time.
In Figure 3 we show the ability of our streamlined algorithm to minimize regret where real data is used to inform the simulation. We also compare our approach to GPyT-CPU and GPyT-GPU. For all users we show the average performance for their th week in the study. For example, we first show the average regret across all users in their first week of the study, however this will not be the same calendar time week, as users join in a staggered manner. The average total time (standard deviation) for estimation of the variance components was 757.1 (76.48) using GPyT-CPU, 7.5 (0.27) using GPyT-GPU and 7.3 (0.16) using sEB.
4 Related Work
The fundamental streamlined empirical Bayes Algorithm 2 makes use of linear system solutions and sub-block matrix inverses for two-level sparse matrix problems (). Our result for streamlined posterior computation in Section 2.3 is analogous and mathematically identical to Result 2 in  who instead focus on streamlined mean field variational Bayes approximate inference for linear mixed models with crossed random effects. In the present article, we make use of a similar result for empirical Bayes posterior inference for use in the mobile health setting. Our empirical Bayes algorithm allows streamlined estimation of the variance components within an online Thompson sampling contextual bandit algorithm.
Other approaches include using mixed model software packages for high-performance statistical computation. For example: (i) BLME provides a posteriori estimation for linear and generalized linear mixed effects models in a Bayesian setting ; and (ii) Stan 
provides full Bayesian statistical inference with MCMC sampling. Even thoughBLME offers streamlined algorithms for obtaining the predictions of fixed and random effects in linear mixed models, the sub-blocks of the covariance matrices of the posterior required for construction of the EM method in the streamlined empirical Bayes algorithm are not provided by such software. On the other hand, Stan does offer support for computation of these sub-blocks, but is well known to suffer computationally in large data settings.
As we point to in Section 3, the Gaussian linear mixed effects model used in the Thompson-Sampling algorithm is equivalent to a Gaussian Process regression model with a structured kernel matrix induced by the use of random effects. Gaussian process models have been used for multi-armed bandits ([6, 12, 1, 23, 8, 26, 9, 3]), and for contextual bandits ([16, 14]). To address the challenges posed by mHealth,  illustrate the benefits of using mixed effects Gaussian Process models in the context of reinforcement learning. Computational challenges in the Gaussian Process regression setting is a known and common problem which has led to contributions from the computer science and machine learning communities. For instance, to address challenges posed for Gaussian Process regression suffering from cubic complexity to data size, a variety of scalable GPs have been presented, including the approach we compare to earlier: GPyTorch . A review on state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning can be found in . The sparsity imposed by the use of random effects, however, afford us accurate inference in the cases considered in this article, and thus do not suffer from the potential loss of accuracy that could result from the approximate methods, such as those discussed in .
We compare three empirical Bayes approaches, sEB, GPyT-CPU and GPyT-GPU for use within a batch simulation setting and an online contextual bandit mHealth simulation study.
Within the batch simulation setting in Section 3.1, inspection of the computational running times in Table 1 shows that sEB achieves the lowest average running time across all simulations and data set sizes, compared to GPyT-CPU and GPyT-GPU. In the starkest case this results in a 98.96% reduction in running time. Even in the more modest comparison, sEB timing is similar to that of GPyT-GPU but doesn’t require the sophisticated hardware that GPyT-GPU does. The improvement between GPyT-CPU and GPyT-GPU is impressive and GPyT-GPU is clearly designed to excel in a resource rich environment. However, in a clinical trial it is unclear if such an environment will be available. In contrast, sEB does not require advanced hardware to achieve state-of-the-art performance.
Our method makes use of computing only the necessary sub-blocks of the posterior variance-covariance matrix at each time step, as opposed to computation of the entire matrix. We were unable to template an equivalent streamlined computation within GPyT-CPU and GPyT-GPU. Consequently, we were unable to run GPyT-CPU and GPyT-GPU on the largest dataset as this involves constructing a matrix of dimension , even before the optimization procedure is called. Templating the variance-covariance matrix such that it did not require this matrix as input might have allowed us to run GPyT-GPU on the largest dataset. An advantage of our method is that it can efficiently exploit the structure inherent within the variance-covariance matrix to manage large datasets.
The median reduction in error from GPyT-CPU to sEB is 28.3% and from GPyT-GPU to sEB is 22.2%. From Figure 2, we see that sEB achieves lower absolute error on average than either GPyT-CPU or GPyT-GPU. The difference is more pronounced between sEB and GPyT-GPU. sEB achieves the lowest average absolute error, at the fastest rate, on the simplest machine.
-  (2015) Fast direct methods for gaussian processes. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 252–265. Cited by: §4.
-  (2008) Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language 59 (4), pp. 390–412. Cited by: §2.1.
-  (2016) Time-varying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pp. 314–323. Cited by: §4.
-  (2017) Stan: a probabilistic programming language. Journal of statistical software 76 (1). Cited by: §4.
-  (1985) An introduction to empirical bayes data analysis. The American Statistician 39 (2), pp. 83–87. Cited by: §2.2.
-  (2017) On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 844–853. Cited by: §4.
-  (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §2.2.
-  (2014) Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research 15 (1), pp. 3873–3923. Cited by: §4.
-  (2013) High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pp. 1025–1033. Cited by: §4.
-  (2015) Blme: bayesian linear mixed-effects models. URL: https://CRAN. R-project. org/package= blme R package version, pp. 1–0. Cited by: §4.
-  (2018) Gpytorch: blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, Cited by: §3, §4.
-  (2011) Portfolio allocation for bayesian optimization.. In UAI, pp. 327–336. Cited by: §4.
-  (2017) A variational maximization–maximization algorithm for generalized linear mixed models with crossed random effects. psychometrika 82 (3), pp. 693–716. Cited by: §2.1.
-  (2011) Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pp. 2447–2455. Cited by: §4.
-  (1982) Random-effects models for longitudinal data. Biometrics 38 (4), pp. 963–974. Cited by: §2.1.
-  (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §4.
-  (2018) When gaussian process meets big data: a review of scalable gps. arXiv preprint arXiv:1807.01065. Cited by: §4.
-  (2019) Streamlined variational inference for linear mixed models with crossed random effects. arXiv preprint arXiv:1910.01799. Cited by: §2.3, §2.3, §4.
Variational inference for heteroscedastic semiparametric regression. Australian & New Zealand Journal of Statistics 57 (1), pp. 119–138. Cited by: §2.3.
-  (1983) Parametric empirical bayes inference: theory and applications. Journal of the American statistical Association 78 (381), pp. 47–55. Cited by: §2.2.
-  (2019) Streamlined computing for variational inference with higher level random effects. arXiv preprint arXiv:1903.06616. Cited by: §2.3, §2.3.
-  (2019) Solutions to sparse multilevel matrix problems. arXiv preprint arXiv:1903.03089. Cited by: The SolveTwoLevelSparseLeastSquares Algorithm, 1st item, §4.
-  (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §4.
-  (2019) Intelligent pooling in thompson sampling for rapid personalization in mobile health. Cited by: §1, §1, §2.1, §4.
-  (2005) Inadequacy of interval estimates corresponding to variational bayesian approximations.. In AISTATS, Cited by: §2.3.
-  (2016) Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pp. 1022–1031. Cited by: §4.
The SolveTwoLevelSparseLeastSquares Algorithm
The SolveTwoLevelSparseLeastSquares algorithm is listed in  and based on Theorem 2 of . Given its centrality to Algorithm 2 we list it again here. The algorithm solves a sparse version of the the least squares problem:
which has solution where and where B and have the following structure:
The sub-vectors of x and the sub-matrices of A corresponding to its non-zero blocks of are labelled as follows:
with denoting sub-blocks that are not of interest. The SolveTwoLevelSparseLeastSquares algorithm is given in Algorithm 4.