1 Introduction
This work is motivated by a mobile health study in which an online Thompson Sampling contextual bandit algorithm is used to personalize the delivery of physical activity suggestions [24]. These suggestions are intended to increase near time physical activity. The personalization occurs via two routes, first the user’s current context is used to decide whether to deliver a suggestion and second, random effects, as described in Section 2.1
, are used to learn user and time specific parameters that encode the influence of each of the contextual variables in this decision. The user and time specific parameters are modeled in the reward function (the mean of the reward conditional on context and action). To learn these parameters, information is pooled across both users and time in a dynamic manner, combining Thompson sampling with a Bayesian random effects model for the reward function. In contrast to fully Bayesian methods, empirical Bayes estimates the value of hyperparameters as a function of observed data.
The contributions of this paper are as follows

We develop a Thompson sampling algorithm coupled with explicit streamlined empirical Bayes updates for fitting linear mixed effects models. To estimate hyperparameters and compute the estimated posterior distribution, our algorithm computes closed form updates which are within the class of twolevel sparse least squares problems introduced by [22].

This work provides an efficient empirical Bayes algorithm in which the amount of storage and computing at each iteration is , where is the larger dimension of the two grouping mechanisms considered. For example, may represent the number of users and the time points or viceversa.

Our approach reduces the running time over other stateofart methods, and critically, does not require advanced hardware.
These contributions make our approach practical for online mHealth settings, in which incremental learning algorithm updates are required (e.g., at nightly increments), where swift computations are necessary for subsequent online
policy adaptation. This facilitates incremental, accurate tuning of the variance hyperparameters in a ThompsonSampling contextual bandit algorithm.
In section 2.1 we describe the problem setting and review ThompsonSampling with the use of a Bayesian mixed effects model for the reward [24]. Section 2.2
describes a natural parametric empirical Bayes approach to hyperparameter tuning and Section
2.3 presents our streamlined alternative. A performance assessment and comparison is shown in Section 3.2 Methods
2.1 Problem Setting
At each time, , on each user,
, a vector of context variables,
, is observed. An action, , is then selected. Here we consider actions, where . Subsequently a realvalued reward, is observed. This continues for times and on users. We assume that the reward at time is generated with a person and time specific mean,where , and are known features of the context and action . are unknown parameters; in particular is the vector of th user parameters and is the vector of time parameters. Time corresponds to “timesinceundertreatment” for a user. Userspecific parameters, , capture unobserved user variables that influence the reward at all times ; in mobile health unobserved user variables may include level of social support for activity, preexisting problems or preferences that make activity difficult. The timespecific parameters, capture unobserved “timesinceundertreatment” variables that influence the reward for all users. In mobile health unobserved “timesinceundertreatment” variables might include treatment fatigue, decreasing motivation, etc.
The Thompson Sampling algorithm in [24] uses the following Bayesian mixed effects model for the reward :
(1) 
The algorithm is designed with independent Gaussian priors on the unknown parameters:
(2) 
The and are called random effects in the statistical literature and the model in (1) and (2) is often referred to as a linear mixed effects model [15]
or a linear mixed model with crossed random effects (e.g.,
[2, 13]). At each time, , Thompson Sampling is used to select the action, , based on the context . That is, we compute the posterior distribution for whereand for context , select treatment
(3) 
where are the posterior mean and variance covariance matrix given in the subblocks of (7).
2.1.1 Bayesian Mixed Effects Model Components
We define the following data matrices
and the following parameter vectors
where, as before, and represent the known features of the context and action . The dimensions of matrices, for and , are
(4) 
2.1.2 Posterior Updates
The posterior distribution for the immediate treatment effect for user at time is updated and then used to assign treatment in the subsequent time point, . Here, we show the form of the full posterior for . Define
The estimated posterior distribution for the fixed and random reward effects vector is
(5) 
where
(6) 
(7) 
The focus of this work is to enable fast incremental estimation of the variance components . We describe a natural, but computationally challenging approach for estimating these variances in Section 2.2 and our streamlined alternative approach in Section 2.3.
2.2 Parametric Empirical Bayes
At each time, , the empirical Bayes [20, 5] procedure maximizes the marginal likelihood based on all user data up to and including data at time with respect to . The marginal likelihood of Y is
and has the following form
The maximization is commonly done via the Expectation Maximisation (EM) algorithm [7].
2.2.1 EM Method
The expected complete data log likelihood is given by
where the expectation is over the distribution of given in (2). The Mstep yields the following closed form iteration estimates for the variance components in :
(8) 
where the posterior mean reward components for the fixed and random effects
(9) 
and the posterior variancecovariance reward components for the fixed and random effects
(10) 
are computed in the Estep using equation (7). Note that (9) are the sub vectors in the the posterior mean and (10) are subblock entries in the posterior variance covariance matrix . The naïve EM algorithm is given in Algortihm 1.
2.2.2 Computational Challenges
At each iteration in Algorithm 1, computation of requires solving the sparse matrix linear system
(11) 
where the LHS of (11) has sparse structure imposed by the random effects as exemplified in Figure 1.
This matrix has dimension
It is often the case that the number of fixed effects parameters , the number of random effects parameters per user and the number of random effects parameters per time are of moderate size. Consequently, it is well known that naïve computation of is , that is, cubic dependence on the number of random effects group sizes and .
To address this computational problem, we employ the fact that in this setting the matrix requiring inversion is sufficiently blockdiagonal that its sparsity can be exploited. In addition, the closed form updates of the variance components in (8) require computation of only the subblocks of that correspond to the nonzero subblocks of (as illustrated in Figure 1) and not the entire matrix.
2.3 Streamlined Empirical Bayes
Streamlined updating of and each of the subblocks of required for the Estep in Algorithm 1 can be embedded within the class of twolevel sparse matrix problems as defined in [21] and is encapsulated in Result 1. Result 1 is analogous and mathematically identical to Result 2 in [18]. The difference being that the authors in [18]
do not apply their methodologies to the mobile health setting and use full variational Bayesian inference for fitting as opposed to our use of empirical Bayes.
Result 1 (Analogous and mathematically identical to Result 2 in [18]). The posterior updates under the Bayesian mixed effects model as given in (1) and (2) for and each of the subblocks of are expressible as a twolevel sparse matrix least squares problem of the form where and the nonzero subblocks of B, according to the notation in the appendix, are, for ,
with each of these matrices having rows. The solutions are
and
, where the , , , and notation is given in the appendix.
The streamlined equivalent of Algorithm 1 is given in Algorithm 2. Algorihm 2 makes use of the SolveTwoLevelSparseLeastSquares algorithm which was first presented in [21] but also provided in the appendix of this article.
The computing time and storage for the streamlined updating of and each of the subblocks of required for the Estep in Algorithm 2 becomes . For moderate sized , this reduces to . If both and are large, one may resort to coupling the streamlining present in this article with an approximation of the posterior so as to further reduce computation time and storage. However, care needs to be taken with the choice of approximation so as to incur as little degradation in accuracy as possible. As explained in Section 3.1 of [19]
, mean field variational Bayes approximations tend to be very accurate for Gaussian response models. However, such high accuracy does not manifest in general. Ignoring important posterior dependencies via mean field restrictions often lead to credible intervals being too small (e.g.
[25]).3 Performance Assessment and Comparison
In order to evaluate the speed achieved by our streamlined empirical Bayes algorithm, we compare the timing and accuracy of our method against state of the art software, GPyTorch [11], which is a highly efficient implementation of Gaussian Process Regression modeling, with GPU acceleration. Note that the Gaussian linear mixed effects model as given in (1) and (2) is equivalent to a Gaussian Process regression model with a structured kernel matrix induced by the use of random effects. For ease of notation, in the following, we use sEB to refer to the streamlined empirical Bayes algorithm, GPyTCPU to refer to empirical Bayes fitting using GPyTorch with CPU and GPyTGPU to refer to empirical Bayes fitting using GPyTorch with GPU. The sEB and GPyTCPU computations were conducted using an Intel Xeon CPU E52683. The GPyTGPU computations were conducted using an Nvidea Tesla V100PCIE32GB.
3.1 Batch Speed Assessment
We obtained timing results for simulated batch data according to versions of the Bayesian mixed effects model as given in (1) and (2
) and for which both the fixed effects and random effects had dimension two, corresponding to random intercepts and slopes for a single continuous predictor which was generated from the Uniform distribution on the unit interval. The true parameter values were set to
and, during the studies, the values were set to , and the number of datapoints specific to each user and time period, , was set to 5. Four separate studies were run with differing values for the number of users . The total number of data points is then , that is, . We then simulated 50 replications of the data for each and recorded the computational times for variance estimation from GPyTCPU, GPyTGPU and sEB. Algorithm 2 was implemented in Fortran 77. The EM iterations were stopped once the absolute difference between successive expected complete data loglikelihood values fell below . The stopping criterion was the same for gPyTorch and additionally the maximum number of iterations was set to 15.
Table 1
shows the mean and standard deviation of elapsed computing times in seconds for estimation of the variance components using
sEB, GPyTCPU and GPyTGPU. Figure 2 shows the absolute error values for each variance components estimated using sEB, GPyTCPU and GPyTGPU summarized as a boxplot.Datapoints  sEB  GPyTCPU  GPyTGPU 

1,500  0.7 (0.10)  5.8 (0.14)  1.5 (0.16) 
7,500  1.7 (0.15)  163.8 (1.81)  1.3 (0.04) 
15,000  2.8 (0.21)  736.2 (38.36)  5.2 (0.03) 
1,500,000  322.1 (24.82)  NA (NA)  NA (NA) 
3.2 Online Thompson Sampling Contextual Bandit mHealth Simulation Study
Next, we evaluate our approach in a simulated mHealth study designed to capture many of the realworld difficulties of mHealth clinical trials. Users in this simulated study are sent interventions multiple times each day according to Algorithm 3. Each intervention represents a message promoting healthy behavior.
In this setting there are 32 users and each user is in the study for 10 weeks. Users join the study in a staggered fashion, such that each week new users might be joining or leaving the study. Each day in the study users can receive up to 5 mHealth interventions.
The Bayesian mixed effects model as represented in (1) and (2) offers several advantages in this setting. In mHealth not only can users differ in the context that they experience, but in their response to treatment under the same context. The user level random effects allow learning of personalized policies for each user, overcoming the flaws of methods which treat individuals as the same. Additionally, there can be nonstationarity in how users respond to treatment, for example, they might be more responsive in the beginning of a study than in the end. By modeling time level random effects , each person’s policy can be sensitive to a dynamic environment, and is informed by how other users’ responsivity has been influenced by time.
We evaluate our approach in a setting which demands personalization within a dynamic environment. Users are modeled to be heterogenous in their response to treatment. Additionally, their responsivity declines with time.
In Figure 3 we show the ability of our streamlined algorithm to minimize regret where real data is used to inform the simulation. We also compare our approach to GPyTCPU and GPyTGPU. For all users we show the average performance for their th week in the study. For example, we first show the average regret across all users in their first week of the study, however this will not be the same calendar time week, as users join in a staggered manner. The average total time (standard deviation) for estimation of the variance components was 757.1 (76.48) using GPyTCPU, 7.5 (0.27) using GPyTGPU and 7.3 (0.16) using sEB.
4 Related Work
The fundamental streamlined empirical Bayes Algorithm 2 makes use of linear system solutions and subblock matrix inverses for twolevel sparse matrix problems ([22]). Our result for streamlined posterior computation in Section 2.3 is analogous and mathematically identical to Result 2 in [18] who instead focus on streamlined mean field variational Bayes approximate inference for linear mixed models with crossed random effects. In the present article, we make use of a similar result for empirical Bayes posterior inference for use in the mobile health setting. Our empirical Bayes algorithm allows streamlined estimation of the variance components within an online Thompson sampling contextual bandit algorithm.
Other approaches include using mixed model software packages for highperformance statistical computation. For example: (i) BLME provides a posteriori estimation for linear and generalized linear mixed effects models in a Bayesian setting [10]; and (ii) Stan [4]
provides full Bayesian statistical inference with MCMC sampling. Even though
BLME offers streamlined algorithms for obtaining the predictions of fixed and random effects in linear mixed models, the subblocks of the covariance matrices of the posterior required for construction of the EM method in the streamlined empirical Bayes algorithm are not provided by such software. On the other hand, Stan does offer support for computation of these subblocks, but is well known to suffer computationally in large data settings.As we point to in Section 3, the Gaussian linear mixed effects model used in the ThompsonSampling algorithm is equivalent to a Gaussian Process regression model with a structured kernel matrix induced by the use of random effects. Gaussian process models have been used for multiarmed bandits ([6, 12, 1, 23, 8, 26, 9, 3]), and for contextual bandits ([16, 14]). To address the challenges posed by mHealth, [24] illustrate the benefits of using mixed effects Gaussian Process models in the context of reinforcement learning. Computational challenges in the Gaussian Process regression setting is a known and common problem which has led to contributions from the computer science and machine learning communities. For instance, to address challenges posed for Gaussian Process regression suffering from cubic complexity to data size, a variety of scalable GPs have been presented, including the approach we compare to earlier: GPyTorch [11]. A review on stateoftheart scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning can be found in [17]. The sparsity imposed by the use of random effects, however, afford us accurate inference in the cases considered in this article, and thus do not suffer from the potential loss of accuracy that could result from the approximate methods, such as those discussed in [17].
5 Discussion
We compare three empirical Bayes approaches, sEB, GPyTCPU and GPyTGPU for use within a batch simulation setting and an online contextual bandit mHealth simulation study.
Within the batch simulation setting in Section 3.1, inspection of the computational running times in Table 1 shows that sEB achieves the lowest average running time across all simulations and data set sizes, compared to GPyTCPU and GPyTGPU. In the starkest case this results in a 98.96% reduction in running time. Even in the more modest comparison, sEB timing is similar to that of GPyTGPU but doesn’t require the sophisticated hardware that GPyTGPU does. The improvement between GPyTCPU and GPyTGPU is impressive and GPyTGPU is clearly designed to excel in a resource rich environment. However, in a clinical trial it is unclear if such an environment will be available. In contrast, sEB does not require advanced hardware to achieve stateoftheart performance.
Our method makes use of computing only the necessary subblocks of the posterior variancecovariance matrix at each time step, as opposed to computation of the entire matrix. We were unable to template an equivalent streamlined computation within GPyTCPU and GPyTGPU. Consequently, we were unable to run GPyTCPU and GPyTGPU on the largest dataset as this involves constructing a matrix of dimension , even before the optimization procedure is called. Templating the variancecovariance matrix such that it did not require this matrix as input might have allowed us to run GPyTGPU on the largest dataset. An advantage of our method is that it can efficiently exploit the structure inherent within the variancecovariance matrix to manage large datasets.
The median reduction in error from GPyTCPU to sEB is 28.3% and from GPyTGPU to sEB is 22.2%. From Figure 2, we see that sEB achieves lower absolute error on average than either GPyTCPU or GPyTGPU. The difference is more pronounced between sEB and GPyTGPU. sEB achieves the lowest average absolute error, at the fastest rate, on the simplest machine.
References
 [1] (2015) Fast direct methods for gaussian processes. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 252–265. Cited by: §4.
 [2] (2008) Mixedeffects modeling with crossed random effects for subjects and items. Journal of memory and language 59 (4), pp. 390–412. Cited by: §2.1.
 [3] (2016) Timevarying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pp. 314–323. Cited by: §4.
 [4] (2017) Stan: a probabilistic programming language. Journal of statistical software 76 (1). Cited by: §4.
 [5] (1985) An introduction to empirical bayes data analysis. The American Statistician 39 (2), pp. 83–87. Cited by: §2.2.
 [6] (2017) On kernelized multiarmed bandits. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 844–853. Cited by: §4.
 [7] (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §2.2.
 [8] (2014) Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research 15 (1), pp. 3873–3923. Cited by: §4.
 [9] (2013) Highdimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pp. 1025–1033. Cited by: §4.
 [10] (2015) Blme: bayesian linear mixedeffects models. URL: https://CRAN. Rproject. org/package= blme R package version, pp. 1–0. Cited by: §4.
 [11] (2018) Gpytorch: blackbox matrixmatrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, Cited by: §3, §4.
 [12] (2011) Portfolio allocation for bayesian optimization.. In UAI, pp. 327–336. Cited by: §4.
 [13] (2017) A variational maximization–maximization algorithm for generalized linear mixed models with crossed random effects. psychometrika 82 (3), pp. 693–716. Cited by: §2.1.
 [14] (2011) Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pp. 2447–2455. Cited by: §4.
 [15] (1982) Randomeffects models for longitudinal data. Biometrics 38 (4), pp. 963–974. Cited by: §2.1.
 [16] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §4.
 [17] (2018) When gaussian process meets big data: a review of scalable gps. arXiv preprint arXiv:1807.01065. Cited by: §4.
 [18] (2019) Streamlined variational inference for linear mixed models with crossed random effects. arXiv preprint arXiv:1910.01799. Cited by: §2.3, §2.3, §4.

[19]
(2015)
Variational inference for heteroscedastic semiparametric regression
. Australian & New Zealand Journal of Statistics 57 (1), pp. 119–138. Cited by: §2.3.  [20] (1983) Parametric empirical bayes inference: theory and applications. Journal of the American statistical Association 78 (381), pp. 47–55. Cited by: §2.2.
 [21] (2019) Streamlined computing for variational inference with higher level random effects. arXiv preprint arXiv:1903.06616. Cited by: §2.3, §2.3.
 [22] (2019) Solutions to sparse multilevel matrix problems. arXiv preprint arXiv:1903.03089. Cited by: The SolveTwoLevelSparseLeastSquares Algorithm, 1st item, §4.
 [23] (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §4.
 [24] (2019) Intelligent pooling in thompson sampling for rapid personalization in mobile health. Cited by: §1, §1, §2.1, §4.
 [25] (2005) Inadequacy of interval estimates corresponding to variational bayesian approximations.. In AISTATS, Cited by: §2.3.
 [26] (2016) Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pp. 1022–1031. Cited by: §4.
Appendix A.
The SolveTwoLevelSparseLeastSquares Algorithm
The SolveTwoLevelSparseLeastSquares algorithm is listed in [22] and based on Theorem 2 of [22]. Given its centrality to Algorithm 2 we list it again here. The algorithm solves a sparse version of the the least squares problem:
which has solution where and where B and have the following structure:
(12) 
The subvectors of x and the submatrices of A corresponding to its nonzero blocks of are labelled as follows:
(13) 
and
(14) 
with denoting subblocks that are not of interest. The SolveTwoLevelSparseLeastSquares algorithm is given in Algorithm 4.