Data are often observed as streaming observations that arrive sequentially across time. Examples of streaming data include hospital patients’ vital signs, telemetry data, online purchases, and stock prices. To model streaming data, it is more efficient to update model parameters as new observations arrive than to refit the model from scratch with the new observations appended onto existing data. A typical prior distribution on the space of functions for time series analysis is the Gaussian process (Rasmussen and Williams, 2006)
. Gaussian processes (GPs) are a convenient distribution on real-valued functions because, when evaluated at a fixed set of inputs, they have a multivariate normal distribution and hence allow closed form posterior inference and prediction when used for regression.
Parameter estimation for GPs remains challenging because inference involves calculating the Gaussian likelihood. This means that the computational complexity of inference is dominated by the operation of inverting an matrix, where
is the number of observations. This complexity makes standard GP inference techniques unsuitable for scenarios with large numbers of observations or where fast inference is essential. Further, Bayesian approaches, such as Markov chain Monte Carlo (MCMC) or variational inference (VI), do not trivially allow online updating of model parameter estimates, although sequential Monte Carlo methods (SMC) may be adapted for this purpose.
From a statistical perspective, a typical GP regression model infers only stationary functions, meaning that properties of the function are constant across all input values. While there are covariance kernels that explicitly capture non-stationary effects in GP regression, they pose greater computational challenges than a stationary kernel as they often require calculating intractable integrals.
Mixture-of-experts GP models have been used to model non-stationary functions by fitting independent GPs to different segments of the input space (Rasmussen and Ghahramani, 2002; Gramacy and Lee, 2008; Yuan and Neubauer, 2009; Zhang and Williamson, 2017). In particular, the IS-MOE approach (Zhang and Williamson, 2017) fits mixtures of GP experts in a distributed way using importance sampling; however, IS-MOE is not an online algorithm. Conversely, sparse online GPs are a state-of-the-art method for online GP estimation, but the estimated functions are constrained to be stationary (Bui et al., 2017). Given the prevalence and complexity of streaming data, this gap necessitates a new approach for online learning of non-stationary functions.
We introduce a sequential Monte Carlo (SMC) algorithm to fit mixtures of GPs by integrating over the space of hyperparameters. SMC samplers can be adapted to allow real-time updates, and are extremely parallelizeable. We show a connection with multi-armed bandits for hyperparameter optimization and exploit this connection to dramatically accelerate mixing in a medical time series problem.
This paper proceeds as follows: In Section 2, we discuss the background for Gaussian processes, methods for fast inference, and importance sampling. We introduce our online GP inference algorithm in Section 3. We show empirical benefits of our framework on prediction tasks in both simulated data and in hospital patient data where online, scaleable inference is essential in Section 4. In Section 5, we conclude with a discussion of future work.
2 Problem Statement and Background
2.1 Notation and Problem Setup
We first define the notation and problem setup in this paper. Consider the problem of estimating an unknown, possibly non-stationary function from streaming data. In detail, observation batches of size , , arrive at times . We assume a functional relationship, , where and . In the mixture-of-GPs scenario, for GPs, each observation is associated with a mixture component, represented as an integer latent variable, . Each mixture component has its own set of kernel hyperparameters, , learned from the partitioning induced by , for instantiated mixtures . Furthermore, we model separate particles , each with their own mixture of GPs, to accelerate mixing in SMC.
2.2 Gaussian Processes
Gaussian processes (GPs) are a popular approach to modeling distributions over arbitrary real-valued functions , with applications in regression , classification (Williams and Barber, 1998), and optimization (Snoek et al., 2012), among others. Characterized by a mean function , and a covariance function , we can sample instantiations from a GP at a fixed set of locations , according to an -dimensional multivariate normal: where and are the mean and covariance functions evaluated at the data. In the presence of noisy observations, , from a GP distributed function, we observe the data generating process: .
2.3 Fast Gaussian Process Inference
Fitting GP-based models is dominated by covariance matrix inversion operations, meaning that, practically, these models are challenging to fit to large sample size data. To allow GP models to be computationally tractable for large data, numerous approaches have been developed for scalable GP inference. These approaches largely fall into two groups: sparse methods and local methods.
Sparse methods approximate the GP posterior distribution with an number of inducing points, reducing the computational complexity to (Snelson and Ghahramani, 2005; Titsias, 2009). However, sparse approaches are unsuitable for learning non-stationary GPs unless we explicitly define a non-stationary kernel.
Local methods, on the other hand, exploit structure among samples to represent the covariance matrix using a low-rank approximation (Deisenroth and Ng, 2015; Ng and Deisenroth, 2014). To do this, they partition the data into sets, and approximate the covariance matrix by setting the inter-partition covariances to zero. They invert this approximate covariance matrix by inverting dense matrices of size , resulting in a complexity of . An additional benefit of such an approach is that we can estimate GP hyperparameters within each block; when samples are partitioned based on input location, this implicitly captures non-stationary functions.
2.4 Online Gaussian Processes
Although local methods are simple to update in real-time, as we can send new data to clusters and only update clusters receiving new data, sparse and mixture-of-experts GP inference methods discussed above do not trivially allow model updates as new data arrive. Previous work in online GP inference required non-trivial adaptations to allow real-time GP inference. Csató and Opper (2002); Bui et al. (2017) propose an inducing-point online method by approximating the posterior using variational inference and expectation propagation techniques. Nguyen-Tuong et al. (2009) propose an online product-of-experts variant of GP regression where the data are assigned to a single partition–leading to a block diagonal structure in the covariance matrix. Contrast this to the mixture of experts as seen in Gramacy and Lee (2008); Rasmussen and Ghahramani (2002); Zhang and Williamson (2017) and our proposed method, where the partition is integrated out, leading to a more expressive covariance structure but not inherently amenable to online updating. However, as noted in Low et al. (2015), sparse methods cannot easily model fast-moving functions without increasing the number of inducing points; similarly, local methods may not easily account for long-range dependencies without expanding the batch size.
2.5 Modeling Non-Stationary Functions with Gaussian Processes
Modeling non-stationary functions using GPs is often computationally challenging because a non-stationary kernel generally includes far more parameters than stationary kernels, and many non-stationary kernels require calculating intractable integrals (Higdon et al., 1999). For example, the non-stationary kernel in Paciorek and Schervish (2004) itself is modeled by GP functions therefore incur costs to learn the posterior which generates the kernel in addition to the cost of learning hyperparameters. Nguyen-Tuong et al. (2009) provide a product-of-experts approach to fitting online GPs that can account for non-stationary behavior, their inference depends on a single partitioning of the input data. Hard partitioning may create undesirable edge effects due to its implicit assumptions that partitions have zero correlation. The Bayesian treed GP (BTGP) (Gramacy and Lee, 2008) addresses the problem of hard partitioning by integrating over the space of partitions via reversible jump MCMC. While method flexibly captures non-stationary functions, reversible jump is slow to marginalize over the space of treed GP partitions and is not parallelizable due the Markov dependencies in MCMC.
2.6 Importance Sampling and Sequential Monte Carlo
A central issue in Bayesian inference is the computation of the often intractable integral. For this task, importance sampling is a popular tool. To perform importance sampling, particles are drawn from a proposal distribution that is simple to sample from and approximates well. The integral is estimated using the weighted empirical average of these samples with weights to approximate the integral For problems where arises in a sequence, , we can rewrite as a sequential update of the importance weight:
In a Bayesian context, one can view this sequential importance sampler as sequentially updating a prior distribution at to the posterior of all observed data at time . Moving from to in this sampler may lead to a situation where the weight of one particle
dominates the others, thereby increasing the variance of our Monte Carlo estimate and propogating suboptimal particles to future time steps. To avoid this, we can resample the particles according to their weightsfor particles–leading to the sequential Monte Carlo (SMC) sampler (more detail available in Doucet et al., 2001; Doucet and Johansen, 2009).
SMC methods have recently been used for online GP learning. Osborne et al. (2008) perform online updating of multi-output GPs and marginalize the hyperparameters directly using Bayesian quadature methods (Rasmussen and Ghahramani, 2003). Gramacy and Polson (2011) propose an SMC sampler for online updates for GPs with stationary behavior but, as noted in Svensson et al. (2015), their approach makes strong conjugacy assumptions on the model likelihood and priors. Svensson et al. (2015) introduce an SMC sampler that marginalizes the GP hyperparameters by sampling hyperparameter states according to Metropolis-Hastings on each particle to calculate the marginalization from Eqn. 1. This SMC sampler allows for online updating of a GP model while avoiding some of the conjugacy restrictions introduced by Gramacy and Polson (2011).
Finally, Zhang and Williamson (2017) propose a fast method (IS-MOE) that unifies non-stationary function learning and parallel GP inference. To do this, they integrate over the space of partitions using importance sampling, which allows distributed computation. This method uses “minibatched” stochastic approximations, where the model is fit with only a subset of the training set and the likelihood is upweighted to approximate the full data likelihood. However, IS-MOE cannot update GP models as new data arrive. To achieve an online GP inference method for non-stationary functions, we develop a new approach to allow a general mixture-of-experts GP to be computationally tractable for streaming data.
2.7 Gaussian Processes and Hyperparameter Optimization: A Bandit Perspective
Bayesian optimization techniques seek to find hyperparameters
that best model a conditional probability. Many approaches optimize hyperparameter configurations adaptively (Srinivas et al., 2009; Hutter et al., 2011; Bergstra et al., 2011; Snoek et al., 2012; Li et al., 2016), with bandit formulations being particularly successful in GP contexts (Wang et al., 2013; Srinivas et al., 2009; Li et al., 2016). The bandit framework considers the problem of optimizing a function , sampled from a GP with kernel hyperparameters , by sequentially selecting from a set of arms corresponding to inputs , where noisy values are observed.
The goal of a bandit algorithm is to sequentially select arms that maximize long term rewards; to do this, one needs to approximate the expected rewards for each arm (exploration) and then select the arms that maximize rewards (exploitation). There are two common algorithmic strategies for choosing arms: optimization based algorithms (Srinivas et al., 2009) and sampling based approaches (Russo and Van Roy, 2014; Russo et al., 2017). When considering hyperparameter optimization, the algorithm’s goal is reduced to pure exploration–searching for the best hyperparameter configurations is akin to selecting the best functional approximation. Taking advantage of the bandit formalism, we show a connection between hyperparameter optimization in our model and a pure exploration bandit strategy. We show how this formulation leads to substantial improvements in mixing rates in our online GP in the context of prediction in hospital patients.
3 Sequential Gaussian Processes for Online Learning
We describe the GP mixture-of-experts model, and then our sequential Monte Carlo approach for online inference.
3.1 Gaussian Process Mixture of Experts (GP-MOE)
We assume our data is generated from a Gaussian process mixture, similar to previous mixture-of-expert models for Gaussian processes (Rasmussen and Ghahramani, 2002; Gramacy and Lee, 2008; Zhang and Williamson, 2017). This hierarchical model allows for greater flexibility in modeling functions, at the cost of more difficult inference for which we propose a distributable solution in the next section.
Our approach adopts the following generative model: We assume that the inputs are distributed according to a Dirichlet process mixture of normal-inverse Wishart distributions (Antoniak, 1974), and the outputs are then assumed to be generated by independent GPs.
where represent the inputs associated with the latent cluster . We assign the th input sequentially to clusters according to the Chinese restaurant process (CRP) (Aldous, 1985) and the marginalized likelihood for :
where is the number of observations assigned to cluster and are the parameters of the multivariate- likelihood for observation ’s assignment to cluster 111For notational simplicity, we drop the index here from .
3.2 SMC for Online GP-MOE
In an SMC setting with particles, our proposal distribution at the initial time, , is , the posterior distribution of the cluster assignments given only the inputs. The important sampling weight corresponding to a particle from this distribution requires integrating over the covariance parameters:
We calculate the MAP estimate of the hyperparameters by optimizing with respect to the marginal likelihood, in order to quickly approximate the integral in Eqn. 4. For additional memory and computational savings, we can fit the independent GP models on each particle by sampling observations uniformly without replacement from the full data set and upweighting the likelihood by a power of .
To fit a batch of data at time , we follow Alg. 1. First, we assign input data to mixture components according to the marginal likelihood from Eqn. 3. After the mixture assignments, we update the hyperparameters for any cluster with new samples. The number of clusters at time for which we update hyperparameters is at most . Thus, the number of computationally demanding hyperparameter updates is fairly small. Typically, SMC methods update a model one sample at a time, but here we update the model using sequential batches. This leads to faster and more efficient posterior updates when samples in a sequential batch share a cluster, and hyperparameter updates can be performed once for that cluster.
Next, we calculate the particle weight update from to (Eqn. 5) (Svensson et al., 2015). At , we just use the importance weight from Eqn. 4. If falls below a certain threshold, then we resample the particles. It is empirically more efficient to resample when the particles all have small weights than it is to resample particles at every time (Liu, 2008). To perform prediction given test inputs (Alg. 2), we average the predictive value of test outputs on each mixture within particle and then average the predictions across particles weighted by .
Our method allows for distributed computation, as we require inter-processor communication twice per batch of observations, when (1) resampling and normalizing the particle weights and (2) averaging the predictions. Assuming each batch is on average of size , the computational complexity of fitting our model is for particles. Since computation can be fully distributed across the particles, we have a complexity per thread of , comparable with other online GP algorithms (Tab. 1).
|Full GP||Sparse||Local Online GP||Online GP-MOE|
3.3 Online GP-MOE in a Bandit Setting
Online GP-MOE can be cast in a bandit framework, where the function is modeled as a sample from a mixture of GPs, as opposed to a single GP, and where the arms of the bandit are indexed by , corresponding to the kernel-specific hyperparameters of the mixture components. At time , we receive new samples , associated with arms from . Formally, selecting arm , corresponds to assigning the th observation of the th batch to the mixture component that produces the highest reward (in this case, the marginal likelihood ). In our medical setting, data encode the times when a vital sign is recorded with the corresponding denoting the recorded value of the vital sign. In this case, the assignments correspond to the patient’s latent physiological state (i.e., stable or septic shock), with each state governed by a different set of hyperparameters.
As batches of data are observed across time, the goal is to design an online strategy to efficiently select arms that lead to the best approximation of
. This is achieved using a Thompson sampling approach, by sampling from the conditional posterior(Chapelle and Li, 2011). In the spirit of importance sampling bandits (Urteaga and Wiggins, 2018), each arm is drawn from the predictive posterior distribution (Eqn. 3). Following the mixture component assignment, hyperparameter values for the expanded cluster and particle weights are updated according to Alg. 1, allowing the computation of the prediction (Alg. 2). Our formulation allows for a growing number of arms (Eqn. 4) (Whittle, 1988), corresponding to the case in which none of the existing hyperparameters configurations is suitable for explaining the current function behavior.
3.4 Warm-Start Hyperparameter Optimization via Transferred Initial Points
Having proper hyperparameter settings is vital for the kernel hyperparameters in a GP model. However, exploring the hyperparameter space is computationally challenging. When provided with new data , a cold start from an initial batch can be avoided by considering Online GP-MOE in a bandit setting. In this case, the initial pool of arms and associated hyperparameter configurations corresponds to the arms queried when running Online GP-MOE on the previous dataset . Thus, we can create a warm-start hyperparameter setting by adapting the previous bandits to the new collection of arms. We will empirically show that this approach to leads to substantial speedup in a medical application.
4 Empirical Analyses for Online GP-MOE
To demonstrate the ability of our algorithm to fit non-stationary GPs online and in real-time, we apply Online GP-MOE to a synthetic non-stationary regression function and to time series data from hospital patients. We compare our results against two alternative approaches: a local GP implementation that can be viewed as a special case of our algorithm with only one particle, and a variational sparse GP method with the number of inducing points chosen to be of computational complexity comparable to our distributed approach.
4.1 Non-Stationary Function Learning
First, we simulate data from a non-stationary function by joining piecewise periodic functions with fast and slow moving behaviors and adding Gaussian noise. We divide the data into 1,000 training observations and 100 test observations. We use 64 particles that we distribute to four compute nodes with 16 cores per node. We divide the test observations into five partitions on which we predict and update the model sequentially. The concentration parameter is fixed to . Each GP mixture is fit with an RBF kernel. Parallelization of the code is carried out through mpi4py and the basis of the GP code is implemented through GPy .
On this simulated prediction task, we find that our approach can learn heterogeneous functional behavior when we use our SMC sampler to integrate over the partitioned GPs and hyperparameters (Fig. 1; Tab. 2). Our GP-MOE model captures multimodal structure in the posterior, as can other SMC methods (Svensson et al., 2015; Osborne et al., 2008). In contrast, the local GP method relies on only one particle and hence may fall into sub-optimal local modes. The sparse GP method with a stationary kernel cannot account for any non-stationarity, resulting in poor performance on this predictive task.
4.2 Online Heart Rate Prediction for Hospital Patients
Next, we applied our method to data from a hospital ICU patient. Specifically, we consider the heart rate of a single patient in the MIMIC-III data (Johnson et al., 2016), with heart rate normalized to have zero-mean and unit variance. The number of heart rate observations is approximately 39,000, of which we randomly reserve 10% for testing. We use a minibatch size of 1,000 for the training data. Copying the settings in the previous experiment, we use 64 particles in a distributed setting, sequentially predicting and updating the test data partitioned into five blocks. On this example, our approach learns a more accurate representation of the predicted mean function than the local or sparse GP (Fig. 2).
4.3 Transfer Learning of Electronic Health Records
Especially in health care records data, having proper hyperparameter settings can be vital and can result in earlier, timely treatments. In the situation of modeling patients’ heart rates from the MIMIC-III data set, we can rely on previous models fit on other patients’ heart rates to inform the hyperparameter settings on a new patient’s heart rate. Concretely, we first fit our mixture-of-experts approach on one patient’s complete heart-rate observations, with a minibatch of size 1,000 and 64 particles, as in the previous experiment. After fitting this model, we collect the hyperparameters of each mixture in the particle with the highest weight. Next, we fit our approach to the heart rate data set from Section 4.2 but instead of optimizing the hyperparameters from a “cold start”, as we did the previous experiment, we sample mixture assignments learned a priori, corresponding to the hyperparameter settings for a particular partition of the data that returns the highest log marginal likelihood, representing a bandit formulation of our method.
The bandit approach for warm starts in hyperparameter estimation is both fast and effective (Tab. 2). The CPU wall clock time of the bandits is 7x faster than fitting from a cold start. The bandit approach also achieved better predictive log likelihood results, although worse predictive MSE, than the cold start case. In a hospital ICU setting, where heart rate might be measured once per minute, this speedup is essential for real time estimation of patient state.
|Data Set||Method||Pred. LL||Pred. MSE||Wall Time (s.)|
5 Conclusion and Future Directions
In this paper, we introduced a fast online inference algorithm for fitting mixtures of Gaussian processes that can perform online estimation of non-stationary functions. For further speed-up, we exploit multi-armed bandits within our inference setup to facilitate faster and better hyperparameter optimization to learn a non-stationary GP modeled as a mixture of experts. The capabilities of the resulting framework—fast non-stationary function inference for streaming data—are essential in many real world problems; we validated our approach on data from the health care setting, where we observe a massive number of streaming observations and highly non-stationary functions.
Gaussian processes have enjoyed notable success in the health-care domain, for example, in predicting sepsis in hospital patients (Futoma et al., 2017a, b), and in jointly modeling patients’ vital signals through multi-output GPs (Cheng et al., 2017). In future research, we will to extend our approach to multi-output GP models, and implement kernel functions customized for health care scenarios. By combining fast inference with flexible modeling, these approaches will have a profound impact in real-time monitoring and decision-making in patient health.
- Aldous (1985) Aldous, D. J. (1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983, pages 1–198. Springer.
- Antoniak (1974) Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, pages 1152–1174.
- Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554.
- Bui et al. (2017) Bui, T. D., Nguyen, C., and Turner, R. E. (2017). Streaming sparse Gaussian process approximations. In Advances in Neural Information Processing Systems, pages 3299–3307.
- Chapelle and Li (2011) Chapelle, O. and Li, L. (2011). An empirical evaluation of Thompson Sampling. In Advances in neural information processing systems, pages 2249–2257.
- Cheng et al. (2017) Cheng, L.-F., Darnell, G., Dumitrascu, B., Chivers, C., Draugelis, M. E., Li, K., and Engelhardt, B. E. (2017). Sparse multi-output Gaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112.
- Csató and Opper (2002) Csató, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural Computation, 14(3):641–668.
- Dalcín et al. (2005) Dalcín, L., Paz, R., and Storti, M. (2005). MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108 – 1115.
- Deisenroth and Ng (2015) Deisenroth, M. P. and Ng, J. W. (2015). Distributed Gaussian processes. In International Conference on Machine Learning, pages 1481–1490.
- Doucet et al. (2001) Doucet, A., De Freitas, N., and Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo Methods in Practice, pages 3–14. Springer.
- Doucet and Johansen (2009) Doucet, A. and Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later.
Futoma et al. (2017a)
Futoma, J., Hariharan, S., and Heller, K. (2017a).
Learning to detect sepsis with a multitask Gaussian process RNN classifier.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1174–1182. JMLR. org.
- Futoma et al. (2017b) Futoma, J., Hariharan, S., Sendak, M., Brajer, N., Clement, M., Bedoya, A., O’Brien, C., and Heller, K. (2017b). An improved multi-output Gaussian process RNN with real-time validation for early sepsis detection. arXiv preprint arXiv:1708.05894.
- GPy (2012) GPy (2012). GPy: A Gaussian process framework in Python. http://github.com/SheffieldML/GPy.
- Gramacy and Lee (2008) Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119–1130.
- Gramacy and Polson (2011) Gramacy, R. B. and Polson, N. G. (2011). Particle learning of Gaussian process models for sequential design and optimization. Journal of Computational and Graphical Statistics, 20(1):102–118.
- Higdon et al. (1999) Higdon, D., Swall, J., and Kern, J. (1999). Non-stationary spatial modeling. Bayesian Statistics, 6(1):761–768.
- Hutter et al. (2011) Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer.
- Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3:160035.
- Li et al. (2016) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560.
- Liu (2008) Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.
Low et al. (2015)
Low, K. H., Yu, J., Chen, J., and Jaillet, P. (2015).
Parallel Gaussian process regression for big data: Low-rank
representation meets Markov approximation.
AAAI Conference on Artificial Intelligence.
- Ng and Deisenroth (2014) Ng, J. W. and Deisenroth, M. P. (2014). Hierarchical mixture-of-experts model for large-scale Gaussian process regression. arXiv preprint arXiv:1412.3078.
- Nguyen-Tuong et al. (2009) Nguyen-Tuong, D., Peters, J. R., and Seeger, M. (2009). Local Gaussian process regression for real time online model learning. In Advances in Neural Information Processing Systems, pages 1193–1200.
- Osborne et al. (2008) Osborne, M. A., Roberts, S. J., Rogers, A., Ramchurn, S. D., and Jennings, N. R. (2008). Towards real-time information processing of sensor network data using computationally efficient multi-output Gaussian processes. In International Conference on Information Processing in Sensor Networks, pages 109–120. IEEE.
- Paciorek and Schervish (2004) Paciorek, C. J. and Schervish, M. J. (2004). Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems, pages 273–280.
- Rasmussen and Ghahramani (2002) Rasmussen, C. E. and Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. In Neural Information Processing Systems, pages 881–888.
- Rasmussen and Ghahramani (2003) Rasmussen, C. E. and Ghahramani, Z. (2003). Bayesian Monte Carlo. Advances in Neural Information Processing Systems, pages 505–512.
- Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for machine learning. Gaussian Processes for Machine Learning.
- Russo and Van Roy (2014) Russo, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243.
- Russo et al. (2017) Russo, D., Van Roy, B., Kazerouni, A., and Osband, I. (2017). A Tutorial on Thompson Sampling. arXiv preprint arXiv:1707.02038.
- Snelson and Ghahramani (2005) Snelson, E. and Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs. In Neural Information Processing Systems, pages 1257–1264.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, pages 2951–2959.
- Srinivas et al. (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
- Svensson et al. (2015) Svensson, A., Dahlin, J., and Schön, T. B. (2015). Marginalizing Gaussian process hyperparameters using sequential Monte Carlo. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 477–480. IEEE.
- Titsias (2009) Titsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, volume 5, pages 567–574.
- Urteaga and Wiggins (2018) Urteaga, I. and Wiggins, C. H. (2018). (Sequential) importance sampling bandits. arXiv preprint arXiv:1808.02933.
- Wang et al. (2013) Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and De Freitas, N. (2013). Bayesian optimization in high dimensions via random embeddings. In Twenty-Third International Joint Conference on Artificial Intelligence.
- Whittle (1988) Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A):287–298.
- Williams and Barber (1998) Williams, C. K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
- Yuan and Neubauer (2009) Yuan, C. and Neubauer, C. (2009). Variational mixture of Gaussian process experts. In Advances in Neural Information Processing Systems, pages 1897–1904.
- Zhang and Williamson (2017) Zhang, M. M. and Williamson, S. A. (2017). Embarrassingly parallel inference for Gaussian processes. arXiv:1702.08420.