# Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Automating statistical modelling is a challenging problem that has far-reaching implications for artificial intelligence. The Automatic Statistician employs a kernel search algorithm to provide a first step in this direction for regression problems. However this does not scale due to its O(N^3) running time for the model selection. This is undesirable not only because the average size of data sets is growing fast, but also because there is potentially more information in bigger data, implying a greater need for more expressive models that can discover finer structure. We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm, to encompass big data within the boundaries of automated statistical modelling.

## Authors

• 11 publications
• 105 publications
12/21/2020

### Learning Compositional Sparse Gaussian Processes with a Shrinkage Prior

Choosing a proper set of kernel functions is an important problem in lea...
08/22/2019

### Clustered Hierarchical Entropy-Scaling Search of Astronomical and Biological Data

Both astronomy and biology are experiencing explosive growth of data, re...
06/16/2020

### Real-Time Regression with Dividing Local Gaussian Processes

The increased demand for online prediction and the growing availability ...
04/11/2017

### Parametric Gaussian Process Regression for Big Data

This work introduces the concept of parametric Gaussian processes (PGPs)...
07/10/2018

### Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

To perform model-selection efficiently, we must run informative experime...
02/23/2017

### A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data

Partial Least Squares (PLS) methods have been heavily exploited to analy...
04/26/2018

### Big Data Analytic based on Scalable PANFIS for RFID Localization

RFID technology has gained popularity to address localization problem in...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Automated statistical modelling is an area of research in its early stages, yet it is becoming an increasingly important problem [14]

. As a growing number of disciplines use statistical analyses and models to help achieve their goals, the demand for statisticians, machine learning researchers and data scientists is at an all time high. Automated systems for statistical modelling serves to assist such human resources, if not as a best alternative where there is a shortage.

An example of a fruitful attempt at automated statistical modelling in nonparametric regression is Compositional Kernel Search (CKS) [10], an algorithm that fits a Gaussian Process (GP) to the data and automatically chooses a suitable parametric form of the kernel. This leads to high predictive performance that matches kernels hand-selected by GP experts [27]. There also exist other approaches that tackle this model selection problem by using a more flexible kernel [2, 24, 30, 41, 43]. However the distinctive feature of Duvenaud et al [10] is that the resulting models are interpretable; the kernels are constructed in such a way that we can use them to describe patterns in the data, and thus can be used for automated exploratory data analysis. Lloyd et al [19] exploit this to generate natural language analyses from these kernels, a procedure that they name Automatic Bayesian Covariance Discovery (ABCD). The Automatic Statistician111See http://www.automaticstatistician.com/index/ for example analyses. implements this to output a 10-15 page report when given data input.

However, a limitation of ABCD is that it does not scale; due to the time for inference in GPs, the analysis is constrained to small data sets, specialising in one dimensional time series data. This is undesirable not only because the average size of data sets is growing fast, but also because there is potentially more information in bigger data, implying a greater need for more expressive models that can discover finer structure. This paper proposes Scalable Kernel Composition (SKC), a scalable extension of CKS, to push the boundaries of automated interpretable statistical modelling to bigger data. In summary, our work makes the following contributions:

• We propose the first scalable version of the Automatic Statistician that scales up to medium-sized data sets by reducing algorithmic complexity from to and enhancing parallelisability.

• We derive a novel cheap upper bound on the GP marginal likelihood, that is used in SKC with the variational lower bound [37] to sandwich the GP marginal likelihood.

• We show that our upper bound is significantly tighter than the lower bound, and plays an important role for model selection.

## 2 ABCD and CKS

The Compositional Kernel Search (CKS) algorithm [10] builds on the idea that the sum and product of two positive definite kernels are also positive definite. Starting off with a set of base kernels defined on , the algorithm searches through the space of zero-mean GPs with kernels that can be expressed in terms of sums and products of these base kernels. is used, which correspond to the squared exponential, linear and periodic kernel respectively (see Appendix C for the exact form of these base kernels). Thus candidate kernels form an open-ended space of GP models, allowing for an expressive model. Such approaches for structure discovery have also appeared in [12, 13]. A greedy search is employed to explore this space, with each kernel scored by the Bayesian Information Criterion (BIC) [31] 222BIC = log marginal likelihood with a model complexity penalty. We use a definition where higher BIC means better model fit. See Appendix A.

after optimising the kernel hyperparameters by type II maximum likelihood (ML-II). See Appendix

B for the algorithm in detail.

The resulting kernel can be simplified to be expressed as a sum of products of base kernels, which has the notable benefit of interpretability. In particular, note for independent and . So a GP whose kernel is a sum of products of kernels can be interpreted as sums of GPs each with structure given by the product of kernels. Now each base kernel in a product modifies the model in a consistent way. For example, multiplication by SE converts global structure into local structure since SE() decreases exponentially with , and multiplication by PER is equivalent to multiplication of the modeled function by a periodic function (see Lloyd et al [19] for detailed interpretations of different combinations). This observation is used in Automatic Bayesian Covariance Discovery (ABCD) [19], giving a natural language description of the resulting function modeled by the composite kernel. In summary ABCD consists of two algorithms: the compositional kernel search CKS, and the natural language translation of the kernel into a piece of exploratory data analysis.

## 3 Scaling up ABCD

ABCD provides a framework for a natural extension to big data settings, in that we only need to be able to scale up CKS, then the natural language description of models can be directly applied. The difficulty of this extension lies in the time for evaluation of the GP marginal likelihood and its gradients with respect to the kernel hyperparameters.

A naïve approach is to subsample the data to reduce , but then we may fail to capture global structure such as periodicities with long periods or omit a set of points displaying a certain local structure. We show failure cases of random subsampling in Section 4.3. Regarding more strategic subsampling, the possibility of a generic subsampling algorithm for GPs that is able to capture the aforementioned properties of the data is a challenging research problem in itself.

Alternatively it is tempting to use either an approximate marginal likelihood or the exact marginal likelihood of an approximate model as a proxy for the exact likelihood [25, 32, 34, 37], especially with modern GP approximations scaling to large data sets with millions of data points [15, 42]. However such scalable GP methods are limited in interpretability as they often behave very differently to the full GP, lacking guarantees for the chosen kernel to faithfully reflect the actual structure in the data. In other words, the real challenge is to scale up the GP while preserving interpretability, and this is a difficult problem due to the tradeoff between scalability and accuracy of GP approximations. Our work pushes the frontiers of interpretable GPs to medium-sized data () by reducing the computational complexity of ABCD from to . Extending this to large data sets () is a difficult open problem.

Our approach is as follows: we provide a cheap lower bound and upper bound to sandwich the exact marginal likelihood, and we use this interval for model selection. To do so we give a brief overview of the relevant work on low rank kernel approximations used for scaling up GPs, and we later outline how they can be applied to obtain cheap lower and upper bounds.

### 3.1 Nyström Methods and Sparse GPs

The Nyström Method [9, 40] selects a set of inducing points in the input space that attempt to explain all the covariance in the Gram matrix of the kernel; the kernel is evaluated for each pair of inducing points and also between the inducing points and the data, giving matrices . This is used to create the Nyström approximation of , where is the pseudo-inverse. Applying Cholesky decomposition to , we see that the approximation admits the low-rank form and so allows efficient computation of determinants and inverses in time (see Appendix D). We later use the Nyström approximation to give an upper bound on the exact log marginal likelihood.

The Nyström approximation arises naturally in the sparse GP literature, where certain distributions are approximated by simpler ones involving , the GP evaluated at the inducing points: the Deterministic Training Conditional (DTC) approximation [32] defines a model that gives the marginal likelihood (

is the vector of observations), whereas the Fully Independent Conditional (FIC) approximation

[34] gives , correcting the Nyström approximation along the diagonals. The Partially Independent Conditional (PIC) approximation [25] further improves this by correcting the Nyström approximation on block diagonals, with blocks typically of size . Note that the approximation is no longer low rank for FIC and PIC, but matrix inversion can still be computed in time by Woodbury’s Lemma (see Appendix D).

The variational inducing points method (VAR) [37] is rather different to DTC/FIC/PIC in that it gives the following variational lower bound on the exact log marginal likelihood:

 log[N(y|0,^K+σ2I)]−12σ2Tr(K−^K)−1mm (1)

This lower bound is optimised with respect to the inducing points and the kernel hyperparameters, which is shown in the paper to successfully yield tight lower bounds in time for reasonable values of . Another useful property of VAR is that the lower bound can only increase as the set of inducing points grows [21, 37]. It is also known that VAR always improves with extra computation, and that it successfully recovers the true posterior GP in most cases, contrary to other sparse GP methods [4]. Hence this is what we use in SKC to obtain a lower bound on the marginal likelihood and optimise the hyperparameters. We use 10 random initialisations of hyperparameters and choose the one with highest lower bound after optimisation.

### 3.2 A cheap and tight upper bound on the log marginal likelihood

Fixing the hyperparameters to be those tuned by VAR, we seek a cheap upper bound to the exact marginal likelihood. Upper bounds and lower bounds are qualitatively different, and in general it is more difficult to obtain an upper bound than a lower bound for the following reason: first note that the marginal likelihood is the integral of the likelihood with respect to the prior density of parameters. Hence to obtain a lower bound it suffices to exhibit regions in the parameter space giving high likelihood. However, to obtain an upper bound one must demonstrate the absence or lack of likelihood mass outside a certain region, an arguably more difficult task. There has been some work on the subject [5, 17], but to the best of our knowledge there has not been any work on cheap upper bounds to the marginal likelihood in large settings. So finding an upper bound from the perspective of the marginal likelihood can be difficult. Instead, we exploit the fact that the GP marginal likelihood has an analytic form, and treat it as a function of . The GP log marginal likelihood is composed of two terms and a constant:

 logp(y)= log[N(y|0,K+σ2I)] = −12logdet(K+σ2I) −12y⊤(K+σ2I)−1y−N2log(2π) (2)

We give separate upper bounds on the negative log determinant (NLD) term and the negative inner product (NIP) term. For NLD, it has been proven that

 −12logdet(K+σ2I)≤−12logdet(^K+σ2I)−1mm (3)

a consequence of being a Schur complement of and hence positive semi-definite (e.g. [3]). So the Nyström approximation plugged into the NLD term serves as an upper bound that can be computed in time (see Appendix D).

As for NIP, we point out that

, the optimal value of the objective function in kernel ridge regression where

is the Reproducing Kernel Hilbert Space associated with (e.g. [23]). The dual problem, whose objective function has the same optimal value, is . So we have the following upper bound:

 −12y⊤(K+σ2I)−1y≤12α⊤(K+σ2I)α−α⊤y (4)

. Note that this is also in the form of an objective for conjugate gradients (CG) [33], hence equality is obtained at the optimal value . We can approach the optimum for a tighter bound by using CG or preconditioned CG (PCG) for a fixed number of iterations to get a reasonable approximation to . Each iteration of CG and the computation of the upper bound takes time, but PCG is very fast even for large data sets and using FIC/PIC as the preconditioner gives fastest convergence in general [8].

Recall that although the lower bound takes to compute, we need for accurate approximations, where depends on the data distribution and kernel [28]. is usually close to 0.5, hence the lower bound is also effectively . In practice, the upper bound evaluation seems to be a little more expensive than the lower bound evaluation, but we only need to compute the upper bound once, whereas we must evaluate the lower bound and its gradients multiple times for the hyperparameter optimisation. We later confirm in Section 4.1 that the upper bound is fast to compute relative to the lower bound optimisation. We also show empirically that the upper bound is tighter than the lower bound in Section 4.1, and give the following sufficient condition for this to be true:

###### Proposition 1.

Suppose

are the eigenvalues of

in descending order. Then if (P)CG for the NIP term converges and , then the upper bound is tighter than the lower bound.

Notice that so the assumption is feasible. The proof is in Appendix E.

We later show in Section 4 that the upper bound is not only tighter than the lower bound, but also much less sensitive to the choice of inducing points. Hence we use the upper bound to choose between kernels whose BIC intervals overlap.

algocf[t]

### 3.3 SKC: Scalable Kernel Composition using the lower and upper bound

We base our algorithm on two intuitive claims. First, the lower and upper bounds converge to the true log marginal likelihood for fixed hyperparameters as the number of inducing points increases. Second, the hyperparameters optimised by VAR converge to those obtained by optimising the exact log marginal likelihood as increases. The former is confirmed in Figure 2 as well as in other works (e.g. [4] for the lower bound), and the latter is confirmed in Appendix F.

The algorithm proceeds as follows: for each base kernel and a fixed value of , we compute the lower and upper bounds to obtain an interval for the GP marginal likelihood and hence the BIC of the kernel, with its hyperparameters optimised by VAR. We rank these kernels by their intervals, using the upper bound as a tie-breaker for kernels with overlapping intervals. We then perform a semi-greedy kernel search, expanding the search tree on all (or some, controlled by buffer size ) kernels whose intervals overlap with the top kernel at the current depth. We recurse to the next depth by computing intervals for these child kernels (parent kernel +/ base kernel), ranking them and further expanding the search tree. This is summarised in Algorithm LABEL:alg:skc, and Figure 12 in Appendix Q is a visualisation of the tree for different values of . Details on the optimisation and initialisation are given in Appendices H and I.

The hyperparameters found by VAR may not be the global maximisers of the exact GP marginal likelihood, but as in ABCD we can optimise the marginal likelihood with multiple random seeds and choose the local optimum closest to the global optimum. One may still question whether the hyperparameter values found by VAR agree with the structure in the data, such as period values and slopes of linear trends. We show in Sections 4.2 and 4.3 that a small suffices for this to be the case.

Choice of m We can guarantee that the lower bound increases with larger , but cannot guarantee that the upper bound decreases, since we tend to get better hyperparameters for higher that boost the marginal likelihood and hence the upper bound. We verify this in Section 4.1. Hence throughout SKC we fix to be the largest possible value that one can afford, so that the marginal likelihood with hyperparameters optimised by VAR is as close as possible to the marginal likelihood with optimal hyperparameters. It is natural to wonder whether an adaptive choice of is possible, using higher for more promising kernels to tighten their bounds. However a fair comparison of different kernels via this lower and upper bound requires that they have the same value of , since using a higher is guaranteed to give a higher lower bound.

Parallelisability Note that SKC is extremely parallelisable across different random initialisations and different kernels at each depth, as is CKS. In fact the presence of intervals for SKC and hence buffers of kernels allows further parallelisation over kernels of different depths in certain cases (see Appendix G).

## 4 Experiments

### 4.1 Investigating the behaviour of the lower bound (LB) and upper bound (UB)

We present results for experiments showing the bounds we obtain for two small time series and a multidimensional regression data set, for which CKS is feasible. The first is the annual solar irradiance data from 1610 to 2011, with 402 observations [18]. The second is the time series Mauna Loa CO2 data [36] with 689 observations. See Appendix K for plots of the time series. The multidimensional data set is the concrete compressive strength data set with 1030 observations and 8 covariates [44]. The functional form of kernels used for each of these data sets have been found by CKS (see Figure 3

). All observations and covariates have been normalised to have mean 0 and variance 1.

From the left of Figures (a)a, (a)a and (a)a, (the latter two can be found in Appendix Q) we see that VAR gives a LB for the optimal log marginal likelihood that improves with increasing . The best LB is tight relative to the exact log marginal likelihoods at the hyperparameters optimised by VAR. We also see that the UB is even tighter than the LB, and increases with as hypothesised. From the middle plots, we observe that the Nyström approximation gives a very tight UB on the NLD term. We also tried using RFF to get a stochastic UB (see Appendix P), but this is not as tight, especially for larger values of . From the right plots, we can see that PCG with any of the three preconditioners (Nyström, FIC, PIC) give a very tight UB to the NIP term, whereas CG may require more iterations to get tight, for example in Figures (a)a, (b)b in Appendix Q.

Figure 2 shows further experimental evidence for a wider range of kernels reinforcing the claim that the UB is tighter than the LB, as well as being much less sensitive to the choice of inducing points. This is why the UB is a much more reliable metric for the model selection than the LB. Hence we use the UB for a one-off comparison of kernels when their BIC intervals overlap.

Comparing Figures (a)a, (a)a, (a)a against (b)b, (b)b, (b)b, learning inducing points does not lead to a vast improvement in the VAR LB. In fact the differences are not very significant, and sometimes learning inducing points can get the LB stuck in a bad local minimum, as indicated by the high variance of LB in the latter three figures. Moreover the differences in computational time is significant as we can see in Table 2 of Appendix J. Hence the computation-accuracy trade-off is best when fixing the inducing points to a random subset of training data.

Table 2 also compares times for the different computations after fixing the inducing points. The gains from using the variational LB instead of the full GP is clear, especially for the larger data sets, and we also confirm that it is indeed the optimisation of the LB that is the bottleneck in terms of computational cost. We also see that the NIP UB computation times are similarly fast for all , thus convergence of PCG with the PIC preconditioner is happening in only a few iterations.

### 4.2 SKC on small data sets

We compare the kernels chosen by CKS and by SKC for the three data sets. The results are summarised in Figure 3. For solar, we see that SKC successfully finds SE PER, which is the second highest kernel for CKS, with BIC very close to the top kernel. For mauna, SKC selects (SE + PER) SE + LIN, which is third highest for CKS and a BIC very close to the top kernel. Looking at the values of hyperparameters in kernels PER and LIN found by SKC, 40 inducing points are sufficient for it to successfully find the correct periods and slopes in the data, reinforcing the claim that we only need a small to find good hyperparameters (see Appendix K for details). For concrete, a more challenging eight dimensional data set, we see that the kernels selected by SKC do not match those selected by CKS, but it still manages to find similar additive structure such as SE1+SE8 and SE4. Of course, the BIC intervals for kernels found by SKC are for hyperparameters found by VAR with , hence do not necessarily contain the optimal BIC of kernels in CKS. However the above results show that our method is still capable of selecting appropriate kernels even for low values of , without having to home in to the optimal BIC using high values of . The savings in computation time is significant even for these small data sets, as shown in Table 2.

### 4.3 SKC on medium-sized data sets & Why the lower bound is not enough

The UB has marginal benefits over the LB for small data sets, where the LB is already a fairly good estimate of the true BIC. However, this gap becomes significant as

grows and as kernels become more complex; the UB is much tighter and more stable with respect to the choice of inducing points (as shown in Figure 2), and plays a crucial role in the model selection.

Power Plant We first show this for SKC on the Power Plant data with 9568 observations and 4 covariates [38]. We see from Figure 4 that again the UB is much tighter than the LB, especially in more complex kernels further down the search tree. Comparing the two plots, we also see that the use of the UB in SKC has a significant positive impact on model selection: the kernel found by SKC at depth 5 has exact BIC 1469.6, whereas the corresponding kernel found by just using the LB (SKC-LB) has exact BIC 745.5. The pathological behaviour of SKC-LB occurs at depth 3, where the (SE1+SE2)*PER3 kernel found by SKC-LB has a higher LB than the corresponding SE1+SE2+SE2 kernel found by SKC. SKC-LB chooses the former kernel since it has a higher LB, which is a failure mode since the latter kernel has a significantly higher exact BIC. Due to the tightness of the UB, we see that SKC correctly chooses the latter kernel over the former, escaping the failure mode.

CKS vs SKC runtime We run CKS and SKC for on Power Plant data on the same machine with the same hyperparameter setting. The runtimes up to depths 1/2 are: 28.7h/94.7h (CKS), 0.6h/2.6h (SKC, ), 1.1h/4.1h (SKC, ), 4.2h/15.0h (SKC, ). Again, the reduction in computation time for SKC are substantial.

Time Series We also implement SKC on medium-sized time series data to explore whether it can successfully find known structure at different scales. Such data sets occur frequently for hourly time series over several years or daily time series over centuries.

GEFCOM First we use the energy load data set from the 2012 Global Energy Forecasting Competition [16], and hourly time series of energy load from 2004/01/01 to 2008/06/30, with 8 weeks of missing data, giving . Notice that this data size is far beyond the scope of full GP optimisation in CKS.

From the plots, we can see that there is a noisy 6-month periodicity in the time series, as well as a clear daily periodicity with peaks in the morning and evening. Despite the periodicity, there are some noticeable irregularities in the daily pattern.

The SE1 PER1+ SE2 (PER2 + LIN) kernel found by SKC with and its hyperparameters are summarised in the first two columns of Table 1. Note that with only 160 inducing points, SKC has succesfully found the daily periodicity (PER1) and the 6-month periodicity (PER2). The SE1 kernel in the first additive term suggests a local periodicity, exhibiting the irregular observation pattern that is repeated daily. Also note that the hyperparameter corresponding to the longer periodicity is close but not exactly half a year, owing to the noisiness of the periodicity that is apparent in the data. Moreover the magnitude of the second additive term SE2 PER2 that contains the longer periodicity is 0.18 0.06, which is much smaller than the magnitude 0.44 1.10 of the first additive term SE1 PER1. This explicitly shows that the second term has less effect than the first term on the behaviour of the time series, hence the weaker long-term periodicity.

Running the kernel search just using the LB with the same hyperparameters as SKC, the algorithm is only able to detect the daily periodicity and misses the 6-month periodicity. This is consistent with the observation that the LB is loose compared to the UB, especially for bigger data sets, hence fails to evaluate the kernels correctly. We also tried selecting a random subset of the data (sizes ) and running CKS. For even a subset of size , the algorithm could not find the daily periodicity. None of the four subset size values were able to make it past depth 4 in the kernel search, struggling to find sophisticated structure.

Tidal We also run SKC on tidal data, namely the sea level measurements in Dover from 2013/01/03 to 2016/08/31 [1]. This is an hourly time series, with each observation taken to be the mean of four readings taken every 15 minutes, giving .

Looking at the bottom of Figure 6, we can find clear 12-hour periodicities and amplitudes that follow slightly noisy bi-weekly periodicities. The shorter periods, called semi-diurnal tides, are on average 12 hours 25 minutes 0.518 days long, and the bi-weekly periods arise due to gravitational effects of the moon [22].

The SE PER1 LIN PER2 PER3 kernel found by SKC with and its hyperparameters are summarised in the right two columns of Table 1. The PER1 PER3 kernel precisely corresponds to the doubly periodic structure in the data, whereby we have semi-daily periodicities with bi-weekly periodic amplitudes. The SE kernel has a length scale of 82.5 days, giving a local periodicity. This represents the irregularity of the amplitudes as can be seen on the bottom plot of Figure 6 between days 20 and 40. The LIN kernel has a large magnitude, but its effect is negligible when multiplied since the slope is calculated to be (see Appendix K for the slope calculation formula). It essentially has the role of raising the magnitude of the resulting kernel and hence representing noise in the data. The PER2 kernel is also negligible due to its high length scale and small magnitude, indicating that the amplitude of the periodicity due to this term is very small. Hence the term has minimal effect on the kernel.

The kernel search using just the LB fails to proceed past depth 3, since it is unable to find a kernel with a higher LB on the BIC than the previous depth (so the extra penalty incurred by increasing model complexity outweighs the increase in LB of log marginal likelihood). CKS on a random subset of the data similarly fails, the kernel search halting at depth 2 even for random subsets as large as size .

In both cases, SKC is able to detect the structure of data and provide accurate numerical estimates of its features, all with much fewer inducing points than , whereas the kernel search using just the LB or a random subset of the data both fail.

## 5 Conclusion and Discussion

We have introduced SKC, a scalable kernel discovery algorithm that extends CKS and hence ABCD to bigger data sets. We have also derived a novel cheap upper bound to the GP marginal likelihood that sandwiches the marginal likelihood with the variational lower bound [37], and use this interval in SKC for selecting between different kernels. The reasons for using an upper bound instead of just the lower bound for model selection are as follows: the upper bound allows for a semi-greedy approach, allowing us to explore a wider range of kernels and compensates for the suboptimality coming from local optima in the hyperparameter optimisation. Should we wish to restrict the range of kernels explored for computational efficiency, the upper bound is tighter and more stable than the lower bound, hence we may use the upper bound as a reliable tie-breaker for kernels with overlapping intervals. Equipped with this upper bound, our method can pinpoint global/local periodicities and linear trends in time series with tens of thousands of data points, which are well beyond the reach of its predecessor CKS.

For future work we wish to make the algorithm even more scalable: for large data sets where quadratic runtime is infeasible, we can apply stochastic variational inference for GPs [15] to optimise the lower bound, the bottleneck of SKC, using mini-batches of data. Finding an upper bound that is cheap and tight enough for model selection would pose a challenge. Also one drawback of the upper bound that could perhaps be resolved is that hyperparameter tuning by optimising the lower bound is difficult (see Appendix L). One other minor scope for future work is using more accurate estimates of the model evidence than BIC. A related paper uses Laplace approximation instead of BIC for kernel evaluation in a similar kernel search context [20]. However Laplace approximation adds on an expensive Hessian term, for which it is unclear how one can obtain lower and upper bounds.

## Acknowledgements

HK and YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. We would also like to thank Michalis Titsias for helpful discussions.

## References

• [1] British Oceanographic Data Centre, UK Tide Gauge Network.
• [2] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In ICML, 2004.
• [3] Rémy Bardenet and Michalis Titsias. Inference for determinantal point processes without spectral knowledge. NIPS, 2015.
• [4] Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse gaussian process approximations. NIPS, 2016.
• [5] Matthew James Beal.

Variational algorithms for approximate Bayesian inference

.
PhD thesis, University College London, 2003.
• [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
• [7] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approximation on learning accuracy. In AISTATS, 2010.
• [8] Kurt Cutajar, Michael A. Osborne, John P. Cunningham, and Maurizio Filippone. Preconditioning kernel matrices. ICML, 2016.
• [9] Petros Drineas and Michael Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. JMLR, 6:2153–2175, 2005.
• [10] David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In ICML, 2013.
• [11] Chris Fraley and Adrian E Raftery. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of classification, 24(2):155–181, 2007.
• [12] Jacob Gardner, Chuan Guo, Kilian Weinberger, Roman Garnett, and Roger Grosse. Discovering and exploiting additive structure for bayesian optimization. In AISTATS, 2017.
• [13] Roger B. Grosse, Ruslan Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. Exploiting compositionality to explore a large space of model structures. In UAI, 2012.
• [14] Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante, Sergio Escalera, Damir Jajetic, James Robert Lloyd, Núria Macià, Bisakha Ray, Lukasz Romaszko, Michèle Sebag, Alexander Statnikov, Sébastien Treguer, and Evelyne Viegas. A brief review of the chalearn automl challenge: Any-time any-dataset learning without human intervention. In Proceedings of the Workshop on Automatic Machine Learning, volume 64, pages 21–30, 2016.
• [15] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. UAI, 2013.
• [16] Tao Hong, Pierre Pinson, and Shu Fan. Global energy forecasting competition 2012, 2014.
• [17] Chunlin Ji, Haige Shen, and Mike West. Bounded approximations for marginal likelihoods. 2010.
• [18] Judith Lean, Juerg Beer, and Raymond S Bradley. Reconstruction of solar irradiance since 1610: Implications for climate cbange. Geophysical Research Letters, 22(23), 1995.
• [19] James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In AAAI, 2014.
• [20] Gustavo Malkomes, Charles Schaff, and Roman Garnett. Bayesian optimization for automated model selection. In NIPS, 2016.
• [21] Alexander G de G Matthews, James Hensman, Richard E Turner, and Zoubin Ghahramani.

On sparse variational methods and the kullback-leibler divergence between stochastic processes.

AISTATS, 2016.
• [22] John Morrissey, James L Sumich, and Deanna R. Pinkard-Meier. Introduction To The Biology Of Marine Life. Jones & Bartlett Learning, 1996.
• [23] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
• [24] Junier B Oliva, Avinava Dubey, Andrew G Wilson, Barnabás Póczos, Jeff Schneider, and Eric P Xing. Bayesian nonparametric kernel-learning. In AISTATS, 2016.
• [25] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. JMLR, 6:1939–1959, 2005.
• [26] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, 2007.
• [27] Carl Edward Rasmussen and Chris Williams. Gaussian processes for machine learning. MIT Press, 2006.
• [28] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In NIPS, 2015.
• [29] Walter Rudin. Fourier analysis on groups. AMS, 1964.
• [30] Yves-Laurent Kom Samo and Stephen Roberts. Generalized spectral kernels. arXiv preprint arXiv:1506.02236, 2015.
• [31] Gideon Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
• [32] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In AISTATS, 2003.
• [33] Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994.
• [34] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In NIPS, 2005.
• [35] Arno Solin and Simo Särkkä. Explicit link between periodic covariance functions and state space models. JMLR, 2014.
• [36] Pieter Tans and Ralph Keeling. NOAA/ESRL, Scripps Institution of Oceanography.
• [37] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, 2009.
• [38] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126–140, 2014.
• [39] Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, and Aki Vehtari. Gpstuff: Bayesian modeling with gaussian processes. JMLR, 14(Apr):1175–1179, 2013.
• [40] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2001.
• [41] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. arXiv preprint arXiv:1302.4245, 2013.
• [42] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015. http://arxiv.org/abs/1511.01870.
• [43] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In AISTATS, 2016.
• [44] I-C Yeh.

Modeling of strength of high-performance concrete using artificial neural networks.

Cement and Concrete research, 28(12):1797–1808, 1998.

## Appendix A Bayesian Information Criterion (BIC)

The BIC is a model selection criterion that is the marginal likelihood with a model complexity penalty:

 BIC=logp(y|^θ)−12plog(N)

for observations , number of observations , maximum likelihood estimate (MLE) of model hyperparameters , number of hyperparameters p. It is derived as an approximation to the log model evidence .

algocf[htbp]

## Appendix C Base Kernels

 LIN(x,x′) =σ2(x−l)(x′−l) SE(x,x′) =σ2exp(−(x−x′)22l2) PER(x,x′) =σ2exp(−2sin2(π(x−x′)/p)l2)

## Appendix D Matrix Identities

###### Lemma 1 (Woodbury’s Matrix Inversion Lemma).

So setting (Nyström) or (FIC) or (PIC), , , we get:

 (A+Φ⊤Φ)−1=A−1−A−1Φ⊤(I+ΦA−1Φ⊤)−1ΦA−1
###### Lemma 2 (Sylvester’s Determinant Theorem).

Hence:

 det(σ2I+Φ⊤Φ) =(σ2)ndet(I+σ−2Φ⊤Φ) =(σ2)ndet(I+σ−2ΦΦ⊤) =(σ2)n−mdet(σ2I+ΦΦ⊤)

## Appendix E Proof of Proposition 1

###### Proof.

If PCG converges, the upper bound for NIP is exact. We showed in Section 4.1 that the convergence happened in only a few iterations. Moreover Cortes et al [7] shows that the lower bound for NIP can be rather loose in general.

So it suffices to prove that the upper bound for NLD is tighter than the lower bound for NLD. Let be the ordered eigenvalues of respectively. Since is positive semi-definite (e.g. [3]), we have (using the assumption in the proposition). Now the slack in the upper bound is:

 −12 logdet(^K+σ2I)−(−12logdet(K+σ2I)) =12N∑i=1(logλi−log^λi)

Hence the slack in the lower bound is:

 −12 logdet(K+σ2I) −[−12logdet(^K+σ2I)−12σ2Tr(K−^K)] =−12N∑i=1(logλi−log^λi)+12σ2N∑i=1(λi−^λi)

Now by concavity and monotonicity of , and since , we have:

 logλi−log^λiλi−^λi ≤12σ2 ⇒N∑i=1(logλi−log^λi) ≤12σ2N∑i=1(λi−^λi) ⇒12N∑i=1(logλi−log^λi) ≤12σ2N∑i=1(λi−^λi) −12N∑i=1(logλi−log^λi)

## Appendix F Convergence of hyperparameters from optimising lower bound to optimal hyperparameters

Note from Figure 7 that the hyperparameters found by optimising the lower bound converges to the hyperparameters found by the exact GP when optimising the exact marginal likelihood, giving empirical evidence for the second claim in Section 3.3.

## Appendix G Parallelising SKC

Note that SKC can be parallelised across the random hyperparameter initialisations, and also across the kernels at each depth for computing the BIC intervals. In fact, SKC is even more parallelisable with the kernel buffer: say at a certain depth, we have two kernels remaining to be optimised and evaluated before we can move onto the next depth. If the buffer size is 5, say, then we can in fact move on to the next depth and grow the kernel search tree on the top 3 kernels of the buffer, without having to wait for the 2 kernel evaluations to be complete. This saves a lot of computation time wasted by idle cores waiting for all kernel evaluations to finish before moving on to the next depth of the kernel search tree.

## Appendix H Optimisation

Since we wish to use the learned kernels for interpretation, it is important to have the hyperparameters lie in a sensible region after the optimisation. In other words, we wish to regularise the hyperparameters during optimisation. For example, we want the SE kernel to learn a globally smooth function with local variation. When naïvely optimising the lower bound, sometimes the length scale and the signal variance becomes very small, so the SE kernel explains all the variation in the signal and ends up connecting the dots. We wish to avoid this type of behaviour. This can be achieved by giving priors to hyperparameters and optimising the energy (log prior added to the log marginal likelihood) instead, as well as using sensible initialisations. Looking at Figure 8, we see that using a strong prior with a sensible random initialisation (see Appendix I for details) gives a sensible smoothly varying function, whereas for all the three other cases, we have the length scale and signal variance shrinking to small values, causing the GP to overfit to the data. Note that the weak prior is the default prior used in the GPstuff software [39].

Careful initialisation of hyperparameters and inducing points is also very important, and can have strong influence the resulting optima. It is sensible to have the optimised hyperparameters of the parent kernel in the search tree be inherited and used to initialise the hyperparameters of the child. The new hyperparameters of the child must be initialised with random restarts, where the variance is small enough to ensure that they lie in a sensible region, but large enough to explore a good portion of this region. As for the inducing points, we want to spread them out to capture both local and global structure. Trying both K-means and a random subset of training data, we conclude that they give similar results and resort to a random subset. Moreover we also have the option of learning the inducing points. However, this will be considerably more costly and show little improvement over fixing them, as we show in Section

4. Hence we do not learn the inducing points, but fix them to a given randomly chosen set.

Hence for SKC, we use maximum a posteriori (MAP) estimates instead of MLE for the hyperparameters to calculate the BIC, since the priors have a noticeable effect on the optimisation. This is justified [23] and has been used for example in [11]

, where they argue that using the MLE to estimate the BIC for Gaussian mixture models can fail due to singularities and degeneracies.

## Appendix I Hyperparameter initialisation and priors

is a Gaussian with mean 0 and variance truncated at the interval then renormalised.
Signal noise

LIN
where

SE

PER
(shortest possible period is 10 time steps)
(longest possible period is a fifth of the range of data set)
or , w.p.
where ,

,
or w.p.
where is log Gaussian, is the student’s t-distribution.

## Appendix J Computation times

Look at Table 2. For all three data sets, the GP optimisation time is much greater than the sum of the Var GP optimisation time and the upper bound (NLD + NIP) evaluation time for . Hence the savings in computation time for SKC is significant even for these small data sets.

Note that we show the lower bound optimisation time against the upper bound evaluation time instead of the evaluation times for both, since this is what happens in SKC - the lower bound has to be optimised for each kernel, whereas the upper bound only has to be evaluated once.

## Appendix K Mauna and Solar plots and hyperparameter values found by SKC

#### Solar

The solar data has 26 cycles over 285 years, which gives a periodicity of around 10.9615 years. Using SKC with , we find the kernel: SE PER SE SE PER. The value of the period hyperparameter in PER is 10.9569 years, hence SKC finds the periodicity to 3 s.f. with only 40 inducing points. The SE term converts the global periodicity to local periodicity, with the extent of the locality governed by the length scale parameter in SE, equal to 45. This is fairly large, but smaller than the range of the domain (1610-2011), indicating that the periodicity spans over a long time but isn’t quite global. This is most likely due to the static solar irradiance between the years 1650-1700, adding a bit of noise to the periodicities.

#### Mauna

The annual periodicity in the data and the linear trend with positive slope is clear. Linear regression gives us a slope of 1.5149. SKC with

gives the kernel: SE + PER + LIN. The period hyperparameter in PER takes value 1, hence SKC successfully finds the right periodicity. The offset and magnitude parameters of LIN allow us to calculate the slope by the formula where is the noise variance in the learned likelihood. This formula is obtained from the posterior mean of the GP, which is linear in the inputs for the linear kernel. This value amounts to 1.5150, hence the slope found by SKC is accurate to 3 s.f.

## Appendix L Optimising the upper bound

If the upper bound is tighter and more robust with respect to choice of inducing points, why don’t we optimise the upper bound to find hyperparameters? If this were to be possible then we can maximise this to get an upper bound of the exact marginal likelihood with optimal hyperparameters. In fact this holds for any analytic upper bound whose value and gradients can be evaluated cheaply. Hence for any , we can find an interval that contains the true optimised marginal likelihood. So if this interval is dominated by an interval of another kernel, we can discard the kernel and there is no need to evaluate the bounds for bigger values of . Now we wish to use values of such that we can choose the right kernel (or kernels) at each depth of the search tree with minimal computation. This gives rise to an exploitation-exploration trade-off, whereby we want to balance between raising for tight intervals that allow us to discard unsuitable kernels whose intervals fall strictly below that of other kernels, and quickly moving on to the next depth in the search tree to search for finer structure in the data. The search algorithm is highly parallelisable, and thus we may raise simultaneously for all candidate kernels. At deeper levels of the search tree, there may be too many candidates for simultaneous computation, in which case we may select the ones with the highest upper bound to get tighter intervals. Such attempts are listed below.

Note the two inequalities for the NLD and NIP terms:

 −12logdet(^K+σ2I)− 12σ2Tr(K−^K) ≤−12logdet (K+σ2I) ≤ −12logdet(^K+σ2I) (5) −12y⊤(^K+σ2I)−1y ≤−12y⊤(K+ σ2I)−1y ≤ −12y⊤(K+(σ2+Tr(K−^K))I)−1y (6)

Where the first two inequalities come from [3], the third inequality is a direct consequence of being postive semi-definite, and the last inequality is from Michalis Titsias’ lecture slides .

Also from 4, we have that

 −12logdet(^K+σ2I)+12α⊤(K+σ2I)α−α⊤y

is an upper bound . Thus one idea of obtaining a cheap upper bound to the optimised marginal likelihood was to solve the following maximin optimisation problem:

 maxθminα∈RN−12logdet(^K+σ2I)+12α⊤(K+σ2I)α−α⊤y

One way to solve this cheaply would be by coordinate descent, where one maximises with respect to fixing , then minimises with respect to fixing . However tends to blow up in practice. This is because the expression is for fixed , hence maximising with respect to pushes it towards infinity.

An alternative is to sum the two upper bounds above to get the upper bound

 −12logdet(^K+σ2I)−12y⊤(K+(σ2+Tr(K−^K))I)−1y

However we found that maximising this bound gives quite a loose upper bound unless . Hence this upper bound is not very useful.

## Appendix M Random Fourier Features

Random Fourier Features (RFF) (a.k.a. Random Kitchen Sinks) was introduced by [26] as a low rank approximation to the kernel matrix. It uses the following theorem

###### Theorem 3 (Bochner’s Theorem [29]).

A stationary kernel k(d) is positive definite if and only if k(d) is the Fourier transform of a non-negative measure.

to give an unbiased low-rank approximation to the Gram matrix with . A bigger lowers the variance of the estimate. Using this approximation, one can compute determinants and inverses in time. In the context of kernel composition in 2, RFFs have the nice property that samples from the spectral density of the sum or product of kernels can easily be obtained as sums or mixtures of samples of the individual kernels (see Appendix N). We use this later to give a memory-efficient upper bound on the exact log marginal likelihood in Appendix P.

## Appendix N Random Features for Sums and Products of Kernels

For RFF the kernel can be approximated by the inner product of random features given by samples from its spectral density, in a Monte Carlo approximation, as follows:

 k(x−y) =∫RDeiv⊤(x−y)dP(v) ∝∫RDp(v)eiv⊤(x−y)dv =Ep(v)[eiv⊤x(eiv⊤y)∗] =Ep(v)[Re(eiv⊤x(eiv⊤y)∗)] ≈1mm∑k=1Re(eivk⊤x(eivk⊤y)∗) =Eb,v[ϕ(x)⊤ϕ(y)]

where with spectral frequencies iid samples from and iid samples from .
Let be two stationary kernels, with respective spectral densities so that
, where . We use this convention as the Fourier transform. Note .

 (k1+k2)(d) =a1∫p1(v)eiv⊤ddv+a2∫p2(v)eiv⊤ddv =(a1+a2)^p+(d)

where , a mixture of and . So to generate RFF for , generate by generating
Now for the product, suppose

 (k1⋅k2)(d)=a1a2^p1(d)^p2(d)=a1a2^p∗(d)

Then is the inverse fourier transform of , which is the convolution . So to generate RFF for , generate by generating and setting .
This is not applicable for non-stationary kernels, such as the linear kernel. We deal with this problem as follows:

Suppose are random features such that
.
It is straightforward to verify that

 (k1+k2)(x,x′) =ϕ+(x)⊤ϕ+(x′) (k1⋅k2)(x,x′) =ϕ∗(x)⊤ϕ∗(x′)

where and . However we do not want the number of features to grow as we add or multiply kernels, since it will grow exponentially. We want to keep it to be features. So we subsample entries from (or ) and scale by factor ( for

), which will still give us unbiased estimates of the kernel provided that each term of the inner product

(or ) is an unbiased estimate of (or ).
This is how we generate random features for linear kernels combined with other stationary kernels, using the features .

## Appendix O Spectral Density for PER

From [35], we have that the spectral density of the PER kernel is:

 ∞∑n=−∞In(l−2)exp(l−2)δ(v−2πnp)

where is the modified Bessel function of the first kind.

## Appendix P An upper bound to NLD using Random Fourier Features

Note that the function is convex on the set of positive definite matrices [6]. Hence by Jensen’s inequality we have, for an unbiased estimate of :

 −12logdet(K+σ2I) =f(K+σ2I) =f(E[Φ⊤Φ+σ2I]) ≤E[f(Φ⊤Φ+σ2I)]

Hence is a stochastic upper bound to NLD that can be calculated in . An example of such an unbiased estimator is given by RFF.