. Instantiations of BO can be found in Active Learning (AL)(Settles, 2009; Tuia et al., 2011; Fu et al., 2013), the optimal design of experiments (Chaloner and Verdinelli, 1995; Foster et al., 2019; Zheng et al., 2020), and Optimal Learning (Powell and Ryzhov, 2012). Its applications range widely from the optimization of hyper-parameters of complex machine learning models (Snoek et al., 2012) to the sciences and engineering as Attia et al. (2020), who optimized charging protocols to maximize battery life. Li et al. (2018) reported that random search with only twice as many samples can outperform standard BO methods on a certain hyper-parameter optimization task. This lead Ahmed et al. (2016) to advocate for first-order BO (FOBO) as a critical improvement, a call that recently received theoretical heft due to Shekhar and Javidi (2021), who proved that FOBO achieves an exponential improvement on the expected regret of standard BO for multi-armed bandit problems as a function of the number of observations and dimensionality of the input.
At the same time, differentiable programming and automatic differentiation (AD), which enable the calculation of gradients through complex numerical programs, have become an integral part of machine learning research (Innes et al., 2017; Wang et al., 2018; Baydin et al., 2018)
and practice, perhaps best illustrated by PyTorch(Paszke et al., 2019)
and Tensorflow(Abadi et al., 2015), both of which include AD engines. Certainly, AD has powered an increasing pace of model development by automating the error-prone writing of derivative code and is thus a natural complement to FOBO, if only to compute the gradients of the objective.
On a high level, most BO approaches build a surrogate model of an objective with a few potentially noisy observations and make informed choices about further queries based on predictive values and uncertainties of the surrogate. In principle, any functional form could be employed as a surrogate, and indeed Wang and Shang (2014), Snoek et al. (2015), and Gal et al. (2017) use deep neural networks for AL and BO. However, Gaussian Processes (GP) are currently the most commonly used models for research and applications of BO because they work well with little data and permit closed-form posterior inference. Fortunately, GPs are closed under differentiation with benign assumptions, see Section A, and maintain their analytical properties when conditioned on gradient information (Solak et al., 2003).
Nonetheless, naïvely incorporating gradients leads to kernel matrices of size , for observations in dimensions, which restricts the possible problem sizes and dimensions, a problem that needs to be overcome to make FOBO applicable to a wide array of problems. Further, as the performance of GPs chiefly depends on their covariance kernel, it is important to give researchers and practitioners flexibility in this choice. Herein, it is our primary goal to enable scalabe inference for GPs in the context of FOBO, while maintaining modeling flexibility via matrix-structure-aware AD.
We 1) derive analytical block-data-sparse structures for a large class of gradient kernel matrices, allowing for an exact multiply in Section 3, 2) propose an AD framework that programmatically computes the data-sparse block structures for transformations, and algebraic combinations of kernels and make our implementation publicly available111github.com/SebastianAment/CovarianceFunctions.jl. In Section 3.3, we further 3) derive analogous structures for kernel matrices that arise from conditioning on Hessian information, reducing the complexity from for the naïve approach to , 4) provide numerical experiments that demonstrate the improved scaling and delineate the problem sizes for which the proposed methods are applicable in Section 4.1, 5) compare against existing techniques in Section 4.2 and 6) use the proposed methods for Bayesian Optimization in Section 4.3.
2 Related Work
Inference for GPs has traditionally been based on matrix factorizations, but recently, methods based on iterative solvers have been developed, which can scale up to a million data points without approximations (Wang et al., 2019) by leveraging the parallelism of modern hardware (Dong et al., 2017; Gardner et al., 2018a). Extending the approximate matrix-vector multiplication algorithms of Wilson and Nickisch (2015) and Gardner et al. (2018b), Eriksson et al. (2018) proposed an approximate method for GPs with derivative information which scales quasi-linearly in for separable product kernels whose constituents are stationary. De Roos et al. (2021) proposed an elegant direct method for GPs with derivatives that scales linearly in the dimensionality but sextically – – with the number of data points and also derive an efficient multiply for dot-product and isotropic kernels whose inputs can be scaled by a diagonal matrix. Wu et al. (2017b) used GPs with gradients for BO and proposed keeping only a single directional derivative to reduce the computational cost. Padidar et al. (2021) proposed a similar strategy, retaining only relevant directional derivatives, to scale a variational inference scheme for GPs with derivatives. Notably, incorporating gradient information into GPs is not only useful for BO: Solak et al. (2003) put forward the integration of gradient information for GP models of dynamical systems, Riihimäki and Vehtari (2010) used “virtual” derivative observations to include monotonicity constraints into GPs, and Solin et al. (2018) employed the derivatives of a GP to model curl-free magnetic fields and their physical constraints.
To disambiguate several sometimes conflated terms, we quote Baydin et al. (2018), who defined AD as “a specific family of techniques that computes derivatives through accumulation of values during code execution to generate numerical derivative evaluations rather than derivative expressions”. It enables the computation of derivatives up to machine precision while maintaining the speed of numerical operations. Practical implementations of AD include forward-mode differentiation techniques based on operator overloading (Revels et al., 2016), the system of Innes (2018), which is able to generate compiled derivative code of differentiable components of the Julia language, as well as the reverse-mode differentiation technologies of PyTorch (Paszke et al., 2019) and Tensorflow (Abadi et al., 2015). Maclaurin et al. (2015) put forward an algorithm for computing gradients of models w.r.t. their hyper-parameters using reverse-mode auto-differentiation, enabling the use of FOBO to optimize a model’s generalization performance. Among others, Verma (1998) explored the exploitation of structure, primarily sparsity, in the automatic computation of Jacobian and Hessian matrices. However, the existing work is not directly applicable here, since it does not treat the more general data-sparse structures of Section 3. For a review of automatic differentiation (AD) techniques, see (Griewank and Walther, 2008).
Bayesian Optimization (BO) has been applied to a diverse set of problems, and of particular interest to the machine learning community is the optimization of hyper-parameters of complex models (Klein et al., 2017). Spurring much interest in BO, Snoek et al. (2012) demonstrated that BO is an effective tool for the optimization of hyper-parameters of deep neural networks. Hennig and Schuler (2012) proposed entropy search for global optimization, a technique that employs GPs to compute a distribution over the potential optimum of a function. Wang et al. (2013) proposed efficient BO with random embeddings which scales to very high-dimensional problems by exploiting lower-dimensional structures. Mutny and Krause (2018) assumed an additive structure to scale BO to high dimensions. Eriksson et al. (2018) used their fast approximate inference technique for FOBO in combination with an active subspaces method (Constantine et al., 2014) in order to reduce the dimensionality of the optimization problem and to speed up convergence. Martinez-Cantin et al. (2018)
enabled BO in the presence of outliers by employing a heavy-tailed likelihood distribution.Malkomes and Garnett (2018) used BO in model space to choose surrogate models for use in a primary BO loop. Wu and Frazier (2019) presented a two-step lookahead method for BO. Eriksson et al. (2019) put forth TuRBO
, leveraging a set of local models for the global optimization of high-dimensional functions. BO is also applied to hierarchical reinforcement learning(Brochu et al., 2010; Prabuchandran et al., 2021). Existing BO libraries include Dragonfly (Kandasamy et al., 2020), BayesOpt (Martinez-Cantin, 2014), and BoTorch (Balandat et al., 2020). For a review of BO, see (Frazier, 2018).
We first provide definitions and set up notation and central quantities for the rest of the paper.
A random function is a Gaussian process with a mean and covariance function
if and only if all of its finite-dimensional marginal distributions are multivariate Gaussian distributions. In particular,is a Gaussian process if and only if for any finite set of inputs ,
where , and . In this case, we write .
When defining kernel functions, and will denote the first and second inputs, their difference, the
-dimensional identity matrix, andthe all-ones vector of length . The gradient and Jacobian operators with respect to will be denoted by and , respectively.
The focus of the present work is the matrix-valued operator that acts on kernel functions and whose entries are . We will show that is highly structured and data-sparse for a vast space of kernel functions and present an automatic structure-aware algorithm for the computation of . The kernel matrix that arises from the evaluation of on the data can be seen as a block matrix whose th block is . For isotropic and dot-product kernels, De Roos et al. (2021) discovered that has the structure , which allows a linear-in- direct inversion, though the resulting -scaling only applies to the low-data regime. Rather than deriving similar global structure, we focus on efficient structure for the blocks , which is more readily amenable to a fully lazy implementation with memory complexity, and the synthesis of several derivative orders, see Sec. D for details. Last, we stress that our goal here is to focus on the subset of transformations that arise in most kernel functions, and not the derivation of a fully general structured AD engine for the computation of the operator.
3.2 Gradient Kernel Structure
In this section, we derive novel structured representations of for a large class of kernels . The only similar previously known structures are for isotropic and dot-product kernels derived by De Roos et al. (2021).
The majority of canonical covariance kernels can be written as
where , is a scalar-valued function, and . The first two types make up most of commonly used stationary covariance functions, while the last constitutes the basis of many popular non-stationary kernels. We call the choice of proto isotropic, stationary linear functional, and dot product, respectively. First, we note that is simple for all three choices:
Kernels with the first and third input type are ubiquitous and include the exponentiated quadratic, rational quadratic, Matérn, and polynomial kernels. An important example of the second type is the cosine kernel, which has been used to approximate stationary kernels (Rahimi et al., 2007; Lázaro-Gredilla et al., 2010; Gal and Turner, 2015) and is also a part of the spectral mixture kernel (Wilson and Adams, 2013). In the following, we systematically treat most of the kernels and transformations in (Rasmussen and Williams, 2005) to greatly expand the class of kernels for which structured representations are available.
A Chain Rule
Many kernels can be expressed as where is scalar-valued. For these types of kernels, we have
That is, is a rank-one correction to . If is structured with data, inherits this property. As an immediate consequence, permits a matrix-vector multiply in time for all isotropic, stationary, and dot-product kernels that fall under the categories outlined above. However, there are combinations and transformations of these base kernels that give rise to more complex kernels and enable more flexible, problem-dependent modeling.
Sums and Products
First, covariance kernels are closed under addition and multiplication. If all summands or coefficients are of the the same input-type, the sum kernel has the same input type since
and similarly for products, so that no special treatment is necessary beside the chain rule above. An interesting case occurs when we combine kernels of different input types or more complex composite kernels. For, we trivially have , and so the complexity of multiplying with is . For product kernels , we have
which is a rank-two correction to the sum of the scaled constituent gradient kernels elements – and – and therefore only adds operations to the multiplication with the constituent elements. In general, the application of to a product of kernels gives rise to a rank- correction to the sum of the constituent gradient kernels:
where and , whose formation would generally be . However, if for all , we have and , where , and is the diagonal matrix with on the diagonal. A matrix-vector multiplication with (1) can thus be computed in . If , the expression is generally not data-sparse unless the Jacobians are, which is the case for the following special type of kernel product.
Direct Sums and Products
Given a set of kernels each of which acts on a different input dimension, we can define their direct product (resp. sum) as (resp. ), where corresponds to the dimension on which acts. This separable structure gives rise to sparse differential operators and that are zero except for
For direct sums, is then simply diagonal: . For direct products, substituting these sparse expressions into the general product rule (1) above yields a rank-one update to a diagonal matrix. Therefore, the computational complexity of multiplying a vector with for separable kernels is . Notably, the above structure can be readily generalized for block-separable kernels, whose constituent kernels act on more than one dimension. The complexity is also attained as long as every constituent kernel only applies to a constant number of dimensions as , or itself allows a multiply that is linear in the dimensionality of the space on which it acts.
If for a scalar-valued , then
Again, is a low-rank (rank two) correction to .
The so called “warping” of inputs to GPs is an important technique for the incorporation of non-trivial problem structure, especially of a non-stationary nature (Snelson et al., 2004; Lázaro-Gredilla, 2012; Marmin et al., 2018). In particular, given some potentially vector-valued warping function a warped kernel can be written as , which leads to
We can factor out the Jacobian factors as block-diagonal matrices from the gradient kernel matrix , leading to an efficient representation:
Taking advantage of the above structure, the complexity of multiplication with the gradient kernel matrix can be reduced to , which is for . Important examples of warping functions are energetic norms or inner products of the form or for some positive semi-definite matrix . In this case, we can factor in a pre-computation that is independent of using a pivoted Cholesky decomposition using operations for a rank matrix, and let , so that . This gives rise to a Kronecker product structure in the Jacobian scaling matrix , and enables subspace search techniques for BO, like the ones of Wang et al. (2013), Eriksson et al. (2018), and Kirschner et al. (2019), to take advantage of the structures proposed here. If is diagonal as for automatic relevance determination (ARD), one can simply use , and the complexity of multiplying with is . Notably, the matrix structure and its scaling also extend to complex warping functions , like Wilson et al. (2016)’s deep kernel learning model.
Systematic application of the rules and data-sparse representations of for the transformations and compositions of kernels above gives rise to similar representations for many more complex kernels. Examples include the neural network kernel , where , the RBF-network kernel , the spectral mixture kernel of Wilson and Adams (2013), and the kernel corresponding to a linear regression with variable coefficients, where are the regression features, is the prior covariance of the weights, and is a secondary kernel controlling the variability of the weights (Rasmussen and Williams, 2005). See Figure 1 for a depiction of these kernels’ computational graphs, where each node represents a computation that we treated in this section. These examples highlight the generality of the proposed approach, since it applies without specializations to these kernels, and is simultaneously the first to enable a linear-in- multiply with their gradient kernel matrices .
3.3 Hessian Kernel Structure
Under appropriate differentiability assumptions (see Sec. A), we can condition a GP on Hessian information. However, incorporating second-order information into GPs has so far – except for one and two-dimensional test problems by Wu et al. (2017a) – not been explored. This is likely due to the prohibitive scaling for a matrix multiply with the associated covariance matrix and scaling for direct matrix inversion. In addition to the special structure for the gradient-Hessian cross-covariance, already reported by De Roos et al. (2021), we derive a structured representation of the Hessian-Hessian covariance for isotropic kernels, enabling efficient computations with second-order information. In particular, letting where is the Hessian w.r.t. and :
where , with , is the “shuffle” matrix that satisfies , and is the Kronecker sum. Thus, it is possible to multiply with covariance matrices that arise from conditioning on second-order information in , which is linear in the amount of information contained in the Hessian matrix and therefore optimal with respect to the dimensionality. This is an attractive complexity in moderate dimensionality since Hessian observations are highly informative of a function’s local behavior. For derivations of the second-order covariances for more kernel types and transformations, see Sec. C.
3.4 An Implementation: CovarianceFunctions.jl
To take advantage of the analytical observations above in an automatic fashion, several technical challenges need to be overcome. First, we need a representation of the computational graph of a kernel function that is built from the basic constituents and transformations that we outlined above, akin to Figure 1. Second, we need to build matrix-free representations of the gradient kernel matrices to maintain data-sparse structure. Here, we briefly describe how we designed CovarianceFunctions.jl, an implementation of the structured AD technique that is enabled by the analytical derivations above, and supporting libraries, all written in Julia (Bezanson et al., 2017).
CovarianceFunctions.jl represents kernels at the level of user-defined types. It is in principle possible to hook into the abstract syntax tree (AST) to recognize these types of structures more generally (Innes, 2018), but this would undoubtedly come at the cost of increased complexity. It is unclear if this generality would have applications outside of the scope of this work. A user can readily extend the framework with a new kernel type if it can not already be expressed as a combinations and transformations of existing kernels. All that is necessary is the definition of its evaluation and the following short function: input_trait returns the type of input the kernel depends on: isotropic, dot-product, or the stationary type and automatically detects homogeneous products and sums of kernels with these input types. As an example, for the rational quadratic kernel, we have
Our implementation uses ForwardDiff.jl (Revels et al., 2016) to compute the regular derivatives and gradients that arise in the structured expressions to achieve a high level of generality and for a robust fall-back implementation of all the relevant operators in case no structure can be inferred in the input kernel. Even though the memory requirements of the data-sparse blocks are much reduced to the dense case, a machine can nevertheless run out of memory if the number of samples gets very large and all blocks are stored in memory. To scale the method up to very large , our implementation employs lazy evaluation of the gradient kernel matrix to achieve a constant, , memory complexity for a matrix-vector multiply.
The main benefit of this system is that researchers and practitioners of BO do not need to derive a special structured representation for each kernel they want to use for an accurate modeling of their problem. As an example, the structured AD rules of Section 3.2 obviate the special derivation of the neural network kernel in Section B, In our view, this has the potential to greatly increase the uptake of first-order BO techniques outside of the immediate field of specialists.
4.1 Scaling on Synthetic Data
First, we study the practical scaling of our implementation of the proposed methods with respect to both dimension and number of observations. See Figure 2 for experimental results using the non-separable rational quadratic kernel. Importantly, we observe virtually the same scaling behavior for more complex kernels like the neural network kernel. Further, we stress that the scaling results are virtually indistinguishable for different kernels, see also Figure 3 for similar experiments using the exponentiated dot-product and more complex neural network kernel.
The exponentiated dot-product kernel was recently used by Karvonen et al. (2021) to derive a probabilistic Taylor-type expansion of multi-variate functions. The neural network kernel is derived as the limit of a neural network with one hidden layer, as the number of hidden units goes to infinity (Rasmussen and Williams, 2005). The scaling plots on the left of Figure 2 and 3 were created with a single thread to minimize constants, while the heat-maps on the right were run with 24 threads on 12 cores in parallel to highlight the applicability of the methods on a modern parallel architecture.
4.2 Comparison to Prior Work
While popular libraries like GPyTorch, GPFlow, and Scikit-Learn have efficient implementations for the generic GP inference problem, they do not offer efficient inference with gradient observations, see Table 1 (Gardner et al., 2018a; De G. Matthews et al., 2017; Pedregosa et al., 2011). Highlighting the novelty of our work, GPyTorch contains only two implementations for this case – RBFKernelGrad and PolynomialKernelGrad – both with the naïve matrix-vector multiplication complexity, hand-written work that is both obviated and outperformed by our structure-aware AD engine, see Figure 4. Thus, BoTorch, which depends on GPyTorch, does not yet support efficient FOBO. Neither GPFlow nor SKLearn contain any implementations of gradient kernels. Dragonfly and BayesOpt do not support gradient observations.
Eriksson et al. (2018)’s D-SKIP
D-SKIP is an approximate method and requires that the kernel can be expressed as a separable product and further, that the resulting constituent kernel matrices have a low rank. In contrast our method is mathematically exact and applies to a large class of kernels without restriction. D-SKIP needs an upfront cost of , followed by a matrix-vector multiplication (MVM) cost of for constituent kernel matrices of rank and inducing points per dimension. For a constant rank , D-SKIP’s MVM scales both linearly in and , while the method proposed herein scales quadratically in . See Figure 4 for a comparison of D-SKIP’s real-world performance, where D-SKIP’s MVM scales linearly in , but the required pre-processing scales quadratically in and dominates the total runtime. Note that D-SKIP’s implementation is restricted to , since D-SKI is faster in this regime. For , D-SKIP’s pure MVM times are faster than our method, whose runtime grows sublinearly until because it takes advantage of vector registers and SIMD instructions. Notably, the linear extrapolation of D-SKIP’s pure MVM times without pre-processing is within a small factor () of the timings of our work for , implying that if D-SKIP were applied to higher dimensions, the pure MVM times of both methods would be comparable for a moderately large number of observations (). Figure 6 in Section E shows that D-SKIP is approximate and looses accuracy as increases, while our method is accurate to machine precision.
4.3 Bayesian Optimization
Shekhar and Javidi (2021) proved that gradient information can lead to an exponential improvement in regret for multi-armed bandit problems, compared to zeroth-order BO, by employing a two-stage procedure, the first of which hones in to a locally quadratic optimum of the objective. Inspired by this result and studying the qualitative appearance of many test functions of Bingham and Surjanovic (2013), a promising model for these objectives is a sum of functions , where is a quadratic function and is a potentially quickly varying, non-convex function. Since is arbitrary, the model does not restrict the space of functions, but offers a useful inductive bias for problems with a globally quadratic structure “perturbed” by a non-convex function. Assuming the minimum of the quadratic function coincides or is close to a global minimum to the objective, this structure can be exploited to accelerate convergence.
Herein, we model with a GP with the kernel , a distribution over quadratic functions whose stationary points are regularized by , while is assumed to be drawn from a GP with a Matérn-5/2 kernel , to model quickly varying deviations from . Then . Notably, the resulting kernel is a composite kernel with isotropic and dot-product constituents, and a quadratic transformation. Without exploiting this structure automatically as proposed above, one would have to derive a new fast multiply by hand, slowing down the application of this model to BO.
Notably, this model is similar to the one employed by Gal et al. (2017), who used a quadratic function as the prior mean, requiring a separate optimization or marginalization of the location and covariance of the mean function. In contrast, we model the quadratic component with a specialized kernel, whose treatment only requires linear operations.
We benchmark both Bayesian and canonical optimization algorithms with and without gradient information on some of the test functions given by Bingham and Surjanovic (2013), namely, the Griewank, Ackley, and Rastrigin functions. See Section F for the definitions of the test functions. For all functions, we scaled the input domains to lie in , scaled the output to lie in , and shifted the global optimum of all functions to . The top row of Figure 5 shows plots of the non-convex functions in two dimensions.
Figure 5 shows the average optimality gap over 128 independent experiments in four and sixteen dimensions for the following strategies: random sampling (black), L-BFGS (blue), L-BFGS with random restarts after local convergence is detected (L-BFGS-R in purple), BO (dotted orange), BO with the quadratic mixture kernel of Section 4.3 (BO-Q in solid orange), FOBO (dotted green), and FOBO with the quadratic mixture kernel (FOBO-Q in solid green). The FOBO variants incorporate both value and gradient observations, see Section D. All the BO variants use the expected improvement acquisition function which is numerically optimized w.r.t. the next observation point using L-BFGS. If the proposed next observation lies within of any previously observed point, we choose a random point instead (see Algorithm 1), similar to the L-BFGS-R strategy. This helps escape local minima in the acquisition function and improves the performance of all BO algorithms.
FOBO-Q outperforms on all 4D problems, L-BFGS converges most rapidly on the 16D Griewank function because its many local minima vanish as increases so that the purely local search results in the fastest convergence, and FOBO achieves the best optimality gap on the 16D Ackley and Rastrigin functions. While the Rastrigin function contains a quadratic component analytically, its inference appears to become more difficult as the dimension increases, leading FOBO to outperform the Q variant. But surprisingly, the Q variants outperform on the Ackley function for , even though it does not contain a quadratic component.
Incorporating gradient information improves the performance of BO, but the global optimization of non-convex functions remains NP-hard (see Section G) and can’t be expected to be solved in general. For example, the optimum of the 16D Ackley function is elusive for all methods, likely because its domain of attraction shrinks exponentially with . While we derived structured representations for kernel matrices arising from Hessian observations, we primarily focused on first-order information. We demonstrated the improved computational scaling and feasibility of computing with Hessian observations in our experiments but did not use this for BO. We leave a more comprehensive comparison of first and second-order BO to future work. Our main goal here was to enable such investigations in the first place, by providing the required theoretical advances, and practical infrastructure through CovarianceFunctions.jl.
1) We are excited at the prospect of linking Maclaurin et al. (2015)’s algorithm for the computation of hyper-parameter gradients with the technology prosed here, enabling efficient FOBO of hyper-parameters. 2) While the methods proposed here are exact and enable a linear-in- MVM complexity, can still become expensive with a large number of observations . We believe that analysis-based fast algorithms like Ryan et al. (2022)’s Fast Kernel Transform could be derived for gradient kernels and hold promise in low dimensions. Further, BO trajectories can yield redundant information particularly when honing in on a minimum, which could be exploited using sparse linear solvers like the ones of Ament and Gomes (2021)
. 3) Hessian observations could be especially useful for Bayesian Quadrature, since the benefit of second-order information for integration is established: the Laplace approximation is used to estimate integrals in Bayesian statistics and relies on a single Hessian observation at the mode of the distribution.
Bayesian Optimization has proven promising in numerous applications and is an active area of research. Herein, we provided exact methods with an MVM complexity for kernel matrices arising from gradient observations in dimensions and a large class of kernels, enabling first-order BO to scale to high dimensions. In addition, we derived structures that allow for an MVM with Hessian kernel matrices, making future investigations into second-order BO and Bayesian Quadrature possible.
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
- Adler (1981) Adler, R. J. (1981). The Geometry of Random Fields. Society for Industrial and Applied Mathematics.
- Ahmed et al. (2016) Ahmed, M. O., Shahriari, B., and Schmidt, M. (2016). Do we need “harmless” bayesian optimization and “first-order” bayesian optimization. NIPS BayesOpt.
- Ament and Gomes (2021) Ament, S. E. and Gomes, C. P. (2021). Sparse bayesian learning via stepwise regression. In International Conference on Machine Learning, pages 264–274. PMLR.
- Attia et al. (2020) Attia, P. M., Grover, A., Jin, N., Severson, K. A., Markov, T. M., Liao, Y.-H., Chen, M. H., Cheong, B., Perkins, N., Yang, Z., et al. (2020). Closed-loop optimization of fast-charging protocols for batteries with machine learning. Nature, 578(7795):397–402.
- Balandat et al. (2020) Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. (2020). BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33.
- Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43.
- Bezanson et al. (2017) Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98.
- Bingham and Surjanovic (2013) Bingham, D. and Surjanovic, S. (2013). Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.html. Accessed: 2021-05-18.
- Brochu et al. (2010) Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
- Bull (2011) Bull, A. D. (2011). Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10).
- Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, pages 273–304.
- Constantine et al. (2014) Constantine, P. G., Dow, E., and Wang, Q. (2014). Active subspace methods in theory and practice: applications to kriging surfaces. SIAM Journal on Scientific Computing, 36(4):A1500–A1524.
- De G. Matthews et al. (2017) De G. Matthews, A. G., Van Der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. (2017). Gpflow: A gaussian process library using tensorflow. J. Mach. Learn. Res., 18(1):1299–1304.
- De Roos et al. (2021) De Roos, F., Gessner, A., and Hennig, P. (2021). High-dimensional gaussian process inference with derivatives. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2535–2545. PMLR.
- Dong et al. (2017) Dong, K., Eriksson, D., Nickisch, H., Bindel, D., and Wilson, A. G. (2017). Scalable log determinants for gaussian process kernel learning. In Advances in Neural Information Processing Systems, pages 6327–6337.
- Eriksson et al. (2018) Eriksson, D., Dong, K., Lee, E., Bindel, D., and Wilson, A. G. (2018). Scaling gaussian process regression with derivatives. In Advances in Neural Information Processing Systems, pages 6867–6877.
- Eriksson et al. (2019) Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. (2019). Scalable global optimization via local bayesian optimization. In Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Foster et al. (2019) Foster, A., Jankowiak, M., Bingham, E., Horsfall, P., Teh, Y. W., Rainforth, T., and Goodman, N. (2019). Variational bayesian optimal experimental design. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Frazier (2018) Frazier, P. I. (2018). A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811.
- Fu et al. (2013) Fu, Y., Zhu, X., and Li, B. (2013). A survey on instance selection for active learning. Knowledge and information systems, 35(2):249–283.
- Gal et al. (2017) Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR.
- Gal and Turner (2015) Gal, Y. and Turner, R. (2015). Improving the gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In International Conference on Machine Learning, pages 655–664. PMLR.
- Gardner et al. (2018a) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018a). Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, pages 7576–7586.
Gardner et al. (2018b)
Gardner, J., Pleiss, G., Wu, R., Weinberger, K., and Wilson, A. (2018b).
Product kernel interpolation for scalable gaussian processes.84:1407–1416.
- Griewank and Walther (2008) Griewank, A. and Walther, A. (2008). Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM.
- Hennig and Schuler (2012) Hennig, P. and Schuler, C. J. (2012). Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(6).
- Innes (2018) Innes, M. (2018). Don’t unroll adjoint: Differentiating ssa-form programs. CoRR, abs/1810.07951.
- Innes et al. (2017) Innes, M., Barber, D., Besard, T., Bradbury, J., Churavy, V., Danisch, S., Edelman, A., Karpinski, S., Malmaud, J., Revels, J., Shah, V., Stenetorp, P., and Yuret, D. (2017). On machine learning and programming languages. https://julialang.org/blog/2017/12/ml-pl/. Accessed: 2021-05-18.
Kandasamy et al. (2020)
Kandasamy, K., Vysyaraju, K. R., Neiswanger, W., Paria, B., Collins, C. R.,
Schneider, J., Poczos, B., and Xing, E. P. (2020).
Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly.Journal of Machine Learning Research, 21(81):1–27.
- Karvonen et al. (2021) Karvonen, T., Cockayne, J., Tronarp, F., and Särkkä, S. (2021). A probabilistic taylor expansion with applications in filtering and differential equations. CoRR, abs/2102.00877.
- Kirschner et al. (2019) Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R., and Krause, A. (2019). Adaptive and safe bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning, pages 3429–3438. PMLR.
- Klein et al. (2017) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. (2017). Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, pages 528–536. PMLR.
- Lázaro-Gredilla (2012) Lázaro-Gredilla, M. (2012). Bayesian warped gaussian processes. Advances in Neural Information Processing Systems, 25:1619–1627.
- Lázaro-Gredilla et al. (2010) Lázaro-Gredilla, M., Quinonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. (2010). Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865–1881.
- Li et al. (2018) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52.
- Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR.
- Malkomes and Garnett (2018) Malkomes, G. and Garnett, R. (2018). Automating bayesian optimization with bayesian optimization. Advances in Neural Information Processing Systems, 31:5984–5994.
- Marmin et al. (2018) Marmin, S., Ginsbourger, D., Baccou, J., and Liandrat, J. (2018). Warped gaussian processes and derivative-based sequential designs for functions with heterogeneous variations. SIAM/ASA Journal on Uncertainty Quantification, 6(3):991–1018.
- Martinez-Cantin (2014) Martinez-Cantin, R. (2014). Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits. Journal of Machine Learning Research, 15(115):3915–3919.
- Martinez-Cantin et al. (2018) Martinez-Cantin, R., Tee, K., and McCourt, M. (2018). Practical bayesian optimization in the presence of outliers. In International Conference on Artificial Intelligence and Statistics, pages 1722–1731. PMLR.
- Mutny and Krause (2018) Mutny, M. and Krause, A. (2018). Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Paciorek (2003) Paciorek, C. (2003). Nonstationary Gaussian Processes for Regression and Spatial Modelling. PhD thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania.
- Padidar et al. (2021) Padidar, M., Zhu, X., Huang, L., Gardner, J. R., and Bindel, D. (2021). Scaling gaussian processes with derivative information using variational inference. arXiv preprint arXiv:2107.04061.
Paszke et al. (2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E.,
DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
Bai, J., and Chintala, S. (2019).
Pytorch: An imperative style, high-performance deep learning library.In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Powell and Ryzhov (2012) Powell, W. B. and Ryzhov, I. O. (2012). Optimal learning, volume 841. John Wiley & Sons.
- Prabuchandran et al. (2021) Prabuchandran, K. J., Santosh, P., Chandramouli, K., and Shalabh, B. (2021). Novel first order bayesian optimization with an application to reinforcement learning. Applied Intelligence, 51:1–15.
Rahimi et al. (2007)
Rahimi, A., Recht, B., et al. (2007).
Random features for large-scale kernel machines.
, volume 3, page 5. Citeseer.
- Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Revels et al. (2016) Revels, J., Lubin, M., and Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. arXiv:1607.07892 [cs.MS].
- Riihimäki and Vehtari (2010) Riihimäki, J. and Vehtari, A. (2010). Gaussian processes with monotonicity information. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 645–652. JMLR Workshop and Conference Proceedings.
- Ryan et al. (2022) Ryan, J. P., Ament, S. E., Gomes, C. P., and Damle, A. (2022). The fast kernel transform. In International Conference on Artificial Intelligence and Statistics, pages 11669–11690. PMLR.
- Schoenberg (1938) Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Annals of Mathematics, 39(4):pp. 811–841.
- Settles (2009) Settles, B. (2009). Active learning literature survey.
- Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
- Shekhar and Javidi (2021) Shekhar, S. and Javidi, T. (2021). Significance of gradient information in bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pages 2836–2844. PMLR.
- Snelson et al. (2004) Snelson, E., Rasmussen, C. E., and Ghahramani, Z. (2004). Warped gaussian processes. Advances in neural information processing systems, 16:337–344.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. arXiv preprint arXiv:1206.2944.
- Snoek et al. (2015) Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. (2015). Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR.
- Solak et al. (2003) Solak, E., Murray-Smith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in gaussian process models of dynamic systems.
- Solin et al. (2018) Solin, A., Kok, M., Wahlström, N., Schön, T. B., and Särkkä, S. (2018). Modeling and interpolation of the ambient magnetic field by gaussian processes. IEEE Transactions on Robotics, 34(4):1112–1127.
- Srinivas et al. (2012) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. (2012). Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265.
- Törn and Zilinskas (1989) Törn, A. and Zilinskas, A. (1989). Global optimization.
- Tuia et al. (2011) Tuia, D., Volpi, M., Copa, L., Kanevski, M., and Munoz-Mari, J. (2011). A survey of active learning algorithms for supervised remote sensing image classification. IEEE Journal of Selected Topics in Signal Processing, 5(3):606–617.
- Verma (1998) Verma, A. (1998). Structured automatic differentiation. Technical report, Cornell University.
- Wang and Shang (2014) Wang, D. and Shang, Y. (2014). A new active labeling method for deep learning. In 2014 International joint conference on neural networks (IJCNN), pages 112–119. IEEE.
- Wang et al. (2018) Wang, F., Decker, J., Wu, X., Essertel, G., and Rompf, T. (2018). Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. Advances in Neural Information Processing Systems, 31:10180–10191.
- Wang et al. (2019) Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K. Q., and Wilson, A. G. (2019). Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, pages 14648–14659.
- Wang et al. (2013) Wang, Z., Zoghi, M., Hutter, F., Matheson, D., De Freitas, N., et al. (2013). Bayesian optimization in high dimensions via random embeddings. In IJCAI, pages 1778–1784.
- Wilson and Adams (2013) Wilson, A. and Adams, R. (2013). Gaussian process kernels for pattern discovery and extrapolation. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1067–1075, Atlanta, Georgia, USA. PMLR.
- Wilson and Nickisch (2015) Wilson, A. and Nickisch, H. (2015). Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784.
- Wilson et al. (2016) Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. (2016). Deep kernel learning. In Artificial intelligence and statistics, pages 370–378. PMLR.
- Wu et al. (2017a) Wu, A., Aoi, M. C., and Pillow, J. W. (2017a). Exploiting gradients and hessians in bayesian optimization and bayesian quadrature. arXiv preprint arXiv:1704.00060.
- Wu and Frazier (2019) Wu, J. and Frazier, P. (2019). Practical two-step lookahead bayesian optimization. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Wu et al. (2017b) Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. (2017b). Bayesian optimization with gradients. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Zheng et al. (2020) Zheng, S., Hayden, D., Pacheco, J., and Fisher III, J. W. (2020). Sequential bayesian experimental design with variable cost structure. Advances in Neural Information Processing Systems, 33.
Appendix A Differentiability of Gaussian Processes
Summarized in (Paciorek, 2003), originally due to Adler.
Theorem A.1 (Mean-Square Differentiability, Adler (1981)).
A Gaussian process with covariance function has a mean-square partial derivative at if and only if exists and is finite.
See proof of Theorem 2.2.2 in (Adler, 1981). ∎
Theorem A.2 (Sample-Path Continuity, Adler (1981)).
A stochastic process is sample-path continuous if for ,
See Corrollary of Theorem 3.2.5 in (Adler, 1981). ∎
Theorem A.3 (Sample-Path Continuity, Adler (1981)).
A Gaussian process is sample-path continuous if for ,
See Corrollary of Theorem 3.2.5 in (Adler, 1981). ∎
The following result states that every positive-definite isotropic kernel can be expressed as a scale-mixture of Gaussian kernels, which aids the derivation of their differentiability properties. In particular,
Theorem A.4 (Schoenberg (1938)).
Suppose an isotropic kernel function is positive-definite on a Hilbert space. Then there is a non-decreasing and bounded such that
We refer to as the scale parameter.
This is due to Theorem 2 of Schoenberg (1938). ∎
Sample function (or almost sure) differentiability is a stronger property that requires a more subtle analysis. Paciorek (2003) provided the following result guaranteeing path differentiability for isotropic kernels.
Theorem A.5 (Sample-Path Differentiability, Paciorek (2003)).
This is essentially Theorem 10 in (Paciorek, 2003). ∎
Paciorek (2003) further uses this result to prove that the exponentiated quadratic and rational quadratic kernels are infinitely and Matérn kernels are finitely sample-path differentiable. A number of kernels like the exponential and the Matérn-1/2 kernels do not give rise to differentiable paths and can thus not be used in conjunction with gradient information. Notably, even the Matérn-3/2 kernel, which has a differentiable mean function, does not give rise to differentiable paths.
Appendix B Explicit Derivation of Gradient Structure of Neural Network Kernel
The point of this section is to show an explicit derivation of the gradient structure of the neural network kernel, which is obviated by the structure-deriving AD engine proposed in this work.
Another interesting class of kernels are those that arise from analyzing the limit of certain neural network architectures as the number of hidden units tends to infinity. It is known that a number of neural networks converge to a Gaussian process under such a limit. For example, if the error function is chosen as the non-linearity for a neural network with one hidden layer, the kernel of the limiting process has the following form:
where . Formally, this is similar but not equivalent to the inner product kernels discussed above. The kernel gives rise to the following more complex gradient kernel structure
where , and
Notably, this is a rank-two correction to the identity, compared to the rank-one corrections for isotropic and dot-product kernels above.
Appendix C Hessian Structure
Note that for arbitrary vectors , not necessarily of the same length, . This will come in handy to simplify certain expressions in the following.
First, note that
Where is a "shuffle" matrix such that , and for square matrices and , the Kronecker sum is defined as . Then for dot-product kernels, we have
Then for isotropic product kernels with , we have
A Chain Rule
for a scalar-valued , then
Again, we observe a structured representation of the Hessian-kernel elements which permit a multiply in operations.
We therefore see that , where is the block-diagonal matrix whose block is equal to . Note that for linearly warped kernels for which , where , we have so that we can multiply with the kernel matrix in . The complexity is due to the following property of Kronecker product:
which can be computed in for every of the Hessian observations.
Appendix D Combining Derivative Orders
Combining observations of the function values and its first and second derivatives is straightforward via the following block-structured kernel:
If all constituent blocks permit a fast multiply - for gradient and for Hessian-related blocks - the entire structure permits a multiply, even though the naïve cost is . If only value and gradient observations are required, only the top-left two-by-two block is necessary, which can be carried out in in the structured case and which we implemented as the ValueGradientKernel.
Recall that the computational complexity of multiplying with the gradient and Hessian kernel matrices is and , respectively. Thus, the gradient-based method can only make a factor of