Code for our AISTATS 2020 paper "Distributionally Robust Bayesian Quadrature Optimization"
Bayesian quadrature optimization (BQO) maximizes the expectation of an expensive black-box integrand taken over a known probability distribution. In this work, we study BQO under distributional uncertainty in which the underlying probability distribution is unknown except for a limited set of its i.i.d. samples. A standard BQO approach maximizes the Monte Carlo estimate of the true expected objective given the fixed sample set. Though Monte Carlo estimate is unbiased, it has high variance given a small set of samples; thus can result in a spurious objective function. We adopt the distributionally robust optimization perspective to this problem by maximizing the expected objective under the most adversarial distribution. In particular, we propose a novel posterior sampling based algorithm, namely distributionally robust BQO (DRBQO) for this purpose. We demonstrate the empirical effectiveness of our proposed framework in synthetic and real-world problems, and characterize its theoretical convergence via Bayesian regret.READ FULL TEXT VIEW PDF
Code for our AISTATS 2020 paper "Distributionally Robust Bayesian Quadrature Optimization"
Making robust decisions in the face of parameter uncertainty is critical to many real-world decision problems in machine learning, engineering and economics. Besides the uncertainty that is inherent in data, a further difficulty arises due to the uncertainty in the context. A common example is hyperparameter selection of machine learning algorithms where cross-validation is performed using a small to medium sized validation set. Due to limited size of validation set, the variance across different folds might be high. Ignoring this uncertainty results in sub-optimal and non-robust decisions. This problem in practice can be further exacerbated as the outcome measurements may be noisy and the black-box function itself is expensive to evaluate. Being risk-averse is critical in such settings.
One way to capture the uncertainty in context is through a probability distribution. In this work, we consider stochastic black-box optimization that is distributionally robust to the uncertainty in context. We formulate the problem as
where is an expensive black-box function and is a distribution over context . We assume distributional uncertainty in which the distribution is known only through a limited set of its i.i.d samples . This is equivalent to the scenario in which we are able to evaluate only on during optimization.
In the case that is known (e.g., is either available in an analytical form or easy to evaluate), a standard solution to the problem in Equation (1) is based on Bayesian quadrature [o1991bayes; nipsRasmussenG02; oates2016probabilistic; OatesS19]. The main idea in this approach is that we can build a Gaussian Process (GP) model of and use the known relationship in the integral to imply a second GP model of . This is possible because integration is a linear operator.
Given the distributional uncertainty in which is only known through a limited set of its samples, a naive approach to the problem in Equation (1) is to maximize its Monte Carlo estimate:
where and is the Dirac distribution. When is sufficiently large, approximates
reasonably well as guaranteed by the weak law of large numbers; thus, the optimal solution ofrepresents that of . In contrast, when is small, the optimal solution of might be sub-optimal to . Since we are considering distributional perturbation, we cannot guarantee the Monte Carlo estimate to be a good surrogate objective.
A more conservative approach from statistical learning is to maximize the variance-regularized objective:
where denotes the empirical variance and is a constant determining the trade-off between bias and variance. Thus, given the context of limited samples, it is logical to use instead of as a surrogate objective for maximizing . However, unlike , the variance term in breaks the linear relationship with respect to . As a result, though is a GP, need not be [o1991bayes].
Alternatively, we approach the distributional uncertainty problem above by formulating the distributionally robust Bayesian quadrature optimization. In the face of the uncertainty about , we seek to find a distributionally robust solution under the most adversarial distribution. Our approach is based on solving a surrogate distributionally robust optimization problem generated by posterior sampling at each time step. The surrogate optimization is solved efficiently via bisection search through any optimization. We demonstrate the efficiency of our algorithm in both synthetic and real-world problems. Our contributions are:
Demonstrating the efficiency of DRBQO in finding distributionally robust solutions in both synthetic and real-world problems (Section 6).
Our work falls in the area of Bayesian quadrature optimization whose goal is to perform black-box global optimization of an expected objective of the form . This type of problems is known with various names such as optimization of integrated response functions [Williams:2000:SDC:932015], multi-task Bayesian optimization [DBLP:conf/nips/SwerskySA13], and optimization with expensive integrands [Toscano_IntegralBO_18]. This direction approaches the problem by evaluating at one or several values of given . This ameliorates the need of evaluating at all the values of and can outperform methods that evaluate the full objective via numerical quadrature [FrazierBOtut18; Toscano_IntegralBO_18]. All the previous approaches assume the knowledge of the distribution in the expected function. The distinction of our formulation is that we are interested in the distributional uncertainty scenario in which the underlying distribution is unknown except its empirical estimate.
Our work also shares similarity with the distributionally robust optimization (DRO) literature [DRO_review]. This problem setup considers the parameter uncertainty in real-world decision making problems. The uncertainty may be due to limited data and noisy measurements. DRO takes into account this uncertainty and approaches the problem by taking the worst-case of the underlying distribution within an uncertainty set of distributions. DRO variants distinguish each other in design choices of the distributional uncertainty set and in problem contexts. Regarding the design of uncertainty sets, common designs specify the set of distributions with respect to the nominal distribution via distributional discrepancy such as divergence [NamkoongD16], Wasserstein distance [WasserstainDRO19], and Maximum Mean Discrepancy [MMD_DRO19]. Regarding studying DRO in different problem contexts, the following contexts have been investigated: robust optimization [DBLP:journals/mansci/Ben-TalHWMR13], robust risk minimization [NamkoongD16], sub-modular maximization [DBLP:conf/aistats/StaibWJ19], boosting algorithms [DROBoosting], graphical models [DBLP:conf/nips/FathonyRBZZ18], games [sun2018distributional; DBLP:conf/aistats/Zhu0WGY19], fairness in machine learning [DBLP:conf/icml/HashimotoSNL18]nipsXuM10]
, and reinforcement learning[DRRL]. The distinction of our work is in terms of the problem context where we study DRO in Bayesian quadrature optimization.
Model. Let be an element of a reproducing kernel Hilbert space (RKHS) where : is a positive-definite kernel, and and are, unless explicitly mentioned otherwise, compact domains in and for some dimensions and , respectively. We further assume that is continuous and bounded from above by , and that for some . Two commonly used kernels are Squared Exponential (SE) and Matérn [RasmussenW06] which are similarly defined on as follows:
where and are the length scales, defines the smoothness in the Matérn kernel, and define the Bessel function and the gamma function, respectively, and .
Let be a distribution on , and be a fixed set of samples drawn from . Though is defined on , we are interested in the distributional uncertainty scenario in which we can query only on during optimization. At time , we query at and observe a noisy reward , where . Our goal is to find a robust solution point such that remains high even under the most adversarial realization of the unknown distribution .
Given a sequence of noisy observations , the posterior distribution under a GP(0, ) prior is also also a GP with the following posterior mean and covariance:
where , , and is the kernel matrix.
We define the quadrature functional as
for any conditional distribution on for all , i.e., . As an extended result of Bayesian quadrature [o1991bayes], for any conditional distribution , also follows a GP with the following mean and variance:
Optimization goal. We seek to optimize the expected function under the most adversarial distribution over some distributional uncertainty set :
where is the empirical distribution, is the confidence radius around the empirical distribution with respect to a distribution divergence such as Wasserstein distance, maximum mean discrepancy, and -divergence. We can interpret as the set of perturbed distributions with respect to the empirical distribution within a confidence radius . We then seek a robust solution in the face of adversarial distributional perturbation within .
For any distribution divergence choice , we define a -robust point to be any such that
Our goal is to report after time a distributionally robust point in the sense that it has small -regret, which is defined as
While our framework in this work can be adopted to various distribution divergences, we focus on the specific case when is -divergence: . From here on, we refer as the ball with being -divergence. In particular, the distributionally robust optimization problem in Equation (7) is equivalent to the variance-regularized optimization in Equation (3) when the variance is sufficiently high, as justified by the following theorem:
Let be a random variable (e.g.,
be a random variable (e.g.,for any fixed ), , , , , and . Then Especially if , then with probability at least .
The intuition for this equivalence is that the ball and the variance penalty term in Equation (3) are both quadratic [DBLP:conf/aistats/StaibWJ19]. Figure 2 illustrates balls with various radii on the -dimensional simplex.
Failure of standard methods. Various methods have been developed for achieving small regret in maximizing for some distribution [Williams:2000:SDC:932015; DBLP:conf/nips/SwerskySA13; Toscano_IntegralBO_18]. These methods leverage the relationships in Equation (5) and (6) to infer the posterior mean and variance of the expected function from those of . The inferred posterior mean and variance for are then used in certain ways to acquire new points. While this is useful in the standard setting when we know , it is not useful when we only have the empirical distribution . Specifically, an optimal solution found by these methods in the problem associated with the empirical distribution may be sub-optimal to that associated with the true distribution .
An illustrative example is depicted in Figure 1 where the averaged trajectories of our proposed DRBQO (detailed in Section 4) and a standard BQO baseline (detailed in Section 6) are also shown. Due to a limited number of samples of , the Monte Carlo estimate results in a spurious expected objective in this case. By resorting to the empirical distribution constructed from the limited set of samples, the standard BQO baseline ignores the distributional uncertainty and converges to the optimum of the spurious expected objective. The same limitation applies to the standard BQO optimization methods, e.g., Williams:2000:SDC:932015; DBLP:conf/nips/SwerskySA13; Toscano_IntegralBO_18; DBLP:conf/wsc/PearceB17a whose goal is to find a global non-robust maximum.
Our main proposed algorithm is presented in Algorithm 1. In the standard Bayesian quadrature problem in Equation (2), we can easily adopt standard Bayesian optimization algorithms such as expected improvement (EI) [mockus1978application] and an upper confidence bound (UCB) (e.g., GP-UCB [DBLP:conf/icml/SrinivasKKS10]) using quadrature relationships in Equation (5) and (6) [DBLP:conf/nips/SwerskySA13]. However, like in Equation (3), does not follow a GP if follows a GP. This difficulty hinders the adoption of EI-like and UCB-like algorithms to our setting. We overcome this problem using posterior sampling [DBLP:journals/mor/RussoR14].
The main idea of our algorithm is to sample and solve a surrogate distributionally robust optimization problem at each step guided by posterior sampling (lines 1 and 1 in Algorithm 1). In practice, we follow DBLP:conf/nips/Hernandez-LobatoHG14 to perform posterior sampling (line 1 in Algorithm 1). Similar to the way posterior sampling is applied to standard Bayesian optimization problem [DBLP:conf/nips/Hernandez-LobatoHG14], a new point is selected according to the probability it is optimal in the sense of distributional robustness. One of the advantages of posterior sampling is that it avoids the need for confidence bound such as UCB. This is useful for our setting because the non-Gaussian nature of the distributionally robust objective makes it difficult to construct a deterministic confidence upper bound.
Due to the convexity of the expectation with respect to a distribution, we can efficiently compute the value (therefore the gradients) of the inner minimization in line 8 of Algorithm 1 in an analytical form via Lagrangian multipliers, as presented in Proposition 1.
The constrained minimization is a convex optimization problem which forms the Lagrangian: where , , and .
The system of linear equations in the proposition emerges from Karush-Kuhn-Tucker (KKT) conditions and simple rearrangements. Note that since the primal problem is convex, the duality gap is zero and the KKT conditions are the sufficient and necessary conditions for the primal problem.
Notice the first two equations that we can compute in terms of . These are then substituted into Equation (10) to solve for . In practice, we can use bisection search [NamkoongD16] to solve for satisfying Equation (10). The details of this algorithm and of Proposition 1 are presented in the supplementary material. ∎
For the sake of analysis, we adopt the definition of the -period regret and Bayesian regret from DBLP:journals/mor/RussoR14 to our setting. In particular, we define a policy as a mapping from the history to where .
The -period regret of a policy is defined by
where , , and for all .
The -period Bayesian regret of a policy is the expectation of the regret with respect to the prior over ,
For simplicity, we focus our analysis on the case where is finite and is a finite subset of the ball of radius . Similar to DBLP:conf/icml/SrinivasKKS10, the results can be extended to infinite sets and the entire ball using discretization trick of DBLP:conf/icml/SrinivasKKS10 as long as a smoothness condition (i.e., the partial derivatives of are bounded with high probability) is satisfied (see Theorem 2 in DBLP:conf/icml/SrinivasKKS10).
Assume is a finite subset of , and is a finite subset of the ball of radius . Let be the DRBQO policy presented in Algorithm 1, be the maximum information gain defined in DBLP:conf/icml/SrinivasKKS10, then for all ,
Note that can be bounded for three common kernels: linear, SE and Matérn kernels in DBLP:conf/icml/SrinivasKKS10. Using these bounds, Theorem 2 suggests that DRBQO has sublinear Bayesian regret for common kernels such as linear, SE and Matérn kernels.
We leverage two proof techniques from DBLP:journals/mor/RussoR14 to derive this bound including posterior sampling regret decomposition and the connection between posterior sampling and UCB. However, an extension from the Bayesian regret bound to our case is non-trivial. The main difficulty is that the -robust quadrature distributions are random variables and the resulting quadrature does not follow a GP. We overcome this difficulty by decomposing the range of
into a set of carefully designed disjoint subsets, using several concentration inequalities for Gaussian distributions, and leveraging the mild assumptions offrom the problem setup. The details are presented in the supplementary material. ∎
In this section, we empirically validate the performance of DRBQO by comparing against several baselines in synthetic and -fold cross-validation hyperparameter tuning experiments.
We focus on the BQO baselines that directly substitute the inferred posterior mean (in Equation (5)) and variance (in Equation (6)) of into any standard acquisition (e.g., EI and GP-UCB) to achieve small regret in maximizing . More advanced BQO baseline methods, e.g., [Toscano_IntegralBO_18] are expected to perform poorly in the distributional uncertainty setting because they are not set out to account for the robust solutions. There is a distinction between sampled points and report points by each baseline algorithm. A sampled point is a suggested point regarding where to sample next while a report point is chosen from all the sampled points (up to any iteration) based on the objective function that an algorithm aims at optimizing. In standard noiseless Bayesian optimization, sampled points and report points are identical. However, this is not necessarily the case in BQO where the objective function has expectation form and is not directly queried. In particular, we consider the following baselines:
MTBO: Multi-task Bayesian optimization [DBLP:conf/nips/SwerskySA13] is a typical BQO algorithm in which the inferred posterior mean and variance are plugged into the standard EI acquisition to select . In addition, each in this case represents a task and MTBO uses multi-task kernels to model the task covariance. Conditioned on , is selected such that the corresponding task yields the highest EI. We include MTBO only in the cross-validation hyperparameter tuning experiments.
BQO-EI: This algorithm is similar to MTBO except for two distinctions. First, is selected such that it yields the highest posterior variance on , similar to our algorithm (see line 1 in Algorithm 1). Second, this uses kernels defined on the Cartesian product space instead of the multi-task kernels as in MTBO. In addition, the report point at time is .
Maximin-BQO-EI: This method is the same as BQO-EI except that the report point is .
BQO-TS: This method is a non-robust version of our proposed DRBQO. The only distinctions between BQO-TS and DRBQO are in the way is selected (line 1 of Algorithm 1) and the way a report point is chosen. In BQO-TS, is selected with respect to the empirical distribution as follows: , and the report point at time is chosen as .
Maximin-BQO-TS: This is the same as BQO-TS except that the final report point is .
Emp-DRBQO: This is the same as DRBQO except that the report point is chosen as .
Synthetic function. The distributional uncertainty problem is more pronounced when is more significantly distinct across different values of , i.e., experiences high variance along the dimension of
. Inspired by the logistic regression and the experimental evaluation from the original variance-based regularization work[NamkoongD16], we use a logistic form for synthetic function: , where . The true distribution is the standard Gaussian . In this example, we use for better visualization. We sample values of from and fix this set for the empirical distribution . The true expected function and the empirical (Monte Carlo) estimate function are illustrated in Figure 1 (a) and (b), respectively. In this illustration, the Monte Carlo estimate function catastrophically shifts the true optimum to a spurious point due to the limited data in estimating .
We initialize the comparative algorithms by selecting uniformly random inputs , and we keep these initial points the same for all the algorithms. We use the squared exponential kernel defined on the Cartesian product space of and . We normalize the input and output values to the unit cube, and resort to marginal maximum likelihood to learn the GP hyperparameters [RasmussenW06] every time we acquire a new observation. The time horizon for all the algorithms is
. We report the results using two evaluation metrics: the-regret as defined in Equation (9) and the value of the empirical expected function evaluated at point reported by an algorithm at time . The former metric quantifies how close a certain point is to the distributionally robust solution while the latter measures the performance of each algorithm from a perspective of the empirical distribution. We repeat the experiment times and report the average mean and the confidence interval for each evaluation metric.
The first results are presented in Figure 3. We report over a range of values capturing the degree of conservativeness against the distributional uncertainty. Note that if , it represents the most conservative case as the ball covers the entire -dimensional simplex. We observe from Figure 3 that DRBQO significantly outperforms the baselines in this experiment. Also notice that when we increase the conservativeness requirement (i.e., increasing the values of ), the standard BQO baselines have higher -regret. This is because the standard BQO baselines are rigid and do not allow for any conservativeness in the optimization. Therefore, these algorithms converge to the optimum of the spurious Monte Carlo estimate function.
We highlight the comparative algorithms in the second metric in Figure 4 where we report the value of the empirical expected function at each point reported by each algorithm at time . Since the BQO baselines are set out to maximize the Monte Carlo estimate function, they achieve higher values in this metric than DRBQO. However, the non-robust solutions returned by the BQO baselines are sub-optimal with respect to the -regret in this case, as seen from the corresponding results in Figure 3.
In addition, we evaluate the effectiveness of the selection of at line 1 in Algorithm 1. Currently, is selected such that it yields the highest posterior mean given . This is to improve exploration in . We compare this selection strategy with the random strategy in which is uniformly selected from regardless of . The result is reported in Figure 5. In this figure, the post-fix RandW denotes the random selection of . We observe that random selection of can hurt the convergence of both the standard BQO baselines and DRBQO. Furthermore, the selection of for the highest posterior variance (line 1 of Algorithm 1) in DRBQO is also meaningful in proving Theorem 2. More empirical evaluations in other synthetic functions are presented in the supplementary material.
Cross-validation hyperparameter tuning. A typical real-world problem that possesses the quadrature structure of Equation (1) is -fold cross-validation hyperparameter tuning. The -fold cross-validation performance can be thought of as a Monte Carlo approximate of the true model performance. Given a fixed learning algorithm associated with a set of hyperparameter , let be an approximate model performance trained on and evaluated on the validation set where denotes the training data set, denotes a subset of training points sampled from , and denotes everything in but not in . Increasing the number of folds reduces the variance in the model performance estimate, but it is expensive to evaluate the cross-validation performance for a large value of . Therefore, a class of Bayesian quadrature optimization methods is beneficial in this case in which we actively select both the algorithm’s hyperparameters and a fold to evaluate without the need of training the model in all folds [DBLP:conf/nips/SwerskySA13].
However, the standard BQO methods assume the empirical distribution for each fold and are set out to maximize the average -fold values. In practice, the average -fold value can be a spurious measure for model performance when there is sufficient discrepancy of the model performance across different folds. This scenario fits well into our distributional uncertainty problem in Equation (1) where the fold distribution is unknown in practice. In addition, we use a one-hot
-dimensional vector to represent each of thefolds. This offers two main advantages: (i) it allows us to leverage the standard kernel such as on the product space ; (ii) it is able to model different covariance between different pairs of folds. For example, the covariance between fold 1 and fold 3 is not necessary the same as that between fold 8 and fold 10 though the fold indicator difference are the same ( in this example).
We evaluate this experiment on two common machine learning models using the MNIST dataset [LeCun98]
: ElasticNet and Convolutional Neural Network (CNN). For ElasticNet, we tune theand regularization hyperparamters, and use the SGDClassificer implementation from the scikit-learn package [pedregosa2012scikitlearn]. For CNN, we use the standard architecture with 2 convolutional layers. In CNN, we optimize over three following hyperparamters: the learning rate and the dropout rates in the first and second pooling layers. We used Adam optimizer [DBLP:journals/corr/KingmaB14] in epochs with the batch size of .
In addition to the previous baselines in the synthetic experiment, we also consider the multi-task Bayesian optimization (MTBO) [DBLP:conf/nips/SwerskySA13] baseline for this application. MTBO is a standard method for cross-validation hyperparameter tuning.
In this experiment, we also use kernel defined on the Cartesian product space of and for all the methods except for MTBO which uses task kernel on the domain of . We initialize (respectively ) initial points and keep these initial points the same for all the algorithms in ElasticNet (respectively CNN). Each of the algorithms are run for (respectively ) iterations in ElasticNet (respectively CNN). We repeat the experiment
times and report the average and standard deviation values of an evaluation metric. We split the training data intofolds and keep these folds the same for all algorithms. We compare DRBQO against the baselines via a practical metric: the classification error in the test set evaluated at the final set of hyperparameters reported by each algorithm at the final step . This metric is a simple but practical measure of the robustness of the hyperparameters over the unknown data distribution . The result is reported in Table 1. We observe that DRBQO outperforms the baselines for most of the considered values of , especially for large values of (i.e.,
in this case). More results for the case of Support Vector Machine (SVM) are presented in the supplementary material.
In this work, we have proposed a posterior sampling based algorithm, namely DRBQO, that efficiently seeks for the robust solutions under the distributional uncertainty in Bayesian quadrature optimization. Compared to the standard BQO algorithms, DRBQO provides a flexibility to control the conservativeness against distributional perturbation. We have demonstrated the empirical effectiveness and characterized the theoretical convergence of DRBQO in sublinear Bayesian regret.
This research was partially funded by the Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).
In this appendix, we provide a detailed proof of Theorem 2 in the main text about a sublinear bound on the Bayesian regret of the DRBQO algorithm. For simplicity, we focus on the case where the decision space and the distributional uncertainty set are finite. The results can be extended to infinite sets using the discretization trick as in DBLP:conf/icml/SrinivasKKS10.
Notations and conventions. Unless explicitly specified otherwise, we denote a conditional distribution by , i.e., . Recall the definition of the quadrature functional in the main text as
for any and . Let , and . Since is a stochastic process (a GP in our case), and are also random variables. The DRBQO algorithm maps at a time step the history to a new decision and conditional distribution as presented in line 2-3 of Algorithm 1 in the main text. The practical implementation of Algorithm 1 samples as follows: , and where is a function sample of at time from its posterior GP.
For any sequence of deterministic functions ,
for all .
Given , samples according to the probability they are optimal, i.e., . Thus, conditioned on , and are identically distributed. As a result, given a deterministic function , we have . Therefore,
For all , we have
If , then
For all , we have
where and denote the density function and cumulative distribution function of
denote the density function and cumulative distribution function of, respectively.
The results are simple properties of normal distributions. ∎
Given , let be the variance of . Then, for all , all and for , we have
It follows from a simple property of posterior covariance that
and , then
for all .
The trick is to concentrate on the non-negative terms of the expectation. These non-negative terms can be bounded due to the specific choice of upper confidence bound .
Note that for any deterministic conditional distribution , we have