Bayesian Optimization with Binary Auxiliary Information

06/17/2019 ∙ by Yehong Zhang, et al. ∙ National University of Singapore 7

This paper presents novel mixed-type Bayesian optimization (BO) algorithms to accelerate the optimization of a target objective function by exploiting correlated auxiliary information of binary type that can be more cheaply obtained, such as in policy search for reinforcement learning and hyperparameter tuning of machine learning models with early stopping. To achieve this, we first propose a mixed-type multi-output Gaussian process (MOGP) to jointly model the continuous target function and binary auxiliary functions. Then, we propose information-based acquisition functions such as mixed-type entropy search (MT-ES) and mixed-type predictive ES (MT-PES) for mixed-type BO based on the MOGP predictive belief of the target and auxiliary functions. The exact acquisition functions of MT-ES and MT-PES cannot be computed in closed form and need to be approximated. We derive an efficient approximation of MT-PES via a novel mixed-type random features approximation of the MOGP model whose cross-correlation structure between the target and auxiliary functions can be exploited for improving the belief of the global target maximizer using observations from evaluating these functions. We propose new practical constraints to relate the global target maximizer to the binary auxiliary functions. We empirically evaluate the performance of MT-ES and MT-PES with synthetic and real-world experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian optimization (BO) has recently demonstrated with notable success to be highly effective in optimizing an unknown (possibly noisy, non-convex, and/or with no closed-form expression/derivative) target function using a finite budget of often expensive function evaluations (Shahriari et al., 2016). As an example, BO is used by Snoek et al. (2012) to determine the setting of input hyperparameters (e.g., learning rate, batch size of data) of a machine learning (ML) model that maximize its validation accuracy (i.e., output of the unknown target function). Conventionally, a BO algorithm relies on some choice of acquisition function (e.g., improvement-based (Shahriari et al., 2016)

such as probability of improvement or

expected improvement (EI) over currently found maximum, information-based (Villemonteix et al., 2009) such as entropy search (ES) (Hennig and Schuler, 2012) and predictive entropy search (PES) (Hernández-Lobato et al., 2014), or upper confidence bound (UCB) (Srinivas et al., 2010)

) as a heuristic to guide its search for the global target maximizer. To do this, the BO algorithm exploits the chosen acquisition function to repeatedly select an input for evaluating the unknown target function that trades off between sampling at or near to a likely target maximizer based on a

Gaussian process (GP) belief of the unknown target function (exploitation) vs. improving the GP belief (exploration) until the budget is expended.

In practice, the expensive-to-evaluate target function often correlates well with some cheaper-to-evaluate binary

auxiliary function(s) that delineate the input regions potentially containing the global target maximizer and can thus be exploited to boost the BO performance. For example, automatically tuning the hyperparameters of a sophisticated ML model (e.g., deep neural network) with BO is usually time-consuming as it may incur several hours to days to evaluate the validation accuracy of the ML model under each selected hyperparameter setting when training with a massive dataset. To accelerate this process, consider an auxiliary function whose output is a binary decision of whether the validation accuracy of the ML model under the selected input hyperparameter setting will exceed a pre-specified threshold, which is recommended by some early/optimal stopping mechanism 

(Müller et al., 2007)

after a small number of training epochs. Such auxiliary information of binary type is cheaper to obtain and can quickly delineate the input regions containing the best hyperparameter setting, hence incurring less time for exploration. Similarly, to find the best reinforcement learning policy for an AI agent in a game or a real robot in a task with binary outcomes (e.g., success or failure) 

(Tesch et al., 2013), maximizing the success rate (i.e., the unknown target function with a continuous output type) averaged over multiple random environments can be accelerated by deciding whether the selected setting of input policy parameters is promising in a single environment (i.e., the auxiliary function with a binary output type). To search for the optimal setting of a system via user interaction (Shahriari et al., 2016), gathering implicit/binary user feedback (e.g., click or not, like or dislike) is often easier than asking for an explicit rating/ranking of a shown example. The above practical examples motivate the need to design and develop a mixed-type BO algorithm that can naturally trade off between exploitation vs. exploration over the target function with a continuous output type and the cheaper-to-evaluate auxiliary function(s) with a binary output type for finding or improving the belief of the global target maximizer, which is the focus of our work here.

In this paper, we generalize information-based acquisition functions like ES and PES to mixed-type ES (MT-ES) and mixed-type PES (MT-PES) for mixed-type BO (Section 4). To the best of our knowledge, these are the first BO algorithms that exploit correlated binary auxiliary information for accelerating the optimization of a continuous target objective function. Different from continuous auxiliary functions which have been exploited by a number of multi-fidelity BO algorithms (Huang et al., 2006; Swersky et al., 2013; Kandasamy et al., 2016, 2017; Poloczek et al., 2017; Sen et al., 2018), the binary auxiliary functions in our problem make the widely used Gaussian likelihood inappropriate and prevent a direct application of existing multi-fidelity BO algorithms.111We discuss other related works in Appendix A.

To resolve this, we first propose a mixed-type multi-output GP to jointly model the unknown continuous target function and binary auxiliary functions. Although the exact acquisition function of MT-PES cannot be computed in closed form, the main contribution of our work here is to show that it is in fact possible to derive an efficient approximation of MT-PES via (a) a novel mixed-type random features (MT-RF) approximation of the MOGP model whose cross-correlation structure between the target and auxiliary functions can be exploited for improving the belief of the global target maximizer using the observations from evaluating these functions (Section 5.1), and (b) new practical constraints relating the global target maximizer to the binary auxiliary functions (Section 5.2). We empirically evaluate the performance of MT-ES and MT-PES with synthetic and real-world experiments (Section 6).

2 Problem Setup

In this work, we have access to an unknown target objective function and auxiliary functions defined over a bounded input domain such that each input is associated with a noisy output for . As mentioned in Section 1, a cost is incurred to evaluate function at each input and the target function is more costly to evaluate than the auxiliary functions, i.e., for . Then, the objective is to find the global target maximizer with a lower cost by exploiting the cheaper auxiliary function evaluations, as compared to evaluating only the target function. Our problem differs from that of the conventional multi-fidelity BO in that only the target function returns continuous outputs (i.e., ) while the auxiliary functions return binary outputs (i.e., for ).

3 Mixed-Type Multi-Output Gp

Various types of multi-output GP models (Cressie, 1993; Wackernagel, 1998; Webster and Oliver, 2007; Skolidis, 2012; Bonilla et al., 2007; Teh and Seeger, 2005; Álvarez and Lawrence, 2011) have be used to jointly model target and auxiliary functions with continuous outputs. However, none of them can be used straightforwardly in our problem to model the mixed output types due to the non-Gaussian likelihood of the auxiliary functions. To resolve this issue, we generalize the convolved multi-output Gaussian process (CMOGP) to model the correlated functions with mixed continuous and binary output types by approximating the non-Gaussian likelihood using expectation propagation (EP), as discussed later. The CMOGP model is chosen for generalization due to its convolutional structure which can be exploited for deriving an efficient approximation of our acquisition function, as described in Section 5.

Let the target and auxiliary functions be jointly modeled as a CMOGP which defines each function as a convolution between a smoothing kernel and a latent function222To ease exposition, we consider a single latent function. Note, however, multiple latent functions can be used to improve the modeling (Álvarez and Lawrence, 2011). More importantly, our proposed MT-RF approximation and MT-PES algorithm can be easily generalized to handle multiple latent functions, as shown in Appendix G. with an additive bias :

(1)

Let and . As shown by Álvarez and Lawrence (2011), if is a GP, then is also a GP, that is, every finite subset of

follows a multivariate Gaussian distribution. Such a GP is fully specified by its

prior mean and covariance for all , the latter of which characterizes both the correlation structure within each function (i.e., ) and the cross-correlation between different functions (i.e., ). Specifically, let be a GP with zero mean, prior covariance , and where

is the signal variance controlling the intensity of the outputs of

, and are diagonal precision matrices controlling, respectively, the degrees of correlation between outputs of latent function and cross-correlation between outputs of and . Then, and

(2)

In this work, we assume the Gaussian and probit likelihoods for the target and auxiliary functions, respectively:

(3)

for

. Supposing a column vector

of outputs are observed by evaluating each -th function at a set of input tuples where , the predictive belief/distribution of for any set of input tuples can be computed by

(4)

For conventional CMOGP with only continuous output types, (4) can be computed analytically since both and are Gaussians (Álvarez and Lawrence, 2011). Unfortunately, the non-Gaussian likelihood in (3) makes the integral in (4) intractable. To resolve this issue, the work of Pourmohamad and Lee (2016) has proposed a sampling strategy based on a sequential Monte Carlo algorithm which, however, is computationally inefficient and makes the approximation of our proposed acquisition function (Section 5) prohibitively expensive. In contrast, we approximate the non-Gaussian likelihood using EP to derive an analytical approximation of (4), as detailed later. EP will be further exploited in Section 5 for approximating our proposed acquisition function efficiently.

3.1 Mixed-Type Cmogp Predictive Inference

Let be a set of input tuples of the auxiliary functions. The posterior distribution in (4) can be computed by

(5)

where can be approximated with a multivariate Gaussian using EP by approximating each non-Gaussian likelihood as a Gaussian. Let

(6)

for all . Following the EP procedure in Section of Rasmussen and Williams (2006), the parameters and can be computed analytically and

(7)

where , is a diagonal matrix with diagonal components for , , and for any .

By combining (7), (5), and (3) with (4) (Appendix B), the predictive belief can be approximated by a multivariate Gaussian with the following posterior mean vector and covariance matrix:

(8)

where , , and is a diagonal matrix with diagonal components . Consequently, the approximated predictive belief of for any input tuple can be computed using . Due to (3) and (8),

(9)

for where for .

4 Bo With Binary Auxiliary Information

To achieve the objective described in Section 2, our BO algorithm repeatedly selects the next input tuple for evaluating the -th function at that maximizes a choice of acquisition function per unit cost given the past observations :

and updates until the budget is expended. Since the costs of evaluating the target vs. auxiliary functions differ, we use the above cost-sensitive acquisition function such that the cheaper auxiliary function evaluations can be exploited. We will focus on designing the acquisition function

first and the estimation of

in real-world applications will be discussed later in Section 6.

Intuitively, should be designed to enable its BO algorithm to jointly and naturally optimize the non-trivial trade-off between exploitation vs. exploration over the target and auxiliary functions for finding or improving the belief of the global target maximizer by utilizing information from the mixed-type CMOGP predictive belief of these functions (8). To do this, one may be tempted to directly use the conventional EI (Mockus et al., 1978) and (Tesch et al., 2013) acquisition functions for selecting inputs to evaluate the target and auxiliary functions, respectively. is a variation of EI and, to the best of our knowledge, the only acquisition function designed for optimizing an unknown function with a binary output type. However, this does not satisfy our objective since aims to find the global maximizer of the auxiliary function which can differ from the global target maximizer if the target and auxiliary functions are not perfectly correlated. To resolve this issue, we propose to exploit information-based acquisition functions and generalize them to our mixed-type BO problem such that input tuples for evaluating the target and auxiliary functions are selected to directly maximize only the unknown target objective function, as detailed later.

4.1 Information-Based Acquisition Functions for Mixed-Type Bo

Information-based acquisition functions like ES and PES have been designed to enable their BO algorithms to improve the belief of the global target maximizer. In mixed-type BO, we can similarly define a belief of the maximizer of each -th function as for . To achieve the objective of maximizing only the target function in mixed-type BO, ES can be used to measure the information gain of only the global target maximizer (i.e., ) from selecting the next input tuple for evaluating the -th (possibly binary auxiliary) function at given the past observations :

(10)

Similar to the multi-task ES algorithm (Swersky et al., 2013) which is designed for BO with continuous auxiliary information, we can use Monte Carlo sampling to approximate (10) by utilizing information from the mixed-type CMOGP predictive belief (i.e., (8) and (9)) of the target and auxiliary functions. To make the Monte Carlo approximation tractable and efficient, we need to discretize the input domain and assume that the search space for evaluating (10) is pruned to a small set of input candidates which, following the work of Swersky et al. (2013), can be selected by applying EI to only the target function. Such a form of approximation, however, faces two critical limitations: (a) Computing (10) incurs cubic time in the size of the discretized input domain and is thus expensive to evaluate with a large input domain (or risks being approximated poorly), and (b) the pruning of the search space artificially constrains the exploration of auxiliary functions and requires a parameter in EI (i.e., to control the exploration-exploitation trade-off) to be manually tuned to fit different real-world applications.

To circumvent the above-mentioned issues, we can exploit the symmetric property of conditional mutual information and rewrite (10) as

(11)

which we call mixed-type PES (MT-PES). Intuitively, the selection of an input tuple to maximize (11) has to trade off between exploration of every target and auxiliary function (hence inducing a large Gaussian predictive entropy ) vs. exploitation of the current belief of the global target maximizer to choose a nearby input of function (i.e., convolutional structures and maximizers of the target and auxiliary functions are similar or close (Section 3)) to be evaluated (hence inducing a small expected predictive entropy ) to yield a highly informative observation that in turn improves the belief of

. Note that the entropy of continuous random variables (i.e., differential entropy) and discrete/binary random variables (i.e., Shannon entropy) are not comparable

333For example, the Shannon entropy is always non-negative while the differential entropy can be negative. A detailed discussion of their difference and connection is available in Chapter 8 of Cover and Thomas (2006).. So, the differential entropy terms in (11) for are not comparable to the Shannon entropy terms in (11) for . Fortunately, the difference of the two entropy terms in (11) is exactly the information gain of the global target maximizer in (10) which is comparable between vs.  regardless of whether the output is continuous or binary. Next, we will describe how to evaluate (11) efficiently.

5 Approximation of Mixed-Type Predictive Entropy Search

Due to (9), the first Gaussian predictive/posterior entropy term in (11) can be computed analytically:

(12)

for . Unfortunately, the second term in (11) cannot be evaluated in closed form. Although this second term appears to resemble that in PES (Hernández-Lobato et al., 2014), their approximation method, however, cannot be applied straightforwardly here since it cannot account for either the binary auxiliary information or the complex cross-correlation structure between the target and auxiliary functions. To achieve this, we will first propose a novel mixed-type random features approximation of the CMOGP model whose cross-correlation structure between the target and auxiliary functions can be exploited for sampling the global target maximizer more accurately using the past observations from evaluating these functions (especially when the target function is sparsely evaluated due to its higher cost), which is in turn used to approximate the expectation in (11). Then, we will formalize some practical constraints relating the global target maximizer to the binary auxiliary functions, which are used to approximate the second entropy term within the expectation in (11).

5.1 Mixed-Type Random Features

To approximate the expectation in (11) efficiently by averaging over samples of the target maximizer from in a continuous input domain, we will derive an analytic sample of the unknown function given the past observations , which is differentiable and can be optimized by any existing gradient-based optimization method to search for its maximizer. Unlike the work of Hernández-Lobato et al. (2014) that achieves this in PES using the single-output random features (SRF) method for handling a single continuous output type (Lázaro-Gredilla et al., 2010; Rahimi and Recht, 2007), we have to additionally consider how the binary auxiliary functions and their complex cross-correlation structure with the target function can be exploited for sampling the target maximizer more accurately. To address this, we will now present a novel mixed-type random features (MT-RF) approximation of the CMOGP model by first deriving an analytic form of the latent function with SRF and then an analytic approximation of using the convolutional structure of the CMOGP model. The results of EP (6) can be reused here to approximate the non-Gaussian likelihood for .

Using SRF (Rahimi and Recht, 2007), the latent function modeled using GP can be approximated by a linear model where is a random vector of an -dimensional feature mapping of the input for and is an -dimensional vector of weights. Then, interestingly, by exploiting the convolutional structure of the CMOGP model in (1), can also be approximated analytically by a linear model: where the random vector can be interpreted as input features of , is a random matrix which is used to map in SRF, and function returns a diagonal matrix with the same diagonal components as . The exact definition of and the derivation of are in Appendix C.

Then, a sample of can be constructed using where and are vectors of features and weights sampled, respectively, from the random vector and the posterior distribution of weights given the past observations , the latter of which is approximated to be Gaussian by exploiting the conditional independence property of MT-RF and the results of EP (6) from the mixed-type CMOGP model:

where and , as detailed in Appendix C.2.

Consequently, the expectation in (11) can be approximated by averaging over samples of the target maximizer of to yield an approximation of MT-PES:

(13)

where and for . Drawing a sample of incurs time if and time if

, which is more efficient than using Thompson sampling to sample

over a discretized input domain that incurs cubic time in its size since a sufficiently fine discretization of the entire input domain is typically larger in size than the no. of observations.

5.2 Approximating the Predictive Entropy Conditioned on the Target Maximizer

We will now discuss how the second entropy term in (13) is approximated. Firstly, the posterior distribution of given the past observations and target maximizer is computed by

(14)

where is defined in (3) and will be approximated by EP, as detailed later. As shown in Section 3, the Gaussian predictive belief  (8) can be computed analytically. Then, can be considered as a constrained version of by further conditioning on the target maximizer . It is intuitive that the posterior distribution of is constrained by . However, since only the target maximizer is of interest, how should the value of be constrained by instead of if ? To resolve this, we introduce a slack variable to formalize the relationship between maximizers of the target and auxiliary functions:

(15)

where measures the gap between the expected maximum of and the expected output of evaluated at and can be approximated efficiently using our MT-RF method even though is unknown, as detailed later. Consequently, the following simplified constraints instead of (15) will be used to approximate :

  1. for a given where equals to if , and otherwise.

  2. where and is the largest among the noisy outputs observed by evaluating the target function at .

  3. for .444Like the work of Swersky et al. (2013) (Section ), we assume the cross-correlation between the target and auxiliary functions to be positive. An auxiliary function that is negatively correlated with the target function can be easily transformed to be positively correlated by negating all its outputs.

The first constraint keeps the influence of to the next input tuple to be selected by MT-PES. Instead of constraining all unknown functions over the entire input domain, and relax (15) to be valid only for the outputs observed from evaluating these functions. When the target and auxiliary functions are highly correlated (i.e., small ), means that a positive label can be observed with high probability by evaluating an auxiliary function at the target maximizer . Using these constraints, which can be approximated analytically using EP. To achieve this, we will first derive a tractable approximation of the posterior distribution which does not depend on the next selected input . Note that such terms can be computed once and reused in the approximation of in (14) which depends on , as detailed later.

Approximating terms independent of . Let and . We can use the cdf of a standard Gaussian distribution and an indicator function to represent the probability of and , respectively. Then, the posterior distribution can be constrained with and by

(16)

Interestingly, by sampling the target and auxiliary maximizers and using our MT-RF method proposed in Section 5.1, the value of in (16) can be approximated by Monte Carlo sampling555When , is equal to since .:

With the multiplicative form of (16) , can be approximated to be a multivariate Gaussian using EP by approximating each non-Gaussian factor (i.e., and ) in (16) to be a Gaussian, as detailed in Appendix D. Consequently, the posterior distribution can be approximated by a Gaussian where is the -th component of and is the -th diagonal component of .

Approximating terms that depend on . In and , is the only term that is related to . It follows that is conditionally independent of and given . Let . So, where and can be computed analytically using , , and (8), as detailed in Appendix E.

To involve , an indicator function is used to represent the probability that holds. Then, where

(17)

Since the posterior of has been updated according to and  (16), in (17) is updated likewise:

where is computed in (16) using a sampled . Similar to that in (Hernández-Lobato et al., 2014), a one-step EP can be used to approximate (17) as a multivariate Gaussian with the following posterior mean vector and covariance matrix:

(18)

where , , and . The derivation of (18) is in Appendix F. So, the posterior mean and variance of can be approximated, respectively, using the -th component of and -th component of denoted by and . As a result, the posterior entropy in (13) can be approximated using (12) by replacing and in (12) with, respectively, and where and are computed in (18) using a sampled .

6 Experiments and Discussion

This section empirically evaluates the performance of our MT-PES algorithm against that of (a) the state-of-the-art PES (Hernández-Lobato et al., 2014) without utilizing the binary auxiliary information and (b) MT-ES performing Monte Carlo approximation of (10). In all experiments, we use random features and samples of the target maximizer in MT-PES. The input candidates with top EI values are selected for evaluating MT-ES. The mixed-type MOGP (MT-MOGP) hyperparameters are learned via maximum likelihood estimation. The performance of the tested algorithms are evaluated using immediate regret (IR) where is their recommended target maximizer. In each experiment, one observation of the target function is randomly selected as the initialization.

6.1 Synthetic Experiments

The performance of the tested algorithms are firstly evaluated using synthetic and benchmark functions.

Figure 1: (a-c) Example of the synthetic functions where ‘’ is the global target maximizer, (d) target function predicted by conventional GP model and the target maximizers (‘’ ) sampled by RF with observations from evaluating the target function, and (e) target function predicted by MT-MOGP model and the target maximizers (‘’ ) sampled by MT-RF with and observations from evaluating the target and aux1 functions, respectively.
Figure 2: Graphs of

vs. cost incurred by tested algorithms for (a-b) synthetic functions and (c) Hartmann-6D function. The type and cost of functions used in each experiment are shown in the title and legend of each graph where ‘t’ denotes target function and ‘a1’ and ‘a2’ denote aux1 and aux2 functions, respectively. The error bars are computed in the form of standard error.

Synthetic functions. The synthetic functions are generated using and . To do this, the CMOGP hyperparameters with one latent function are firstly fixed as the values shown in Appendix H.1, which are also used in the tested algorithms as optimal hyperparameters. Then, a set of input tuples are uniformly sampled from and their corresponding outputs are sampled from the CMOGP prior. The target function is set to be the predictive mean of the CMOGP model. The outputs of the auxiliary function are set to be if , and otherwise. An example of the synthetic functions can be found in Figs. 1a to 1c. As can be seen in Figs. 1b and 1c, we can generate multiple auxiliary functions with different proportions of positive outputs from a target function (Fig. 1a) by varying the bias . All these auxiliary functions correlate well with the target function but delineate the input regions containing the target maximizer differently and thus result in different MT-PES performance, as will be shown later.

Empirical analysis of MT-MOGP and MT-RF. Firstly, we verify that the MT-MOGP model and MT-RF can outperform the conventional GP model and single-output RF by exploiting cross-correlation structure between the target and auxiliary function aux1 (i.e., Figs. 1a and 1b). Figs. 1d and 1e show the predictive mean and the sampled maximizers of the target function using randomly sampled observations. By comparing Figs. 1d and 1e with Fig. 1a, it can be observed that the MT-MOGP model and MT-RF can predict the target function and sample the target maximizer more accurately than the conventional GP model and single-output RF using an additional observations from evaluating aux1.

Empirical analysis of mixed-type BO. Next, the performance of the tested BO algorithms are evaluated using ten groups (i.e., one target function, two auxiliary functions aux1 and aux2 with different ) of synthetic functions generated using the above procedure. We adjust such that around of auxiliary outputs are positive for each aux1 and set for each aux2. An averaged IR is obtained by optimizing the target function in each of them with different initializations for each tested algorithm.

Fig. 2 shows the results of all tested algorithms for synthetic functions with a cost budget of . From Fig. 2a, MT-PES can achieve a similar averaged IR with a much lower cost than PES, which implies that the BO performance can be accelerated by exploiting the binary auxiliary information of lower evaluation cost. MT-ES achieves lower averaged IR than PES with a cost less than but unfortunately performs less well in the remaining BO iterations. Even though the cheap auxiliary outputs provide additional information for finding the target maximizer at the beginning of BO, the multimodal nature of the synthetic function (see Fig. 1a) causes MT-ES to be trapped easily in some local maximum since its search space has been pruned using EI for time efficiency.

To investigate how the performance of MT-PES will be affected by the proportion of positive outputs in different auxiliary functions, we vary the number and bias of the auxiliary function(s) and show the results in Fig. 2b. It can be observed that MT-PES using aux2 as the auxiliary function does not converge as fast as MT-PES using aux1, which is expected since aux2 with a larger proportion of positive outputs is less informative in delineating the input regions containing the target maximizer than aux1. Also, Fig. 2b shows that MT-PES is able to exploit multiple auxiliary functions with different costs to achieve a lower averaged IR than PES with a much lower cost.

Remark. From the results in Fig. 2b, one may expect MT-PES to converge faster using an auxiliary function with a smaller proportion of positive outputs, which is not always the case. If the auxiliary function has sparse positive outputs, MT-PES will face difficulty finding a positive output when exploring the auxiliary function and start to evaluate the target function after only several negative outputs are observed from evaluating the cheap auxiliary function. These negative outputs may not be informative enough to guide the algorithm to directly evaluate the target function near to the likely target maximizer. To reduce the negative effect of such an unexpected behavior in real-world applications with an unknown auxiliary function, we can set MT-PES to evaluate only the auxiliary function using a small amount (e.g., ) of the budget at the beginning of BO so that positive auxiliary outputs are highly likely to be observed before MT-PES chooses to evaluate the expensive target function.

To provide more insight into the approximations of MT-PES, we follow the PES paper (Hernández-Lobato et al., 2014) and show the accuracy of the EP approximations (Section 5.2) compared to that of the ground truth constructed using the rejection sampling method. To verify how sensitive the performance of MT-PES is to different settings, we have also evaluated the performance of the tested algorithms using synthetic functions with varying costs , random features dimension , and sampling size . The results are reported in Appendix H.1.

Hartmann-6D function. The original Hartmann-D function is used as the target function and to construct the binary auxiliary function, as detailed in Appendix H.2. Fig. 2c shows results of the tested algorithms with different initializations. It can be observed that MT-PES converges faster to a lower averaged IR than PES. However, MT-ES does not perform well for Hartmann-D function which is difficult to optimize due to their multimodal nature (i.e., global maximum and local maxima) and large input domain. The former causes MT-ES to be trapped easily in some local maximum while the latter prohibits MT-ES from finely discretizing the input domain to remain computationally tractable.

6.2 Real-World Experiments

The tested algorithms are next used in hyperparameter tuning of a ML model in an image classification task and policy search for reinforcement learning.

Convolutional neural network (CNN) with CIFAR-10 dataset. The six CNN666

We use the example code of keras (i.e., cifar10_cnn.py) and switch the optimizer in their code to SGD.

hyperparameters to be tuned in our experiments are the learning rate of SGD in the range of , three dropout rates in the range of , batch size in the range of , and number of learning epochs in the range of . We use training and validation data of size and , respectively. The unknown target function to be maximized is the validation accuracy evaluated by training the CNN with all the training data. The auxiliary function is the decision made using the Bayesian optimal stopping (BOS) mechanism in (Dai et al., 2019; Müller et al., 2007) by setting as a threshold of the validation accuracy. In particular, we train the same CNN model with a smaller fixed dataset of size randomly selected from the original training data and apply the BOS after training epochs. The BOS will early-stop the training and return if it predicts that a final validation accuracy of can be achieved with a high probability, and otherwise.777A description of BOS is provided in Appendix H.3. The real training time is not known and varies with different settings of hyperparameters. To simplify the setting of the evaluation costs, we use and where is the number of learning epochs in each selected hyperparameter setting.888We use of the training data for evaluating the auxiliary function and early-stop the training after around epochs. For this experiment, we additionally compare the tested algorithms with multi-fidelity GP-UCB (MF-GP-UCB) (Kandasamy et al., 2016) that can only exploit continuous auxiliary functions. The auxiliary function of MF-GP-UCB is the validation accuracy evaluated by training the same CNN with the same data used for the auxiliary function of MT-PES.999One may consider constructing the auxiliary function of MF-GP-UCB with an even smaller training dataset such that its cost is similar to that of the binary auxiliary function. However, for any smaller training dataset, we can always early-stop the training and achieve a much cheaper binary auxiliary function, as compared to the continuous auxiliary function of MF-GP-UCB constructed using the same dataset. The actual wall-clock time shown in the results includes the time of both CNN training and BO. The validation accuracy is evaluated by training the CNN with for the tested algorithms.

Policy search for reinforcement learning (RL). We apply the tested algorithms to the CartPole task from OpenAI Gym and use a linear policy consisting of parameters in the range of . This task is defined to be a success (i.e., reward of ) if the episode length reaches , and a failure (reward of ) otherwise. The target function to be maximized is the success rate averaged over episodes with random starting states. The auxiliary function is the reward of one episode with a fixed starting state . and are used in the experiments. The success rate is evaluated by running the CartPole task with as the policy parameters over episodes for the tested algorithms.

Figure 3: Graphs of (a) validation accuracy vs. wall-clock time incurred by tested algorithms for CNN and (b) success rate vs. no. of episodes incurred by tested algorithms for RL. The results for the first 50 episodes are zoomed in for a clearer comparison.

Fig. 3 shows results of the tested algorithms with different initializations for the CNN hyperparameter tuning and RL policy search tasks. It can be observed that both MT-ES and MT-PES converge faster to a smaller IR than other tested algorithms. MT-PES also converges faster than MT-ES in both experiments. MT-ES and MT-PES outperform MF-GP-UCB since evaluating the binary auxiliary function by early-stopping the CNN training incurs much less time than evaluating the true validation accuracy for MF-GP-UCB. Using only hour, MT-PES can improve the performance of CNN over that of the baseline achieved using the default hyperparameters in the existing code, which shows that MT-PES is promising in quickly finding more competitive hyperparameters of complex ML models.

7 Conclusion

This paper describes novel MT-ES and MT-PES algorithms for mixed-type BO that can exploit cheap binary auxiliary information for accelerating the optimization of a target objective function. A novel mixed-type CMOGP model and its MT-RF approximation are proposed for improving the belief of the unknown target function and the global target maximizer using observations from evaluating the target and binary auxiliary functions. New practical constraints are proposed to relate the global target maximizer to the binary auxiliary functions such that MT-PES can be approximated efficiently. Empirical evaluation on synthetic functions and real-world applications shows that MT-PES outperforms the state-of-the-art BO algorithms. For future work, our proposed mixed-type BO algorithms can be easily extended to handle both binary and continuous auxiliary information, hence generalizing multi-fidelity PES (Zhang et al., 2017).101010

A closely related counterpart is multi-fidelity active learning 

(Zhang et al., 2016).

Acknowledgements. This research is supported by the Singapore Ministry of Education Academic Research Fund Tier , MOE-T--.

References

  • Álvarez and Lawrence (2011) Álvarez, M. A. and Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. JMLR, 12, 1459–1500.
  • Bonilla et al. (2007) Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I. (2007). Multi-task Gaussian process prediction. In Proc. NIPS, pages 153–160.
  • Cover and Thomas (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. John Wiley & Sons.
  • Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. John Wiley & Sons, Inc., second edition.
  • Dai et al. (2019) Dai, Z., Yu, H., Low, K. H., and Jaillet, P. (2019). Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pages 1496–1506.
  • Falkner et al. (2018) Falkner, S., Klein, A., and Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization at scale. In Proc. ICML, pages 1436–1445.
  • González et al. (2017) González, J., Dai, Z., Damianou, A., and Lawrence, N. D. (2017). Preferential Bayesian optimization. In Proc. ICML, pages 1282–1291.
  • Hennig and Schuler (2012) Hennig, P. and Schuler, C. J. (2012). Entropy search for information-efficient global optimization. JMLR, 13, 1809–1837.
  • Hernández-Lobato et al. (2014) Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Proc. NIPS, pages 918–926.
  • Hernández-Lobato et al. (2016) Hernández-Lobato, J. M., Gelbart, M. A., Adams, R. P., Hoffman, M. W., and Ghahramani, Z. (2016). A general framework for constrained Bayesian optimization using information-based search. JMLR, 17(1), 5549–5601.
  • Huang et al. (2006) Huang, D., Allen, T. T., Notz, W. I., and Miller, R. A. (2006). Sequential kriging optimization using multiple-fidelity evaluations. Struct. Multidisc. Optim., 32(5), 369–382.
  • Kandasamy et al. (2016) Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., and Póczos, B. (2016). Gaussian process bandit optimisation with multi-fidelity evaluations. In Proc. NIPS, pages 992–1000.
  • Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. (2017). Multi-fidelity Bayesian optimisation with continuous approximations. In Proc. ICML, pages 1799–1808.
  • Lázaro-Gredilla et al. (2010) Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. (2010). Sparse spectrum Gaussian process regression. JMLR, 11, 1865–1881.
  • Li et al. (2018) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18, 1–52.
  • Minka (2001) Minka, T. P. (2001).

    A family of algorithms for approximate Bayesian inference

    .
    Ph.D. thesis, Massachusetts Institute of Technology.
  • Mockus et al. (1978) Mockus, J., Tiešis, V., and Žilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. In L. C. W. Dixon and G. P. Szegö, editors, Towards Global Optimization 2, pages 117–129. North-Holland Publishing Company.
  • Müller et al. (2007) Müller, P., Berry, D. A., Grieve, A. P., Smith, M., and Krams, M. (2007). Simulation-based sequential Bayesian design. J. Statistical Planning and Inference, 137(10), 3140–3150.
  • Poloczek et al. (2017) Poloczek, M., Wang, J., and Frazier, P. I. (2017). Multi-information source optimization. In Proc. NIPS, pages 4288–4298.
  • Pourmohamad and Lee (2016) Pourmohamad, T. and Lee, H. K. H. (2016). Multivariate stochastic process models for correlated responses of mixed type. Bayesian Anal., 11(3), 797–820.
  • Rahimi and Recht (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Proc. NIPS, pages 1177–1184.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. MIT Press.
  • Russo et al. (2018) Russo, D. J., van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.
  • Schön and Lindsten (2011) Schön, T. B. and Lindsten, F. (2011). Manipulating the multivariate Gaussian density. Technical report, Division of Automatic Control, Linköping University, Sweden.
  • Sen et al. (2018) Sen, R., Kandasamy, K., and Shakkottai, S. (2018). Multi-fidelity black-box optimization with hierarchical partitions. In Proc. ICML, pages 4538–4547.
  • Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R., and de Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.
  • Skolidis (2012) Skolidis, G. (2012). Transfer Learning with Gaussian Processes. Ph.D. thesis, University of Edinburgh.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Proc. NIPS, pages 2951–2959.
  • Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015–1022.
  • Swersky et al. (2013) Swersky, K., Snoek, J., and Adams, R. P. (2013). Multi-task Bayesian optimization. In Proc. NIPS, pages 2004–2012.
  • Teh and Seeger (2005) Teh, Y. W. and Seeger, M. (2005). Semiparametric latent factor models. In Proc. AISTATS, pages 333–340.
  • Tesch et al. (2013) Tesch, M., Schneider, J., and Choset, H. (2013). Expensive function optimization with stochastic binary outcomes. In Proc. ICML, pages 1283–1291.
  • Villemonteix et al. (2009) Villemonteix, J., Vazquez, E., and Walter, E. (2009). An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim., 44(4), 509–534.
  • Wackernagel (1998) Wackernagel, H. (1998). Multivariate Geostatistics: An Introduction with Applications. Springer, second edition.
  • Webster and Oliver (2007) Webster, R. and Oliver, M. (2007). Geostatistics for Environmental Scientists. John Wiley & Sons, Inc., second edition.
  • Zhang et al. (2016) Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M. (2016). Near-optimal active learning of multi-output Gaussian processes. In Proc. AAAI, pages 2351–2357.
  • Zhang et al. (2017) Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M. (2017). Information-based multi-fidelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization.

Appendix A Related Work

Some existing BO works focus on optimizing a target function with a binary output type (González et al., 2017; Tesch et al., 2013) but have not considered utilizing the binary outputs for optimizing other correlated function which is more expensive to evaluate. The Bernoulli multi-armed bandit problem (Russo et al., 2018) assumes binary reward for each action and aims to maximize the cumulative rewards. However, the correlations between the arms and the cross-correlation between the immediate binary reward and the averaged reward are ignored. Other than the multi-fidelity BO algorithms (Section 1), the constrained BO algorithms (Hernández-Lobato et al., 2016) also involve multiple functions (unknown target function and constraints) when optimizing the target function. Different from our mixed-type BO algorithms that can exploit the cross-correlation structure between the target and binary auxiliary functions, the constrained BO algorithms only consider continuous output types for the unknown constraints and assume the target and constraint functions to be independent. Similar to our CNN experiment (Section 6.2), some hyperparameter optimization methods such as Hyperband (Li et al., 2018) and BOHB (Falkner et al., 2018) have considered speeding up their optimization process by early-stopping the training of underperforming models and continuing that of only the highly ranked ones. However, both methods require the outputs (e.g., validation accuracy) to be continuous for ranking and do not consider the binary auxiliary information. Given the above idea, one may be tempted to exploit the binary information in a similar way: The binary auxiliary function is evaluated for a batch of inputs, and the target function is only evaluated at those inputs in the batch that yield positive auxiliary outputs for finding the global maximum. To achieve this, some important issues need to be considered: (a) Which inputs should we select to evaluate the binary auxiliary function? (b) How many binary auxiliary outputs should we sample before evaluating the expensive target function? (c) If a large proportion of inputs in the batch yield positive auxiliary outputs, then evaluating the target function for all of them can also be very expensive. Which inputs should we select for evaluating the target function such that the global target maximizer can be found given a limited budget? Our proposed MT-ES and MT-PES have resolved all the above issues in a principled manner.

Appendix B Derivation of (8)

Since are jointly modeled as a CMOGP, we know that

(19)

for any (Álvarez and Lawrence, 2011). Then,

(20)

due to (7), (19), and equation c in (Schön and Lindsten, 2011). As a result, the posterior distribution can be approximated with a multivariate Gaussian distribution:

(21)

The first equality is due to (5). The last approximation is due to (20), equation f in (Schön and Lindsten, 2011), and where . Finally, the predictive belief in (8) can be obtained using (19), (21), and equation c in (Schön and Lindsten, 2011).

Appendix C Details of Mixed-Type Random Features (Mt-Rf)

Using some results of Rahimi and Recht (2007), the prior covariance of the GP modeling (Section 3) can be rewritten as

(22)

where , is the Fourier dual of , and . Let denote a random vector of an -dimensional feature mapping of the input :

(23)

where and with and sampled from and , respectively. From (22) and (23), the prior covariance can be approximated by and the latent function can be approximated by a linear model:

(24)

Next, we will show how to derive the following approximation of :