1 Introduction
Bayesian optimization (BO) has recently demonstrated with notable success to be highly effective in optimizing an unknown (possibly noisy, nonconvex, and/or with no closedform expression/derivative) target function using a finite budget of often expensive function evaluations (Shahriari et al., 2016). As an example, BO is used by Snoek et al. (2012) to determine the setting of input hyperparameters (e.g., learning rate, batch size of data) of a machine learning (ML) model that maximize its validation accuracy (i.e., output of the unknown target function). Conventionally, a BO algorithm relies on some choice of acquisition function (e.g., improvementbased (Shahriari et al., 2016)
such as probability of improvement or
expected improvement (EI) over currently found maximum, informationbased (Villemonteix et al., 2009) such as entropy search (ES) (Hennig and Schuler, 2012) and predictive entropy search (PES) (HernándezLobato et al., 2014), or upper confidence bound (UCB) (Srinivas et al., 2010)) as a heuristic to guide its search for the global target maximizer. To do this, the BO algorithm exploits the chosen acquisition function to repeatedly select an input for evaluating the unknown target function that trades off between sampling at or near to a likely target maximizer based on a
Gaussian process (GP) belief of the unknown target function (exploitation) vs. improving the GP belief (exploration) until the budget is expended.In practice, the expensivetoevaluate target function often correlates well with some cheapertoevaluate binary
auxiliary function(s) that delineate the input regions potentially containing the global target maximizer and can thus be exploited to boost the BO performance. For example, automatically tuning the hyperparameters of a sophisticated ML model (e.g., deep neural network) with BO is usually timeconsuming as it may incur several hours to days to evaluate the validation accuracy of the ML model under each selected hyperparameter setting when training with a massive dataset. To accelerate this process, consider an auxiliary function whose output is a binary decision of whether the validation accuracy of the ML model under the selected input hyperparameter setting will exceed a prespecified threshold, which is recommended by some early/optimal stopping mechanism
(Müller et al., 2007)after a small number of training epochs. Such auxiliary information of binary type is cheaper to obtain and can quickly delineate the input regions containing the best hyperparameter setting, hence incurring less time for exploration. Similarly, to find the best reinforcement learning policy for an AI agent in a game or a real robot in a task with binary outcomes (e.g., success or failure)
(Tesch et al., 2013), maximizing the success rate (i.e., the unknown target function with a continuous output type) averaged over multiple random environments can be accelerated by deciding whether the selected setting of input policy parameters is promising in a single environment (i.e., the auxiliary function with a binary output type). To search for the optimal setting of a system via user interaction (Shahriari et al., 2016), gathering implicit/binary user feedback (e.g., click or not, like or dislike) is often easier than asking for an explicit rating/ranking of a shown example. The above practical examples motivate the need to design and develop a mixedtype BO algorithm that can naturally trade off between exploitation vs. exploration over the target function with a continuous output type and the cheapertoevaluate auxiliary function(s) with a binary output type for finding or improving the belief of the global target maximizer, which is the focus of our work here.In this paper, we generalize informationbased acquisition functions like ES and PES to mixedtype ES (MTES) and mixedtype PES (MTPES) for mixedtype BO (Section 4). To the best of our knowledge, these are the first BO algorithms that exploit correlated binary auxiliary information for accelerating the optimization of a continuous target objective function. Different from continuous auxiliary functions which have been exploited by a number of multifidelity BO algorithms (Huang et al., 2006; Swersky et al., 2013; Kandasamy et al., 2016, 2017; Poloczek et al., 2017; Sen et al., 2018), the binary auxiliary functions in our problem make the widely used Gaussian likelihood inappropriate and prevent a direct application of existing multifidelity BO algorithms.^{1}^{1}1We discuss other related works in Appendix A.
To resolve this, we first propose a mixedtype multioutput GP to jointly model the unknown continuous target function and binary auxiliary functions. Although the exact acquisition function of MTPES cannot be computed in closed form, the main contribution of our work here is to show that it is in fact possible to derive an efficient approximation of MTPES via (a) a novel mixedtype random features (MTRF) approximation of the MOGP model whose crosscorrelation structure between the target and auxiliary functions can be exploited for improving the belief of the global target maximizer using the observations from evaluating these functions (Section 5.1), and (b) new practical constraints relating the global target maximizer to the binary auxiliary functions (Section 5.2). We empirically evaluate the performance of MTES and MTPES with synthetic and realworld experiments (Section 6).
2 Problem Setup
In this work, we have access to an unknown target objective function and auxiliary functions defined over a bounded input domain such that each input is associated with a noisy output for . As mentioned in Section 1, a cost is incurred to evaluate function at each input and the target function is more costly to evaluate than the auxiliary functions, i.e., for . Then, the objective is to find the global target maximizer with a lower cost by exploiting the cheaper auxiliary function evaluations, as compared to evaluating only the target function. Our problem differs from that of the conventional multifidelity BO in that only the target function returns continuous outputs (i.e., ) while the auxiliary functions return binary outputs (i.e., for ).
3 MixedType MultiOutput Gp
Various types of multioutput GP models (Cressie, 1993; Wackernagel, 1998; Webster and Oliver, 2007; Skolidis, 2012; Bonilla et al., 2007; Teh and Seeger, 2005; Álvarez and Lawrence, 2011) have be used to jointly model target and auxiliary functions with continuous outputs. However, none of them can be used straightforwardly in our problem to model the mixed output types due to the nonGaussian likelihood of the auxiliary functions. To resolve this issue, we generalize the convolved multioutput Gaussian process (CMOGP) to model the correlated functions with mixed continuous and binary output types by approximating the nonGaussian likelihood using expectation propagation (EP), as discussed later. The CMOGP model is chosen for generalization due to its convolutional structure which can be exploited for deriving an efficient approximation of our acquisition function, as described in Section 5.
Let the target and auxiliary functions be jointly modeled as a CMOGP which defines each function as a convolution between a smoothing kernel and a latent function^{2}^{2}2To ease exposition, we consider a single latent function. Note, however, multiple latent functions can be used to improve the modeling (Álvarez and Lawrence, 2011). More importantly, our proposed MTRF approximation and MTPES algorithm can be easily generalized to handle multiple latent functions, as shown in Appendix G. with an additive bias :
(1) 
Let and . As shown by Álvarez and Lawrence (2011), if is a GP, then is also a GP, that is, every finite subset of
follows a multivariate Gaussian distribution. Such a GP is fully specified by its
prior mean and covariance for all , the latter of which characterizes both the correlation structure within each function (i.e., ) and the crosscorrelation between different functions (i.e., ). Specifically, let be a GP with zero mean, prior covariance , and whereis the signal variance controlling the intensity of the outputs of
, and are diagonal precision matrices controlling, respectively, the degrees of correlation between outputs of latent function and crosscorrelation between outputs of and . Then, and(2) 
In this work, we assume the Gaussian and probit likelihoods for the target and auxiliary functions, respectively:
(3) 
for
. Supposing a column vector
of outputs are observed by evaluating each th function at a set of input tuples where , the predictive belief/distribution of for any set of input tuples can be computed by(4) 
For conventional CMOGP with only continuous output types, (4) can be computed analytically since both and are Gaussians (Álvarez and Lawrence, 2011). Unfortunately, the nonGaussian likelihood in (3) makes the integral in (4) intractable. To resolve this issue, the work of Pourmohamad and Lee (2016) has proposed a sampling strategy based on a sequential Monte Carlo algorithm which, however, is computationally inefficient and makes the approximation of our proposed acquisition function (Section 5) prohibitively expensive. In contrast, we approximate the nonGaussian likelihood using EP to derive an analytical approximation of (4), as detailed later. EP will be further exploited in Section 5 for approximating our proposed acquisition function efficiently.
3.1 MixedType Cmogp Predictive Inference
Let be a set of input tuples of the auxiliary functions. The posterior distribution in (4) can be computed by
(5) 
where can be approximated with a multivariate Gaussian using EP by approximating each nonGaussian likelihood as a Gaussian. Let
(6) 
for all . Following the EP procedure in Section of Rasmussen and Williams (2006), the parameters and can be computed analytically and
(7) 
where , is a diagonal matrix with diagonal components for , , and for any .
By combining (7), (5), and (3) with (4) (Appendix B), the predictive belief can be approximated by a multivariate Gaussian with the following posterior mean vector and covariance matrix:
(8) 
where , , and is a diagonal matrix with diagonal components . Consequently, the approximated predictive belief of for any input tuple can be computed using . Due to (3) and (8),
(9) 
for where for .
4 Bo With Binary Auxiliary Information
To achieve the objective described in Section 2, our BO algorithm repeatedly selects the next input tuple for evaluating the th function at that maximizes a choice of acquisition function per unit cost given the past observations :
and updates until the budget is expended. Since the costs of evaluating the target vs. auxiliary functions differ, we use the above costsensitive acquisition function such that the cheaper auxiliary function evaluations can be exploited. We will focus on designing the acquisition function
first and the estimation of
in realworld applications will be discussed later in Section 6.Intuitively, should be designed to enable its BO algorithm to jointly and naturally optimize the nontrivial tradeoff between exploitation vs. exploration over the target and auxiliary functions for finding or improving the belief of the global target maximizer by utilizing information from the mixedtype CMOGP predictive belief of these functions (8). To do this, one may be tempted to directly use the conventional EI (Mockus et al., 1978) and (Tesch et al., 2013) acquisition functions for selecting inputs to evaluate the target and auxiliary functions, respectively. is a variation of EI and, to the best of our knowledge, the only acquisition function designed for optimizing an unknown function with a binary output type. However, this does not satisfy our objective since aims to find the global maximizer of the auxiliary function which can differ from the global target maximizer if the target and auxiliary functions are not perfectly correlated. To resolve this issue, we propose to exploit informationbased acquisition functions and generalize them to our mixedtype BO problem such that input tuples for evaluating the target and auxiliary functions are selected to directly maximize only the unknown target objective function, as detailed later.
4.1 InformationBased Acquisition Functions for MixedType Bo
Informationbased acquisition functions like ES and PES have been designed to enable their BO algorithms to improve the belief of the global target maximizer. In mixedtype BO, we can similarly define a belief of the maximizer of each th function as for . To achieve the objective of maximizing only the target function in mixedtype BO, ES can be used to measure the information gain of only the global target maximizer (i.e., ) from selecting the next input tuple for evaluating the th (possibly binary auxiliary) function at given the past observations :
(10) 
Similar to the multitask ES algorithm (Swersky et al., 2013) which is designed for BO with continuous auxiliary information, we can use Monte Carlo sampling to approximate (10) by utilizing information from the mixedtype CMOGP predictive belief (i.e., (8) and (9)) of the target and auxiliary functions. To make the Monte Carlo approximation tractable and efficient, we need to discretize the input domain and assume that the search space for evaluating (10) is pruned to a small set of input candidates which, following the work of Swersky et al. (2013), can be selected by applying EI to only the target function. Such a form of approximation, however, faces two critical limitations: (a) Computing (10) incurs cubic time in the size of the discretized input domain and is thus expensive to evaluate with a large input domain (or risks being approximated poorly), and (b) the pruning of the search space artificially constrains the exploration of auxiliary functions and requires a parameter in EI (i.e., to control the explorationexploitation tradeoff) to be manually tuned to fit different realworld applications.
To circumvent the abovementioned issues, we can exploit the symmetric property of conditional mutual information and rewrite (10) as
(11) 
which we call mixedtype PES (MTPES). Intuitively, the selection of an input tuple to maximize (11) has to trade off between exploration of every target and auxiliary function (hence inducing a large Gaussian predictive entropy ) vs. exploitation of the current belief of the global target maximizer to choose a nearby input of function (i.e., convolutional structures and maximizers of the target and auxiliary functions are similar or close (Section 3)) to be evaluated (hence inducing a small expected predictive entropy ) to yield a highly informative observation that in turn improves the belief of
. Note that the entropy of continuous random variables (i.e., differential entropy) and discrete/binary random variables (i.e., Shannon entropy) are not comparable
^{3}^{3}3For example, the Shannon entropy is always nonnegative while the differential entropy can be negative. A detailed discussion of their difference and connection is available in Chapter 8 of Cover and Thomas (2006).. So, the differential entropy terms in (11) for are not comparable to the Shannon entropy terms in (11) for . Fortunately, the difference of the two entropy terms in (11) is exactly the information gain of the global target maximizer in (10) which is comparable between vs. regardless of whether the output is continuous or binary. Next, we will describe how to evaluate (11) efficiently.5 Approximation of MixedType Predictive Entropy Search
Due to (9), the first Gaussian predictive/posterior entropy term in (11) can be computed analytically:
(12) 
for . Unfortunately, the second term in (11) cannot be evaluated in closed form. Although this second term appears to resemble that in PES (HernándezLobato et al., 2014), their approximation method, however, cannot be applied straightforwardly here since it cannot account for either the binary auxiliary information or the complex crosscorrelation structure between the target and auxiliary functions. To achieve this, we will first propose a novel mixedtype random features approximation of the CMOGP model whose crosscorrelation structure between the target and auxiliary functions can be exploited for sampling the global target maximizer more accurately using the past observations from evaluating these functions (especially when the target function is sparsely evaluated due to its higher cost), which is in turn used to approximate the expectation in (11). Then, we will formalize some practical constraints relating the global target maximizer to the binary auxiliary functions, which are used to approximate the second entropy term within the expectation in (11).
5.1 MixedType Random Features
To approximate the expectation in (11) efficiently by averaging over samples of the target maximizer from in a continuous input domain, we will derive an analytic sample of the unknown function given the past observations , which is differentiable and can be optimized by any existing gradientbased optimization method to search for its maximizer. Unlike the work of HernándezLobato et al. (2014) that achieves this in PES using the singleoutput random features (SRF) method for handling a single continuous output type (LázaroGredilla et al., 2010; Rahimi and Recht, 2007), we have to additionally consider how the binary auxiliary functions and their complex crosscorrelation structure with the target function can be exploited for sampling the target maximizer more accurately. To address this, we will now present a novel mixedtype random features (MTRF) approximation of the CMOGP model by first deriving an analytic form of the latent function with SRF and then an analytic approximation of using the convolutional structure of the CMOGP model. The results of EP (6) can be reused here to approximate the nonGaussian likelihood for .
Using SRF (Rahimi and Recht, 2007), the latent function modeled using GP can be approximated by a linear model where is a random vector of an dimensional feature mapping of the input for and is an dimensional vector of weights. Then, interestingly, by exploiting the convolutional structure of the CMOGP model in (1), can also be approximated analytically by a linear model: where the random vector can be interpreted as input features of , is a random matrix which is used to map in SRF, and function returns a diagonal matrix with the same diagonal components as . The exact definition of and the derivation of are in Appendix C.
Then, a sample of can be constructed using where and are vectors of features and weights sampled, respectively, from the random vector and the posterior distribution of weights given the past observations , the latter of which is approximated to be Gaussian by exploiting the conditional independence property of MTRF and the results of EP (6) from the mixedtype CMOGP model:
where and , as detailed in Appendix C.2.
Consequently, the expectation in (11) can be approximated by averaging over samples of the target maximizer of to yield an approximation of MTPES:
(13) 
where and for . Drawing a sample of incurs time if and time if
, which is more efficient than using Thompson sampling to sample
over a discretized input domain that incurs cubic time in its size since a sufficiently fine discretization of the entire input domain is typically larger in size than the no. of observations.5.2 Approximating the Predictive Entropy Conditioned on the Target Maximizer
We will now discuss how the second entropy term in (13) is approximated. Firstly, the posterior distribution of given the past observations and target maximizer is computed by
(14) 
where is defined in (3) and will be approximated by EP, as detailed later. As shown in Section 3, the Gaussian predictive belief (8) can be computed analytically. Then, can be considered as a constrained version of by further conditioning on the target maximizer . It is intuitive that the posterior distribution of is constrained by . However, since only the target maximizer is of interest, how should the value of be constrained by instead of if ? To resolve this, we introduce a slack variable to formalize the relationship between maximizers of the target and auxiliary functions:
(15) 
where measures the gap between the expected maximum of and the expected output of evaluated at and can be approximated efficiently using our MTRF method even though is unknown, as detailed later. Consequently, the following simplified constraints instead of (15) will be used to approximate :

for a given where equals to if , and otherwise.

where and is the largest among the noisy outputs observed by evaluating the target function at .

for .^{4}^{4}4Like the work of Swersky et al. (2013) (Section ), we assume the crosscorrelation between the target and auxiliary functions to be positive. An auxiliary function that is negatively correlated with the target function can be easily transformed to be positively correlated by negating all its outputs.
The first constraint keeps the influence of to the next input tuple to be selected by MTPES. Instead of constraining all unknown functions over the entire input domain, and relax (15) to be valid only for the outputs observed from evaluating these functions. When the target and auxiliary functions are highly correlated (i.e., small ), means that a positive label can be observed with high probability by evaluating an auxiliary function at the target maximizer . Using these constraints, which can be approximated analytically using EP. To achieve this, we will first derive a tractable approximation of the posterior distribution which does not depend on the next selected input . Note that such terms can be computed once and reused in the approximation of in (14) which depends on , as detailed later.
Approximating terms independent of . Let and . We can use the cdf of a standard Gaussian distribution and an indicator function to represent the probability of and , respectively. Then, the posterior distribution can be constrained with and by
(16) 
Interestingly, by sampling the target and auxiliary maximizers and using our MTRF method proposed in Section 5.1, the value of in (16) can be approximated by Monte Carlo sampling^{5}^{5}5When , is equal to since .:
With the multiplicative form of (16) , can be approximated to be a multivariate Gaussian using EP by approximating each nonGaussian factor (i.e., and ) in (16) to be a Gaussian, as detailed in Appendix D. Consequently, the posterior distribution can be approximated by a Gaussian where is the th component of and is the th diagonal component of .
Approximating terms that depend on . In and , is the only term that is related to . It follows that is conditionally independent of and given . Let . So, where and can be computed analytically using , , and (8), as detailed in Appendix E.
To involve , an indicator function is used to represent the probability that holds. Then, where
(17) 
Since the posterior of has been updated according to and (16), in (17) is updated likewise:
where is computed in (16) using a sampled . Similar to that in (HernándezLobato et al., 2014), a onestep EP can be used to approximate (17) as a multivariate Gaussian with the following posterior mean vector and covariance matrix:
(18) 
where , , and . The derivation of (18) is in Appendix F. So, the posterior mean and variance of can be approximated, respectively, using the th component of and th component of denoted by and . As a result, the posterior entropy in (13) can be approximated using (12) by replacing and in (12) with, respectively, and where and are computed in (18) using a sampled .
6 Experiments and Discussion
This section empirically evaluates the performance of our MTPES algorithm against that of (a) the stateoftheart PES (HernándezLobato et al., 2014) without utilizing the binary auxiliary information and (b) MTES performing Monte Carlo approximation of (10). In all experiments, we use random features and samples of the target maximizer in MTPES. The input candidates with top EI values are selected for evaluating MTES. The mixedtype MOGP (MTMOGP) hyperparameters are learned via maximum likelihood estimation. The performance of the tested algorithms are evaluated using immediate regret (IR) where is their recommended target maximizer. In each experiment, one observation of the target function is randomly selected as the initialization.
6.1 Synthetic Experiments
The performance of the tested algorithms are firstly evaluated using synthetic and benchmark functions.
vs. cost incurred by tested algorithms for (ab) synthetic functions and (c) Hartmann6D function. The type and cost of functions used in each experiment are shown in the title and legend of each graph where ‘t’ denotes target function and ‘a1’ and ‘a2’ denote aux1 and aux2 functions, respectively. The error bars are computed in the form of standard error.
Synthetic functions. The synthetic functions are generated using and . To do this, the CMOGP hyperparameters with one latent function are firstly fixed as the values shown in Appendix H.1, which are also used in the tested algorithms as optimal hyperparameters. Then, a set of input tuples are uniformly sampled from and their corresponding outputs are sampled from the CMOGP prior. The target function is set to be the predictive mean of the CMOGP model. The outputs of the auxiliary function are set to be if , and otherwise. An example of the synthetic functions can be found in Figs. 1a to 1c. As can be seen in Figs. 1b and 1c, we can generate multiple auxiliary functions with different proportions of positive outputs from a target function (Fig. 1a) by varying the bias . All these auxiliary functions correlate well with the target function but delineate the input regions containing the target maximizer differently and thus result in different MTPES performance, as will be shown later.
Empirical analysis of MTMOGP and MTRF. Firstly, we verify that the MTMOGP model and MTRF can outperform the conventional GP model and singleoutput RF by exploiting crosscorrelation structure between the target and auxiliary function aux1 (i.e., Figs. 1a and 1b). Figs. 1d and 1e show the predictive mean and the sampled maximizers of the target function using randomly sampled observations. By comparing Figs. 1d and 1e with Fig. 1a, it can be observed that the MTMOGP model and MTRF can predict the target function and sample the target maximizer more accurately than the conventional GP model and singleoutput RF using an additional observations from evaluating aux1.
Empirical analysis of mixedtype BO. Next, the performance of the tested BO algorithms are evaluated using ten groups (i.e., one target function, two auxiliary functions aux1 and aux2 with different ) of synthetic functions generated using the above procedure. We adjust such that around of auxiliary outputs are positive for each aux1 and set for each aux2. An averaged IR is obtained by optimizing the target function in each of them with different initializations for each tested algorithm.
Fig. 2 shows the results of all tested algorithms for synthetic functions with a cost budget of . From Fig. 2a, MTPES can achieve a similar averaged IR with a much lower cost than PES, which implies that the BO performance can be accelerated by exploiting the binary auxiliary information of lower evaluation cost. MTES achieves lower averaged IR than PES with a cost less than but unfortunately performs less well in the remaining BO iterations. Even though the cheap auxiliary outputs provide additional information for finding the target maximizer at the beginning of BO, the multimodal nature of the synthetic function (see Fig. 1a) causes MTES to be trapped easily in some local maximum since its search space has been pruned using EI for time efficiency.
To investigate how the performance of MTPES will be affected by the proportion of positive outputs in different auxiliary functions, we vary the number and bias of the auxiliary function(s) and show the results in Fig. 2b. It can be observed that MTPES using aux2 as the auxiliary function does not converge as fast as MTPES using aux1, which is expected since aux2 with a larger proportion of positive outputs is less informative in delineating the input regions containing the target maximizer than aux1. Also, Fig. 2b shows that MTPES is able to exploit multiple auxiliary functions with different costs to achieve a lower averaged IR than PES with a much lower cost.
Remark. From the results in Fig. 2b, one may expect MTPES to converge faster using an auxiliary function with a smaller proportion of positive outputs, which is not always the case. If the auxiliary function has sparse positive outputs, MTPES will face difficulty finding a positive output when exploring the auxiliary function and start to evaluate the target function after only several negative outputs are observed from evaluating the cheap auxiliary function. These negative outputs may not be informative enough to guide the algorithm to directly evaluate the target function near to the likely target maximizer. To reduce the negative effect of such an unexpected behavior in realworld applications with an unknown auxiliary function, we can set MTPES to evaluate only the auxiliary function using a small amount (e.g., ) of the budget at the beginning of BO so that positive auxiliary outputs are highly likely to be observed before MTPES chooses to evaluate the expensive target function.
To provide more insight into the approximations of MTPES, we follow the PES paper (HernándezLobato et al., 2014) and show the accuracy of the EP approximations (Section 5.2) compared to that of the ground truth constructed using the rejection sampling method. To verify how sensitive the performance of MTPES is to different settings, we have also evaluated the performance of the tested algorithms using synthetic functions with varying costs , random features dimension , and sampling size . The results are reported in Appendix H.1.
Hartmann6D function. The original HartmannD function is used as the target function and to construct the binary auxiliary function, as detailed in Appendix H.2. Fig. 2c shows results of the tested algorithms with different initializations. It can be observed that MTPES converges faster to a lower averaged IR than PES. However, MTES does not perform well for HartmannD function which is difficult to optimize due to their multimodal nature (i.e., global maximum and local maxima) and large input domain. The former causes MTES to be trapped easily in some local maximum while the latter prohibits MTES from finely discretizing the input domain to remain computationally tractable.
6.2 RealWorld Experiments
The tested algorithms are next used in hyperparameter tuning of a ML model in an image classification task and policy search for reinforcement learning.
Convolutional neural network (CNN) with CIFAR10 dataset. The six CNN^{6}^{6}6
We use the example code of keras (i.e., cifar10_cnn.py) and switch the optimizer in their code to SGD.
hyperparameters to be tuned in our experiments are the learning rate of SGD in the range of , three dropout rates in the range of , batch size in the range of , and number of learning epochs in the range of . We use training and validation data of size and , respectively. The unknown target function to be maximized is the validation accuracy evaluated by training the CNN with all the training data. The auxiliary function is the decision made using the Bayesian optimal stopping (BOS) mechanism in (Dai et al., 2019; Müller et al., 2007) by setting as a threshold of the validation accuracy. In particular, we train the same CNN model with a smaller fixed dataset of size randomly selected from the original training data and apply the BOS after training epochs. The BOS will earlystop the training and return if it predicts that a final validation accuracy of can be achieved with a high probability, and otherwise.^{7}^{7}7A description of BOS is provided in Appendix H.3. The real training time is not known and varies with different settings of hyperparameters. To simplify the setting of the evaluation costs, we use and where is the number of learning epochs in each selected hyperparameter setting.^{8}^{8}8We use of the training data for evaluating the auxiliary function and earlystop the training after around epochs. For this experiment, we additionally compare the tested algorithms with multifidelity GPUCB (MFGPUCB) (Kandasamy et al., 2016) that can only exploit continuous auxiliary functions. The auxiliary function of MFGPUCB is the validation accuracy evaluated by training the same CNN with the same data used for the auxiliary function of MTPES.^{9}^{9}9One may consider constructing the auxiliary function of MFGPUCB with an even smaller training dataset such that its cost is similar to that of the binary auxiliary function. However, for any smaller training dataset, we can always earlystop the training and achieve a much cheaper binary auxiliary function, as compared to the continuous auxiliary function of MFGPUCB constructed using the same dataset. The actual wallclock time shown in the results includes the time of both CNN training and BO. The validation accuracy is evaluated by training the CNN with for the tested algorithms.Policy search for reinforcement learning (RL). We apply the tested algorithms to the CartPole task from OpenAI Gym and use a linear policy consisting of parameters in the range of . This task is defined to be a success (i.e., reward of ) if the episode length reaches , and a failure (reward of ) otherwise. The target function to be maximized is the success rate averaged over episodes with random starting states. The auxiliary function is the reward of one episode with a fixed starting state . and are used in the experiments. The success rate is evaluated by running the CartPole task with as the policy parameters over episodes for the tested algorithms.
Fig. 3 shows results of the tested algorithms with different initializations for the CNN hyperparameter tuning and RL policy search tasks. It can be observed that both MTES and MTPES converge faster to a smaller IR than other tested algorithms. MTPES also converges faster than MTES in both experiments. MTES and MTPES outperform MFGPUCB since evaluating the binary auxiliary function by earlystopping the CNN training incurs much less time than evaluating the true validation accuracy for MFGPUCB. Using only hour, MTPES can improve the performance of CNN over that of the baseline achieved using the default hyperparameters in the existing code, which shows that MTPES is promising in quickly finding more competitive hyperparameters of complex ML models.
7 Conclusion
This paper describes novel MTES and MTPES algorithms for mixedtype BO that can exploit cheap binary auxiliary information for accelerating the optimization of a target objective function. A novel mixedtype CMOGP model and its MTRF approximation are proposed for improving the belief of the unknown target function and the global target maximizer using observations from evaluating the target and binary auxiliary functions. New practical constraints are proposed to relate the global target maximizer to the binary auxiliary functions such that MTPES can be approximated efficiently. Empirical evaluation on synthetic functions and realworld applications shows that MTPES outperforms the stateoftheart BO algorithms. For future work, our proposed mixedtype BO algorithms can be easily extended to handle both binary and continuous auxiliary information, hence generalizing multifidelity PES (Zhang et al., 2017).^{10}^{10}10
A closely related counterpart is multifidelity active learning
(Zhang et al., 2016).Acknowledgements. This research is supported by the Singapore Ministry of Education Academic Research Fund Tier , MOET.
References
 Álvarez and Lawrence (2011) Álvarez, M. A. and Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. JMLR, 12, 1459–1500.
 Bonilla et al. (2007) Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I. (2007). Multitask Gaussian process prediction. In Proc. NIPS, pages 153–160.
 Cover and Thomas (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. John Wiley & Sons.
 Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. John Wiley & Sons, Inc., second edition.
 Dai et al. (2019) Dai, Z., Yu, H., Low, K. H., and Jaillet, P. (2019). Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pages 1496–1506.
 Falkner et al. (2018) Falkner, S., Klein, A., and Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization at scale. In Proc. ICML, pages 1436–1445.
 González et al. (2017) González, J., Dai, Z., Damianou, A., and Lawrence, N. D. (2017). Preferential Bayesian optimization. In Proc. ICML, pages 1282–1291.
 Hennig and Schuler (2012) Hennig, P. and Schuler, C. J. (2012). Entropy search for informationefficient global optimization. JMLR, 13, 1809–1837.
 HernándezLobato et al. (2014) HernándezLobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of blackbox functions. In Proc. NIPS, pages 918–926.
 HernándezLobato et al. (2016) HernándezLobato, J. M., Gelbart, M. A., Adams, R. P., Hoffman, M. W., and Ghahramani, Z. (2016). A general framework for constrained Bayesian optimization using informationbased search. JMLR, 17(1), 5549–5601.
 Huang et al. (2006) Huang, D., Allen, T. T., Notz, W. I., and Miller, R. A. (2006). Sequential kriging optimization using multiplefidelity evaluations. Struct. Multidisc. Optim., 32(5), 369–382.
 Kandasamy et al. (2016) Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., and Póczos, B. (2016). Gaussian process bandit optimisation with multifidelity evaluations. In Proc. NIPS, pages 992–1000.
 Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. (2017). Multifidelity Bayesian optimisation with continuous approximations. In Proc. ICML, pages 1799–1808.
 LázaroGredilla et al. (2010) LázaroGredilla, M., QuiñoneroCandela, J., Rasmussen, C. E., and FigueirasVidal, A. R. (2010). Sparse spectrum Gaussian process regression. JMLR, 11, 1865–1881.
 Li et al. (2018) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband: A novel banditbased approach to hyperparameter optimization. JMLR, 18, 1–52.

Minka (2001)
Minka, T. P. (2001).
A family of algorithms for approximate Bayesian inference
. Ph.D. thesis, Massachusetts Institute of Technology.  Mockus et al. (1978) Mockus, J., Tiešis, V., and Žilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. In L. C. W. Dixon and G. P. Szegö, editors, Towards Global Optimization 2, pages 117–129. NorthHolland Publishing Company.
 Müller et al. (2007) Müller, P., Berry, D. A., Grieve, A. P., Smith, M., and Krams, M. (2007). Simulationbased sequential Bayesian design. J. Statistical Planning and Inference, 137(10), 3140–3150.
 Poloczek et al. (2017) Poloczek, M., Wang, J., and Frazier, P. I. (2017). Multiinformation source optimization. In Proc. NIPS, pages 4288–4298.
 Pourmohamad and Lee (2016) Pourmohamad, T. and Lee, H. K. H. (2016). Multivariate stochastic process models for correlated responses of mixed type. Bayesian Anal., 11(3), 797–820.
 Rahimi and Recht (2007) Rahimi, A. and Recht, B. (2007). Random features for largescale kernel machines. In Proc. NIPS, pages 1177–1184.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. MIT Press.
 Russo et al. (2018) Russo, D. J., van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.
 Schön and Lindsten (2011) Schön, T. B. and Lindsten, F. (2011). Manipulating the multivariate Gaussian density. Technical report, Division of Automatic Control, Linköping University, Sweden.
 Sen et al. (2018) Sen, R., Kandasamy, K., and Shakkottai, S. (2018). Multifidelity blackbox optimization with hierarchical partitions. In Proc. ICML, pages 4538–4547.
 Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R., and de Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.
 Skolidis (2012) Skolidis, G. (2012). Transfer Learning with Gaussian Processes. Ph.D. thesis, University of Edinburgh.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Proc. NIPS, pages 2951–2959.
 Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015–1022.
 Swersky et al. (2013) Swersky, K., Snoek, J., and Adams, R. P. (2013). Multitask Bayesian optimization. In Proc. NIPS, pages 2004–2012.
 Teh and Seeger (2005) Teh, Y. W. and Seeger, M. (2005). Semiparametric latent factor models. In Proc. AISTATS, pages 333–340.
 Tesch et al. (2013) Tesch, M., Schneider, J., and Choset, H. (2013). Expensive function optimization with stochastic binary outcomes. In Proc. ICML, pages 1283–1291.
 Villemonteix et al. (2009) Villemonteix, J., Vazquez, E., and Walter, E. (2009). An informational approach to the global optimization of expensivetoevaluate functions. J. Glob. Optim., 44(4), 509–534.
 Wackernagel (1998) Wackernagel, H. (1998). Multivariate Geostatistics: An Introduction with Applications. Springer, second edition.
 Webster and Oliver (2007) Webster, R. and Oliver, M. (2007). Geostatistics for Environmental Scientists. John Wiley & Sons, Inc., second edition.
 Zhang et al. (2016) Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M. (2016). Nearoptimal active learning of multioutput Gaussian processes. In Proc. AAAI, pages 2351–2357.
 Zhang et al. (2017) Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M. (2017). Informationbased multifidelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization.
Appendix A Related Work
Some existing BO works focus on optimizing a target function with a binary output type (González et al., 2017; Tesch et al., 2013) but have not considered utilizing the binary outputs for optimizing other correlated function which is more expensive to evaluate. The Bernoulli multiarmed bandit problem (Russo et al., 2018) assumes binary reward for each action and aims to maximize the cumulative rewards. However, the correlations between the arms and the crosscorrelation between the immediate binary reward and the averaged reward are ignored. Other than the multifidelity BO algorithms (Section 1), the constrained BO algorithms (HernándezLobato et al., 2016) also involve multiple functions (unknown target function and constraints) when optimizing the target function. Different from our mixedtype BO algorithms that can exploit the crosscorrelation structure between the target and binary auxiliary functions, the constrained BO algorithms only consider continuous output types for the unknown constraints and assume the target and constraint functions to be independent. Similar to our CNN experiment (Section 6.2), some hyperparameter optimization methods such as Hyperband (Li et al., 2018) and BOHB (Falkner et al., 2018) have considered speeding up their optimization process by earlystopping the training of underperforming models and continuing that of only the highly ranked ones. However, both methods require the outputs (e.g., validation accuracy) to be continuous for ranking and do not consider the binary auxiliary information. Given the above idea, one may be tempted to exploit the binary information in a similar way: The binary auxiliary function is evaluated for a batch of inputs, and the target function is only evaluated at those inputs in the batch that yield positive auxiliary outputs for finding the global maximum. To achieve this, some important issues need to be considered: (a) Which inputs should we select to evaluate the binary auxiliary function? (b) How many binary auxiliary outputs should we sample before evaluating the expensive target function? (c) If a large proportion of inputs in the batch yield positive auxiliary outputs, then evaluating the target function for all of them can also be very expensive. Which inputs should we select for evaluating the target function such that the global target maximizer can be found given a limited budget? Our proposed MTES and MTPES have resolved all the above issues in a principled manner.
Appendix B Derivation of (8)
Since are jointly modeled as a CMOGP, we know that
(19) 
for any (Álvarez and Lawrence, 2011). Then,
(20) 
due to (7), (19), and equation c in (Schön and Lindsten, 2011). As a result, the posterior distribution can be approximated with a multivariate Gaussian distribution:
(21) 
The first equality is due to (5). The last approximation is due to (20), equation f in (Schön and Lindsten, 2011), and where . Finally, the predictive belief in (8) can be obtained using (19), (21), and equation c in (Schön and Lindsten, 2011).
Appendix C Details of MixedType Random Features (MtRf)
Using some results of Rahimi and Recht (2007), the prior covariance of the GP modeling (Section 3) can be rewritten as
(22) 
where , is the Fourier dual of , and . Let denote a random vector of an dimensional feature mapping of the input :
(23) 
where and with and sampled from and , respectively. From (22) and (23), the prior covariance can be approximated by and the latent function can be approximated by a linear model:
(24) 
Next, we will show how to derive the following approximation of :
(25) 
Comments
There are no comments yet.