1 Introduction
As a wellestablished approach for sampleefficient global optimization of blackbox functions that are expensive to evaluate, Bayesian optimization (BO) is used in many tasks such as hyperparameter tuning (Hutter2011Sequential; Bergstra2011Algorithm; Snoek2012Practical), neural architecture search (Kandasamy2018Neural), and chemical structure search (Bombarelli2018Automatic)
. BO provides a principled method for finding the global optimum of blackbox function: using the cheap probability surrogate model of blackbox function as the input to the acquisition function, repeatedly considering the tradeoff between exploitation and exploration to select the promising points. The surrogate model is constructed based on the evaluation values observed so far. A widelyused surrogate model is Gaussian Process regression, which provides the uncertainty quantification of the function value by imposing a Gaussian Process prior.
While BO provides such an automated procedure, it still faces a huge challenge in highdimensional scenarios. To ensure the converge for learning the function response value, the sample complexity depends exponentially on the number of dimensions (Shahriari2016Taking). In practice, BO is limited to the optimization problem with around 20 dimensions when using Gaussian Process regression as a surrogate model (frazier2018tutorial).
To handle highdimensional Bayesian optimization, many methods have been proposed. Based on the assumption that only a small number of dimensions influence the response value of the objective function, the embedding methods perform BO on a lowdimensional space. The corresponding projection matrix can be constructed randomly (Wang2013Bayesian; nayebi19a), or learned actively (Djolonga2013High; zhang2019high). Some methods impose an additive structure on the objective function (gardner17Discovering; k2015high). Besides, many methods start from the way of learning lowdimensional embedding and find an effective subspace through nonlinear dimension reduction (lu2018Structured). However, there are two limitations in the existing methods. First, the projection matrix is immutable. Once the generated lowdimensional embedding cannot represent the intrinsic structure of the objective function, finding the global optimum through Bayesian optimization will become very difficult. Second, the lowdimensional space is learned in a supervised way. The label of each point indicates the response value of the blackbox function. To learn an effective lowdimensional space, a large number of labeled points are required, which leads to huge computation costs especially when the evaluation overhead of the objective function is expensive.
In this paper, we propose a novel framework called SILBO^{1}^{1}1SILBO stands for Semisupervised, Iterative, and Learningbased Bayesian Optimization. The code is available at https://github.com/cjfcsjt/SILBO. to mitigate the problem of
by learning the effective lowdimensional space iteratively through the semisupervised dimension reduction method. After a lowdimensional space is identified, Bayesian optimization is performed in the lowdimensional space, leading to a stable and reliable estimation of the global optimum. Specifically, the contribution of this paper is as follows:

We propose an iterative method in SILBO to update the projection matrix dynamically. During each iteration of BO, a semisupervised lowdimensional embedding learning method is proposed to construct the projection matrix by utilizing both labeled and unlabeled points acquired from the acquisition function of BO.

To accelerate the semisupervised dimension reduction, we further propose a randomized method to compute the highdimensional generalized eigenvalue problem efficiently. We also analyze its time complexity in detail.

Furthermore, to map from the lowdimensional space to the highdimensional original space, we propose two mapping strategies: SILBOTD and SILBOBU according to the evaluation overhead of the objective function.

Experimental results on both synthetic function and neural network hyperparameter optimization tasks reveal that SILBO outperforms the existing stateoftheart highdimensional Bayesian optimization methods.
The rest of this paper is organized as follows. Section 2 gives an overview of related work. Section 3 states the problem and lists relevant background materials. The SILBO algorithm is proposed in Section 4. The experimental evaluation is presented in Section 5 and the conclusion is given in Section 6.
2 Related Work
Bayesian optimization has achieved great success in many applications with low dimensions (k2017multifidelity; klein2016fast; Swersky2013Multi; wu2019practical; Wu2017Bayesian; hernndezlobato2015predictive). However, extending BO to high dimensions is still a challenge. Recently, the highdimensional BO has received increasing attention and a large body of literature has been devoted to addressing this issue.
Given the assumption that only a few dimensions play a decisive role, the linear lowdimensional embedding method achieves dimension reduction using a projection matrix. In REMBO (Wang2013Bayesian)
, the projection matrix is randomly generated according to Gaussian distribution. The promising points are searched in the lowdimensional space by performing Gaussian Process regression and then mapped back to the highdimensional space for evaluation. It has been proven that REMBO has a great probability to find the global optimum by convex projection, although the high probability is not guaranteed when the box bound exists. Another problem of REMBO is the overexploration of the boundary. To address the over exploration, a carefullyselected bound in the lowdimensional embedding space was proposed
(binois2017choice), which finds the corresponding points in the highdimensional space by solving a quadratic programming problem. The BOCK algorithm (oh2018bock) scales to the highdimensional space using a cylindrical transformation of the search space. HeSBO (nayebi19a)employs the count sketch method to alleviate the overexploration of the boundary and use the hash technique to improve computational efficiency. HeSBO also shows that the mean and variance function of Gaussian Process do not deviate greatly under certain condition. However, the abovementioned methods only use the prior information to generate the projection matrix randomly and do not employ the information of the actual initial points to learn a lowdimensional embedding actively.
Different from the previous methods, the learningbased methods have been proposed. SIRBO (zhang2019high) uses the supervised dimension reduction method to learn a lowdimensional embedding, while SIBO (Djolonga2013High) employs the lowrank matrix recovery to learn the embedding. However, the lowdimensional embedding learned by these methods is immutable. Once the projection matrix is generated according to the initial points, it will not be updated. In some scenarios, because of the small number of initial points that have been evaluated, the lowdimensional embedding space cannot accurately reflect the information of the objective function.
Another way for handling the highdimensional BO is to assume an additive structure (gardner17Discovering) of the objective function. Typically, ADDBO (k2015high) optimizes the objective function on a disjoint subspace decomposed from the highdimensional space. Unfortunately, the additive assumption does not hold in most practical applications. Besides, the nonlinear embedding method is also attractive (eissman2018bayesian; lu2018Structured; moriconi2019highdimensional)
. These nonlinear methods use the Variational Autoencoder (VAE) to learn a lowdimensional embedding space. The advantage of nonlinear learning methods is that points in the original space can be easily reconstructed through the nonlinear mapping. However, training VAE requires a large number of points. When the evaluation cost of the objective function is expensive, the nonlinear embedding method is almost impractical.
In this paper, we focus on the linear lowdimensional embedding method and propose an iterative, semisupervised method to learn the embedding.
3 Preliminary
In this section, we give the problem setup and introduce Bayesian optimization (BO), semisupervised discriminant analysis (SDA), and slice inverse regression (SIR).
3.1 Problem Setup
Consider the blackbox function , defined on a highdimensional and continuous domain . is computationally expensive and may be nonconvex. Given , we can only access the noisy response value extracted from with noise . Also, we assume that the objective function contains intrinsic dimensions. In other words, given an embedding matrix with orthogonal rows and a function , . Our goal is to find the global optimum.
(1) 
3.2 Bayesian Optimization
Bayesian optimization is an iterative framework, which combines a surrogate model of the blackbox function with a search strategy that tries to find possible points with large response values. Given observation points and their corresponding evaluation values , the surrogate model is usually Gaussian Process regression that imposes a prior, , to the objective function with mean function at each and covariance function or kernel at each pair of points . The kernel function describes the similarity between inputs. One of the widelyused kernel functions is the Matérn kernel. Then, given a new point
, the prediction of the response value can be calibrated by the posterior probability distribution (noisefree).
(2)  
(3)  
(4) 
At each iteration of Bayesian optimization, the predictive mean and variance are regarded as uncertainty quantification, supporting the subsequent acquisition function optimization. The acquisition function tries to balance between exploration (high variance) and exploitation (high mean value). The commonlyused acquisition function for searching promising points is UCB (Upper Confidence Bound) (Srinivas2010Gaussian). UCB tries to select the next point with the largest plausible response value according to Equation 5.
(5) 
where is a parameter set used to achieve the tradeoff between exploration and exploitation. In this work, we also experiment with EI (Expected Improvement) (Snoek2012Practical), which is another popular acquisition function.
3.3 Semisupervised Discriminant Analysis
Semisupervised discriminant analysis (SDA) (Cai2007Semi) is a semisupervised linear dimension reduction algorithm that leverages both labeled and unlabeled points. SDA aims to find a projection that reflects the discriminant structure inferred from the labeled data points, as well as the intrinsic geometrical structure inferred from both labeled and unlabeled points. SDA is an extension of linear discriminant analysis (LDA). The original LDA aims to find a projection such that the ratio of the betweenclass scatter matrix to the total withinclass scatter matrix is maximized. When the number of training data is small, it can easily lead to overfitting. SDA solves this problem by introducing a regularizer combined with unlabeled information.
(6) 
where is a coefficient used to balance between model complexity and experience loss.
is constructed by considering a graph incorporating neighbor information of the labeled points and the unlabeled points, where indicates whether and are neighbors. Motivated from spectral dimension reduction (Belkin2002Laplacian), the regularizer can be defined as for any two points and . Then, given the dataset , can be written as:
(7) 
where is a diagonal matrix, , and is a Laplacian matrix (Chung1997Spectral). Finally, SDA can be reformulated as solving the following generalized eigenvalue problem.
(8) 
3.4 Slice Inverse Regression
Sliced inverse regression (SIR) (Li1991Sliced) is a supervised dimension reduction method for continuous response values. SIR aims to find an effective lowdimensional space. The dimension reduction model is:
(9) 
Here,
is the response variable,
is an input vector,
is an unknown function with arguments. denotes orthogonal projection vectors, denotes the dimensions of the (effective dimension reducing) space and is noise. The core idea of SIR is to swap the positions of and Y. The algorithm cuts response value into slices and consider the dimensional inverse regression curve rather than regressing on directly. SIR assumes the existence of directions, and the curve that just falls into an dimensional space. SIR finds the directions by minimizing the total withinslice scatter and maximize the betweenslice scatter . Similar to LDA, the problem can be reformulated as solving the following generalized eigenvalue problem.(10) 
4 The SILBO Algorithm
In this section, we propose a framework that addresses the highdimensional optimization challenge by learning a lowdimensional embedding space associated with a projection matrix . To learn the intrinsic structure of the objective function effectively, we iteratively update through semisupervised dimension reduction. Moreover, we propose a randomized method to accelerate the computation of the embedding matrix in the highdimensional scenario. By performing BO on the learned lowdimensional space , the algorithm can approach the that corresponds to the optimum as close as possible.
Given the labeled points and unlabeled points where the label represents the evaluation value of the corresponding point, SILBO consists of three tightlyconnected phases:

embedding learning, which tries to find an effective lowdimensional space of the objective function by utilizing both and ;

performing Gaussian Process regression on the learned lowdimensional embedding space and selecting candidate points according to the acquisition function of BO;

evaluating points by the objective function , then updating the Gaussian Process surrogate model and the projection matrix .
Algorithm 1 summarizes the three steps. The first step is to construct the projection matrix and find an effective lowdimensional space that can keep the information of the original space as much as possible, where is the unlabeled points obtained by acquisition function and is the labeled points that have been evaluated.
The second step is to find the possible lowdimensional optimum which maximizes the acquisition function and select several promising unlabeled points that can be used to update in the next iteration. Then, and are mapped back to the highdimensional space through a specific mapping to get and .
Finally, we compute by evaluating the objective function on and update the GP model by . The and will be added to and respectively for updating the embedding in the next iteration.
The lowdimensional embedding is learned through SIR combined with the semisupervised technique. The betweenslice scatter matrix and total withinslice scatter matrix are constructed by utilizing the local information as well as the unlabeled points. Then, is obtained through solving the generalized eigenvalue problem. Based on the randomized SVD method, Algorithm 2 is proposed to speed up the solution to this problem. Moreover, we carefully analyze the mapping between the lowdimensional space and the highdimensional space and further propose two strategies for the scenarios with different evaluation costs.
4.1 Semisupervised Embedding Learning
The assumption is that there is a lowdimensional space that preserves the information of the objective function defined in the highdimensional space. In other words, the dimensionality of can be reduced without losing the essential information to predict response values . Meanwhile, if there are enough evaluated points for the initialization of Bayesian optimization, we may be able to explore the intrinsic structure of through these points. However, for optimization problems with high computational cost, only a few evaluated points are available. Thus, it is difficult to learn proper embedding only through them. In such a case, reasonable and effective use of unevaluated points acquired from the acquisition function of BO will be helpful for embedding learning.
Although SDA provides an effective strategy to incorporate the unlabeled data, it is only suitable for classification problems. Therefore, we need to extend it to the scenarios where the response value is continuous. At the same time, SIR is suitable for the supervised dimension reduction tasks in the continuous domain. SIR aims to find the e.d.r directions to minimize the total withinslice scatter and maximize the betweenslice scatter, the purpose of which is very similar to that of LDA. In fact, SIR is equivalent to LDA (Wu2010Local). Due to the equivalence of these two methods, along with the fact that SDA is an extension of LDA, we can employ such discreteness brought by the slicing technique in SIR to apply SDA to the continuous domain. Next, we introduce how to construct semiSIR to learn a lowdimensional subspace embedding.
We first assume that nearby points often have not only close response values but also similar lowdimensional representations. Furthermore, we sort the evaluated points according to their response values and cut them into slices. This process can be equivalent to that the labeled points belong to different classes. Therefore, similar to SDA (Cai2007Semi), our problem can be reformulated as follows. Given labeled points and unlabeled points , and denote the number of the labeled and unlabeled points respectively, we aim to find the embedding matrix through solving:
(11) 
where denotes a centered data matrix whose rows represent samples in , . is a weight matrix. can be expressed as:
(12) 
where is a identity matrix.
Note that the betweenslice scatter matrix can be written in an elegant linear algebraic formulation.
(13) 
where denotes the rescaled slice membership matrix for those evaluated samples with if the th sample of is a member of the th slice. Otherwise, . denotes the number of the samples in the th slice.
For the labeled points in the same slice, their response values are very close, but this closeness may not be retained in . There could be a small number of points that are far away from others in each slice. Although these outliers may indicate that there exist other areas with large response values of the objective function, they are likely to interfere with the embedding we have learned. Thus, to reveal the main information of the objective function as much as possible, we employ the localization technique. By strengthening the local information of each slice, the degeneracy issue in the original SIR can be mitigated (Wu2010Local). Next, we illustrate how to construct in Equation 13 with the local information.
We note that in Equation 13 is a block matrix, and each block corresponds to a slice.
(14) 
For the original betweenslice scatter , can only indicate whether a sample point belongs to a slice. Here, we will strengthen the localization information by introducing a localized weight. For those evaluated samples, we let if the th sample belongs to the nearest neighbors of the th sample ( can equal to ) in the th slice. Otherwise, . denotes the number of slices, is a parameter for NN, and is the number of neighbor pairs in the th slice.
On the other hand, if these outliers contain enough information, then through the iterative process of Bayesian optimization, we have a great probability to get other points near outliers. As a result, the number of these outliers may expand and become the leading samples in their slice to guide the generation of embedding. Therefore, it is necessary to update the projection matrix iteratively.
In summary, we aim to estimate the intrinsic structure of the objective function and find an effective subspace that reflects the large response value. Then, we are in a better position to identify the possible global optimum. Therefore, in combination with Bayesian optimization whose acquisition function will provide us candidate points with potentially large response values, semisupervised dimension reduction can alleviate overfitting caused by the small size of labeled points. Thus, by utilizing both the labeled and unlabeled points, we can learn an dimensional space that preserves the intrinsic structure of the objective function better.
4.2 Randomized Semisupervised Dimension reduction
In the highdimensional scenario, learning a lowdimensional embedding space still requires a lot of time. The main bottleneck lies in the solution of generalized eigenvalue problem in Equation 11 and the computation of the neighbor information in each slice. The major computation cost of the traditional method such as the twostage SVD algorithm is the decomposition of the largescale matrix. Fortunately, the randomized SVD (RSVD) technique (Halko2011Finding) shed a light on this problem.
The contribution of RSVD here is that when the effective dimension , using randomized SVD to approximate a matrix only requires , rather than . Moreover, the empirical results find that when solving the generalized eigenvalue problem, the performance of the randomized algorithm is superior to that of the traditional deterministic method in the highdimensional case (georgiev2012randomized). Thus, we first replace the twostage SVD with RSVD to accelerate the solution of Equation 11.
We note that the decomposition overhead of the in Equation 11 is huge when is large. Thus, we decompose using the RSVD algorithm. Then, the betweenslice scatter matrix can be expressed as:
(15) 
Similarly, due to the symmetry of matrix , the right hand side of Equation 11 can be decomposed as:
(16) 
where can be constructed through Cholesky decomposition. Thus, the generalized eigenvalue problem in Equation 11 can be reformulated as:
(17) 
where .
Next, if we let , then we can see that this is a traditional eigenvalue problem. is an matrix, is small enough and thus we can decompose it through original SVD naturally: . Finally, the embedding matrix can be computed as:
(18) 
The original time complexity of the construction of and is and respectively, including the computation of the exact NN for each point, where is the number of samples in each slice. In addition, the time complexity of decomposition of is and the time complexity of construction and computation of Equation 17 is . Therefore, the overall time complexity is , which is prohibitive in the highdimensional case.
To alleviate the expensive overhead in highdimensional case, we use the fast NN (2003Database; Ailon2009the) to compute the neighbor information, we find that constructing and only requires and respectively. Moreover, using RSVD to factorize the matrix only requires . Therefore, The overall time complexity reduces to . denotes a fixed parameter about the logarithm of .
Algorithm 2 summarizes the above steps.
4.3 Mapping From Low to High
After semisupervised dimensional reduction, we perform Bayesian optimization with GP on the lowdimensional space . But we need to further consider three issues. The first is how to select the search domain in the lowdimensional embedding space. The second is how to specify a mapping associate with to map the point from the embedding space to the highdimensional space where the response value should be evaluated. The third is how to maintain consistency of the Gaussian process regression model after updating . Only by addressing these issues, the lowdimensional Bayesian optimization can be updated by the data pair .
Before elaborating the selection of in the first issue, we introduce the definition of zonotope, which is a convex symmetric polytope. Zonotope (Le2013Zonotope) Given a linear mapping and vector , an rzonotope is defined as
where is the center vector of the zonotope. Without loss of generality, let , we note that is exactly computed by Equation 18. Thus, the zonotope associate with is the lowdimensional space where Bayesian optimization is performed. Then, we introduce the smallest box containing the zonotope. Given zonotope , , the smallest box containing this zonotope can be computed as
Although the boundary of zonotope is difficult to solve, we use the box as an alternative (binois2017choice).
Next, we focus on the second and third issues. As mentioned before, a mapping should connect the highdimensional space and the lowdimensional space. As a result, for any point in the lowdimensional space, a corresponding point can be found in the highdimensional space. In this way, the lowdimensional point and the response value of the corresponding highdimensional point can be used as the training set to build a Gaussian process regression model. Figure 1 shows the relationship between them. We note that the reproduction of the correlation between and is the goal of the lowdimensional Gaussian process regression model, and the two are closely connected through . Thus a reasonable choice of plays a great influence on such a middle layer. Meanwhile, due to the iterative nature of SILBO, the mapping will change after updating , which directly makes the correlation between and inconsistent before and after the update. Therefore, we need to develop a strategy to maintain consistency. Next, we introduce two different strategies to address these two issues.
As shown in Algorithm 3, we can naturally get the response value by evaluating the point after finding the candidate point through BO. When the projection matrix changed, we fix at the bottom of the hierarchy in Figure 0(a) and then update at the top to maintain consistency ( for short).
Since we use to connect the points between the lowdimensional space and the highdimensional space , the actual evaluation space is only a part of (i.e., ) when is fixed. While a great probability of containing the optimum in has been proved (Wang2013Bayesian), one of the preconditions is that there is no limitation to the highdimensional space of the objective function. In practice, the restriction often exists (e.g., ), making the high probability no longer guaranteed. In contrast, due to the iterative nature of the effective lowdimensional space learning mechanism in SILBO (Algorithm 1), we can alleviate this problem. Specifically, more and more associate with will be learned, which gradually reveal the location of the optimum.
However, the  strategy may bring more evaluation overhead. When we get the training set for BO through iterations in Algorithm 1, we update to and get through the acquisition function . Due to the updating of , the mapping relationship between the highdimensional and lowdimensional spaces has changed. As shown in Figure 0(a), we need to find the corresponding again for in the training set and then evaluate to maintain consistency. When the evaluation of the objective function is expensive, the computational cost is huge.
To eliminate the reevaluation overhead, we propose a new strategy. Specifically, we first obtain in the lowdimensional space through the acquisition function , and then solve the following equation to find the corresponding .
(19) 
When the projection matrix changed, we fix and at the top of the hierarchy in Figure 1 and then update in the bottom to maintain consistency ( for short).
Different from the  strategy, the  strategy shown in Figure 0(b) updates directly, which enables us to reconstruct a new BO model efficiently only by replacing the training set with after updating , instead of relocating and reevaluating them. The  strategy can be found in Algorithm 4. Next, we analyze the rationality of this strategy theoretically.
Given a matrix with orthogonal rows and its corresponding zonotope . Given , for any , . Let be the orthogonal decomposition with respect to = Row(), then . Let with . Due to the orthogonal decompostion, we have lies in . For any , we have and , then . Thus, .
According to Theorem 4.3, if we assume that has orthogonal rows, it is clear that given and , any point will be the solution to Equation 19. However, our purpose is not itself, but , because is the data pair used to update BO. Fortunately, if we use SILBO to learn an ideal effective lowdimensional subspace , then for any in the solution set , , (i.e., ). Thus, in each iteration , we can obtain the unique data pair under to update the lowdimensional Bayesian optimization model. Therefore, the diversity of solutions in the set does not affect the construction of BO and we can explore in the original space which liberates the shackles of .
In summary, both the two strategies aim to generate a consistent training set for the subsequent GP construction when updating . To maintain such consistency, the  strategy acquires more observations from the objective function according to in the training set while the  strategy directly changes according to . Note that the more response values, the more information about the objective function we can get. Thus, the  strategy can obtain more clues about the global optimum than the  strategy.
5 Numerical Results
In this section, we evaluated our proposed algorithm SILBO on a set of highdimensional benchmarks. First, we compared SILBO with other stateoftheart algorithms on synthetic function^{2}^{2}2The synthetic function can be found at https://www.sfu.ca/~ssurjano/optimization.html. and neural network hyperparameter optimization tasks. Second, we evaluated and analyzed the scalability of SILBO. Finally, we demonstrated the effectiveness of semisupervised embedding learning.
5.1 Experiment Setting
For comparison, we also evaluated with other stateoftheart algorithms: REMBO (Wang2013Bayesian), which performs BO on random linear embedding, REMBO (nayebi19a) which computes the kernel using distance in the highdimensional space. REMBO (binois2014warped), which uses a warping kernel and finds the evaluation point through sophisticated mapping. HeSBO (nayebi19a), which uses the countsketch technique to generate subspace and compute the projection matrix. Additional baselines include ADDBO (k2015high), which assumes that the objective function has an additive structure, and recentlyproposed SIRBO (zhang2019high), which uses SIR to learn a lowdimensional space actively. SILBOBU and SILBOTD represent the proposed SILBO algorithm that employs the  and  mapping strategies respectively.
We employed the package GPy^{3}^{3}3https://github.com/SheffieldML/GPy as the Gaussian Process framework and selected the Matérn kernel for the GP model. We adopted the package ristretto^{4}^{4}4https://github.com/erichson/ristretto to perform the RSVD. The embedding matrix was set to be updated every 20 iterations. We employed CMAES (hansen16CMA) to compute Equation 19 for SILBOTD. The size of the unlabeled points is set to 50 in each iteration. The number of neighbors is set to
for semisupervised dimension reduction. To obtain error bars, we performed 100 independent runs for each algorithm on synthetic functions and 5 runs for the neural network hyperparameter search. We plotted the mean value along with one standard deviation. For ADDBO, we used the authors’ implementation through
, so we did not compare its scalability since other algorithms were implemented in . Also, we evaluated SILBO using two acquisition functions: UCB and EI.5.2 Performance on Synthetic Functions
Similar to (nayebi19a), these algorithms were under comparison upon the following widelyused test functions: (1) Branin (2) Colville (3) Hartmann6. The active dimensionalities for Branin, Colville, Hartmann6 are 2, 4, and 6 respectively. We studied the cases with different input dimensions . The experimental results are shown in Figure 24. The proposed SILBOBU, outperforms all benchmarks especially in high dimensions (). SILBOTD is also very competitive, surpassing most traditional algorithms. In addition, the experimental results under different acquisition functions are similar. ADDBO performs excellently in Hartmann6 due to the additive structure of the function itself. However, it performs poorly in higher dimensions and functions without the additive assumption. HeSBO performs well in Colville and Hartmann6, but poorly in Branin. Moreover, we can find that HeSBO does not converge in most cases. The performance of SIRBO is similar to our proposed SILBOTD since it also tries to learn an embedding space actively. In contrast, the proposed method SILBOBU nearly beats all baselines due to its powerful iterative embedding learning.
5.3 Hyperparameter Optimization on Neural Network
Following (oh2018bock), we also evaluated SILBO in the hyperparameter optimization task for a neural network on the MNIST dataset. Specifically, the neural network has one hidden layer of size 50 and one output layer of size 10. These 500 initial weights are viewed as hyperparameters and optimized using Adam (kingma2014adam).
Figure 5 shows that SILBOBU converges quickly than all other benchmarks. The performance of HeSBO also improves quickly but converged slowly to the optimum. The performance of SILBOTD and SIRBO is close. The two methods perform better than other traditional benchmarks such as REMBO and show competitive performance. Specifically, we can see that the curve corresponding to SILBOBU obtained a better response value every 20 iterations, which demonstrate the effectiveness of our iterative embedding learning.
5.4 Scalability Analysis
We analyzed the scalability by comparing the cumulative time of each algorithm under the same number of iterations. As shown in Figure 6, we can see that for the lowcost objective functions such as Branin, SILBOBU is fast, while SILBOTD is relatively slow. This is because CMAES is used to solve Equation 19, which takes much time. For the expensive objective functions such as neural networks, SILBOTD takes approximately the same time as most algorithms, while SILBOBU takes more time due to its reevaluation process.
5.5 Effectiveness of Embedding Learning
We compared the embedding learning performance of SILBO and SIR on Branin and Camel. We first evaluated 50 highdimensional points to get response value and then projected these points to lowdimensional space. The goal is to find the e.d.r directions that preserve the information of the objective function as comprehensively as possible. Figure 7 and Figure 8 summarize the observed performance.
We can see that the information extracted by the SIR method stacked together. A specific lowdimensional point corresponds to many different response values. The consequence is that a lot of information will be lost if we train a Gaussian process regression model using these lowdimensional representations. In contrast, the information extracted by SILBO is complete without the loss of the intrinsic information of the objective function.
6 Conclusion and Future Work
In this paper, we proposed a novel iterative embedding learning framework SILBO for highdimensional Bayesian optimization through semisupervised dimensional reduction. We also proposed a randomized fast algorithm for solving the embedding matrix efficiently. Moreover, according to the cost of the objective function, two different mapping strategies are proposed. Experimental results on both synthetic function and hyperparameter optimization tasks reveal that SILBO outperforms the existing stateoftheart highdimensional Bayesian optimization methods. In the future, we plan to combine our framework with the multifidelity method. Moreover, we also plan to apply SILBO to more AutoML (Automatic Machine Learning) tasks.
This work was supported by the National Natural Science Foundation of China
(U1811461, 61702254), National Key R&D Program of China (2019YFC1711000), Jiangsu Province Science and Technology Program (BE2017155), National Natural Science Foundation of Jiangsu Province (BK20170651), and Collaborative Innovation Center of Novel Software Technology and Industrialization.
Comments
There are no comments yet.