As a well-established approach for sample-efficient global optimization of black-box functions that are expensive to evaluate, Bayesian optimization (BO) is used in many tasks such as hyperparameter tuning (Hutter2011Sequential; Bergstra2011Algorithm; Snoek2012Practical), neural architecture search (Kandasamy2018Neural), and chemical structure search (Bombarelli2018Automatic)
. BO provides a principled method for finding the global optimum of black-box function: using the cheap probability surrogate model of black-box function as the input to the acquisition function, repeatedly considering the trade-off between exploitation and exploration to select the promising points. The surrogate model is constructed based on the evaluation values observed so far. A widely-used surrogate model is Gaussian Process regression, which provides the uncertainty quantification of the function value by imposing a Gaussian Process prior.
While BO provides such an automated procedure, it still faces a huge challenge in high-dimensional scenarios. To ensure the converge for learning the function response value, the sample complexity depends exponentially on the number of dimensions (Shahriari2016Taking). In practice, BO is limited to the optimization problem with around 20 dimensions when using Gaussian Process regression as a surrogate model (frazier2018tutorial).
To handle high-dimensional Bayesian optimization, many methods have been proposed. Based on the assumption that only a small number of dimensions influence the response value of the objective function, the embedding methods perform BO on a low-dimensional space. The corresponding projection matrix can be constructed randomly (Wang2013Bayesian; nayebi19a), or learned actively (Djolonga2013High; zhang2019high). Some methods impose an additive structure on the objective function (gardner17Discovering; k2015high). Besides, many methods start from the way of learning low-dimensional embedding and find an effective subspace through nonlinear dimension reduction (lu2018Structured). However, there are two limitations in the existing methods. First, the projection matrix is immutable. Once the generated low-dimensional embedding cannot represent the intrinsic structure of the objective function, finding the global optimum through Bayesian optimization will become very difficult. Second, the low-dimensional space is learned in a supervised way. The label of each point indicates the response value of the black-box function. To learn an effective low-dimensional space, a large number of labeled points are required, which leads to huge computation costs especially when the evaluation overhead of the objective function is expensive.
In this paper, we propose a novel framework called SILBO111SILBO stands for Semi-supervised, Iterative, and Learning-based Bayesian Optimization. The code is available at https://github.com/cjfcsjt/SILBO. to mitigate the problem of
by learning the effective low-dimensional space iteratively through the semi-supervised dimension reduction method. After a low-dimensional space is identified, Bayesian optimization is performed in the low-dimensional space, leading to a stable and reliable estimation of the global optimum. Specifically, the contribution of this paper is as follows:
We propose an iterative method in SILBO to update the projection matrix dynamically. During each iteration of BO, a semi-supervised low-dimensional embedding learning method is proposed to construct the projection matrix by utilizing both labeled and unlabeled points acquired from the acquisition function of BO.
To accelerate the semi-supervised dimension reduction, we further propose a randomized method to compute the high-dimensional generalized eigenvalue problem efficiently. We also analyze its time complexity in detail.
Furthermore, to map from the low-dimensional space to the high-dimensional original space, we propose two mapping strategies: SILBO-TD and SILBO-BU according to the evaluation overhead of the objective function.
Experimental results on both synthetic function and neural network hyperparameter optimization tasks reveal that SILBO outperforms the existing state-of-the-art high-dimensional Bayesian optimization methods.
The rest of this paper is organized as follows. Section 2 gives an overview of related work. Section 3 states the problem and lists relevant background materials. The SILBO algorithm is proposed in Section 4. The experimental evaluation is presented in Section 5 and the conclusion is given in Section 6.
2 Related Work
Bayesian optimization has achieved great success in many applications with low dimensions (k2017multifidelity; klein2016fast; Swersky2013Multi; wu2019practical; Wu2017Bayesian; hernndezlobato2015predictive). However, extending BO to high dimensions is still a challenge. Recently, the high-dimensional BO has received increasing attention and a large body of literature has been devoted to addressing this issue.
Given the assumption that only a few dimensions play a decisive role, the linear low-dimensional embedding method achieves dimension reduction using a projection matrix. In REMBO (Wang2013Bayesian)
, the projection matrix is randomly generated according to Gaussian distribution. The promising points are searched in the low-dimensional space by performing Gaussian Process regression and then mapped back to the high-dimensional space for evaluation. It has been proven that REMBO has a great probability to find the global optimum by convex projection, although the high probability is not guaranteed when the box bound exists. Another problem of REMBO is the over-exploration of the boundary. To address the over exploration, a carefully-selected bound in the low-dimensional embedding space was proposed(binois2017choice), which finds the corresponding points in the high-dimensional space by solving a quadratic programming problem. The BOCK algorithm (oh2018bock) scales to the high-dimensional space using a cylindrical transformation of the search space. HeSBO (nayebi19a)
employs the count sketch method to alleviate the over-exploration of the boundary and use the hash technique to improve computational efficiency. HeSBO also shows that the mean and variance function of Gaussian Process do not deviate greatly under certain condition. However, the above-mentioned methods only use the prior information to generate the projection matrix randomly and do not employ the information of the actual initial points to learn a low-dimensional embedding actively.
Different from the previous methods, the learning-based methods have been proposed. SIRBO (zhang2019high) uses the supervised dimension reduction method to learn a low-dimensional embedding, while SI-BO (Djolonga2013High) employs the low-rank matrix recovery to learn the embedding. However, the low-dimensional embedding learned by these methods is immutable. Once the projection matrix is generated according to the initial points, it will not be updated. In some scenarios, because of the small number of initial points that have been evaluated, the low-dimensional embedding space cannot accurately reflect the information of the objective function.
Another way for handling the high-dimensional BO is to assume an additive structure (gardner17Discovering) of the objective function. Typically, ADD-BO (k2015high) optimizes the objective function on a disjoint subspace decomposed from the high-dimensional space. Unfortunately, the additive assumption does not hold in most practical applications. Besides, the non-linear embedding method is also attractive (eissman2018bayesian; lu2018Structured; moriconi2019highdimensional)
. These non-linear methods use the Variational Autoencoder (VAE) to learn a low-dimensional embedding space. The advantage of non-linear learning methods is that points in the original space can be easily reconstructed through the non-linear mapping. However, training VAE requires a large number of points. When the evaluation cost of the objective function is expensive, the non-linear embedding method is almost impractical.
In this paper, we focus on the linear low-dimensional embedding method and propose an iterative, semi-supervised method to learn the embedding.
In this section, we give the problem setup and introduce Bayesian optimization (BO), semi-supervised discriminant analysis (SDA), and slice inverse regression (SIR).
3.1 Problem Setup
Consider the black-box function , defined on a high-dimensional and continuous domain . is computationally expensive and may be non-convex. Given , we can only access the noisy response value extracted from with noise . Also, we assume that the objective function contains intrinsic dimensions. In other words, given an embedding matrix with orthogonal rows and a function , . Our goal is to find the global optimum.
3.2 Bayesian Optimization
Bayesian optimization is an iterative framework, which combines a surrogate model of the black-box function with a search strategy that tries to find possible points with large response values. Given observation points and their corresponding evaluation values , the surrogate model is usually Gaussian Process regression that imposes a prior, , to the objective function with mean function at each and covariance function or kernel at each pair of points . The kernel function describes the similarity between inputs. One of the widely-used kernel functions is the Matérn kernel. Then, given a new point
, the prediction of the response value can be calibrated by the posterior probability distribution (noise-free).
At each iteration of Bayesian optimization, the predictive mean and variance are regarded as uncertainty quantification, supporting the subsequent acquisition function optimization. The acquisition function tries to balance between exploration (high variance) and exploitation (high mean value). The commonly-used acquisition function for searching promising points is UCB (Upper Confidence Bound) (Srinivas2010Gaussian). UCB tries to select the next point with the largest plausible response value according to Equation 5.
where is a parameter set used to achieve the trade-off between exploration and exploitation. In this work, we also experiment with EI (Expected Improvement) (Snoek2012Practical), which is another popular acquisition function.
3.3 Semi-supervised Discriminant Analysis
Semi-supervised discriminant analysis (SDA) (Cai2007Semi) is a semi-supervised linear dimension reduction algorithm that leverages both labeled and unlabeled points. SDA aims to find a projection that reflects the discriminant structure inferred from the labeled data points, as well as the intrinsic geometrical structure inferred from both labeled and unlabeled points. SDA is an extension of linear discriminant analysis (LDA). The original LDA aims to find a projection such that the ratio of the between-class scatter matrix to the total within-class scatter matrix is maximized. When the number of training data is small, it can easily lead to overfitting. SDA solves this problem by introducing a regularizer combined with unlabeled information.
where is a coefficient used to balance between model complexity and experience loss.
is constructed by considering a graph incorporating neighbor information of the labeled points and the unlabeled points, where indicates whether and are neighbors. Motivated from spectral dimension reduction (Belkin2002Laplacian), the regularizer can be defined as for any two points and . Then, given the dataset , can be written as:
where is a diagonal matrix, , and is a Laplacian matrix (Chung1997Spectral). Finally, SDA can be reformulated as solving the following generalized eigenvalue problem.
3.4 Slice Inverse Regression
Sliced inverse regression (SIR) (Li1991Sliced) is a supervised dimension reduction method for continuous response values. SIR aims to find an effective low-dimensional space. The dimension reduction model is:
is the response variable,
is an input vector,is an unknown function with arguments. denotes orthogonal projection vectors, denotes the dimensions of the (effective dimension reducing) space and is noise. The core idea of SIR is to swap the positions of and Y. The algorithm cuts response value into slices and consider the -dimensional inverse regression curve rather than regressing on directly. SIR assumes the existence of directions, and the curve that just falls into an -dimensional space. SIR finds the directions by minimizing the total within-slice scatter and maximize the between-slice scatter . Similar to LDA, the problem can be reformulated as solving the following generalized eigenvalue problem.
4 The SILBO Algorithm
In this section, we propose a framework that addresses the high-dimensional optimization challenge by learning a low-dimensional embedding space associated with a projection matrix . To learn the intrinsic structure of the objective function effectively, we iteratively update through semi-supervised dimension reduction. Moreover, we propose a randomized method to accelerate the computation of the embedding matrix in the high-dimensional scenario. By performing BO on the learned low-dimensional space , the algorithm can approach the that corresponds to the optimum as close as possible.
Given the labeled points and unlabeled points where the label represents the evaluation value of the corresponding point, SILBO consists of three tightly-connected phases:
embedding learning, which tries to find an effective low-dimensional space of the objective function by utilizing both and ;
performing Gaussian Process regression on the learned low-dimensional embedding space and selecting candidate points according to the acquisition function of BO;
evaluating points by the objective function , then updating the Gaussian Process surrogate model and the projection matrix .
Algorithm 1 summarizes the three steps. The first step is to construct the projection matrix and find an effective low-dimensional space that can keep the information of the original space as much as possible, where is the unlabeled points obtained by acquisition function and is the labeled points that have been evaluated.
The second step is to find the possible low-dimensional optimum which maximizes the acquisition function and select several promising unlabeled points that can be used to update in the next iteration. Then, and are mapped back to the high-dimensional space through a specific mapping to get and .
Finally, we compute by evaluating the objective function on and update the GP model by . The and will be added to and respectively for updating the embedding in the next iteration.
The low-dimensional embedding is learned through SIR combined with the semi-supervised technique. The between-slice scatter matrix and total within-slice scatter matrix are constructed by utilizing the local information as well as the unlabeled points. Then, is obtained through solving the generalized eigenvalue problem. Based on the randomized SVD method, Algorithm 2 is proposed to speed up the solution to this problem. Moreover, we carefully analyze the mapping between the low-dimensional space and the high-dimensional space and further propose two strategies for the scenarios with different evaluation costs.
4.1 Semi-supervised Embedding Learning
The assumption is that there is a low-dimensional space that preserves the information of the objective function defined in the high-dimensional space. In other words, the dimensionality of can be reduced without losing the essential information to predict response values . Meanwhile, if there are enough evaluated points for the initialization of Bayesian optimization, we may be able to explore the intrinsic structure of through these points. However, for optimization problems with high computational cost, only a few evaluated points are available. Thus, it is difficult to learn proper embedding only through them. In such a case, reasonable and effective use of unevaluated points acquired from the acquisition function of BO will be helpful for embedding learning.
Although SDA provides an effective strategy to incorporate the unlabeled data, it is only suitable for classification problems. Therefore, we need to extend it to the scenarios where the response value is continuous. At the same time, SIR is suitable for the supervised dimension reduction tasks in the continuous domain. SIR aims to find the e.d.r directions to minimize the total within-slice scatter and maximize the between-slice scatter, the purpose of which is very similar to that of LDA. In fact, SIR is equivalent to LDA (Wu2010Local). Due to the equivalence of these two methods, along with the fact that SDA is an extension of LDA, we can employ such discreteness brought by the slicing technique in SIR to apply SDA to the continuous domain. Next, we introduce how to construct semi-SIR to learn a low-dimensional subspace embedding.
We first assume that nearby points often have not only close response values but also similar low-dimensional representations. Furthermore, we sort the evaluated points according to their response values and cut them into slices. This process can be equivalent to that the labeled points belong to different classes. Therefore, similar to SDA (Cai2007Semi), our problem can be reformulated as follows. Given labeled points and unlabeled points , and denote the number of the labeled and unlabeled points respectively, we aim to find the embedding matrix through solving:
where denotes a centered data matrix whose rows represent samples in , . is a weight matrix. can be expressed as:
where is a identity matrix.
Note that the between-slice scatter matrix can be written in an elegant linear algebraic formulation.
where denotes the rescaled slice membership matrix for those evaluated samples with if the -th sample of is a member of the -th slice. Otherwise, . denotes the number of the samples in the -th slice.
For the labeled points in the same slice, their response values are very close, but this closeness may not be retained in . There could be a small number of points that are far away from others in each slice. Although these outliers may indicate that there exist other areas with large response values of the objective function, they are likely to interfere with the embedding we have learned. Thus, to reveal the main information of the objective function as much as possible, we employ the localization technique. By strengthening the local information of each slice, the degeneracy issue in the original SIR can be mitigated (Wu2010Local). Next, we illustrate how to construct in Equation 13 with the local information.
We note that in Equation 13 is a block matrix, and each block corresponds to a slice.
For the original between-slice scatter , can only indicate whether a sample point belongs to a slice. Here, we will strengthen the localization information by introducing a localized weight. For those evaluated samples, we let if the -th sample belongs to the -nearest neighbors of the -th sample ( can equal to ) in the -th slice. Otherwise, . denotes the number of slices, is a parameter for NN, and is the number of neighbor pairs in the -th slice.
On the other hand, if these outliers contain enough information, then through the iterative process of Bayesian optimization, we have a great probability to get other points near outliers. As a result, the number of these outliers may expand and become the leading samples in their slice to guide the generation of embedding. Therefore, it is necessary to update the projection matrix iteratively.
In summary, we aim to estimate the intrinsic structure of the objective function and find an effective subspace that reflects the large response value. Then, we are in a better position to identify the possible global optimum. Therefore, in combination with Bayesian optimization whose acquisition function will provide us candidate points with potentially large response values, semi-supervised dimension reduction can alleviate overfitting caused by the small size of labeled points. Thus, by utilizing both the labeled and unlabeled points, we can learn an -dimensional space that preserves the intrinsic structure of the objective function better.
4.2 Randomized Semi-supervised Dimension reduction
In the high-dimensional scenario, learning a low-dimensional embedding space still requires a lot of time. The main bottleneck lies in the solution of generalized eigenvalue problem in Equation 11 and the computation of the neighbor information in each slice. The major computation cost of the traditional method such as the two-stage SVD algorithm is the decomposition of the large-scale matrix. Fortunately, the randomized SVD (R-SVD) technique (Halko2011Finding) shed a light on this problem.
The contribution of R-SVD here is that when the effective dimension , using randomized SVD to approximate a matrix only requires , rather than . Moreover, the empirical results find that when solving the generalized eigenvalue problem, the performance of the randomized algorithm is superior to that of the traditional deterministic method in the high-dimensional case (georgiev2012randomized). Thus, we first replace the two-stage SVD with R-SVD to accelerate the solution of Equation 11.
We note that the decomposition overhead of the in Equation 11 is huge when is large. Thus, we decompose using the R-SVD algorithm. Then, the between-slice scatter matrix can be expressed as:
Similarly, due to the symmetry of matrix , the right hand side of Equation 11 can be decomposed as:
where can be constructed through Cholesky decomposition. Thus, the generalized eigenvalue problem in Equation 11 can be reformulated as:
Next, if we let , then we can see that this is a traditional eigenvalue problem. is an matrix, is small enough and thus we can decompose it through original SVD naturally: . Finally, the embedding matrix can be computed as:
The original time complexity of the construction of and is and respectively, including the computation of the exact NN for each point, where is the number of samples in each slice. In addition, the time complexity of decomposition of is and the time complexity of construction and computation of Equation 17 is . Therefore, the overall time complexity is , which is prohibitive in the high-dimensional case.
To alleviate the expensive overhead in high-dimensional case, we use the fast NN (2003Database; Ailon2009the) to compute the neighbor information, we find that constructing and only requires and respectively. Moreover, using R-SVD to factorize the matrix only requires . Therefore, The overall time complexity reduces to . denotes a fixed parameter about the logarithm of .
Algorithm 2 summarizes the above steps.
4.3 Mapping From Low to High
After semi-supervised dimensional reduction, we perform Bayesian optimization with GP on the low-dimensional space . But we need to further consider three issues. The first is how to select the search domain in the low-dimensional embedding space. The second is how to specify a mapping associate with to map the point from the embedding space to the high-dimensional space where the response value should be evaluated. The third is how to maintain consistency of the Gaussian process regression model after updating . Only by addressing these issues, the low-dimensional Bayesian optimization can be updated by the data pair .
Before elaborating the selection of in the first issue, we introduce the definition of zonotope, which is a convex symmetric polytope. Zonotope (Le2013Zonotope) Given a linear mapping and vector , an r-zonotope is defined as
where is the center vector of the zonotope. Without loss of generality, let , we note that is exactly computed by Equation 18. Thus, the zonotope associate with is the low-dimensional space where Bayesian optimization is performed. Then, we introduce the smallest box containing the zonotope. Given zonotope , , the smallest box containing this zonotope can be computed as
Although the boundary of zonotope is difficult to solve, we use the box as an alternative (binois2017choice).
Next, we focus on the second and third issues. As mentioned before, a mapping should connect the high-dimensional space and the low-dimensional space. As a result, for any point in the low-dimensional space, a corresponding point can be found in the high-dimensional space. In this way, the low-dimensional point and the response value of the corresponding high-dimensional point can be used as the training set to build a Gaussian process regression model. Figure 1 shows the relationship between them. We note that the reproduction of the correlation between and is the goal of the low-dimensional Gaussian process regression model, and the two are closely connected through . Thus a reasonable choice of plays a great influence on such a middle layer. Meanwhile, due to the iterative nature of SILBO, the mapping will change after updating , which directly makes the correlation between and inconsistent before and after the update. Therefore, we need to develop a strategy to maintain consistency. Next, we introduce two different strategies to address these two issues.
As shown in Algorithm 3, we can naturally get the response value by evaluating the point after finding the candidate point through BO. When the projection matrix changed, we fix at the bottom of the hierarchy in Figure 0(a) and then update at the top to maintain consistency (- for short).
Since we use to connect the points between the low-dimensional space and the high-dimensional space , the actual evaluation space is only a part of (i.e., ) when is fixed. While a great probability of containing the optimum in has been proved (Wang2013Bayesian), one of the preconditions is that there is no limitation to the high-dimensional space of the objective function. In practice, the restriction often exists (e.g., ), making the high probability no longer guaranteed. In contrast, due to the iterative nature of the effective low-dimensional space learning mechanism in SILBO (Algorithm 1), we can alleviate this problem. Specifically, more and more associate with will be learned, which gradually reveal the location of the optimum.
However, the - strategy may bring more evaluation overhead. When we get the training set for BO through iterations in Algorithm 1, we update to and get through the acquisition function . Due to the updating of , the mapping relationship between the high-dimensional and low-dimensional spaces has changed. As shown in Figure 0(a), we need to find the corresponding again for in the training set and then evaluate to maintain consistency. When the evaluation of the objective function is expensive, the computational cost is huge.
To eliminate the re-evaluation overhead, we propose a new strategy. Specifically, we first obtain in the low-dimensional space through the acquisition function , and then solve the following equation to find the corresponding .
When the projection matrix changed, we fix and at the top of the hierarchy in Figure 1 and then update in the bottom to maintain consistency (- for short).
Different from the - strategy, the - strategy shown in Figure 0(b) updates directly, which enables us to reconstruct a new BO model efficiently only by replacing the training set with after updating , instead of relocating and re-evaluating them. The - strategy can be found in Algorithm 4. Next, we analyze the rationality of this strategy theoretically.
Given a matrix with orthogonal rows and its corresponding zonotope . Given , for any , . Let be the orthogonal decomposition with respect to = Row(), then . Let with . Due to the orthogonal decompostion, we have lies in . For any , we have and , then . Thus, .
According to Theorem 4.3, if we assume that has orthogonal rows, it is clear that given and , any point will be the solution to Equation 19. However, our purpose is not itself, but , because is the data pair used to update BO. Fortunately, if we use SILBO to learn an ideal effective low-dimensional subspace , then for any in the solution set , , (i.e., ). Thus, in each iteration , we can obtain the unique data pair under to update the low-dimensional Bayesian optimization model. Therefore, the diversity of solutions in the set does not affect the construction of BO and we can explore in the original space which liberates the shackles of .
In summary, both the two strategies aim to generate a consistent training set for the subsequent GP construction when updating . To maintain such consistency, the - strategy acquires more observations from the objective function according to in the training set while the - strategy directly changes according to . Note that the more response values, the more information about the objective function we can get. Thus, the - strategy can obtain more clues about the global optimum than the - strategy.
5 Numerical Results
In this section, we evaluated our proposed algorithm SILBO on a set of high-dimensional benchmarks. First, we compared SILBO with other state-of-the-art algorithms on synthetic function222The synthetic function can be found at https://www.sfu.ca/~ssurjano/optimization.html. and neural network hyperparameter optimization tasks. Second, we evaluated and analyzed the scalability of SILBO. Finally, we demonstrated the effectiveness of semi-supervised embedding learning.
5.1 Experiment Setting
For comparison, we also evaluated with other state-of-the-art algorithms: REMBO (Wang2013Bayesian), which performs BO on random linear embedding, REMBO- (nayebi19a) which computes the kernel using distance in the high-dimensional space. REMBO- (binois2014warped), which uses a warping kernel and finds the evaluation point through sophisticated mapping. HeSBO (nayebi19a), which uses the count-sketch technique to generate subspace and compute the projection matrix. Additional baselines include ADD-BO (k2015high), which assumes that the objective function has an additive structure, and recently-proposed SIR-BO (zhang2019high), which uses SIR to learn a low-dimensional space actively. SILBO-BU and SILBO-TD represent the proposed SILBO algorithm that employs the - and - mapping strategies respectively.
We employed the package GPy333https://github.com/SheffieldML/GPy as the Gaussian Process framework and selected the Matérn kernel for the GP model. We adopted the package ristretto444https://github.com/erichson/ristretto to perform the R-SVD. The embedding matrix was set to be updated every 20 iterations. We employed CMA-ES (hansen16CMA) to compute Equation 19 for SILBO-TD. The size of the unlabeled points is set to 50 in each iteration. The number of neighbors is set to
for semi-supervised dimension reduction. To obtain error bars, we performed 100 independent runs for each algorithm on synthetic functions and 5 runs for the neural network hyperparameter search. We plotted the mean value along with one standard deviation. For ADD-BO, we used the authors’ implementation through, so we did not compare its scalability since other algorithms were implemented in . Also, we evaluated SILBO using two acquisition functions: UCB and EI.
5.2 Performance on Synthetic Functions
Similar to (nayebi19a), these algorithms were under comparison upon the following widely-used test functions: (1) Branin (2) Colville (3) Hartmann-6. The active dimensionalities for Branin, Colville, Hartmann-6 are 2, 4, and 6 respectively. We studied the cases with different input dimensions . The experimental results are shown in Figure 2-4. The proposed SILBO-BU, outperforms all benchmarks especially in high dimensions (). SILBO-TD is also very competitive, surpassing most traditional algorithms. In addition, the experimental results under different acquisition functions are similar. ADD-BO performs excellently in Hartmann-6 due to the additive structure of the function itself. However, it performs poorly in higher dimensions and functions without the additive assumption. HeSBO performs well in Colville and Hartmann-6, but poorly in Branin. Moreover, we can find that HeSBO does not converge in most cases. The performance of SIRBO is similar to our proposed SILBO-TD since it also tries to learn an embedding space actively. In contrast, the proposed method SILBO-BU nearly beats all baselines due to its powerful iterative embedding learning.
5.3 Hyperparameter Optimization on Neural Network
Following (oh2018bock), we also evaluated SILBO in the hyperparameter optimization task for a neural network on the MNIST dataset. Specifically, the neural network has one hidden layer of size 50 and one output layer of size 10. These 500 initial weights are viewed as hyperparameters and optimized using Adam (kingma2014adam).
Figure 5 shows that SILBO-BU converges quickly than all other benchmarks. The performance of HeSBO also improves quickly but converged slowly to the optimum. The performance of SILBO-TD and SIRBO is close. The two methods perform better than other traditional benchmarks such as REMBO and show competitive performance. Specifically, we can see that the curve corresponding to SILBO-BU obtained a better response value every 20 iterations, which demonstrate the effectiveness of our iterative embedding learning.
5.4 Scalability Analysis
We analyzed the scalability by comparing the cumulative time of each algorithm under the same number of iterations. As shown in Figure 6, we can see that for the low-cost objective functions such as Branin, SILBO-BU is fast, while SILBO-TD is relatively slow. This is because CMA-ES is used to solve Equation 19, which takes much time. For the expensive objective functions such as neural networks, SILBO-TD takes approximately the same time as most algorithms, while SILBO-BU takes more time due to its re-evaluation process.
5.5 Effectiveness of Embedding Learning
We compared the embedding learning performance of SILBO and SIR on Branin and Camel. We first evaluated 50 high-dimensional points to get response value and then projected these points to low-dimensional space. The goal is to find the e.d.r directions that preserve the information of the objective function as comprehensively as possible. Figure 7 and Figure 8 summarize the observed performance.
We can see that the information extracted by the SIR method stacked together. A specific low-dimensional point corresponds to many different response values. The consequence is that a lot of information will be lost if we train a Gaussian process regression model using these low-dimensional representations. In contrast, the information extracted by SILBO is complete without the loss of the intrinsic information of the objective function.
6 Conclusion and Future Work
In this paper, we proposed a novel iterative embedding learning framework SILBO for high-dimensional Bayesian optimization through semi-supervised dimensional reduction. We also proposed a randomized fast algorithm for solving the embedding matrix efficiently. Moreover, according to the cost of the objective function, two different mapping strategies are proposed. Experimental results on both synthetic function and hyperparameter optimization tasks reveal that SILBO outperforms the existing state-of-the-art high-dimensional Bayesian optimization methods. In the future, we plan to combine our framework with the multi-fidelity method. Moreover, we also plan to apply SILBO to more AutoML (Automatic Machine Learning) tasks.
This work was supported by the National Natural Science Foundation of China
(U1811461, 61702254), National Key R&D Program of China (2019YFC1711000), Jiangsu Province Science and Technology Program (BE2017155), National Natural Science Foundation of Jiangsu Province (BK20170651), and Collaborative Innovation Center of Novel Software Technology and Industrialization.