Given a dataset , an algorithm and hyperparameters =, the hyperparameter optimization (HPO) problem aims at finding an optimal configuration of hyperparameters, which maximizes the performance of in . HPO problems exist widely in the real life, and many common tasks in the computer science area, such as neural architecture search Bergstra et al. (2013); Mendoza et al. (2016); Chen et al. (2019); Kandasamy et al. (2018); Zhou et al. (2019) and feature subset selection Zarshenas and Suzuki (2016); Kale and Sonavane (2017), can be transformed into and considered as such kind of problems.
Take the case of the neural architecture search. Treating items that control the structure of the neural network, such as the type of each layer in the neural network and the values of arguments in different layers, asin the HPO problem; and neural network algorithm as in the HPO problem. For each configuration of
, construct the neural network according to it first, then use the partial data to train the neural network weights, and finally use the left data to estimate the prediction accuracy of the trained neural network, which is taken as the performance score in the HPO problem. The optimal parameters configuration of this HPO problem corresponds to the optimal neural network architecture that suit the given dataset .
In order to solve these problems effectively, many HPO methods Li et al. (2017); Thornton et al. (2013); Hutter et al. (2007); Springenberg et al. (2016); Golovin et al. (2017); Goldberg (1989); Wistuba et al. (2016); Feurer et al. (2015); Francescomarino et al. (2018) have been proposed. Among them, Grid Search Montgomery (2017), Random Search Bergstra and Bengio (2012) and Bayesian Optimization Snoek et al. (2012b); Shahriari et al. (2016); Oh et al. (2018); Mutny and Krause (2018)
are very famous and commonly used. Without taking various constraints into account, each of the existing HPO techniques can provide us with an excellent solution by traversing a large proportion of hyperparameter configurations. However, in practice, such approach is impractical due to complex and high-dimensional configuration space. Besides, in many cases, the evaluation of only one specific hyperparameter configuration can be extremely expensive for large models, complex machine learning pipelines, or large datesetsZhou et al. (2019); Mantovani et al. (2015b). Users always unable to afford the huge expenses brought by the large numbers of configuration evaluations. However, well-performed hyperparameter configurations are still necessary. Therefore, they desperately need intelligent methods to help them find a good hyperparameter configuration with the limited funds. Motivated by this, in this paper, we define a new problem, Constrained Hyperparameter Optimization (CHPO), as follows.
CHPO problem aims at finding a best possible hyperparameter configuration, which leads to great performance of the algorithm in the given dataset, utilizing a finite number of configuration evaluations. It allows users to put an upper limit on the number of configuration evaluations, according to their budget, which is more practical and user-friendly compared with HPO problem. However, this advantage also brings a crucial technical challenge that the configuration space is always very huge, whereas, the number of configuration evaluations are limited and few. It is not trivial to select the configurations to be estimated from such a huge space, and make sure that well-performed configurations are involved in such few candidates.
In this paper, facing this challenge, we design the Human Experience and Parameter Analysis approaches to analyze experience and intelligently infer optimal configurations, respectively, and thus increase the possibility of finding well-performed configurations.
It is known that, exploring deeply internal rules of the problem, then twice as much can be accomplished with half the effort. Based on this though, we design Human Experience, a knowledge-driven approach, to find the optimal configurations with the help knowledge. Human Experience discovers potential relation among configuration, configuration adjustment and the corresponding change of performance from the known experience. It finally uses the discovered knowledge to infer optimal configurations reasonably. This method works well when most given hyperparameters are decisive for the performance. However, it may be less effective when most hyperparameters are redundant or unimportant to the performance, because much noise data may greatly influence the quality of the obtained knowledge and mislead it.
In order to solve this disadvantage, we develop Parameter Analysis, which applies pruning method, to cope with this challenge. Parameter Analysis analyzes the importance of each hyperparameter to the performance, and reduces the configuration space by ignoring unimportant or redundant ones. Finally, it searches for the optimal configurations from the much smaller space. Such method makes up for the limitation of Human Experience, because the space can be reduced significantly when most hyperparameters are redundant or unimportant, and this makes optimal configurations much easier to be found. Its shortcoming is that it may be less effective when most hyperparameters are decisive for the performance, because the adjusted space is very similar to the original one, and it is still very difficult to select optimal configurations from the new space. Obviously, such shortcoming could be overcome with Human Experience.
From above discussions, these two methods complement each other. We combine them with developing respective advantage and finally propose a well-performed CHPO algorithm, which is called ExperienceThinking.
Major contributions of this paper are summarized as follows.
Firstly, we propose CHPO problem, which is more practical and user-friendly than HPO. To the best of our knowledge, this is the first comprehensive definition of the constrained HPO.
Secondly, we develop two novel methods, i.e., Human Experience and Parameter Analysis, to intelligently infer optimal configurations from different aspects.
Thirdly, we combine Human Experience and Parameter Analysis, and present ExperienceThinking to effectively deal with CHPO problems.
Fourthly, we conduct extensive experiments to test the performance of ExperienceThinking and classic HPO algorithms for CHPO problems. The experimental results demonstrate the superiority of our proposed algorithm.
The remainder of this paper is organized into five sections. Section 2 introduces the existing HPO techniques. In Section 3, we define the CHPO problem and some related concepts involved in this paper. Section 4 introduces Human Experience and Parameter Analysis approaches that we designed to analyze experience and intelligently infer optimal configurations. Section 5 gives our proposed algorithm ExperienceThinking. Section 6 compares and evaluates the ability of classic HPO techniques and ExperienceThinking to solve CHPO problem. Finally, we draw conclusions and present the future works in Section 7.
2 Related Work
Many modern methods and algorithms, e.g., deep learning methods and machine learning algorithms, are very sensitive to hyperparameters — their performance depends more strongly than ever on the correct setting of many internal hyperparameters. In order to automatically find out suitable hyperparameter configurations, and thus promote the efficiency and effectiveness of the target method or algorithm, a number of HPO techniques have been proposedLi et al. (2017); Snoek et al. (2012a); Mantovani et al. (2015a); Thornton et al. (2013); Hutter et al. (2007); Springenberg et al. (2016); Golovin et al. (2017). In this section, we will provide a detailed introduction of three classic and commonly used HPO techniques, i.e., Grid Search Montgomery (2017), Random Search Bergstra and Bengio (2012) and Bayesian Optimization Snoek et al. (2012b); Shahriari et al. (2016), which are involved in our experimental part.
Grid Search (GS).
GS is one of the most used and basic HPO methods in the literature. Each hyperparameter is discretized into a desired set of values to study, and GS evaluates the Cartesian product of these sets and finally chooses the best one as the optimal configuration. Although easy to implement, GS may suffer from the curse of dimensionality and thus become computationally infeasible, since the required number of configuration evaluations grows exponentially with the number of hyperparameters and the number of discrete levels of each. For example, 10 hyperparameters with 4 levels each would require 1,048,576 models to be trained. Even with a substantial cluster of compute resources, training so many models is prohibitive in most cases, especially with massive datasets and enormous calculations.
Random Search (RS). RS is a simple yet surprisingly effective alternative of the GS. RS samples configurations at random until a certain budget for the search is exhausted, and chooses the best one as the optimal configuration. It explores the entire configuration space, and works better than GS when some hyperparameters are much more important than others Bergstra and Bengio (2012); Koch et al. (2018). However, its effectiveness is subject to the size and the uniformity of the sample. Candidate configurations can be concentrated in regions that completely omit the effective hyperparameter configurations, and it is likely to generate fewer improved configurations Koch et al. (2018).
Bayesian Optimization (BO). BO is a state-of-the-art optimization method for the global optimization of expensive black box functions Hutter et al. (2019). BO works by fitting a probabilistic surrogate model to all observations of the target black box function made so far, and then using the predictive distribution of the probabilistic model, to decide which point to evaluate next. Finally, consider the tested point with the highest score as the solution for the given HPO problem. Different from GS and RS, which ignore historical observations, it makes full use of them to intelligently infer more optimal configurations, and thus capable of providing better solutions within shorter time. Many works Bergstra et al. (2013); Mendoza et al. (2016); Kandasamy et al. (2018); Zhou et al. (2019); Ma et al. (2019)
apply BO to optimize hyperparameters of neural networks due to its effectiveness. However, it is noticed that, BO is not perfect. Traditionally, the probabilistic model used in the BO is assumed to obey Gaussian distribution. However, this assumption does not hold in all HPO problems, and it may result in poor performance of BO in some cases.
These three techniques can deal with HPO problems effectively when the budget constraint does not exit. However, their ability to deal with HPO problems with a finite number of configuration evaluations has not been fully analyzed and systematically compared. In the experimental part, we make minor readjustments to these three techniques, making them suitable for dealing with various CHPO problems. We then analyze their performance with a certain finite number of estimates and compare with that of our proposed ExperienceThinking algorithm, in order to find out an effective method for dealing with CHPO problems. Details are shown in Section 6.
3 Problem Definition and Related Concepts
3.1 CHPO Problem Definition
Definition 1. (Constrained Hyperparameter Optimization Problem) Consider a dataset , an algorithm , hyperparameters =, and an integer . Let denote the domain of , = denote the overall hyperparameter configuration space, and represent the performance score of in under . The target of Constrained Hyperparameter Optimization (CHPO) problem is to find
from , which maximizes the performance of in , by evaluating configurations in .
3.2 Related Concepts of CHPO
The following concepts are used in Human Experience approach description.
Consider a CHPO problem =, where is a dataset, is an algorithm, are hyperparameters and is an integer. We represent the overall hyperparameter configuration space as
. A vector of hyperparameters is denoted by=, and the normalized version of is denoted by . We use to represent the ideal performance score of in (, ).
Definition 2. (Configuration Difference, CDiffer) Consider a CHPO problem =, and two configurations ,. The Configuration Difference (CDiffer) from to is defined as:
Definition 3. (Performance Difference, PDiffer) Consider a CHPO problem =, and two configurations ,. The Performance Difference (PDiffer) from to is defined as:
Definition 4. (Performance Promotion Space, PSpace) Consider a CHPO problem =, and a configuration . The Performance Promotion Space (PSpace) of is defined as:
The smaller is the more optimal is.
Definition 5. (Ideal Adjustment, IAdjust) Consider a CHPO problem =, and a configuration . The Ideal Adjustment (IAdjust) of is denoted as , and the relationship between and is as follows:
and the relationship between and is as follows:
4 Human Experience and Parameter Analysis
Human Experience and Parameter Analysis are the core of our proposed ExperienceThinking algorithm. Two methods tell ExperienceThinking which configurations tend to be optimal by carefully analyzing and summarizing the experience, and thus guide ExperienceThinking to approach the global optimal configuration gradually. In this section, we will introduce these two intelligent methods in detail by revealing their internal operating mechanism.
4.1 Human Experience Method
Motivation. Clearly, the knowledge of the relation among configuration, configuration adjustment and the corresponding change of performance is helpful for solving the problem. Thus, we tend to design a knowledge-driven approach to find optimal configurations efficiently. Such approach brings two challenges. On the one hand, we need procedural knowledge to help us infer optimal configurations, while only factual knowledge (the performances of some configurations) is known. How to derive procedural knowledge from factual knowledge effectively is the problem to solve. On the other hand, the optimal configurations predicted by one model may not be completely trustworthy, due to the possible bias of single model, which is hard to avoid.
Design Idea. Facing these two challenges, we develop knowledge representation and acquirement mechanism. We learn the procedural knowledge from the historical configurations with corresponding performance, which is a set of configuration-performance pairs, denoted as . Consider two tuples , . If , then we can say that the performance of can promote = if =- adjustment is made to (i.e., changes to +); instead, we say that the performance of can decrease under adjustment. Any two tuples in can provide us with two ,, triples as above, and we can obtain a total of triples from . These triples are useful for our understanding of the relationship among , and . We train neural networks with these triples, and consider the trained neural networks as the procedural knowledge, which assists us to find more optimal configurations, e.g., setting to a high value.
To avoid the bias of single model, we design multiple models to predict or verify optimal configurations, and ask them to discuss and exchange views, and thus improve the reliability of the predicted optimal configurations. We combine these solutions as Human Experience method.
Detail Workflow. Algorithm 1 shows the pseudo code of Human Experience method. Firstly, HumanExperience algorithm builds and trains NNAdjust, whose structure is shown in Figure 1(a), to fit the relationship between and (Line 1-4). NNAdjust tells what adjustment can be performed to make the performance of certain configuration achieve a certain increase, and thus help us infer optimal configurations. Note that in order to make gradient descent easier and convergence speed faster, and are normalized, and are used instead. Then, NNVerify, whose structure is shown in Figure 1(b), is built and trained to fit the relationship between and (Line 5-6). NNverify can tell how much the performance will increase if an adjustment is made to a certain configuration, and thus help us verify the effect of the certain adjustment. After obtained these two well-trained neural networks, HumanExperience intelligently finds optimal configuration candidates utilizing them (Line 7-13). The details are as follows.
Step 1. Use NNAdjust to predict (Line 7). Taking as input, NNAdjust outputs the predicted , which is denoted by . + is considered to be optimal by NNAdjust.
Step 2. Use NNVerify to verify the rationality of (Line 8). Taking as input, NNVerify outputs the predicted +, which is denoted by . reflects NNVerify’s view on the rationality of . More specifically, if is very similar to , then we can say that both NNAdjust and NNVerify judge + to be optimal; otherwise, NNVerify disagrees with NNAdjust on the performance of +.
Step 3. Select configurations that are considered to be optimal by both of two neural networks (Line 9-13). The smaller - is, the more confidence NNAdjust and NNVerify have in the superiority of +, and thus the more likely that + is optimal. Based on this thought, are sorted and new configurations that are considered to be better are selected and output.
Summary. Human Experience extracts useful knowledge from known information, and utilizes obtained knowledge to infer optimal configurations intelligently. Two neural networks used in it are like two human brains with different thought patterns. They discuss with each other and exchange their views, and finally select the configuration candidates which are considered to be optimal by both of them. Human Experience brings forward a novel thought to infer optimal configurations.
4.2 Parameter Analysis Method
Motivation. Different hyperparameters may have different effects on algorithm performance. If we can figure out important hyperparameters utilizing , more optimal configurations are likely to be found. The reason is as follows. The opportunities to evaluate configurations are finite in CHPO problems, whereas, the configuration space is always huge. If we focus on important hyperparameters instead of unrelated or unimportant ones when deciding new configurations to test, then we can avoid wasting opportunities on useless configurations, and have more opportunities to reach more optimal and useful ones.
The key point in the implementation of the above idea is to judge the importance of hypeparameters reasonably. As we know, Random forestBreiman (2001) has the strong ability to distinguish the importance of features in the classification dataset. Thus, we can transfer the into classification dataset first, then utilize the strong ability of Random Forest to judge the importance of hypeparameters for the search. Based on this idea, we design Parameter Analysis method.
Detail Workflow. Algorithm 2 shows the pseudo code of the Parameter Analysis method. Firstly, ParameterAnalysis algorithm converts into the classification dataset (Line 1-3). It ranks the configurations in the
according to their performance, and classifies them into three categories, including high-performance ones (labeled by 3), mid-performance ones (labeled by 2) and low-performance ones (labeled by 1). In this way, each configuration has a category label related to their performance.
Secondly, ParameterAnalysis utilizes Random Forest to select key hyperparameters in , i.e., , which have profound effect on the performance (Line 4-11).
Finally, ParameterAnalysis finds and outputs optimal configuration candidates utilizing (Line 12-14). It generates new configurations of randomly, and sets the values of other less important hyperparameters according to the most optimal configuration in . In this way, optimal configuration candidates of are obtained.
Summary. Parameter Analysis applies pruning method. It utilizes Random Forest’s strong ability of evaluating feature importance to reduce configuration space, and thus improves the chances of finding optimal configurations.
This method complements with Human Experience. It works well especially when most given hyperparameters are redundant or unimportant, because the configuration space can be reduced a lot. However, it may be less effective when most hyperparameters are important, because the adjusted space is still very huge, and it is still very difficult to select the optimal configurations from the adjusted space.
5 ExperienceThinking Algorithm
As discussed above, Human Experience and Parameter Analysis suit for different situations and complement each other. However, we are always unable to make the choice in advance. To increase the performance furthermore for various scenarios, we combine them and propose the ExperienceThinking. In the combined approach, these two approaches infer optimal configurations separately. All optimal candidates provided by them are evaluated, and corresponding performance information is generated. is augmented with such configurations and performance. With the augmented , these two approaches could constantly adjust themselves and enhance their credibility. Such adjustments are performed times, where is the constraint. Finally, the best configuration in is considered as the solution. Figure 2 is the overall framework of ExperienceThinking, and Algorithm 3 gives its pseudocode.
|Default Task||Datasets||No. of features||No. of classes||No. of records|
Detail Workfolw. For a CHPO problem, ExperienceThinking works as follows. Firstly, initializing , i.e., known configuration-performance pairs, by evaluating several randomly-selected using about percent configuration evaluation opportunities (Line 1-2). Then, the iteration begins. ExperienceThinking divides the left configuration evaluation opportunities into parts equally. For each iteration, it invokes Human Experience method and Parameter Analysis method to analyze experience and infer optimal , after that utilizes a part of opportunities to evaluate several optimal candidates provided by two intelligent methods and updates (Line 4-6). The iterative process is continued until the assumed number of evaluations is reached. Finally, the optimal among evaluated configurations is output and considered as the solution for the given CHPO problem (Line 8-9).
Note that these two methods may be run many times in ExperienceThinking. As the number of invocations grows, the increase, these two methods become more reliable, and the configuration candidates suggested by them are more likely to be optimal. ExperienceThinking constantly adjusts two methods by enhancing their accuracy and thus gradually approaches the optimal configuration. This is just like the human growth processes — with the increase of age, humans accumulate richer experience and have stronger ability to solve problems, and the solution provided by them is improved. From this aspect, ExperienceThinking acts like a growing human and solve CHPO problems intelligently.
To show the benefits of the proposed approach, we conduce extensive experiments. We implement all the algorithms in Python, and run experiments irrelevant to CNN on an Intel 2.3GHz i5-7360U CPU machine with 16GB memory on Windows 10. As for the experiments related to CNN, we run them using GTX 1080 Ti.
6.1 Experimental Setup
Datasets. We conduct experimental studies using 10 datasets, including 8 datasets used for data classification and 2 datasets used for image classification. Table 1 shows the statistical information of them, and the following is the brief introduction to them.
The first 8 datasets are available from UCI Machine Learning Repository222http://archive.ics.uci.edu/ml/datasets.php. These datasets are from various areas, including life, computer, physics, society, finance and business. The cifar10 dataset333https://www.cs.toronto.edu/ kriz/cifar.html
is a collection of color images that are commonly used to train machine learning and computer vision algorithms — consisting of a training set of 50,000 examples and a test set of 10,000 examples. It is collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The fashion mnist444https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/ is a dataset of Zalando’s article images-consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image.
Algorithms for Comparison. We implement three state-of-the-art HPO techniques: Random Search (RS) Bergstra and Bengio (2012), Gride Search (GS) Montgomery (2017) and Bayesian Optimization (BO) Shahriari et al. (2016), which are introduced in Section 2. We performed the following adjustments to these techniques making them suitable for dealing with CHPO problems and be able to compare with ExperienceThinking.
Consider a CHPO problem =. RS randomly selects configurations from and considers the optimal one among them as the solution to . For each hyperparameter , GS randomly select or values from , and thus form approximately (no more than ) configurations. GS evaluates these configurations and considers the most optimal one among them as the solution to . BO randomly selects configurations from as the initial samples, and then selects the next sample by optimizing acquisition function iteratively. The iteration stops when - configuration evaluation opportunities are used up, and BO considers the most optimal evaluated configurations as the solution to .
For ExperienceThinking algorithm, in the experiments, we set the parameter and by default. This setting will be demonstrated to be reasonable in the parameter sensitivity evaluation part.
Evaluation Metrics. In the experiments, if the hyperparameters in the given CHPO problem = have the default configuration , then we utilize to measure the effectiveness of the CHPO algorithm , if do not have , then we use instead to quantify the effectiveness of . For all efficiency experiments, we report the analysis time (all time cost except for time used for evaluating configurations) in minutes. The definition of and the explanation of are given as follows.
Definition 6. (Performance Increase Rate, PIRate) Consider a CHPO problem =, and a CHPO technique . Let represent the default hyperparameter configuration, and denote the optimal configuration of provided by . The Performance Increase Rate (PIRate) of the algorithm in under is defined as:
measures the difference between and . It can either be positive or negative. If outperforms , is larger than , and thus is positive; otherwise is negative. The higher value means the stronger ability of to solve . Note that, in the following experiments, we divide into groups equally and apply 3-fold cross-validation accuracy to calculate ().
6.2 Performance Evaluation
We examine the performance of ExperienceThinking, RS, GS and BO using three different types of CHPO problems, including CHPO problems related to the machine learning algorithm (Section 6.2.1), CHPO problems related to neural architecture search (Section 6.2.2, Section 6.2.3) and CHPO problems related to feature subset selection (Section 6.2.4). And we analyze all experimental results in Section 6.2.5.
Note that for all CHPO problems, we run the CHPO algorithm 50 times by default, and report its average (or average ) and its average analysis time. Due to the fact that the analysis time of RS and GS is very little, we ignore them in the experiments.
6.2.1 XGBoost hyperparameter Optimization
Seven important hyperparameters in XGBoost.
XGBoost (eXtreme Gradient Boosting)Chen and Guestrin (2016)
is a popular open-source implementation of the gradient boosted trees algorithm. From predicting ad click-through rates to classifying high energy physics events, XGBoost has proved its mettle in terms of performance and speed. It is very sensitive to hyperparameters — its performance depends strongly on the correct setting of many internal hyper-parameters. In this part, we try to automatically find out suitable hyperparameter configurations, and thus promote the effectiveness of XGBoost, utilizing CHPO techniques.
|zoo||0.66%||1.04%||-5.45%||-9.87%||1.07%, 38.76||1.12%, 124.78||1.23%, 6.83||1.73%, 24.02|
|cbsonar||30.40%||31.88%||22.67%||28.13%||29.25%, 41.58||31.35%, 158.61||32.17%, 7.08||37.26%, 23.48|
|image||5.45%||6.25%||0.36%||1.59%||5.27%, 41.66||5.91%, 263.52||7.77%, 7.40||9.43%, 25.70|
|ecoli||4.46%||4.69%||3.69%||3.88%||4.55%, 31.40||4.77%, 219.91||4.80%, 5.63||5.23%, 23.45|
|breast cancer||0.55%||0.71%||-0.23%||0.06%||0.58%, 36.09||0.62%, 281.25||0.80%, 5.79||1.04%, 22.52|
|balance||7.03%||7.38%||6.36%||6.79%||7.58%, 51.60||7.65%, 143.47||7.68%, 6.41||7.85%, 23.92|
|creditapproval||2.43%||2.64%||1.55%||1.64%||2.57%, 24.34||2.57%, 201.34||2.68%, 6.05||2.75%, 23.93|
|banknote||0.14%||0.36%||-0.40%||-0.72%||0.46%, 35.81||0.46%, 210.51||0.56%, 7.05||0.60%, 24.43|
|Average Values||6.39%||6.87%||3.57%||3.92%||6.42%, 37.66||6.81%, 200.40||7.21%, 6.53||8.24%, 23.93|
We consider seven main hyperparameters of XGBoost (shown in Table 2) as 555https://xgboost.readthedocs.io/en/latest/ gives the default configuration of ()., set to 128 or 256, set to XGBoost algorithm and set to a data classification dataset in Table 1, and thus construct CHPO problems related to XGBoost to compare algorithms. Table 3 show their performance.
Experimental Results. From Table 3, we find that algorithms generally achieve higher with the increase of . ExperienceThinking is the most effective among them no matter what the value of is, GS performs the worst, the effectiveness of BO is slightly superior to that of RS. We also discover that BO and ExperienceThinking cost more time to analyze when gets larger, and the increase rate of analysis time cost by ExperienceThinking is much smaller than that of BO. ExperienceThinking outperforms BO no matter what the value of is.
|Name||Type||Set Ranges or Available Options|
|zoo||1.41%||1.59%||0.91%||1.71%||1.46%, 20.30||2.14%, 63.55||1.62%, 7.60||2.49%, 23.43|
|cbsonar||129.22%||136.54%||87.37%||112.11%||105.75%, 23.85||119.68%, 133.40||136.18%, 6.93||138.64%, 25.07|
|image||116.29%||120.48%||101.90%||111.43%||120.57%, 43.85||120.95%, 195.55||120.57%, 7.65||122.86%, 22.01|
|ecoli||33.20%||35.00%||32.50%||33.19%||32.38%, 28.11||35.05%, 145.18||33.20%, 7.90||35.09%, 22.32|
|breast cancer||6.71%||7.06%||6.09%||6.65%||7.33%, 29.38||7.44%, 130.94||7.28%, 7.11||7.50%, 22.03|
|balance||10.96%||11.15%||9.57%||9.85%||10.98%, 20.81||10.96%, 87.95||11.40%, 7.47||11.73%, 23.92|
|creditapproval||13.36%||15.53%||7.20%||7.53%||16.15%, 22.93||16.64%, 161.05||16.75%, 7.69||17.10%, 23.15|
|banknote||0.00%||0.00%||-0.18%||-0.08%||0.00%, 14.94||0.00%, 120.64||0.00%, 6.36||0.00%, 21.76|
|Average Values||38.89%||40.92%||30.67%||35.30%||36.83%, 25.52||39.11%, 129.78||40.88%, 7.34||41.93%, 22.96|
6.2.2 MLP Architecture Search
Experimental Design. Neural networks are powerful and flexible models that work well for many difficult learning tasks. Despite their success, they are still hard to design. In this part, we try to automatically design suitable MLP, a feedforward artificial neural network model, for the given dataset utilizing CHPO techniques.
We consider six main hyperparameters of MLP (shown in Table 4) as 666https://scikit-learn.org/ gives the default configuration of ()., set to 128 or 256, set to MLP algorithm and set to one data classification dataset in Table 1, and thus construct several CHPO problems related to MLP architecture search to examine algorithms. Table 5 shows their performance.
Experimental Results. We obtain the similar results as the experiment in Section 6.2.1.
|Name||Type||Set Ranges or Available Options||Meaning|
|SL1Type||list||[Conv2D, MaxPooling2D, AveragePooling2D, Dropout]||The type of the 1st layer|
|SL2Type||list||[Conv2D, MaxPooling2D, AveragePooling2D, Dropout, None]||The type of the 2nd layer|
|SL3Type||list||[Conv2D, MaxPooling2D, AveragePooling2D, Dropout, None]||The type of the 3rd layer|
|SL4Type||list||[Conv2D, MaxPooling2D, AveragePooling2D, Dropout, None]||The type of the 4th layer|
|SL5Type||list||[Conv2D, MaxPooling2D, AveragePooling2D, Dropout, None]||The type of the 5th layer|
|SLActivation||list||[relu, softsign, softplus, selu, elu, softmax, tanh, sigmoid, hardsigmoid, linear]||
The activation function used by the Conv2D layers in CNN
|SLDroupout||float||0.10.9||The dropout rate set in Dropout layers used in the first five layers|
|DenseLNum||int||03||The number of fully connected layers in CNN|
The number of neurons in each fully connected layer
|DenseLDroupout||float||0.10.9||The dropout rate set in the Dropout layer used after fully connected layers|
|DenseLActivation||list||[relu, softsign, softplus, selu, elu, softmax, tanh, sigmoid, hardsigmoid, linear]||The activation function used by fully connected layers in CNN|
|OutputLActivation||list||[relu, softsign, softplus, selu, elu, softmax, tanh, sigmoid, hardsigmoid, linear]||The activation function used by the output layer in CNN|
[SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam]
|The optimizer used by CNN|
|batchsize||int||10100||The batch size used when training CNN|
6.2.3 CNN Architecture Search
Experimental Design. In this part, we try to automatically design suitable CNN, which is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network, for the given image dataset utilizing CHPO techniques.
We consider fourteen hyperparameters related to CNN design (shown in Table 6) as , set to 128, set to CNN algorithm and set to one image classification dataset in Table 1, and thus construct several RHPO problems related to CNN architecture search to examine four CHPO algorithms. Table 7 shows the performance of them.
Experimental Results. Since hyperparameters mentioned in Table 6 do not have default values, we use to examine the effectiveness of the CHPO algorithm . Besides, it is noticed that CNN training is very time-consuming, in order to save time, we set the epochs to 10 when training CNN, and run the CHPO algorithm 10 time to get average and average analysis time. From Table 7, we find that ExperienceThinking performs the best among four algorithms, GS performs the worst, and RS performs better than BO. ExperienceThinking is more efficient than BO.
|cifar10||0.660||0.383||0.613, 88.37||0.673, 34.29|
|fashion mnist||0.878||0.627||0.864, 63.99||0.883, 34.43|
|Average Values||0.769||0.505||0.739, 76.18||0.778, 34.36|
6.2.4 Feature Subset Selection
Experimental Design. Feature subset selection is an important step in machine learning. Its idea is to find the best features that are suitable to the classification task. In this part, we utilize CHPO techniques to deal with feature subset selection problems.
We set to 128 or 256, set to K-Nearest Neighbor classification algorithm, set to one data classification dataset with more than 14 features in Table 1, and consider the features in as hyperparameters 777Every three features construct a hyperparameter, where each feature corresponds to a value, i.e., 0 or 1. We preserve a feature in iff the value of this feature is 1 in this experiment. and thus construct several CHPO problems related to feature subset selection. We use four CHPO techniques to deal with these CHPO problems, and compare their performance. Table 8 shows the results. Note that, in this experiment, we consider the configuration of , which preserves all features in dataset , as the default configuration .
Experimental Results. We obtain the similar results as the experiment in Section 6.2.1. Note that BO will cost too much time (more than 5 days) on the cbsonar dataset for getting average and average analysis time, in this experiment. Since the time limit, we did not give the results of BO on cbsonar.
|zoo||6.05%||6.77%||4.58%||4.93%||6.15%, 36.38||6.48%, 221.08||6.15%, 6.55||6.89%, 23.93|
|cbsonar||61.13%||64.03%||32.65%||35.33%||, 150||, 300||78.46%, 7.35||85.88%, 23.74|
|image||41.91%||46.09%||23.48%||19.57%||43.48%, 24.66||47.83%, 235.88||45.91%, 6.69||50.65%, 21.19|
|breast cancer||2.13%||2.28%||-0.34%||-0.34%||2.13%, 36.89||2.22%, 226.38||2.16%, 6.35||2.29%, 23.67|
|creditapproval||39.10%||41.96%||27.34%||24.36%||39.80%, 23.35||42.38%, 154.05||41.53%, 7.82||43.78%, 27.92|
|Average Values||30.06%||32.23%||17.54%||16.77%||,||,||34.84%, 6.95||37.90%, 24.09|
6.2.5 Experimental Results Analysis
Effectiveness Analysis. The experimental results obtained from the above four experiments show us that the ability of ExperienceThinking to deal with RHPO problems is the strongest among four algorithms, GS performs the worst, and the effectiveness of BO is slightly superior than that of RS. Now let us analyze the reasons for different effectiveness performance of four algorithms.
For each hyper-parameter, GS can only test very few values of it due to the limited number of configuration evaluations in RHPO problems. Besides, since these few values are randomly selected, not selected by domain experts, it is very likely that bad or ineffective configurations of the hyper-parameter are selected, and thus result in the bad performance of GS. As for RS, although the tested values of each hyper-parameter are also randomly selected, more values can be tested in RS. This makes RS more likely to find out more optimal configuration and thus be more effectiveness compared with GS. However, GS and RS ignore historical observations and do not think deeply or analyze carefully for getting more optimal configurations. This shortcoming makes GS and RS performs worse than BO and ExperienceThinking, which add intelligent analysis.
Both of BO and ExperienceThinking analyze historical experience intelligently for inferring more optimal configurations, however, ExperienceThinking has two different analysis modules which complement each other, whereas BO only has one which is based on the assumption that samples obey gauss distribution. Due to the limited number of configuration evaluations in RHPO problems, the accuracy of each analysis module can not be guaranteed. Inferring optimal configurations with the help more reasonably designed analysis modules makes ExperienceThinking more reliable and thus be more effective compared with BO.
Efficiency Analysis. The experimental results obtained from the above four experiments show us that the analysis time of GS and RS is the smallest (can be ignored), and ExperienceThinking is far more efficient than BO. Now let us analyze the reasons for different efficiency performance of four algorithms.
GS and RS do not analyze historical experience and thus be more efficient than BO and ExperienceThinking, however, this also make them less effective. As for BO and ExperienceThinking, their analysis modules work differently and thus they have different time performance. The analysis modules in ExperienceThinking can provide many optimal configuration candidates at each iteration, and ExperienceThinking only need to invoke them several times (e.g., 5 set in the experiments) to get a good solution. However, the analysis module used in BO can only provide one candidate each time, and BO need to invoke it many times. Two analysis modules used in ExperienceThinking are not time-consuming, besides, they are invoked very few times, therefore the time performance of ExperienceThinking is far more efficient than BO.
Summary. Since the configuration evaluation in CHPO problems are commonly very expensive and time-consuming, users do not want to get a inferior solution after evaluating configurations. If more optimal solutions can be obtained at the cost of a certain amount of time for analyzing, users would gladly agree. Though ExperienceThinking is less efficient than GS and RS, but its effectiveness is the highest, besides, it efficiency is acceptable (better than BO), therefore, ExperienceThinking is the best RHPO algorithm among four algorithms that we analyzed.
6.3 Parameter Sensitivity Evaluation
We also investigate the effect of and on the performance of ExperienceThinking using CHPO problems analyzed above. Table 9 is an example on a CHPO problem =(cbsonar,XGBoost,,128), where consists of seven hyperparameters in Table 2.
As we can see, the increases first and then decreases with the increasing of , and the analysis time increases with the increase of . The reasons are as follows. When is very big, ExperienceThinking is very similar to RS, which ignores the historical information, and thus be ineffective. When is very small, the initial few configurations can be concentrated in regions that completely omit the effective hyperparameter configuration and thus be useless for inferring more optimal configurations, thus forming a vicious circle. This makes ExperienceThinking ineffective. Besides, with the increase of , more experience is considered in the two analysis modules, and thus makes the analysis time used by ExperienceThinking longer. For getting a better solution, we suggest users to set to 0.5 when using ExperienceThinking.
As for , the and the analysis time increase with the increasing of . The reasons are as follows. More adjustments are made to improve the reliability of two analysis modules in ExperienceThinking, with the increase of . This makes the configuration candidates suggested by two modules are more likely to be optimal and thus enhance the effectiveness of ExperienceThinking. However, more invocations mean much more analysis time, and this makes ExperienceThinking less efficient. For getting a better solution, we suggest users to set as large as possible when using ExperienceThinking.
|Performance||Varying (=3)||Varying (=0.3)|
7 Conclusion and Future Works
In this paper, we present and formulate the CHPO problem, which aims at dealing with HPO problem as effectively as possible under limited computing resource. Compared with classic HPO problem, CHPO problem is more practical and user-friendly. Besides, we simulate the human thinking processes and combine the merit of the existing technique, and thus propose an effective algorithm ExperienceThinking to solve CHPO problem. We also design a series of experiments to examine the ability of three classic HPO techniques to deal with CHPO problems, and compare with that of ExperienceThinking. The extensive experimental results show that our proposed algorithm provides more superior results and has better performance. In the future works, we will try to design more effective algorithms to deal with CHPO problem, and utilize the proposed CHPO techniques to deal with more practical problems.
This paper was partially supported by NSFC grant U1509216, U1866602, 61602129, 61772157, CCF-Huawei Database System Innovation Research Plan DBIR2019005B and Microsoft Research Asia.
- Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, pp. 281–305. Cited by: §1, §2, §2, §6.1.
- Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 115–123. Cited by: §1, §2.
- Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §4.2.
- XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 785–794. Cited by: §6.2.1.
RENAS: reinforced evolutionary neural architecture search.
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4787–4796. Cited by: §1.
Initializing bayesian hyperparameter optimization via meta-learning.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pp. 1128–1135. Cited by: §1.
- Genetic algorithms for hyperparameter optimization in predictive business process monitoring. Inf. Syst. 74 (Part), pp. 67–83. Cited by: §1.
- Genetic algorithms in search optimization and machine learning. Addison-Wesley. External Links: Cited by: §1.
- Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 1487–1495. Cited by: §1, §2.
- Automatic algorithm configuration based on local search. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada, pp. 1152–1157. Cited by: §1, §2.
- Automated machine learning - methods, systems, challenges. The Springer Series on Challenges in Machine Learning, Springer. Cited by: §2.
- Optimal feature subset selection for fuzzy extreme learning machine using genetic algorithm with multilevel parameter optimization. In 2017 IEEE International Conference on Signal and Image Processing Applications, ICSIPA 2017, Kuching, Malaysia, September 12-14, 2017, pp. 445–450. Cited by: §1.
- Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 2020–2029. Cited by: §1, §2.
- Autotune: A derivative-free optimization framework for hyperparameter tuning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pp. 443–452. Cited by: §2.
- Hyperband: bandit-based configuration evaluation for hyperparameter optimization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1, §2.
- Deep neural architecture search with deep graph bayesian optimization. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2019, Thessaloniki, Greece, October 14-17, 2019, pp. 500–507. Cited by: §2.
- Effectiveness of random search in SVM hyper-parameter tuning. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015, pp. 1–8. Cited by: §2.
- Effectiveness of random search in SVM hyper-parameter tuning. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015, pp. 1–8. Cited by: §1.
- Towards automatically-tuned neural networks. In Proceedings of the 2016 Workshop on Automatic Machine Learning, AutoML 2016, co-located with 33rd International Conference on Machine Learning (ICML 2016), New York City, NY, USA, June 24, 2016, pp. 58–65. Cited by: §1, §2.
- Design and analysis of experiments. John wiley & sons. Cited by: §1, §2, §6.1.
- Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 9019–9030. Cited by: §1.
- BOCK : bayesian optimization with cylindrical kernels. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 3865–3874. Cited by: §1.
- Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1, §2, §6.1.
- Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 2960–2968. Cited by: §2.
- Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 2960–2968. Cited by: §1, §2.
- Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4134–4142. Cited by: §1, §2.
- Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, pp. 847–855. Cited by: §1, §2.
- Two-stage transfer surrogate model for automatic hyperparameter optimization. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I, pp. 199–214. Cited by: §1.
- Binary coordinate ascent: an efficient optimization technique for feature subset selection for machine learning. Knowl.-Based Syst. 110, pp. 191–201. Cited by: §1.
- BayesNAS: A bayesian approach for neural architecture search. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 7603–7613. Cited by: §1, §1, §2.