Automatic Hyper-Parameter Optimization Based on Mapping Discovery from Data to Hyper-Parameters

03/03/2020 ∙ by Bozhou Chen, et al. ∙ Harbin Institute of Technology 0

Machine learning algorithms have made remarkable achievements in the field of artificial intelligence. However, most machine learning algorithms are sensitive to the hyper-parameters. Manually optimizing the hyper-parameters is a common method of hyper-parameter tuning. However, it is costly and empirically dependent. Automatic hyper-parameter optimization (autoHPO) is favored due to its effectiveness. However, current autoHPO methods are usually only effective for a certain type of problems, and the time cost is high. In this paper, we propose an efficient automatic parameter optimization approach, which is based on the mapping from data to the corresponding hyper-parameters. To describe such mapping, we propose a sophisticated network structure. To obtain such mapping, we develop effective network constrution algorithms. We also design strategy to optimize the result futher during the application of the mapping. Extensive experimental results demonstrate that the proposed approaches outperform the state-of-the-art apporaches significantly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic machine learning (autoML) have gained wide attention and applications in both industry and academia. Automatic hyper-parameters optimization is one of the most critical parts. The effectiveness of many machine learning algorithms is extremely sensitive to parameters[7]. Without a good set of hyper-parameters, the machine task cannot be solved well even with optimal model.

Among the hyper-parameter optimization approaches, data-driven methods draw attentions since they could achieve effective prediction of hyper-parameters based on historical experience implicit in the data.

However, data-driven automatic hyper-parameter optimization faces three severe challenges. Firstly, exisitng systems may involve thousands of machine learning tasks with many hyper-parameters[10]. Recalculating hyper-parameters for each task may cause large time cost. Thus, the optimization process should be efficient. Secondly, the hyper-parameter optimization algorithm should have good transferability. The reason is that the optimal hyper-parameters are always different for different datasets, without transferability, the hyper-parameter optimization algorithm has to be run many times even for the same machine learning algorithm. Thirdly, a hyper-parameter optimization algorithm should be able to handle many parameters, since some complex machine learning algorithms have thousands of hyper-parameters[22], which are required to be optimized to ensure the effectiveness.

Even though some hyper-parameter optimization algorithms have been proposed. They could not solve all these problem. MI-SMBO[3] optimizes hyper-parameters based on historical data’s meta-feature. It accelerates the optimization process and significantly improves the algorithm’s performance. However, in this approach, the meta-features are selected manually, which limited its transferability. Also, as a kind of SMBO module, since MI-SMBO needs to run the machine learning algorithm iteratively for many times, the low efficiency of MI-SMBO may be caused by inefficient machine learning algorithm. [Rijn2018hyper-parametersIA] selects the hyper-parameters with the most significant influence to the performance, and predicts priors hyper-parameters based on the best group hyper-parameters in the historical datasets. This method also needs to iterate the machine learning algorithm and could hardly optimize complex algorithms in limited time.

This motivate us to solve these problems. Intuitively, the optimal parameters are determined by two factors, i.e., the machine learning algorithm and the data. Therefore, under the same algorithm, the parameters are completely determined by the data. Thus, we attempt to investigate the relationship between parameters and data. Considering that each dataset corresponds to at least one set of optimal hyper-parameters, we believe that there is a mapping from data space to parameter space, and describe this mapping with a neural network. As a result, we use this mapping to achieve prediction of hyper-parameters directly.

Our contributions of this paper are summarized as follows.

  • We consider the mapping from data to the optimal hyper-parameters and apply this mapping to the selection of the optimal hyper-parameters. On different tasks of an algorithm, the model has strong transferability, which greatly saves time overhead. For this reason, the model can achieve ultra-high-dimensional optimization of hyper-parameters.

  • With XGBOOST as an example, we design the neural network structure for the mapping as well as traing approaches, which could be applied to other machine learning tasks with slight modificaiton.

  • Experimental results on real data demonstrate that the proposed approach significantly outperforms the state-of-art algorithms in both accuracy and efficiency.

In the remaining of this paper, Section 2 describes the proposed approach. Extensive experiments are conducted in Section 3. We overview related work in Section 4. Section 5 draws the conclusions.

2 Method

The basic idea of our approach is to build a mapping from datasets to the optimal hyper-parameters and use this mapping to take parameter determination according to a given dataset. Since mapping is the core concept of our approach, we define it at first and overview the algorithm in Section 2.1, and then discuss the major components of our algorithm in Section 2.3 and Section 2.4, respectively.

2.1 Overview

For a machine learning algorithm, the optimal hyper-parameters should be specific for the dataset . From this aspect, optimal hyper-parameters generator for a algorithm could be considered as training a mapping from a dataset

to an optimal parameter vector

, which is defined as follows.

Definition 2.1.

Parameter Mapping: For a machine learning algorithm , the mapping from each training dataset of to the corresponding optimal hyper-parameter vector is called a parameter mapping from to w.r.t , determined by

Since the mapping catches complex features of the data and may be very complex, we attempt to use a neural network to represent this mapping, which is called a Core Network (CN). Thus, our algorithm is divided into two phase, CN construction and CN application as is shown in Figure 1. And they are described in Section 2.3 and Section 2.4, respectively. Before them, we introduce the structure of CN at first.

Figure 1: The components of the proposed algorithm is shown in this figure. The CN construction process of is shown in the lower left, and the rest shows the CN application component for hyper-parameter prediction.

2.2 CN Structure

To build the mapping from the meta-feature to optimal hyper-parameters, we train a neural network called CN for each algorithm. The input of the CN is meta-feature of the dataset, and the output is the generated hyper-parameters. Clearly different machine learning task corresponds to different hyper-parameters and require different CN correspondingly. In this section, we introduced the CN for XGBOOST[6]. The CNs for other tasks could be constructed in the similar way.

Figure 2: An example of core network.

The CN’s structure is shown in Figure 2. For such CN, we attempt to model the dataset as a neutral network and use its trainable parameters as the input. More specifically, the parameters refers to neutral network’s weights and bias. The former is represented as a 2-dimension matrix, and the latter is a float. The CN has two part, i.e., Meta-Feature Process part and Hyper-parameter Output part.

Meta-Feature Process part analyze the trainable parameter and reduce its size. In order to fully retain the structural information of the meta-features, we concatenate the biases of the same layer to weights, and then use these 2-dimension matrices as the CN input. Since these matrices are often large, to reduce the difficulty of CN’s training, we use a convolution layer (ConvRelu component in Figure 2) to reduce the size of the inputs. According to the output format, the output of each convolution layer is flattened (Flatten component in Figure 2) and concatenated (Concat component in Figure 2).

Hyper-parameter Output part combines the result of Meta-Feature Process part and predict the hyper-parameters. In this part, the fully connected layer (FC+Tanh and FC+ELU components in Figure 2

) uses active functions such as Tanh and ELU

[9] to make the range of output within that of the hyper-parameter. The activate function’s selection depends on the range and distribution of the hyper-parameter[18].

To construct CN for a machine learning algorithm , we first select suitable neural network structure based on the data type of the input of . For example, for image classification, the neural network could be CNN[13]. While for NLP, the structure could be LSTM[8]

. After this, the inputs of CN, i.e., the trainable parameters of the neural network, are determined by the dimensions and data types of trainable parameters. Next, to construct Meta-Feature Process part, we use convolution layers similar as the CN described above to reduce the size of the input. Finally, we flatten the result of the inputs, and concatenate them into a 1-dimension vector if the results have multiple branches. Hyper-Parameter Output part is constructed by several fully-connected layers, and we need to adjust the last layer’s output to fit the number of hyper-parameters. We can also use Batch-Normalization

[11] to increase the effectiveness of learning after the convolution layers if necessary.

2.3 CN construction

To construct CN, we have two major jobs. The first is to prepare suitable data for the CN training, and the second is to train the CN. We introduce them in this section, respectively.

2.3.1 Data Preparation

The preparation of data has two goals, i.e., sufficiency and task-fitting. To achieve the former, we develop sampling technique to generate sufficient training data from original datasets which are suitable for this problem and contain sufficient data. As for the latter, we propose encoding approach to extract the meta-features of the generated dataset, as is the input of CN. Additionally, we need to label the generated dataset by generating the corresponding optimal hyper-parameters.

Sampling To obtain sufficient amount of the training data, we attempt to sample them from the a large dataset, which could be the union of training datasets. Note that to increase the generalization ability of the CN, the training datasets should be diverse. Due to a large amount of standard training dataset published online, it is easy to obtain such dataset. For example, for XGBOOST, we could easily obtain 98 datasets for classification from UCI datasets 111https://archive.ics.uci.edu/ml/datasets.php.

Clearly, to ensure the generalization ability of the trained CN, the sampled training data should be dissimilar. We measure the similarity of two datasets and with Jaccard similarity, i.e., =. If , where is a threshold and , and a small means a strong constraint for independency, then and are similar. Suppose we perform samplings to obtain ={, }. If for any and in , , is independent w.r.t . Clearly, the independence of ensures the dissimilarity among the sampled training set. Fortunately, according to Theorem 1, obtained by randomly sampling is independent.

Theorem 1.

For a dataset , the minimum size of sample set and , there exists and a reasonable such that a sample set ={, , } randomly sampled from with for any is independent w.r.t and m. Mean while .

Proof.

sketch we only need to ensure the to ensure that the sample set is unbiased. It is easy to satisfy by taking out all the subsets of samples independent to other samples (suppose the number is ). As long as the instance with and has a solution, we can get the result such that . The proof of is straightforward, since = , which has a maximum value. We only need to prove that the maximum value of the function is greater than , and this matter can be solved by a differentiation directly. ∎

Now we take MNIST dataset as an example. Suppose , , . When and take the above values, is very close to 1, and have . If , , then . So sampling at least 1800 times can ensure that at least 1000 subsets are independent to each other. So it’s no problem to get enough independent subsets.

Figure 3: Normal-NPE
Figure 4: Image-NPE.

Encoding Even though the dataset may be various, the input of CN should be uniformed by encoding. Two issues must be addressed here. One is that the feature numbers of the dataset may be different. The other is that the number of samples in the dataset may be different.

We solve the first one through zero-padding. That is, adding features to a dataset with small number of features. All these new features could be simply set to 0. Thus, the meaning of new dataset is consistent with that before it is processed. Experimental results have shown that zero-padding does not affect key indicators such as classification accuracy.

For the second issue, we design Network Parameter Embedding(NPE) approach, which uses an auto-encoder to encode the dataset, and returns the neural network parameters of the encoder as the feature of the dataset. There are two differences between NPE and traditional auto-encoders. On the one hand, NPE encodes attributes and the label at the same time, because only when the attributes and label are jointly encoded, the result can represent the features of the original dataset. On the other hand, as discussed above, we use the parameters of corresponding neural network representing the dataset as the input of CN. Therefore, in the application phase, each dataset is encoded to parameters.

Since different dataset may be represented by different neural network with different parameters. We develop two types of NPEs, Normal-NPE and Image-NPE to fit to typical types of datasets for our CN.In this paper, we focus on these two data types and will study NPEs for more data types in the future.

Table-NPE is used to process table data, i.e., each sample is a one-dimensional vector. The input and output of the encoder are one-dimensional. Considering that the number of features of the data is not particularly large, it can use the fully connected layer as the main structure of the network. Here we use a stack autoencoder

[24], whose structure is shown in Figure 3. It can effectively encode structured datasets.

The essence of Image-NPE is a convolutional auto-encoder, which is designed to encode unstructured datasets such as images. Firstly, a convolutional self-encoder commonly used in image encoding problems is used. However, considering that the label of the dataset should be encoded at the same time as the picture, it can be ensured the encoding result can represent the original dataset. Therefore, at the output layer of the encoder, the encoding result of the picture is flattened (saves the picture structure information), and the label is jointly encoded by the fully connected layer, and then the fully linked layer is used to separate the label and the picture in the decoder part. Finally, the reshape layer is used to recover the picture using the previously saved picture structure information, which is decoded and output by the convolutional decoder. The structure of Image-NPE is shown in Figure 4.

labeling The goal of labeling is to generate the optimal hyper-parameters as the label for the sampled dataset to form the input of CN with the encoded features of . Our solution is to compute the hyper-parameters with existing approaches such as the work in [4] and pick the best one according to the experiments.

2.3.2 CN Training

Intuitively, CN can be trained with normal neural network training approaches such as the work in [21]. The major challenge in CN training is that the loss is too large and could not converge.

The first cause is that the label range of the training data is too wide to learn. To solve this problem, we first zoom the labels with tanh, then adjust the activation function of neural network’s output layer to tanh, so that the output’s range can fit the labels’ perfectly.

The second cause is the gradient explosion[12]

. To handle this issue, the gradient clipping was performed, and the full link layer activation function is changed to tanh. The third cause is that the prediction result of the CN for a label with a large value is small and has little change during training. This is due to the saturation of tanh. To solve this problem, the data with the larger value in the label is taken as log10 before tanh function computation. After the application of these strategies, the CN’s loss in the training gets small, and the validation sets can converge steadily.

Besides, in order to improve the generalization, dropout technology could be used.[20]

2.4 CN Application

After CN is trained, it can be applied to generate the optimal hyper-parameters. Before prediction, it is still necessary to encode the dataset to import the dataset into CN. This process is exactly the same as that in CN construction.

Note that although the CN prediction results may still contain some errors, which may cause a huge loss in performance. Therefore, we need to optimize the output of CN furthermore. Since in most of the cases, even with the errors, the results generated with CN are around the optimal results[23], we attempt optimize the parameters within this local area, as is called local optimization.

Local Optimization Suppose is the output of CN. We divide into two subsets and , and then divide these two subsets recursively until one subset contains just one or two parameters. If a subset has just one parameter, then the function MC(mountaining climbing method) is invoked. If a subset contains two parameters, then the function DMC(dual mountain climbing method) is invoked.

The whole process is shown in the Algorithm 1. LOPT input is the output of CN, and its output is the optimized parameters. In this algorithm, we put all the parameters to optimize in a list . Similar as Quick-sort, during recursion, we optimize the parameters in a range in .

Line 1 and 2 initializes a segment tree and invoke FUNC method for local optimization, which is in Line 3-11. FUNC has three parameters. P is the parameter list. l and r is the range of parameters in to optimize in this function. In each loop in FUNC, the parameters in is optimized recursively in Line 8-11 until converge. In this algorithm, to acclerate the judgement of converge, we maintain the sum of the absolute update values of parameters in a segment tree[2], such that for each loop, the sum is unnecessary to be recomputed, and the total time of converge checking is changed from to , where is the parameter number. Line 12-30 is the process of MC method, which is the mountain climbing process and updates the segment tree with the updated value. Line 31 to 34 is the process of DMC method, which optimize and iteratively.

Figure 5: The distribution of the algorithm’s accuracy when changing its two parameters. TestENV is the algorithm to be adjusted.
1:
2:
3:
4:
5:function func()
6:       if  then
7:              return        
8:       if  then
9:              return        
10:       while  do
11:             
12:             
13:                     return
14:function mc()
15:       , , ,
16:       while  do
17:             
18:             
19:             
20:             
21:             
22:             
23:             
24:             if  then
25:                    
26:             else if  then
27:                    
28:             else
29:                    
30:                                         
31:       
32:        return
33:function dmc()
34:       while  do
35:             
36:                     return
Algorithm 1 LOPT

Since the running time of the function MC has a constant upper bound, the time complexity of MC is . We show that the time complexity of Algorithm1 can reach in Theorem 2.

Theorem 2.

Algorithm time complexity can reach , where is the parameter number.

Proof.

(sketch) The recursive equations based for Algorithm 1 is , , where is the number of recursive calls (). Solving the equation, . When , the algorithm has the optimal efficiency of .

Now we attempt to prove that could ensure the correctness of the algorithm. Inside FUNC, the parameters are split into two halves, and FUNC is called recursively. Transforming between these two parameter sets is similar to the coordinate rotation transform in the DMC algorithm. After the transformation, it is ensured the parameters adjusted for each iteration are optimized along the optimal path. This ensures that the DMC algorithm will converge to the optimal solution after two iterations. These two parameter subsets divided by the FUNC function are regarded as two parameters. Similar transformations are performed on these two parameters. FUNC recursively calls itself twice and exits the loop directly. It can guarantee that the final result is optimal. ∎

3 Experiments

In this section, we study our approach experimentally with two typical machine learning algorithm, and . For the experiment, we collected a series of relevant data suitable for the two algorithms respectively.

Figure 6: Distribution of features corresponding to different categories.
Figure 7: The number of values for each feature.
Figure 8: Visualization of MNIST dataset by T-SNE

3.1 Experiment Data

XGBOOST We download 98 classification datasets from website222https://archive.ics.uci.edu/ml/datasets.php. Then we sample on those datasets. Here we select two from the datasets for visucalization, as is shown in Figure 8 and Figure 8.

CNN We choose MNIST and SVHN datasets, performing random sampling on them according to labels. For subset of MNIST, the size is set to 1000 (100 samples per class), 500 subsets in all. For subset of SVNH, the size is 5000 (500 samples per class), 500 subsets in all. In this part’s labeling, we use the state-of-the-art derivative free optimization method SRACOS 333https://github.com/eyounx/ZOOpt.

3.2 Experimental Settings

The software and hardware settings are shown in Table 1, and the data information is in Section 3.1.

XGBOOST CNN
CPU AMD Ryzen 3600 Intel Xeon Platinum 8163 2.5GHz x4 (96 core)
RAM 16x2G 3200MHz 251G
GPU GTX 1060 6G (2000MHz) GTX 2080Ti x8
OS Windows10 1903 Linux version 3.10.0-1062.9.1.e17.x86_64
python 3.7 3.6
keras 2.3.1 2.3.1
tensorflow 1.13.1 1.13.1
tensorflow-gpu 1.13.1 1.13.1
numpy 1.17.4 1.18.0
pandas 0.23.4 0.25.3
scikit-learn 0.22 0.22
others XGBoost 0.90, Bayesian-optimization 1.0.1 ZOOpt 0.18.2
Table 1: Software and hardware settings

3.3 Experimental Results

We design three groups experiments: blank control group(BCG) without pre-training, control group and experiment group. In the control group, we use Bayesian and ZOOpt to optimize those hyper-parameters. In the experiment group, we use CN and CN + LOPT (or CN) to optimize the algorithm. , and are their outputs respectively, and they are all a set containing predicted hyper-parameters. Then the three sets of hyper-parameters will be tested on the algorithm(ENV). The overall process is shown in Figure 9 and detailed description is as follows.

We divide the datasets into and with size 9:1, and is used to train CN. Then we continue to divide by 9:1 into and . We use to run on the three models and obtain the output, then we use to test the output.

Figure 9: The overall process of our experiment. CN denotes control group. Test_ENV denotes the algorithm to be adjusted.

XGBOOST

We verified the effectiveness of our model (CN, CN + LOPT) on 280 classified datasets, and used the Bayesian optimization algorithm (BO) as a control group, while setting up a blank control group (BCG). The optimal output by the four models are tested, and the accuracy is shown in Figure 

12. The horizontal axis represents different test files, and the vertical axis represents accuracy.

To analyze the pros and cons of each model more clearly, we extract the median, mean, standard deviation, maximum, and quartile of the accuracy corresponding to each of the 280 runs of the tuning algorithm. Statistics are shown in the Table 

2, and visualized in the Figure 12. First, according to the performance of BCG, we observe that our dataset has a strong detection ability, which further illustrates that the CN and CN + LOPT models are effective. Second, from the comparison results of CN and CN + LOPT with BO, our model outperforms BO on various indicators. This shows hat our model has strong generalization and migration capabilities.

To get a deeper understanding of the results on 280 files. For each file, we find the increment of CN + LOPT accuracy relative to BO accuracy. We count the number of files in each incremental interval and draw a pie chart as shown in Figure 16. It can be seen from the results that our model is better than the control group BO on 3/4 of the test data.

Additionally, we compare the time overhead of the three algorithms (CN, CN + LOPT, BO) for predicting the hyper-parameters of 280 test sets, as shown in Figure 12. The horizontal axis still represents different test files, and the vertical axis represents the running time. The time is in log scale. According to the results, our model CN and CN + LOPT outperforms the control group BO. Especially the model CN, which is not optimized locally, will accomplish the task of finding hyper-parameters in just a few seconds. This demonstrates the efficeincy of our model. Because of this, it’s possible for our model to optimize algorithms with ultra-high dimensional parameters.

Figure 10: Accuracy for XGBOOST
Figure 11: Statistics for XGBOOST
Figure 12: Time overhead comparison.(XGBOOST)
Figure 13: Accuracy for CNN
Figure 14: Statistics for CNN
Figure 15: Run Time for CNN
Figure 16: Accuracy increment of CN + LOPT
CN CN+LOPT BO BCG
max 1 1 1 0.75
q3 0.95 0.95 0.945 0.58
median 0.73 0.74 0.73 0.46
mean 0.7279 0.7371 0.7097 0.4678
sd 0.1833 0.1791 0.2061 0.1510
q1 0.555 0.57 0.49 0.39
min 0.37 0.41 0.14 0
Table 2: Comparison of statistical characteristics.(XGBOOST)
CN ZO BCG
max 0.975 0.9625 0.8625
q3 0.925 0.8563 0.4988
median 0.835 0.7763 0.3963
mean 0.8353 0.7680 0.4322
sd 0.0889 0.1136 0.1978
q1 0.76 0.6975 0.2775
min 0.64 0.4625 0.0875
Table 3: Comparison of statistical characteristics.(CNN)

CNN We test the effectiveness of our model (CN) on 180 classified datasets, 90 from mnist subsets and 90 from svhn subsets. The ZOOpt algorithm (ZO) was used as a control group, and a blank control group (BCG) without pre-training was also used to show the effectiveness of the proposed approach. The optimal hyper-parameters generated by these three models are tested on the test set, and the accuracy is shown in Figure 15. The horizontal axis in the figure represents the test file, and the vertical axis represents the accuracy rate of CNN classification.

As with CNN, we analyze some statistics in Table 3, and visualize them in Figure 15. First, according to the performance of BCG, we can observe that our dataset has a strong detection ability, which further illustrates that the CN model is effective in hyper-parameters prediction for CNN. Second, from the comparasion results of CN and ZO, our model outperforms ZO in various indicators, which shows that our model has strong generalization and migration capabilities.

In addition to the accuracy, we also compared the time overhead of CN and ZO, as shown in Figure 15. The horizontal axis still represents different test files, and the vertical axis represents time overhead. The result is in log scale. From the comparison results, our model CN outperforms the control group BO significantly. This coincides with the performance of CN on XGBOOST.

4 Related Work

The techniques of AutoML include model selection, automatic hyper-parameter optimization and automatic neural network structures construction. Here we focus on automatic hyper-parameter optimization. Bayesian model is used to optimize hyper-parameters [5]. Ref.[1]

uses Bayesian algorithms to optimize the hyper-parameters of reinforcement learning. Even though they are effective, their efficiency prevent their applications on large datasets or the algorithm with many hyper-parameters. Although Some work solve the high-dimension problem 

[17, 19], the problem in efficiency is still not solved. Additionally, such approaches could be hardly transferred.

For the efficiency issue, some approaches have been proposed. An efficient automatic method [14] is proposed to optimize the parameters in the kernel of SVM by vectoring the kernel with sine/cosine algorithm [16]. However, this approach is too specific for SVM and fails to be generalized for other approaches.

Bandit-based hyper-parameters optimization[15] accelerates the random search through adaptive resource allocation. Ref. [3] considers extracting the meta-knowledge of datasets to calculate the hyper-parameters. But they neglect the relations between all datasets running on the same algorithm and the optimal hyper-parameters corresponding to these datasets. Rer.[9] determines search direction through analysis of meta-knowledge. These approaches are orthogonal to ours and could be combined to our approach.

5 Conclusion Future Work

In this paper, we study hyper-parameter optimization for machine learning algorithms. We model the mapping from dataset to corresponding optimal hyper-parameters with neural network, obtaining the optimal hyper-parameter according to such relationship trained from generated datasets and their corresponding optimal hyper-parameters. To achive high-quality model, we design sophisticated network structure with effective training methods. With such model, the hyper-parameters could be derived according to the dataset directly. To optimize the hyper-parameters furthermore, we also devleop local search strategies. Extensive experiments on real datasets shows that our approaches achieve high efficiency and effectiveness. In future work, we will design CN structures to optimize more algorithms. At the same time, we will design matching NPEs to adapt to video, audio, and text data, establishing a CN-NPE knowledge base.

References

  • [1] J. C. Barsce, J. A. Palombarini, and E. C. Martínez (2018) Towards autonomous reinforcement learning: automatic setting of hyper-parameters using bayesian optimization. CoRR abs/1805.04748. External Links: Link, 1805.04748 Cited by: §4.
  • [2] P. Fenwick (1996-12) A new data structure for cumulative frequency tables. Software - Practice and Experience 24, pp. . Cited by: §2.4.
  • [3] M. Feurer, J. T. Springenberg, and F. Hutter (2015)

    Initializing bayesian hyperparameter optimization via meta-learning

    .
    pp. 1128–1135. Cited by: §1, §4.
  • [4] P. I. Frazier (2018) A tutorial on bayesian optimization. ArXiv abs/1807.02811. Cited by: §2.3.1.
  • [5] P. Frazier (2018-10) Bayesian optimization. pp. 255–278. External Links: ISBN 978-0-9906153-2-3, Document Cited by: §4.
  • [6] X. Gao, S. Fan, X. Li, Z. Guo, H. Zhang, Y. Peng, and X. Diao (2017-11) An improved xgboost based on weighted column subsampling for object classification. In 2017 4th International Conference on Systems and Informatics (ICSAI), Vol. , pp. 1557–1562. External Links: Document, ISSN null Cited by: §2.2.
  • [7] R. Hamou, A. Amine, and A. Lokbani (2013-04) Study of sensitive parameters of pso application to clustering of texts.

    International Journal of Applied Evolutionary Computation

    4, pp. 41–55.
    External Links: Document Cited by: §1.
  • [8] S. Hochreiter and J. Schmidhuber (1997-12) Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Document Cited by: §2.2.
  • [9] Y. Hu, Y. Yu, and Z. Zhou (2018-07) Experienced optimization with reusable directional model for hyper-parameter search. pp. 2276–2282. External Links: Document Cited by: §2.2, §4.
  • [10] H. Ichihashi, K. Honda, and A. Notsu (2011-06) Comparison of scaling behavior between fuzzy c-means based classifier with many parameters and libsvm. In 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011), Vol. , pp. 386–393. External Links: Document, ISSN 1098-7584 Cited by: §1.
  • [11] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Link, 1502.03167 Cited by: §2.2.
  • [12] S. Kanai, Y. Fujiwara, and S. Iwamura (2017)

    Preventing gradient explosions in gated recurrent units

    .
    In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 435–444. External Links: Link Cited by: §2.3.2.
  • [13] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989-12) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. External Links: Document, ISSN 0899-7667 Cited by: §2.2.
  • [14] C. Li, H. Ho, Y. Liu, C. Lin, B. Kuo, and J. Taur (2012)

    An automatic method for selecting the parameter of the normalized kernel function to support vector machines

    .
    J. Inf. Sci. Eng. 28 (1), pp. 1–15. External Links: Link Cited by: §4.
  • [15] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017-01) Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18 (1), pp. 6765–6816. External Links: ISSN 1532-4435 Cited by: §4.
  • [16] S. Li, H. Fang, and X. Liu (2017-08) Parameter optimization of support vector regression based on sine cosine algorithm. Expert Systems with Applications 91, pp. . External Links: Document Cited by: §4.
  • [17] M. Mutny and A. Krause (2018) Efficient high dimensional bayesian optimization with additivity and quadrature fourier features.. Annual Conference on Neural Information Processing Systems, pp. 9019–9030. Cited by: §4.
  • [18] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall (2018)

    Activation functions: comparison of trends in practice and research for deep learning

    .
    CoRR abs/1811.03378. External Links: Link, 1811.03378 Cited by: §2.2.
  • [19] P. Rolland, J. Scarlett, I. Bogunovic, and V. Cevher (2018-02) High-dimensional bayesian optimization via additive models with overlapping groups. pp. . Cited by: §4.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014-06) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §2.3.2.
  • [21] H. Tang, M. Lei, Q. Gong, and J. Wang (2019) A bp neural network recommendation algorithm based on cloud model. IEEE Access 7 (), pp. 35898–35907. External Links: Document, ISSN 2169-3536 Cited by: §2.3.2.
  • [22] R. Trinchero, M. Larbi, H. M. Torun, F. G. Canavero, and M. Swaminathan (2019) Machine learning and uncertainty quantification for surrogate models of integrated devices with a large number of parameters. IEEE Access 7 (), pp. 4056–4066. External Links: Document, ISSN 2169-3536 Cited by: §1.
  • [23] T. Wang, S. Sui, and S. Tong (2017-03) Data-based adaptive neural network optimal output feedback control for nonlinear systems with actuator saturation. Neurocomputing 247, pp. . External Links: Document Cited by: §2.4.
  • [24] C. Xu, Y. Dai, R. Lin, and S. Wang (2019) Stacked autoencoder based weak supervision for social image understanding. IEEE Access 7 (), pp. 21777–21786. External Links: Document, ISSN 2169-3536 Cited by: §2.3.1.