Automatic Configuration of Deep Neural Networks with EGO

10/10/2018 ∙ by Bas van Stein, et al. ∙ 0

Designing the architecture for an artificial neural network is a cumbersome task because of the numerous parameters to configure, including activation functions, layer types, and hyper-parameters. With the large number of parameters for most networks nowadays, it is intractable to find a good configuration for a given task by hand. In this paper an Efficient Global Optimization (EGO) algorithm is adapted to automatically optimize and configure convolutional neural network architectures. A configurable neural network architecture based solely on convolutional layers is proposed for the optimization. Without using any knowledge on the target problem and not using any data augmentation techniques, it is shown that on several image classification tasks this approach is able to find competitive network architectures in terms of prediction accuracy, compared to the best hand-crafted ones in literature. In addition, a very small training budget (200 evaluations and 10 epochs in training) is spent on each optimized architectures in contrast to the usual long training time of hand-crafted networks. Moreover, instead of the standard sequential evaluation in EGO, several candidate architectures are proposed and evaluated in parallel, which saves the execution overheads significantly and leads to an efficient automation for deep neural network design.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Artificial Neural Networks and in particular Convolutional Neural Networks (CNN) have demonstrated great performance on a wide range of difficult computer vision, classification and regression tasks. One of the most promising aspects of using deep neural networks is that feature extraction and feature engineering, which was mostly done by hand so far, now is completely taken care of by the networks themselves. Unfortunately, the design and configuration of the artificial neural networks are still derived by hand using either an educated guess, by popularity (using an architecture from previous literature) or by trying a grid of different architectures and parameters and then choosing the best performing network. Since the number of choices for a network architecture and its parameters can become quite large, an optimal deep neural network for a given problem is very unlikely to be obtained using this hand-crafted procedure.

The challenges in configuring CNNs are: 1) the search space is usually high dimensional and heterogeneous, resulting from a large number of structure choices (e.g., number of layers, layer type, etc.) and real parameters. 2) the computational time becomes the bottleneck when fitting a deep network structure to a relatively large data set. Those difficulties hinder the applicability of the traditional nonlinear black-box optimizers, for instance Evolutionary Algorithms 

(Stanley & Miikkulainen, 2002). Instead, it is proposed here to adopt the so-called Efficient Global Optimization (Močkus, 1975, 2012; Jones et al., 1998) (EGO) algorithm as the network configurator. The standard EGO algorithm is a sequential strategy designed for the expensive evaluation scenario, where a single candidate configuration is provided in each iteration. It is proposed to adapt the EGO algorithm to yield several candidate configurations in each iteration where the resulting configurations can be evaluated in parallel.

This paper is organized as follows. In section 2, the related approaches on network configuration are discussed. In section 3, we introduce the All-CNN configuration framework, using only convolutional layers, and the EGO-based configurator is explained in section 4. The proposed method is validated and tested in sections 5 and 6, followed by the demonstration of an application on a real-world problem.

2 Related Research

The optimization of hyper-parameters is a very known challenge and has been addressed in many works. For example, (Bergstra & Bengio, 2012) shows that random chosen trials are more efficient than using grid search to perform hyper-parameter optimization. Obviously, both random and grid search are far from optimal, and more sophisticated methods are required to search the very large and complex search space for optimizing deep artificial neural networks. More recent work of the same author (Bergstra et al., 2013)

shows that automatic hyper-parameter tuning can yield state-of-the-art result, In these papers, architectures are used that are known to work on a specific problem and are then fine-tuned by hyper-parameter optimization. Some other sophisticated algorithms to perform parameter tuning and automated machine learning configuration are Bayesian Optimization

(Snoek et al., 2012; Jones et al., 1998), Evolutionary Algorithms (Loshchilov & Hutter, 2016) and SMAC (Hutter et al., 2011a), which try to quickly converge to practical well-performing hyper-parameters for a given machine learning algorithm.

Unfortunately, even with these sophisticated algorithms, optimization of the deep neural network architecture itself, in addition with its hyper-parameters, is a very challenging task. This is caused by the time complexity and computational effort that is required to train these networks, in combination with the size of the search space of hyper-parameters for such networks. Automatically optimizing the structure of an artificial neural network is not an entirely new idea though, as already in 1989 (Miller et al., 1989)genetic algorithms were proposed to optimize the links between a predefined number of nodes. A bit later, an evolutionary program (GNARL) was proposed to evolve the structure of recurrent neural networks (Angeline et al., 1994). In another, more recent work (Ritchie et al., 2003), Genetic Programming (GP) is used for the automatic construction of neural network architectures.

One of the main bottlenecks with the already proposed methods though is that a single evaluation of an artificial neural network can take several hours, on a modern GPU system. This makes it infeasible to apply these algorithms with a large evaluation budget or on a large problem instance. Unfortunately, these algorithms usually require a large evaluation budget to find well performing network configurations for a specific problem. Another challenge is to define a bounded search space that still covers most of the possibilities in order to find the optimum. When dealing with neural network structures this is far from simple. The number of layers for example could be a problematic parameter to vary, since each layer comes with its own set of hyper-parameters.

To alleviate this problem, a generic configurable deep neural network architecture is proposed in this paper. This architecture is highly configurable with a large number of parameters and can represent very shallow to very deep convolutional neural networks. The configurable architecture has a fixed number of hyper-parameters and is therefore very suitable for optimization. To tackle this optimization task, the well-known Efficient Global Optimization algorithm (Močkus, 1975, 2012; Jones et al., 1998) is adopted with several important improvements, enabling the parallel training of different network candidates. The main advantages of the proposed approach are:

  1. Small optimization time: it requires by far less real evaluations (training of candidate networks) than other approaches.

  2. Parallelism: several candidate networks are suggested in each iteration, facilitating parallel execution over multiple GPUs.

3 A Configurable All-Convolutional Neural Network

In order to optimize the structure and hyper-parameters of a deep neural network, a few modeling decisions are required to set the boundaries of the search space. The complexity of the search space is mostly due to a large number of different layer types, activation functions and regularization methods, each coming with their own set of hyper-parameters.

In order to reduce the complexity of the search space without making too many modeling assumptions, a generic configurable convolutional neural network designed for any image classification problem is proposed here.

According to (Springenberg et al., 2014), using only convolutional neural network layers can give the same or better performance as using the often used structure of convolutional layers followed by a pooling layer. Therefore, for our generic configurable network structure, we have chosen to use only convolution layers with the exception of the final layer.

The configurable network architecture is shown in Table 1 where each of the stacks has an architecture as shown in Figure 1

. The network consists of multiple of these stacks, that each consist of a number of convolutional layers and a convolutional layer with strides (Conv2D-Out), to allow for pooling, and a dropout layer. The last part of the network uses either global average pooling or not, and ends in a dense layer with the size of the number of classes one wants to predict. Each stack has

independent configurable parameters and shared parameters that can be optimized. The convolutional layers in the stack have the parameters and , which are the number of filters , the kernel size , the regularization factor for the weights, and for the Conv2D-Out layer the strides , respectively. The parameter

stands for the configurable activation function and every dropout has its own dropout probability (

). The last dense layer has and as configurable parameters. The size of each stack is configurable as well (

), and allows for very shallow to very deep neural network architectures. All hyper-parameters that are not taken into account for the configuration are set to the values recommended by literature and the padding for each convolution layer is set to ‘same’ in order to avoid negative dimensions.

Figure 1: Schematic diagram of the stack structure.
Layer Type Parameters
GlobalPooling boolean
Table 1: Generic Configurable All-CNN structure with stacks and configurable parameters per layer. is the index for the current stack ().

Next to the parameters of the configurable network itself (which are when using stacks), there are the learning rate () and decay rate for the back-propagation optimizer. Depending on available resources and the classification task at hand, the ranges of the parameters can be determined by the user. For this paper, the ranges can be found in Table 2

. The optimizer used for back-propagation is the well-known stochastic gradient descent (SGD), provided by the Keras

(Chollet et al., 2015) python library.

Parameter Range

{elu, relu, tanh, selu, sigmoid}

Table 2: Ranges of the search space dimensions

4 Efficient Global Optimization based Configurator

The search space of the All-CNN framework is heterogenous and high dimensional. For the integer parameters, in case of three stacks, there are seven for the number of filters (), seven for the kernel size (), three for strides () and three for the number of layers () in the stack, and thus in total. For the discrete parameters, there are two for the activation functions () of the stack and the head and for the real parameters, there are four parameters for the dropout rate (), one for regularization () and one for the learning rate (). In addition, we have one boolean variable to control the global pooling. Therefore, this search space can be represented as:

where and . The convolutional neural network can be instantiated by drawing samples in . Given a data set, the problem arises in finding the optimal configuration, with respect to a pre-defined, real-valued performance metric of the neural network (for instance, can be set to for regression tasks and precision for classification problems): In the following discussion, it is assumed that the performance metric is subject to minimization, without loss of generality (the maximization problem can be easily converted). The challenge in optimizing is the evaluation time of itself, which will be extremely expensive when training a large network structure on a huge data set. Consequently, it is recommended to use efficient optimization algorithms that can save as many evaluations as possible. The efficient global optimization (EGO) algorithm (Močkus, 1975, 2012; Jones et al., 1998) is a suitable candidate algorithm for this task. It is a sequential optimization strategy that does not require the derivatives of the objective function and is designed to tackle expensive global optimization problems. Compared to alternative optimization algorithms (or other design of experiment methods), the distinctive feature of this method is the usage of a meta-model, which gives the predictive distribution over the (partially) unknown function.

Briefly, this optimization method iteratively proposes new candidate configurations over the meta-model, taking both the prediction and model uncertainty into account. After the evaluation of the new candidate configurations, the meta-model will be re-trained.

4.1 Initial Design and Meta-modeling

To construct the meta-model, some initial samples in the configuration space, are generated via the Latin hypercube sampling (LHS)  (McKay et al., 1979). The corresponding performance metric values are obtained by instantiating the network and validating its performance on the data set: . Note that the evaluation of the initial designs can be easily parallelized. For the choice of meta-models, although Gaussian process regression (Sacks et al., 1989; Santner et al., 2003) (referred to as Kriging in geostatistics (Krige, 1951)) is frequently used in EGO, we adopt the random forest instead, due to the fact that it is more suitable for a mixed integer configuration domain (Hutter et al., 2011b). In the following discussions, the prediction on configuration is denoted as

. In addition, the empirical variance of the prediction

is also calculated from the forest, which quantifies the prediction uncertainty.

4.2 Infill-Criterion

To propose potentially good configurations in each iteration, the so-called infill-criterion is used to quantify the quality criterion of the configurations. Informally, infill-criteria work in a way that predicted values from the meta-model and the prediction uncertainty are balanced. A lot of research effort has been put over the last decades in exploring various infill-criteria, e.g., Expected Improvement (Močkus, 1975; Jones et al., 1998), Probability of Improvement (Jones, 2001; Žilinskas, 1992) and Upper Confident Bound (Auer, 2002; Srinivas et al., 2010)

. In this contribution, we adopt the so-called Moment-Generating Function (MGF) based infill-criterion, as proposed in 

(Wang et al., 2017). This infill-criterion allows for explicitly balancing exploitation and exploration. This criterion has a closed form and can be expressed as:


where is the current best performance over all the evaluated configurations and

stands for the cumulative distribution function of the standard normal distribution. The infill-criterion

introduces an additional real parameter (“temperature”) to explicitly control the balance between exploration and exploitation. As explained in (Wang et al., 2017), when goes up, tends to reward the configurations with high uncertainty. On the contrary, when is decreased, puts more weight on the predicted performance value. It is then possible to set the value according to the budget on the configuration task: with a larger budget of function evaluations, can be set to a relatively high value, leading to a slow but global search process and vice versa for a smaller budget.

4.3 Parallel execution

Due to the typically large execution time of instantiated network structures, it is also proposed here to parallelize the execution. This requires generating more than one candidate configuration in each iteration. Many methods are developed for this purpose, including multi-point Expected Improvement (Ginsbourger et al., 2010) and Niching techniques (Wang et al., 2018). Here, we adopt the approach in (Hutter et al., 2012), where () different temperatures

are sampled from the log-normal distribution

and different criteria are instantiated using the temperatures accordingly. Consequently, candidate configurations can be obtained by maximizing those infill-criteria. On one hand, as log-normal is a long-tailed distribution, most of the values are realized relatively small and thus the model prediction is well exploited. On the other hand, only a few samples will be relatively high and therefore will lead to very explorative search behavior.

To maximize the infill-criterion on the mixed-integer search domain, we adopt the so-called Mixed-Integer Evolution Strategy (MIES) (Li et al., 2013). The proposed Bayesian configurator is summarized in Algorithm 1.

1:  Generate the initial design using LHS.
2:  Construct the initial random forest on .
3:  while the stopping criterion is not fulfilled do
4:     for  do
6:        Maximize the infill-criterion using Mixed integer Evolution Strategy:
7:     end for
8:     Parallel training and performance assessment for all : .
9:     Append to .
10:     Re-train the random forest model of on the augmented data set
11:  end while
Algorithm 1 EGO Configurator

5 Experiments

To test our algorithm, two very popular and common classification tasks have been performed using the proposed configurator and a configurable network with stacks. These are the MNIST dataset (LeCun et al., 1998), containing 60.000 training samples, and a test set of 10.000 examples, all 28x28 greyscale images, and the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), containing 60.000, 32x32 colour images with 10 classes, divided into 6000 images per class. There are 50.000 training images and 10.000 test images, in this case.

In the optimization procedure of the neural network on the MNIST dataset, each evaluation is run for epochs only with a batch size of images. For the CIFAR-10 dataset, the number of epochs is increased to , which is still much less than the number of epochs in most recent literature (). An early stopping criterion is used to stop the evaluation of a particular configuration after epochs of no improvements. No data augmentation is used.

The Bayesian mixed integer configurator is set to evaluate network configurations per step in parallel using NVIDIA K80 GPUs, where the first steps are used for the initial LHS design. The test set accuracy is returned after each evaluation as fitness value for the optimizer.

6 MNIST and CIFAR-10 Results

In Figure 2 the results of the automatic configuration of the All-CNN networks are shown. In both cases, after approximately evaluations, a well-performing network configuration is obtained. Both classification tasks used exactly the same initial configuration, the only difference is the number of epochs for each network evaluation.

(a) MNIST, evaluations
(b) CIFAR-10, evaluations
Figure 2: a) The left plot shows the optimization run on MNIST, plotting the test accuracy (black dots) of evaluations using epochs. The -moving average is depicted by the green line. b) The right plot shows similar results on the CIFAR-10 classification task using epochs per evaluation.

The best performing configurations compete with the state-of-the-art as shown in Table 3 and Table 4 and can be possibly improved when trained using more epochs. It should be noted that the number of epochs used to obtain these results is significantly lower than the number of epochs in state-of-the-art solutions from literature. The advantage of such a small number of epochs is that it speeds up the entire optimization process. The idea behind this is that well-performing configurations can be tuned with a larger number of epochs as second optimization step, most likely resulting in increased performance. Using the automatic configurator we obtained neural network architectures that compete with state-of-the-art results using only epochs and epochs in total, without any manual tuning, reconfiguring or upfront knowledge of the specific problem instances. While hand-crafted network configurations are not only trained using many more epochs for the final reported architecture, they also require a huge amount of time to be constructed by reconfiguring and fine-tuning the architecture. Therefore, the hand-crafted networks basically use many more more epochs until the final architecture is reached.

Test error Algorithm Epochs
(Ciregan et al., 2012)
(Graham, 2014)
Optimized All-CNN
(Yang et al., 2015) unknown
Table 3: MNIST Performance from literature.
Accuracy Algorithm Epochs
(Graham, 2014)
(Springenberg et al., 2014)
Optimized All-CNN
(Zeiler & Fergus, 2013)
Table 4: CIFAR-10 Performance from literature.

7 Real World Problem: Tata Steel

The proposed algorithm is applied on the real world problem of classifying defects during the hot rolling process of steel. This industrial process is very complex with many conditions and parameters that influence the final product. It is also a process that changes over the years and requires dealing with concept shift and concept drift. One of the main objectives for Tata Steel is to automatically classify and predict surface defects using material properties and machine parameters as input data. To achieve this objective, first, a deep neural network architecture is designed by hand to classify these defects.

The Tata Steel data set consists of various material measurements and machine parameters. Most of the measurements are measured over the complete length of each coil but not over the width of the coil (since the width is much smaller). However, the temperature measurements are taken over several tracks in the width of the coil as well. Due to this spatial difference, it was decided to design two concatenated network architectures. One part of the architecture is based purely on the temperature data, allowing for the application of convolution layers in the width and length direction of the coil. The second component is used for modeling the remaining measurements and machine parameters where the convolution filters only work in the length of the coil. In the end of the design process, these two parts are merged into one final fully-connected output layer.

The initial design process of these architectures was mainly based on trial and error and recommendations from literature. The design process started with a small, relative simple, two-layer multi-perceptron, and adding additional dense and convolution layers in order to increase the final accuracy. Dropout is being applied to prevent over-fitting, and after several manual iterations a dropout rate of

seemed to work best.

Next, we applied a slightly modified version of the proposed configurable all-CNN network (with a separate stack for the temperature data before concatenating it to the main model) and automatically optimized the configuration. The optimal configuration obtained by using our optimization procedure significantly improves the classification accuracy. It also allows for easy retraining and validation on future data, since almost zero knowledge of the actual dataset is required to train and optimize the network architecture.

(a) ROC of hand-constructed classifier.
(b) ROC of optimized All-CNN.
Figure 3: a) The left plot shows the ROC curve of the classification task for Tata Steel with a hand-crafted neural network architecture. b) The right plot shows the ROC curve for the same task but now using the optimized neural network architecture using evaluations for the optimization and only epochs per evaluation.

The test set accuracy of the hand-designed classifier and the optimized classifier for this real world application is shown in Figure 3. It can be observed that the optimized classifier has a significantly improved accuracy on this specific defect type, with an almost true positive rate with only a very small () false positive rate. This shows that the optimization procedure and configurable network architecture has great potential for industrial applications.

8 Conclusions and Outlook

A novel approach based on Efficient global optimization algorithm is proposed to automatically configure the neural networks architecture. On some well-known image classification tasks, it is observed that the proposed optimization approach is capable of generating well-performing networks with a limited number of optimization iterations. In addition, the resulting optimized neural networks are also highly competitive with the state-of-the-art manually designed ones on the MNIST and CIFAR-10 classification task. Note that such performance of the optimized network are achieved under a very small number of epochs ( for MNIST, and for CIFAR-10) for training, without any knowledge on the classification task or data augmentation techniques.

As for the real-world problem, we have applied the proposed approach on the challenge of steel surface detection. The outcome clearly illustrates that the proposed configuration approach also works extremely well. The accuracy of the optimized network that detects the surface defect for Tata Steel is significantly higher than the accuracy of the network designed by hand, which is obtained with manual fine-tuning.

For the next step, there are several possibilities to work on. First, the proposed approach will be applied and tested on various modeling tasks and real-world problems. Second, the actual training time of the candidate network will be taken into account explicitly. The trade-off between training time and accuracy can be controlled by optimizing the epochs and batch size. Additionally, it is also interesting to formulate this as a bi-criteria decision making problem, with one objective being the accuracy of the network and the other objective the training time required. Third, we will investigate how to extend the current configurable network that has a linear topology, to more general topological structures. In this case, it will be very challenging to search efficiently in the complex configuration space with multiple dependencies.


The authors acknowledge support by NWO (Netherlands Organization for Scientific Research) PROMIMOOC project (project number: 650.002.001).


  • Angeline et al. (1994) Angeline, Peter J, Saunders, Gregory M, and Pollack, Jordan B. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994.
  • Auer (2002) Auer, Peter. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Bergstra & Bengio (2012) Bergstra, James and Bengio, Yoshua. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Bergstra et al. (2013) Bergstra, James, Yamins, Daniel, and Cox, David.

    Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.

    In International Conference on Machine Learning, pp. 115–123, 2013.
  • Chollet et al. (2015) Chollet, François et al. Keras., 2015.
  • Ciregan et al. (2012) Ciregan, Dan, Meier, Ueli, and Schmidhuber, Jürgen. Multi-column deep neural networks for image classification. In

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    , pp. 3642–3649. IEEE, 2012.
  • Ginsbourger et al. (2010) Ginsbourger, David, Le Riche, Rodolphe, and Carraro, Laurent. Kriging Is Well-Suited to Parallelize Optimization, pp. 131–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-10701-6. doi: 10.1007/978-3-642-10701-6˙6. URL
  • Graham (2014) Graham, Benjamin. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
  • Hutter et al. (2011a) Hutter, Frank, Hoos, Holger H., and Leyton-Brown, Kevin. Sequential Model-Based Optimization for General Algorithm Configuration, pp. 507–523. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011a. ISBN 978-3-642-25566-3. doi: 10.1007/978-3-642-25566-3˙40.
  • Hutter et al. (2011b) Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin. Sequential model-based optimization for general algorithm configuration. LION, 5:507–523, 2011b.
  • Hutter et al. (2012) Hutter, Frank, Hoos, Holger, and Leyton-Brown, Kevin. Parallel algorithm configuration. Learning and Intelligent Optimization, pp. 55–70, 2012.
  • Jones (2001) Jones, Donald R. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
  • Jones et al. (1998) Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
  • Krige (1951) Krige, Daniel G. A Statistical Approach to Some Basic Mine Valuation Problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52(6):119–139, December 1951.
  • Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2013) Li, Rui, Emmerich, Michael TM, Eggermont, Jeroen, Bäck, Thomas, Schütz, Martin, Dijkstra, Jouke, and Reiber, Johan HC. Mixed integer evolution strategies for parameter optimization. Evolutionary computation, 21(1):29–64, 2013.
  • Loshchilov & Hutter (2016) Loshchilov, Ilya and Hutter, Frank. Cma-es for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269, 2016.
  • McKay et al. (1979) McKay, M. D., Beckman, R. J., and Conover, W. J. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979. ISSN 00401706. URL
  • Miller et al. (1989) Miller, Geoffrey F, Todd, Peter M, and Hegde, Shailesh U. Designing neural networks using genetic algorithms. In ICGA, volume 89, pp. 379–384, 1989.
  • Močkus (1975) Močkus, J. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Springer, 1975.
  • Močkus (2012) Močkus, Jonas. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.
  • Ritchie et al. (2003) Ritchie, Marylyn D, White, Bill C, Parker, Joel S, Hahn, Lance W, and Moore, Jason H. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC bioinformatics, 4(1):28, 2003.
  • Sacks et al. (1989) Sacks, Jerome, Welch, William J., Mitchell, Toby J., and Wynn, Henry P. Design and Analysis of Computer Experiments. Statistical Science, 4(4):409–423, 1989.
  • Santner et al. (2003) Santner, T.J., Williams, B.J., and Notz, W. The Design and Analysis of Computer Experiments. Springer Series in Statistics. Springer, 2003. ISBN 9780387954202.
  • Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
  • Springenberg et al. (2014) Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • Srinivas et al. (2010) Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M., and Seeger, Matthias. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1015–1022, 2010. ISSN 00189448. doi: 10.1109/TIT.2011.2182033. URL
  • Stanley & Miikkulainen (2002) Stanley, Kenneth O and Miikkulainen, Risto. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
  • Wang et al. (2017) Wang, H., van Stein, B., Emmerich, M., and Bäck, T. A new acquisition function for bayesian optimization based on the moment-generating function. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 507–512, Oct 2017. doi: 10.1109/SMC.2017.8122656.
  • Wang et al. (2018) Wang, Hao, Bäck, Thomas, and Emmerich, Michael T. M. Multi-point efficient global optimization using niching evolution strategy. In Tantar, Alexandru-Adrian, Tantar, Emilia, Emmerich, Michael, Legrand, Pierrick, Alboaie, Lenuta, and Luchian, Henri (eds.), EVOLVE - A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation VI, pp. 146–162, Cham, 2018. Springer International Publishing. ISBN 978-3-319-69710-9.
  • Yang et al. (2015) Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483, 2015.
  • Zeiler & Fergus (2013) Zeiler, Matthew D and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
  • Žilinskas (1992) Žilinskas, Antanas. A review of statistical models for global optimization. Journal of Global Optimization, 2(2):145–153, 1992.