1 Introduction
Several practical optimization problems such as process blackbox optimization for complex dynamical systems pose a unique challenge owing to the restriction on the number of possible function evaluations. Such blackbox functions do not have a simple closed form but can be evaluated (queried) at any arbitrary query point in the domain. However, evaluation of realworld complex processes is expensive and time consuming, therefore the optimization algorithm must optimize while employing as few realworld function evaluations as possible. Most practical optimization problems are constrained in nature, i.e. have one or more constraints on the values of input parameters. In this work, we focus on realparameter singleobjective blackbox optimization (BBO) where the goal is to obtain a value as close to the maximum value of the objective function as possible by adjusting the values of the realvalued continuous input parameters while ensuring domain constraints are not violated. We further assume a limited budget, i.e. assume that querying the blackbox function is expensive and thus only a small number of queries can be made.
Efficient global optimization of expensive blackbox functions [14] requires proposing the next query (input parameter values) to the blackbox function based on past queries and the corresponding responses (function evaluations). BBO can be mapped to the problem of proposing the next query given past queries and the corresponding responses such that the expected improvement in the function value is maximized, as in Bayesian Optimization approaches [4]. While most research in optimization has focused on engineering algorithms catering to specific classes of problems, recent metalearning [24] approaches, e.g. [2, 18, 5, 27, 7], cast design of an optimization algorithm as a learning problem rather than the traditional handengineering approach, and then, propose approaches to train neural networks that learn to optimize
. In contrast to a traditional machine learning approach involving training of a neural network on a single task using training data samples so that it can generalize to unseen data samples from the same data distribution, here the neural network is trained on a distribution of similar tasks (in our case optimization tasks) so as to learn a strategy that generalizes to related but unseen tasks from a similar task distribution. The metalearning approaches attempt to train a single network to optimize several functions at once such that the network can effectively generalize to optimize unseen functions.
Recently, [5]
proposed a metalearning approach wherein a recurrent neural network (RNN with gated units such as Long Short Term Memory (LSTM)
[9]) learns to optimize a large number of diverse synthetic nonconvex functions to yield a learned taskindependent optimizer. The RNN iteratively uses the sequence of past queries and corresponding responses to propose the next query in order to maximize the observed improvement (OI) in the response value. We refer to this approach as RNNOI in this work. Once the RNN is trained to optimize a diverse set of synthetic functions by using gradient descent, it is able to generalize well to solve unseen derivativefree blackbox optimization problems [5, 29]. Such learned optimizers are shown to be faster in terms of the time taken to propose the next query compared to Bayesian optimizers as they do not require any matrix inversion or optimization of acquisition functions, and also have lower regret values within the training horizon, i.e. the number of steps of the optimization process for which the RNN is trained to generate queries.Key contributions of this work and the challenges addressed can be summarized as follows:

Regretbased loss function: We hypothesize that training an RNN optimizer using a loss function that minimizes the regret observed for a given number of queries more closely resembles the performance measure of an optimizer. So it is better than a loss function based on OI such as the one used in [5, 29]. To this end, we propose a simple yet highly effective loss function that yields superior results than the existing OI loss for blackbox optimization. Regret of the optimizer is the difference between the optimal value (maximum of the blackbox function) and the realized maximum value.

Deal with lack of prior knowledge on range of the blackbox function
: In many practical optimization problems, it may be difficult to ascertain the possible range of values the function can take, and the range of values would vary across applications. On the other hand, neural networks are known to work well only on normalized inputs, and can be numerically unstable and difficult to train on very large or very small values as typical nonlinear activation functions like sigmoid activation function tend to saturate for large inputs and will then adjust slowly during training. RNNs are most easily trained when their inputs are well conditioned, and have a similar scale as their latent state, and suitable scaling often accelerates training
[27]. We, therefore, propose incremental normalization that dynamically normalizes the output (response) from the blackbox function using the response values observed so far before the value is passed as an input to the RNN, and observe significant improvements in terms of regret by doing so. 
Incorporate domainconstraints: Any practical optimization problem has a set of constraints on the input parameters. It is important that the RNN optimizer is penalized when it proposes query points outside the desired limits. We introduce a mechanism to achieve this by giving an additional feedback to the RNN whenever it proposes a query that violates domain constraints. In addition to regretbased loss, RNN is also trained to simultaneously minimize domain constraint violations. We show that an RNN optimizer trained in this manner attains lower regret values in fewer steps when subjected to domain constraints compared to an RNN optimizer not explicitly trained to utilize feedback.
We refer to the proposed approach as RNNOpt. As a result of the above considerations, RNNOpt can deal with an unknown range of function values and also incorporate domain constraints. We demonstrate that RNNOpt works well on optimizing unseen benchmark blackbox functions and outperforms RNNOI in terms of the optimal value attained under a limited budget for 2dimensional and 6dimensional input spaces. We also perform extensive ablation experiments demonstrating the importance of each of the abovestated features in RNNOpt.
2 Related Work
Our work falls under the category of realparameter blackbox global optimization [21]. Traditional approaches for blackbox optimization like covariance matrix adaptation evolution strategy (CMAES) [8], NelderMead [20]
, and Particle Swarm Optimization (PSO)
[15]handdesign rules using heuristics (e.g. using natureinspired genetic algorithms) to decide the next query point(s) given the observations made so far. Another category of approaches for global optimization of blackbox functions include Bayesian optimization techniques
[4, 26, 25]. These approaches use observations (query and response) made thus far to approximate the blackbox function via a surrogate (meta) model, e.g. using a Gaussian Process [10], and then use this model to construct an acquisition function to decide the next query point. The acquisition function updates needed at each step are known to be costly [5].Learned optimizers: There has been a recent interest in learning optimizers under the metalearning setting [24] by training RNN optimizers via gradient descent. For example, [2] casts the design of an optimization algorithm as a learning problem and uses an LSTM model to learn an optimizer for a particular class of optimization problems, e.g. quadratic functions, training neural networks, etc. Similarly, [18, 7]
cast optimizer learning as learning a policy under a reinforcement learning setting.
[27] proposes a hierarchical RNN architecture to learn optimizers that scale well to optimize a large number of parameters (highdimensional input space). However, the above metalearning approaches for optimization assume the availability of gradient information to decide the next set of parameters, which is not available in the case of blackbox optimization. Our work builds upon the metalearning approach for learning blackbox optimizers proposed in [5]. This approach mimics the sequential modelbased Bayesian approaches in the sense that it proposes an RNN optimizer that stores sequential information about previous queries and responses, and accesses this memory to generate the next candidate query. RNNOI mimics the Bayesian optimization based sequential decisionmaking process [4] (refer [5] for details) while being significantly faster than standard BBO algorithms like SMAC [11] and Spearmint [26] as it does not involve any matrix inversion or optimization of acquisition functions. RNNOI was successfully tested on Gaussian process bandits, simple low dimensional controllers, and hyperparameter tuning.Handling domain constraints in neural networks
Recent work on Physicsguided deep learning
[13, 19] incorporates domain knowledge in the learning process via additional loss terms. Such approaches can be useful in our setting if the optimizer network is to be trained from scratch for a given application. However, the purpose of building a generic optimizer that can be transferred to new applications requires incorporating domain constraints in a posterior manner during inference time when the optimizer is suggesting query points. This is not only useful to adapt the same optimizer to a new application but also useful in another practical scenario of adapting to a new set of domain constraints for a given application. ThermalNet [6] uses a deep Qnetwork as an optimizer and uses an LSTM predictor for combustion optimization of a boiler in a power plant but does not handle domain constraints. Similar to our approach, ChemOpt [29] uses an RNN based optimizer for chemical reaction optimization but does not address aspects related to handling an unknown range for the function being optimized and incorporating domain constraints.Handling unknown range of function values: Suitable scaling of input and output of hidden layers in neural networks has been shown to accelerate training of neural networks [12, 23, 3, 17]. Dynamic input scaling has been used in a similar setting as ours [27] to ensure that the neural network based optimizer is invariant to parameter scale. However, the scaling is applied to the average gradients. In our setting, we use a similar approach but apply dynamic scaling to the function evaluations being fed back as input to RNNOpt.
3 Problem Overview
We consider learning an optimizer that can optimize (e.g., maximize) a blackbox function , where is the domain of the input parameters. We assume that the function does not have a closedform representation, is costly to evaluate, and does not allow the computation of gradients. In other words, the optimizer can query the function at a point to obtain a response , but it does not obtain any gradient information, and in particular it cannot make any assumptions on the analytical form of . The goal is to find within a limited budget, i.e. within a limited number of queries that can be made to the blackbox.
We consider training an optimizer with parameters such that, given the queries and the corresponding responses from where , proposes the next query point under a budget constraint of queries, i.e. :
(1) 
4 RNNOpt
We model using an LSTMbased RNN. (For implementation, we use a variant of LSTMs as described in [28].) Recurrent Neural Networks (RNNs) with gated units such as Long Short Term Memory (LSTM) [9] units are a popular choice for sequence modeling to make predictions about future values given the past. They do so by maintaining a memory of all the relevant information from the sequence of inputs observed so far. In the metalearning or training phase, a diverse set of syntheticallygenerated differentiable nonconvex functions (refer Appendix 0.A) with known global optima are used to train the RNN (using gradient descent). The RNN is then used to predict the next query in order to intelligently explore the search space given the sequence of previous queries and the function responses. The RNN is expected to learn to retain any information about previous queries and responses that is relevant to proposing the next query to minimize the regret as shown in Fig. 1.
4.1 RNNOpt without Domain Constraints
Given a trained RNNbased optimizer and a differentiable function , inference in RNNOpt follows the following iterative process for : At each step , the output of the final recurrent hidden layer of the RNN is used to generate the output via an affine transformation to finally obtain .
(2)  
(3)  
(4)  
(5) 
where represents the RNN with parameters , is the function to be optimized, defines the affine transformation of the final output (hidden state) of the RNN. The parameters and together constitute . Instead of directly training to propose the next query as in [5]
, we use a stochastic RNN to estimate
and as in Equation 3, then samplefrom a multivariate Gaussian distribution
. Introducing randomness in the query generation process leads to better exploration compared to a deterministic model [29]. The first queryis sampled from a uniform distribution over the domain of the function
to be optimized. Once the network is trained, can be replaced by any blackbox function that takes dimensional input.For any synthetically generated function , we assume (approximate) can be found, e.g. using gradientdescent, since the closed form of the function is known. Hence, we assume that of given by is known. Therefore, it is easy to determine the regret after iterations (queries) to the function . We can then define a regretbased loss function as follows:
(6) 
where . Since the regret is expected to be high during initial iterations because of random initialization of but desired to be low close to , we give exponentially increasing importance to regret terms via a discount factor . In contrast to regret loss, OI loss used in RNNOI is given by [5, 29]:
(7) 
It is to be noted that using as the loss function mimics a supervised scenario where the target for each optimization task is known and explicitly used to guide the learning process. On the other hand, mimics an unsupervised scenario where the target is unknown and the learning process solely relies on the feedback about whether it is able to improve over iterations. It is important to note that once trained, the model requires neither nor during inference.
4.1.1 Incremental Normalization
We do not assume any constraint on the range of values the functions and can take. Although this feature is critical for most practical aspects, it poses a challenge on the training and inference procedures using RNN: Neural networks are known to work well only on normalized inputs, and can be numerically unstable and difficult to train on very large or very small values as typical nonlinear activation functions like sigmoid activation function tend to saturate for large inputs and will adjust slowly during training. RNNs are most easily trained when their inputs are well conditioned, and have a similar scale as their latent state, and suitable scaling often accelerates training [12, 27]. This poses a challenge during both training and inference if we directly use as an input to the RNN.
Fig. 2 illustrates the saturation effect if suitable incremental normalization of function values is not used during inference. This behavior at inference time was noted^{1}^{1}1as per electronic correspondence with the authors in [5], however, was not considered while training RNNOI. In order to deal with any range of values that can take during training or that can take during inference, we consider incremental normalization while training such that in Eq. 2 is replaced by such that , where , , and . (We used in our experiments).
4.2 RNNOpt with Domain Constraints (RNNOptDC)
Consider a constrained optimization problem of finding subject to constraints given by , where is the number of constraints. To ensure that the optimizer proposes queries that satisfy the domain constraints, or is at least able to receive feedback when it proposes a query that violates any domain constraints, we consider the following enhancements in RNNOpt, as depicted in Fig. 3:
1. Input an explicit feedback via a penalty function s.t. to the RNN that captures the extent to which a proposed query violates any of the domain constraints. We consider the following instantiation of penalty function: , i.e. for any for which a penalty equal to is considered, while for any with the contribution to penalty is 0. The realvalued penalty captures the cumulative extent of violation as well. Further, similar to normalizing , we also normalize incrementally and use as an additional input to the RNN, such that:
(8) 
Further, whenever , i.e. when one or more of the domain constraints are violated for the proposed query, we set rather than actually getting a response from the blackbox. This is useful in practice: for example, when trying to optimize a complex dynamical system, getting a response from the system for such a query is not possible as it can be catastrophic.
2. During training, an additional domain constraint loss is considered that penalizes the optimizer if it proposes a query that does not satisfy one or more of the domain constraints.
(9) 
The overall loss is then given by:
(10) 
where controls how strictly the constraints on the domain of parameters should be enforced; higher implies stricter adherence to constraints. It is worth noting that the above formulation of incorporating domain constraints does not put any restriction on the number of constraints nor on the nature of constraints in the sense that the constraints can be linear or nonlinear in nature. Further, complex nonlinear constraints based on domain knowledge can also be incorporated in a similar fashion during training, e.g. as used in [13, 19]. Apart from optimizing (in our case, maximizing) , the optimizer is also being simultaneously trained to minimize .
4.2.1 Example of penalty function.
Consider simple limit constraints on the input parameters such that the domain of the function is given by , then we have:
(11) 
where denotes the th dimension of where and are the th elements of and , respectively.
5 Experimental Evaluation
We conduct experiments to evaluate the following: i. regret loss () versus OI loss (), ii. effect of including incremental normalization during training, and iii. ability of RNNOpt trained with domain constraints using (Eq. 10) to generate more feasible queries and leverage feedback to quickly adapt in case it proposes queries violating domain constraints.
For the unconstrained setting, we test RNNOpt on i) standard benchmark functions for and , and ii) 1280 synthetically generated GMMDF functions (refer Appendix 0.A) not seen during training. We choose the benchmark functions such as Goldstein, Rosenbrock, and Rastrigin (and the simple spherical function) that are known to be challenging for standard optimization methods. None of these functions were used for training any of the optimizers.
We use regret to measure the performance of any optimizer after iterations, i.e. after proposing queries. Lower values of indicate superior optimizer performance. We test all the optimizers under limited budget setting such that . For each test function, the first query is randomly sampled from , and we report average regret over 1280 random initializations. For synthetically generated GMMDF functions, we report average regret over 1280 functions with one random initialization for each.
All RNNbased optimizers (refer Table 1) were trained for 8000 iterations using Adam optimizer [16] with an initial learning rate of 0.005. The network consists of two hidden layers with the number of LSTM units in each layer being chosen from using a holdout validation set of GMMDF. Another set of 1280 randomly generated functions constitute the GMMDF test set. An initial code base^{2}^{2}2https://github.com/lightingghost/chemopt
developed using Tensorflow
[1] was adapted to implement our algorithm. We used a batch size of 128, i.e. 128 randomlysampled functions (refer Equation 12) are processed in one minibatch for updating the parameters of LSTM.Method  Loss  Inc. Norm.  Domain Const. (DC)  

Training  Inference  Training  Inference  
RNNOI  1.0  N  Y  N  N  
RNNOptBasic  0.98  N  Y  N  N  
RNNOpt  0.98  Y  Y  N  N  
RNNOptP  0.98  Y  Y  N  Y  
RNNOptDC  0.98  Y  Y  Y  Y 
5.1 Observations
We make the following key observations for unconstrained optimization setting:
1. RNNOpt is able to optimize blackbox functions not seen during training, and hence, generalize. We compare RNNOpt with RNNOI and two standard blackbox optimization algorithms CMAES [8] and NelderMead [20]. RNNOI uses , , and to get the next hidden state , which is then used to get (as in Eq 4), such that with OI loss as given in Eq. 7. From Fig. 4 (a)(i), we observe that RNNOpt outperforms all the baselines considered on most functions considered while being at least as good as the baselines in few remaining cases. Except for the simple convex spherical function, RNNbased optimizers outperform CMAES and NelderMead under limited budget, i.e. with for and for . We observe that trained optimizers outperform CMAES and NelderMead for higherdimensional cases ( here, as also observed in [5, 29]).
2. Regretbased loss is better than the OI loss. We compare RNNOptBasic with RNNOI (refer Table 1) where RNNOptBasic differs from RNNOI only in the loss function (and the discount factor, as discussed in next point). For fair comparison with RNNOI, RNNOptBasic does not include incremental normalization during training. From Fig. 4 (j)(k), we observe that RNNOptBasic (with ) performs better than RNNOI during initial steps for (while being comparable eventually) and across all steps for , proving the advantage of using regret loss over OI loss.
3. Significance of discount factor when using regretbased loss versus OI loss. From Fig. 4 (j)(k), we also observe that the results of RNNOpt and RNNOI are sensitive to the discount factor (refer Eqs. 6 and 7). works better for RNNOpt while (i.e. no discount) works better for RNNOI. This can be explained as follows: the queries proposed initially (small ) are expected to be far from due to random initialization, and therefore, have high initial regret. Hence, components of the loss term for smaller should be given lower weightage in the regretbased loss. On the other hand, during later steps (close to ), we would like the regret to be as low as possible, and hence a higher importance should be given to the corresponding terms in the regretbased loss. In contrast, RNNOI is trained to keep improving irrespective of , and hence giving equal importance to the contribution of each step to the OI loss works best.
4. Incremental normalization during training and inference to optimize functions with diverse range of values. We compare RNNOptBasic and RNNOpt, where RNNOpt uses incremental normalization of inputs during training as well as testing (as described in Section 4.1.1) while RNNOptBasic uses incremental normalization only during testing (refer Table 1). From Fig. 5, we observe that RNNOpt performs significantly better than RNNOptBasic proving the advantage of incorporating incremental normalization during training. Note that since most of the functions considered have large range of values, incremental normalization is bydefault enabled for all RNNbased optimizers during testing to obtain meaningful results, as illustrated earlier in Fig. 2, especially for functions with large range, e.g. Rosenbrock.
5.2 RNNOpt with Domain Constraints
To train RNNOptDC, we generate synthetic functions with random limit constraints as explained in Section 4.2.1. The limits of the search space are set as where (th component of ) is sampled from (we use , during training).
We use for RNNOptDC. As a baseline, we use RNNOpt with minor variation during inference time (with no change in training procedure) where, instead of passing as input to the RNN, we pass so as to capture penalty feedback. We call this baseline approach as RNNOptP (refer Table 1). While RNNOptDC is explicitly trained to minimize penalty explicitly, RNNOptP captures the requirement of trying to maximize under a softconstraint of minimizing only during inference time.
We use the standard quadratic (disk) constraint used to evaluate constrained optimization approaches, i.e. (we use ) for Rosenbrock function. For GMMDF, we generate random limit constraints on each dimension around the global optima, s.t. the optimal solution is still the same as the one without constraints, while the feasible search space varies randomly across functions. Limits of the domain is , where (th component of ) is sampled from (we use , ). We also consider two instances of (anonymized) nonlinear surrogate model for a realworld industrial process built by subjectmatter experts with six controllable input parameters () as blackbox functions, referred to as Industrial1 and Industrial2 in Fig. 6. This process imposes limit constraints on all six parameters guided by domainknowledge. The groundtruth optimal value for these functions was obtained by querying the surrogate model 200k times via grid search. The regret results are averaged over runs assuming diverse environmental conditions.
RNNOptDC and RNNOptP are not guaranteed to propose feasible queries at all steps because of the soft constraints during training and/or inference. Therefore, despite training the optimizers for steps we unroll the RNNs up to a maximum of steps and take the first proposed queries that are feasible, i.e. satisfy domain constraints. For functions where optimizer is not able to propose feasible queries in steps, we replicate the regret corresponding to best solution for remaining steps. As shown in Fig. 6, we observe that RNNOpt with domain constraints, namely, RNNOptDC is able to effectively use explicit penalty feedback, and at least as good as RNNOptP in all cases. As expected, we also observe that the performance of both optimizers degrades with increasing values of or as the search space to be explored by the optimizer increases.
6 Conclusion and Future Work
Learning optimization algorithms under the metalearning paradigm is an area of active research. In this work, we have shown that using regret directly as a loss for training optimizers using recurrent neural networks is possible, and that it yields better optimizers than those obtained using observedimprovement based loss. We have proposed useful extensions of practical importance to optimization algorithms for blackbox optimization that allow dealing with diverse range of function values and handle domain constraints more effectively. One shortcoming of this approach is that a different optimizer needs to be trained for varying number of input parameters. In future, we plan to extend this work to train optimizers that can ingest input with varying and high number of parameters, e.g. by first proposing a change in a latent space and then estimating changes in actual input space as in [22, 27]. Further, training optimizers for multiobjective optimization can be a useful extension.
Appendix 0.A Generating Diverse NonConvex Synthetic Functions
We generate synthetic nonconvex continuous functions defined over
via a Gaussian Mixture Model density function (GMMDF, similar to
[29]):(12) 
In this work, we used GMMDF instead of Gaussian Processes used in [5] for ease of implementation and faster response time to queries:
Functions obtained in this manner are often nonconvex and have multiple local minima/maxima. Sample plots for functions obtained over 2D input space are shown in Fig. 7. We use , and for , and for in our experiments (all covariance matrices are diagonal).
For any function , we use an estimated value () instead of . This assumes that the global maximum of the function is at the mean of one of the Gaussian components. We validate this assumption by obtaining better estimates of the ground truth for via grid search over randomly sampled 0.2M query points over the domain of
. For 10k randomly sampled GMMDF functions, we obtained an average error of 0.03 with standard deviation of 0.02 in estimating
, suggesting that the assumption is reasonable, and in practice, approximate values of suffice to estimate the regret values for supervision. However, in general, can also be obtained using gradient descent on .References
 [1] Abadi, M., Barham, P., et al.: Tensorflow: a system for largescale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)
 [2] Andrychowicz, M., Denil, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems. pp. 3981–3989 (2016)
 [3] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
 [4] Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
 [5] Chen, Y., Hoffman, M.W., Colmenarejo, S.G., Denil, M., Lillicrap, T.P., Botvinick, M., de Freitas, N.: Learning to learn without gradient descent by gradient descent. In: Proceedings of the 34th International Conference on Machine LearningVolume 70. pp. 748–756. JMLR. org (2017)

[6]
Cheng, Y., Huang, Y., Pang, B., Zhang, W.: Thermalnet: A deep reinforcement learningbased combustion optimization system for coalfired boiler. Engineering Applications of Artificial Intelligence
74, 303–311 (2018)  [7] Faury, L., Vasile, F.: Rover descent: Learning to optimize by learning to navigate on prototypical loss surfaces. In: Learning and Intelligent Optimization. pp. 271–287 (2018)

[8]
Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In: Evolutionary Computation, 1996., Proceedings of IEEE International Conference on. pp. 312–317. IEEE (1996)
 [9] Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation 9(8), 1735–1780 (1997)
 [10] Huang, D., Allen, T.T., Notz, W.I., Zeng, N.: Global optimization of stochastic blackbox systems via sequential kriging metamodels. Journal of global optimization 34(3), 441–466 (2006)
 [11] Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507–523. Springer (2011)
 [12] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
 [13] Jia, X., Willard, J., Karpatne, A., Read, J., Zward, J., Steinbach, M., Kumar, V.: Physics guided rnns for modeling dynamical systems: A case study in simulating lake temperature profiles. arXiv preprint arXiv:1810.13075 (2018)
 [14] Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive blackbox functions. Journal of Global optimization 13(4), 455–492 (1998)
 [15] Kennedy, J.: Particle swarm optimization. Encyclopedia of machine learning pp. 760–766 (2010)
 [16] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [17] Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Selfnormalizing neural networks. In: Advances in neural information processing systems. pp. 971–980 (2017)
 [18] Li, K., Malik, J.: Learning to optimize. arXiv preprint arXiv:1606.01885 (2016)
 [19] Muralidhar, N., Islam, M.R., Marwah, M., Karpatne, A., Ramakrishnan, N.: Incorporating prior domain knowledge into deep neural networks. In: 2018 IEEE International Conference on Big Data (Big Data). pp. 36–45. IEEE (2018)
 [20] Nelder, J.A., Mead, R.: A simplex method for function minimization. The computer journal 7(4), 308–313 (1965)
 [21] Rios, L.M., Sahinidis, N.V.: Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56(3), 1247–1293 (2013)
 [22] Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., Hadsell, R.: Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960 (2018)
 [23] Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. pp. 901–909 (2016)
 [24] Schmidhuber, J.: Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. Ph.D. thesis, Technische Universität München (1987)
 [25] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016)
 [26] Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. pp. 2951–2959 (2012)
 [27] Wichrowska, O., Maheswaranathan, N., Hoffman, M.W., Colmenarejo, S.G., Denil, M., de Freitas, N., SohlDickstein, J.: Learned optimizers that scale and generalize. In: Proceedings of the 34th International Conference on Machine LearningVolume 70. pp. 3751–3760. JMLR. org (2017)
 [28] Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
 [29] Zhou, Z., Li, X., Zare, R.N.: Optimizing chemical reactions with deep reinforcement learning. ACS central science 3(12), 1337–1344 (2017)