1 Introduction
Optimization provides a mathematical foundation for solving quantitative problems in many fields, along with numerical challenges. The no free lunch
theorem indicates the nonexistence of a universally best optimization algorithm for all objectives. To manually design an effective optimization algorithm for a given problem, many efforts have been spent on tuning and validating pipelines, architectures, and hyperparameters. For instance, in deep learning, there is a gallery of gradientbased algorithms specific to highdimensional, nonconvex objective functions, such as Stochastic Gradient Descent
[22], RmsDrop [25], and Adam [16]. Another example is in ab initioprotein docking whose energy functions as objectives have extremely rugged landscapes and are expensive to evaluate. Gradientfree algorithms are thus popular there, including Markov chain Monte Carlo (MCMC)
[12]and Particle Swarm Optimization (PSO)
[19].To overcome the laborious manual design, an emerging approach of metalearning (learning to learn) takes advantage of the knowledge learned from related tasks. In metalearning, the goal is to learn a metalearner that could solve a set of problems, where each sample in the training or test set is a particular problem. As in classical machine learning, the fundamental assumption of metalearning is the generalizability from solving the training problems to solving the test ones. For optimization problems, a key to metalearning is how to efficiently utilize the information in the objective function and explore the space of optimization algorithms.
In this study, we introduce a novel framework in metalearning, where we train a metaoptimizer that learns in the space of both pointbased and populationbased optimization algorithms for continuous optimization. To that end, we design a novel architecture where a population of RNNs (specifically, LSTMs) jointly learn iterative update formula for a population of samples (or a swarm of particles). To balance exploration and exploitation in search, we directly estimate the posterior over the optimum and include in the metaloss function the differential entropy of the posterior. Furthermore, we embed feature and samplelevel attentions in our metaoptimizer to interpret the learned optimization strategies. Our numerical experiments, including global optimization for nonconvex test functions and an application of protein docking, endorse the superiority of the proposed metaoptimizer.
2 Related work
Metalearning originated from the field of psychology [27, 14]. [4, 6, 5] optimized a learning rule in a parameterized learning rule space. [30]
used RNN to automatically design a neural network architecture. More recently, learning to learn has also been applied to sparse coding
[13, 26, 9, 18], plugandplay optimization [23], and so on.In the field of learning to optimize, [1] proposed the first framework where gradients and function values were used as the features for RNN. A coordinatewise structure of RNN relieved the burden from the enormous number of parameters, so that the same update formula was used independently for each coordinate. [17]
used the history of gradients and objective values as states and step vectors as actions in reinforcement learning.
[10] also used RNN to train a metalearner to optimize blackbox functions, including Gaussian process bandits, simple control objectives, and hyperparameter tuning tasks. Lately, [28] introduced a hierarchical RNN architecture, augmented with additional architectural features that mirror the known structure of optimization tasks.The target applications of previous methods are mainly focused on training deep neural networks, except [10] focusing on optimizing blackbox functions. There are three limitations of these methods. First, they learn in a limited algorithmic space, namely pointbased optimization algorithms that use gradients or not (including SGD and Adam). So far there is no method in learning to learn that reflects populationbased algorithms (such as evolutionary and swarm algorithms) proven powerful in many optimization tasks. Second, their learning is guided by a limited meta loss, often the cumulative regret in sampling history that primarily drives exploitation. One exception is the expected improvement (EI) used by [10] under Gaussian processes. Last but not the least, these methods do not interpret the process of learning update formula, despite the previous usage of attention mechanisms in [28].
To overcome aforementioned limitations of current learningtooptimize methods, we present a new metaoptimizer with the following contributions:

(Where to learn): We learn in an extended space of both pointbased and populationbased optimization algorithms;

(How to learn): We incorporate the posterior into metaloss to guide the search in the algorithmic space and balance the exploitationexploration tradeoff.

(What more to learn): We design a novel architecture where a population of LSTMs jointly learn iterative update formula for a population of samples and embedded sample and featurelevel attentions to explain the formula.
3 Method
3.1 Notations and background
We use the following convention for notations throughout the paper. Scalars, vectors (column vectors unless stated otherwise), and matrices are denoted in lowercase, bold lowercase, and bold uppercase, respectively. Superscript indicates vector transpose.
Our goal is to solve the following optimization problem:
Iterative optimization algorithms, either pointbased or populationbased, have the same generic update formula:
where and are the sample (or a single sample called “particle" in swarm algorithms) and the update (a.k.a. step vector) at iteration , respectively. The update is often a function of historic sample values, objective values, and gradients. For instance, in pointbased gradient descent,
where is the learning rate. In particle swarm optimization (PSO), assuming that there are samples (particles), then for particle , the update is determined by the entire population:
where and are the best position (with the smallest objective value) of particle and among all particles, respectively, during the first iterations; and
are the hyperparameters often randomly sampled from a fixed distribution (e.g. standard Gaussian distribution) during each iteration.
In most of the modern optimization algorithms, the update formula is analytically determined and fixed during the whole process. Unfortunately, similar to what the No Free Lunch Theorem suggests in machine learning, there is no single best algorithm for all kinds of optimization tasks. Every stateofart algorithm has its own bestperforming problem set or domain. Therefore, it makes sense to learn the optimal update formula from the data in the specific problem domain, which is called “learning to optimize”. For instance, in [1], the function
is parameterized by a recurrent neural network (RNN) with input
and the hidden state from the last iteration: . In [10], the inputs of RNN are and the hidden state from the last iteration: .3.2 Populationbased learning to optimize with posterior estimation
We describe the details of our populationbased metaoptimizer in this section. Compared to previous metaoptimizers, we employ samples whose update formulae are learned from the population history and are individually customized, using attention mechanisms. Specifically, our update rule for particle could be written as:
where is a feature matrix for particle at iteration , is the intraparticle attention function for particle , and is the th output of the interparticle attention function. is the hidden state of the th LSTM at iteration .
For typical populationbased algorithms, the same update formula is adopted by all particles. We follow the convention to set , which suggests and .
We will first introduce the feature matrix and then describe the intra and inter attention modules.
3.2.1 Features from different types of algorithms
Considering the expressiveness and the searchability of the algorithmic space, we consider the update formulae of both point and populationbased algorithms and choose the following four features for particle at iteration :

gradient:

momentum:

velocity:

attraction: , for all that . is a hyperparameter and .
These four features include two from pointbased algorithms using gradients and the other two from populationbased algorithms. Specifically, the first two are used in gradient descent and Adam. The third feature, velocity, comes from PSO, where is the best position (with the lowest objective value) of particle in the first iterations. The last feature, attraction, is from the Firefly algorithm [29]. The attraction toward particle is the weighted average of over all such that ; and the weight of particle is the Gaussian similarity between particle and . For the particle of the smallest , we simply set this feature vector to be zero. In this paper, we use and .
It is noteworthy that each feature vector is of dimension , where is the dimension of the search space. Besides, the update formula in each basealgorithm is linear w.r.t. its corresponding feature. To learn a better update formula, we will incorporate those features into our model of deep neural networks, which is described next.
3.2.2 Overall model architecture
Fig. 1a depicts the overall architecture of our proposed model. We use a population of LSTMs and design two attention modules here: featurelevel (“intraparticle”) and samplelevel (“interparticle”) attentions. For particle at iteration , the intraparticle attention module is to reweight each feature based on the context vector , which is the hidden state from the th LSTM in the last iteration. The reweight features of all particles are fed into an interparticle attention module, together with a distance similarity matrix. The interattention module is to learn the information from the rest particles and affect the update of particle . The outputs of interparticle attention module will be sent into identical LSTMs for individual updates.
3.2.3 Attention mechanisms
For the intraparticle attention module, we use the idea from [2, 3, 8]. As shown in Fig. 1b, given that the th input feature of the th particle at iteration is , we have:
where , and are the weight matrices, is the hidden state from the th LSTM in iteration , is the output of the fullyconnected (FC) layer and
is the output after the softmax layer. We then use
to reweight our input features:where is the output of the intraparticle attention module for the th particle at iteration .
For interparticle attention, we model for each particle under the impacts of the rest particles. Specific considerations are as follows.

The closer two particles are, the more they impact each other’s update. Therefore, we construct a kernelized pairwise similarity matrix (columnnormalized) as the weight matrix. Its element is .

The similar two particles are in their intraparticle attention outputs (, local suggestions for updates), the more they impact each other’s update. Therefore, we introduce another weight matrix whose element is (normalized after columnwise softmax).
As shown in Fig. 1b, the output of the interparticle module for the th particle will be:
where is a hyperparameter which controls the ratio of contribution of rest 1 particles to the th particle. In this paper, is set to be 1 without further optimization.
3.2.4 Loss function, posterior estimation, and model training
Cumulative regret is a common meta loss function: . However, this loss function has two main drawbacks. First, the loss function does not reflect any exploration. If the search algorithm used for training the optimizer does not employ exploration, it can be easily trapped in the vicinity of a local minimum. Second, for populationbased methods, this loss function tends to drag all the particles to quickly converge to the same point.
To balance the explorationexploitation tradeoff, we bring the work from [7] — it built a Bayesian posterior distribution over the global optimum as , where denotes the samples at iteration : . We claim that, in order to reduce the uncertainty about the whereabouts of the global minimum, the best next sample can be chosen to minimize the entropy of the posterior, . Therefore, we propose a loss function for function as:
where controls the balance between exploration and exploitation and is a vector of model parameters.
Following [7], the posterior over the global optimum is modeled as a Boltzmann distribution:
where is a function estimator and is the annealing constant. In the original work of [7], both and are updated over iteration for active sampling. In our work, they are fixed since the complete training sample paths are available at once.
Specifically, for a function estimator based on samples in , we use a Kriging regressor [11] which is known to be the best unbiased linear estimator (BLUE):
where is the prior for (we use in this study); is the kernel vector with the th element being the kernel, a measure of similarity, between and ; is the kernel matrix with the th element being the kernel between and ; and are the vector consisting of and , respectively; and reflects the noise in the observation and is often estimated to be the average training error (set at 2.1 in this study).
For , we follow the annealing schedule in [7] with onestep update:
where is the initial parameter of ( without further optimization here); is the initial entropy of the posterior with ; and is the dimensionality of the search space.
In total, our meta loss for functions () (analogous to training examples) with L2 regularization is
To train our model we use the optimizer Adam which requires gradients. The firstorder gradients are calculated numerically through TensorFlow following
[1]. We use coordinatewise LSTM to reduce the number of parameters. In our implementation the length of LSTM is set to be 20. For all experiments, the optimizer is trained for 10,000 epochs with 100 iterations in each epoch.
4 Experiments
We test our metaoptimizer through convex quadratic functions, nonconvex test functions and an optimizationbased application with extremely noisy and rugged landscapes: protein docking.
4.1 Learn to optimize convex quadratic functions
In this case, we are trying to minimize a convex quadratic function:
where and
are parameters, whose elements are sampled from i.i.d. normal distributions for the training set. We compare our algorithm with SGD, Adam, PSO and DeepMind’s LSTM (DM_LSTM)
[1]. Since different algorithms have different population sizes, for fair comparison we fix the total number of objective function evaluations (sample updates) to be 1,000 for all methods. The population size of our metaoptimizer and PSO is set to be 4, 10 and 10 in the 2D, 10D and 20D cases, respectively. During the testing stage, we sample another 128 pairs of and and evaluate the current best function value at each step averaged over 128 functions. We repeat the procedure 100 times in order to obtain statistically significant results.As seen in Fig. 2, our metaoptimizer performs better than DM_LSTM in the 2D, 10D, and 20D cases. Both metaoptimizers perform significantly better than the three baseline algorithms (except that PSO had similar convergence in 2D).
The performance of different algorithms for quadratic functions in (a) 2D, (b) 10D, and (c) 20D. The mean and the standard deviation over 100 runs are evaluated every 50 function evaluations.
We also compare our metaoptimizer’s performances with and without the guiding posterior in meta loss. As shown in the supplemental Fig. S1, including the posterior improves optimization performances especially in higher dimensions. Meanwhile, posterior estimation in higher dimensions presents more challenges. The impact of posteriors will be further assessed in ablation study.
4.2 Learn to optimize nonconvex Rastrigin functions
We then test the performance on a nonconvex test function called Rastrigin function:
where . We consider a broad family of similar functions as the training set:
(1) 
where , and are parameters whose elements are sampled from i.i.d. normal distributions. It is obvious that Rastrigin is a special case in this family with .
During the testing stage, 100 i.i.d. trajectories are generated in order to reach statistically significant conclusions. The population size of our metaoptimizer and PSO is set to be 4, 10 and 10 for 2D, 10D and 20D, respectively. The results are shown in Fig. 3. In the 2D case, our metaoptimizer and PSO perform fairly the same while DM_LSTM performs much worse. In the 10D and 20D cases, our metaoptimizer outperforms all other algorithms. It is interesting that PSO is the second best among all algorithms, which indicates that populationbased algorithms have unique advantages here.
4.3 Transferability: Learning to optimize nonconvex functions from convex optimization
We also examine the transferability from convex to nonconvex optimization. The hyperparameter in Rastrigin family controls the level of ruggedness for training functions: corresponds to a convex quadratic function and does the rugged Rastrigin function. Therefore, we choose three different values of (0, 5 and 10) to build training sets and test the resulting three trained models on the 10D Rastrigin function. From the results in the supplemental Fig. S2, our metaoptimizer’s performances improve when it is trained with increasing . The metaoptimizer trained with had limited progress over iterations, which indicates the difficulty to learn from convex functions to optimize nonconvex rugged functions. The one trained with has seen significant improvement.
4.4 Interpretation of learned update formula
In an effort to rationalize the learned update formula, we choose the 2D Rastrigin test function to illustrate the interpretation analysis. We plot sample paths of our algorithm, PSO and Gradient Descent (GD) in Fig 4a. Our algorithm finally reaches the funnel (or valley) containing the global optimum (), while PSO finally reaches a suboptimal funnel. At the beginning, samples of our metaoptimizer are more diverse due to the entropy control in the loss function. In contrast, GD is stuck in a local minimum which is close to its starting point after 80 samples.
To further show which factor contributes the most to each update, we plot the feature weight distribution over the first 20 iterations. Since for particle at iteration , the output of its intraattention module is a weighted sum of its 4 features: , we hereby sum for the th feature over all particles . The final weight distribution (normalized) over 4 features reflecting the contribution of each feature at iteration is shown in Fig. 4b. In the first 6 iterations, the populationbased features contribute to the update most. Pointbased features start to play an important role later.
Finally, we examine in the interparticle attention module the level of particles working collaboratively or independently. In order to show this, we plot the percentage of the diagonal part of : ( denotes elementwise product), as shown in Fig. 4c. It can be seen that, at the beginning, particles are working more collaboratively. With more iterations, particles become more independent. However, we note that the trace (reflecting self impacts) contributes 67%69% over iterations and the offdiagonals (impacts from other particles) do above 30%, which demonstrates the importance of collaboration, a unique advantage of populationbased algorithms.
4.5 Ablation study
How and why our algorithm outperforms DM_LSTM is both interesting and important to unveil the underlying mechanism of the algorithm. In order to deeply understand each part of our algorithms, we performed an ablation study to progressively show each part’s contribution. Starting from the DM_LSTM baseline (B), we incrementally crafted four algorithms: running DM_LSTM for times under different initializations and choosing the best solution (B); using independent particles, each with the two pointbased features, the intraparticle attention module, and the hidden state (B); adding the two populationbased features and the interparticle attention module to B so as to convert independent particles into a swarm (B); and eventually, adding an entropy term in meta loss to B, resulting in our Proposed model.
We tested the five algorithms (B–B and the Proposed) on 10D and 20D Rastrigin functions with the same settings as in Section 4.2. We compare the function minimum values returned by these algorithms in the table below (reported are means standard deviations over 100 runs, each using 1,000 function evaluations).
Dimension  B  B  B  B  Proposed 
10  55.413.5  48.410.5  40.19.4  20.46.6  12.35.4 
20  140.410.2  137.412.7  108.413.4  48.57.1  43.0 9.2 
Our key observations are as follows. i) B v.s. B: their performance gap is marginal, which proves that our performance gain is not simply due to having independent runs; ii) B v.s. B and B v.s. B: Whereas including intraparticle attention in B already notably improves the performance compared to B, including populationbased features and interparticle attention in B results in the largest performance boost. This confirms that our algorithm majorly benefits from the attention mechanisms; iii) Proposed v.s. B: adding entropy from the posterior gains further, thanks to its balancing exploration and exploitation during search.
4.6 Application to protein docking
We bring our metaoptimizer into a challenging realworld application. In computational biology, the structural knowledge about how proteins interact each other is critical but remains relatively scarce [20]. Protein docking helps close such a gap by computationally predicting the 3D structures of proteinprotein complexes given individual proteins’ 3D structures or 1D sequences [24]. Ab initio protein docking represents a major challenge of optimizing a noisy and costly function in a highdimensional conformational space [7].
Mathematically, the problem of ab initio protein docking can be formulated as optimizing an extremely rugged energy function: , the Gibbs binding free energy for conformation . We calculate the energy function in a CHARMM 19 force field as in [19] and shift it so that at the origin of the search space. And we parameterize the search space as as in [7]. The resulting is fully differentiable in the search space. For computational concern and batch training, we only consider 100 interface atoms. We choose a training set of 25 proteinprotein complexes from the protein docking benchmark set 4.0 [15] (see Supp. Table S1 for the list), each of which has 5 starting points (top5 models from ZDOCK [21]). In total, our training set includes 125 instances. During testing, we choose 3 complexes (with 1 starting model each) of different levels of docking difficulty. For comparison, we also use the training set from Eq. 1 (). All methods including PSO and both versions of our metaoptimizer have particles and 40 iterations in the testing stage.
As seen in Fig. 5, both metaoptimizers achieve lowerenergy predictions than PSO and the performance gains increase as the docking difficulty level increases. The metaoptimizer trained on other proteindocking cases performs similarly as that trained on the Rastrigin family in the easy case and outperforms the latter in the difficult case.
5 Conclusion
Designing a wellbehaved optimization algorithm for a specific problem is a laborious task. In this paper, we extend pointbased metaoptimizer into populationbased metaoptimizer, where update formulae for a sample population are jointly learned in the space of both point and populationbased algorithms. In order to balance exploitation and exploration, we introduce the entropy of the posterior over the global optimum into the meta loss, together with the cumulative regret, to guide the search of the metaoptimizer. We further embed intra and inter particle attention modules to interpret each update. We apply our metaoptimizer to quadratic functions, Rastrigin functions and a realworld challenge – protein docking. The empirical results demonstrate that our metaoptimizer outperforms competing algorithms. Ablation study shows that the performance improvement is directly attributable to our algorithmic innovations, namely populationbased features, intra and interparticle attentions, and posteriorguided meta loss.
Acknowledgments
This work is in part supported by the National Institutes of Health (R35GM124952 to YS). Part of the computing time is provided by the Texas A&M High Performance Research.
References
 [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep networks? In Advances in Neural Information Processing Systems, pages 4261–4271, 2018.
 [4] Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier. On the search for new learning rules for ANNs. Neural Processing Letters, 2(4):26–30, 1995.
 [5] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
 [6] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN91Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, July 1991.
 [7] Yue Cao and Yang Shen. Bayesian active learning for optimization and uncertainty quantification in protein docking. arXiv preprint arXiv:1902.00067, 2019.
 [8] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, and Zhangyang Wang. Abdnet: Attentive but diverse person reidentification. ICCV, 2019.
 [9] Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pages 9061–9071, 2018.
 [10] Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 748–756. JMLR. org, 2017.
 [11] JeanPaul Chilès and Pierre Delfiner. Geostatistics: Modeling Spatial Uncertainty, 2nd Edition. 2012.
 [12] Jeffrey J. Gray, Stewart Moughon, Chu Wang, Ora SchuelerFurman, Brian Kuhlman, Carol A. Rohl, and David Baker. Protein–Protein Docking with Simultaneous Optimization of Rigidbody Displacement and Sidechain Conformations. Journal of Molecular Biology, 2003.
 [13] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 399–406. Omnipress, 2010.
 [14] Harry F Harlow. The formation of learning sets. Psychological review, 56(1):51, 1949.
 [15] Howook Hwang, Thom Vreven, Joël Janin, and Zhiping Weng. ProteinProtein Docking Benchmark Version 4.0. Proteins, 78(15):3111–3114, November 2010.
 [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
 [18] Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as good as learned weights in lista. ICLR, 2019.
 [19] Iain H. Moal and Paul A. Bates. SwarmDock and the Use of Normal Modes in ProteinProtein Docking. International Journal of Molecular Sciences, 11(10):3623–3648, September 2010.
 [20] Roberto Mosca, Arnaud Céol, and Patrick Aloy. Interactome3d: adding structural details to protein networks. Nature methods, 10(1):47, 2013.
 [21] Brian G. Pierce, Kevin Wiehe, Howook Hwang, BongHyun Kim, Thom Vreven, and Zhiping Weng. ZDOCK server: interactive docking prediction of protein–protein complexes and symmetric multimers. Bioinformatics, 30(12):1771–1773, 02 2014.
 [22] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [23] Ernest K Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Plugandplay methods provably converge with properly trained denoisers. ICML, 2019.
 [24] Graham R Smith and Michael JE Sternberg. Prediction of protein–protein interactions by docking methods. Current opinion in structural biology, 12(1):28–35, 2002.

[25]
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 2012. 
[26]
Zhangyang Wang, Qing Ling, and Thomas S Huang.
Learning deep 0 encoders.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [27] Lewis B Ward. Reminiscence and rote learning. Psychological Monographs, 49(4):i, 1937.
 [28] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha SohlDickstein. Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3751–3760. JMLR. org, 2017.
 [29] XinShe Yang. Firefly algorithms for multimodal optimization. In Proceedings of the 5th International Conference on Stochastic Algorithms: Foundations and Applications, SAGA’09, pages 169–178, Berlin, Heidelberg, 2009. SpringerVerlag.
 [30] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Comments
There are no comments yet.