There is a growing interest in the machine learning community in meta-learning Thrun and Pratt (1998)
, i.e. learning to learn. Recently an influential model-agnostic meta-learning (MAML) algorithm was proposed for the fast adaptation of parameters in neural networksFinn et al. (2017). It works by using gradient descent to learn the reference (initial) parameter values of a neural network from which new parameters can most rapidly be learned to solve a sample of tasks from a distribution of tasks. It requires a differentiable learning procedure to backpropagate into the reference parameter values, and even then it is limited in the number of gradient steps in the inner learning loop that can be made before second order gradient calculations become intractable. Meta-learning in this way achieves state of the art in few shot learning, for example by allowing a reinforcement learning algorithm to (within a few gradient updates) learn the optimal speed or direction of a simulated cheetah or four-legged robot based only on reward.
Over a hundred years ago, a similar effect was proposed by Mark Baldwin Baldwin (1896) to explain how evolution could deal with irreducibly complex adaptations without the need for Lamarckian information flow Jablonka and Lamb (2014). John Maynard Smith described the effect as follows: “If individuals vary genetically in their capacity to learn, or to adapt developmentally, then those most able to adapt will leave most descendants, and the genes responsible will increase in frequency. In a fixed environment, when the best thing to learn remains constant, this can lead to the genetic determination of a character that, in earlier generations, had to be acquired afresh in each generation” Maynard Smith (1987)
. In this formulation the Baldwin effect is really two effects, or a trade-off between two factors: initially genetically specified phenotypic plasticity (variance), followed by genetic accommodation of the induced trait (bias). Turney writes ”…the Baldwin effect has two aspects. First, lifetime learning in individuals can, in some situations, accelerate evolution. Second, learning is expensive. Therefore, in relatively stable environments, there is a selective pressure for the evolution of instinctive behaviors.”Turney (2002).
Here we compare these two algorithms – MAML and the Baldwin effect – on the same tasks. Note that unlike in MAML, our evolutionary experiments show no learning of the hyperparameters and initial parameters, only standard Darwinian evolution of these elements. We show that the Baldwin effect is competitive with MAML, biasing a learning algorithm to fit the distribution of tasks encountered during evolution, without some of the restrictions encountered in Finn et al. (2017) (e.g. having direct access to gradients).
In the framework of deep reinforcement learning (RL), evolution can be complementary to gradient descent by specifying and evolving the initial neural network parameters and hyperparameters of a learning algorithm Castillo et al. (2006); Jaderberg et al. (2017). Throughout the course of learning, phenotypic plasticity is expressed as gradient updates are made to the parameters and the model learns to perform the task. This has the effect of smoothing the fitness landscape Turney (2002). In the case of Baldwinian evolution, these updated weights are forgotten by the next generation, which instead inherit the initial weights and hyperparameters , with possible mutation, whereas in Lamarckian evolution, these final, learned weights are evolved and passed on (see Figure 1). We refer to Darwinian evolution as the case where there is no learning within a lifetime. Lamarckianism closely resembles Population Based Training (PBT) Jaderberg et al. (2017): a method for online hyperparameter evolution with the exception that in PBT we do not mutate the learned parameters but inherit them unchanged, only mutating hyperparameters: to . This method, while highly successful on a number of supervised, unsupervised and reinforcement learning tasks, has no incentive to learn a representation that can be easily evolved to solve a number of different tasks in a meta-learning setup. Control experiments in Finn et al. (2017) suggest that sequentially training (fine-tuning) a model on different tasks doesn’t lead to competitive performance in meta-learning.
There is already evidence that the Baldwin effect has a role to play in machine learning because it is capable of evolving inductive bias in the form of the initial parameters and hyperparameters of a learning algorithm Castillo et al. (2006); Turney (2002). Here we test the more specific hypothesis that the Baldwin effect provides a way to evolve agents for few-shot data-efficient fast learning on a task distribution (previously examined mainly in classification settings Ravi and Larochelle (2016); Vinyals et al. (2016); Santoro et al. (2016), but see also Wang et al. (2017); Duan et al. (2016) for applications to RL), across a wide variety of learning domains. The Baldwin effect is thus used here as an algorithm for meta-learning, which results in a representation that is fit to a distribution of tasks, i.e. learning the structure of the various problems to be encountered rather than the specifics Thrun and Pratt (1998). The genome to be evolved is shaped by the task distribution, whereas the learning algorithm itself learns task specifics. The effect arises whenever there is a cost to learning imposed by the speed at which learning must occur. Such costs often arise in nature, for example in a co-evolutionary ecosystem where a newly born organism must rapidly learn to run so it can escape predators.
Our main experimental contributions are as follows: First we show that the Baldwin effect and MAML are comparable on a supervised learning task. Secondly we demonstrate that the Baldwin effect can be used in cases where MAML cannot be used, for instance in cases where the genotype is non-differentiable, e.g. where we evolve the macro-actions used by a discrete action RL algorithm, or the algorithms’ discrete hyperparameters themselves. Thirdly we examine how genetic accommodation takes place in real deep neural networks undergoing the Baldwin effect. Fourthly, we examine two task distributions; one where Baldwinian learning is superior to Lamarckian learning and vice versa.
2 Related Work
While deep learning systems trained with traditional supervised or reinforcement learning methods have achieved remarkable success in a variety of tasks, they perform poorly when only a small amount of data is available. Meta-learning aims to mitigate this limitation by broadening the learner’s scope from a single task to a distribution of related tasks Schmidhuber et al. (1996); Thrun and Pratt (1998). The goal of meta-learning is then to learn a learning strategy that generalizes to similar but unseen tasks from a given task distribution. A lot of interest in meta-learning comes from the problem of one-shot learning in image classification, which consists of learning a new class from a single labelled example Lake et al. (2011). Several approaches address this problem by using specialized neural network architectures that learn an embedding space that allows to effectively compare new examples. For instance, employing Siamese networks Koch et al. (2015) or recurrence with attention mechanisms Vinyals et al. (2016). These approaches achieve very good results in one-shot visual learning but cannot be easily employed in other tasks, such as reinforcement learning. Another approach to meta-learning is to train a recurrent memory augmented learner to quickly adapt to new tasks of a given task distribution. Such networks have been applied to few-shot image recognition Santoro et al. (2016) and reinforcement learning Duan et al. (2016); Wang et al. (2017). More recent approaches propose to include the inductive bias of optimization-based learning into a meta-learner Husken and Goerick (2000); Ravi and Larochelle (2016); Finn et al. (2017). Particularly related to this work is model-agnostic meta-learning (MAML) approach Finn et al. (2017), that aims to learn the initial set of parameters such that they can be rapidly adapted (via gradient descent) to solve a given task from the task distribution. We describe this method in detail in section 3.1.
Hinton and Nowlan showed in their 1987 paper that the Baldwin effect works in the toy example of a needle-in-a-haystack binary optimization problem of 20 alleles (bits) using random search, where the random search distribution is encoded by evolution Hinton and Nowlan (1987). The emphasis in their paper was to show how learning can smooth a (single task) rugged fitness landscape. The generality of that work has recently been called into question by a paper which claims that the scope of the effect is severely limited Santos et al. (2015). Specifically, Santos et al showed that under certain task parameter settings (i.e. initial ratios of correct alleles , and incorrect alleles ) then standard Darwinian evolution finds the needle in a haystack in the same number of generations as the Baldwin effect on average. However, a more recent paper by the same authors has shown that, when actually using Hinton and Nowlan’s original conditions, , , and , (where ? refers to an allele that does random search) the Baldwin effect is indeed significantly faster than Darwinian evolution Fontanari and Santos (2017). In short, we know the Baldwin effect is possible in this toy task, only in a subset of the parameter conditions, but we wish here to provide convincing empirical support for the necessity of the Baldwin effect in more substantial and contemporary learning tasks than bit learning problems using random search.
The Baldwin effect in neural network learning has been investigated in several papers following the original work of Hinton and Nowlan Downing (2010)
. Notably, Keesing and Stork used a genetic algorithm to evolve the initial weights of a neural network for digit classification, and found that the extent of genetic accommodation by the Baldwin effect depended on the amount of learning; too much learning and evolution was slowed down because there was too little selection pressure on the initial weights (as learning can do well form any starting position); too little learning and evolution was slowed down because fitness landscapes were not sufficiently smoothedKeesing and Stork (1991). Interestingly, they found that randomly sampling the number of gradient update steps from a distribution rather than using a fixed number significantly increased the rate of accommodation because that way the cost of learning was always felt by selection. Bullinaria evolved learning rate schedules during a lifetime, showing the evolution of developmental critical periods tailored to specific problems Bullinaria (2002)
. Neural network learning is only one kind of phenotypic plasticity. Turney used a genetic algorithm to evolve a population of biases for a decision tree induction algorithm for classificationTurney (1995). Cecconi et al evolved a hyperparameter determining how much learning, by imitation of a parent, an offspring will do in a co-evolutionary system Cecconi et al. (1995). Anderson modelled how the adaptive immune system could facilitate natural antibody production by the Baldwin effect Anderson (1996). Bull argued that the Haploid-Diploid cycle was a primitive example of learning and so subject to the Baldwin effect. Bull (2016). This paper extends the existing literature by applying the Baldwin effect to deep learning reinforcement learning algorithms in task distributions.
In this section, we begin by describing the three algorithm families that we compare. Secondly we describe the three tasks that we solve, before finally outlining the details of the models that we train and evolve.
Model-Agnostic Meta-Learning. Given a distribution over tasks and a neural network with parameters collectively denoted , MAML aims to learn an set of reference parameters such that one or a small number of gradient decent steps computed using a small amount of data for a given task from the distribution, , leads to effective generalization on that task.
The objective function for MAML is given by
where the expectation is taken over the task distribution, represents the loss corresponding to task and the parameters are the parameters adapted to fit representative training examples for this task. The task-specific learning is obtained via gradient descent,
where is the learning rate. We used a single gradient descent step for ease of notation, but multiple steps can be used insted. The outer loss in (1) evaluates the generalization of on a small amount of validation data for the th task. The reference set of parameters are found by minimizing (1
) via stochastic gradient descent. The procedure is given in Algorithm1. Note that high order gradients are required to compute the parameter update.
Genetic Algorithm (GA) is a general-purpose optimization algorithm inspired by the biological processes of mutation and selection. In our work, we use two flavors of Genetic Algorithms: a Steady State Genetic Algorithm and Generational Genetic Algorithm Goldberg and Deb (1991). In section 3.2 we introduce a sinusoid fitting task and a physics task domain. In the sinusoid fitting experiments we use a generational GA of population size 100, and rank-based selection. The physics-based RL experiments use an asynchronous parallel steady state GA with population size 500, and tournament size 10. The Baldwinian evolution algorithm hybridizes the GA and gradient-based learning as shown in Algorithm 2. We compare the Baldwinian algorithm with to two baselines: standard Darwinian evolution, and Lamarckian evolution.
With some small modifications to Algorithm 2 we can obtain a learning process for single task or for a continual/multi-task learning setting by allowing all gradient updates to act successively on the same set of parameters. In this setting we can also consider a further modification/variant in which we use Lamarckian inheritance (changing line 15 to update the population using the parameters from the end of the short run of gradient optimization).
Natural Evolution Strategies (NES) are a family of continuous black-box optimization algorithms that maintain and adapt a (Gaussian) search distribution in order to maximize the expected fitness under that distribution Wierstra et al. (2008, 2014); they update the distribution parameters in the direction of the (natural) policy gradient, as estimated from the fitnesses of a population of samples . In our case specifically, we employ the variant called separable NES (SNES Schaul et al. (2011)) that models only the element-wise variances , instead of the full covariance matrix , because its linear complexity enables it to scale the high-dimensional spaces required for our experiments.
A supervised regression problem distribution and two physics-based reinforcement learning task distributions are used to compare the algorithms.
Sinusoid-fitting task: In this supervised task, in any one lifetime the agent must fit by regression a single sinusoid drawn from a distribution of phases and amplitudes. If evolution were to encode a non-plastic neural network, it would be impossible for it to do more than evolve the mean sinusoid for the distribution; whereas with lifetime learning, the initial function at birth could be modified to fit the sampled sinusoid encountered in any particular lifetime. In this case, the Baldwin effect would be expected to take place. This leads to a different perspective on the Baldwin effect to that taken by Hinton and Nowlan and others; whereby its role is not primarily for smoothing fitness landscapes in otherwise unsolvable adaptive problems, but instead for meta-learning distributions of adaptive problems that would be entirely unsolvable by evolution alone without phenotypic plasticity, and then encoding these distributions genetically to produce faster learning.
We compare the performance of MAML, NES, and generational GA on fitting sinusoids. In each generation, we sample different sine waves, out of which we select points for training and points for testing. The amplitude of sine waves was sampled uniformly from and the phase from . In one fitness evaluation, we perform gradient descent steps for each sine wave using training points, evaluate performance as mean-squared error on test data, and average the results for different sine waves. For NES and the GA, the fitness is the final MSE for that task after the gradient updates obtained over a separate sample of data than what was trained on. In both our models (generational Genetic Algorithms and Natural Evolution Strategies) the different genotypes in a given generation are evaluated using the same data, so that the amount of data our models see after some fixed number of generations is equal to the data baseline MAML sees after doing that number of meta-updates. In NES the population size was and in GA it was .
Physics simulation reinforcement learning tasks. In reinforcement learning, the goal of few-shot meta-learning is to enable an agent to quickly acquire a policy for a new task based on training on the same distribution of tasks. For example, an agent might learn to quickly run at a certain desired target speed or direction. We constructed two sets of tasks based on those used in Finn et al Finn et al. (2017). One fitness evaluation consisted of 10 independent episodes with different task parameters. For the Baldwinian training condition, the parameters were reset to the inherited parameters at the start of each episode, and before inheritance. With Lamarckian training, the parameters were not reset between episodes and were inherited at the end of the final episode. In the Darwinian case there was no learning (gradient updates) in any of the 10 episodes. Fitness was defined as the sum of rewards obtained over all 10 episodes, providing an implicit selection pressure to learn quickly.
Two types of high-dimensional locomotion tasks were investigated using the MuJoCo simulator Todorov et al. (2012), specifically using the Planar Cheetah task, requiring it to run in either a particular direction (Goal Direction) or at a particular velocity (Goal Velocity). In the goal velocity experiments, the reward was the negative absolute value between the current velocity of the agent and a target velocity. The target velocity for each episode was chosen exhaustively in steps of 0.2, in the range 0.0 to 2.0. There were 10 such episodes for one fitness evaluation. The fitness was the sum of the rewards over these 10 episodes. In the goal direction experiments, the reward was the magnitude of the velocity in either the forward or backward direction. The fitness was the summed reward over 10 episodes with each episode alternating in whether backwards or forwards movement was required. In both cases the length of one episode was 3000 time steps (30 seconds) with a rollout size of 40 simulation time steps per gradient step, unless otherwise noted.
Sinusoid regression network: The model architecture we used for the sinusoid fitting task (see Tasks) was the same as in Finn
et al. (2017): a neural network with two hidden layers with neurons each. We used Gaussian noise with mean and std to initialize the weights and biases of the network in GA and NES. The mean squared error loss is used to train the parameters of the network using stochastic gradient descent.
to estimate the gradient for training the controller of the running cheetah agent. It consists of a shared torso which is a feed-forward neural network with rectified-linear transfer functions and two hidden layers of size 100. A policy readout from this final layer outputs a softmax over 12 possible discrete actions. A value function readout from the final layer outputs a single scalar value which is used as a baseline for the policy gradient algorithm. On replication, Gaussian noise of mean zero and standard deviation 0.02 is added to each weight and bias of the neural network. Additional noise is added to the hyperparameters which are the learning rate, entropy lost scale, and discount of the a2c algorithm.
For the physics-based RL tasks (see Tasks), macro-actions are evolved by the Baldwin effect; i.e. the 12x7 action primitive matrix which determines the 7 motor torques produced for each of the 12 discrete actions the cheetah controller can execute at each time step. The use of second order gradients to modify such hyperparameters is known to be notoriously unstable, whilst the Baldwin effect allows meta-optimization of these hyperparameters as well as the initial weights as in Finn et al. (2017)
. In the a2c experiments we additionally explored the use of a genetically encoded binary vector of hyperparameters of the same length as the number of parameters in the model. This vector (which we call a mask) determines whether or not each parameter should be learnable or not. Bit flip mutation is used to evolve this mask hyperparameter vector. Thus is done in order to emulate the setup in the original Hinton and Nowlan paper.
Firstly, MAML is compared with genetic algorithms and natural evolution strategies on a supervised learning task; fitting sinusoids drawn from a distribution. Secondly, genetic algorithms are used to evolve the hyperparameters and initial parameters of a policy gradient algorithm with an adaptive critic, on two reinforcement learning problems.
4.1 Rapid fitting of sinusoids
The performance of MAML, NES and GA on fitting sinusoids is shown in Figure 1(a). As our methods are based on a population of genotypes, we plot both the median and the best fitness achieved in each generation.
Figure 3 shows the rate at which the neural network fits a particular sine wave presented during a lifetime. Similar to MAML, our methods’s adaptation speed is superior to the one of the baseline approach (pretrained), which was trained to predict sine waves using a standard supervised-learning approach.111Plots for MAML and the pretrained approach come from Finn et al. (2017).
4.2 Reinforcement Learning in Physics Environments
Goal Velocity Task. The Baldwin effect evolves a model that can quickly adapt its velocity to the target velocity within a single episode lasting only 30 simulated seconds. Figure 3(a) shows that Lamarckian evolution outperforms Baldwinian evolution, which in turn outperforms Darwinian evolution. Figure 5 shows that Lamarckian evolution achieves the target velocity in each episode better than Baldwinian evolution. Both the Baldwin effect and Lamarckian learning are superior to pure Darwinian learning in this case.
Goal Direction Task. The Baldwin effect evolves a model that can quickly adapt its direction to the target direction within a single episode lasting only 30 simulated seconds. Figure 3(b) shows the best agent fitness recorded over five independent evolutionary runs; two that use Baldwinian evolution (green), two that use Lamarckian evolution (red) and one that uses Darwinian evolution (blue), on the goal direction task. Best performance is obtained by Baldwinian evolution without an explicit plasticity mask, and second best with Baldwinian evolution with an explicit plasticity mask, followed by Darwinian evolution, with Lamarckian evolution a very clear loser in this task. The horizontal velocity of the cheetah over the course of one fitness evaluation is shown in Figure 6.
The contrast between the goal velocity and goal direction tasks is interesting. The goal direction task requires a radical change in policy for moving forwards or backwards in different episodes. Lamarckian evolution gets stuck in a local optimum of only being able to go backwards. Baldwinian evolution is able to cope with these two diverse tasks. In the goal velocity task, Lamarckian evolution is superior because the final velocity achieved in task is a suitable starting point for the target velocity required in task (note we increment the target velocity by in each episode).
How do the hyperparameters of the a2c algorithm evolve during the goal direction task? Figure 7 shows histograms of the distribution of hyperparameters for evenly spaced time-points during the runs. The main points to note are that in Baldwinian evolution we see learning rates evolve to quite high values, e.g. to , whereas in Lamarckian evolution we see learning rates drop to the lowest values i.e. . In Baldwinian evolution the entropy loss scale evolves to high values , but experiences little directed selection in Lamarckian evolution. In Baldwinian evolution the discounts become as small as we allow, i.e. 0.92, but in Lamarckian evolution they become as large as we allow i.e. . The Baldwin effect does not abolish learning in this task – instead, it increases the rate of learning, but evolves strong learning biases. This is something that would not be possible in the Hinton and Nowlan task.
5 Discussion and Conclusion
In conclusion, in supervised learning tasks there was genetic accommodation of the initial function prior to learning, i.e. the regression network’s prior was initially sinusoidal. Rapid learning continued to be selected for throughout evolution. In learning RL task distributions using the Baldwin effect, we observed that learning hyperparameters also evolved high learning rates and low discount factors, with the initial behaviour ‘at birth’ providing strong biases to the learning algorithm which continued to show rapid learning throughout evolution. There is no complete genetic accommodation because that can never achieve high fitness by construction. Instead, it is the biases of the learning algorithm which are accommodated. The Baldwin effect is superior to Lamarckian learning when the distribution of tasks is broad or quickly changing, whereas Lamarckian learning is superior when the task distribution is narrow. The use of an explicit Hinton and Nowlan type mask did not speed up learning or final performance in task distributions.
We have demonstrated that the Baldwin effect is capable of producing learning algorithms and models capable of few shot learning when combined with deep learning in supervised and reinforcement learning tasks. Further work is to show that this principle can be used to achieve state of the art results in machine learning on more complex task distributions.
Remarkably, meta-learning through evolution enables the use of non-differentiable fitness functions, in contrast to popular meta-learning approaches. For example, the fitness function can be defined on different, potentially multi-modal data distributions, making it a prime candidate for multi-objective optimization, even when data from one or several objectives is not always available to the low level optimization process.
- Anderson  Russell Wayne Anderson. 1996. How adaptive antibodies facilitate the evolution of natural antibodies. Immunology and cell biology 74, 3 (1996), 286.
- Baldwin  J Mark Baldwin. 1896. A new factor in evolution. The american naturalist 30, 354 (1896), 441–451.
- Bull  Larry Bull. 2016. The Evolution of Sex through the Baldwin Effect. CoRR abs/1607.00318 (2016). arXiv:1607.00318 http://arxiv.org/abs/1607.00318
John A Bullinaria.
The evolution of variable learning rates. In
Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation. Morgan Kaufmann Publishers Inc., 52–59.
- Castillo et al.  PA Castillo, MG Arenas, JG Castellano, JJ Merelo, A Prieto, V Rivas, , and G Romero. 2006. Lamarckian evolution and the Baldwin effect in evolutionary neural networks. arXiv preprint arXiv:cs/0603004 (2006).
et al. 
Federico Cecconi, Filippo
Menczer, and Richard K Belew.
Maturation and the evolution of imitative learning in artificial organisms.Adaptive Behavior 4, 1 (1995), 29–50.
- Downing  Keith L Downing. 2010. The Baldwin effect in developing neural networks. In Proceedings of the 12th annual conference on Genetic and evolutionary computation. ACM, 555–562.
- Duan et al.  Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. 2016. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv preprint arXiv:1611.02779 (2016).
- Finn et al.  Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017).
- Fontanari and Santos  José F Fontanari and Mauro Santos. 2017. The revival of the Baldwin effect. The European Physical Journal B 90, 10 (2017), 186.
- Goldberg and Deb  David E Goldberg and Kalyanmoy Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms. Vol. 1. Elsevier, 69–93.
- Hinton and Nowlan  Geoffrey E Hinton and Steven J Nowlan. 1987. How learning can guide evolution. Complex systems 1, 3 (1987), 495–502.
- Husken and Goerick  Michael Husken and Christian Goerick. 2000. Fast learning for problem classes using knowledge based network initialization. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, Vol. 6. IEEE, 619–624.
- Jablonka and Lamb  Eva Jablonka and Marion J Lamb. 2014. Evolution in four dimensions, revised edition: Genetic, epigenetic, behavioral, and symbolic variation in the history of life. MIT press.
- Jaderberg et al.  Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. 2017. Population Based Training of Neural Networks. arXiv preprint arXiv:1711.09846 (2017).
- Keesing and Stork  Ron Keesing and David G Stork. 1991. Evolution and learning in neural networks: the number and distribution of learning trials affect the rate of evolution. In Advances in Neural Information Processing Systems. 804–810.
- Koch et al.  Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2.
- Lake et al.  Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33.
- Maynard Smith  John Maynard Smith. 1987. Natural selection: when learning guides evolution. (01 1987), 455–457.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.
- Ravi and Larochelle  Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for few-shot learning. (2016).
- Santoro et al.  Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016).
- Santos et al.  Mauro Santos, Eörs Szathmáry, and José F Fontanari. 2015. Phenotypic plasticity, the baldwin effect, and the speeding up of evolution: The computational roots of an illusion. Journal of theoretical biology 371 (2015), 127–136.
- Schaul et al.  Tom Schaul, Tobias Glasmachers, and Jürgen Schmidhuber. 2011. High Dimensions and Heavy Tails for Natural Evolution Strategies. In Genetic and Evolutionary Computation Conference (GECCO).
- Schmidhuber et al.  Juergen Schmidhuber, Jieyu Zhao, and MA Wiering. 1996. Simple principles of metalearning. Technical report IDSIA 69 (1996), 1–23.
- Thrun and Pratt  Sebastian Thrun and Lorien Pratt. 1998. Learning to learn: Introduction and overview. In Learning to learn. Springer, 3–17.
- Todorov et al.  Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. (10 2012), 5026-5033 pages.
Peter D Turney.
Cost-sensitive classification: Empirical evaluation
of a hybrid genetic decision tree induction algorithm.
Journal of artificial intelligence research2 (1995), 369–409.
- Turney  Peter D Turney. 2002. Myths and legends of the Baldwin effect. arXiv preprint cs/0212036 (2002).
- Vinyals et al.  Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630–3638.
- Wang et al.  Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. 2017. Learning to Reinforcement Learn. Cognitive Science, CogSci (2017). arXiv:1611.05763
- Wierstra et al.  Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. 2014. Natural Evolution Strategies. Journal of Machine Learning Research 15 (2014), 949–980. http://jmlr.org/papers/v15/wierstra14a.html
- Wierstra et al.  Daan Wierstra, Tom Schaul, Jan Peters, and Jürgen Schmidhuber. 2008. Natural Evolution Strategies. In Proceedings of the Congress on Evolutionary Computation (CEC08), Hongkong. IEEE Press.