1 Introduction
When a reinforcementlearning agent is learning to behave, it is critical that it both explores its domain and exploits its rewards effectively. One way to think of this problem is in terms of curiosity or intrisic motivation: constructing reward signals that augment or even replace the extrinsic reward from the domain, which induce the RL agent to explore their domain in a way that results in effective longerterm learning and behavior (pathak2017curiosity; burda2018exploration; oudeyer2018computational). The primary difficulty with this approach is that researchers are handdesigning these strategies: it is difficult for humans to systematically consider the space of strategies or to tailor strategies for the distribution of environments an agent might be expected to face.
We take inspiration from the curious behavior observed in young humans and other animals and hypothesize that curiosity is a mechanism found by evolution that encourages meaningful exploration early in an agent’s life. This exploration exposes it to experiences that enable it to learn to obtain high rewards over the course of its lifetime. We propose to formulate the problem of generating curious behavior as one of metalearning: an outer loop, operating at “evolutionary” scale will search over a space of algorithms for generating curious behavior by dynamically adapting the agent’s reward signal, and an inner loop will perform standard reinforcement learning using the adapted reward signal. This process is illustrated in figure 1; note that the aggregate agent, outlined in gray, has the standard interface of an RL agent. The inner RL algorithm is continually adapting to its input stream of states and rewards, attempting to learn a policy that optimizes the discounted sum of proxy rewards . The outer “evolutionary” search is attempting to find a program for the curiosity module, so as to optimize the agent’s lifetime return , or another global objective like the mean performance on the last few trials.
In this metalearning setting, our objective is to find a curiosity module that works well given a distribution of environments from which we can sample at metalearning time. MetaRL has been widely explored recently, in some cases with a focus on reducing the amount of experience needed by initializing the RL algorithm well (finn2017model; clavera2018learning) and, in others, for efficient exploration (duan2016rl; wang2017learning). The environment distributions in these cases have still been relatively lowdiversity, mostly limited to variations of the same task, such as exploring different mazes or navigating terrains of different slopes. We would like to discover curiosity mechanisms that can generalize across a much broader distribution of environments, even those with different state and action spaces: from imagebased games, to jointbased robotic control tasks. To do that, we perform metalearning in a rich, combinatorial, openended space of programs.
This paper makes three novel contributions.
We focus on a regime of metareinforcementlearning in which the possible environments the agent might face are dramatically disparate and in which the agent’s lifetime is very long.
This is a substantially different setting than has been addressed in previous work on metaRL and it requires substantially different techniques for representation and search.
We propose to do metalearning in a rich, combinatorial space of programs rather than transferring neural network weights.
The programs are represented in a domainspecific language (DSL) which includes sophisticated building blocks including neural networks complete with gradientdescent mechanisms, learned objective functions, ensembles, buffers, and other regressors. This language is rich enough to represent many previously reported handdesigned exploration algorithms. We believe that by performing metaRL in such a rich space of mechanisms, we will be able to discover highly general, fundamental curiositybased exploration methods. This generality means that a relatively computationally expensive metalearning process can be amortized over the lifetimes of many agents in a wide variety of environments.
We make the search over programs feasible with relatively modest amounts of computation.
It is a daunting search problem to find a good solution in a combinatorial space of programs, where evaluating a single potential solution requires running an RL algorithm for up to millions of time steps. We address this problem in multiple ways. By including environments of substantially different difficulty and character, we can evaluate candidate programs first on relatively simple and shorthorizon domains: if they don’t perform well in those domains, they are pruned early, which saves a significant amount of computation time. In addition, we predict the performance of an algorithm from its structure and operations, thus trying the most promising algorithms early in our search. Finally, we also monitor the learning curve of agents and stop unpromising programs before they reach all environment steps.
We demonstrate the effectiveness of the approach empirically, finding curiosity strategies that perform on par or better than those in published literature. Interestingly, the top 2 algorithms, to the best of our knowledge, had not been proposed before, despite making sense in hindsight. We conjecture the first one (shown in figure 3) is deceptively simple and that the complexity of the other one (figure 10 in the appendix) makes it relatively implausible for humans to discover.
2 Problem formulation
2.1 Metalearning problem
Let us assume we have an agent equipped with an RL algorithm (such as DQN or PPO, with all hyperparameters specified),
, which receives states and rewards from and outputs actions to an environment , generating a stream of experienced transitions . The agent continually learns a policy , which will change in time as described by algorithm ; so and thus . Although this need not be the case, we can think of as an algorithm that tries to maximize the discounted reward and that, at any timestep, always takes the greedy action that maximizes its estimated expected discounted reward.
To add exploration to this policy, we include a curiosity module that has access to the stream of state transitions experienced by the agent and that, at every timestep , outputs a proxy reward . We connect this module so that the original RL agent receives these modified rewards, thus observing , without having access to the original . Now, even though the inner RL algorithm acts in a purely exploitative manner with respect to , it may efficiently explore in the outer environment.
Our overall goal is to design a curiosity module that induces the agent to maximize , for some number of total timesteps or some other global goal, like final episode performance. In an episodic problem, will span many episodes. More formally, given a single environment , RL algorithm , and curiosity module , we can see the triplet (environment, curiosity module, agent) as a dynamical system that induces state transitions for the environment, and learning updates for the curiosity module and the agent. Our objective is to find that maximizes the expected original reward obtained by the composite system in the environment. Note that the expectation is over two different distributions at different time scales: there is an “outer” expectation over environments , and in “inner” expectation over the rewards received by the composite system in that environment, so our final objective is:
2.2 Programs for curiosity
In science and computing, mathematical language has been very successful in describing varied phenomena and powerful algorithms with short descriptions. As Valiant points out: “the power [of mathematics and algorithms] comes from the implied generality, that knowledge of one equation alone will allow one to make accurate predictions about a host of situations not even conceived when the equation was first written down” (valiant2013probably). Therefore, in order to obtain curiosity modules that can generalize over a very broad range of tasks and that are sophisticated enough to provide exploration guidance over very long horizons, we describe them in terms of general programs in a domainspecific language. Algorithms in this language will map a history of tuples into a proxy reward .
Inspired by humandesigned systems that compute and use intrinsic rewards, and to simplify the search, we decompose the curiosity module into two components: the first, , outputs an intrinsic reward value based on the current experienced transition (and past transitions indirectly through its memory); the second, , takes the current timestep , the actual reward , and the intrinsic reward (and, if it chooses to store them, their histories) and combines them to yield the proxy reward . To ease generalization across different timescales, in practice, before feeding into we normalize it by the total length of the agent’s lifetime, .
Both programs consist of a directed acyclic graph (DAG) of modules with polymorphically typed inputs and outputs. As shown in figure 2, there are four classes of modules:

Input modules (shown in blue), drawn from the set for the component and from the set for the component. They have no inputs, and their outputs have the type corresponding to the types of states and actions in whatever domain they are applied to, or the reals numbers for rewards.

Buffer and parameter modules (shown in gray) of two kinds: FIFO queues that provide as output a finite list of the most recent inputs, and neural network weights initialized at random at the start of the program and which may (pink border) or may not (black border) get updated via backpropagation depending on the computation graph.

Functional modules (shown in white), which compute output values given the inputs from their parent modules.

Update modules (shown in pink), which are functional modules (such as kNearestNeighbor) that either add variables to buffers or modules which add realvalued outputs to a global loss that will provide error signals for gradient descent.
A single node in the DAG is designated as the output node (shown in green): the output of this node is considered to be the output of the entire program, but it need not be a leaf node of the DAG.
On each call to a program (corresponding to one timestep of the system) the current input values and parameter values are propagated through the functional modules, and the output node’s output is given to the RL algorithm. Before the call terminates, the FIFO buffers are updated and the adjustable parameters are updated via gradient descent using the Adam optimizer (Kingma2014AdamAM)
. Most operations are differentiable and thus able to propagate gradients backwards. Some operations are not differentiable, including buffers (to avoid backpropagating through time) and ”Detach” whose purpose is stopping the gradient from flowing back. In practice, we have multiple copies of the same agent running at the same time, with both a shared policy and shared curiosity module. Thus, we execute multiple reward predictions on a batch and then update on a batch.
Programs representing several published designs for curiosity modules that perform internal gradient descent, including inverse features (pathak2017curiosity), random network distillation (RND) (burda2018exploration)
, and ensemble predictive variance
(pathak2019self), are shown in figure 2 (bigger versions can be found in appendix A.3). We can also represent algorithms similar to novelty search (lehman2008exploiting) and (fu2017ex2), which include buffers and nearest neighbor regression modules. Details on the data types and module library are given in appendix A.A crucial, and possibly somewhat counterintuitive, aspect of these programs is their use of neural network weight updates via gradient descent as a form of memory. In the parameter update step, all adjustable parameters are decremented by the gradient of the sum of the outputs of the loss modules, with respect to the parameters. This type of update allows the program to, for example, learn to make some types of predictions, online, and use the quality of those predictions in a state to modulate the proxy reward for visiting that state (as is done, for example, in RND).
Key to our program search are polymorphic data types: the inputs and outputs to each module are typed, but the instantiation of some types, and thus of some operations, depends on the environment. We have four types: reals , state space of the given environment , action space of the given environment and feature space , used for intermediate computations and always set to in our current implementation. For example, a neural network module going from to
will be instantiated as a convolutional neural network if
is an image and as a fully connected neural network of the appropriate dimension ifis a vector. Similarly, if we are measuring an error in action space
we use meansquared error for continuous action spaces and negative loglikelihood for discrete action spaces. This facility means that the same curiosity program can be applied, independent of whether states are represented as images or vectors, or whether the actions are discrete or continuous, or the dimensionality of either.This type of abstraction enables our metalearning approach to discover curiosity modules that generalize radically, applying not just to new tasks, but to tasks with substantially different input and output spaces than the tasks they were trained on.
To clarify the semantics of these programs, we walk through the operation of the RND program in figure 2. Its only input is , which might be an image or an input vector, which is processed by two NNs with parameters and , respectively. The structure of the NNs (and, hence, the dimensions of the ) depends on the type of : if is an image, then they are CNNs, otherwise a fully connected networks. Each NN outputs a 32dimensional vector; the distance between these vectors is the output of the program on this iteration, and is also the input to a loss module. So, given an input , the output intrinsic reward is large if the two NNs generate different outputs and small otherwise. After each forward pass, the weights in are updated to minimize the loss while remains constant, which causes the trainable NN to mimic the output of the randomly initialized NN. As the program’s ability to predict the output of the randomized NN on an input improves, the intrinsic reward for visiting that state decreases, driving the agent to visit new states.
To limit the search space and prioritize short, meaningful programs we limit the total number of modules of the computation graph to 7. Our language is expressive enough to describe many (but far from all) curiosity mechanisms in the existing literature, as well as many other potential alternatives, but the expressiveness leads to a very large search space. Additionally, removing or adding a single operation can drastically change the behavior of a program, making the objective function nonsmooth and, therefore, the space hard to search. In the next section we explore strategies for speeding up the search over tens of thousands of programs.
3 Improving the efficiency of our search
We wish to find curiosity programs that work effectively in a wide range of environments, from simple to complex. However, evaluating tens of thousands of programs in the most expensive environments would consume decades of GPU computation. Therefore, we designed multiple strategies for quickly discarding less promising programs and focusing computation on a few promising programs. In doing so, we take inspiration from efforts in the AutoML community (automl_book).
We divide these pruning efforts into three categories: simple tests that are independent of running the program in any environment, “filtering” by ruling out some programs based on poor performance in simple environments, and “metametaRL”: learning to predict which curiosity programs will produce good RL agents based on syntactic features.
3.1 Pruning invalid algorithms without running them
Many programs are obviously bad curiosity programs. We have developed two heuristics to immediately prune these programs without an expensive evaluation.

Checking that programs are not duplicates. Since our language is highly expressive, there are many nonobvious ways of getting equivalent programs. To find duplicates, we designed a randomized test where we identically seed two programs, feed them both identical fake environment data for tens of steps and check whether their outputs are identical.

Checking that the loss functions cannot be minimized independently of the input data. Many programs optimize some loss depending on neural network regressors. If we treat inputs as uncontrollable variables and networks as having the ability to become any possible function, then for every variable, we can determine whether neural networks can be optimized to minimize it, independently of the input data. For example, if our loss function is the neural network can learn to make it by disregarding and optimizing the weights to 0. We discard any program that has this property.
3.2 Pruning algorithms in cheap environments
Our ultimate goal is to find algorithms that perform well on many different environments, both simple and complex. We make two key observations. First, there may be only tens of reasonable programs that perform well on all environments but hundreds of thousands of programs that perform poorly. Second, there are some environments that are solvable in a few hundred steps while others require tens of millions. Therefore, a key idea in our search is to try many programs in cheap environments and only a few promising candidates in the most expensive environments. This was inspired by the effective use of sequential halving (karnin2013almost) in hyperparameter optimization (jamieson2016non).
By pruning programs aggressively, we may be losing multiple programs that perform well on complex environments. However, by definition, these programs will tend to be less general and robust than those that succeed in all environments. Moreover, we seek generalization not only for its own sake, but also to ease the search since, even if we only cared about the most expensive environment, performing the complete search only in this environment would be impractical.
3.3 Predicting algorithm performance
Perhaps surprisingly, we find that we can predict program performance directly from program structure. Our search process bootstraps an initial training set of (program structure, program performance) pairs, then uses this training set to select the most promising next programs to evaluate. We encode each program’s structure with features representing how many times each operation is used, thus having as many features as number of operations in our vocabulary. We use a nearestneighbor regressor, with . We then try the most promising programs and update the regressor with their results. Finally, we add an greedy exploration policy to make sure we explore all the search space. Even though the correlation between predictions and actual values is only moderately high ( on a holdout test), this is enough to discover most of the top programs searching only half of the program space, which is our ultimate goal. Results are shown in appendix C.
We can also prune algorithms during the training process of the RL agent. In particular, at any point during the metasearch, we use the top current best programs as benchmarks for all timesteps. Then, during the training of a new candidate program we compare its current performance at time with the performance at time of the top programs and stop the run if its performance is significantly lower. If the program is not pruned and reaches the final timestep with one of the top performances, it becomes part of the benchmark for the future programs.
4 Experiments
Our RL agent uses PPO (schulman2017proximal) based on the implementation by pytorchrl
in PyTorch
(Paszke2017AutomaticDI). Our code (https://github.com/mfranzs/metalearningcuriosityalgorithms) can take in any OpenAI gym environment (brockman2016openai) with a specification of the desired exploration horizon .We evaluate each curiosity algorithm for multiple trials, using a seed dependent on the trial but independent of the algorithm, which leads to the PPO weights and curiosity datastructures being initialized identically on the same trials for all algorithms. As is common in PPO, we run multiple rollouts (5, except for MuJoCo which only has 1), with independent experiences but shared policy and curiosity modules. Curiosity predictions and updates are batched across these rollouts, but not across time. PPO policy updates are batched both across rollouts and multiple timesteps.
4.1 First search phase in simple environment
We start by searching for a good intrinsic curiosity program in a purely exploratory environment, designed by gym_minigrid, which is an imagebased grid world where agents navigate in an image of a 2D room either by moving forward in the grid or rotating left or right. We optimize the total number of distinct cells visited across the agent’s lifetime. This allows us to evaluate intrinsic reward programs in a fast and simple environment, without worrying about combining it with external reward.
To bias towards simple, interpretable algorithms and keep the search space manageable, we search for programs with at most 7 operations. We first discard duplicate and invalid programs, as described in section 3.1, resulting in about 52,000 programs. We then randomly split the programs across 4 machines, each with 8 Nvidia Tesla K80 GPUs for 10 hours; thus a total of 13 GPU days.
Each machine finds the highestscoring 625 programs in its section of the search space and prunes programs whose partial learning curve is statistically significantly lower than the current top 625 programs. To do so, after every episode of every trial, we check whether .Thus, we account for both interprogram variability among the top 625 programs and intraprogram variability among multiple trials of the same program.
We use a 10nearestneighbor regressor to predict program performance and choose the next program to evaluate with an greedy strategy, choosing the best predicted program of the time and a random program of the time. By doing this, we try the most promising programs early in our search. This is important for two reasons: first, we only try 26,000 programs, half of the whole search space, which we estimated from earlier results (shown in figure 8 in the appendix) would be enough to get of the top of programs. Second, the earlier we run our best programs, the higher the bar for later programs, thus allowing us to prune them earlier, further saving computation time. Searching through this space took a total of 13 GPU days. As shown in figure 9 in the appendix, we find that most programs perform relatively poorly, with a long tail of programs that are statistically significantly better, comprising roughly of the whole program space.
The highest scoring program (a few other programs have lower average performance but are statistically equivalent) is surprisingly simple and meaningful, comprised of only 5 operations, even though the limit was 7. This program, which we call FAST (Fast ActionSpace Transition), is shown in figure 3; it trains a single neural network (a CNN or MLP depending on the type of state) to predict the action from and then compares its predictions based on with its predictions based on , generating high intrinsic reward when the difference is large. The action prediction loss module either computes a softmax followed by NLL loss or appends zeros to the action to match dimensions and applies MSE loss, depending on the type of the action space. Note that this is not the same as rewarding taking a different action in the previous timestep. The network predicting the action is learning to imitate the policy learned by the internal RL agent, because the curiosity module does not have direct access to the RL agent’s internal state.
Of the top 16 programs, 13 are variants of FAST, including versions that predict the action from instead of . The other 3 are variants of a more complex program that is hard to understand at first glance, but we finally determined to be using ideas similar to cycleconsistency in the GAN literature zhu2017unpaired (we thus name it Cycleconsistency intrinsic motivation); the diagram and explanation are in figure 10 in the appendix. Interestingly, to the best of our knowledge neither algorithm had been proposed before: we conjecture the former was too simple for humans to believe it would be effective and the latter too hard for humans to design, as it was already very hard to understand in hindsight.
4.2 Transferring to new environments
Our reward combiner was developed in lunar lander (the simplest environment with meaningful extrinsic reward) based on the best program among a preliminary set of 16,000 programs (which resembled Random Network Distillation; its computation graph is shown in appendix E). Among a set of 2,500 candidates (with 5 or fewer operations) the best reward combiner discovered by our search was . Notice that for (usually the case) this is approximately
, which is a downscaled version of intrinsic reward plus a linear interpolation that ranges from all intrinsic reward at
to all extrinsic reward at . In future work, we hope to coadapt the search for intrinsic reward programs and combiners as well as find multiple reward combiners.Given the fixed reward combiner and the list of 2,000 selected programs found in the imagebased grid world, we evaluate the programs on both lunar lander and acrobot, in their discrete action space versions. Notice that both environments have much longer horizons than the imagebased grid world (37,500 and 50,000 vs 2,500) and they have vectorbased, rather than imagebased, inputs. The results in figure 4 show good correlation between performance on grid world and on each of the new environments. Especially interesting is that, for both environments, when intrinsic reward in grid world is above 400 (the lowest score that is statistically significantly good), performance on the other two environments is also good in more than of cases.
Finally, we evaluate on two MuJoCo environments (todorov2012mujoco): hopper and ant. These environments have more than an order of magnitude longer exploration horizon than acrobot and lunar lander, exploring for 500K timesteps, as well as continuous actionspaces instead of discrete. We then compare the best 16 programs on grid world (most of which also did well on lunar lander and acrobot) to four weak baselines (constant 0,1,1 intrinsic reward and Gaussian noise reward) and three published algorithms expressible in our language (shown in figure 2
). We run two trials for each algorithm and pool all results in each category to get a confidence interval for the mean of that category. All trials used the reward combiner found on lunar lander. For both environments we find that the performance of our top programs is statistically equivalent to published work and significantly better than the weak baselines, confirming that we metalearned good curiosity programs.
Note that we metatrained our intrinsic curiosity programs only on one environment (GridWorld) and showed they generalized well to other very different environments: they perform better than published works in this metatrain task and one metatest task (Acrobot) and on par in the other 3 tasks metatest tasks. Adding more metatraining tasks would be as simple as standardising the performance within each task (to make results comparable) and then selecting the programs with best mean performance. We chose to only metatrain on a single, simple, task because it (surprisingly!) already gave great results, highlighting the broad generalization of metalearning program representations.
Class  Ant  Hopper 

Baseline algorithms  [95.3, 39.9]  [318.5, 525.0] 
Metalearned algorithms  [+67.5, +80.0]  [589.2, 650.6] 
Published algorithms  [+67.4, +98.8]  [627.7, 692.6] 
). The table shows the confidence interval (one standard deviation) for the mean performance (across trials, across algorithms) for each algorithm category. Performance is defined as mean episode reward for all episodes.
5 Related work
In some regards our work is similar to neural architecture search (NAS) (stanley2002evolving; zoph2016neural; elsken2018neural; pham2018efficient) or hyperparameter optimization for deep networks (mendoza2016towards), which aim at finding the best neural network architecture and hyperparameters for a particular task. However, in contrast to most (but not all, see zoph2018learning) NAS work, we want to generalize to many environments instead of just one. Moreover, we search over programs, which include nonneural operations and data structures, rather than just neuralnetwork architectures, and decide what loss functions to use for training. Our work also resembles work in the AutoML community (automl_book) that searches in a space of programs, for example in the case of SAT solving (khudabukhsh2009satenstein) or autosklearn (NIPS2015_5872) and concurrent work on learning loss functions to replace crossentropy for training a fixed architecture on MNIST and CIFAR (gonzalez2019improved; gonzalez2020evolving). Although we took inspiration from ideas in that community (jamieson2016non; li2016hyperband), our algorithms specify both how to compute their outputs and their own optimization objectives in order to work well in synchrony with an expensive deep RL algorithm.
There has been work on metalearning with genetic programming
(schmidhuber1987evolutionary), searching over mathematical operations within neural networks (ramachandran2017searching; gaier2019weight), searching over programs to solve games (wilson2018evolving; kelly2017multi; silver2019few) and to optimize neural networks (bengio1995search; bello2017neural), and neural networks that learn programs (reed2015neural; pierrot2019learning). Our work uses neural networks as basic operations within larger algorithms. Finally, modular metalearning (alet2018modular; alet2019neural) trains the weights of small neural modules and transfers to new tasks by searching for a good composition of modules; as such, it can be seen as a (restricted) dual of our approach.There has been much interesting work in designing intrinsic curiosity algorithms. We take inspiration from many of them to design our domainspecific language. In particular, we rely on the idea of using neural network training as an implicit memory, which scales well to millions of timesteps, as well as buffers and nearestneighbour regressors. As we showed in figure 2 we can represent several prominent curiosity algorithms. We can also generate meaningful algorithms similar to novelty search (lehman2008exploiting) and (fu2017ex2); which include buffers and nearest neighbours. However, there are many exploration algorithm classes that we do not cover, such as those focusing on generating goals (srivastava2013first; kulkarni2016hierarchical; pmlrv80florensa18a), learning progress (oudeyer2007intrinsic; schmidhuber2008driven; azar2019world), generating diverse skills (eysenbach2018diversity), stochastic neural networks (florensa2017stochastic; fortunato2017noisy), countbased exploration (tang2017exploration) or objectbased curiosity measures (forestier2016modular). Finally, part of our motivation stems from taiga2019benchmarking showing that some bonusbased curiosity algorithms have trouble generalising to new environments.
There have been research efforts on metalearning exploration policies: duan2016rl; wang2017learning learn an LSTM that explores an environment for one episode, retains its hidden state and is spawned in a second episode in the same environment; by training the network to maximize the reward in the second episode alone it learns to explore efficiently in the first episode. stadie2018some improves their exploration and that of finn2017model by considering the importance of sampling in RL policies. gupta2018meta combine gradientbased metalearning with a learned latent exploration space in which they add structured noise for meaningful exploration. Closer to our formulation, zheng2018learning parametrize an intrinsic reward function which influences policygradient updates in a differentiable manner, allowing them to backpropagate through a single step of the policygradient update to optimize the intrinsic reward function for a single task. In contrast to all three of these methods, we search over algorithms, which will allows us to generalize more broadly and to consider the effect of exploration on up to timesteps instead of the of previous work. Finally, chiang2019learning; faust2019evolving have a setting similar to ours where they modify reward functions over the entire agent’s lifetime, but instead of searching over intrinsic curiosity algorithms they tune the parameters of a handdesigned reward function.
Related work on metalearning (schmidhuber1987evolutionary; thrun2012learning; clune2019ai) and efforts to increase its generalization can be found in appendix B. Closest to our work, evolved policy gradients (EPG, houthooft2018evolved) use evolutionary strategies to metalearn a neural network that acts as a loss function and is used to train a policy network. EPG generalizes by metatraining with target locations east of the start location and metatesting with target locations to the west. In contrast, we showed that by metalearning programs, we can generalize between radically different environments, not just goal variations of a single environment. Concurrent to our work, kirsch2019improving also show generalization capabilities between environments similar to ours (lunar lander, hopper and halfcheetah). Their approach transfers a parametric representation, for which it is unclear how to adapt the learned neural losses to an unseen environment with a different observation space. Their approach thus does not encode states into the loss function, which is critical for efficient exploration. In contrast, our algorithms can leverage polymorphic data types that adapt the neural networks to the environment they are running in, adapting both the size and the type of network (CNN vs MLP) running in each environment.
6 Conclusions
In this work, we proposed to metalearn algorithms and show that by transferring programs we can generalize between tasks much more varied than previously possible in metaRL, even between those with different input or output spaces. In many settings, however, the input and output space remain the same as we change tasks. This opens the possibility of getting the best of both worlds by metalearning weights along with structure, thus simultaneously transferring domainspecific knowledge in the weights and higherlevel algorithmic knowledge in the architecture. In addition, we note that the approach of metalearning programs instead of network weights may have further applications beyond finding curiosity algorithms, such as metalearning optimization algorithms or even metalearning metalearning algorithms. Our relatively modest compute (2 GPUweeks) and a simple search method restricted us to a mediumsized search space, but we expect that future work could search over significantly bigger spaces. It thus may be possible to automatically search for new machine learning algorithms from more fundamental building blocks for a wide variety of problems.
Acknowledgments
We thank Kelsey Allen, Peter Karkus, Kevin Smith, Josh Tenenbaum and the rest of the HondaCMM MIT team for their insightful feedback. We thank Chris Lu for his idea on what the algorithm in figure 10 is computing. We also want to thank Bernadette Bucher, Chelsea Finn, Abhishek Gupta, Deepak Pathak, Lerrel Pinto, Oleh Rybkin, Karl Schmeckpeper and Joaquin Vanschoren for valuable conversations. Finally, we also want to thank Maria Bauza and Tej Chajed for their feedback on early drafts and Clement Gehring for his help setting up the experiments.
We gratefully acknowledge support from NSF grants 1523767 and 1723381, AFOSR grant FA95501710165, ONR grant N000141812847, the Honda Research Institute, SUTD Temasek Laboratories and the MIT Quest for Intelligence. Any opinions, findings, and conclusions or recommendations expressed in this material do not necessarily reflect the views of our sponsors.
References
Appendix A Details of our domainspecific language for curiosity algorithms
We have the following types. Note that and get defined differently for every environment.

: real numbers such as or the dotproduct between two vectors.

: numbers guaranteed to be positive, such as the distance between two vectors. The only difference to our program search between and is in pruning programs that can optimize objectives without looking at the data. For we check whether they can optimize down to 0, for we check whether they can optimize to arbitrarily negative values.

state space : the environment state, such as a matrix of pixels or a vector with robot joint values. The particular form of this type is adapted to each environment.

action space : either a 1hot description of the action or the action itself. The particular form of this type is adapted to each environment.

featurespace : a space mostly useful to work with neural network embeddings. For simplicity, we only have a single feature space.

List: for each type we may also have a list of elements of that type. All operations that take a particular type as input can also be applied to lists of elements of that type by mapping the function to every element in the list. Lists also support extra operations such as average or variance.
a.1 Curiosity operations
Operation  Input type(s)  State  Output type 

Add  ,  
RunningNorm  
VariableAsBuffer  
NearestNeighborRegressor  ,  
SubtractOneTenth  
NormalDistribution  
Subtract  ,  
Sqrt(Abs(x))  
NN  ,  
NN  ,  
NN  
NN  
(C)NN  
(C)NN, Detach  
(C)NNEnsemble  5x  
NN Ensemble  5x  
NN Ensemble  ,  5x  
NN Ensemble  ,  5x  
MinimizeValue  Adam  
L2Norm  
L2Distance  ,  
ActionSpaceLoss  ,  
DotProduct  ,  
Add  ,  
Detach  
Mean  
Variance  
Mean  
Mapped L2 Norm  
Average Distance  ,  
Minus  , 
Note that stands for the option of being or . NearestNeighborRegressor takes a query and a target, automatically creates a buffer of the target (thus keeps a list as a state) and answers based on the buffer. RunningNorm keeps track of the variance of the input and normalizes by that variance.
a.2 Reward combiner operations
Operation  Input type(s)  State  Output type 
Constant {0.01,0.1,0.5,1}  
NormalDistribution  
Add  ,  
Max  ,  
Min  ,  
WeightedNormalizedSum  , , ,  
RunningNorm  
VariableAsBuffer  
Subtract  ,  
Multiply  ,  
Sqrt(Abs(x))  
Mean  
Note that . RunningNorm keeps track of the variance of the input and normalizes by that variance.
a.3 Two other published algorithms covered by our DSL
Appendix B Related work on metaRL and generalization
Most work on metaRL has focused on learning transferable feature representations or parameter values for quickly adapting to new tasks (finn2017model; Finn:EECS2018105; clavera2018learning) or improving performance on a single task (xu2018meta; veeriah2019discovery). However, the range of variability between tasks is typically limited to variations of the same goal (such as moving at different speeds or to different locations) or generalizing to different environment variations (such as different mazes or different terrain slopes). There have been some attempts to broaden the spectrum of generalization, showing transfer between Atari games thanks to modularity (fernando2017pathnet; rusu2016progressive) or proper pretraining (parisotto2015actor). However, as noted by nichol2018gotta, Atari games are too different to get big gains with current featuretransfer methods; they instead suggest using different levels of the game Sonic to benchmark generalization. Moreover, yu2019 recently proposed a benchmark of many tasks. wang2019paired automatically generate different terrains for a bipedal walker and transfer policies between terrains, showing that this is more effective than learning a policy on hard terrains from scratch; similar to our suggestion in section 3.2. In contrast to these methods, we aim at generalization between completely different environments, even between environments that do not share the same state and action spaces.
Appendix C Predicting algorithm performance
Appendix D Performance on grid world
Appendix E Interesting programs found by our search
One can give meaning to the role of all 3 neural networks by considering how they contribute to minimizing the loss. To do so, let us name the networks: (as labeled in the figure) as (for random embedding), as (for backwards) and as (for forward and random embedding) and look at the algorithm in equation form:
random initialization  
(1) 
We can see that will indeed be a random embedding because the network is randomly initialized and is not trained. Then, we observe that the second term in the loss for , which does not involve and thus has to minimize alone, is . In this term, receives a transformation of and has to make it very similar to the same transformation applied to ; therefore, this term is similar to cycleconsistency found in some other parts of machine learning zhu2017unpaired and must act like a backward model. Finally, looking at the minimization of receives the original and has to output a vector such that the backward model will bring it close to the random embedding of . Therefore must learn a forward model composed with the random embedding of . Finally, we see that the algorithm outputs , going forward and backward for both and and comparing the difference. In summary, this distance combines errors in the cycleconsistency of predictions (which will be higher in unvisited parts of the state) with distance in the random embedding space between and , i.e. moving to a very different state.
Comments
There are no comments yet.