I Introduction
How do we build agents that perform well over a wide range of tasks? Achieving this generality (or universality) is considered by many research agendas to be one of the core challenges in artificial intelligence (AI)
(Solomonoff, 1964; Gottfredson, 1997; HernandezOrallo, 2000; Hutter, 2004). Generality is at the heart of IQ tests (Spearman, 1904; Raven, 1936; Urbina, 2011) and the measure of intelligence suggested by Legg and Hutter (2007).Metalearning
is a practical machine learning approach to building
general AIsystems such as classifiers, predictors, and agents. Broadly speaking, metalearning aims to produce flexible, dataefficient learning systems through the acquisition of inductive biases from data
(Thrun and Pratt, 1998; Schmidhuber et al., 1996). In contrast to systems that build in such biases by design, in metalearning they are acquired by training on a distribution over tasks. For example, an agent trained to find rewards in one maze will simply learn the solution to that maze, but an agent trained on mazes drawn from a broad class will learn a generalpurpose strategy for exploring new mazes (Wang et al., 2016; Duan et al., 2016). Systems trained in this way are able to absorb structure in the task distribution that allows them to adapt efficiently and generalize to new tasks, effectively leveraging past experience to speed up new learning. Therefore, a metalearner can be thought of as a primary learning system that progressively improves the learning of a secondary system (Thrun and Pratt, 1998; Hochreiter et al., 2001; Schmidhuber et al., 1996).This method has been shown to be remarkably effective in practice, especially in combination with deep learning architectures. Recent years have brought a wealth of approaches centered on metalearning different aspects of the learning process, such as learning the optimizer
(Andrychowicz et al., 2016; Li and Malik, 2016; Ravi and Larochelle, 2016; Wichrowska et al., 2017; Chen et al., 2018b), the metric space (Vinyals et al., 2016; Snell et al., 2017), the initial network parameters (Finn et al., 2017; Nichol and Schulman, 2018), the learning targets (Xu et al., 2018), conditional distributions (Wang et al., 2017; Garnelo et al., 2018; Gordon et al., 2018; Zintgraf et al., 2018; Chen et al., 2018a), or even the entire learning procedure using a memorybased architecture such as a recurrent neural network
(Santoro et al., 2016; Wang et al., 2016; Duan et al., 2016; Denil et al., 2016; Mishra et al., 2018). Some approaches have also taken advantage of modularity as an inductive bias to learn modules to be reused in transfer tasks (Reed and De Freitas, 2015).In this report we focus on this last class of memorybased metalearning methods, which aim to find sequential strategies that learn from experience. Specifically, we aim for a theoretical understanding of such metalearning methods by recasting them within a Bayesian framework. Our goal is to provide a basic algorithmic template to which various metalearning procedures conform, showing that learned strategies are capable of performing nearoptimally. We hope that this deeper conceptual understanding will provide the foundations for new, scalable models.
The significance of memorybased metalearning methods rests in their ability to build—in a scalable and datadriven way—systems that behave as if they had a probabilistic model of the future. Agents with probabilistic models possess inductive biases that allow them to quickly draw structured inferences based on experience. However, building these agents is very challenging: typically, constructing them involves specifying both the probabilistic models and their inference procedures either by hand or through probabilistic programming.
In contrast, metalearning offers a simple alternative: to precondition a system with training samples in order to fix the right inductive biases necessary at test time.
The key insight is that the metatraining process generates training samples that are implicitly filtered according to Bayes rule, i.e. the samples are drawn directly from the posterior predictive distribution
. Combined with a suitably chosen cost function, metalearning can use these Bayesfiltered samples to regress an adaptive strategy that solves a task quickly by implicitly performing Bayesian updates “under the hood”—that is, without computing the (typically intractable) Bayesian updates explicitly. In this way, memorybased metalearning agents can behave as if they possess a probabilistic model (Orhan and Ma, 2017). Moreover, the agents track, in their memory dynamics, the Bayesian sufficient statistics necessary for estimating the uncertainties for solving the task. Note that this conceptualization is distinct from but extends ideas presented in
Baxter (1998, 2000), which considered only the supervised learning case, and
Finn et al. (2017); Grant et al. (2018), in which the inference step is builtin and constrained to only a few gradientdescent steps.This report is structured as follows. Throughout, we focus on simple toy examples before moving on to discuss issues with scaling and practical applications. Section II reviews sequential predictions. This is the most basic application of metalearning of sequential strategies, as it only requires regressing the statistics of the training samples. The analysis of the sequential prediction case will also serve as a basis for studying other applications and for investigating the structure of the solutions found through metalearning. Section III reviews the sequential decisionmaking case. Here we show how to combine the basic metalearning scheme with a policy improvement method. We illustrate this with two minimal examples: one for building Thompson sampling agents, which is the natural extension of the prediction case, and another for building Bayesoptimal agents. Finally, Section IV discusses the connection between metalearning and Bayesian statistics, the spontaneous emergence of metalearned solutions in (singletask) online learning, and future challenges.
Ii Sequential Prediction
We start our analysis with the problem of sequential prediction, i.e. the task of forecasting the future based on past experience. We use this case because sequential prediction is the most basic application of metalearning, and it will lay down the basics for analyzing other applications such as sequential decisionmaking.
Consider the following sequence prediction problems:
 I)

1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, ?
 II)

1, 4, 9, 16, 25, ?
 III)

1, 2, 3, 4, ?
What is the next number in the sequence? The answers are given by :
 I)

1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
 II)

1, 4, 9, 16, 25,
 III)

1, 2, 3, 4,
These are justified as follows. Sequence (I) is a unary encoding of the natural numbers. (II) is the sequence of quadratic numbers . Finally, (III) is the sequence . Intuitively however, 29 feels very unlikely, as 5 (i.e. the next natural number) seems a far more natural choice compared to the complex 4thdegree polynomial (Hutter, 2004).
In spite of the enormous number of possible explanations, we were remarkably good at predicting the next element in the sequence. Our prediction ability is at the same time general and dataefficient. We possess inductive biases that allowed us to quickly narrow down the space of possible sequences to just a few, even though there isn’t any obvious shared structure among the three examples; furthermore, these biases permitted us to judge the relative plausibilities of competing predictions (such as “29” versus “5” in the previous example). Our prediction strategy appears to follow two principles: we maintained every possible explanation/hypothesis we could think of given resource constraints (Epicurus’ principle), and we deemed simpler explanations as being more likely (Occam’s razor). These are also the core principles underlying Bayesian statistics, which is arguably the gold standard for computing predictions given the inductive biases (Jaynes, 2003).
But where do these inductive biases come from? How does a learner know what is most parsimonious? Here, metalearning provides a simple answer: a learner, through repeated exposure to various tasks, effectively captures the statistics of the data, which translate into the inductive biases necessary for future predictions.
In the following we will briefly review how to make sequential predictions using Bayesian ideas. Then, we will show how to numerically approximate this prediction strategy using metalearning.
Iia Problem setup
We now formalize the setup for sequential prediction. For convenience, we limit ourselves to finite observation spaces and discrete time.^{1}^{1}1Note however that this assumption comes with a loss of generality. Extensions to e.g. continuous domains typically require additional (geometric) assumptions which are beyond the scope of this report. Outputs are distributions over finite spaces unless stated otherwise.
Our goal is to set up a generative process over trajectories (i.e. finite sequences of observations), where the trajectories are drawn not from a single generator, but from a class of generators. Each generator will correspond to a possible “ground truth” that we want the system to consider as a hypothesis. By defining a loss function, we tell the system what to do (i.e. the task to perform) or what to predict under each situation. Training the system with this generative process then encourages the system to adopt the different generators as potential hypotheses.
IiA1 Sequences
Let be a finite alphabet of observations. The set of (finite) strings over is written as , which includes the empty string . denotes the set of oneway infinite sequences over . For concreteness, we assume that , where could be very large. For strings, subindices correspond to (time) indices as in , and we use obvious shorthands for substrings such as , and . If the length is implicit from the context, then we also write strings of length simply as , from the word trajectory.
IiA2 Generators/Hypotheses
The domain of possible (stochastic) generators will be modeled using a set of generators, formalized as a class of distributions over infinite sequences in . These will become, after training, the set of hypotheses of the system, and henceforth we will use the terms “generator” and “hypotheses” interchangeably. Specifically, we demand that for each distribution
over strings, the probability
of any next symbol given any past is specified.^{2}^{2}2This requirement avoids the technical subtleties associated to conditioning on pasts having probability zero. The probability of an observation string is then equal to the product . Defining the conditionals also uniquely determines the distribution over infinite sequences.^{3}^{3}3More precisely, a collection of consistent distributions may be defined over each of the spaces for via the conditionals specified above. Kolmogorov’s extension theorem then states that there exists a distribution over infinite sequences in (with respect to the sigmaalgebra generated by cylinder sets , where denotes the set containing all oneway infinite sequences having a common prefix ) that is consistent with the distributions over finite sequences defined via conditionals.We will also index the distributions in by a countable parameter set , so that each member is a distribution , where .^{4}^{4}4This choice of the cardinality of the set of parameters is for simplifying our mathematical exposition. In practice, the extension to uncountable parameter sets is straightforward (see the Dirichlet example below) To use a notation that fits neatly the Bayesian interpretation, we will write rather than ; that is, where the parameter is interpreted as a conditional.
Finally we place prior probabilities
over the members in . This will play the role of our measure of simplicity (or inductive bias) of a hypothesis, where a simpler one possesses more prior mass.^{5}^{5}5Measuring “simplicity” in this way is justified by the implied description length of the hypothesis under an optimal code, namely bits.Example 1.
(Dice roll prediction) In a dice roll prediction problem, the set of hypotheses is given by a collection of sided “dice” that generate i.i.d. rolls according to the categorical distribution, that is (with a slight abuse of notation) , where is the
th element of the probability vector
. If there are such dice, then a possible prior distribution is the uniform .Indeed, this example can be generalized to the set of all probability vectors in the simplex , with a uniform prior density. In this case, the whole process (i.e. first sampling the parameter and then sampling the observation sequence ) is known as a DirichletCategorical process.
Example 2.
(Optional: Algorithmic sequence prediction) In the introductory example we used sequences that are generated according to patterns. These patterns can be formalized as algorithms.^{6}^{6}6In fact, Example 1 is a special case of a collection of algorithms. Algorithmic information theory allows us to formalize a wellknown, very large hypothesis class (Vitanyi and Li, 1997)
. Take a universal Turing machine
that takes a binaryencoded program and produces a binary sequence as a result.^{7}^{7}7We assume that the universal Turing machine is prefixfree, and that it outputs trailing zeros after reaching a halting state. Then we can generate a random binary sequence by providing the universal Turing machine with fair coin flips, and then running the machine on this input.In this case, each hypothesis is a (degenerate) distribution over sequences, where if is a prefix of and zero otherwise. The prior distribution over binary programs is , where is the length of the program .^{8}^{8}8Strictly speaking, this distribution might be unnormalized, and we refer the reader to Vitanyi and Li (1997) for a detailed technical discussion. The resulting distribution over sequences is known as the algorithmic prior and Solomonoff’s prior.
IiA3 Strategies
The system uses a strategy to solve a task. In general, strategies can implement predictions (over observations) and/or policies (distributions over actions). In the prediction case, we formally define a strategy as a distribution over strings in
. This is probability distribution that characterizes the system’s outputs, and it should not be confused with the generators. Then,
will denote a prediction over the next symbol given the past .^{9}^{9}9Throughpout the paper, for any distribution over , we define as the marginal distribution over the first symbols, i.e. . Then, is defined as . The set of candidate strategies available to the agent is denoted as .IiA4 Losses
We consider tasks that can be formalized as the minimization of a loss function. A loss function is a function that maps a strategy and a trajectory into a realvalued cost . Intuitively, this captures the loss of using the strategy under the partial observability of the parameter when the trajectory is . For instance, in sequential predictions, a typical choice for the loss function is the logloss , also known as the compression loss; in sequential decisionmaking problems, one typically chooses a negative utility, such as the negative (discounted) cumulative sum of rewards.
IiA5 Goal
The aim of the system is to minimize the expected loss
(1) 
with respect to the strategy . That is, the objective is the expected loss of a trajectory , but generated by a latent hypothesis (i.e. the ground truth) randomly chosen according to the desired inductive bias . This is the standard objective in Bayesian decisiontheory in which the system has to choose in the face of (known) uncertainty (Savage, 1972). Notice that this setup assumes the realizable case, that is, the case in which the true generative distribution is a member of the class of hypotheses pondered by the system.
IiB Universality: Bayesian answer
We start by showing how to solve the prediction problem from a purely Bayesian point of view. This brief digression is necessary in order to fully understand the statistical properties of the samples generated during metalearning. It is worth pointing out that the Bayesian solution is typically viewed as following directly from an interpretation of probabilities as degrees of belief rather than from an optimization problem (Jaynes, 2003). This shouldn’t be a distraction however, as we will later express the Bayesian solution in terms of the compression loss in order to establish the connection to metalearning.
The classical Bayesian approach consists in using a predictor given by the mixture distribution
(2) 
in which the probability of given trajectory is given by combining the probabilities predicted by the individual hypotheses weighted by their prior probabilities. The prior probabilities are the initial inductive biases.
The mixture distribution automatically implements an adaptive prediction strategy, and not a static average predictor as a naive reading of its definition might suggest. This is seen as follows. The Bayes rule and (2) easily imply that given past data , we can predict the next observation using the posterior predictive distribution obtained by conditioning on the past:
(3) 
As (3) shows, the prediction of is given by the average prediction made by the different hypotheses, but weighted by the posterior probabilities , i.e. the weights updated by the data.
These predictions converge to the true distribution with more data;^{10}^{10}10Notice however that in general the posterior probabilities do not converge unless we impose stricter conditions. that is, with probability one,
(4) 
where is the latent parameter of the true generator (Hutter, 2004, Theorem 3.19) (In contrast, notice that the posterior distribution does not converge in general). From the standpoint of lossless compression, Bayesian prediction is the optimal strategy. The rate of convergence in (4), or more precisely, the convergence rate of the average excess compression loss (a.k.a. regret) can easily be established (see, e.g., CesaBianchi and Lugosi (2006)): for any , and sequence ,
That is, depending on the taskprior , the average excess compression loss converges to zero at an rate.
To prepare the ground for the next section on metalearning, we will characterize the construction of the mixture distribution in terms of a solution to an optimization problem. For this, we use a central result from information theory. As suggested above, we can use the compression loss to regress the statistics of the distribution over trajectories. Formally, we choose . Then, we optimize (1)
(5)  
where . The resulting expected prediction loss is the crossentropy of from the marginal , which implies the minimizer ; that is, precisely the desired Bayesian mixture distribution having the optimal prediction properties (Dawid et al., 1999).
IiC Universality: metalearning answer
In a nutshell, metalearning consists of finding a minimizer for a MonteCarlo approximation of the expected loss (1). More precisely, consider a learning architecture (say, a deep learning architecture trained by gradient descent) that implements a function class , where each member maps a trajectory into a probability of the trajectory. Then, we can approximate the expected loss (1) as
(6) 
where are i.i.d. samples drawn from randomly chosen generators as
The goal of metalearning is to find a function that minimizes (6). Since the loss depends on the sampled trajectories but not on the parameters , the MonteCarlo objective implicitly marginalizes over the generators.^{11}^{11}11The technique of sampling the parameters rather than marginalizing over them explicitly was also called rootsampling in Silver and Veness (2010). The computation graph is shown in Figure 1.
We assume that optimizing the MonteCarlo estimate (6) will give a minimizer with the property
for the most probable .^{12}^{12}12We will avoid discussions of the approximation quality here. It suffices to say that this depends on the learning model used; in particular, on whether the optimum is realizable, and on the smoothness properties of the model near the optimum. While this gives us a method for modeling the probabilities of trajectories, for sequential predictions we need to incorporate some additional structure into the regression problem.
To do sequential predictions, i.e. implementing , the above optimization does not suffice; instead, we have to impose the correct functional interface constraints onto the regression problem in order to get a system that can map histories into predictions. This is done by setting up the target loss so that the solution implements the required function interface. Specifically, we seek a function that maps histories into predictions of the next observation , thereby also respecting the causal structure of the task. If we use a memorybased architecture, such as a recurrent neural network (Elman, 1990; Jordan, 1997; Hochreiter et al., 2001; Kolen and Kremer, 2001; Graves, 2012), then the function conforms to the interface
(7) 
where is the current prediction vector, and are the preceding and current memory states respectively, and is the preceding observation. Furthermore, we fix the initial state. This is a sufficient condition for the system to make predictions that marginalize over the hypotheses. Obviously, since must remember all the necessary past information, the set of possible memory states needs to be sufficiently large for it to possess the capacity for encoding the sufficient statistics of the pasts in . See discussion below in Section IID.
We need instantaneous predictions. The associated instantaneous loss function for this interface (7) is then
(8) 
so that the MonteCarlo approximation of the expected loss (6) becomes
(9) 
where are MonteCarlo samples of the instantaneous logloss. The computation graph is shown in Figure 2 and the pseudocode is listed in Algorithm 1.^{13}^{13}13Note that one could separate learning and sampling in Algorithm 1 by first sampling models and corresponding observation sequences, and then feed the corresponding observation in the "observe" step of the algorithm. In later sections, however, the prediction of the agent will affect how the observations evolve; we chose to present sampling and prediction/learning in this interleaved way to have a more unified presentation style with Algorithms 2 and 3, given in the next section. Arguably the most important result in the realizable setting is that,^{14}^{14}14Realizable is when the minimizer of the MonteCarlo loss is in . if is large, and the minimum of (9) is attained, then the minimizer implements a function
where, crucially, the instantaneous prediction is
(10) 
i.e. the optimal sequential prediction in which the Bayesian update is automatically amortized (or precomputed) (Ritchie et al., 2016). In practice, this means that we can use without updating the parameters of the model (say, changing the weights of the neural network) to predict the future from the past, as the prediction algorithm is automatically implemented by the recursive function with the help of the memory state . The next section discusses how this is implemented in the solution.
IiD Anatomy of metalearned agents
How does the function implement the sequential strategy with amortized prediction (10)? In which sense does this strategy rely on an internal model? First, let us review the conditions:

Choice of loss function: The choice of the loss function specifies what solving the task means, i.e. what we want the agent to do as a function of the data.

Functional interface: Since the agent is ultimately implemented as a function, the choice of the interface (e.g. mapping entire trajectories into probabilities versus mapping pasts into predictions) is crucial. Obviously this choice is implicit in the practice of any regressionbased machine learning technique. Nevertheless, we point this out because it is especially important in sequential problems, as it determines how the agent is situated within the task, that is, what informational (and ultimately causal) constraints the agent is subject to for solving a task.

MonteCarlo marginalization: Marginalizing analytically over the generators is in general intractable, except for special cases such as in some exponential families (Koopman, 1936; Abramovich and Ritov, 2013). Instead, metalearning performs a MonteCarlo marginalization. Furthermore, in the sequential setting, the marginalization hides the identity of the generator, thereby forcing the model to find a function that uses the past experience to improve on the loss. This, in turn, leads to the numerical approximation of amortized Bayesian estimators, as long as the first state is fixed across all the samples.
As a result, we obtain a function that performs the following operations:

New prediction: takes the past input and memory state to produce a new prediction minimizing the instantaneous loss .

New memory state: In order to perform the new prediction, combines the past input and memory state to produce a new memory state that acts as a sufficient statistic of the entire past . That is, there exists a sufficient statistic function extracting all the necessary information from the past to predict the future, i.e.
and this function is related to via the equation .
That is, the recursive function implements a state machine (Sipser, 2006) (or transducer) in which the states correspond to memory states, the transitions are the changes of the memory state caused by an input, and where the outputs are the predictions. We can represent this state machine as a labeled directed graph. Figure 3 shows a state machine predicted by theory and Figure 4 shows a metalearned state machine.
The state machine thus implements a dynamics driven by the input, where the state captures the sufficient statistics of the past (Kolen and Kremer, 2001). However, it is important to note that the objective (9) does not enforce the minimality of the state machine (at least not without postprocessing (Kolen, 1994)). Indeed, in practice this is often not the case, implying that the states do not correspond to the minimal sufficient statistics and that optimization runs with different initial conditions (e.g. random seeds) can produce different state machines. Furthermore, as with any machine learning method, the accuracy of the transitions of inputstate pairs that never occurred during training depend on the generalization ability of the function approximator. This is especially the case for input sequences that are longer than the length of the trajectories seen during training.
State machines are important because they reflect symmetry relations due to their intimate relation to semigroups (Krohn et al., 1968). If a node in the graph has two or more incoming arrows, then there exist two past observation strings and , not necessarily of the same length, such that
that is, they map onto the same sufficient statistics (Diaconis, 1988), and hence, all the trajectories that emanate from those states are jointly amortized. Thus, analyzing the graph structure can reveal the invariances of the data relative to the task. In particular, a minimal state machine captures all the invariances. For instance, an exchangeable stochastic sequence (in which the sequence is generated i.i.d. conditional on the hypothesis) leads to a state machine with lattice structure as in Figure 3.
Iii Sequential DecisionMaking
We now describe two ways of constructing interactive agents, i.e. agents that exchange actions and observations with an external environment. Many of the lessons learned in the sequential prediction case carry over. The main additional difficulty is that, in the decisionmaking case, unlike in the prediction case, the optimal policy (which is needed in order to generate the trajectories using the right distribution for metalearning) is not available. Hence, we need to interleave two processes: a metalearning process that implicitly amortizes the marginalization over the generators; and a policy improvement process that anneals toward the optimal policy.
Iiia Thompson sampling
We can leverage the ideas from the sequential prediction case to create an adaptive agent that acts according to probability matching—and more specifically, Thompson sampling (Thompson, 1933)—to address the explorationexploitation problem (Sutton et al., 1998). For this, we need generators that not only produce observations, but also optimal actions provided by experts, which are then used as teaching signals.
In particular, provided we know the expert policies, Thompson sampling translates the reinforcement learning problem into an inference problem. Hence, metatraining a Thompson sampler is akin to metatraining a sequential predictor, with the crucial difference that we want our system to predict expert actions rather than observations. Due to this, Thompson samplers are optimal in the compression sense (i.e. using the logloss), but not in the Bayesoptimal sense (see next subsection).
Formally, this time we consider distributions over interaction sequences, that is, strings in , where and are discrete sets of actions and observation respectively. We underline symbols to glue them together, so . Then, a generator is a member defining a distribution over strings
where the conditional probabilities
are the probabilities of the next observation and of the next action given the past, respectively. One can interpret the as the desired (or optimal) policy provided by an expert when the observations follow the statistics . In addition, these probabilities must match the causal structure of the interactions for our following derivation to be correct.^{15}^{15}15In practice, this is achieved by enforcing a particular factorization of the joint probability distribution over parameters and interactions into conditional probabilities that reflect the causal structure—see Pearl (2009).
Thompson sampling can be characterized as sampling the actions directly from the posterior predictive (Ortega and Braun, 2010). As in Bayesian prediction, consider the mixture distribution
Then we can generate actions by sampling them from the posterior predictive
(11) 
where the “hat” as in “” denotes a causal intervention and where is recursively given by^{16}^{16}16See Pearl (2009) for a thorough definition of causal interventions. Equations (11) and (12) are nontrivial and beyond the scope of this report; we refer the reader to Ortega and Braun (2010) for their derivation. In particular, note that in (11) does not have interventions.
(12) 
In other words, we continuously condition on the past, treating actions as interventions and observations as normal (Bayesian) conditions. More precisely, unlike observations, past actions were generated by the agent without knowledge of the underlying parameter (hence the and are independent conditional on the past experience), and the causal intervention mathematically accounts for this fact.
Metalearning a Thompson sampling agent follows a scheme analogous to sequential prediction. We seek a strategy that amortizes the posterior predictive over actions (11):
This strategy conforms to the functional interface
(13) 
where is the current policy vector, and are the preceding and current memory states respectively, and is the preceding interaction.
Next we derive the loss function. As in Bayesian prediction, Thompson sampling optimizes the compression of the interaction sequence characterized by the expected logloss. This is easiest written recursively in terms of the instantaneous expected logloss as
(14) 
for each action given its past .^{17}^{17}17The expected logloss for observations is omitted, as we only need to regress the policy here. For a MonteCarlo approximation, we sample trajectories as
(15) 
In particular, note how actions are drawn from the agent’s policy, not from the generator. This ensures that the agent does not use any privileged information about the generator’s identity, thus covering the support over all the trajectories that the agent might explore. Then, we can choose the instantaneous loss function as the crossentropy
(16) 
evaluated on the sampled trajectories. This is known as (a variant of) policy distillation (Rusu et al., 2015). Again, this only works if we have access to the groundtruth optimal policy for each environment. The computation graph is shown in Figure 5 and the pseudocode is listed in Algorithm 2
Finally, we note that the above metatraining algorithm is designed for finding agents that implicitly update their posterior after each time step. However, this can lead to unstable policies that change their behavior (i.e. expert policy) in each time step. Such inconsistencies can be addressed by updating the posterior only after experiencing longer interaction sequences—for instance, only after an episode. We refer the reader to (Russo et al., 2018; Osband and Van Roy, 2016; Ouyang et al., 2017) for a detailed discussion.
IiiB BayesOptimality
Bayesoptimal sequential decisionmaking is the decision strategy that follows from the theory of subjective expected utility (Savage, 1972) and the method of dynamic programming (Bellman, 1954). Roughly, it consists in always picking an action that maximizes the value, that is, the expected sum of future rewards under the best future actions. Methodologically, it requires solving a stochastic partial difference equation modeling the value for given boundary conditions, where the latter typically constrain the value to zero everywhere along the planning horizon (Bertsekas, 2008). Due to this, learning a Bayesoptimal policy is far more challenging than learning a Thompson sampling strategy. Here we also depart from the logloss, and use a reward function instead to characterize the task goals.
In this case the generators are distributions over observation sequences conditioned on past interactions, where each member defines a conditional distribution
However, unlike the Thompson sampling case, here we seek a global optimal policy which is indirectly defined via a reward function as discussed later. This global policy will by construction solve the explorationexploitation problem in a Bayesoptimal way, although it is tailored specifically to the given task distribution.
Because the optimal policy is unknown, in practice during training we only have access to trajectories that are drawn from a distribution that results from the interactions between the expected task and a custom policy . is given by
that is, a distribution where actions are drawn from the agent’s current strategy and where observations are drawn from a randomly chosen generator as described in (15) from Thompson sampling.
As mentioned above, we specify the agent’s objective with a reward function. This is a global reward function that maps every interaction and past into a scalar value , indicating the interaction’s desirability. Furthermore, we define the actionvalue function for a policy as the expected sum of rewards given a past,^{18}^{18}18For simplicity, we assume that rewards are undiscounted. that is,
(17) 
where the expectation denotes an expectation w.r.t. the distribution . Notice that this definition implicitly equates the rewards after the horizon with zero. An optimal policy is defined as any policy that maximizes (17) for any past . Note that this is a recursive definition that can be solved using dynamic programming.
Metalearning a Bayesoptimal policy can be done in numerous ways: here we settle on inferring the optimal policy via estimating the actionvalues for concreteness, but other methods (e.g. using policy gradients) work as well. We seek amortizing the actionvalues using a vector :
The functional interface conforms to
where is the current actionvalue vector used for constructing the policy; and are the preceding and current memory states respectively; and is the preceding interaction, which implicitly also provides the reward.^{19}^{19}19If the reward is not a function of actions and observations, then it needs to be passed explicitly alongside the last interaction.
As the instantaneous lossfunction for regressing the actionvalues, we can for instance use the TDerror
Crucially, this is only a function of the value of the current action. The target value , given by the sum of the current reward and the value of the next action, is kept constant, ensuring that the boundary conditions are propagated in the right direction, namely backwards in time.
To regress the optimal policy, we use simulated annealing (Kirkpatrick et al., 1983). Specifically, we start from a random policy and then slowly crystallize an optimal policy. To do so, actions are drawn as from a policy built from the actionvalues using e.g. the softmax function
where the inverse temperature is a parameter controlling the stochasticity of the policy: yields a nearly uniform policy and a nearly deterministic one. During metatraining, the inverse temperature is annealed (cooled), starting from and ending in a large value for . This gives the model time to regress the actionvalues by sampling suboptimal branches before committing to a specific policy. Good cooling schedules are typically determined empirically (Mitra et al., 1986; Nourani and Andresen, 1998). The pseudocode is listed in Algorithm 3.
Iv Discussion
Iva Metalearning and Bayesian statistics
Metalearning is intimately connected to Bayesian statistics regardless of the loss function, due to the statistics of the generated trajectories. When regressing a sequential strategy using a MonteCarlo estimation, we sample trajectories as^{20}^{20}20For sequentialdecision making problems, we assume that the policy improvement steps have already converged. In this case, the actions are drawn from the target distribution for metalearning, and we can safely ignore the distinction between actions and observations.
However, from the point of view of the system that has already seen the past , the transition looks, on average, as if it were sampled with probability
(18) 
that is, from the Bayesian posterior predictive distribution, which in turn induces the (implicit) update of the hypothesis
(19) 
Hence, (18) and (19) together show that the samples are implicitly filtered according to Bayes’ rule. It is precisely this statistical property that is harvested through memorybased metalearning. As shown in Section II, metalearning a Bayesian sequence predictor corresponds to directly regressing the statistics of the samples; due to this, it can be considered the most basic form of metalearning. In contrast, the two sequential decision makers from Section III do not regress the statistics directly but rather use their correspondence to a Bayesian filtration to build adaptive policies.
Conceptually, using a generative process to produce Bayesfiltered samples is an old idea. It is the rationale driving many practical implementations of Bayesian models (Bishop, 2016), and one of the key ideas in MonteCarlo methods for Bayesoptimal planning (Silver and Veness, 2010; Guez et al., 2012).
IvB Sample complexity of strategies
Current modelfree RL algorithms such as deep RL algorithms (e.g. DQN (Mnih et al., 2013) or A3C (Mnih et al., 2016)) are known to be sample inefficient. The sample complexity can potentially be improved by using a suitable modelbased approach with strong inductive biases. Full probabilistic modelbased approaches (i.e. those that do not work with expectation models) can rely on handcrafted probabilistic models that possess little structure (e.g. Dirichlet priors over state transitions) or have intractable (exact) posterior distributions. This can make such approaches unwieldy in practice.
More commonly, traditional approaches have used expectation models (Sutton et al., 2012; Schmidhuber, 1990), or deterministic abstract models (Watter et al., 2015; Silver et al., 2016). However, so far there has been limited success scaling probabilistic modeling to improve sample efficiency in the context of deep reinforcement learning.
Metalearning addresses this problem in a conceptually straightforward manner. It automates the synthesis of nearoptimal algorithms, by searching in algorithm space (or automata space) in order to find a new reinforcement learning algorithm that is tailored to a given class of tasks, exploiting the structure and using the desired inductive biases. The metalearned algorithms minimize the sample complexity at test time because they directly minimize the expected loss averaged over all generators. In the examples we have seen, metalearning finds sequential algorithms, which when deployed perform nearoptimally, minimizing the sample complexity. The flip side is that metalearning can be very expensive at metatraining time due to the slow convergence of the MonteCarlo approximation and the very large amount of data required by current popular neural architectures during the metatraining phase.
IvC Spontaneous metalearning
Metalearning can also occur spontaneously in online regression when the capacity of the agent is bounded and the data is produced by a single generator. Unfortunately, the downside is that we cannot easily control what will be metalearned. In particular, spontaneous metalearning could lead to undesirable emergent properties, which is considered an open research problem in AI safety (Ortega et al., 2018). To see how metalearning happens, consider the sequential prediction case. All we need is to show how the conditions for metalearning occur naturally; that is, by identifying the MonteCarlo samples and their latent parameters.
Input:  
Agent State:  
Generator State:  
Trajectory:  (na)  
Derived Input:  (na)  
Derived Parameter:  (na) 
Let and be the agent’s and the generator’s set of internal states respectively, and let be the set of observations. Assume that the dynamics are deterministic (stochastic dynamics can be modeled using pseudorandom transitions) and implemented by functions , so that the sequence of observations , agent states , and generator states are given by
(observations)  
(agent states)  
(generator states) 
(with arbitrary initial values , ) as illustrated in Figure 6. Furthermore, assume that and are finite.^{21}^{21}21This assumption can be relaxed to compact metric spaces using tools from dynamical systems theory that are beyond the scope of this report. Then, for a sufficiently long sequence there must exist a state that is visited infinitely often when given an infinite stream of observations. We can use to segment the sequence of observations into (variablelength) trajectories which we can identify as MonteCarlo samples from a class of generators as illustrated by the example in Table I. The th trajectory is defined by the first substring delimited by the th and th occurrence of , and define as the th observation of the th trajectory, e.g. because . Finally, the task parameter of the th trajectory is the state of the generator at the beginning of the sequence.
Given this identification, an online learning algorithm that updates based on a sufficiently large window of the past will effectively perform batch updates based on a set of trajectories sampled by different “generators”, thereby performing metalearning.
IvD Capacity limitations
In our analysis, we have assumed that the solution found is the minimizer of the metalearning objective. This is a very strong assumption. In practice, the difficulty of actually finding a nearoptimal solution depends on many factors.
The first and most important factor is of course the model used for regression. Properties such as e.g. the inductive biases of the model implementing the function class , the smoothness of the loss landscape, the optimizer, and the memory capacity, play fundamental roles in any machine learning method, and metalearning is no exception.
Another important factor is the task class, including both the space of hypotheses and the loss function. Together they shape the complexity of the strategy to regress via metalearning. In particular, the invariances in the state machine (see Section IID) reduce the number of distinct mappings from past experiences in to instantaneous strategies in the regressor has to learn. Conversely, in the worst case when there are no invariances, the number of distinct mappings grows exponentially in the maximum length of the trajectories.
IvE Selected future challenges
Task structure
Metalearning crucially relies on the skillful design of the class of tasks. Previous work has shown that agents can metalearn to perform a variety of task if the task distributions have been designed accordingly, such as learning to identify a best option (Denil et al., 2016; Wang et al., 2016) or learning to reason causally (Dasgupta et al., 2019)
In each case, the practitioner must ask the question: if the metalearned strategy should have a capability X, what property Y must the class of tasks posses? For instance, how should we design the generators so that we can generalize outofdistribution and beyond the length of the trajectories sampled during metalearning?
Addressing questions like these entail further questions regarding the structure of tasks; but to date, we are not aware of an adequate language or formalism to conceptualize this structure rigorously. In particular, we expect to gain: a better understanding of the dynamical structure of solutions; predict the structure of the sufficient statistics that a class of tasks gives rise to; and compare two tasks classes and determine if they are equivalent (or similar) in some sense.
Beyond expected losses
In the basic metalearning scheme the strategies minimize the expected
loss. Minimizing an expectation disregards the higherorder moments of the loss distribution (e.g. variance), leading to riskinsensitive strategies that are brittle under model uncertainty. Going beyond expected losses means that we have to change our
certaintyequivalent—that is, the way we aggregate uncertain losses into a single value (Hansen and Sargent, 2008).Changing the certaintyequivalent changes the attitude towards risk. This has important applications ranging from safety & robustness (van den Broek et al., 2010; Ortega and Legg, 2018) to games & multiagent tasks (McKelvey and Palfrey, 1995, 1998; Littman, 1994). For instance, if the agent is trained on an imperfect simulator of the realworld, we would want the agent to explore the world cautiously.
Risksensitive strategies can be metalearned by tweaking the statistics of the generative process. For instance, by changing the prior over generators as a function of the strategy’s performance, we can metalearn strategies that are risksensitive. For games, it might be necessary to simultaneously metatrain multiple agents in order to find the equilibrium strategies.
Continual learning
The continual learning problem asks an agent to learn sequentially, incorporating skills and knowledge in a nondisruptive way. There are strong links between the continual learning problem and the metalearning problem.
On one hand, the traditional metalearning setting assumes a fixed distribution of tasks, which can be restrictive and unrealistic. Allowing the distribution to change over time not only would be crucial from a practical perspective, but could also be used as a tool to refine the task distribution in order to induce the right properties in the learned solution. Continual or incremental variants of the metalearning problem is however an underexplored topic (e.g. Nagabandi et al. (2019) touches on this topic).
On the other hand, metalearning can be seen as part of the solution for the continual learning problem. In principle continual learning is illdefined. Remembering a potentially infinite set of skills is unfeasible within a finite model. While continual learning usually focuses on 0shot transfer (Goodfellow et al., 2014; Parisi et al., 2018; Kirkpatrick et al., 2017), or how well the agent remembers a previously seen task without any adaptation, this might be the wrong measure. A relaxation of this approach, e.g. explored by Kaplanis et al. (2018), would be to measure how fast one recovers performance, which converts continual learning into a metalearning problem. The compression of all seen tasks becomes the metalearning algorithm that the agent needs to infer and that can be exploited to recover the solution of the task. However the tasks are not seen necessarily in an i.i.d. fashion. So while the mechanism outlined in this work could describe such a solution to a continual learning problem, it is unclear how to approach the learning problem in practice.
IvF Conclusions
Reinforcement learning algorithms based on probabilistic models promise to address many of the shortcomings—in particular, the sampleinefficiency—of modelfree reinforcement learning approaches. However, the implementation of such systems is very challenging and, more often than not, depends on domain expertise, i.e. human knowledge, handcrafted in the form of probabilistic models that are either tractable but too simple to be useful, or outright intractable. Such an approach does not scale.
Memorybased metalearning offers a conceptually simple alternative for the construction of agents implicitly based on probabilistic models that leverages data and largescale computation. In essence, metalearning transforms the hard problem of probabilistic inference into a curve fitting problem. Here we have provided three metalearning templates: one for building predictors, one for Thompson samplers, and one for Bayesoptimal agents respectively. In all of them, the key idea is to precondition a slowlearning system by exposing it to a distribution over trajectories drawn from a broad class of tasks, so that the metalearned system ends up implementing a general and fast sequential strategy—that is, a strategy with the right inductive biases.
We have also shown why this approach works and how the metalearned strategies are implemented. Basically, the resulting strategies are nearoptimal by construction because metalearning directly trains on a MonteCarlo approximation of the expected cost over possible hypotheses. The sequential data drawn during this MonteCarlo approximation is implicitly Bayesfiltered, and the Bayesian updates are amortized by the metalearned strategy. Moreover, we have shown that the adaptation strategy is implemented as a state machine in the agent’s memory dynamics, which is driven by the data the agent experiences. A given memory state then represents the agent’s information state, and, more precisely, the sufficient statistics for predicting its future interactions. The structure of the transition graph encodes the symmetries in the task distribution: paths that have the same initial and final information state are indistinguishable, and thus equivalent for the purposes of modulating the future behavior of the agent.
Finally, we note that metalearning also converts complex probabilistic inference problems into regression problems in oneshot settings, when metalearning is applied without memory (e.g. Orhan and Ma 2017). A key distinction of the sequential setting with memory is that metalearning also produces an update rule.
Our hope is that readers will find these insights useful, and that they will use the ideas presented here as a conceptual starting point for the development of more advanced algorithms and theoretical investigations. There are many remaining challenges for the future, such as understanding task structure, dealing with outofdistribution generalization, and continual learning (to mention some). Some of these can be addressed through tweaking the basic metalearning training process and through designing hypothesis classes with special properties.
More generally though, memorybased metalearning illustrates a more powerful claim: a slow learning system, given enough data and computation, can not only learn a model over its environment, but an entire reasoning procedure. In a sense, this suggests that rationality principles are not necessarily preexistent, but rather emerge over time as a consequence of the situatedness of the system and its interaction with the environment.
References
 Abramovich and Ritov [2013] Felix Abramovich and Ya’acov Ritov. Statistical theory: a concise introduction. CRC Press, 2013.
 Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Baxter [1998] Jonathan Baxter. Theoretical models of learning to learn. In Learning to learn, pages 71–94. Springer, 1998.
 Baxter [2000] Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.
 Bellman [1954] Richard Bellman. The theory of dynamic programming. Technical report, RAND Corp Santa Monica CA, 1954.
 Bertsekas [2008] Dimitri P Bertsekas. Neurodynamic programming. In Encyclopedia of optimization, pages 2555–2560. Springer, 2008.
 Bishop [2016] C.M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer New York, 2016.
 CesaBianchi and Lugosi [2006] Nicolo CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
 Chen et al. [2018a] Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample efficient adaptive texttospeech. arXiv preprint arXiv:1809.10460, 2018a.
 Chen et al. [2018b] Yutian Chen, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P Lillicrap, and Nando de Freitas. Learning to learn for global optimization of black box functions. 2018b.
 Dasgupta et al. [2019] Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic, Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia, Matthew Botvinick, and Zeb KurthNelson. Causal reasoning from metareinforcement learning. arXiv preprint arXiv:1901.08162, 2019.
 Dawid et al. [1999] A Philip Dawid, Vladimir G Vovk, et al. Prequential probability: Principles and properties. Bernoulli, 5(1):125–162, 1999.
 Denil et al. [2016] Misha Denil, Pulkit Agrawal, Tejas D Kulkarni, Tom Erez, Peter Battaglia, and Nando de Freitas. Learning to perform physics experiments via deep reinforcement learning. arXiv preprint arXiv:1611.01843, 2016.
 Diaconis [1988] Persi Diaconis. Sufficiency as statistical symmetry. In Proceedings of the AMS Centennial Symposium, pages 15–26, 1988.
 Duan et al. [2016] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Elman [1990] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Garnelo et al. [2018] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1690–1699, 2018.
 Goodfellow et al. [2014] Ian J. Goodfellow, Mehdi Mirza, Xia Da, Aaron C. Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgeting in gradientbased neural networks. ICLR, 2014.
 Gordon et al. [2018] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard Turner. Metalearning probabilistic inference for prediction. arXiv preprint arXiv:1805.09921, 2018.
 Gottfredson [1997] Linda S Gottfredson. Mainstream science on intelligence: An editorial with 52 signatories, history, and bibliography, 1997.
 Grant et al. [2018] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
 Graves [2012] Alex Graves. Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks, pages 5–13. Springer, 2012.
 Guez et al. [2012] Arthur Guez, David Silver, and Peter Dayan. Efficient bayesadaptive reinforcement learning using samplebased search. In Advances in Neural Information Processing Systems, pages 1025–1033, 2012.
 Hansen and Sargent [2008] Lars Peter Hansen and Thomas J Sargent. Robustness. Princeton university press, 2008.
 HernandezOrallo [2000] Jose HernandezOrallo. Beyond the turing test. Journal of Logic, Language and Information, 9(4):447–466, 2000.
 Hochreiter et al. [2001] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001.
 Hutter [2004] Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
 Jaynes [2003] Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
 Jordan [1997] Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997.

Kaplanis et al. [2018]
Christos Kaplanis, Murray Shanahan, and Claudia Clopath.
Continual reinforcement learning with complex synapses.
In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 2502–2511, 2018.  Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.
 Kirkpatrick et al. [1983] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
 Kolen [1994] John F Kolen. Fool’s gold: Extracting finite state machines from recurrent network dynamics. In Advances in neural information processing systems, pages 501–508, 1994.
 Kolen and Kremer [2001] John F Kolen and Stefan C Kremer. A field guide to dynamical recurrent networks. John Wiley & Sons, 2001.
 Koopman [1936] Bernard Osgood Koopman. On distributions admitting a sufficient statistic. Transactions of the American Mathematical society, 39(3):399–409, 1936.
 Krohn et al. [1968] Kenneth Krohn, John L Rhodes, and Michael A Arbib. Algebraic theory of machines, languages, and semigroups. Technical report, KrohnRhodes Research Institute, 1968.
 Legg and Hutter [2007] Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4):391–444, 2007.
Comments
There are no comments yet.