Meta-learning of Sequential Strategies

05/08/2019 ∙ by Pedro A. Ortega, et al. ∙ 16

In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

How do we build agents that perform well over a wide range of tasks? Achieving this generality (or universality) is considered by many research agendas to be one of the core challenges in artificial intelligence (AI)

(Solomonoff, 1964; Gottfredson, 1997; Hernandez-Orallo, 2000; Hutter, 2004). Generality is at the heart of IQ tests (Spearman, 1904; Raven, 1936; Urbina, 2011) and the measure of intelligence suggested by Legg and Hutter (2007).


is a practical machine learning approach to building

general AI

systems such as classifiers, predictors, and agents. Broadly speaking, meta-learning aims to produce flexible, data-efficient learning systems through the acquisition of inductive biases from data

(Thrun and Pratt, 1998; Schmidhuber et al., 1996). In contrast to systems that build in such biases by design, in meta-learning they are acquired by training on a distribution over tasks. For example, an agent trained to find rewards in one maze will simply learn the solution to that maze, but an agent trained on mazes drawn from a broad class will learn a general-purpose strategy for exploring new mazes (Wang et al., 2016; Duan et al., 2016). Systems trained in this way are able to absorb structure in the task distribution that allows them to adapt efficiently and generalize to new tasks, effectively leveraging past experience to speed up new learning. Therefore, a meta-learner can be thought of as a primary learning system that progressively improves the learning of a secondary system (Thrun and Pratt, 1998; Hochreiter et al., 2001; Schmidhuber et al., 1996).

This method has been shown to be remarkably effective in practice, especially in combination with deep learning architectures. Recent years have brought a wealth of approaches centered on meta-learning different aspects of the learning process, such as learning the optimizer

(Andrychowicz et al., 2016; Li and Malik, 2016; Ravi and Larochelle, 2016; Wichrowska et al., 2017; Chen et al., 2018b), the metric space (Vinyals et al., 2016; Snell et al., 2017), the initial network parameters (Finn et al., 2017; Nichol and Schulman, 2018), the learning targets (Xu et al., 2018), conditional distributions (Wang et al., 2017; Garnelo et al., 2018; Gordon et al., 2018; Zintgraf et al., 2018; Chen et al., 2018a)

, or even the entire learning procedure using a memory-based architecture such as a recurrent neural network

(Santoro et al., 2016; Wang et al., 2016; Duan et al., 2016; Denil et al., 2016; Mishra et al., 2018). Some approaches have also taken advantage of modularity as an inductive bias to learn modules to be re-used in transfer tasks (Reed and De Freitas, 2015).

In this report we focus on this last class of memory-based meta-learning methods, which aim to find sequential strategies that learn from experience. Specifically, we aim for a theoretical understanding of such meta-learning methods by recasting them within a Bayesian framework. Our goal is to provide a basic algorithmic template to which various meta-learning procedures conform, showing that learned strategies are capable of performing near-optimally. We hope that this deeper conceptual understanding will provide the foundations for new, scalable models.

The significance of memory-based meta-learning methods rests in their ability to build—in a scalable and data-driven way—systems that behave as if they had a probabilistic model of the future. Agents with probabilistic models possess inductive biases that allow them to quickly draw structured inferences based on experience. However, building these agents is very challenging: typically, constructing them involves specifying both the probabilistic models and their inference procedures either by hand or through probabilistic programming.

In contrast, meta-learning offers a simple alternative: to precondition a system with training samples in order to fix the right inductive biases necessary at test time.

The key insight is that the meta-training process generates training samples that are implicitly filtered according to Bayes rule, i.e. the samples are drawn directly from the posterior predictive distribution

. Combined with a suitably chosen cost function, meta-learning can use these Bayes-filtered samples to regress an adaptive strategy that solves a task quickly by implicitly performing Bayesian updates “under the hood”—that is, without computing the (typically intractable) Bayesian updates explicitly. In this way, memory-based meta-learning agents can behave as if they possess a probabilistic model (Orhan and Ma, 2017)

. Moreover, the agents track, in their memory dynamics, the Bayesian sufficient statistics necessary for estimating the uncertainties for solving the task. Note that this conceptualization is distinct from but extends ideas presented in

Baxter (1998, 2000)

, which considered only the supervised learning case, and

Finn et al. (2017); Grant et al. (2018), in which the inference step is built-in and constrained to only a few gradient-descent steps.

This report is structured as follows. Throughout, we focus on simple toy examples before moving on to discuss issues with scaling and practical applications. Section II reviews sequential predictions. This is the most basic application of meta-learning of sequential strategies, as it only requires regressing the statistics of the training samples. The analysis of the sequential prediction case will also serve as a basis for studying other applications and for investigating the structure of the solutions found through meta-learning. Section III reviews the sequential decision-making case. Here we show how to combine the basic meta-learning scheme with a policy improvement method. We illustrate this with two minimal examples: one for building Thompson sampling agents, which is the natural extension of the prediction case, and another for building Bayes-optimal agents. Finally, Section IV discusses the connection between meta-learning and Bayesian statistics, the spontaneous emergence of meta-learned solutions in (single-task) online learning, and future challenges.

Ii Sequential Prediction

We start our analysis with the problem of sequential prediction, i.e. the task of forecasting the future based on past experience. We use this case because sequential prediction is the most basic application of meta-learning, and it will lay down the basics for analyzing other applications such as sequential decision-making.

Consider the following sequence prediction problems:


1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, ?


1, 4, 9, 16, 25, ?


1, 2, 3, 4, ?

What is the next number in the sequence? The answers are given by :


1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,


1, 4, 9, 16, 25,


1, 2, 3, 4,

These are justified as follows. Sequence (I) is a unary encoding of the natural numbers. (II) is the sequence of quadratic numbers . Finally, (III) is the sequence . Intuitively however, 29 feels very unlikely, as 5 (i.e. the next natural number) seems a far more natural choice compared to the complex 4th-degree polynomial (Hutter, 2004).

In spite of the enormous number of possible explanations, we were remarkably good at predicting the next element in the sequence. Our prediction ability is at the same time general and data-efficient. We possess inductive biases that allowed us to quickly narrow down the space of possible sequences to just a few, even though there isn’t any obvious shared structure among the three examples; furthermore, these biases permitted us to judge the relative plausibilities of competing predictions (such as “29” versus “5” in the previous example). Our prediction strategy appears to follow two principles: we maintained every possible explanation/hypothesis we could think of given resource constraints (Epicurus’ principle), and we deemed simpler explanations as being more likely (Occam’s razor). These are also the core principles underlying Bayesian statistics, which is arguably the gold standard for computing predictions given the inductive biases (Jaynes, 2003).

But where do these inductive biases come from? How does a learner know what is most parsimonious? Here, meta-learning provides a simple answer: a learner, through repeated exposure to various tasks, effectively captures the statistics of the data, which translate into the inductive biases necessary for future predictions.

In the following we will briefly review how to make sequential predictions using Bayesian ideas. Then, we will show how to numerically approximate this prediction strategy using meta-learning.

Ii-a Problem setup

We now formalize the setup for sequential prediction. For convenience, we limit ourselves to finite observation spaces and discrete time.111Note however that this assumption comes with a loss of generality. Extensions to e.g. continuous domains typically require additional (geometric) assumptions which are beyond the scope of this report. Outputs are distributions over finite spaces unless stated otherwise.

Our goal is to set up a generative process over trajectories (i.e. finite sequences of observations), where the trajectories are drawn not from a single generator, but from a class of generators. Each generator will correspond to a possible “ground truth” that we want the system to consider as a hypothesis. By defining a loss function, we tell the system what to do (i.e. the task to perform) or what to predict under each situation. Training the system with this generative process then encourages the system to adopt the different generators as potential hypotheses.

Ii-A1 Sequences

Let be a finite alphabet of observations. The set of (finite) strings over is written as , which includes the empty string . denotes the set of one-way infinite sequences over . For concreteness, we assume that , where could be very large. For strings, subindices correspond to (time) indices as in , and we use obvious shorthands for substrings such as , and . If the length is implicit from the context, then we also write strings of length simply as , from the word trajectory.

Ii-A2 Generators/Hypotheses

The domain of possible (stochastic) generators will be modeled using a set of generators, formalized as a class of distributions over infinite sequences in . These will become, after training, the set of hypotheses of the system, and henceforth we will use the terms “generator” and “hypotheses” interchangeably. Specifically, we demand that for each distribution

over strings, the probability

of any next symbol given any past is specified.222This requirement avoids the technical subtleties associated to conditioning on pasts having probability zero. The probability of an observation string is then equal to the product . Defining the conditionals also uniquely determines the distribution over infinite sequences.333More precisely, a collection of consistent distributions may be defined over each of the spaces for via the conditionals specified above. Kolmogorov’s extension theorem then states that there exists a distribution over infinite sequences in (with respect to the sigma-algebra generated by cylinder sets , where denotes the set containing all one-way infinite sequences having a common prefix ) that is consistent with the distributions over finite sequences defined via conditionals.

We will also index the distributions in by a countable parameter set , so that each member is a distribution , where .444This choice of the cardinality of the set of parameters is for simplifying our mathematical exposition. In practice, the extension to uncountable parameter sets is straightforward (see the Dirichlet example below) To use a notation that fits neatly the Bayesian interpretation, we will write rather than ; that is, where the parameter is interpreted as a conditional.

Finally we place prior probabilities

over the members in . This will play the role of our measure of simplicity (or inductive bias) of a hypothesis, where a simpler one possesses more prior mass.555Measuring “simplicity” in this way is justified by the implied description length of the hypothesis under an optimal code, namely bits.

Example 1.

(Dice roll prediction) In a dice roll prediction problem, the set of hypotheses is given by a collection of -sided “dice” that generate i.i.d. rolls according to the categorical distribution, that is (with a slight abuse of notation) , where is the

-th element of the probability vector 

. If there are  such dice, then a possible prior distribution is the uniform .

Indeed, this example can be generalized to the set of all probability vectors in the simplex , with a uniform prior density. In this case, the whole process (i.e. first sampling the parameter  and then sampling the observation sequence ) is known as a Dirichlet-Categorical process.

Example 2.

(Optional: Algorithmic sequence prediction) In the introductory example we used sequences that are generated according to patterns. These patterns can be formalized as algorithms.666In fact, Example 1 is a special case of a collection of algorithms. Algorithmic information theory allows us to formalize a well-known, very large hypothesis class (Vitanyi and Li, 1997)

. Take a universal Turing machine 

that takes a binary-encoded program and produces a binary sequence as a result.777We assume that the universal Turing machine is prefix-free, and that it outputs trailing zeros after reaching a halting state. Then we can generate a random binary sequence by providing the universal Turing machine with fair coin flips, and then running the machine on this input.

In this case, each hypothesis is a (degenerate) distribution over sequences, where if is a prefix of and zero otherwise. The prior distribution over binary programs is , where is the length of the program .888Strictly speaking, this distribution might be un-normalized, and we refer the reader to Vitanyi and Li (1997) for a detailed technical discussion. The resulting distribution over sequences is known as the algorithmic prior and Solomonoff’s prior.

Ii-A3 Strategies

The system uses a strategy to solve a task. In general, strategies can implement predictions (over observations) and/or policies (distributions over actions). In the prediction case, we formally define a strategy as a distribution over strings in

. This is probability distribution that characterizes the system’s outputs, and it should not be confused with the generators. Then,

will denote a prediction over the next symbol given the past .999Throughpout the paper, for any distribution over , we define as the marginal distribution over the first symbols, i.e. . Then, is defined as . The set of candidate strategies available to the agent is denoted as .

Ii-A4 Losses

We consider tasks that can be formalized as the minimization of a loss function. A loss function is a function that maps a strategy and a trajectory into a real-valued cost . Intuitively, this captures the loss of using the strategy  under the partial observability of the parameter  when the trajectory is . For instance, in sequential predictions, a typical choice for the loss function is the log-loss , also known as the compression loss; in sequential decision-making problems, one typically chooses a negative utility, such as the negative (discounted) cumulative sum of rewards.

Ii-A5 Goal

The aim of the system is to minimize the expected loss


with respect to the strategy . That is, the objective is the expected loss of a trajectory , but generated by a latent hypothesis  (i.e. the ground truth) randomly chosen according to the desired inductive bias . This is the standard objective in Bayesian decision-theory in which the system has to choose in the face of (known) uncertainty (Savage, 1972). Notice that this setup assumes the realizable case, that is, the case in which the true generative distribution is a member of the class of hypotheses pondered by the system.

Ii-B Universality: Bayesian answer

We start by showing how to solve the prediction problem from a purely Bayesian point of view. This brief digression is necessary in order to fully understand the statistical properties of the samples generated during meta-learning. It is worth pointing out that the Bayesian solution is typically viewed as following directly from an interpretation of probabilities as degrees of belief rather than from an optimization problem (Jaynes, 2003). This shouldn’t be a distraction however, as we will later express the Bayesian solution in terms of the compression loss in order to establish the connection to meta-learning.

The classical Bayesian approach consists in using a predictor given by the mixture distribution


in which the probability of given trajectory  is given by combining the probabilities predicted by the individual hypotheses weighted by their prior probabilities. The prior probabilities are the initial inductive biases.

The mixture distribution automatically implements an adaptive prediction strategy, and not a static average predictor as a naive reading of its definition might suggest. This is seen as follows. The Bayes rule and (2) easily imply that given past data , we can predict the next observation  using the posterior predictive distribution obtained by conditioning on the past:


As (3) shows, the prediction of  is given by the average prediction made by the different hypotheses, but weighted by the posterior probabilities , i.e. the weights updated by the data.

These predictions converge to the true distribution with more data;101010Notice however that in general the posterior probabilities do not converge unless we impose stricter conditions. that is, with probability one,


where is the latent parameter of the true generator (Hutter, 2004, Theorem 3.19) (In contrast, notice that the posterior distribution does not converge in general). From the standpoint of lossless compression, Bayesian prediction is the optimal strategy. The rate of convergence in (4), or more precisely, the convergence rate of the average excess compression loss (a.k.a. regret) can easily be established (see, e.g., Cesa-Bianchi and Lugosi (2006)): for any , and sequence ,

That is, depending on the task-prior , the average excess compression loss converges to zero at an rate.

To prepare the ground for the next section on meta-learning, we will characterize the construction of the mixture distribution in terms of a solution to an optimization problem. For this, we use a central result from information theory. As suggested above, we can use the compression loss to regress the statistics of the distribution over trajectories. Formally, we choose . Then, we optimize (1)


where . The resulting expected prediction loss is the cross-entropy of from the marginal , which implies the minimizer ; that is, precisely the desired Bayesian mixture distribution having the optimal prediction properties (Dawid et al., 1999).

Ii-C Universality: meta-learning answer

In a nutshell, meta-learning consists of finding a minimizer for a Monte-Carlo approximation of the expected loss (1). More precisely, consider a learning architecture (say, a deep learning architecture trained by gradient descent) that implements a function class , where each member maps a trajectory into a probability of the trajectory. Then, we can approximate the expected loss (1) as


where are i.i.d. samples drawn from randomly chosen generators as

The goal of meta-learning is to find a function that minimizes (6). Since the loss depends on the sampled trajectories but not on the parameters , the Monte-Carlo objective implicitly marginalizes over the generators.111111The technique of sampling the parameters rather than marginalizing over them explicitly was also called root-sampling in Silver and Veness (2010). The computation graph is shown in Figure 1.

We assume that optimizing the Monte-Carlo estimate (6) will give a minimizer  with the property

for the most probable .121212We will avoid discussions of the approximation quality here. It suffices to say that this depends on the learning model used; in particular, on whether the optimum is realizable, and on the smoothness properties of the model near the optimum. While this gives us a method for modeling the probabilities of trajectories, for sequential predictions we need to incorporate some additional structure into the regression problem.

Figure 1: Basic computation graph for meta-learning a trajectory predictor. The loss function depends only on the trajectory , not on the parameter . Thus, the strategy  must marginalize over the latent parameter .

To do sequential predictions, i.e. implementing , the above optimization does not suffice; instead, we have to impose the correct functional interface constraints onto the regression problem in order to get a system that can map histories into predictions. This is done by setting up the target loss so that the solution implements the required function interface. Specifically, we seek a function that maps histories into predictions of the next observation , thereby also respecting the causal structure of the task. If we use a memory-based architecture, such as a recurrent neural network (Elman, 1990; Jordan, 1997; Hochreiter et al., 2001; Kolen and Kremer, 2001; Graves, 2012), then the function conforms to the interface


where is the current prediction vector, and are the preceding and current memory states respectively, and is the preceding observation. Furthermore, we fix the initial state. This is a sufficient condition for the system to make predictions that marginalize over the hypotheses. Obviously, since must remember all the necessary past information, the set of possible memory states  needs to be sufficiently large for it to possess the capacity for encoding the sufficient statistics of the pasts in . See discussion below in Section II-D.

We need instantaneous predictions. The associated instantaneous loss function for this interface (7) is then


so that the Monte-Carlo approximation of the expected loss (6) becomes


where are Monte-Carlo samples of the instantaneous log-loss. The computation graph is shown in Figure 2 and the pseudo-code is listed in Algorithm 1.131313Note that one could separate learning and sampling in Algorithm 1 by first sampling models and corresponding observation sequences, and then feed the corresponding observation in the "observe" step of the algorithm. In later sections, however, the prediction of the agent will affect how the observations evolve; we chose to present sampling and prediction/learning in this interleaved way to have a more unified presentation style with Algorithms 2 and 3, given in the next section. Arguably the most important result in the realizable setting is that,141414Realizable is when the minimizer of the Monte-Carlo loss is in . if is large, and the minimum of (9) is attained, then the minimizer  implements a function

where, crucially, the instantaneous prediction  is


i.e. the optimal sequential prediction in which the Bayesian update is automatically amortized (or pre-computed) (Ritchie et al., 2016). In practice, this means that we can use without updating the parameters of the model (say, changing the weights of the neural network) to predict the future from the past, as the prediction algorithm is automatically implemented by the recursive function  with the help of the memory state . The next section discusses how this is implemented in the solution.

Figure 2: Computation graph for meta-learning a sequential prediction strategy. The agent function  generates a prediction  of the observation  based on the past, captured in the last observation  and the state . The top diagram illustrates the computation graph for a whole sequence (of length ), while the lower diagram shows a detailed view of a single step computation.
Data: Prior , generators , and initial predictor , memory state , and observation .
Result: Meta-learned predictor .
  (initialize function)
while  not converged do
         (reset loss)
       for  do (rollout batch)
               (reset memory state)
               (sample parameter)
             for  do (perform rollout)
                     (instantaneous loss)
                     (accumulate total loss)
         (do an update step)
return f
Algorithm 1 Meta-learning a prediction strategy

Ii-D Anatomy of meta-learned agents

How does the function implement the sequential strategy with amortized prediction (10)? In which sense does this strategy rely on an internal model? First, let us review the conditions:

  1. Choice of loss function: The choice of the loss function specifies what solving the task means, i.e. what we want the agent to do as a function of the data.

  2. Functional interface: Since the agent is ultimately implemented as a function, the choice of the interface (e.g. mapping entire trajectories into probabilities versus mapping pasts into predictions) is crucial. Obviously this choice is implicit in the practice of any regression-based machine learning technique. Nevertheless, we point this out because it is especially important in sequential problems, as it determines how the agent is situated within the task, that is, what informational (and ultimately causal) constraints the agent is subject to for solving a task.

  3. Monte-Carlo marginalization: Marginalizing analytically over the generators is in general intractable, except for special cases such as in some exponential families (Koopman, 1936; Abramovich and Ritov, 2013). Instead, meta-learning performs a Monte-Carlo marginalization. Furthermore, in the sequential setting, the marginalization hides the identity of the generator, thereby forcing the model to find a function that uses the past experience to improve on the loss. This, in turn, leads to the numerical approximation of amortized Bayesian estimators, as long as the first state is fixed across all the samples.

As a result, we obtain a function  that performs the following operations:

  1. New prediction: takes the past input and memory state to produce a new prediction minimizing the instantaneous loss .

  2. New memory state: In order to perform the new prediction, combines the past input  and memory state  to produce a new memory state  that acts as a sufficient statistic of the entire past . That is, there exists a sufficient statistic function extracting all the necessary information from the past to predict the future, i.e.

    and this function is related to via the equation .

That is, the recursive function  implements a state machine (Sipser, 2006) (or transducer) in which the states correspond to memory states, the transitions are the changes of the memory state caused by an input, and where the outputs are the predictions. We can represent this state machine as a labeled directed graph. Figure 3 shows a state machine predicted by theory and Figure 4 shows a meta-learned state machine.

Figure 3: Minimal state machine for a predictor of coin tosses with a fixed, unknown bias. The hypothesis class can be modeled as a 2-sided coin (see Example 1). Dark and light state transitions correspond to observing the outcomes ‘Head’ and ‘Tail’ respectively, and the states are annotated with , the number of times Head and Tail have been observed. The predictions made from each state are shown in the stacked bar charts: the probability of Head is (which is how these predictions are implemented in a computer program). Note how different observations sequences can lead to the same state (e.g. HT and TH).
Figure 4:

Meta-learned state machine for a predictor of coin tosses. The figure shows the memory dynamics of a standard memory-based predictor projected onto the first two eigenvectors. Notice the striking similarity with Figure 

3. The predictor consists of 20 LSTM cells with softmax predictions, which was trained using Algorithm 1 on 1000 batches of 100 rollouts, where rollouts were of length 10. For training, we used the Adam optimization algorithm (Kingma and Ba, 2014).

The state machine thus implements a dynamics driven by the input, where the state captures the sufficient statistics of the past (Kolen and Kremer, 2001). However, it is important to note that the objective (9) does not enforce the minimality of the state machine (at least not without post-processing (Kolen, 1994)). Indeed, in practice this is often not the case, implying that the states do not correspond to the minimal sufficient statistics and that optimization runs with different initial conditions (e.g. random seeds) can produce different state machines. Furthermore, as with any machine learning method, the accuracy of the transitions of input-state pairs that never occurred during training depend on the generalization ability of the function approximator. This is especially the case for input sequences that are longer than the length of the trajectories seen during training.

State machines are important because they reflect symmetry relations due to their intimate relation to semigroups (Krohn et al., 1968). If a node in the graph has two or more incoming arrows, then there exist two past observation strings  and , not necessarily of the same length, such that

that is, they map onto the same sufficient statistics (Diaconis, 1988), and hence, all the trajectories that emanate from those states are jointly amortized. Thus, analyzing the graph structure can reveal the invariances of the data relative to the task. In particular, a minimal state machine captures all the invariances. For instance, an exchangeable stochastic sequence (in which the sequence is generated i.i.d. conditional on the hypothesis) leads to a state machine with lattice structure as in Figure 3.

Iii Sequential Decision-Making

We now describe two ways of constructing interactive agents, i.e. agents that exchange actions and observations with an external environment. Many of the lessons learned in the sequential prediction case carry over. The main additional difficulty is that, in the decision-making case, unlike in the prediction case, the optimal policy (which is needed in order to generate the trajectories using the right distribution for meta-learning) is not available. Hence, we need to interleave two processes: a meta-learning process that implicitly amortizes the marginalization over the generators; and a policy improvement process that anneals toward the optimal policy.

Iii-a Thompson sampling

We can leverage the ideas from the sequential prediction case to create an adaptive agent that acts according to probability matching—and more specifically, Thompson sampling (Thompson, 1933)—to address the exploration-exploitation problem (Sutton et al., 1998). For this, we need generators that not only produce observations, but also optimal actions provided by experts, which are then used as teaching signals.

In particular, provided we know the expert policies, Thompson sampling translates the reinforcement learning problem into an inference problem. Hence, meta-training a Thompson sampler is akin to meta-training a sequential predictor, with the crucial difference that we want our system to predict expert actions rather than observations. Due to this, Thompson samplers are optimal in the compression sense (i.e. using the log-loss), but not in the Bayes-optimal sense (see next subsection).

Formally, this time we consider distributions over interaction sequences, that is, strings in , where and are discrete sets of actions and observation respectively. We underline symbols to glue them together, so . Then, a generator is a member defining a distribution over strings

where the conditional probabilities

are the probabilities of the next observation  and of the next action  given the past, respectively. One can interpret the as the desired (or optimal) policy provided by an expert when the observations follow the statistics . In addition, these probabilities must match the causal structure of the interactions for our following derivation to be correct.151515In practice, this is achieved by enforcing a particular factorization of the joint probability distribution over parameters and interactions into conditional probabilities that reflect the causal structure—see Pearl (2009).

Thompson sampling can be characterized as sampling the actions directly from the posterior predictive (Ortega and Braun, 2010). As in Bayesian prediction, consider the mixture distribution

Then we can generate actions by sampling them from the posterior predictive


where the “hat” as in “” denotes a causal intervention and where is recursively given by161616See Pearl (2009) for a thorough definition of causal interventions. Equations (11) and (12) are non-trivial and beyond the scope of this report; we refer the reader to Ortega and Braun (2010) for their derivation. In particular, note that in (11) does not have interventions.


In other words, we continuously condition on the past, treating actions as interventions and observations as normal (Bayesian) conditions. More precisely, unlike observations, past actions were generated by the agent without knowledge of the underlying parameter (hence the and are independent conditional on the past experience), and the causal intervention mathematically accounts for this fact.

Meta-learning a Thompson sampling agent follows a scheme analogous to sequential prediction. We seek a strategy  that amortizes the posterior predictive over actions (11):

This strategy conforms to the functional interface


where is the current policy vector, and are the preceding and current memory states respectively, and is the preceding interaction.

Next we derive the loss function. As in Bayesian prediction, Thompson sampling optimizes the compression of the interaction sequence characterized by the expected log-loss. This is easiest written recursively in terms of the instantaneous expected log-loss as


for each action given its past .171717The expected log-loss for observations is omitted, as we only need to regress the policy here. For a Monte-Carlo approximation, we sample trajectories as


In particular, note how actions are drawn from the agent’s policy, not from the generator. This ensures that the agent does not use any privileged information about the generator’s identity, thus covering the support over all the trajectories that the agent might explore. Then, we can choose the instantaneous loss function as the cross-entropy


evaluated on the sampled trajectories. This is known as (a variant of) policy distillation (Rusu et al., 2015). Again, this only works if we have access to the ground-truth optimal policy for each environment. The computation graph is shown in Figure 5 and the pseudo-code is listed in Algorithm 2

Figure 5: Detail of the computation graph for meta-learning a Thompson sampling agent. The actions are generated from the agent’s policy . The expert policy is only used for generating a loss signal .
Data: Prior , generators , and policies ; and initial predictor , memory state , and interaction .
Result: Meta-learned predictor .
  (initialize function)
while  not converged do
         (reset loss)
       for  do (rollout batch)
               (reset memory state)
               (sample parameter)
             for  do (perform rollout)
                     (inst. loss)
                     (accumulate total loss)
         (do an update step)
return f
Algorithm 2 Meta-learning a Thompson sampler

Finally, we note that the above meta-training algorithm is designed for finding agents that implicitly update their posterior after each time step. However, this can lead to unstable policies that change their behavior (i.e. expert policy) in each time step. Such inconsistencies can be addressed by updating the posterior only after experiencing longer interaction sequences—for instance, only after an episode. We refer the reader to (Russo et al., 2018; Osband and Van Roy, 2016; Ouyang et al., 2017) for a detailed discussion.

Iii-B Bayes-Optimality

Bayes-optimal sequential decision-making is the decision strategy that follows from the theory of subjective expected utility (Savage, 1972) and the method of dynamic programming (Bellman, 1954). Roughly, it consists in always picking an action that maximizes the value, that is, the expected sum of future rewards under the best future actions. Methodologically, it requires solving a stochastic partial difference equation modeling the value for given boundary conditions, where the latter typically constrain the value to zero everywhere along the planning horizon (Bertsekas, 2008). Due to this, learning a Bayes-optimal policy is far more challenging than learning a Thompson sampling strategy. Here we also depart from the log-loss, and use a reward function instead to characterize the task goals.

In this case the generators are distributions over observation sequences conditioned on past interactions, where each member defines a conditional distribution

However, unlike the Thompson sampling case, here we seek a global optimal policy which is indirectly defined via a reward function as discussed later. This global policy will by construction solve the exploration-exploitation problem in a Bayes-optimal way, although it is tailored specifically to the given task distribution.

Because the optimal policy is unknown, in practice during training we only have access to trajectories  that are drawn from a distribution  that results from the interactions between the expected task and a custom policy . is given by

that is, a distribution where actions are drawn from the agent’s current strategy  and where observations are drawn from a randomly chosen generator as described in (15) from Thompson sampling.

As mentioned above, we specify the agent’s objective with a reward function. This is a global reward function that maps every interaction and past into a scalar value , indicating the interaction’s desirability. Furthermore, we define the action-value function for a policy  as the expected sum of rewards given a past,181818For simplicity, we assume that rewards are undiscounted. that is,


where the expectation denotes an expectation w.r.t. the distribution . Notice that this definition implicitly equates the rewards after the horizon  with zero. An optimal policy is defined as any policy that maximizes (17) for any past . Note that this is a recursive definition that can be solved using dynamic programming.

Meta-learning a Bayes-optimal policy can be done in numerous ways: here we settle on inferring the optimal policy via estimating the action-values for concreteness, but other methods (e.g. using policy gradients) work as well. We seek amortizing the action-values using a vector :

The functional interface conforms to

where is the current action-value vector used for constructing the policy; and are the preceding and current memory states respectively; and is the preceding interaction, which implicitly also provides the reward.191919If the reward is not a function of actions and observations, then it needs to be passed explicitly alongside the last interaction.

As the instantaneous loss-function  for regressing the action-values, we can for instance use the TD-error

Crucially, this is only a function of the value of the current action. The target value , given by the sum of the current reward and the value  of the next action, is kept constant, ensuring that the boundary conditions are propagated in the right direction, namely backwards in time.

To regress the optimal policy, we use simulated annealing (Kirkpatrick et al., 1983). Specifically, we start from a random policy and then slowly crystallize an optimal policy. To do so, actions are drawn as from a policy built from the action-values using e.g. the softmax function

where the inverse temperature is a parameter controlling the stochasticity of the policy:  yields a nearly uniform policy and  a nearly deterministic one. During meta-training, the inverse temperature is annealed (cooled), starting from  and ending in a large value for . This gives the model time to regress the action-values by sampling sub-optimal branches before committing to a specific policy. Good cooling schedules are typically determined empirically (Mitra et al., 1986; Nourani and Andresen, 1998). The pseudo-code is listed in Algorithm 3.

Data: Prior , generators , and reward function ; and initial predictor , inverse temperature , memory state , and interaction .
Result: Meta-learned predictor .
  (initialize function)
  (initialize inv. temp.)
while  not converged do
         (reset loss)
       for  do (rollout batch)
               (reset memory state)
               (sample parameter)
             for  do (perform rollout)
                   if  then (compute previous loss)
                           (accumulate total loss)
               (last target)
               (last inst. loss)
               (accumulate total loss)
         (do an update step)
         (update inv. temp.)
return f
Algorithm 3 Meta-learning a Bayes-optimal policy

Iv Discussion

Iv-a Meta-learning and Bayesian statistics

Meta-learning is intimately connected to Bayesian statistics regardless of the loss function, due to the statistics of the generated trajectories. When regressing a sequential strategy using a Monte-Carlo estimation, we sample trajectories as202020For sequential-decision making problems, we assume that the policy improvement steps have already converged. In this case, the actions are drawn from the target distribution for meta-learning, and we can safely ignore the distinction between actions and observations.

However, from the point of view of the system that has already seen the past , the transition  looks, on average, as if it were sampled with probability


that is, from the Bayesian posterior predictive distribution, which in turn induces the (implicit) update of the hypothesis


Hence, (18) and (19) together show that the samples are implicitly filtered according to Bayes’ rule. It is precisely this statistical property that is harvested through memory-based meta-learning. As shown in Section II, meta-learning a Bayesian sequence predictor corresponds to directly regressing the statistics of the samples; due to this, it can be considered the most basic form of meta-learning. In contrast, the two sequential decision makers from Section III do not regress the statistics directly but rather use their correspondence to a Bayesian filtration to build adaptive policies.

Conceptually, using a generative process to produce Bayes-filtered samples is an old idea. It is the rationale driving many practical implementations of Bayesian models (Bishop, 2016), and one of the key ideas in Monte-Carlo methods for Bayes-optimal planning (Silver and Veness, 2010; Guez et al., 2012).

Iv-B Sample complexity of strategies

Current model-free RL algorithms such as deep RL algorithms (e.g. DQN (Mnih et al., 2013) or A3C (Mnih et al., 2016)) are known to be sample inefficient. The sample complexity can potentially be improved by using a suitable model-based approach with strong inductive biases. Full probabilistic model-based approaches (i.e. those that do not work with expectation models) can rely on hand-crafted probabilistic models that possess little structure (e.g. Dirichlet priors over state transitions) or have intractable (exact) posterior distributions. This can make such approaches unwieldy in practice.

More commonly, traditional approaches have used expectation models (Sutton et al., 2012; Schmidhuber, 1990), or deterministic abstract models (Watter et al., 2015; Silver et al., 2016). However, so far there has been limited success scaling probabilistic modeling to improve sample efficiency in the context of deep reinforcement learning.

Meta-learning addresses this problem in a conceptually straightforward manner. It automates the synthesis of near-optimal algorithms, by searching in algorithm space (or automata space) in order to find a new reinforcement learning algorithm that is tailored to a given class of tasks, exploiting the structure and using the desired inductive biases. The meta-learned algorithms minimize the sample complexity at test time because they directly minimize the expected loss averaged over all generators. In the examples we have seen, meta-learning finds sequential algorithms, which when deployed perform near-optimally, minimizing the sample complexity. The flip side is that meta-learning can be very expensive at meta-training time due to the slow convergence of the Monte-Carlo approximation and the very large amount of data required by current popular neural architectures during the meta-training phase.

Iv-C Spontaneous meta-learning

Meta-learning can also occur spontaneously in online regression when the capacity of the agent is bounded and the data is produced by a single generator. Unfortunately, the downside is that we cannot easily control what will be meta-learned. In particular, spontaneous meta-learning could lead to undesirable emergent properties, which is considered an open research problem in AI safety (Ortega et al., 2018). To see how meta-learning happens, consider the sequential prediction case. All we need is to show how the conditions for meta-learning occur naturally; that is, by identifying the Monte-Carlo samples and their latent parameters.

Agent State:
Generator State:
Trajectory: (na)
Derived Input: (na)
Derived Parameter: (na)
Table I: Example segmentation of an input sequence in spontaneous meta-learning.

Let and be the agent’s and the generator’s set of internal states respectively, and let be the set of observations. Assume that the dynamics are deterministic (stochastic dynamics can be modeled using pseudo-random transitions) and implemented by functions , so that the sequence of observations , agent states , and generator states  are given by

(agent states)
(generator states)

(with arbitrary initial values , ) as illustrated in Figure 6. Furthermore, assume that  and  are finite.212121This assumption can be relaxed to compact metric spaces using tools from dynamical systems theory that are beyond the scope of this report. Then, for a sufficiently long sequence there must exist a state  that is visited infinitely often when given an infinite stream of observations. We can use  to segment the sequence of observations into (variable-length) trajectories which we can identify as Monte-Carlo samples from a class of generators as illustrated by the example in Table I. The -th trajectory is defined by the first substring delimited by the -th and -th occurrence of , and define as the -th observation of the -th trajectory, e.g.  because . Finally, the task parameter  of the -th trajectory is the state of the generator at the beginning of the sequence.

Figure 6: Dynamics of spontaneous meta-learning. The (pseudorandom) generator with states in produces observations in , which drive the agent’s memory state-transitions in . A loop in the memory space with endpoints equal corresponds to a trajectory generated by a latent parameter .

Given this identification, an online learning algorithm that updates  based on a sufficiently large window of the past will effectively perform batch updates based on a set of trajectories sampled by different “generators”, thereby performing meta-learning.

Iv-D Capacity limitations

In our analysis, we have assumed that the solution found is the minimizer of the meta-learning objective. This is a very strong assumption. In practice, the difficulty of actually finding a near-optimal solution depends on many factors.

The first and most important factor is of course the model used for regression. Properties such as e.g. the inductive biases of the model implementing the function class , the smoothness of the loss landscape, the optimizer, and the memory capacity, play fundamental roles in any machine learning method, and meta-learning is no exception.

Another important factor is the task class, including both the space of hypotheses and the loss function. Together they shape the complexity of the strategy to regress via meta-learning. In particular, the invariances in the state machine (see Section II-D) reduce the number of distinct mappings from past experiences in to instantaneous strategies in the regressor has to learn. Conversely, in the worst case when there are no invariances, the number of distinct mappings grows exponentially in the maximum length  of the trajectories.

Iv-E Selected future challenges

Task structure

Meta-learning crucially relies on the skillful design of the class of tasks. Previous work has shown that agents can meta-learn to perform a variety of task if the task distributions have been designed accordingly, such as learning to identify a best option (Denil et al., 2016; Wang et al., 2016) or learning to reason causally (Dasgupta et al., 2019)

In each case, the practitioner must ask the question: if the meta-learned strategy should have a capability X, what property Y must the class of tasks posses? For instance, how should we design the generators so that we can generalize out-of-distribution and beyond the length of the trajectories sampled during meta-learning?

Addressing questions like these entail further questions regarding the structure of tasks; but to date, we are not aware of an adequate language or formalism to conceptualize this structure rigorously. In particular, we expect to gain: a better understanding of the dynamical structure of solutions; predict the structure of the sufficient statistics that a class of tasks gives rise to; and compare two tasks classes and determine if they are equivalent (or similar) in some sense.

Beyond expected losses

In the basic meta-learning scheme the strategies minimize the expected

loss. Minimizing an expectation disregards the higher-order moments of the loss distribution (e.g. variance), leading to risk-insensitive strategies that are brittle under model uncertainty. Going beyond expected losses means that we have to change our

certainty-equivalent—that is, the way we aggregate uncertain losses into a single value (Hansen and Sargent, 2008).

Changing the certainty-equivalent changes the attitude towards risk. This has important applications ranging from safety & robustness (van den Broek et al., 2010; Ortega and Legg, 2018) to games & multi-agent tasks (McKelvey and Palfrey, 1995, 1998; Littman, 1994). For instance, if the agent is trained on an imperfect simulator of the real-world, we would want the agent to explore the world cautiously.

Risk-sensitive strategies can be meta-learned by tweaking the statistics of the generative process. For instance, by changing the prior over generators as a function of the strategy’s performance, we can meta-learn strategies that are risk-sensitive. For games, it might be necessary to simultaneously meta-train multiple agents in order to find the equilibrium strategies.

Continual learning

The continual learning problem asks an agent to learn sequentially, incorporating skills and knowledge in a non-disruptive way. There are strong links between the continual learning problem and the meta-learning problem.

On one hand, the traditional meta-learning setting assumes a fixed distribution of tasks, which can be restrictive and unrealistic. Allowing the distribution to change over time not only would be crucial from a practical perspective, but could also be used as a tool to refine the task distribution in order to induce the right properties in the learned solution. Continual or incremental variants of the meta-learning problem is however an under-explored topic (e.g. Nagabandi et al. (2019) touches on this topic).

On the other hand, meta-learning can be seen as part of the solution for the continual learning problem. In principle continual learning is ill-defined. Remembering a potentially infinite set of skills is unfeasible within a finite model. While continual learning usually focuses on 0-shot transfer (Goodfellow et al., 2014; Parisi et al., 2018; Kirkpatrick et al., 2017), or how well the agent remembers a previously seen task without any adaptation, this might be the wrong measure. A relaxation of this approach, e.g. explored by Kaplanis et al. (2018), would be to measure how fast one recovers performance, which converts continual learning into a meta-learning problem. The compression of all seen tasks becomes the meta-learning algorithm that the agent needs to infer and that can be exploited to recover the solution of the task. However the tasks are not seen necessarily in an i.i.d. fashion. So while the mechanism outlined in this work could describe such a solution to a continual learning problem, it is unclear how to approach the learning problem in practice.

Iv-F Conclusions

Reinforcement learning algorithms based on probabilistic models promise to address many of the shortcomings—in particular, the sample-inefficiency—of model-free reinforcement learning approaches. However, the implementation of such systems is very challenging and, more often than not, depends on domain expertise, i.e. human knowledge, hand-crafted in the form of probabilistic models that are either tractable but too simple to be useful, or outright intractable. Such an approach does not scale.

Memory-based meta-learning offers a conceptually simple alternative for the construction of agents implicitly based on probabilistic models that leverages data and large-scale computation. In essence, meta-learning transforms the hard problem of probabilistic inference into a curve fitting problem. Here we have provided three meta-learning templates: one for building predictors, one for Thompson samplers, and one for Bayes-optimal agents respectively. In all of them, the key idea is to precondition a slow-learning system by exposing it to a distribution over trajectories drawn from a broad class of tasks, so that the meta-learned system ends up implementing a general and fast sequential strategy—that is, a strategy with the right inductive biases.

We have also shown why this approach works and how the meta-learned strategies are implemented. Basically, the resulting strategies are near-optimal by construction because meta-learning directly trains on a Monte-Carlo approximation of the expected cost over possible hypotheses. The sequential data drawn during this Monte-Carlo approximation is implicitly Bayes-filtered, and the Bayesian updates are amortized by the meta-learned strategy. Moreover, we have shown that the adaptation strategy is implemented as a state machine in the agent’s memory dynamics, which is driven by the data the agent experiences. A given memory state then represents the agent’s information state, and, more precisely, the sufficient statistics for predicting its future interactions. The structure of the transition graph encodes the symmetries in the task distribution: paths that have the same initial and final information state are indistinguishable, and thus equivalent for the purposes of modulating the future behavior of the agent.

Finally, we note that meta-learning also converts complex probabilistic inference problems into regression problems in one-shot settings, when meta-learning is applied without memory (e.g. Orhan and Ma 2017). A key distinction of the sequential setting with memory is that meta-learning also produces an update rule.

Our hope is that readers will find these insights useful, and that they will use the ideas presented here as a conceptual starting point for the development of more advanced algorithms and theoretical investigations. There are many remaining challenges for the future, such as understanding task structure, dealing with out-of-distribution generalization, and continual learning (to mention some). Some of these can be addressed through tweaking the basic meta-learning training process and through designing hypothesis classes with special properties.

More generally though, memory-based meta-learning illustrates a more powerful claim: a slow learning system, given enough data and computation, can not only learn a model over its environment, but an entire reasoning procedure. In a sense, this suggests that rationality principles are not necessarily pre-existent, but rather emerge over time as a consequence of the situatedness of the system and its interaction with the environment.