Pre-training as Batch Meta Reinforcement Learning with tiMe

by   Quan Vuong, et al.

Pre-training is transformative in supervised learning: a large network trained with large and existing datasets can be used as an initialization when learning a new task. Such initialization speeds up convergence and leads to higher performance. In this paper, we seek to understand what the formalization for pre-training from only existing and observational data in Reinforcement Learning (RL) is and whether it is possible. We formulate the setting as Batch Meta Reinforcement Learning. We identify MDP mis-identification to be a central challenge and motivate it with theoretical analysis. Combining ideas from Batch RL and Meta RL, we propose tiMe, which learns distillation of multiple value functions and MDP embeddings from only existing data. In challenging control tasks and without fine-tuning on unseen MDPs, tiMe is competitive with state-of-the-art model-free RL method trained with hundreds of thousands of environment interactions.



There are no comments yet.


page 1

page 2

page 3

page 4


MOReL : Model-Based Offline Reinforcement Learning

In offline reinforcement learning (RL), the goal is to learn a successfu...

Unsupervised Curricula for Visual Meta-Reinforcement Learning

In principle, meta-reinforcement learning algorithms leverage experience...

A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition

Reinforcement Learning (RL) is a semi-supervised learning paradigm which...

Reinforcement Learning with Action-Free Pre-Training from Videos

Recent unsupervised pre-training methods have shown to be effective on l...

Vizarel: A System to Help Better Understand RL Agents

Visualization tools for supervised learning have allowed users to interp...

Enhanced Experience Replay Generation for Efficient Reinforcement Learning

Applying deep reinforcement learning (RL) on real systems suffers from s...

Linear Representation Meta-Reinforcement Learning for Instant Adaptation

This paper introduces Fast Linearized Adaptive Policy (FLAP), a new meta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning algorithms still require millions of environment interactions to obtain reasonable performance, hindering their applications (Mnih et al., 2015; Lillicrap et al., 2016; Vuong et al., 2018; Fujimoto et al., 2018b; Jaderberg et al., 2018; Arulkumaran et al., 2019; Huang et al., 2019). This is due to the lack of good pre-training methods. In supervised learning, a pre-trained network significantly reduces sample complexity when learning new tasks (Zeiler and Fergus, 2013; Devlin et al., 2018; Yang et al., 2019). Even though Meta Reinforcement Learning (RL) has been proposed as a framework for pre-training in RL, such methods still require the collection of millions of interactions during meta-train (Wang et al., 2016b; Duan et al., 2016; Finn et al., 2017).

In supervised learning, a key reason why pre-training is incredibly successful is that the dataset used for pre-training can be collected from naturally occurring large-scale processes. This removes the need to manually collect data and allows for scalable data collection, resulting in massive datasets. For example, Mahajan et al. (2018)

pre-trains using existing images and their corresponding hashtags from Instagram to obtain state-of-the-art performance on ImageNet

(Russakovsky et al., 2014).

In this paper, we seek to formalize pre-training in RL in a way that allows for scalable data collection. The data used for pre-training should be purely observational

and the policies that are being optimized for should not need to interact with the environment during pre-training. To this end, we propose Batch Meta Reinforcement Learning (BMRL) as a formalization of pre-training in RL from only existing and observational data. During training, the learning algorithms only have access to a batch of existing data collected a priori from a family of Markov Decision Process (MDP). During testing, the trained policies should perform well on unseen MDPs sampled from the family.

A related setting is Batch RL (Antos et al., 2007; Lazaric et al., 2008; Lange et al., 2012), which we emphasize assumes the existing data comes from a single MDP. To enable scalable data collection, this assumption must be relaxed: the existing data should come from a family of related MDPs. Consider smart thermostats, whose goal is to maintain a specific temperature while minimizing electricity cost. Assuming Markovian dynamics, the interactions between a thermostat and its environment can be modelled as a MDP. Data generated by a thermostat operating in a single building can be used to train Batch RL algorithms. However, if we consider the data generated by the same thermostat operating in different buildings, much more data is available. While the interactions between the same thermostat model and different buildings correspond to different MDPs, these MDPs share regularities which can support generalization, such as the physics of heat diffusion. In section 6, we further discuss the relations between BMRL and other existing formulations.

The first challenge in BMRL is the accurate inference of the unseen MDP identity. We show that existing algorithms which sample mini-batches from the existing data to perform Q-learning style updates converge to a degenerate value function, a phenomena we term MDP mis-identification

. The second challenge is the interpolation of knowledge about seen MDPs to perform well on unseen MDPs. While Meta RL algorithms can explicitly optimize for this objective thanks to the ability to interact with the environment, we must rely on the inherent generalizability of the trained networks. To mitigate these issues, we propose tiMe, which learns from existing data to dis

till multiple value functions and MDP embeddings. tiMe is a flexible and scalable pipeline with inductive biases to encourage accurate MDP identity inference and rich supervision to maximize generalization. The pipeline consists of two phases. In the first phase, Batch RL algorithm is used to extract MDP-specific networks from MDP-specific data. The second phase distills the MDP-specific networks.

2 Preliminaries

2.1 Batch Reinforcement Learning

We model the environment as a Markov Decision Process (MDP), uniquely defined as a 5 element tuple with state space , action space , transition function , reward function and discount factor (Puterman, 1994; Sutton and Barto, 1998). At each discrete timestep, the agent is in a state , pick an action , and arrives at the next state and receives a reward . The goal of the agent is to maximize the expected sum of discounted rewards where is a trajectory generated by using to interact with . We will consider a family of MDPs, defined formally in subsection 2.3. We thus index each MDP in this family with .

In Batch RL, policies are trained from scratch to solve a single MDP using existing batch of transition tuples without any further interaction with . At test time, we use the trained policies to interact with

to obtain an empirical estimate of its performance

. Batch RL optimizes for the same objective as standard RL algorithms. However, during training, the learning algorithm only has access to and are not allowed to interact with .

2.2 Batch-Constrained Q-Learning

Fujimoto et al. (2018a) identifies extrapolation error and value function divergence as the modes of failure when modern Q-learning algorithms are applied to the Batch RL setting. Concretely, deep Q-learning algorithms approximate the expected sum of discounted reward starting from a state-action pair with a value estimate . The estimate can be learned by sampling transition tuples from the batch and applying the temporal difference update:


The value function diverges if Q fails to accurately estimate the value of . Fujimoto et al. (2018a) introduces Batch-Constrained Q-Learning, constraining to select actions that are similar to actions in the batch to prevent inaccurate values estimation. Concretely, given , a generator outputs multiple candidate actions . A perturbation model takes each state-candidate action pair as input and generates small correction term for each candidate. The corrected candidate action with the highest value as estimated by a learnt is :

Estimation error has also been previously studied and mitigated in model-free RL algorithms (Hasselt, 2010; van Hasselt et al., 2015; Fujimoto et al., 2018b).

2.3 Meta Reinforcement Learning

Meta RL optimizes for average return on a family of MDPs and usually assume that the MDPs in this family share . Each MDP is uniquely defined by a tuple . A distribution defines a distribution over MDPs. During meta-train, we train a policy by sampling MDPs from this distribution and sampling trajectories from each sampled MDP, referred to as the meta-train MDPs. During meta-test, unseen MDPs are sampled from , referred to as the meta-test MDPs. The trained policy is used to interact with the meta-test MDPs to obtain estimate of its performance. The choice of whether to update parameters (Finn et al., 2017) or to keep them fixed during meta-test (Hochreiter et al., 2001) is left to the learning algorithms, both having demonstrated prior successes.

2.4 MDP Identity Inference with Set Neural Network

A Meta RL policy needs to infer the meta-test MDP identity to pick actions with high return. Rakelly et al. (2019)

introduces PEARL, which uses a set neural network

(Qi et al., 2016; Zaheer et al., 2017) as the MDP identity inference function. takes as input a context set

and infers the identity of a MDP in the form of distributed representation in continuous space. The parameters of

is trained to minimize the error of the critic :


where is a learnt state value function. PEARL also adopts an amortized variational approach (Kingma and Welling, 2013; Rezende et al., 2014; Alemi et al., 2016; Kingma and Welling, 2019) to train a probabilistic , which is interpreted as an approximation to the true posterior over the set of possible MDP identities given the context set.

3 Batch Meta Reinforcement Learning

Let be the number of meta-train MDPs, be the number of transition tuples available from each meta-train MDP, be the parameter of the policy, we can formulate Batch Meta Reinforcement Learning (BMRL) as an optimization problem:


where the learning algorithms only have access to the batch during meta-train:

We assume we know which MDP each transition in the batch was collected from. This assumption simplifies our setting and is used to devise the algorithms. To maintain the flexibility of the formalization, we do not impose restrictions on the controller that generates the batch. However, the performance of learning algorithms generally increases as the training data becomes more diverse.

MDP identity inference challenge To obtain high return on the unseen meta-test MDPs, the trained policies need to accurately infer their identities (Ritter et al., 2018; Gupta et al., 2018; Humplik et al., 2019). In BMRL, previously proposed solutions based on Q-learning style updates, where mini-batches are sampled from the batch to minimize TD error, converge to a degenerate solution. subsection 5.1 provides experimental result that demonstrates the phenomena. In finite MDP, this degenerate solution is the optimal value function of the MDP constructed by the relative frequencies of transitions contained in the batch. We can formalize this statement with the following proposition.

Proposition 1.

Let be the number of times the triple appears in (with any reward). Performing Q-learning on finite and with all the -values initialized to and update rule (1) where is sampled uniformly at random from at each step , will lead the -values to converge to the optimal -value of the MDP almost surely as long as , , , where

Thus, performing Q-learning style update directly on data sampled from the batch fails to find a good policy because the value function converges to the optimal value function of the wrong MDP. We refer this phenomena as MDP mis-identification. The proof is shown in subsection A.1.

Interpolation of seen MDPs to unseen MDPs challenge The trained policies need to generalize from the meta-train MDPs to unseen meta-test MDPs. Meta RL tackles this challenge by formulating an optimization problem that explicitly optimizes for the average return of the meta-trained policy after additional gradient steps in unseen MDPs (Finn et al., 2017; Rothfuss et al., 2018; Nichol et al., 2018). This is possible thanks to the ability to interact with the environment during meta-train. However, in the meta-train phase of BMRL, the learning algorithms do not have access to the environment. We must rely on the inherent generalizability of the trained networks to perform well on the unseen meta-test MDPs. The key challenge is therefore finding the right inductive biases in the architecture and training procedure to encourage such generalization. The need to find the right inductive biases in RL was highlighted by Botvinick et al. (2019); Zambaldi et al. (2019); Hessel et al. (2019). We note that previous works phrase the need to find inductive biases as a means to forgo generality for efficient learning. In our setting, these two goals need not be mutually exclusive.

4 Learning Distillation of value functions and MDP Embeddings

4.1 Description of architecture and training procedure

Figure 1: The above consist of two separate figures. (left) The training pipeline of tiMe in the simplest setting. (right) Architecture for the second phase when BCQ is used in the first phase.

We propose a flexible and scalable pipeline for BMRL. Figure 1 (left) provides an overview of the pipeline in the simplest setting. Meta-train comprises of two separate phases. The first phase consists of independently training a value function for each MDP-specific batch using Batch RL algorithms. In the second phase, we distill the set of batch-specific value functions into a super value function through supervised learning (Hinton et al., 2015). Compared to the normal value function, a super value function takes not only a state-action pair as input, but also an inferred MDP identity, and outputs different values depending on the inferred MDP identity.

The pipeline is flexible in that any Batch RL algorithms are applicable in the first phase. Figure 1 (right) illustrates the architecture for the second phase given that the Batch RL algorithm used in the first phase is Batch Constrained Q (BCQ) Learning. As described in subsection 2.2, BCQ maintains three separate components, a learnt value function , a candidate action generator and a perturbation model . Therefore, the output of the first phase consists of 3 sets . The second phase distills each set to respectively. The distillation of and is necessary to pick actions that lead to high return because each learnt value function only provides reliable estimates for actions generated by and , a consequence of the training procedure of BCQ.

In the second phase of the pipeline, in addition to , the architecture consists of 3 other networks. takes as input a context and outputs a distributed representation of the MDP identity in a fixed-dimension continuous space. The output of is passed as an input to . and predicts given a state-action pair . has low capacity while the other networks are relatively large. Pseudocode is provided in Algorithm 1. After meta-train, the super functions are used to pick actions in the meta-test MDPs, similar to how BCQ picks actions.

Input: batches ,
parameterized jointly by
2 Randomly choose out of
3 Sample a transition from
4 Sample context from
5 Infer MDP identity:
6 Predict :
7 Predict state-action value:
9 Predict candidate action:
10 Obtain ground truth candidate action:
11 Predict correction factor:
Algorithm 1 tiMe training procedure when BCQ is used in the first phase

The key idea behind the second phase is to jointly learn distillation of value functions and MDP embeddings. We therefore name the approach tiMe. MDP embeddings refer to the ability to infer the identity of a MDP in the form of distributed representation in continuous space given a context.

4.2 Benefits of the proposed pipeline

Inductive biases to encourage accurate MDP identification The first inductive bias is the relationship between and . They collectively receive as input state-action pair and context and regress to target value . The target for each state-action pair can take on the values within the set . Similar state-action pairs can have very different regression targets if they correspond to different meta-train MDPs. The context is the only information in the input to and that correlates with which out of the set and should regress to. Thus, and must learn to interpret the context to predict the correct value for . The second inductive bias is the auxiliary task of predicting . A key design choice is that the network which takes as input and and predicts has low capacity. As such, the output of must contain meaningful semantic information such that a small network can use it to reconstruct the MDP. This is to prevent the degenerate scenario where learns to copy its input as its output. To summarize, these two explicit biases in the architecture and training procedure encourage to accurately infer the MDP identity given the context.

Richness and stability of supervision Previous approaches update to minimize the critic loss (subsection 2.4). It is well-known that RL provides sparse training signal. This signal can also cause instability since the target values in the critic loss change over time. In contrast, our pipeline provides training signal for that is both rich and stable. It is rich because is trained to infer a representation of the MDP identity that can be used for multiple downstream tasks, such as predicting . This encourages general-purpose learnt representation and supports generalization. The training signal is also stable since the regression targets are fixed during the second phase of tiMe.

Scalability The pipeline is scalable in that an arbitrary amount of purely observational data can be used in the first phase so long as computational constraints permit. The extraction of the batch-specific networks, such as the batch-specific value functions , from the MDP-specific batches can be trivially parallelized and scales gracefully as the number of meta-train MDPs increases.

5 Experimental Results

Our experiments have two main goals: (1) Demonstration of the MDP mis-identification phenomena and tiMe’s ability to effectively mitigate it. (2) Demonstration of the scalability of the tiMe pipeline to challenging continuous control tasks and generalization to unseen MDPs.

In all experiments, the MDP-specific batch is the replay buffer when training Soft Actor Critic (SAC) (Haarnoja et al. (2018a, b)) in for a fixed number of environment interactions. While our problem formulation BMRL and the pipeline tiMe allow for varying both the transition and reward functions within the family of MDPs, we consider the case of changing reward function in the experiments and leave changing transition function to future work.

5.1 Toy Experiments

Figure 2: All three figures are from the 3 meta-train MDPs scenario. (left) Performance of Batch SAC in one meta-test MDP and the learnt value estimates of initial state-action pairs. The estimates do not diverge but are significantly higher than the actual returns, demonstrating the MDP mis-identification phenomena. In contrast, tiMe’s performance is close to optimal. (middle) The behavior of the Batch SAC agent. The dark circle, red circle and dark crosses indicates the agent’s starting locations, final location and the 3 meta-train goal locations. The agent fails to find good actions during meta-test and navigates to the location closest to the 3 meta-train goals. (right) The behavior of an agent trained with tiMe. The crosses and circles indicate each meta-test MDP goal location and the agent’s corresponding final location after evaluation in the corresponding meta-test MDP. The agent trained with tiMe finds near-optimal actions in all 3 meta-test MDPs.

This section illustrates MDP mis-identification as the failure mode of existing Batch RL algorithms in BMRL. The toy setting allows for easy interpretability of the trained agents’ behavior. We also show that in the standard Batch RL setting, the Batch RL algorithm tested finds a near-optimal policy. This means the failure of existing Batch RL algorithm in BMRL is not because of the previously identified extrapolation issue when learning from existing data (Fujimoto et al., 2018a).

Environment Description In this environment, the agent needs to navigate on a 2d-plane to a goal location. The agent is a point mass whose starting location is at the origin . Each goal location is a point on a semi-circle centered at the origin with radius of 10 units. At each discrete timestep, the agent receives as input its current location , takes an action indicating the change in its position , transitions to a new position and receives a reward. The reward is the negative distance between the agent’s current location and the goal location. The agent does not receive the goal location as input and . Since the MDP transition function is fixed, each goal location uniquely defines a MDP. The distribution over MDPs is defined by the distribution over goal locations, which corresponds to a distribution over reward functions.

Batch SAC We modify SAC to learn from the batch by initializing the replay buffer with existing transitions. Otherwise, training stays the same. We test Batch SAC on a simple setting where there is only one meta-train MDP and one meta-test MDP which share the same goal location. This is the standard Batch RL setting and is a special case of BMRL. Batch SAC finds a near-optimal policy.

Three meta-train and meta-test MDPs This experiment has 3 meta-train MDPs with different goal locations. The goals divide the semi-circle into two segments of equal length. There are three meta-test MDPs whose goal locations coincides with the goal locations of the meta-train MDPs. This setting only tests the ability of the trained policies to correctly identify the meta-test MDPs and do not pose the challenge of generalization to unseen MDPs. Figure 2 (left, middle) illustrates that Batch SAC fails to learn a reasonable policy because of the MDP mis-identification phenomena. We also tried using a probabilistic MDP identity inference function as described in subsection 2.4 in addition to Batch SAC, which fails to train a policy that performs well on all 3 meta-test MDPs.

Performance of tiMe Since Batch SAC can extract the optimal value function out of the batch in the single meta-train MDP case, we use it as the Batch RL algorithm in the first phase of the tiMe pipeline. The architecture in the second phase thus consists of and . To pick an action, we randomly sample multiple actions and choose the action with the highest value as estimated by . This method is termed random shooting (Chua et al., 2018). As illustrated in Figure 2 (left, right), tiMe can identify the identities of the three meta-test MDPs and pick near-optimal actions.

5.2 Mujoco experiments

Figure 3: The leftmost column illustrates hopper and halfcheetah. The remaining columns indicate the performance of SAC trained from scratch versus tiMe for different unseen MDPs during zero-shot meta-test. We emphasize the difficult nature of obtaining high performance in this setting. The second to forth columns correspond to small, medium, and large target velocities respectively. The first and second row indicates performance on hopper and halfcheetah respectively. The x-axis indicates SAC’s number of environment interactions. The y-axis indicates the average episode return. The final performance of SAC is close-to-optimal in all plots in the sense that running SAC for many more timesteps will not increase its performance significantly.

Environment Description This section illustrates the test of tiMe in challenging continuous control robotic locomotion tasks. Each task requires the application of control action to a simulated robot so that it moves with a particular velocity in the direction of its initial heading. Formally, the MDP within each MDP family share and only differs in where is defined to be:

and are positive constant. A one-to-one correspondence exists between a MDP within the family and a target velocity. Defining a family of MDP is equivalent to picking an interval of possible target velocity. This setting is instantiated on two types of simulated robots, hopper and halfcheetah, illustrated in Figure 3. Experiments are performed inside the Mujoco simulator (Todorov et al., 2012). The setting was first proposed by Finn et al. (2017).

Zero-shot meta-test

During testing, in contrast to prior works, we do not update the parameters of the trained networks or allow for an initial exploratory phase where episode returns do not count towards the final meta-test performance. This allows for testing the inherent generalizability of the trained networks without confounding factors. The meta-test MDPs are chosen such that they are unseen during meta-train, i.e. none of the transitions used during meta-train was sampled from any of the meta-test MDPs. At the beginning of each meta-test episode, the inferred MDP identity is initialized to a zero vector. Subsequent transitions collected during the episode is used as the context.

Meta-train conditions The target velocities of the meta-train MDPs divide the target velocity interval into equal segments. This removes the bias of sampling meta-train MDPs when evaluating performance. The target velocity intervals, episode length, and number of meta-train MDPs for hopper and halfcheetah are and , 1000 and 200, and respectively.

Performance analysis Figure 3 illustrates tiMe’s performance on unseen meta-test MDPs. tiMe is competitive with the state-of-the-art model-free RL methods trained from scratch for one million and sixty thousands environment interactions in hopper and halfcheetah respectively. We perform experiments on halfcheetah with an episode length 200 because of computational constraints. Previous Meta RL works also use an episode length of 200 (Rakelly et al., 2019). The same network trained with tiMe also performs well in diverse points inside the support of the MDP distribution, demonstrating that it does not over-fit to one particular meta-train MDP. We compare with SAC to demonstrate BMRL is a promising research direction. We do not include other Meta RL algorithms as baseline because they would require interacting with the environment during meta-train and thus, is not solving for the problem that BMRL poses. We tried removing from the architecture in Figure 1 and picked action with Cross Entropy Method (Rubinstein and Kroese, 2004), but that lead to poor performance because over-estimates the values of actions not generated by .

6 Related Works

Supervised Learning and Imitation Learning

The main differences between Batch (Meta) RL and supervised learning are: actions have long-term consequences and the actions in the batch are not assumed to be optimal. If they are optimal in the sense that they were collected from an expert, Batch RL reduces to Imitation Learning

(Abbeel and Ng, 2004; Ho and Ermon, 2016). In fact, Fujimoto et al. (2018a) demonstrates that Batch RL generalizes Imitation Learning in discrete MDPs.

Meta RL Equation 3 is the same objective that existing Meta RL algorithms optimize for (Wang et al. (2016a); Finn et al. (2017)). We could have formulated our experimental setting as a Partially Observable MDP, but we chose to formulate it as Batch Meta Reinforcement Learning to ensure consistency with literature that inspires this paper. The main difference between Meta RL and our formulation is access to the environment during training. Meta RL algorithms sample transitions from the environment during meta-train. We only have access to existing data during meta-train.

Context Inference Zintgraf et al. (2019) and Rakelly et al. (2019) propose learning inference modules that infer the MDP identity. Their procedures sample transitions from the MDP during meta-train, which differs from our motivation of learning from only existing data. Killian et al. (2017) infers the MDP’s “hidden parameters”, inputs the parameters to a learnt transition function to generates synthetic data and train a policy from the synthetic data. Such model-based approaches are still outperformed by the best model-free methods (Wang et al. (2019)), which our method is based on.

Batch RL Fujimoto et al. (2018a) and Agarwal et al. (2019) demonstrate that good policies can be learnt entirely from existing data in modern RL benchmarks. Our work extends their approaches to train policies from data generated by a family of MDPs. Li et al. (2004) selects transitions from the batch based on an importance measure. They assume that for state-action pair in the batch, their value under the optimal value function can be easily computed. We do not make such assumption.

Factored MDPs In discrete MDP, the number of possible states increases exponentialy in the number of dimension. Kearns and Koller (2000) tackles this problem by assuming each dimension in the next state is conditionally dependent on only a subset of the dimensions in the current state. In contrast, our method makes no such assumption and applies to both discrete and continuous settings.

Joint MDP The family of MDPs can be seen as a joint MDP with additional information in the state which differentiates states between the different MDPs (Parisotto et al., 2015). Sampling an initial state from the joint MDP is equivalent to sampling a MDP from the family of MDPs. However, without prior knowledge, it is unclear how to set the value of the additional information to supports generalization from the meta-train MDPs to the meta-test MDPs. In fact, the additional information in our approach is the transitions from the MDP and the network learns to infer MDP identity.

7 Conclusion

We propose a new formalization of pre-training in RL as Batch Meta Reinforcement Learning (BMRL). BMRL differs from Batch RL in that the existing data comes from a family of related MDPs and thus enables scalable data collection. BMRL also differs from Meta RL in that no environment interaction happens during meta-train. We identified two main challenges in BMRL: MDP identity inference and generalization to unseen MDPs. To tackle these challenges, we propose tiMe, a flexible and scalable training pipeline which jointly learn distillation of value functions and MDP embeddings. Experimentally, we demonstrate that tiMe obtains performance competitive with those obtained by state-of-the-art model-free RL methods on unseen MDPs.


  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the Twenty-first International Conference on Machine Learning

    ICML ’04, New York, NY, USA, pp. 1–. External Links: ISBN 1-58113-838-5, Link, Document Cited by: §6.
  • R. Agarwal, D. Schuurmans, and M. Norouzi (2019) Striving for simplicity in off-policy deep reinforcement learning. External Links: Link Cited by: §6.
  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. CoRR abs/1612.00410. External Links: Link, 1612.00410 Cited by: §2.4.
  • A. Antos, R. Munos, and C. Szepesvári (2007) Fitted q-iteration in continuous action-space mdps. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, USA, pp. 9–16. External Links: ISBN 978-1-60560-352-0, Link Cited by: §1.
  • K. Arulkumaran, A. Cully, and J. Togelius (2019)

    AlphaStar: an evolutionary computation perspective

    Note: cite arxiv:1902.01724 External Links: Link Cited by: §1.
  • M. Botvinick, S. Ritter, J. Wang, Z. Kurth-Nelson, C. Blundell, and D. Hassabis (2019) Reinforcement learning, fast and slow. Trends in Cognitive Sciences 23, pp. . External Links: Document Cited by: §3.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR abs/1805.12114. External Links: Link, 1805.12114 Cited by: §5.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016) RL$^2$: fast reinforcement learning via slow reinforcement learning. CoRR abs/1611.02779. External Links: Link, 1611.02779 Cited by: §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. CoRR abs/1703.03400. External Links: Link, 1703.03400 Cited by: §1, §2.3, §3, §5.2, §6.
  • S. Fujimoto, D. Meger, and D. Precup (2018a) Off-policy deep reinforcement learning without exploration. CoRR abs/1812.02900. External Links: Link, 1812.02900 Cited by: §2.2, §5.1, §6, §6.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018b) Addressing function approximation error in actor-critic methods. CoRR abs/1802.09477. External Links: Link, 1802.09477 Cited by: §1, §2.2.
  • A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine (2018) Meta-reinforcement learning of structured exploration strategies. CoRR abs/1802.07245. External Links: Link, 1802.07245 Cited by: §3.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: Link, 1801.01290 Cited by: §5.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018b) Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: Link, 1812.05905 Cited by: §5.
  • H. V. Hasselt (2010) Double q-learning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2613–2621. External Links: Link Cited by: §2.2.
  • M. Hessel, H. van Hasselt, J. Modayil, and D. Silver (2019) On inductive biases in deep reinforcement learning. External Links: Link Cited by: §3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In

    NIPS Deep Learning and Representation Learning Workshop

    External Links: Link Cited by: §4.1.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. CoRR abs/1606.03476. External Links: Link, 1606.03476 Cited by: §6.
  • S. Hochreiter, A. S. Younger, and P. R. Conwell (2001) Learning to learn using gradient descent. In IN LECTURE NOTES ON COMP. SCI. 2130, PROC. INTL. CONF. ON ARTI NEURAL NETWORKS (ICANN-2001, pp. 87–94. Cited by: §2.3.
  • Z. Huang, F. Liu, and H. Su (2019) Cited by: §1.
  • J. Humplik, A. Galashov, L. Hasenclever, P. A. Ortega, Y. W. Teh, and N. Heess (2019) Meta reinforcement learning as task inference. CoRR abs/1905.06424. External Links: Link, 1905.06424 Cited by: §3.
  • M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel (2018) Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. CoRR abs/1807.01281. External Links: Link, 1807.01281 Cited by: §1.
  • M. Kearns and D. Koller (2000) Efficient reinforcement learning in factored mdps. pp. . Cited by: §6.
  • T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez (2017)

    Robust and efficient transfer learning with hidden parameter markov decision processes

    In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6250–6261. External Links: Link Cited by: §6.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. CoRR abs/1312.6114. Cited by: §2.4.
  • D. P. Kingma and M. Welling (2019)

    An introduction to variational autoencoders

    CoRR abs/1906.02691. External Links: Link, 1906.02691 Cited by: §2.4.
  • S. Lange, T. Gabel, and M. Riedmiller (2012) Batch reinforcement learning. Reinforcement Learning: State of the Art, pp. . External Links: Document Cited by: §1.
  • A. Lazaric, M. Restelli, and A. Bonarini (2008) Transfer of samples in batch reinforcement learning. See conf/icml/2008, pp. 544–551. External Links: Link Cited by: §1.
  • L. Li, V. Bulitko, and R. Greiner (2004) Batch reinforcement learning with state importance. In Machine Learning: ECML 2004, J. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), Berlin, Heidelberg, pp. 566–568. External Links: ISBN 978-3-540-30115-8 Cited by: §6.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning.. See conf/iclr/2016, External Links: Link Cited by: §1.
  • D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. CoRR abs/1805.00932. External Links: Link, 1805.00932 Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §1.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. CoRR abs/1803.02999. External Links: Link, 1803.02999 Cited by: §3.
  • E. Parisotto, J. Ba, and R. Salakhutdinov (2015) Actor-mimic: deep multitask and transfer reinforcement learning. CoRR abs/1511.06342. Cited by: §6.
  • M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §2.1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. CoRR abs/1612.00593. External Links: Link, 1612.00593 Cited by: §2.4.
  • K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. CoRR abs/1903.08254. External Links: Link, 1903.08254 Cited by: §2.4, §5.2, §6.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. External Links: Link Cited by: §2.4.
  • S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar, C. Blundell, R. Pascanu, and M. Botvinick (2018) Cited by: §3.
  • J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel (2018) ProMP: proximal meta-policy search. CoRR abs/1810.06784. External Links: Link, 1810.06784 Cited by: §3.
  • R. Y. Rubinstein and D. P. Kroese (2004)

    The cross entropy method: a unified approach to combinatorial optimization, monte-carlo simulation (information science and statistics)

    Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 038721240X Cited by: §5.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §1.
  • R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control.. See conf/iros/2012, pp. 5026–5033. External Links: ISBN 978-1-4673-1737-5, Link Cited by: §5.2.
  • H. van Hasselt, A. Guez, and D. Silver (2015) Deep reinforcement learning with double q-learning. CoRR abs/1509.06461. External Links: Link, 1509.06461 Cited by: §2.2.
  • Q. H. Vuong, Y. Zhang, and K. W. Ross (2018) Supervised policy update. CoRR abs/1805.11706. External Links: Link, 1805.11706 Cited by: §1.
  • J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick (2016a) Learning to reinforcement learn. CoRR abs/1611.05763. External Links: Link, 1611.05763 Cited by: §6.
  • J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick (2016b) Learning to reinforcement learn. CoRR abs/1611.05763. External Links: Link, 1611.05763 Cited by: §1.
  • T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. CoRR abs/1907.02057. External Links: Link, 1907.02057 Cited by: §6.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §A.1.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §1.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola (2017) Deep sets. CoRR abs/1703.06114. External Links: Link, 1703.06114 Cited by: §2.4.
  • V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, and P. Battaglia (2019) Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, External Links: Link Cited by: §3.
  • M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §1.
  • L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7693–7702. External Links: Link Cited by: §6.

Appendix A Appendix

a.1 MDP mis-identification convergence proof

Statement: Performing Q-learning on finite and with all the -values initialized to and update rule (1) where is sampled uniformly at random from at each step , will lead the -values to converge to the optimal -value of the MDP almost surely as long as , , , where


First note that for any such , the initial is already optimal and will never be updated; for all other and any , we have

and with probability


Then convergence follows from the same argument for the convergence of -learning (Watkins and Dayan, 1992)

Appendix B Hyper-parameters

The small, medium and large target velocity in hopper corresponds to . The small, medium and large velocity in halfcheetah corresponds to . The learning rate is and the Adam optimizer is used in all experiment. All neural networks used are feed-forward network. All experiments are performed on machines with up to CPU cores and Nvidia GPU.

All experiments are performed in Python , mujoco-py running on top of mujoco

. All neural network operations are in Pytorch


In the toy experiment, consists of hidden layers, each of size . The inferred MDP size is . The context size is . consists of hidden layers of size . consists of hidden layers of size and outputs values. Same goes for . Random shooting was performed with random actions at each iterations.

In hopper, the size of the inferred MDP is . The context size is . consists of hidden layers, each of size . consists of hidden layers, each of size , and outputs a vector of size . consists of hidden layers, each of size . consists of hidden layers, each of size . The training mini-batch size is . When BCQ is ran to extract the value function out of the batch, the same hyper-parameters as found in the official implementation are used, except the learning rate is lowered from to . The and in the reward function definition are and . The is .

In halfcheetah, the size of the inferred MDP is . The context size is . consists of hidden layers, each of size . consists of hidden layers, each of size , and outputs a vector of size . consists of hidden layers, each of size . consists of hidden layers, each of size . The training mini-batch size is . consists of hidden layers, each of size . consists of hidden layers, each of size . When BCQ is ran to extract the value function out of the batch, unless otherwise mentioned, the same hyper-parameters as found in the official implementation are used The learning rate is lowered from to . The perturbation model has hidden layers, of size . The critic also has hidden layers, of size . The and in the reward function definition are and . The is .

In both hopper and halfcheetah, except for the super Q function loss, the terms in the loss in Algorithm 1 are scaled so that they have the same magnitude as the super Q function loss. Graphs for the mujoco experiments are generated by smoothing over the last evaluation datapoints.

The performance on Mujoco was averaged over seeds . The hyper-parameters for SAC are the same as those found in the Pytorch public implementation

. The standard deviations are averaged over

timesteps during evaluation. This corresponds to episodes in halfcheetah because there is no terminal state termination in halfcheetah and variable number of episodes in hopper because there is terminal state termination.