1 Introduction
Deep Reinforcement Learning algorithms still require millions of environment interactions to obtain reasonable performance, hindering their applications (Mnih et al., 2015; Lillicrap et al., 2016; Vuong et al., 2018; Fujimoto et al., 2018b; Jaderberg et al., 2018; Arulkumaran et al., 2019; Huang et al., 2019). This is due to the lack of good pretraining methods. In supervised learning, a pretrained network significantly reduces sample complexity when learning new tasks (Zeiler and Fergus, 2013; Devlin et al., 2018; Yang et al., 2019). Even though Meta Reinforcement Learning (RL) has been proposed as a framework for pretraining in RL, such methods still require the collection of millions of interactions during metatrain (Wang et al., 2016b; Duan et al., 2016; Finn et al., 2017).
In supervised learning, a key reason why pretraining is incredibly successful is that the dataset used for pretraining can be collected from naturally occurring largescale processes. This removes the need to manually collect data and allows for scalable data collection, resulting in massive datasets. For example, Mahajan et al. (2018)
pretrains using existing images and their corresponding hashtags from Instagram to obtain stateoftheart performance on ImageNet
(Russakovsky et al., 2014).In this paper, we seek to formalize pretraining in RL in a way that allows for scalable data collection. The data used for pretraining should be purely observational
and the policies that are being optimized for should not need to interact with the environment during pretraining. To this end, we propose Batch Meta Reinforcement Learning (BMRL) as a formalization of pretraining in RL from only existing and observational data. During training, the learning algorithms only have access to a batch of existing data collected a priori from a family of Markov Decision Process (MDP). During testing, the trained policies should perform well on unseen MDPs sampled from the family.
A related setting is Batch RL (Antos et al., 2007; Lazaric et al., 2008; Lange et al., 2012), which we emphasize assumes the existing data comes from a single MDP. To enable scalable data collection, this assumption must be relaxed: the existing data should come from a family of related MDPs. Consider smart thermostats, whose goal is to maintain a specific temperature while minimizing electricity cost. Assuming Markovian dynamics, the interactions between a thermostat and its environment can be modelled as a MDP. Data generated by a thermostat operating in a single building can be used to train Batch RL algorithms. However, if we consider the data generated by the same thermostat operating in different buildings, much more data is available. While the interactions between the same thermostat model and different buildings correspond to different MDPs, these MDPs share regularities which can support generalization, such as the physics of heat diffusion. In section 6, we further discuss the relations between BMRL and other existing formulations.
The first challenge in BMRL is the accurate inference of the unseen MDP identity. We show that existing algorithms which sample minibatches from the existing data to perform Qlearning style updates converge to a degenerate value function, a phenomena we term MDP misidentification
. The second challenge is the interpolation of knowledge about seen MDPs to perform well on unseen MDPs. While Meta RL algorithms can explicitly optimize for this objective thanks to the ability to interact with the environment, we must rely on the inherent generalizability of the trained networks. To mitigate these issues, we propose tiMe, which learns from existing data to dis
till multiple value functions and MDP embeddings. tiMe is a flexible and scalable pipeline with inductive biases to encourage accurate MDP identity inference and rich supervision to maximize generalization. The pipeline consists of two phases. In the first phase, Batch RL algorithm is used to extract MDPspecific networks from MDPspecific data. The second phase distills the MDPspecific networks.2 Preliminaries
2.1 Batch Reinforcement Learning
We model the environment as a Markov Decision Process (MDP), uniquely defined as a 5 element tuple with state space , action space , transition function , reward function and discount factor (Puterman, 1994; Sutton and Barto, 1998). At each discrete timestep, the agent is in a state , pick an action , and arrives at the next state and receives a reward . The goal of the agent is to maximize the expected sum of discounted rewards where is a trajectory generated by using to interact with . We will consider a family of MDPs, defined formally in subsection 2.3. We thus index each MDP in this family with .
In Batch RL, policies are trained from scratch to solve a single MDP using existing batch of transition tuples without any further interaction with . At test time, we use the trained policies to interact with
to obtain an empirical estimate of its performance
. Batch RL optimizes for the same objective as standard RL algorithms. However, during training, the learning algorithm only has access to and are not allowed to interact with .2.2 BatchConstrained QLearning
Fujimoto et al. (2018a) identifies extrapolation error and value function divergence as the modes of failure when modern Qlearning algorithms are applied to the Batch RL setting. Concretely, deep Qlearning algorithms approximate the expected sum of discounted reward starting from a stateaction pair with a value estimate . The estimate can be learned by sampling transition tuples from the batch and applying the temporal difference update:
(1) 
The value function diverges if Q fails to accurately estimate the value of . Fujimoto et al. (2018a) introduces BatchConstrained QLearning, constraining to select actions that are similar to actions in the batch to prevent inaccurate values estimation. Concretely, given , a generator outputs multiple candidate actions . A perturbation model takes each statecandidate action pair as input and generates small correction term for each candidate. The corrected candidate action with the highest value as estimated by a learnt is :
Estimation error has also been previously studied and mitigated in modelfree RL algorithms (Hasselt, 2010; van Hasselt et al., 2015; Fujimoto et al., 2018b).
2.3 Meta Reinforcement Learning
Meta RL optimizes for average return on a family of MDPs and usually assume that the MDPs in this family share . Each MDP is uniquely defined by a tuple . A distribution defines a distribution over MDPs. During metatrain, we train a policy by sampling MDPs from this distribution and sampling trajectories from each sampled MDP, referred to as the metatrain MDPs. During metatest, unseen MDPs are sampled from , referred to as the metatest MDPs. The trained policy is used to interact with the metatest MDPs to obtain estimate of its performance. The choice of whether to update parameters (Finn et al., 2017) or to keep them fixed during metatest (Hochreiter et al., 2001) is left to the learning algorithms, both having demonstrated prior successes.
2.4 MDP Identity Inference with Set Neural Network
A Meta RL policy needs to infer the metatest MDP identity to pick actions with high return. Rakelly et al. (2019)
introduces PEARL, which uses a set neural network
(Qi et al., 2016; Zaheer et al., 2017) as the MDP identity inference function. takes as input a context setand infers the identity of a MDP in the form of distributed representation in continuous space. The parameters of
is trained to minimize the error of the critic :(2) 
where is a learnt state value function. PEARL also adopts an amortized variational approach (Kingma and Welling, 2013; Rezende et al., 2014; Alemi et al., 2016; Kingma and Welling, 2019) to train a probabilistic , which is interpreted as an approximation to the true posterior over the set of possible MDP identities given the context set.
3 Batch Meta Reinforcement Learning
Let be the number of metatrain MDPs, be the number of transition tuples available from each metatrain MDP, be the parameter of the policy, we can formulate Batch Meta Reinforcement Learning (BMRL) as an optimization problem:
(3) 
where the learning algorithms only have access to the batch during metatrain:
We assume we know which MDP each transition in the batch was collected from. This assumption simplifies our setting and is used to devise the algorithms. To maintain the flexibility of the formalization, we do not impose restrictions on the controller that generates the batch. However, the performance of learning algorithms generally increases as the training data becomes more diverse.
MDP identity inference challenge To obtain high return on the unseen metatest MDPs, the trained policies need to accurately infer their identities (Ritter et al., 2018; Gupta et al., 2018; Humplik et al., 2019). In BMRL, previously proposed solutions based on Qlearning style updates, where minibatches are sampled from the batch to minimize TD error, converge to a degenerate solution. subsection 5.1 provides experimental result that demonstrates the phenomena. In finite MDP, this degenerate solution is the optimal value function of the MDP constructed by the relative frequencies of transitions contained in the batch. We can formalize this statement with the following proposition.
Proposition 1.
Let be the number of times the triple appears in (with any reward). Performing Qlearning on finite and with all the values initialized to and update rule (1) where is sampled uniformly at random from at each step , will lead the values to converge to the optimal value of the MDP almost surely as long as , , , where
Thus, performing Qlearning style update directly on data sampled from the batch fails to find a good policy because the value function converges to the optimal value function of the wrong MDP. We refer this phenomena as MDP misidentification. The proof is shown in subsection A.1.
Interpolation of seen MDPs to unseen MDPs challenge The trained policies need to generalize from the metatrain MDPs to unseen metatest MDPs. Meta RL tackles this challenge by formulating an optimization problem that explicitly optimizes for the average return of the metatrained policy after additional gradient steps in unseen MDPs (Finn et al., 2017; Rothfuss et al., 2018; Nichol et al., 2018). This is possible thanks to the ability to interact with the environment during metatrain. However, in the metatrain phase of BMRL, the learning algorithms do not have access to the environment. We must rely on the inherent generalizability of the trained networks to perform well on the unseen metatest MDPs. The key challenge is therefore finding the right inductive biases in the architecture and training procedure to encourage such generalization. The need to find the right inductive biases in RL was highlighted by Botvinick et al. (2019); Zambaldi et al. (2019); Hessel et al. (2019). We note that previous works phrase the need to find inductive biases as a means to forgo generality for efficient learning. In our setting, these two goals need not be mutually exclusive.
4 Learning Distillation of value functions and MDP Embeddings
4.1 Description of architecture and training procedure
We propose a flexible and scalable pipeline for BMRL. Figure 1 (left) provides an overview of the pipeline in the simplest setting. Metatrain comprises of two separate phases. The first phase consists of independently training a value function for each MDPspecific batch using Batch RL algorithms. In the second phase, we distill the set of batchspecific value functions into a super value function through supervised learning (Hinton et al., 2015). Compared to the normal value function, a super value function takes not only a stateaction pair as input, but also an inferred MDP identity, and outputs different values depending on the inferred MDP identity.
The pipeline is flexible in that any Batch RL algorithms are applicable in the first phase. Figure 1 (right) illustrates the architecture for the second phase given that the Batch RL algorithm used in the first phase is Batch Constrained Q (BCQ) Learning. As described in subsection 2.2, BCQ maintains three separate components, a learnt value function , a candidate action generator and a perturbation model . Therefore, the output of the first phase consists of 3 sets . The second phase distills each set to respectively. The distillation of and is necessary to pick actions that lead to high return because each learnt value function only provides reliable estimates for actions generated by and , a consequence of the training procedure of BCQ.
In the second phase of the pipeline, in addition to , the architecture consists of 3 other networks. takes as input a context and outputs a distributed representation of the MDP identity in a fixeddimension continuous space. The output of is passed as an input to . and predicts given a stateaction pair . has low capacity while the other networks are relatively large. Pseudocode is provided in Algorithm 1. After metatrain, the super functions are used to pick actions in the metatest MDPs, similar to how BCQ picks actions.
The key idea behind the second phase is to jointly learn distillation of value functions and MDP embeddings. We therefore name the approach tiMe. MDP embeddings refer to the ability to infer the identity of a MDP in the form of distributed representation in continuous space given a context.
4.2 Benefits of the proposed pipeline
Inductive biases to encourage accurate MDP identification The first inductive bias is the relationship between and . They collectively receive as input stateaction pair and context and regress to target value . The target for each stateaction pair can take on the values within the set . Similar stateaction pairs can have very different regression targets if they correspond to different metatrain MDPs. The context is the only information in the input to and that correlates with which out of the set and should regress to. Thus, and must learn to interpret the context to predict the correct value for . The second inductive bias is the auxiliary task of predicting . A key design choice is that the network which takes as input and and predicts has low capacity. As such, the output of must contain meaningful semantic information such that a small network can use it to reconstruct the MDP. This is to prevent the degenerate scenario where learns to copy its input as its output. To summarize, these two explicit biases in the architecture and training procedure encourage to accurately infer the MDP identity given the context.
Richness and stability of supervision Previous approaches update to minimize the critic loss (subsection 2.4). It is wellknown that RL provides sparse training signal. This signal can also cause instability since the target values in the critic loss change over time. In contrast, our pipeline provides training signal for that is both rich and stable. It is rich because is trained to infer a representation of the MDP identity that can be used for multiple downstream tasks, such as predicting . This encourages generalpurpose learnt representation and supports generalization. The training signal is also stable since the regression targets are fixed during the second phase of tiMe.
Scalability The pipeline is scalable in that an arbitrary amount of purely observational data can be used in the first phase so long as computational constraints permit. The extraction of the batchspecific networks, such as the batchspecific value functions , from the MDPspecific batches can be trivially parallelized and scales gracefully as the number of metatrain MDPs increases.
5 Experimental Results
Our experiments have two main goals: (1) Demonstration of the MDP misidentification phenomena and tiMe’s ability to effectively mitigate it. (2) Demonstration of the scalability of the tiMe pipeline to challenging continuous control tasks and generalization to unseen MDPs.
In all experiments, the MDPspecific batch is the replay buffer when training Soft Actor Critic (SAC) (Haarnoja et al. (2018a, b)) in for a fixed number of environment interactions. While our problem formulation BMRL and the pipeline tiMe allow for varying both the transition and reward functions within the family of MDPs, we consider the case of changing reward function in the experiments and leave changing transition function to future work.
5.1 Toy Experiments
This section illustrates MDP misidentification as the failure mode of existing Batch RL algorithms in BMRL. The toy setting allows for easy interpretability of the trained agents’ behavior. We also show that in the standard Batch RL setting, the Batch RL algorithm tested finds a nearoptimal policy. This means the failure of existing Batch RL algorithm in BMRL is not because of the previously identified extrapolation issue when learning from existing data (Fujimoto et al., 2018a).
Environment Description In this environment, the agent needs to navigate on a 2dplane to a goal location. The agent is a point mass whose starting location is at the origin . Each goal location is a point on a semicircle centered at the origin with radius of 10 units. At each discrete timestep, the agent receives as input its current location , takes an action indicating the change in its position , transitions to a new position and receives a reward. The reward is the negative distance between the agent’s current location and the goal location. The agent does not receive the goal location as input and . Since the MDP transition function is fixed, each goal location uniquely defines a MDP. The distribution over MDPs is defined by the distribution over goal locations, which corresponds to a distribution over reward functions.
Batch SAC We modify SAC to learn from the batch by initializing the replay buffer with existing transitions. Otherwise, training stays the same. We test Batch SAC on a simple setting where there is only one metatrain MDP and one metatest MDP which share the same goal location. This is the standard Batch RL setting and is a special case of BMRL. Batch SAC finds a nearoptimal policy.
Three metatrain and metatest MDPs This experiment has 3 metatrain MDPs with different goal locations. The goals divide the semicircle into two segments of equal length. There are three metatest MDPs whose goal locations coincides with the goal locations of the metatrain MDPs. This setting only tests the ability of the trained policies to correctly identify the metatest MDPs and do not pose the challenge of generalization to unseen MDPs. Figure 2 (left, middle) illustrates that Batch SAC fails to learn a reasonable policy because of the MDP misidentification phenomena. We also tried using a probabilistic MDP identity inference function as described in subsection 2.4 in addition to Batch SAC, which fails to train a policy that performs well on all 3 metatest MDPs.
Performance of tiMe Since Batch SAC can extract the optimal value function out of the batch in the single metatrain MDP case, we use it as the Batch RL algorithm in the first phase of the tiMe pipeline. The architecture in the second phase thus consists of and . To pick an action, we randomly sample multiple actions and choose the action with the highest value as estimated by . This method is termed random shooting (Chua et al., 2018). As illustrated in Figure 2 (left, right), tiMe can identify the identities of the three metatest MDPs and pick nearoptimal actions.
5.2 Mujoco experiments
Environment Description This section illustrates the test of tiMe in challenging continuous control robotic locomotion tasks. Each task requires the application of control action to a simulated robot so that it moves with a particular velocity in the direction of its initial heading. Formally, the MDP within each MDP family share and only differs in where is defined to be:
and are positive constant. A onetoone correspondence exists between a MDP within the family and a target velocity. Defining a family of MDP is equivalent to picking an interval of possible target velocity. This setting is instantiated on two types of simulated robots, hopper and halfcheetah, illustrated in Figure 3. Experiments are performed inside the Mujoco simulator (Todorov et al., 2012). The setting was first proposed by Finn et al. (2017).
Zeroshot metatest
During testing, in contrast to prior works, we do not update the parameters of the trained networks or allow for an initial exploratory phase where episode returns do not count towards the final metatest performance. This allows for testing the inherent generalizability of the trained networks without confounding factors. The metatest MDPs are chosen such that they are unseen during metatrain, i.e. none of the transitions used during metatrain was sampled from any of the metatest MDPs. At the beginning of each metatest episode, the inferred MDP identity is initialized to a zero vector. Subsequent transitions collected during the episode is used as the context.
Metatrain conditions The target velocities of the metatrain MDPs divide the target velocity interval into equal segments. This removes the bias of sampling metatrain MDPs when evaluating performance. The target velocity intervals, episode length, and number of metatrain MDPs for hopper and halfcheetah are and , 1000 and 200, and respectively.
Performance analysis Figure 3 illustrates tiMe’s performance on unseen metatest MDPs. tiMe is competitive with the stateoftheart modelfree RL methods trained from scratch for one million and sixty thousands environment interactions in hopper and halfcheetah respectively. We perform experiments on halfcheetah with an episode length 200 because of computational constraints. Previous Meta RL works also use an episode length of 200 (Rakelly et al., 2019). The same network trained with tiMe also performs well in diverse points inside the support of the MDP distribution, demonstrating that it does not overfit to one particular metatrain MDP. We compare with SAC to demonstrate BMRL is a promising research direction. We do not include other Meta RL algorithms as baseline because they would require interacting with the environment during metatrain and thus, is not solving for the problem that BMRL poses. We tried removing from the architecture in Figure 1 and picked action with Cross Entropy Method (Rubinstein and Kroese, 2004), but that lead to poor performance because overestimates the values of actions not generated by .
6 Related Works
Supervised Learning and Imitation Learning
The main differences between Batch (Meta) RL and supervised learning are: actions have longterm consequences and the actions in the batch are not assumed to be optimal. If they are optimal in the sense that they were collected from an expert, Batch RL reduces to Imitation Learning
(Abbeel and Ng, 2004; Ho and Ermon, 2016). In fact, Fujimoto et al. (2018a) demonstrates that Batch RL generalizes Imitation Learning in discrete MDPs.Meta RL Equation 3 is the same objective that existing Meta RL algorithms optimize for (Wang et al. (2016a); Finn et al. (2017)). We could have formulated our experimental setting as a Partially Observable MDP, but we chose to formulate it as Batch Meta Reinforcement Learning to ensure consistency with literature that inspires this paper. The main difference between Meta RL and our formulation is access to the environment during training. Meta RL algorithms sample transitions from the environment during metatrain. We only have access to existing data during metatrain.
Context Inference Zintgraf et al. (2019) and Rakelly et al. (2019) propose learning inference modules that infer the MDP identity. Their procedures sample transitions from the MDP during metatrain, which differs from our motivation of learning from only existing data. Killian et al. (2017) infers the MDP’s “hidden parameters”, inputs the parameters to a learnt transition function to generates synthetic data and train a policy from the synthetic data. Such modelbased approaches are still outperformed by the best modelfree methods (Wang et al. (2019)), which our method is based on.
Batch RL Fujimoto et al. (2018a) and Agarwal et al. (2019) demonstrate that good policies can be learnt entirely from existing data in modern RL benchmarks. Our work extends their approaches to train policies from data generated by a family of MDPs. Li et al. (2004) selects transitions from the batch based on an importance measure. They assume that for stateaction pair in the batch, their value under the optimal value function can be easily computed. We do not make such assumption.
Factored MDPs In discrete MDP, the number of possible states increases exponentialy in the number of dimension. Kearns and Koller (2000) tackles this problem by assuming each dimension in the next state is conditionally dependent on only a subset of the dimensions in the current state. In contrast, our method makes no such assumption and applies to both discrete and continuous settings.
Joint MDP The family of MDPs can be seen as a joint MDP with additional information in the state which differentiates states between the different MDPs (Parisotto et al., 2015). Sampling an initial state from the joint MDP is equivalent to sampling a MDP from the family of MDPs. However, without prior knowledge, it is unclear how to set the value of the additional information to supports generalization from the metatrain MDPs to the metatest MDPs. In fact, the additional information in our approach is the transitions from the MDP and the network learns to infer MDP identity.
7 Conclusion
We propose a new formalization of pretraining in RL as Batch Meta Reinforcement Learning (BMRL). BMRL differs from Batch RL in that the existing data comes from a family of related MDPs and thus enables scalable data collection. BMRL also differs from Meta RL in that no environment interaction happens during metatrain. We identified two main challenges in BMRL: MDP identity inference and generalization to unseen MDPs. To tackle these challenges, we propose tiMe, a flexible and scalable training pipeline which jointly learn distillation of value functions and MDP embeddings. Experimentally, we demonstrate that tiMe obtains performance competitive with those obtained by stateoftheart modelfree RL methods on unseen MDPs.
References

Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the Twentyfirst International Conference on Machine Learning
, ICML ’04, New York, NY, USA, pp. 1–. External Links: ISBN 1581138385, Link, Document Cited by: §6.  Striving for simplicity in offpolicy deep reinforcement learning. External Links: Link Cited by: §6.
 Deep variational information bottleneck. CoRR abs/1612.00410. External Links: Link, 1612.00410 Cited by: §2.4.
 Fitted qiteration in continuous actionspace mdps. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, USA, pp. 9–16. External Links: ISBN 9781605603520, Link Cited by: §1.

AlphaStar: an evolutionary computation perspective
. Note: cite arxiv:1902.01724 External Links: Link Cited by: §1.  Reinforcement learning, fast and slow. Trends in Cognitive Sciences 23, pp. . External Links: Document Cited by: §3.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR abs/1805.12114. External Links: Link, 1805.12114 Cited by: §5.1.
 BERT: pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
 RL$^2$: fast reinforcement learning via slow reinforcement learning. CoRR abs/1611.02779. External Links: Link, 1611.02779 Cited by: §1.
 Modelagnostic metalearning for fast adaptation of deep networks. CoRR abs/1703.03400. External Links: Link, 1703.03400 Cited by: §1, §2.3, §3, §5.2, §6.
 Offpolicy deep reinforcement learning without exploration. CoRR abs/1812.02900. External Links: Link, 1812.02900 Cited by: §2.2, §5.1, §6, §6.
 Addressing function approximation error in actorcritic methods. CoRR abs/1802.09477. External Links: Link, 1802.09477 Cited by: §1, §2.2.
 Metareinforcement learning of structured exploration strategies. CoRR abs/1802.07245. External Links: Link, 1802.07245 Cited by: §3.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: Link, 1801.01290 Cited by: §5.
 Soft actorcritic algorithms and applications. CoRR abs/1812.05905. External Links: Link, 1812.05905 Cited by: §5.
 Double qlearning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2613–2621. External Links: Link Cited by: §2.2.
 On inductive biases in deep reinforcement learning. External Links: Link Cited by: §3.

Distilling the knowledge in a neural network.
In
NIPS Deep Learning and Representation Learning Workshop
, External Links: Link Cited by: §4.1.  Generative adversarial imitation learning. CoRR abs/1606.03476. External Links: Link, 1606.03476 Cited by: §6.
 Learning to learn using gradient descent. In IN LECTURE NOTES ON COMP. SCI. 2130, PROC. INTL. CONF. ON ARTI NEURAL NETWORKS (ICANN2001, pp. 87–94. Cited by: §2.3.
 Cited by: §1.
 Meta reinforcement learning as task inference. CoRR abs/1905.06424. External Links: Link, 1905.06424 Cited by: §3.
 Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning. CoRR abs/1807.01281. External Links: Link, 1807.01281 Cited by: §1.
 Efficient reinforcement learning in factored mdps. pp. . Cited by: §6.

Robust and efficient transfer learning with hidden parameter markov decision processes
. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6250–6261. External Links: Link Cited by: §6.  Autoencoding variational bayes. CoRR abs/1312.6114. Cited by: §2.4.

An introduction to variational autoencoders
. CoRR abs/1906.02691. External Links: Link, 1906.02691 Cited by: §2.4.  Batch reinforcement learning. Reinforcement Learning: State of the Art, pp. . External Links: Document Cited by: §1.
 Transfer of samples in batch reinforcement learning. See conf/icml/2008, pp. 544–551. External Links: Link Cited by: §1.
 Batch reinforcement learning with state importance. In Machine Learning: ECML 2004, J. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), Berlin, Heidelberg, pp. 566–568. External Links: ISBN 9783540301158 Cited by: §6.
 Continuous control with deep reinforcement learning.. See conf/iclr/2016, External Links: Link Cited by: §1.
 Exploring the limits of weakly supervised pretraining. CoRR abs/1805.00932. External Links: Link, 1805.00932 Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §1.
 On firstorder metalearning algorithms. CoRR abs/1803.02999. External Links: Link, 1803.02999 Cited by: §3.
 Actormimic: deep multitask and transfer reinforcement learning. CoRR abs/1511.06342. Cited by: §6.
 Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §2.1.
 PointNet: deep learning on point sets for 3d classification and segmentation. CoRR abs/1612.00593. External Links: Link, 1612.00593 Cited by: §2.4.
 Efficient offpolicy metareinforcement learning via probabilistic context variables. CoRR abs/1903.08254. External Links: Link, 1903.08254 Cited by: §2.4, §5.2, §6.

Stochastic backpropagation and approximate inference in deep generative models
. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. External Links: Link Cited by: §2.4.  Cited by: §3.
 ProMP: proximal metapolicy search. CoRR abs/1810.06784. External Links: Link, 1810.06784 Cited by: §3.

The cross entropy method: a unified approach to combinatorial optimization, montecarlo simulation (information science and statistics)
. SpringerVerlag, Berlin, Heidelberg. External Links: ISBN 038721240X Cited by: §5.2.  ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §1.
 Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2.1.
 MuJoCo: a physics engine for modelbased control.. See conf/iros/2012, pp. 5026–5033. External Links: ISBN 9781467317375, Link Cited by: §5.2.
 Deep reinforcement learning with double qlearning. CoRR abs/1509.06461. External Links: Link, 1509.06461 Cited by: §2.2.
 Supervised policy update. CoRR abs/1805.11706. External Links: Link, 1805.11706 Cited by: §1.
 Learning to reinforcement learn. CoRR abs/1611.05763. External Links: Link, 1611.05763 Cited by: §6.
 Learning to reinforcement learn. CoRR abs/1611.05763. External Links: Link, 1611.05763 Cited by: §1.
 Benchmarking modelbased reinforcement learning. CoRR abs/1907.02057. External Links: Link, 1907.02057 Cited by: §6.
 Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §A.1.
 XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §1.
 Deep sets. CoRR abs/1703.06114. External Links: Link, 1703.06114 Cited by: §2.4.
 Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, External Links: Link Cited by: §3.
 Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §1.
 Fast context adaptation via metalearning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7693–7702. External Links: Link Cited by: §6.
Appendix A Appendix
a.1 MDP misidentification convergence proof
Statement: Performing Qlearning on finite and with all the values initialized to and update rule (1) where is sampled uniformly at random from at each step , will lead the values to converge to the optimal value of the MDP almost surely as long as , , , where
Proof.
First note that for any such , the initial is already optimal and will never be updated; for all other and any , we have
and with probability
,Then convergence follows from the same argument for the convergence of learning (Watkins and Dayan, 1992) ∎
Appendix B Hyperparameters
The small, medium and large target velocity in hopper corresponds to . The small, medium and large velocity in halfcheetah corresponds to . The learning rate is and the Adam optimizer is used in all experiment. All neural networks used are feedforward network. All experiments are performed on machines with up to CPU cores and Nvidia GPU.
All experiments are performed in Python , mujocopy running on top of mujoco
. All neural network operations are in Pytorch
.In the toy experiment, consists of hidden layers, each of size . The inferred MDP size is . The context size is . consists of hidden layers of size . consists of hidden layers of size and outputs values. Same goes for . Random shooting was performed with random actions at each iterations.
In hopper, the size of the inferred MDP is . The context size is . consists of hidden layers, each of size . consists of hidden layers, each of size , and outputs a vector of size . consists of hidden layers, each of size . consists of hidden layers, each of size . The training minibatch size is . When BCQ is ran to extract the value function out of the batch, the same hyperparameters as found in the official implementation are used https://github.com/sfujim/BCQ, except the learning rate is lowered from to . The and in the reward function definition are and . The is .
In halfcheetah, the size of the inferred MDP is . The context size is . consists of hidden layers, each of size . consists of hidden layers, each of size , and outputs a vector of size . consists of hidden layers, each of size . consists of hidden layers, each of size . The training minibatch size is . consists of hidden layers, each of size . consists of hidden layers, each of size . When BCQ is ran to extract the value function out of the batch, unless otherwise mentioned, the same hyperparameters as found in the official implementation are used https://github.com/sfujim/BCQ. The learning rate is lowered from to . The perturbation model has hidden layers, of size . The critic also has hidden layers, of size . The and in the reward function definition are and . The is .
In both hopper and halfcheetah, except for the super Q function loss, the terms in the loss in Algorithm 1 are scaled so that they have the same magnitude as the super Q function loss. Graphs for the mujoco experiments are generated by smoothing over the last evaluation datapoints.
The performance on Mujoco was averaged over seeds . The hyperparameters for SAC are the same as those found in the Pytorch public implementation https://github.com/vitchyr/rlkit
. The standard deviations are averaged over
timesteps during evaluation. This corresponds to episodes in halfcheetah because there is no terminal state termination in halfcheetah and variable number of episodes in hopper because there is terminal state termination.
Comments
There are no comments yet.