1 Introduction
The last decade has seen significant progress in reinforcement learning. The field has matured to a state where RL can solve challenging simulated motor control problems [22, 35] or games [40, 43] even from highdimensional observations such as images [31, 24, 11]. Offpolicy modelfree algorithms have become workable for highdimensional continuous action spaces [29, 23, 2], and have improved in robustness and dataefficiency allowing experiments directly on robotics hardware [1, 19, 53, 25].
Modelfree techniques directly learn a policy (and value function) from environment interactions. Their simplicity, generality and versatility has been a major factor behind their recent successes. Yet, these techniques do not entirely satisfy the intuition that a learning agent should be able to acquire approximate knowledge of the dynamics of its environment in a manner that is independent of any particular task and such that it is easily applicable to new tasks, more easily than taskspecific objects such as policies. It is this intuition and as well as the desire to further improve the sampleefficiency and transferability of solutions that has driven much of the recent research in modelbased RL.
A growing body of literature is concerned with learning dynamics models for physical control problems [9, 10], including approaches that learn latent models from images [45, 20, 52]. Although some approaches excel in data efficiency [9, 28], in general, modelbased methods have not yet achieved the robustness of modelfree techniques; and they still struggle with model inaccuracies and long planning horizons. When learned dynamic models are combined with policy learning [41, 18, 48, 12, 47] the advantages over purely modelfree techniques can be less clear.
In this work we develop an approach that uses modelbased RL to learn a stochastic parametric policy in domains with highdimensional observation spaces such as images. We build on Stochastic Value Gradients (SVG) [23]
which allow us to compute lowvariance policy gradients by directly differentiating a modelbased value function. We extend this work in two ways: Firstly, we develop a latent state space model that allows us to predict expected future reward and value as a function of the current policy even when the true lowdimensional system state is not directly observed. Secondly, rather than using the model only for credit assignment along observed trajectories we use it directly to produce “imagined” rollouts and show that stable policy learning can still be achieved.
We apply our algorithm to several challenging longhorizon visionbased manipulation tasks (e.g. lifting and stacking) in simulation and demonstrate the following: Our modelbased approach (a) is as robust and achieves the same asymptotic performance as competitive modelfree baselines, and (b) in several cases it significantly improves dataefficiency. (c) It can effectively transfer the learned model to novel tasks with different reward distributions or visual distractors, leading to a dramatic gain in dataefficiency in such settings. (d) It is particularly effective in a multitask setup where the models learned on multiple tasks learn faster and generalize better to new tasks, successfully solving problems that cannot be learned in isolation.
2 Related Work
Modelfree RL: Modelfree RL has recently been applied to many challenging problems including robotics [8]. These successes were helped by advances in algorithms suitable for the use of powerful function approximators [31, 29, 39, 2, 40, 37], also enabling the use of RL in domains with highdimensional observation spaces [31, 43, 37]. However, modelfree methods can still struggle in the low data regime and their primary objects of interest, policies and value functions are task specific and can thus be difficult to transfer.
Modelbased RL: On the other end of the spectrum, modelbased methods learn a model of the environment and use a planner to generate actions. One of the successes in this area is work by Deisenroth et al. [9]
who learn uncertainty aware policies using GaussianProcess models and showed impressive data efficiency on multiple lowdimensional tasks. Using deep neural networks, model based RL has been extended to visionbased problems such as Atari
[24] and complex manipulation tasks [28, 10]. Other work has studied learned visionbased latent representations for RL [45, 44, 52, 46, 20]: Such lowdimensional representations can provide feature spaces in which learning and planning can proceed significantly faster compared to learning directly from images although model inaccuracies, sparse rewards, and longhorizons remain challenging.Modelfree + Modelbased: Several papers have focused on combining the strengths of modelfree and modelbased RL. The work on Dyna [41] was among the first. It integrates an action model and modelbased imagined rollouts with policy learning, interleaving planning, learning and execution in a tight loop. Recent work has further explored the use of imagined rollouts generated with learned models for accelerating learning of modelfree policies [18, 48] – or indirectly speeding up learning via modelbased value estimates [12, 5]. These approaches propose different means to handle model approximation errors, such as using short rollouts to avoid cascading model errors [12], uncertainty estimation through model ensembles [5] and using the rollouts as policy conditioning variables only [48]. Contrasting these methods, Stochastic Value Gradients (SVG) [23] reevaluates rollouts with a learned model from offpolicy data, accelerating learning through value gradients backpropagated through time via a learned model. In this work, we extend SVG to use imagined rollouts in latent spaces, accounting for model approximation errors via gradient averaging.
There has also been work on combining latent space models and modelfree RL. DeepMDP [17], MERLIN [47], CRAR [15] and VPN [34] combine representation learning with policy optimization via auxiliary losses such as observation reconstruction [47], predicting the next latent state [17] or future values [34]. Along these lines, we learn a latent representation that can be used for predicting (expected) future latent states, observations, rewards and values conditioned on the observed actions.
Transfer: One argument for modelbased RL is the potential of transfer to tasks in the same environment. Sutton et al. [41] discuss early examples of this type of transfer on simple problems. Recently, Francois et al. [15] showed encouraging results with a visionbased RL agent on 2D labyrinth tasks. Nagabandi et al. [32, 33] metalearn predictive dynamics models to enable a modelbased agent to rapidly adapt to changes in its environment. Other work on transfer in RL has focused on learning reusable skills e.g. in the form of embedding spaces [21, 14, 42], successor representations [4], transferable priors [16] or metapolicies [13, 6]. As they learn “behaviors” rather than dynamics models, they are in a sense orthogonal and complementary to the ideas presented in this work.
3 Background
We are interested in solving motor control problems such as robotic manipulation tasks from vision. This setup can be formalized as a partially observed Markov decision process (a POMDP) with observation
, states , actionstransition probabilities
, and reward function . Let be the sequence of observations and actions in a trajectory up to decision point . The optimal policy and the value function at time step in this setting are a function of the posterior of the system state given the interaction history. However, this posterior is usually intractable and many reinforcement learning approaches resort, instead, to directly optimizing a parametric policy that is a function of the history , i.e. they consider policies of the form . Thus, they aim to maximize the sum of discounted rewards: , with discount factor , where the trajectory distribution is assumed to decompose into transition and action probabilities as This allows us to define the value of an observed trajectory prefix as(1) 
where, below, we consider the infinite horizon case” .
4 An ActionConditional Expectation Model of Observations and Rewards
Naively, a modelbased evaluation of (1) would require an accurate actionconditional model of future observations and rewards given a a partial trajectory . Especially in highdimensional observation spaces such a model can be difficult to learn. Instead we consider an approximate model suitable for weakly partially observed domains with limited stochasticity. We develop a latent statespace model whose latent state at time step is optimized to represent a sufficient statistic of the interaction history . The model is trained to predict expectations of future observations and reward by approximately modeling the evolution of the summary statistic via a deterministic transition model. We express the policy as a function of the latent state and the model allows us to construct a surrogate model that expresses the value as a recursive function of the policy, the deterministic transition function, and a learned approximation to the reward. This surrogate model can be used to compute approximate policy gradients.
More specifically, we let be a determistic mapping (with parameters ) extracting a summary statistic from a history of observations^{1}^{1}1We drop the dependency on actions here, assuming is retrievable from a history of observations only.; and we let denote a latent transition function. The assumption that the latent dynamics are well described by a deterministic transition function is our primary simplification. We further define the approximate reward function based on latent states as which we calculate by chaining the transition model with a reward predictor. Finally, we denote with a stochastic policy, parameterized by . Using these definitions we construct an approximation of the expectation in Eq. (1) as or, written recursively:
(2) 
where the initial latent state is given as . In the following we will describe our approach in two steps: We first describe how to learn the actionconditional model for , , and thus in Sec. 4.1. We then explain in Sec. 5 how the model can be used to optimize the policy.
4.1 Model Learning
We need to estimate all quantities comprising Equation (2); i.e., we need to estimate the following parametric functions (for brevity we use a single set of parameters for all model parameters):
Encode  Transition 

Decode  Value 
Reward 
The Encoder maps a history of observations to a summary statistic or latent state
, via a recurrent neural network. Recurrence allows us to handle partially observable settings (eg: occlusion). The
Transition function predicts the next latent state given the current state and action , evolving dynamics in the lowdimensional latent space. The Decoder maps the latent back to an expected observation and is primarily used as selfsupervision for training the encoder [47]. The Value function, predicts the sum of expected rewards (the value) as a function of a latent . Lastly, the Reward function predicts the immediate, expected reward for a given latent stateaction pair. Please refer to Section E of the supplementary for additional details on the model architecture.For the modelbased value function in Eq. (2) to form a good approximation of the true value (and its gradient) we train the model on trajectories collected while interacting with the environment. The main approximation of our approach is the assumption that the evolution of the latent state is well modeled by a deterministic transition function. In partially observed and stochastic environments this is not guaranteed. For the relevant quantities to be well approximated despite this simplification we employ a number of losses that satisfy the following desiderata: i) we want to ensure that is a sufficient statistic of the history and the system thus Markov in ; ii) we want to minimize the discrepancy between the predicted and observed evolution of the latent state ; iii) given a latent state we want to accurately predict the expected reward and value (of policy ) at time step after executing some action sequence .
Let denote a set of trajectory data collected while executing some behavior policy . Let us define the full model loss after an initial “burnin” of steps (to ensure the encoder has sufficient information) as where we approximate the expectation wrt. with samples from ,
(3) 
with and where the per example loss is defined as . and are coefficients that determine the relative contribution of the loss components. The per example transition model loss is given as
(4) 
where the first term measures the error between the observations and reconstructions from the openloop latent state predictions () and the second term enforces consistency between the latent state representation from the encoder and the predictions from the transition model ; this encourages the latent state to stay close to encodings of observed trajectories thus addressing points i) and ii) above. Here is a coefficient that weights the two loss terms and the reward loss is
(5) 
where the valueloss is given by the, importance weighted, squared Bellman error
(6) 
where the next state value is calculated via a “target network”, whose parameters are periodically copied from , to stabilize training (see e.g. [31] for a discussion). In practice we use a vtrace target [11]; see discussion in the supplementary. Note that both loss terms are evaluated for predicted latent states with gradients flowing backwards through the transition model and eventually the encoder, which addresses point iii). We encourage the reader to consult to Section C & Figure 1 in the supplementary material for additional details and a schematic.
5 Imagined Value Gradients in Latent Spaces
Given a model, we optimize a parametric policy by maximizing the Nstep surrogate value function which is a recursive composition of the policy, the transition, reward and value function:
(7) 
This Nstep value can be computed by performing an “imagined” rollout in the latent statespace using our model (see Fig.1). It can be maximized by gradient ascent, exploiting the so called “value gradient” [23]; which can often be computed recursively, taking advantage of the reparameterization trick [27, 36] for sampling from
and calculating analytic gradients via backpropagation backwards through time. We can express the policy as a deterministic function
that transforms a sample from a canonical noise distribution into a sample from . In the following we will consider Gaussian policy distributions, i.e. , for which the reparameterization is given as , with , wheredenotes the identity matrix. We can then calculate the gradient
for any state as(8) 
where and we dropped the dependencies on for brevity. The value gradient wrt. a state is defined recursively and provided in Eq. (3) in the supplementary material. The case is established by assuming the policy is fixed in all steps after ; i.e. bootstrapping with . To calculate the gradient, only an initial state (encoded from ) is required (see Fig.1). Eq. (8) computes policy gradient contributions for the encoded state as well as for the imagined states . This ensures that the policy can be evaluated on either kind of latent state. Our derivation is analogous to SVG [23]; but using imagined latent states and assuming a deterministic transition model.
5.1 Stable Regularized Policy Optimization
We make the following observations regarding the use of value gradients in practice: First, we can obtain different gradient estimators by varying . For small we obtain a more biased value gradient – due to the reliance on bootstrapping – that changes slowly (i.e. it changes at the speed of convergence for ). For large , less bias, and faster learning could be achieved if model and reward predictors are accurate, but the estimate can be affected by modelling errors. As a compromise, we found that averaging gradient estimates obtained with differentlength rollouts worked well in practice We refer to the supplementary for details.
Second, even with this averaged gradient, optimization is prone to exploiting modelling errors (in the transition dynamics and reward/value estimates) – see e.g. [20] for a discussion. To counteract this, we employ relativeentropy (KL) regularization. Similar to existing policy optimization methods [39, 2, 19] we augment the estimated reward with a KL penalty, yielding where is a the prior action probability (we use throughout) and is a cost multiplier. Replacing in Equation (8) with – noting that is differentiable wrt. – results in the regularized value gradient . Analogously, we obtain a compatible value by replacing with in Equation (6). The total derivative estimate we use then is
(9) 
where we use batches from the replay to optimize the policy on all visited states – using Eq. (9) in combination with Adam [26]. Please refer to Algorithm (1) in the supplementary for a description.
6 Experiments
We evaluate our approach on several challenging longhorizon manipulation tasks in simulation (see sections D & F of the supplementary for details). Tasks involve the agent controlling a Sawyer manipulator equipped with a Robotiq gripper with a 5dim. control (= 5) to interact with a red and blue block on a tabletop. Observations are 64x64px RGB images from two cameras located on either side of the table, looking at the robot, and proprioceptive features. The latent representation is 128dimensional (=128) and unless otherwise noted we use a history of observations and a rollout horizon of .
Task Setup: Fig.2 presents visualizations of a subset of tasks from our experiments. We consider three main tasks: 1) the Lift task requires the robot to lift an object above a certain threshold; LiftR refers to lifting the red block and analagously for LiftB. 2) the Stack task requires the robot to stack one object on top of the other; StackR refers to stacking the red block on top of the blue block and viceversa for StackB. 3) Lastly, the Match Positions task involves moving both objects to a fixed target position. We also consider variants of these tasks with the addition of visual distractors and stochasticity. It is worth noting that all our tasks involve longterm dependencies and complicated contact dynamics making them particularly challenging for modelbased approaches. We use shaped rewards for all tasks except the Match Positions task which has a mixed densesparse reward.
Baselines: We consider the following pixelbased baselines for our experiments: 1) SVG(0): the modelfree version of SVG. As our approach (termed Imagined Value Gradients – IVG from here on out) builds on SVG we expect to improve on SVG(0). 2) MPO: Maximum a Posteriori Policy Optimisation [2], a stateoftheart modelfree approach. To obtain an upper bound on performance we also include a version of MPO with access to the full system state (incl. objects). For the transfer experiments we also experiment with variants where we replace the value gradient based optimization. In particular we use CEM: the crossentropy method [38] using the same model as IVG for transfer (latent rollouts). PG: replacing the value gradients with a likelihood ratio estimator (using 100 imagined rollouts), again using the same model as used for the IVG transfer. Additional details on the baselines are given in the appendix.
6.1 Learning from scratch
We first compare IVG and the baselines when learning the LiftR and StackR manipulation tasks from scratch: Modelbased IVG(5) learns the simpler lift task stably and performs on par with the baselines (Fig.3, left). On the harder stack task (Fig.3, right), IVG(5) learns significantly (about 2x) faster than both MPO and SVG(0), and also outperforms SVG(0) in terms of final performance. Compared to the informed MPO (State) baseline, IVG learns more slowly, but this difference is significantly reduced for stacking. Even when learning from pixels, the structure inherent in IVG via the latent space rollouts allows it to learn complex tasks faster than strong modelfree baselines.
We also tested the ability of IVG to handle slightly more stochastic and partially observed environments. Fig.3 (right) presents results on learning the StackR task in environments with: 1) delayed proprioception (2 timestep lag) and 2) noisy observations where one of the blocks switches colors randomly every 3 frames. IVG(5) successfully learns in both cases (albeit more slowly). MPO also solved these tasks but required approximately more episodes than IVG (color switch not shown).
6.2 Transferring learned models to related tasks
IVG learns a model of the environment which we may be able to transfer, and thus accelerate learning of related tasks. In the following, we evaluate this possibility. First, we train IVG from scratch on one or multiple source task(s). Second, we copy the weights of the trained model (encoder , transition model and decoder ). The policy, value and reward models are not transferred and are learned from scratch. As a modelfree baseline we include MPO where we initialize all weights of the policy & value function except the last layer from an agent trained on a source task.
Multiple source tasks: A model trained on multiple tasks should transfer better due to the diversity in transitions observed. To test this hypothesis, we propose a version of IVG that is trained on multiple source tasks – learning a task agnostic model. We use the following source tasks: ReachB, MoveB, LiftB, StackB and ReachR (most tasks involve the blue object but ReachR gives the model some experience with the red block). Details on this setup are provided in the supplementary.
Transfer results: We present results on transferring IVG models to the following target tasks:
1) LiftR: Fig.4 (left) shows the transfer performance of IVG(5) on the LiftR task. With a model prelearned on StackB, IVG(5) learns 2x faster than from scratch and 4x faster than MPO (irrespective of MPO’s initialization). A model pretrained on StackR accelerates IVG(5) further since the model has already observed many relevant transitions, achieving speed comparable to MPO (state). Replacing the learned policy with direct optimization (CEM) or using a policy gradient yields suboptimal behavior. These results highlight the benefits of a model when transferring to related tasks, even in cases where the model has only observed a subset of relevant transitions.
2) StackR: Results with transferred models for StackR are similar to those for LiftR (Fig.4, center). But here, CEM fails to perform the stack (possibly caused by overly exploiting the model due to missing policy regularization), and using PG instead of a value gradient takes significantly longer to converge (15k trajectories) and performs worse, likely due to the noise in the likelihood ratio calculation – even though we already used 100 forward rollouts for the likelihood ratio calculation (1 for IVG), a fold increase in computation. In addition, we also compared the performance when transferring a model trained on multiple source tasks (Multitask). This multitask variant significantly accelerates learning speed; it is 1.5x faster than transferring from a singletask and about faster than learning from scratch. As we will see from subsequent results, models trained on multiple tasks greatly accelerate transfer.
3) Match Positions: We tested model generalization on the match positions task, which differs significantly from the source tasks (Fig.4, right). All agents except multitask IVG fail on this task; this includes transferring IVG from a single task (StackB) and the pixelbased MPO. This is likely due to the sparse structure of the reward (see supplementary). This shows how multitask training enables the formation of robust and expressive latent spaces that transfers well to new tasks.
4) Visual distractors: Lastly, we analyzed the generalization of IVG when visual distractors in the form of a yellow cube or ball are added to the scene (Fig.5, (left & center)). Transfer from one or multiple source tasks remains effective. Transferring a multitask model significantly outperforms all other methods, learning 3x faster than from scratch and 4x faster than MPO. Thus, even though CNNs are known to be sensitive to changes in visual inputs, jointly training a predictive model with the policy can still lead to robust representations that transfer quickly.
6.3 Ablation experiments
To validate our algorithm and model design choices we performed two sets of ablations.
Rollout horizon: To evaluate the effect of the rollout horizon on the policy gradient (eq. (8)) we tested IVG across multiple settings of =0,1,5,20 for learning StackR from scratch ( Fig.5 (right)). IVG is robust to the choice of the rollout horizon and learns stably for all settings of . However, the choice of has a marked effect on the speed of learning. Increasing speeds up learning up to a point after which no additional speedup is obtained (compare ). Importantly, these results suggest that the benefit of the model in IVG is not just one of representation learning in partially observed domains but that using the model for policy optimziation is beneficial.
Averaging value gradients: To quantify the effect of averaging value gradients across multiple rollout horizons (c.f. Eq (9)), we ran experiments using a single imagined rollout of length ( Fig.5, right). For short horizons e.g. =5, using a single rollout horizon leads to lower learning speed. On the other hand, learning completely fails for longer horizons =20 (unlike with averaging) potentially due to cascading model errors on imagined rollouts, validating our averaging approach.
7 Conclusions
We presented an approach for modelbased RL where an actionconditional latent space model is trained jointly with policy, value and reward functions that operate on the learned latent space. To achieve efficient policy optimization we introduced Imagined Value Gradients (IVG), an extension of SVG (using imagined rollouts and Nstep horizon averaging). We demonstrated that IVG can learn complex, longhorizon manipulation tasks like lifting and stacking. We further demonstrated in several transfer experiments on related tasks that transferring a model learned via IVG can significantly improve data efficiency compared to offpolicy baselines. Crucially, transferring with models trained on multiple tasks further accelerates learning, even succeeding on tasks where singletask transfer fails. We feel that our approach is a promising first step towards designing RL methods that combine learning of closed loop policies with the generalization capabilities that learned approximate models can provide – although extensions (such as handling egocentric cameras, increasing sample efficiency) are needed to make our approach a fully general purpose solution for realworld robotics tasks.
The authors would like to thank the entire Control Team and many others at DeepMind for numerous discussions on this work. Special thanks go to Tuomas Haarnoja and Raia Hadsell for reviewing an early version of this work and to Hannah Kirkwood for help in organizing the internship.
References
 [1] (2018) Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256. Cited by: §1.
 [2] (2018) Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: Appendix B, §F.4, §1, §2, §5.1, §6.
 [3] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Appendix E, §F.4.
 [4] (2017) Successor features for transfer in reinforcement learning. In NeurIPS 30, Cited by: §2.
 [5] (2018) Sampleefficient reinforcement learning with stochastic ensemble value expansion. In NeurIPS, Cited by: §2.
 [6] (2018) Modelbased reinforcement learning via metapolicy optimization. In CoRL, Cited by: §2.
 [7] (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §F.2.
 [8] (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §2.
 [9] (2011) PILCO: a modelbased and dataefficient approach to policy search. In ICML, Cited by: §1, §2.
 [10] (2018) Visual foresight: modelbased deep reinforcement learning for visionbased robotic control. CoRR abs/1812.00568. Cited by: §1, §2.
 [11] (2018) Impala: scalable distributed deeprl with importance weighted actorlearner architectures. In ICML, Cited by: Appendix C, §1, §4.1.
 [12] (2018) Modelbased value estimation for efficient modelfree reinforcement learning. CoRR abs/1803.00101. Cited by: §1, §2.
 [13] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §2.
 [14] (2017) Stochastic neural networks for hierarchical reinforcement learning. In ICLR, Cited by: §2.
 [15] (2018) Combined reinforcement learning via abstract representations. arXiv preprint arXiv:1809.04506. Cited by: §2, §2.
 [16] (2019) Information asymmetry in KLregularized RL. In ICLR, Cited by: §2.
 [17] (2019) DeepMDP: learning continuous latent space models for representation learning. ICML. Cited by: §2.
 [18] (2016) Continuous deep qlearning with modelbased acceleration. In ICML, Cited by: §1, §2.
 [19] (2018) Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: Appendix B, §1, §5.1.
 [20] (2019) Learning latent dynamics for planning from pixels. In ICML, Cited by: Appendix B, §1, §2, §5.1.
 [21] (2018) Learning an embedding space for transferable robot skills. In ICLR, Cited by: §2.
 [22] (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §1.
 [23] (2015) Learning continuous control policies by stochastic value gradients. In NeurIPS, Cited by: Appendix A, Appendix A, §1, §1, §2, §5.
 [24] (2019) Modelbased reinforcement learning for atari. CoRR abs/1903.00374. Cited by: §1, §2.
 [25] (2018) Qtopt: scalable deep reinforcement learning for visionbased robotic manipulation. In CoRL, Cited by: §1.
 [26] (2014) Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix B, §F.2, §5.1.
 [27] (2013) Autoencoding variational bayes. In ICLR, Cited by: Appendix A, §5.

[28]
(2016)
Endtoend training of deep visuomotor policies.
The Journal of Machine Learning Research
17 (1), pp. 1334–1373. Cited by: §1, §2.  [29] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §2.
 [30] (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML), Cited by: §F.5.
 [31] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: Appendix C, §1, §2, §4.1.
 [32] (2018) Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv preprint arXiv:1803.11347. Cited by: §2.
 [33] (2018) Deep online learning via metalearning: continual adaptation for modelbased RL. CoRR abs/1812.07671. Cited by: §2.
 [34] (2017) Value prediction network. In NeurIPS, Cited by: §2.
 [35] (2018) Learning dexterous inhand manipulation. CoRR abs/1808.00177. Cited by: §1.
 [36] (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: Appendix A, §5.
 [37] (2018) Learning by playingsolving sparse reward tasks from scratch. In ICML, Cited by: §D.1, §2.

[38]
(2004)
The cross entropy method: a unified approach to combinatorial optimization, montecarlo simulation (information science and statistics)
. SpringerVerlag, Berlin, Heidelberg. External Links: ISBN 038721240X Cited by: §F.5, §6.  [39] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix B, §2, §5.1.
 [40] (2018) A general reinforcement learning algorithm that masters chess, shogi, & go through selfplay. Science 362 (6419). Cited by: §1, §2.
 [41] (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, Cited by: §1, §2, §2.
 [42] (2019) Exploiting hierarchy for learning and transfer in klregularized RL. CoRR abs/1903.07438. Cited by: §2.
 [43] (2019) AlphaStar: Mastering the RealTime Strategy Game StarCraft II. Cited by: §1, §2.
 [44] (2015) From pixels to torques: policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251. Cited by: §2.
 [45] (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In NeurIPS, Cited by: §1, §2.
 [46] (2019) COBRA: dataefficient modelbased rl through unsupervised object discovery and curiositydriven exploration. CoRR abs/1905.09275. Cited by: §2.
 [47] (2018) Unsupervised predictive memory in a goaldirected agent. arXiv preprint arXiv:1803.10760. Cited by: §1, §2, §4.1.
 [48] (2017) Imaginationaugmented agents for deep reinforcement learning. CoRR abs/1707.06203. Cited by: §1, §2.
 [49] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning. Cited by: §F.5.
 [50] (2017) The devil is in the decoder. arXiv preprint arXiv:1707.05847. Cited by: Appendix E.
 [51] (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Appendix E.
 [52] (2018) SOLAR: deep structured latent representations for modelbased reinforcement learning. CoRR abs/1808.09105. Cited by: §1, §2.
 [53] (2019) Dexterous manipulation with deep reinforcement learning: efficient, general, and lowcost. In ICRA, Cited by: §1.
Appendix A Additional details on Value gradient derivation
Given a model, we optimize a parametric policy by maximizing the Nstep surrogate value function which is a recursive composition of the policy, the transition, reward and value function:
(10) 
This Nstep value can be computed by performing an “imagined” rollout in the latent statespace using our model (see Figure (1) in the main text). It can be maximized by gradient ascent, exploiting the so called “value gradient” [23]; which can often be computed recursively, taking advantage of the reparameterization trick [27, 36] for sampling from we can recursively define a sample estimate of this gradient. We start by defining the deterministic function that transforms a sample from a canonical noise distribution into a sample from . In the following we will consider Gaussian policy distributions, i.e. , for which the reparameterization is given as , with , where denotes the identity matrix. Using these definitions we can define the gradient for any state as
(11) 
where and we dropped dependencies of all functions on for brevity. The partial value gradient wrt. a state is defined recursively as
(12) 
where the case is established by assuming that the policy does not change in any step after ; i.e. bootstrapping with . We note that, to calculate these gradients, only an initial state (encoded from a history of observations ) is required in addition to the learned model. Our derivation here is thus analogous to the Nstep stochastic value gradient definition from [23] but replacing observed states with imagined latent states – and assuming a deterministic transition model.
Appendix B Additional details on Regularized Policy Optimization
Given the definition of the value gradient in the previous section we can make a few interesting observations relevant to its use in practice.
First, we can realize that, in principle, the singlestep gradient estimate (and hence a model trained by minimizing is sufficient for performing policy optimization. However, in this case we would obtain a biased value gradient after one gradient step in the direction of – since at that point – which only becomes unbiased again once the dynamic programming updates from (16) have converged. To counteract this bias we could consider using the Nstep gradient for large . Such an estimate is not affected by the above described bias for the first steps (since the equivalence of and is only assumed for steps after time ). As a result, it facilitates faster learning (as also demonstrated in our experiments). A downside of this approach is that it can be more heavily affected by modelling errors; i.e. a latent statespace model predicting rewards steps into the future is harder to learn than a 1step model. As a compromise, tradingoff bias with modelling errors, we found that using a simple average gradient estimate – over horizons – worked well in practice This averaging linearly down weights the contributions from states further along the trajectory – the first state appears N times in the sum, N1 times for the second state and so on; as opposed to a discount based weighting that decays slowly this can drastically reduce the effect of model errors later in the sequence. We note that, in principle, we could also use weighting terms based on the variance of different horizon estimates, but opted for an average for simplicity here.
Second, even with the averaged modelgradient from above, gradient based optimization is prone to exploiting modelling errors (in both the transition dynamics and reward/value estimates), yielding overly optimistic policies. This is a well known problem in model based RL; see e.g. [20] for a recent discussion. To counteract such effects it is hence desirable to further regularize the policy optimization step. Similar to many existing policy optimization methods [39, 2, 19] we adopt a relativeentropy (KL) regularization scheme. We augment the estimated reward with a sample based likelihood ratio term (a sample based estimate of the KL)
(13) 
where is a the prior action probability (we use throughout) and is a multiplier tradingoff reward and regularization. replacing in Equation (11) with – noting that is differentiable wrt. the policy parameters – results in the regularized value gradient
(14)  
where is, analogously, given by inserting into Equation (12). To ensure that the bootstrap value for is compatible with this regularized reward we additionally change the loss for the value function (the Bellman error); by, again, replacing with in Equation (16). The total derivative estimate we use in practice is then given as
(15) 
where we use batches of samples from the replay to optimize the policy on all visited states. That is we perform stochastic gradient ascent combining the gradient from Equation (15) with any optimization method (we use Adam [26]). The full optimization procedure is also described in Algorithm. 1.
Appendix C Details for the Value Learning step
As described in the main paper valueloss involves the calculation of a (squared) Bellman error, which is given by
(16) 
where the next state value is calculated via a “target network”, whose parameters are periodically copied from , to stabilize training (see e.g. [31] for a discussion). In practice we use vtrace [11] to calculate a better target value. That is we set , where the vtrace target is given as:
(17) 
with being the temporal difference error multiplied by importance weight with denoting the behaviour policy and we set . We refer to Espeholt et al. [11] for additional details regarding vtrace.
Appendix D Details for the Experimental Setup
We used the Mujoco Simulator^{2}^{2}2MuJoCo: see www.mujoco.org for simulating the Sawyer robot setup. The robot is equipped with a twofinger Robotiq gripper. We ran the simulation with a numerical time step of 10 milliseconds, integrating 5 steps, to get a control interval of 50 milliseconds for the agent. In this way we can resolve all important properties of the robot arm and the object interactions in simulation. All the objects used were based on wooden toy blocks and balls. For the majority of our experiments we used two cubic blocks with side lengths of 5 cm, colored red and blue. For the distractor experiments, we used a yellow cubic block of size 6 cm and a yellow ball of diameter 4 cm. We used a table with sides of 60 cm x 30 cm in length as the workspace of the robot in all our experiments. Objects were spawned randomly on the table surface. The robot hand is initialized randomly above the tabletop with a height offset of up to 20 cm above the table (minimum 10 cm) and the fingers in an open configuration. All experiments run on episodes with 200 steps length (which gives a total simulated real time of 10 seconds per episode).
The sawyer robot is controlled via a 5D position controller based on an inverse kinematics model. That is, that agent outputs are 5D velocity commands – 3 for the robot end effector position, one for the gripper and one for rotating the gripper to change its orientation. The action space for all our tasks, therefore, is 5 dimensional.
Entry  Dimensions  Unit 

arm joint pos  7  rad 
arm joint vel  7  rad / s 
finger joint pos  1  rad 
finger joint vel  1  rad / s 
finger grasp  1  Binary 
Entry  Dimensions  Unit 

camera 1 (frontleft)  3 x 64 x 64  RGB 
camera 2 (frontright)  3 x 64 x 64  RGB 
Entry  Dimensions  Unit 

object i pose  7  m, au 
object i velocity  6  m/s, dq/dt 
object i relative pos  3  m 
Table 1 (left) shows the list of proprioception observations we use for all our experiments. These observations are concatenated to produce a
dimensional vector which is used as input to our model. In addition to proprioception, we use RGB images from two cameras located to the left and right of the table (in the front of the table, pointing towards the robot) as visual observations (see table
1, right). These two images (3x64x64 each) are concatenated along the channel dimensions to generate a 6x64x64 input visual observation to our model. It is worth noting that the availability of two camera views helps disambiguate most occlusions. We will explore switching to a single central view in future experiments; this significantly increases occlusions and makes the tasks strongly partially observable. For the baseline state based MPO experiment, we used features of the objects (see table 2) in addition to proprioceptive features as inputs to the policy.Fig.13 shows example images from a learned policy on the IVG task; both camera views used for training are shown.
d.1 Tasks and Rewards
We used shaped rewards for specifying all our tasks as we wanted to measure transfer efficiency rather than the capability to handle sparse reward settings. In principle, the multitask version of IVG can be extended to the sparse reward setting easily, in a manner similar to SAC [37]. Below, we discuss the reward setup for the three major tasks considered in the paper, the Lift, Stack and Match Positions task. The addition of a visual distractor does not change the reward setup for a given task.
Lift: In the lift task, the agent has to pickup one of either the red (LiftR) or blue (LiftB) objects on the table and lift it above a certain height. We introduce additional shaping to the task through auxiliary rewards that encourage reaching the target object, grasping it and lifting it once grasped. These are specified in turn as:

REACH(O): :
Minimize the distance of the TCP to the target cube. 
GRASP:
Activate grasp sensor of gripper (”inward grasp signal” of Robotiq gripper) 
HEIGHT(O, x):
Increase z coordinate of an object more than relative to the table.
Where the is the Euclidean distance between a pair of 3D points, and the tolerance and linear reward functions terms are defined as:
(18) 
(19) 
The final reward is a weighted sum of all these subrewards:
which, overall cannot exceed a value of two.
Stack: Similar to the lift task, there are two variants of the stack task: 1) StackR, where the agent has to stack the red block on the blue block and 2) StackB, where the agent does the opposite. We again introduce shaping by first encouraging the agent to lift the object – the lift reward is a part of the reward for the stack task. Additionally, once the object has been lifted we encourage the agent to move towards the target, align it with the target block and release the grasped object. The total reward is:
(20) 
where:
where is detemined by the grasp sensor and:
Match Positions: In this task, the agent has to move both the red and blue blocks to a fixed target position (see Fig.7). As this task involves moving both objects it is a nice setting for testing the generalization of our learned models. Additionally, the reward is not shaped to encourage motion towards an object; there is no change in the reward unless one of the objects is moved. We specify the reward as:
(21) 
where denote the target 3D positions of the red and blue block respectively.
d.2 Partially observable environments
For the experiments using more partially observable environments we considered two settings. In the first instance we added a delay to the proprioceptive features in the StackR task (see IVG(5) (delayed proprio) in Figure 3 in the main paper). This results in an environment where the RNN has to perform integration to estimate the robot arm position (and its velocities).
In the second experiment we created a variant of the StackR task in which the red block (that needs to be lifted and placed on top of the blue block) changes color (switching from blue to red at random every 2 frames).
d.3 Multitask setup: Tasks and Rewards
In the multitask setup we introduce several auxiliary tasks that are solved in addition to a main extrinsic task. We consider the following tasks and rewards in all our multitask experiments:
where the move reward is given by where is the object’s velocity. To train in a multitask setup we use a taskconditioned policy, value and rewardfunction, we refer to the section below for details. The learned model (i.e. ) on the other hand is not conditioned on the task – it hence has to learn consistent dynamics across tasks. The actors generate data by selecting one task per episode at random.
We note that even though the multitask largely consists of the same rewards as for individual experiments the data distribution is very different as the model is trained on episodes from all tasks. As can be seen, in the experiments in main paper, this results in significant improvements in the transfer learning experiments.
Appendix E Details on the Model & Policy
We present some details on the architecture of the model components and policy network below:
The encoder
uses a recurrent, deterministic, convolutional neural network (CNN) to encode the observations
to a lowdimensional latent state representation (see Fig.8). Our observation is a pair of RGB images (3x64x64 each), concatenated along the first dimension, and a proprioception vector. The images are passed through a CNN with an initial convolutional layer followed by three residual blocks with strided convolutions
[51]and average pooling to generate a vector of outputs. In parallel, the proprioception input is passed through a 2layer multilayer perceptron (MLP) to generate a feature vector. These are concatenated and passed through a 3layer MLP and an LSTM which outputs the latent state
. As an initial preprocessing step, we normalized all our images to be between 01 and proprioception to 1 to 1.The transition model is deterministic, taking a latent state and action to predict the next latent state (see Fig.9). Both the inputs are first passed through 2layer MLPs. The outputs of these MLPs are concatenated and passed through another 2layer MLP which predicts the change in latent state . To ensure that the transition model outputs are well conditioned for long rollouts, we pass this delta change through a tanh
layer to normalize the result to 1 to 1. This is further scaled by a linear transform and added to the input state
to generate the prediction . In practice, we saw a significant improvement in performance when predicting the change in state as opposed to directly predicting the next state.The decoder predicts the (expected) input observation from the latent state (see Fig.10). We have two parts to the decoder: 1) To reconstruct the proprioception input, we pass the latent state through a 2layer MLP. 2) For reconstructing the images, we first use a linear layer to transform the latent state to a
dimensional vector which is reshaped into a 64x8x8 feature tensor. This feature tensor is passed through three upsampling layers, each using a bilinear additive upsampling layer
[50]followed by a convolution; the output is at the same resolution as the input images. Finally, the output features are passed through a 1x1 convolution layer to get the correct number of channels and a sigmoid layer to ensure that the outputs are normalized. We also experimented with using a deconvolutional architecture for the image upsampling but found that it reduced the reconstruction quality.
The value and reward modules predict the (expected) value and reward from a given state (see Fig.11). Both these modules are implemented as 3layer MLPs, with a layer norm [3] after the output of first layer.
Lastly, the policy network predicts a distribution over actions from the corresponding latent state . As mentioned earlier, we consider Gaussian policy distributions i.e. , from which we can sample through the reparameterization trick as , with , where denotes the identity matrix. We implement the policy network as a 3layer MLP similar to the value and reward modules. Unlike those, the policy outputs the mean
and logstandard deviation
from which we can sample an action using the reparameterization shown above.e.1 Multitask learning
We introduce a few additional changes to the network architecture of a few model components and the policy in the multitask learning setup. In this setting, each task has a unique task ID id associated with it – this is represented as a 1hot vector of length ( is the number of tasks). This task ID is fed as an additional input to both the policy () and value modules (), thereby conditioning their predictions based on the task that is currently being considered. On the other hand, the reward predictor now predicts the rewards for all these tasks; its output is now dimensional as opposed to a scalar from before. This further encourages the latent state to capture features relevant to all the learned tasks, leading to better generalization performance as witnessed in our experiments. Lastly, the architectures of the encoder , decoder and transition model are unchanged; these components are task agnostic and can integrate & transfer knowledge across tasks.
Appendix F Details on training and transfer setup
We implemented all our models in Python using the Tensorflow neural network package. Below, we present some details on the loss functions used for training and the hyperparameter settings.
f.1 Model Loss
We defined the per example model loss as . and are coefficients that determine the relative contribution of the loss components. As explained in the main text, we use a squared error term for the reward loss and a squared error to a Vtrace target for the value loss . The per example transition model loss is given as
(22) 
where the first term measures the error between the observations and reconstructions from the openloop latent state predictions (), the second term enforces consistency between the latent states predicted by the encoder and the transition model and is a coefficient that determines the relative contribution of the two loss terms. The reconstruction loss is split into two parts (weighted equally), an image reconstruction loss and a proprioception reconstruction loss. We use a squared error term for the proprioception loss and a binary cross entropy loss term for image reconstruction; in practice we found this to result in better image reconstructions than a squared error term.
f.2 Hyperparameters
We used ADAM [26] with default settings and a fixed learning rate of 5e5 for all our experiments. We used the ELU [7]
nonlinearity as the activation function in all our networks. We initialized the final layers of our policy to predict values close to zero at the start of training; we found that this improved stability, especially in the early stages of learning. We used a latent state dimension of
for all our experiments (). We found this to be lowdimensional enough to be used for fast RL while still allowing room for expressivity.We found that setting gave the good results and kept this setting throughout all experiments. For the policy optimization, we set the weight of the KL regularizer to based on a hyperparameter sweep.
We used a batch size of (two learners each with a batch size of ) to train our model and policy in all our experiments; we initially experimented with larger batch sizes of but found that lowering the batch size made learning more stable. We fixed the history length and experimented with different rollout lengths ; as shown in our experiments performed best, we use that as the default. We ran experiments for a fixed number of episodes (per actor).
f.3 Actor data generation
We used asynchronous actors for data generation in all our experiments. At the start of each episode, the actor retrieves the most recent model and policy parameters. It then executes this policy a fixed time horizon of seconds (episode lasts 10 seconds). The resulting trajectory is split up into smaller subtrajectories of length , the length needed for learning, and added to a central replay buffer which collects experience from all actors. We used a buffer containing up to sequences (randomly deleting old sequences when full) for all our experiments. Both our learners sample from this replay buffer, compute the gradients for the model components and policy and perform synchronized updates to the parameters.
For the multitask experiments, at the start of each episode, the actor chooses a task to execute at random out of the available tasks. This task is executed for the full length of the episode ( seconds). Random sampling of tasks can help generate diverse trajectories for training, facilitating learning of an expressive latent representation.
f.4 Baseline parameters
For the baseline experiments that used pixel observations we constructed policy and value networks that are equivalent in architecture to applying after , similar to the networks in our IVG approach (to ensure a fair comparison).
For the statebased baselines we concatenated the true object positions, velocities and orientations (see table 2) to the proprioceptive robot features (see table 1, left). This is fed as input to 3layer MLP policy networks (ELU activations, 200 hidden units each, layer normalization [3] after the first layer) and 3layer MLP Qvalue networks (ELU activations, 300 units each, layer normalization [3] after the first layer) which additionally take the actions (concatenated to other features) as input. To train SVG(0) we used the same relative entropy regularization technique as in for our method (using ). For MPO we used the hyperparameters from [2], which performed well across all our tests. We tuned the learning rate for both MPO and SVG(0) for performance; a rate of 1e4 worked best.
To ensure a fair comparison between algorithms in an asynchronous setting, we ensured all algorithms ran at the same frequency of learning steps per second (which we set to 10).
f.5 Additional Baselines CEM and PG
Cem
To demonstrate the value of a parametric policy we ablate it and combine a model pretrained with our approach with an implementation of the crossentropy method (CEM). We bootstrap with a learned value estimate as in IVG (learning this value function in the transfer learning setting as in IVG) and perform latent rollouts. In particular during training on a transfer task we replace the parametric policy with an optimization based approach using the crossentropy method [38]. We use CEM both for computiong actions in the value function learning step and when interacting with the system. In either case the length of the rollouts for CEM is set to 5 and we use 100 trajectories in each optimization step (repeating for a total of ten steps of optimization).
Pg
For the likelihood ratio policy gradient baseline we use the exact same model and policy structure as for IVG. The only difference is in the calculation of the gradient of the stateaction value (wrt. the policy parameters). In particular, we replace the value gradient from Equation (8) with a likelihood ratio calculation [49, 30] (using 100 rollouts for the gradient estimation).
f.6 Transfer experiment setup
We use the following threestep procedure for transferring our models to new tasks:

We first train the entire system from scratch (IVG) on a source task (or) a set of source tasks in the multitask setting. From the trained modules, we choose the following model components: the encoder , transition model , and decoder . Only these components are transferred to the target task.

We initialize the parameters of the encoder, transition model and decoder using the pretrained networks. The parameters of the policy, value and reward functions are initialized to their default values; we train these from scratch.

We train IVG in the usual fashion on the target task. An important point to note is that the encoder, transition and decoder networks are finetuned on the target task; they just have a significantly better initialization (that is generalizable to the target task). This is the primary contribution to our increase in learning speed in the transfer setting, particularly when using a model that has been trained on multiple source tasks.
Appendix G Additional Experimental results
We present a few additional experimental results in this section.
g.1 Reconstructions
Fig.13 shows a 45step open loop rollout from an IVG(5) model, trained on the Multitask setting, from scratch. Even when tested on significantly longer sequences than it was trained on, the model predictions remain consistent, capturing salient d
Comments
There are no comments yet.