Dynamics-aware Embeddings

by   William Whitney, et al.
NYU college

In this paper we consider self-supervised representation learning to improve sample efficiency in reinforcement learning (RL). We propose a forward prediction objective for simultaneously learning embeddings of states and actions. These embeddings capture the structure of the environment's dynamics, enabling efficient policy learning. We demonstrate that our action embeddings alone improve the sample efficiency and peak performance of model-free RL on control from low-dimensional states. By combining state and action embeddings, we achieve efficient learning of high-quality policies on goal-conditioned continuous control from pixel observations in only 1-2 million environment steps.



page 13

page 15


Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model

Developing an agent in reinforcement learning (RL) that is capable of pe...

DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Many reinforcement learning (RL) tasks provide the agent with high-dimen...

Control with adaptive Q-learning

This paper evaluates adaptive Q-learning (AQL) and single-partition adap...

PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning

Learning good feature representations is important for deep reinforcemen...

Mapping Visual Themes among Authentic and Coordinated Memes

What distinguishes authentic memes from those created by state actors? I...

Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

Learning a good representation is an essential component for deep reinfo...

Learning State Representations via Retracing in Reinforcement Learning

We propose learning via retracing, a novel self-supervised approach for ...

Code Repositories


Official implementation of DynE, Dynamics-aware Embeddings for RL

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been a lot of excitement around end-to-end model-free reinforcement learning for control, both in simulation (Lillicrap et al., 2015; Andrychowicz et al., 2018; Haarnoja et al., 2018b; Fujimoto et al., 2018) and on real hardware (Kalashnikov et al., 2018; Haarnoja et al., 2018d). In this paradigm, we simultaneously learn intermediate representations and policies by maximizing rewards provided by environment. End-to-end learning has one indisputable advantage: since every component of the system is optimized for the end objective, there are no sub-optimal modules that limit best-case performance by losing task-relevant information.

Learning only from the target task is however a double-edged sword. When the end objective provides only weak signal for learning, a policy with a poor representation may require many samples to learn a better one. By contrast, a policy with a good representation may be able to rapidly fit a simple function of that representation even with weak signal.

Figure 1: A 1D environment. The agent (blue dot) can move continuously left and right to reach the goal (gold star).

Consider the environment shown in Figure 1, and two representations of its state: coordinates and pixels. As a function of the agent’s coordinate, the value function is simple and smooth. The coordinate representation has structure which is useful for learning about the task; namely, points which are close in

distance have similar values. By contrast, a pixel representation of the agent’s state (below, blue) is practically a one-hot vector. Two states whose

coordinates differ by one unit have states exactly as different as states which differ by 100 units. This illustrates the importance of good representations and the potential of representation learning to aid RL.

In this work, we consider the problem of self-supervised representation learning for reinforcement learning. Our key insight is that the difference between two states or two actions should be measured by the difference in their effects on the environment. Our contributions are as follows:

  1. We describe a set of goals that representations in RL should aim to achieve, providing a framework for analyzing a representation’s strengths and weaknesses.

  2. We construct a representation learning objective that captures the structure of the dynamics. This objective, called Dynamics-aware Embedding or DynE, yields embeddings where nearby states and actions have similar outcomes.

  3. We show that this single objective greatly simplifies learning from pixels and enables faster exploration through temporally abstract actions.

We demonstrate the effectiveness of our representation learning objective by training the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) (Fujimoto et al., 2018) with learned action and state spaces. With a learned representation of temporally abstract actions, our method exhibits improved sample efficiency compared to state-of-the-art RL methods on control tasks.

When additionally combined with our learned state representation, our method allows TD3 to scale to pixel observations. We demonstrate good performance on a simple family of goal-conditioned 2D control tasks within a few million environment steps and without adjusting any hyperparameters. This stands in contrast to end-to-end model-free RL from pixels, which requires extensive tuning

(Lillicrap et al., 2015) and on the order of 100 million environment steps111Number of steps required to train D4PG taken from Hafner et al. (2018), as Barth-Maron et al. (2018) does not include this information. (Barth-Maron et al., 2018).

2 Dynamics-aware embeddings

2.1 Notation

We consider the framework of reinforcement learning in Markov decision processes (MDPs).

222In the interest of space we omit the usual recap of Markov decision processes and reinforcement learning. We refer the reader to Section 2 of Silver et al. (2014) for notation and background on MDPs. We denote the state of an environment (e.g. joint angles of a robot or pixels) by , and we assume that the states given by the environment satisfy the Markov property. We refer to a sequence of actions using the shorthand . We use to refer to the environment’s (stochastic) transition function, and overload it to accept sequences of actions: .

2.2 Model and learning objective

Figure 2: Computational architecture for training the DynE encoders and . The encoders are trained to minimize the information content of the learned embeddings while still allowing the predictor to make accurate predictions.

In most practical environments, everything an agent needs to know about a state or an action is captured by its outcome. This suggests that any good representation of a state and an action sequence should form the sufficient statistics of the distribution of outcomes .

Our method, which we call Dynamics-aware Embedding (DynE), learns encoders and which embed a state and action sequence into latent spaces and respectively. These encodings are optimized to form a maximally compressed representation of the sufficient statistics of such that . We approximate this objective by maximizing a variational lower bound on :


where and and is the distribution of transitions under a behavior policy .

A variational autoencoder (VAE)

(Kingma and Welling, 2013; Rezende et al., 2014) chooses the variational family to be . We instead use a factored latent space and independent posterior approximations given the previous state and the action:

In our experiments we use an isotropic Normal distribution for

such that term 1 reduces to where computes the mean. This can be interpreted as learning a generative model for : , with a fixed . We use diagonal-covariance Normal distributions for and such that , , , and .

2.3 Representation properties

We describe the properties of dynamics-aware embeddings which support efficient learning of high-quality policies.


A representation should be sufficiently rich to permit policies which achieve high reward. It achieves perfect fidelity if the optimal policy for the target task is attainable with this representation. This property applies to both state and action representations.

With and a sufficiently rich function class for , the learned representations and will perfectly preserve the sufficient statistics of . Such a representation captures all the information that and contain about , but may discard anything about and which does not affect (e.g. the states visited between and ).

If the reward function depends on these intermediate states, the learned representations may have lower fidelity with larger . However, for many tasks the reward earned in an episode depends only on the change in state over the entire episode, for example solving a maze or walking as far as possible, and for these tasks the DynE representation can achieve perfect fidelity. The tasks we use for evaluation, such as ReacherVertical, violate the conditions for perfect fidelity by using an action cost and rewards on every state. Empirically we find that the loss of fidelity (as measured by the peak performance of a TD3 agent using learned vs. low-level representations) is minimal to nonexistent.


The representation space should have regularities which improve the learning behavior of practical algorithms (in particular, RL with deep neural networks).

Structure, which applies to both state and action representations, is measured by the number of environment steps required for a policy to converge.

When we set , and

will be regularized, increasing the variances

and for each data point and shrinking the diameter of the posterior aggregated over the dataset. As a result, the embeddings become noisy; occasionally

will have a higher probability under

than under for some . This smooths the latent space, since will incur lower loss if the neighborhood of corresponds to states which have similar outcomes to as measured by . Under the mild assumption that states with similar outcomes have similar values, the value function will be smooth with respect to and

, leading to meaningful gradients and easier interpolation between states.


An action representation should allow the agent to get to any state in as few actions as possible. Reach is measured by the expected number of states reachable by taking a single action.

Our learned action representation has greater reach than the original action space if . If an action embedding is rich enough that , then for any that is reachable in actions, there is some that reaches that state. Therefore the reach of such an action representation is equivalent to the reach of producing a sequence of actions directly.

3 Using learned embeddings for reinforcement learning

3.1 Decoding to raw actions

In order to be useful for RL, the abstract action space produced by the encoder must be decodeable to raw actions in the environment. Since the mapping from action sequences to high-level actions is many-to-one, inverting it is nontrivial. We simplify this ill-posed problem by defining an objective with a single optimum.

Once the action encoder is fully trained, we hold it fixed and train an action decoder to minimize

The first term of this objective ensures that the action decoder is a one-sided inverse of ; that is, but . The second term of the loss ensures that is in particular the minimum-norm one-sided inverse of and gives the objective for the output of a single minimum. The minimum-norm inverse, i.e. the inverse which produces the actions with the smallest norm, is desireable as it leads to actions which are smooth and consume less energy. We choose to be small (e.g. ) to ensure that the reconstruction criterion dominates the optimization.

This action decoder takes only an embedded action as its input, not a state. As a result, if there are multiple environments that share similar dynamics, we can use the same decoder even when the task or the state representations may be different. The dynamics must be similar in the sense that the same sets of actions map to similar outcomes across all the environments. A sequence of actions does not need to have the same outcome in environment A as it does in environment B, but if and are equivalent in environment A they should be equivalent in environment B. We show in Figure 4 that an action decoder trained on one environment generalizes extremely well to related environments.

3.2 Efficient RL with temporal abstraction

Once equipped with a decoder which maps from high-level actions to sequences of raw actions, we train a high-level policy that solves a task by selecting high-level actions. In this section we extend the deterministic policy gradient (Silver et al., 2014) family of algorithms to work with temporally-extended actions while maintaining off-policy updates and learning from every environment step. This allows our method to achieve superior sample efficiency when working with high-level actions. In particular, we extend twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) to work with the DynE representation of actions to form an algorithm we call DynE-TD3.

We first describe why DPG requires modifications to accommodate temporally-abstracted actions. One simple approach to combining DynE with DPG would be to incorporate the -step DynE action space into the environment to form a new MDP. This MDP allows the use of DPG without modification; however, it only emits observations once every timesteps. As a result, after steps in the original environment, the deterministic policy and critic function can only be trained on observations. This has a substantial impact on sample efficiency when measured in the original environment.

Instead we require an algorithm which can perform updates to the policy and critic for every environment step. To do this, we train both and in the abstract action space with minor changes to their updates. We distinguish these functions which use DynE actions from their raw equivalents by adding a superscript DynE, i.e. and . We augment the critic function with an additional input, , which represents the number of steps of the current embedded action that have already been executed. This forms the DynE-TD3 critic:

In plain language, the value of being on step of abstract action is the value of finishing the remaining steps of and then continuing on following the policy. This is similar to the idea of -step returns (Sutton and Barto, 2018), but with a variable which depends on the step along the current plan. The DynE critic is trained by minimizing the Bellman error implied by the equation above.

To update the policy we follow the standard DPG technique of using the gradient of the critic. We modify the algorithm to take into account that at the time of issuing a new high-level action. The gradient of the return with respect to the policy parameters is then

given that data was collected according to a behavior policy .

4 Related work

Successor representations, an inspiration for this work, represent a state by the expected rate of future visits to other states (Dayan, 1993; Kulkarni et al., 2016b; Barreto et al., 2017). Successor representations have been demonstrated to be an effective model of animal and human learning (Momennejad et al., 2017; Stachenfeld et al., 2017). They are also one of the earliest realizations of the idea of representing each state by its future. Whereas successor representations learn future occupancy maps for a particular policy, we learn an embedding space where states are close together if they have similar outcomes for any policy.

Several papers have proposed using (variational) auto-encoders to learn embeddings for observations (Higgins et al., 2017; Caselles-Dupré et al., 2018); unlike our work, these models operate on a single observation at a time and do not depend on the environment dynamics. Forward prediction has also been used as an auxiliary task to speed RL training (Jaderberg et al., 2016). Ghosh et al. (2018) propose to learn state embeddings using the action distribution of a goal-conditioned policy; however, their technique depends on already having a successful policy. Other work has proposed to use mutual information maximization to learn embeddings which facilitate exploration via intrinsic motivation (Kim et al., 2018).

Another line of work couples the process of learning a model of the environment with training a policy on imagined rollouts (Sutton, 1991; Deisenroth and Rasmussen, 2011; Ha and Schmidhuber, 2018; Clavera et al., 2018; Kaiser et al., 2019; Henaff et al., 2019).These works are similar to ours in that they learn forward models of the environment and use them to speed the training of model-free policies. However, our work differs from theirs in that we use forward modeling only as a surrogate objective for representation learning.

Similarly to this work, hierarchical reinforcement learning seeks to learn temporal abstractions. These abstractions are variously defined as skills (Florensa et al., 2017; Hausman et al., 2018), options (Sutton et al., 1999; Bacon et al., 2017), or goal-directed sub-policies (Kulkarni et al., 2016a; Vezhnevets et al., 2017)

. Whereas these works train a low-level policy by maximizing the reward of the overall task or a heuristically-defined subtask, in this work we seek to learn a representation of the transition structure of the environment which can be used for any downstream task.

Most closely related are Co-Reyes et al. (2018) and Nachum et al. (2018a). SeCTAR (Co-Reyes et al., 2018) simultaneously learns a generative model of future states and a low-level policy which can reach those states. Unlike this work, their latent space represents a particular trajectory through the environment rather than an effect, making it state-dependent. This limits its transferability between environments. Furthermore, SeCTAR assumes the reward function is given ahead of time. HIRO (Nachum et al., 2018a) addresses the aim of off-policy training of hierarchical policies. However, their off-policy performance depends on an approximate re-labeling of action sequences to train the high-level policy, and their low-level policy must be trained on an observation space which matches the target task. A follow-up paper (Nachum et al., 2018b) learns a representation for goal states such that a high-level policy can induce any action in a low-level policy.

Also related are methods which attempt to learn embeddings of single actions to enable efficient learning in very large action spaces (Dulac-Arnold et al., 2015; Chandak et al., 2019). In particular, Chandak et al. (2019) learns a latent space of actions based on the effects of an action on the environment. However, their latent spaces are for a single action and they do not consider learned state representations. Another related direction is learning embeddings of one or more actions from demonstrations (Tennenholtz and Mannor, 2019); this embedded action space builds in prior knowledge from the demonstrator and can allow faster learning.

5 Experiments

In the following experiments, we evaluate the effectiveness of the DynE representations for deep RL. We particularly assess a lower bound on their fidelity, measured by peak performance, and structure, measured by number of environment steps required for convergence.

We separately analyze the contributions of the learned action and observation representations. First, we evaluate the DynE action space on a set of six tasks with low-dimensional state observations, including transferring the learned action space between environments with similar dynamics. Then, we test the DynE observation space on a set of three tasks with pixel observations. Finally, we combine DynE actions with DynE observations, verifying that the two learned representations are complementary.

Appendix B provides a full description of hyperparameters and model architectures, and all of the code for DynE is available on GitHub at https://github.com/willwhitney/dynamics-aware-embeddings.


We use six continuous control tasks from two families implemented in the MuJoCo simulator (Todorov et al., 2012) to evaluate our method. Within each family, the task and observation space change but the robot being controlled stays roughly the same, allowing us to test the transferrability of the DynE action space between tasks. The “Reacher” family consists of three of tasks which involve controlling a 2D, 2DoF arm to interact with various objects. The “7DoF” family of tasks is quite difficult, featuring three tasks in which a 3D, 7DoF arm must use different end effectors to move objects to their goal positions. Detailed descriptions of both families of tasks are available in Appendix A.

5.1 Reach and efficient exploration

Figure 3: The distribution of state distances reached by uniform random exploration using DynE actions () or raw actions in ReacherVertical. Left: Randomly selecting a 4-step DynE action reaches a state uniformly sampled from those reachable in 4 environment timesteps. Right: Over the length of an episode (100 steps), random exploration with DynE actions reaches faraway states very much more often than exploration with raw actions.

We empirically validate the exploration benefits of the increased reach of DynE actions. Figure 3

shows that uniformly sampling an DynE action results in a nearly uniform distribution over the

-step reachable outcomes. When extended to an entire episode, the uniform random policy with DynE actions reaches faraway states more often than the same policy with raw actions. We also provide a visualization of the learned DynE action space in Appendix C, which shows that DynE actions have a one-to-one correspondence with outcome states. Appendix E shows the qualitative difference between random trajectories in the raw and DynE action spaces.

5.2 DynE-TD3 with low-dimensional states

We use both families of tasks to evaluate the performance of the DynE action space and its transferability between environments. Whereas directly transferring a policy to an environment with different objects and observations would be impossible, DynE actions transfer between any environments with the same actions.

For training the DynE action representation we use 100K steps with a uniformly random behavior policy in the simplest environment in each family with no reward or other supervisory signal. As this DynE pretraining is unsupervised and only occurs once for each family of environments, the axis on these training curves refers only to the samples used to train the policy. We then transfer this action representation to all three environments in the family. When training DynE-TD3, we use all of the default hyperparameters from the TD3 implementation across all environments.

We compare against three state-of-the-art model-free baseline methods: regular TD3 (Fujimoto et al., 2018), soft actor-critic (SAC) (Haarnoja et al., 2018c, e), and proximal policy optimization (PPO) (Schulman et al., 2017). We also compare against soft actor-critic with latent space policies (SAC-LSP) (Haarnoja et al., 2018a), an innovative hierarchical method based on SAC. In all cases we use the current version of the official implementations333TD3: https://github.com/sfujim/TD3/444SAC and SAC-LSP: https://github.com/haarnoja/sac555PPO: https://github.com/openai/baselines/tree/master/baselines/ppo2 and the MuJoCo hyperparameters used by the authors. We also attempted to compare with the hierarchical method by Nachum et al. (2018b), but after several emails with the authors and a few hundred GPU-hours of hyperparameter sweeps we were unable to get it to converge on tasks other than those in their paper.

Figure 4 shows the results of these experiments. They show that (1) high-quality policies can be trained on the DynE action space; (2) TD3 shows substantial efficiency gain from using the DynE action space; and (3) the first and second observations continue to hold even when transferring the DynE space between environments. It is especially worth noting that the gains from DynE-TD3 increase as the tasks become harder, maintaining convergence, stability, and low variance in the face of high-dimensional control with difficult exploration. Since SAC-LSP (Haarnoja et al., 2018a) performs similarly but worse than SAC we test it only on the simpler Reacher family of tasks; meanwhile, the PPO curves do not enter the frame on the Reacher family of tasks due to its poor sample efficiency.

Figure 4:

Performance of DynE-TD3 and baselines on two families of environments with low-dimensional observations. Dark lines are mean reward over 8 seeds and shaded areas are bootstrapped 95% confidence intervals. Across all the environments, DynE-TD3 exhibits faster learning than baselines, and with one exception beats them in asymptotic performance. Within each family of environments, the DynE action space was trained only on the simplest task (left). This demonstrates that DynE action representations are highly transferable.

5.3 DynE-TD3 with pixel observations

We test whether DynE state representations allow TD3 to scale to pixel observations using the “Reacher” family of environments. To train the DynE observation space we use 100K steps from a uniformly random policy in each environment; since the DynE state representation must be trained on each environment, we include those 100K steps in the axis of our training curves. We train TD3 with the pretrained observation space using all of the default hyperparameters from the TD3 implementation. We call these results S-DynE-TD3, for “State DynE TD3”. We provide details of the representation training in Appendix B.

We compare against regular TD3 trained from pixels. As there are no experiments on pixels in the TD3 paper, we performed extensive search over network architectures and hyperparameters. We included in our search the configurations used in the pixel experiments of DDPG (Lillicrap et al., 2015) as well as those used in successful discrete-action RL works from pixels (Schulman et al., 2017; Kostrikov, 2018; Espeholt et al., 2018). For the experiments shown here, we use a simple linear control problem to evalute which combination of architecture and hyperparameters worked best, then use those settings throughout.

We also compare to a representation learning baseline, “VAE-TD3”, which consists of training a variational autoencoder on the pixel observations from each environment, then using that encoding as the state space for TD3. As this encoder operates on a single image at a time, we mirror the stacked-image input to the other models by concatenating the encoding of the current frame with the encodings of the three most recent frames.

Finally, we evaluate whether DynE actions yield additional improvement when combined with DynE states. We use the state encoder and the action decoder from the same DynE model we use for the S-DynE-TD3 results. We call the policies trained with both state and action DynE representations SA-DynE-TD3.

Figure 5: Performance of DynE-TD3 and baselines with pixel observations. Learned representations for state make a dramatic difference. SA-DynE-TD3 converges stably and rapidly and achieves performance from pixels that nearly equals TD3’s performance from states. Dark lines are mean reward over 8 seeds and shaded areas are bootstrapped 95% confidence intervals.

Figure 5 shows the results of these experiments. We find that neither of our baselines are able to solve any of the three tasks from pixels; TD3 diverges in all cases, while VAE-TD3 learns gradually at best. If simply reducing the dimension of the states were sufficient to enable effective policy training, we would expect good performance from VAE-TD3. Instead we find that S-DynE-TD3 trains with many fewer samples and reaches higher performance than VAE-TD3, demonstrating that the particular structure learned by DynE plays a crucial role in learning. S-DynE-TD3 is able to achieve decent performance on the two simpler environments, establishing a lower bound on the fidelity of the DynE state representation. SA-DynE-TD3 learns rapidly and it reliably learns behaviors which qualitatively solve all three tasks. In fact, training a policy from pixels using SA-DynE-TD3 has dramatically better sample complexity than training PPO from low-dimensional states across all three environments and equals SAC on ReacherTurn. These results show that the DynE action and state representations are effective at scaling model-free RL to environments with high-dimensional states and difficult exploration.

6 Discussion

In this work we proposed a method, Dynamics-aware Embedding (DynE), that jointly learns embedded representations of states and actions for reinforcement learning. We described how DynE embeddings exhibits the properties of fidelity, structure, and reach and how they affect policy learning. Our experiments reveal that DynE action embeddings lead to more efficient exploration of the state space, resulting in more sample efficient learning on complex tasks, while DynE state embeddings allow unmodified model-free RL algorithms to scale to pixel observations. With the combination of state and action embeddings, the DynE-TD3 algorithm results in stable, sample-efficient learning of high-quality policies from pixels.


We thank many people for valuable discussions and for editing versions of this paper, including David Brandfonbrener, Martin Arjovsky, Denis Yarats, Aahlad Puli, Ilya Kostrikov, Cinjon Resnick, and Saurabh Gupta.


  • M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §1.
  • P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §4.
  • A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver (2017) Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065. Cited by: §4.
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, footnote 1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix A, Appendix A.
  • H. Caselles-Dupré, M. Garcia-Ortiz, and D. Filliat (2018) Continual state representation learning for reinforcement learning using generative replay. arXiv preprint arXiv:1810.03880. Cited by: §4.
  • Y. Chandak, G. Theocharous, J. Kostas, S. Jordan, and P. S. Thomas (2019) Learning action representations for reinforcement learning. arXiv preprint arXiv:1902.00183. Cited by: §4.
  • I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel (2018) Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214. Cited by: §4.
  • J. D. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine (2018) Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In ICML, Cited by: §4.
  • P. Dayan (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §4.
  • M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In

    Proceedings of the 28th International Conference on machine learning (ICML-11)

    pp. 465–472. Cited by: §4.
  • G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin (2015) Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: §4.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §5.3.
  • C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. Cited by: §4.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §1, §1, §3.2, §5.2.
  • D. Ghosh, A. Gupta, and S. Levine (2018) Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819. Cited by: §4.
  • D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §4.
  • T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine (2018a) Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808. Cited by: §5.2, §5.2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018b) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018c) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.2.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018d) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018e) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §5.2.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: footnote 1.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, Cited by: §4.
  • M. Henaff, A. Canziani, and Y. LeCun (2019) Model-predictive policy learning with uncertainty regularization for driving in dense traffic. arXiv preprint arXiv:1901.02705. Cited by: §4.
  • I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner (2017) Darla: improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1480–1490. Cited by: §4.
  • M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §4.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §4.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §1.
  • H. Kim, J. Kim, Y. Jeong, S. Levine, and H. O. Song (2018) EMI: exploration with mutual information maximizing state and action embeddings. arXiv preprint arXiv:1810.01176. Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.2.
  • I. Kostrikov (2018) PyTorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail Cited by: §5.3.
  • T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum (2016a) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §4.
  • T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman (2016b) Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396. Cited by: §4.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §5.3.
  • X. Liu, J. Gao, A. Celikyilmaz, L. Carin, et al. (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: Appendix B.
  • I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman (2017) The successor representation in human reinforcement learning. Nature Human Behaviour 1 (9), pp. 680. Cited by: §4.
  • O. Nachum, S. Gu, H. Lee, and S. Levine (2018a) Data-efficient hierarchical reinforcement learning. In NeurIPS, Cited by: §4.
  • O. Nachum, S. Gu, H. Lee, and S. Levine (2018b) Near-optimal representation learning for hierarchical reinforcement learning. CoRR abs/1810.01257. Cited by: §4, §5.2.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: Appendix B.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    arXiv preprint arXiv:1401.4082. Cited by: §2.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.2, §5.3.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §3.2, footnote 2.
  • K. L. Stachenfeld, M. M. Botvinick, and S. J. Gershman (2017) The hippocampus as a predictive map. Nature neuroscience 20 (11), pp. 1643. Cited by: §4.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.2.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §4.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. Cited by: §4.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: Appendix A.
  • G. Tennenholtz and S. Mannor (2019) The natural language of actions. arXiv preprint arXiv:1902.01119. Cited by: §4.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix A, §5.
  • A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3540–3549. Cited by: §4.

Appendix A Environment description

(a) ReacherVertical
(b) ReacherTurn
(c) ReacherPush
Figure 6: The Reacher family of environments. ReacherVertical requires the agent to move the tip of the arm to the red dot. ReacherTurn requires the agent to turn a rotating spinner (dark red) so that the tip of the spinner (gray) is close to the target point (red). ReacherPush requires the agent to push the brown box onto the red target point. The initial state of the simulator and the target point are randomized for each episode. In each environment the rewards are dense and there is a penalty on the norm of the actions. The robot’s kinematics are the same in each environment but the state spaces are different.

The first task family, pictured in Figure 6, is the “Reacher family”, based on the Reacher-v2 MuJoCo (Todorov et al., 2012) task from OpenAI Gym (Brockman et al., 2016). These tasks form a simple new benchmark for multitask robot learning. The first task, which we use as the “source” task for training the DynE space, is ReacherVertical, a standard reach to a location task. The other two tasks are inspired by the DeepMind Control Suite’s Finger Turn and Stacker environments, respectively (Tassa et al., 2018). In ReacherTurn, the same 2-link Reacher robot must turn a spinner to the specified random location. In ReacherPush, the Reacher must push a block to the correct random location.

(a) Pusher-v2
(b) Striker-v2
(c) ReacherPush
Figure 7: The 7DoF family of environments. Pusher-v2 requires the agent to use a C-shaped end effector to push a puck across the table onto a red circle. Striker-v2 requires the agent to use a flat end effector to hit a ball so that it rolls across the table and reaches the goal. Thrower-v2 requires the agent to throw a ball to a target using a small scoop. As with the Reacher family, the dynamics of the robot are the same within the 7DoF family of tasks. However, the morphology of the robot, as well as the object it interacts with, is different.

The second task family is the “7DoF family”, which comprises Pusher-v2, Striker-v2, and Thrower-v2 from OpenAI Gym (Brockman et al., 2016). We use Pusher-v2 as the source task. These tasks use similar (though not identical) robot models, making them a feasible family of tasks for transfer. They are shown in Figure 7.

a.1 Pixels

We use full-color images rendered at 256x256 and resized to 64x64 pixels. In order to allow the agents to perceive motion, we stack the current frame with the three most recent frames, resulting in an observation of dimension 12x64x64.

Appendix B Hyperparameters and DynE training

For DynE-TD3 we use all of the default hyperparameters from the TD3 code666https://github.com/sfujim/TD3 across all tasks. For all experiments we choose the dimension of the DynE action space to be equal to the dimension of a single action in the environment. We set the number of actions in the DynE space to be for all experiments except Thrower-v2, for which we use . When computing log-likelihoods we divide by the number of dimensions in the state in an attempt to make the correct settings of and loss invariant; the same result could be achieved by multiplying the values of and that we report by the state dimension and changing the learning rate. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate . All our experiments used recent-model NVidia GPUs.

Training on states

For all experiments we set our hyperparameters as we found performance to be relatively insensitive. We concatenate all the joint angles and velocities to use as the states during representation learning. We preprocess the pairs by first taking the difference and then whitening so that has zero mean and unit variance in each dimension. This preprocessing encourages the encoder to represent both position and velocity in the latent space; the scales of these two components are quite different.

We use fully-connected networks for the action encoder and the conditional state predictor . Each function has two hidden layers of 400 units. Training this model should take 5-10 minutes on GPU.

Training on pixels

We train a DynE model for each environment, taking in a stack of frames and a sequence of actions and predicting future states. To speed training we predict only the two latest frames of the future state (i.e. the picture of the world at time and ) instead of all four. We then take the state encoder from this model and use it to preprocess all states from the environment.

We set the dimension of the state embedding to 100. We did not try other options, and given the sensitivity of RL to state dimension a smaller setting would very likely yield faster learning. We set . This number is silly due to our rescaling of the log-likelihood by the dimension; without that rescaling it would be . As the goal of this objective is representation learning, not generation, it is better to err on the side of setting and too small. This results in higher fidelity but lower structure, which is better than low-fidelity but smooth (or constant) latent spaces. We recommend ensuring that the predictions (not generations) from the model are correctly rendering all the task-relevant objects; if and are too high, the model may incur lower loss by ignoring details in the image. We use cyclic KL annealing (Liu et al., 2019) to improve convergence over a wide range of settings.

We use the DCGAN architecture (Radford et al., 2015) for the image encoder and the predictor . The action encoder is fully connected with two hidden layers of 400 units. Training this model takes 1-2 hours on GPU.

Appendix C Visualizing the DynE action space

(a) Effect space
(b) DynE action space
Figure 8: These plots represent the mapping between the effects of action sequences and points in DynE action space. In the left plot, each point represents the effect of a sequence of four randomly-selected actions , colored according to their location in the plot. That is, the e.g. coordinate of a point represents the between the initial state and the state reached at the end of the four actions. Each point in the right plot corresponds to the point in the left plot with the same color, and its coordinates in the right axes are given by . The 1:1 correspondence between the effects of a sequence of actions and the DynE space representation of that action sequence indicates that DynE is truly encoding the change in state induced by a sequence of actions.

To better understand the structure in the latent space of action sequences learned by DynE, we use a simple linear Point environment with a 2D state space and 2D action space. We render action sequences on a pair of plots in Figure 8. On the left plot we show the effect of the action sequence, measured as . Each action sequence is assigned a color based on its location in this plot. On the right plot, we show each action sequence’s location in DynE space, marking the point with that action sequence’s color. If the spaces are isomorphic, we should expect to see smooth color transitions in the right plot, indicating that the DynE representations of any pair of action sequences are ordered according to their effects. Indeed, for this simple problem, we see that the DynE space is an affine transformation of the effect space. This indicates that the DynE action space does not just have local structure (smoothness with respect to outcomes), but actually global structure: all pairs of action sequences and with similar outcomes are close together in the embedding space. The correspondence between the two spaces appears to remain strong for high-dimensional and nonlinear environments, but is much harder to render in two dimensions.

Appendix D Extended results

Figure 9: These plots allow for direct comparison between the methods from pixels (Pixel-TD3, VAE-TD3, S-DynE-TD3, and SA-DynE-TD3) and our baselines from low-dimensional states (PPO and SAC). The DynE methods from pixels perform much better than PPO does from states. Surprisingly, SA-DynE-TD3 performs exactly the same as SAC from states on ReacherTurn.

Appendix E Exploration with raw and DynE action spaces

(a) Random exploration with raw actions
(b) Random exploration with DynE
Figure 10: These figures illustrate the way the DynE action space enables more efficient exploration. Each figure is generated by running a uniform random policy for ten episodes on a PointMass environment. Since the environment has only two position dimensions, we can plot the actual 2D position of the mass over the course of each episode. Left: A policy which selects actions at each environment timestep uniformly at random explores a very small region of the state space. Right: A policy which randomly selects DynE actions once every timesteps explores much more widely.