Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

06/24/2015 ∙ by Manuel Watter, et al. ∙ Google University of Freiburg 0

We introduce Embed to Control (E2C), a method for model learning and control of non-linear dynamical systems from raw pixel images. E2C consists of a deep generative model, belonging to the family of variational autoencoders, that learns to generate image trajectories from a latent space in which the dynamics is constrained to be locally linear. Our model is derived directly from an optimal control formulation in latent space, supports long-term prediction of image sequences and exhibits strong performance on a variety of complex control problems.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Torch7 impementation of: Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

view repo


E2C implementation in PyTorch

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Control of non-linear dynamical systems with continuous state and action spaces is one of the key problems in robotics and, in a broader context, in reinforcement learning for autonomous agents. A prominent class of algorithms that aim to solve this problem are model-based locally optimal (stochastic) control algorithms such as iLQG control

[1, 2], which approximate the general non-linear control problem via local linearization. When combined with receding horizon control [3]

, and machine learning methods for learning approximate system models, such algorithms are powerful tools for solving complicated control problems

[3, 4, 5]; however, they either rely on a known system model or require the design of relatively low-dimensional state representations. For real autonomous agents to succeed, we ultimately need algorithms that are capable of controlling complex dynamical systems from raw sensory input (e.g. images) only. In this paper we tackle this difficult problem.

If stochastic optimal control (SOC) methods were applied directly to control from raw image data, they would face two major obstacles. First, sensory data is usually high-dimensional – i.e. images with thousands of pixels – rendering a naive SOC solution computationally infeasible. Second, the image content is typically a highly non-linear function of the system dynamics underlying the observations; thus model identification and control of this dynamics are non-trivial.

While both problems could, in principle, be addressed by designing more advanced SOC algorithms we approach the “optimal control from raw images” problem differently: turning the problem of locally optimal control in high-dimensional non-linear systems into one of identifying a low-dimensional latent state space, in which locally optimal control can be performed robustly and easily. To learn such a latent space we propose a new deep generative model belonging to the class of variational autoencoders [6, 7] that is derived from an iLQG formulation in latent space. The resulting Embed to Control (E2C) system is a probabilistic generative model that holds a belief over viable trajectories in sensory space, allows for accurate long-term planning in latent space, and is trained fully unsupervised. We demonstrate the success of our approach on four challenging tasks for control from raw images and compare it to a range of methods for unsupervised representation learning. As an aside, we also validate that deep up-convolutional networks [8, 9] are powerful generative models for large images.

2 The Embed to Control (E2C) model

We briefly review the problem of SOC for dynamical systems, introduce approximate locally optimal control in latent space, and finish with the derivation of our model.

2.1 Problem Formulation

We consider the control of unknown dynamical systems of the form


where denotes the time steps, the system state, the applied control and the system noise. The function is an arbitrary, smooth, system dynamics. We equivalently refer to Equation (1) using the notation

, which we assume to be a multivariate normal distribution

. We further assume that we are only given access to visual depictions of state . This restriction requires solving a joint state identification and control problem. For simplicity we will in the following assume that is a fully observed depiction of , but relax this assumption later.

Our goal then is to infer a low-dimensional latent state space model in which optimal control can be performed. That is, we seek to learn a function , mapping from high-dimensional images

to low-dimensional vectors

with , such that the control problem can be solved using instead of :


where accounts for system noise; or equivalently

. Assuming for the moment that such a function can be learned (or approximated), we will first define SOC in a latent space and introduce our model thereafter.

2.2 Stochastic locally optimal control in latent spaces

Let be the inferred latent state from image of state and the transition dynamics in latent space, i.e., . Thus models the changes that occur in when control is applied to the underlying system as a latent space analogue to . Assuming is known, optimal controls for a trajectory of length in the dynamical system can be derived by minimizing the function which gives the expected future costs when following :


where are instantaneous costs, denotes terminal costs and and are state and action sequences respectively. If contains sufficient information about , i.e., can be inferred from alone, and is differentiable, the cost-minimizing controls can be computed from via SOC algorithms [10]. These optimal control algorithms approximate the global non-linear dynamics with locally linear dynamics at each time step . Locally optimal actions can then be found in closed form. Formally, given a reference trajectory

– the current estimate for the optimal trajectory – together with corresponding controls

the system is linearized as


where , are local Jacobians, and is an offset. To enable efficient computation of the local controls we assume the costs to be a quadratic function of the latent representation


where and are cost weighting matrices and is the inferred representation of the goal state. We also assume throughout this paper. In combination with Equation (4) this gives us a local linear-quadratic-Gaussian formulation at each time step which can be solved by SOC algorithms such as iterative linear-quadratic regulation (iLQR) [11] or approximate inference control (AICO) [12]. The result of this trajectory optimization step is a locally optimal trajectory with corresponding control sequence .

2.3 A locally linear latent state space model for dynamical systems

Starting from the SOC formulation, we now turn to the problem of learning an appropriate low-dimensional latent representation of . The representation has to fulfill three properties: (i) it must capture sufficient information about (enough to enable reconstruction); (ii) it must allow for accurate prediction of the next latent state and thus, implicitly, of the next observation ; (iii) the prediction of the next latent state must be locally linearizable for all valid control magnitudes . Given some representation , properties (ii) and (iii) in particular require us to capture possibly highly non-linear changes of the latent representation due to transformations of the observed scene induced by control commands. Crucially, these are particularly hard to model and subsequently linearize. We circumvent this problem by taking a more direct approach: instead of learning a latent space and transition model which are then linearized and combined with SOC algorithms, we directly impose desired transformation properties on the representation during learning. We will select these properties such that prediction in the latent space as well as locally linear inference of the next observation according to Equation (4) are easy.

The transformation properties that we desire from a latent representation can be formalized directly from the iLQG formulation given in Section 2.2 . Formally, following Equation (2), let the latent representation be Gaussian . To infer from we first require a method for sampling latent states. Ideally, we would generate samples directly from the unknown true posterior , which we, however, have no access to. Following the variational Bayes approach (see Jordan et al. [13] for an overview) we resort to sampling from an approximate posterior distribution with parameters .





Figure 1: The information flow in the E2C model. From left to right, we encode and decode an image with the networks and , where we use the latent code for the transition step. The network computes the local matrices with which we can predict from and . Similarity to the encoding is enforced by a KL divergence on their distributions and reconstruction is again performed by .

Inference model for .

In our work this is always a diagonal Gaussian distribution

, whose mean and covariance

are computed by an encoding neural network with outputs


where is the activation of the last hidden layer and where is given by the set of all learnable parameters of the encoding network, including the weight matrices , and biases ,

. Parameterizing the mean and variance of a Gaussian distribution based on a neural network gives us a natural and very expressive model for our latent space. It additionally comes with the benefit that we can use the

reparameterization trick [6, 7]

to backpropagate gradients of a loss function based on samples through the latent distribution.

Generative model for . Using the approximate posterior distribution we generate observed samples (images) and from latent samples and by enforcing a locally linear relationship in latent space according to Equation (4), yielding the following generative model


where is the next latent state posterior distribution, which exactly follows the linear form required for stochastic optimal control. With as an estimate of the system noise, can be decomposed as . Note that while the transition dynamics in our generative model operates on the inferred latent space, it takes untransformed controls into account. That is, we aim to learn a latent space such that the transition dynamics in linearizes the non-linear observed dynamics in and is locally linear in the applied controls . Reconstruction of an image from is performed by passing the sample through multiple hidden layers of a decoding neural network which computes the mean

of the generative Bernoulli distribution

111A Bernoulli distribution for is a common choice when modeling black-and-white images. as


where is the response of the last hidden layer in the decoding network. The set of parameters for the decoding network, including weight matrix and bias , then make up the learned generative parameters .

Transition model for . What remains is to specify how the linearization matrices , and offset are predicted. Following the same approach as for distribution means and covariance matrices, we predict all local transformation parameters from samples

based on the hidden representation

of a third neural network with parameters

– to which we refer as the transformation network. Specifically, we parametrize the transformation matrices and offset as


where vec denotes vectorization and therefore and . To circumvent estimating the full matrix of size

, we can choose it to be a perturbation of the identity matrix

which reduces the parameters to be estimated for to .

A sketch of the complete architecture is shown in Figure 1. It also visualizes an additional constraint that is essential for learning a representation for long-term predictions: we require samples from the state transition distribution to be similar to the encoding of through . While it might seem that just learning a perfect reconstruction of from is enough, we require multi-step predictions for planning in which must correspond to valid trajectories in the observed space . Without enforcing similarity between samples from and , following a transition in latent space from with action may lead to a point , from which reconstruction of is possible, but that is not a valid encoding (i.e. the model will never encode any image as ). Executing another action in

then does not result in a valid latent state – since the transition model is conditional on samples coming from the inference network – and thus long-term predictions fail. In a nutshell, such a divergence between encodings and the transition model results in a generative model that does not accurately model the Markov chain formed by the observations.

2.4 Learning via stochastic gradient variational Bayes

For training the model we use a data set containing observation tuples with corresponding controls obtained from interactions with the dynamical system. Using this data set, we learn the parameters of the inference, transition and generative model by minimizing a variational bound on the true data negative log-likelihood plus an additional constraint on the latent representation. The complete loss function222Note that this is the loss for the latent state space model and distinct from the SOC costs. is given as


The first part of this loss is the per-example variational bound on the log-likelihood


where , and are the parametric inference, generative and transition distributions from Section 2.3 and is a prior on the approximate posterior ; which we always chose to be an isotropic Gaussian distribution with mean zero and unit variance. The second KL divergence in Equation (11) is an additional contraction term with weight , that enforces agreement between the transition and inference models. This term is essential for establishing a Markov chain in latent space that corresponds to the real system dynamics (see Section 2.3 above for an in depth discussion). This KL divergence can also be seen as a prior on the latent transition model. Note that all KL terms can be computed analytically for our model (see supplementary for details).

During training we approximate the expectation in via sampling. Specifically, we take one sample for each input and transform that sample using Equation (10) to give a valid sample from . We then jointly learn all parameters of our model by minimizing using SGD.

3 Experimental Results

We evaluate our model on four visual tasks: an agent in a plane with obstacles, a visual version of the classic inverted pendulum swing-up task, balancing a cart-pole system, and control of a three-link arm with larger images. These are described in detail below.

3.1 Experimental Setup

Model training. We consider two different network types for our model: Standard fully connected neural networks with up to three layers, which work well for moderately sized images, are used for the planar and swing-up experiments; A deep convolutional network for the encoder in combination with an up-convolutional network as the decoder which, in accordance with recent findings from the literature [8, 9], we found to be an adequate model for larger images. Training was performed using Adam [14] throughout all experiments. The training data set for all tasks was generated by randomly sampling state observations and actions with corresponding successor states. For the plane we used samples, for the inverted pendulum and cart-pole system we used and for the arm

. A complete list of architecture parameters and hyperparameter choices as well as an in-depth explanation of the up-convolutional network are specified in the supplementary material. We will make our code and a video containing controlled trajectories for all systems available under

http://ml.informatik.uni-freiburg.de/research/e2c .

Model variants. In addition to the Embed to Control (E2C) dynamics model derived above, we also consider two variants: By removing the latent dynamics network , i.e. setting its output to one in Equation (10) – we obtain a variant in which , and are estimated as globally linear matrices (Global E2C). If we instead replace the transition model with a network estimating the dynamics as a non-linear function and only linearize during planning, estimating , , as Jacobians to as described in Section 2.2, we obtain a variant with nonlinear latent dynamics.

Baseline models. For a thorough comparison and to exhibit the complicated nature of the tasks, we also test a set of baseline models on the plane and the inverted pendulum task (using the same architecture as the E2C model): a standard variational autoencoder (VAE) and a deep autoencoder (AE) are trained on the autoencoding subtask for visual problems. That is, given a data set used for training our model, we remove all actions from the tuples in and disregard temporal context between images. After autoencoder training we learn a dynamics model in latent space, approximating from Section 2.2. We also consider a VAE variant with a slowness term on the latent representation – a full description of this variant is given in the supplementary material.

Optimal control algorithms. To perform optimal control in the latent space of different models, we employ two trajectory optimization algorithms: iterative linear quadratic regulation (iLQR) [11] (for the plane and inverted pendulum) and approximate inference control (AICO) [12] (all other experiments). For all VAEs both methods operate on the mean of distributions and . AICO additionally makes use of the local Gaussian covariances and . Except for the experiments on the planar system, control was performed in a model predictive control fashion using the receding horizon scheme introduced in [3]. To obtain closed loop control given an image , it is first passed through the encoder to obtain the latent state . A locally optimal trajectory is subsequently found by optimizing with fixed, small horizon (with unless noted otherwise). Controls are applied to the system and a transition to is observed (by encoding the next image ). Then a new control sequence – with horizon – starting in is found using the last estimated trajectory as a bootstrap. Note that planning is performed entirely in the latent state without access to any observations except for the depiction of the current state. To compute the cost function required for trajectory optimization in we assume knowledge of the observation of the goal state . This observation is then transformed into latent space and costs are computed according to Equation (5).

3.2 Control in a planar system

The agent in the planar system can move in a bounded two-dimensional plane by choosing a continuous offset in x- and y-direction. The high-dimensional representation of a state is a black-and-white image. Obstructed by six circular obstacles, the task is to move to the bottom right of the image, starting from a random x position at the top of the image. The encodings of obstacles are obtained prior to planning and an additional quadratic cost term is penalizing proximity to them.

Figure 2: The true state space of the planar system (left) with examples (obstacles encoded as circles) and the inferred spaces (right) of different models. The spaces are spanned by generating images for every valid position of the agent and embedding them with the respective encoders.

A depiction of the observations on which control is performed – together with their corresponding state values and embeddings into latent space – is shown in Figure 2. The figure also clearly shows a fundamental advantage the E2C model has over its competitors: While the separately trained autoencoders make for aesthetically pleasing pictures, the models failed to discover the underlying structure of the state space, complicating dynamics estimation and largely invalidating costs based on distances in said space. Including the latent dynamics constraints in these end-to-end models on the other hand, yields latent spaces approaching the optimal planar embedding.

We test the long-term accuracy by accumulating latent and real trajectory costs to quantify whether the imagined trajectory reflects reality. The results for all models when starting from random positions at the top and executing pre-computed actions are summarized in Table 1 – using a seperate test set for evaluating reconstructions. While all methods achieve a low reconstruction loss, the difference in accumulated real costs per trajectory show the superiority of the E2C model. Using the globally or locally linear E2C model, trajectories planned in latent space are as good as trajectories planned on the real state. All models besides E2C fail to give long-term predictions that result in good performance.

Algorithm State Loss Next State Loss Trajectory Cost Success
log log Latent Real percent
Planar System
AE 11.5 97.8 3538.9 1395.2 1325.6 81.2 273.3 16.4 0
VAE 3.6 18.9 652.1 930.6 43.1 20.8 91.3 16.4 0
VAE + slowness 10.5 22.8 104.3 235.8 47.1 20.5 89.1 16.4 0
Non-linear E2C 8.3 5.5 11.3 10.1 19.8 9.8 42.3 16.4 96.6
Global E2C 6.9 3.2 9.3 4.6 12.5 3.9 27.3 9.7 100
E2C 7.7 2.0 9.7 3.2 10.3 2.8 25.1 5.3 100

Inverted Pendulum Swing-Up
AE 8.9 100.3 13433.8 6238.8 1285.9 355.8 194.7 44.8 0
VAE 7.5 47.7 8791.2 17356.9 497.8 129.4 237.2 41.2 0
VAE + slowness 26.5 18.0 779.7 633.3 419.5 85.8 188.2 43.6 0
E2C no latent KL 64.4 32.8 87.7 64.2 489.1 87.5 213.2 84.3 0
Non-linear E2C 59.6 25.2 72.6 34.5 313.3 65.7 37.4 12.4 63.33
Global E2C 115.5 56.9 125.3 62.6 628.1 45.9 125.1 10.7 0
E2C 84.0 50.8 89.3 42.9 275.0 16.6 15.4 3.4 90
Table 1: Comparison between different approaches to model learning from raw pixels for the planar and pendulum system. We compare all models with respect to their prediction quality on a test set of sampled transitions and with respect to their performance when combined with SOC (trajectory cost for control from different start states). Note that trajectory costs in latent space are not necessarily comparable. The “real” trajectory cost was computed on the dynamics of the simulator while executing planned actions. For the true models for , real trajectory costs were for the planar system, and for the pendulum. Success was defined as reaching the goal state and staying -close to it for the rest of the trajectory (if non terminating). All statistics quantify over 5/30 (plane/pendulum) different starting positions. A marks separately trained dynamics networks.

3.3 Learning swing-up for an inverted pendulum

We next turn to the task of controlling the classical inverted pendulum system [15] from images. We create depictions of the state by rendering a fixed length line starting from the center of the image at an angle corresponding to the pendulum position. The goal in this task is to swing-up and balance an underactuated pendulum from a resting position (pendulum hanging down). Exemplary observations and reconstructions for this system are given in Figure 3(d). In the visual inverted pendulum task our algorithm faces two additional difficulties: the observed space is non-Markov, as the angular velocity cannot be inferred from a single image, and second, discretization errors due to rendering pendulum angles as small 48x48 pixel images make exact control difficult. To restore the Markov property, we stack two images (as input channels), thus observing a one-step history.




Figure 3: (a) The true state space of the inverted pendulum task overlaid with a successful trajectory taken by the E2C agent. (b) The learned latent space. (c) The trajectory from (a) traced out in the latent space. (d) Images and reconstructions showing current positions (right) and history (left).

Figure 3 shows the topology of the latent space for our model, as well as one sample trajectory in true state and latent space. The fact that the model can learn a meaningful embedding, separating velocities and positions, from this data is remarkable (no other model recovered this shape). Table 1 again compares the different models quantitatively. While the E2C model is not the best in terms of reconstruction performance, it is the only model resulting in stable swing-up and balance behavior. We explain the failure of the other models with the fact that the non-linear latent dynamics model cannot be guaranteed to be linearizable for all control magnitudes, resulting in undesired behavior around unstable fixpoints of the real system dynamics, and that for this task a globally linear dynamics model is inadequate.

3.4 Balancing a cart-pole and controlling a simulated robot arm

Finally, we consider control of two more complex dynamical systems from images using a six layer convolutional inference and six layer up-convolutional generative network, resulting in a 12-layer deep path from input to reconstruction. Specifically, we control a visual version of the classical cart-pole system [16] from a history of two pixel images as well as a three-link planar robot arm based on a history of two pixel images. The latent space was set to be 8-dimensional in both experiments. The real state dimensionality for the cart-pole is four and is controlled using one action, while for the arm the real state can be described in 6 dimensions (joint angles and velocities) and controlled using a three-dimensional action vector corresponding to motor torques.

As in previous experiments the E2C model seems to have no problem finding a locally linear embedding of images into latent space in which control can be performed. Figure 4 depicts exemplary images – for both problems – from a trajectory executed by our system. The costs for these trajectories ( for the cart-pole, for the arm) are only slightly worse than trajectories obtained by AICO operating on the real system dynamics starting from the same start-state ( and respectively). The supplementary material contains additional experiments using these domains.

Figure 4: Left: Trajectory from the cart-pole domain. Only the first image (green) is “real”, all other images are “dreamed up” by our model. Notice discretization artifacts present in the real image. Right: Exemplary observed (with history image omitted) and predicted images (including the history image) for a trajectory in the visual robot arm domain with the goal marked in red.

4 Comparison to recent work

In the context of representation learning for control (see Böhmer et al. [17] for a review), deep autoencoders (ignoring state transitions) similar to our baseline models have been applied previously, e.g. by Lange and Riedmiller [18]. A more direct route to control based on image streams is taken by recent work on (model free) deep end-to-end Q-learning for Atari games by Mnih et al. [19], as well as kernel based [20] and deep policy learning for robot control [21].

Close to our approach is a recent paper by Wahlström et al. [22], where autoencoders are used to extract a latent representation for control from images, on which a non-linear model of the forward dynamics is learned. Their model is trained jointly and is thus similar to the non-linear E2C variant in our comparison. In contrast to our model, their formulation requires PCA pre-processing and does neither ensure that long-term predictions in latent space do not diverge, nor that they are linearizable.

As stated above, our system belongs to the family of VAEs and is generally similar to recent work such as Kingma and Welling [6], Rezende et al. [7], Gregor et al. [23], Bayer and Osendorfer [24]. Two additional parallels between our work and recent advances for training deep neural networks can be observed. First, the idea of enforcing desired transformations in latent space during learning – such that the data becomes easy to model – has appeared several times already in the literature. This includes the development of transforming auto-encoders [25] and recent probabilistic models for images [26, 27]. Second, learning relations between pairs of images – although without control – has received considerable attention from the community during the last years [28, 29]

. In a broader context our model is related to work on state estimation in Markov decision processes (see

Langford et al. [30]

for a discussion) through, e.g., hidden Markov models and Kalman filters

[31, 32].

5 Conclusion

We presented Embed to Control (E2C), a system for stochastic optimal control on high-dimensional image streams. Key to the approach is the extraction of a latent dynamics model which is constrained to be locally linear in its state transitions. An evaluation on four challenging benchmarks revealed that E2C can find embeddings on which control can be performed with ease, reaching performance close to that achievable by optimal control on the real system model.


We thank A. Radford, L. Metz, and T. DeWolf for sharing code, as well as A. Dosovitskiy for useful discussions. This work was partly funded by a DFG grant within the priority program “Autonomous learning” (SPP1597) and the BrainLinks-BrainTools Cluster of Excellence (grant number EXC 1086). M. Watter is funded through the State Graduate Funding Program of Baden-Württemberg.


  • Jacobson and Mayne [1970] D. Jacobson and D. Mayne. Differential dynamic programming. American Elsevier, 1970.
  • Todorov and Li [2005] E. Todorov and W. Li. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In ACC. IEEE, 2005.
  • Tassa et al. [2008] Y. Tassa, T. Erez, and W. D. Smart. Receding horizon differential dynamic programming. In Proc. of NIPS, 2008.
  • Pan and Theodorou [2014] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Proc. of NIPS, 2014.
  • Levine and Koltun [2013] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In Proc. of NIPS, 2013.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proc. of ICLR, 2014.
  • Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML, 2014.
  • Zeiler et al. [2010] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.
  • Dosovitskiy et al. [2015] A. Dosovitskiy, J. T. Springenberg, and T. Brox.

    Learning to generate chairs with convolutional neural networks.

    In Proc. of CVPR, 2015.
  • Stengel [1994] R. F. Stengel. Optimal Control and Estimation. Dover Publications, 1994.
  • Li and Todorov [2004] W. Li and E. Todorov. Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems. In Proc. of ICINCO, 2004.
  • Toussaint [2009] M. Toussaint. Robot Trajectory Optimization using Approximate Inference. In Proc. of ICML, 2009.
  • Jordan et al. [1999] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Machine Learning, 1999.
  • Kingma and Ba [2015] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
  • Wang et al. [1996] H. Wang, K. Tanaka, and M. Griffin. An approach to fuzzy control of nonlinear systems; stability and design issues. IEEE Trans. on Fuzzy Systems, 4(1), 1996.
  • Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
  • Böhmer et al. [2015] W. Böhmer, J. T. Springenberg, J. Boedecker, M. Riedmiller, and K. Obermayer. Autonomous learning of state representations for control. KI - Künstliche Intelligenz, 2015.
  • Lange and Riedmiller [2010] S. Lange and M. Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In Proc. of IJCNN, 2010.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540), 02 2015.
  • van Hoof et al. [2015] H. van Hoof, J. Peters, and G. Neumann. Learning of non-parametric control policies with high-dimensional state features. In Proc. of AISTATS, 2015.
  • Levine et al. [2015] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015. URL http://arxiv.org/abs/1504.00702.
  • Wahlström et al. [2015] N. Wahlström, T. B. Schön, and M. P. Deisenroth. From pixels to torques: Policy learning with deep dynamical models. CoRR, abs/1502.02251, 2015. URL http://arxiv.org/abs/1502.02251.
  • Gregor et al. [2015] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra.

    DRAW: A recurrent neural network for image generation.

    In Proc. of ICML, 2015.
  • Bayer and Osendorfer [2014] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. In NIPS 2014 Workshop on Advances in Variational Inference, 2014.
  • Hinton et al. [2011] G. Hinton, A. Krizhevsky, and S. Wang. Transforming auto-encoders. In Proc. of ICANN, 2011.
  • Dinh et al. [2015] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. CoRR, abs/1410.8516, 2015. URL http://arxiv.org/abs/1410.8516.
  • Cohen and Welling [2015] T. Cohen and M. Welling. Transformation properties of learned visual representations. In ICLR, 2015.
  • Taylor et al. [2010] G. W. Taylor, L. Sigal, D. J. Fleet, and G. E. Hinton. Dynamical binary latent variable models for 3d human pose tracking. In Proc. of CVPR, 2010.
  • Memisevic [2013] R. Memisevic. Learning to relate images. IEEE Trans. on PAMI, 35(8):1829–1846, 2013.
  • Langford et al. [2009] J. Langford, R. Salakhutdinov, and T. Zhang. Learning nonlinear dynamic models. In ICML, 2009.
  • West and Harrison [1997] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models (Springer Series in Statistics). Springer-Verlag, February 1997. ISBN 0387947256.
  • Matsubara et al. [2014] T. Matsubara, V. Gómez, and H. J. Kappen. Latent Kullback Leibler control for continuous-state systems using probabilistic graphical models. UAI, 2014.
  • Kulkarni et al. [2015] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. CoRR, abs/1503.03167, 2015. URL http://arxiv.org/abs/1503.03167.
  • Osendorfer et al. [2014] C. Osendorfer, H. Soyer, and P. van der Smagt.

    Image super-resolution with fast approximate convolutional sparse coding.

    In Proc. of ICONIP, Lecture Notes in Computer Science. Springer International Publishing, 2014.
  • Jonschkowski and Brock [2014] R. Jonschkowski and O. Brock. State representation learning in robotics: Using prior knowledge about physical interaction. In Proc. of RSS, 2014.
  • Legenstein et al. [2010] R. Legenstein, N. Wilbert, and L. Wiskott. Reinforcement learning on slow features of high-dimensional input streams. PLoS Computational Biology, 2010.
  • Zou et al. [2011] W. Zou, A. Ng, and K. Yu. Unsupervised learning of visual invariance with temporal coherence. In

    NIPS*2011 Workshop on Deep Learning and Unsupervised Feature Learning

    , 2011.
  • Saxe et al. [2014] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proc. of ICLR, 2014.
  • Glorot et al. [2011] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS. Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011.

Appendix A Supplementary to the E2C description

a.1 State transition matrix factorization and KL Divergence

As alluded to in the main paper, estimation of the full local state transition matrix from Equation (8) requires the transition network to predict parameters. Using an arbitrary state transition matrix also – inconveniently – requires inversion of said matrix for computing the KL divergence penalty from Equation (11) (through which it is hard to backpropagate). We started our experiments using a full matrix (and only approximating all KL divergence terms), but quickly found that a rank one pertubation of the identity matrix could be used instead without loss of performance in any of our benchmarks. To the contrary, the resulting networks have fewer parameters and are thus easier to train. We here give the derivation of this process and how the KL divergence from Equation (11) can be computed. For the reformulation we represent as , therefore only and need to be estimated by the transition network, reducing the number of outputs for from to .

The KL divergence between two multivariate Gaussians is given by


For a simplified notation, such that , let us assume

The main point behind the derivation presented in the following, is to make partial derivatives of the above KL divergence efficiently computable. To this end, we cannot take the trace or the determinant via numerical algorithms, because we have to be able to take the gradients in symbolic form. Aside from that, we like to process a batch of samples, so the computation should have a convenient form and not require excessive amounts of tensor products in between. We start our simplification with the trace term which results in

The last equation is easy to implement and only requires summing over the non-batch dimension. The difference of means can be derived very quickly with the same summing scheme:

It remains the ratio of determinants, which we will simplify with the matrix determinant lemma giving

Matrix determinant lemma

Putting the above to formulas together finally yields


Appendix B Supplementary to the experimental setup

b.1 Up-convolution

We used convolutional inference networks for the cart-pole and three-link arm task. While these networks help us overcome the problem of large input dimensionalities (i.e. pixel images in the three-link arm task), we still have to generate full resolution images with the decoder network. For high-dimensional images generation fully connected neural networks are simply not an option. We thus decided to use up-convolutional networks, which were recently show to be powerful models for image generation [8, 9, 33].

To set-up these models we basically “mirror” the convolutional architecture used for the encoder. More specifically for each convolution followed by max-pooling step in the encoder network, we introduce a up-sampling and convolution step in the decoder network. The complete network architecture is given below. It is similar to the up-convolution networks used in Dosovitskiy et al. [9]. The upsampling strategy we use is simple “perforated” upsampling as described in [34].

b.2 Variational Autoencoder with slowness

Enforcing temporal slowness during learning has previously been found to be a good proxy for learning representations in reinforcement learning [35, 36] and representation learning from videos [37]. We also consider a VAE variant with a slowness term on the latent representation by enforcing similarity of the encodings of temporally close images. This can be achieved by augmenting the standard VAE objective with an additional KL divergence term on the latent posterior :


Indeed there seems to be a slightly better coherence of similar states in the latent spaces, as e.g. depicted in Figure 8 in the main paper. Yet, our experiments show that a slowness term alone does not suffice to structure the latent space, such that locally linear predictions and control become feasible.

b.3 Evaluation criteria

For comparing the performance of all variants of E2C and the baselines, the following criteria are of importance:

  • Autoencoding. Being able to reconstruct the given observations is the basic necessity for a model to work. The reconstruction cost drives a model to identify single states from its observations.

  • Decoding the next state. For any planning to be possible at all, the decoder must be able to generate the correct images from transitions the dynamics model performed. If this is not the case, we know that the latent states of the encoding and the transition model do not coincide, thus preventing any planning.

  • Optimizing latent trajectory costs. The action sequences for achieving a specified goal will be determined completely by locally linearized dynamics in the latent space. Therefore minimizing trajectory costs in latent space is, again, a necessity for successful control.

  • Optimizing real trajectory costs. While the action sequence has been determined for the latent dynamics, the deciding criterion is whether this reflects the true state trajectory costs. Therefore carrying out the ”dreamed” plans in reality is the optimality criterion for every model. To make the different models comparable, we use the same cost matrices for evaluation, which are not necessarily the same as for optimization.

We reflected these four criteria in the evaluation table in the paper. For the reconstruction of the current and next state we specified the mean log loss, which is in case of the Bernoulli distributions the cross entropy error function:


For the costs a model imagines and truly achieves, we sample from different starting states and accumulate the distances in latent and true state space according to the SOC method.

b.4 The three-link robot arm

The robot arm we used in the last experiment in the main paper was simulated using dynamics generated by the MapleSim http://www.maplesoft.com/products/maplesim/ simulator wrapped in Python and visualized for producing inputs to E2C using PyGame. We simulated a fairly standard robot arm with three links. The length of the links were set to , and (units were set to meters). The masses of the corresponding links were all set to .

b.5 Evaluating the true system model

To compare the efficacy of different models when combined with optimal control algorithms, we always reported the cost in latent space (as used by the optimal control algorithm) as well as the “real” trajectory cost. To compute this real cost, we evaluated the same cost function as in the latent space (quadratic costs on the deviation from a given goal state), but using the real system states during execution and different cost matrices for a fair comparison.

As an upper bound on the performance achievable for control by any of the models, we also computed the true system cost by applying iLQR/AICO to a model of the real system dynamics. We have this model available since all experiments were performed in simulation.

b.6 Neural Network training

b.6.1 Experimental Setup

All the datasets were created in advance as for the training, validation and test split. While the E2C models were trained on , the ones that do not incorporate any transition information (i.e. AE, VAE) were trained on images extracted from the original dataset . The slowness VAE was trained on the pairs of images subset and our E2C models on the full .

In order to learn dynamics predictions for the image-only autoencoders, we extracted the latent representations and combined them with the actions from into . On these low-dimensional representations we trained the dynamics MLPs, thus ensuring that all methods were trained on exactly the same data.

b.6.2 Implementation details

We used orthogonal weight initialization for every layer [38]. As described in the main paper, Adam [14]

was used as the learning rule for all networks. We found both these techniques to be fundamentally important for stabilizing training and achieving good reconstructions for all methods. Both methods also clearly helped to cut the hyperparameter search needed for all methods to a minimum. In the process of training, we could make out three phases: the unfolding of the latent space, the overcoming of the trivial solution (the average image of the dataset) and the minimization of the latent KL term. The architectures used for our experiments were as follows (where ReLU stands for rectified linear units

[39] and conv. for convolutions):

  • [noitemsep]

  • Input: image dimensions, action dimensions

  • Latent Space dimensionality:

  • Encoder: 150 ReLU - 150 ReLU - 150 ReLU - 4 Linear (2 for AE)

  • Decoder: 200 ReLU - 200 ReLU - 1600 Linear (Sigmoid for AE)

  • Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C)

    • [noitemsep]

    • AE, VAE, VAE with slowness, Non-linear E2C: 2 Linear

    • E2C: 8 Linear ( for , for , 2 for ),

  • Adam:

  • Evaluation costs: , ,

Pendulum swing-up
  • [noitemsep]

  • Input: image dimensions, action dimension

  • Latent Space dimensionality:

  • Encoder: 800 ReLU - 800 ReLU - 6 Linear (3 for AE)

  • Decoder: 800 ReLU - 800 ReLU - 4608 Linear (Sigmoid for AE)

  • Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C)

    • [noitemsep]

    • AE, VAE, VAE with slowness, Non-linear E2C: 3 Linear

    • E2C: 12 Linear ( for , for , 3 for ),

  • Adam:

  • Evaluation costs: ,

Cart-Pole balancing
  • [noitemsep]

  • Input: image dimensions, action dimension

  • Latent Space dimensionality:

  • Encoder: ReLU - ReLU - ReLU - 512 ReLU - 512 ReLU

  • Decoder: 512 ReLU - 512 ReLU - up-sampling - ReLU - up-sampling - ReLU - up-sampling - conv. ReLU

  • Dynamics: 200 ReLU - 200 ReLU + 32 Linear ( for , for , for ),

  • Adam:

  • Evaluation costs: ,

Three-link arm
  • [noitemsep]

  • Input: image dimensions, action dimensions

  • Latent Space dimensionality:

  • Encoder: conv. ReLU - max-pooling - conv. ReLU - max-pooling - conv. ReLU - max-pooling - 512 ReLU - 512 ReLU

  • Decoder: 512 ReLU - 512 ReLU - up-sampling - ReLU - up-sampling - ReLU - up-sampling - conv. ReLU

  • Dynamics: 200 ReLU - 200 ReLU + 48 Linear ( for , for , for ),

  • Adam:

  • Evaluation costs: ,

Figure 5: Generated “dreamed” trajectories of different models for the plane task (from left to right). The opacity of the obstacles has been lowered in this depiction for better visibility of the agent.

Appendix C Supplementary evaluations

c.1 Trajectories for plane and pendulum

To qualitatively measure the predictive accuracy, the starting state for a trajectory is encoded and the actions are applied on the latent representation. After each transition, the predicted latent position is decoded and visualized. In this manner, multi-step predictions can be generated for the planar system in Figure 5 and for the inverted pendulum in Figures 6 and 7.

Figure 6: Generated “dreamed” trajectories (from left to right) for passive dynamics: the pendulum starts with angle without velocity. The models have to predict the dynamics, while no force is applied.
Figure 7: Dreamed trajectories (from left to right) for controlled dynamics: the pendulum starts with angle without velocity. For 6 timesteps, full force is applied to the right, followed by 4 timesteps of full force to the left.

c.2 Inverted pendulum latent space

Encoding the pendulum depictions into a 3-dimensional latent space allows for a visual comparison in Figure 8 .

Figure 8: Latent spaces of all baseline models and E2C variants for the inverted pendulum.

c.3 Trajectories for cart-pole and three-link arm

Finally – similar to the images in Section C.1 – Figure 9 shows multi-step predictions for the cart-pole system. We depict important cases: (1) a long-term prediction with the cart-pole standing still (essentially the unstable fix-point of the underlying dynamics); (2) the cart-pole moving to the right, changing the direction of the poles angular velocity (middle column); (3) and the pole moving farthest to the right. The long-term predictions by the E2C model are all of high quality. Note that for the uncontrolled dynamics the predictions show a slight bias of the pole moving to the right (an effect that we consistently saw in trained models for the cart-pole). We attribute this problem to the fact that discretization errors in the image rendering process of the pole angle make it hard to predict small velocities accurately.

Figure 9: Dreamed trajectories (top to bottom) for uncontrolled (left column) and controlled (middle/right column) dynamics in the cart-pole system. The red image shows the initial configuration, which is encoded resulting in . The images in the right half of each column are then generated without additional input by following the dynamics in latent space. The left column depicts the uncontrolled case ( for all steps). The middle column shows a controlled trajectory with torque applied in each step and the right column a trajectory with torque applied in each step. Prediction of the history image is omitted in these depictions.

c.4 Exemplary trajectory taken for three-link arm task

Figure 10 shows a segment of a controlled trajectory for the three-link arm as executed by the E2C system. Note that, in contrast to other figures in this supplementary material, it does not show a long-term prediction but rather 10 steps of a trajectory (together with one-step-ahead predictions) that was taken by the E2C system when combined with model predictive control. For additional visualizations and controlled trajectories for all tasks we refer to the supplementary video.

c.5 Comparison of different models for cart-pole and robot arm

In Table 2 we compare our variety of models in terms of real trajectory cost and task success percentage on the cart-pole and the robot arm. All results are averaged over 30 different starting states with a fixed goal state.

The cart-pole always starts in the goal state (zero angle and zero velocity) with small additive Gaussian noise (). Success is defined as preventing the pole from falling below an angle of rad. The three-link arm system begins in a random configuration and the goal is to to unroll all joints (e.g. make all angles zero) and stay -close to that position.

The results show that only E2C and its non-linear variant can perform this task successfully, although there is still a large performance gap between the two. We conclude, that the error of linearizing non-linear dynamics after training the corresponding model grows to the point of no longer allowing accurate control for the system.

Algorithm True model VAE + slownes E2C no latent KL Non-linear E2C E2C
Cart-Pole balance
Traj. Cost 15.33 7.70 49.12 16.94 48.90 17.88 31.96 13.26 22.23 14.89
Success 100 0 0 63 93
Three-link arm
Traj. Cost 59.46 1275.53 864.66 1246.69 262.6 460.40 82.18 90.23 47.38
Success 100 0 0 40 90
Table 2: Comparison between trajectory costs of different approaches for the cart-pole and three-link task. The standard Autoencoder, Variational Autoencoder and Global E2C model are omitted from the table as they failed on this task (performance similar to VAE with slowness).

c.6 Comparison of trajectory optimizers for cart-pole and robot arm

To compare how well AICO deals with the covariance matrices estimated in latent space we performed an additional experiment on the cart-pole and three-link robot arm task comparing it to iLQR. We performed model predictive control using the locally linear E2C model starting in 10 different start states each. The remaining settings are as given in Section C.5.

As reported in Table 3, both methods performed about the same for these tasks, indicating that the covariance matrices estimated by our model do not “hurt” planning, but considering them does not improve performance either.

Method iLQR AICO
True model
Three-Link Robot Arm
True model
Table 3: Comparison between AICO and iLQR based on the “real” cost for controlling the cart-pole and three-link robot arm using convolutional networks.
Figure 10: Frames extracted from a trajectory (top to bottom) as executed by the Embed to Control system. The left column shows the real images corresponding to transitions taken in the MDP. Middle and right column show the prediction of history image and current image based on the previous two images.