1 Introduction
Modelbased reinforcement learning (RL) methods use known or learned models in a variety of ways, such as planning through the model and generating synthetic experience (Sutton, 1990; Kober et al., 2013). On simple, lowdimensional tasks, modelbased approaches have demonstrated remarkable data efficiency, learning policies for systems like cartpole swingup with under 30 seconds of experience (Deisenroth et al., 2014; Moldovan et al., 2015). However, for more complex domains, one of the main difficulties in applying modelbased methods is modeling bias: if control or policy learning is performed against an imperfect model, performance in the real world will typically degrade with model inaccuracy (Deisenroth et al., 2014). Many modelbased methods rely on accurate forward prediction for planning (Nagabandi et al., 2018; Chua et al., 2018), and for imagebased domains, this precludes the use of simple models which will introduce significant modeling bias. However, complex, expressive models must typically be trained on very large datasets, corresponding to days to weeks of data collection, in order to generate accurate forward predictions of images (Finn & Levine, 2017; Pinto & Gupta, 2016; Agrawal et al., 2016).
How can we use modelbased methods to learn from images with similar data efficiency as we have seen in simpler domains? In our work, we focus on removing the need for accurate forward prediction, using what we term local models methods. These methods use simple models, typically linear models, to provide gradient directions for local policy improvement, rather than for forward prediction and planning (Todorov & Li, 2005; Levine & Abbeel, 2014). Thus, local model methods circumvent the need for accurate predictive models, but these methods cannot be directly applied to imagebased tasks because image dynamics, even locally speaking, are highly nonlinear.
Our main contribution is a representation learning and modelbased RL procedure, which we term stochastic optimal control with latent representations (SOLAR), that jointly optimizes a latent representation and model such that inference produces local models that provide good gradient directions for policy improvement. As shown in Figure 1
, SOLAR is able to learn policies directly from highdimensional image observations in several domains, including a real robotic arm stacking blocks and pushing objects with only one to two hours of data collection. To our knowledge, SOLAR is the most efficient RL method for solving real world robotics tasks directly from raw images. We also demonstrate several additional advantages of our method, including the ability to transfer learned models in the multitask RL setting and the ability to handle sparse reward settings with a set of goal images.
2 Preliminaries
We formalize our setting as a partially observed Markov decision process (POMDP) environment, which is given by the tuple
. Most prior work in modelbased RL assumes the fully observed RL setting where the observation space is the same as the state space and the observation density function provides the exact state, so we will first discuss this setting. In this setting, the state space , action space , and horizon are known, but the dynamics function , cost function , and initial state distribution are unknown. RL agents interact with the environment via a policy that chooses an action conditioned on the current state, and the environment responds with the next state, sampled from the dynamics function, and the cost, evaluated through the cost function. The goal of RL is to minimize, with respect to the agent’s policy, the expected sum of costs . Local model methods iteratively fit dynamics and cost models to data collected from the current policy in order to optimize . One particularly tractable and popular model is the linearquadratic system (LQS), which models the dynamics as timevarying linearGaussian (TVLG) and the cost as quadratic, i.e.,Any deterministic policy operating in an environment with smooth dynamics can be locally modeled with a timevarying LQS (Boyd & Vandenberghe, 2004), while lowentropy stochastic policies are modeled approximately. This makes the timevarying LQS a reasonable local model for many dynamical systems. Furthermore, the optimal maximumentropy policy under the model is linearGaussian state feedback (Jacobson & Mayne, 1970), i.e.,
We describe how to compute the parameters , , and in Appendix A. Due to modeling bias, the policy computed through LQR likely will not perform well in the real environment. This is because the model will not be globally correct but rather only valid close to the distribution of the datacollecting policy. One approach to addressing this issue is to use LQR with fitted linear models (LQRFLM; Levine & Abbeel, 2014), a method which imposes a KLdivergence constraint on the policy update such that the shift in the trajectory distributions before and after the update, which we denote as and , respectively, is bounded by a step size . This leads to the constrained optimization
(1) 
As shown in Levine & Abbeel (2014), this constrained optimization can be solved by augmenting the cost function to penalize the deviation from the previous policy , i.e., . Note that this augmented cost function is still quadratic, since the policy is linearGaussian, and thus we can still compute the optimal policy for this cost function in closed form using the LQR procedure. is a dual variable that trades off between optimizing the original cost and staying close in distribution to the previous policy, and the weight of this term can be determined through a dual gradient descent procedure.
Methods based on LQR have enjoyed considerable success in a number of control domains, including learning tasks on real robotic systems (Todorov & Li, 2005; Levine et al., 2016). However, most prior work in modelbased RL assumes access to a lowdimensional state representation, and this precludes these methods from operating on complex observations such as images. There is some work on lifting this restriction: for example, Watter et al. (2015) and Banijamali et al. (2018) combine LQRbased control with a representation learning scheme based on the variational autoencoder (VAE; Kingma & Welling, 2014; Rezende et al., 2014) where images are encoded into a learned lowdimensional representation that is used for modeling and control. They demonstrate success on learning several continuous control domains directly from pixel observations. We discuss our method’s relationship to this work in Section 6.
3 Learning and Modeling the Latent Space
Representation learning is a promising approach for integrating local models with complex observation spaces like images. What are the desired properties for a learned representation to be useful for local model methods? A simple answer is that local model fitting in a latent space that is lowdimensional and regularized will be more accurate than fitting directly to image observations. Concretely, one approach that satisfies these properties is to embed observations using a standard VAE, where regularization comes in the form of a unit Gaussian prior. However, a VAE representation still may not be amenable to local model fitting since the latent state is not optimized for dynamics and cost modeling. Since we aim to infer local dynamics and cost models in the neighborhood of the observed data, the main property we require from the latent representation is to make this fitting process more accurate for the observed trajectories, thereby reducing modeling bias and enabling a local model method to better improve the policy.
As we discuss in subsection 3.1, in order to make the local model fitting more accurate, especially in the low data regime, we learn global dynamics and cost models on all observed data jointly with the latent representation. Our formulation allows us to directly optimize the latent representation to be amenable for fitting linear dynamics and quadratic cost models, and subsection 3.2 details the learning procedure. Section 4 describes how, using our learned representation and global model as a starting point, we can infer local models that accurately explain the observed data. In this case, the local TVLG dynamics become latent variables in the model. As shown in Figure 2, updating the policy can then be done simply by rolling out a few trajectories, inferring the posterior over the latent TVLG dynamics, and using these dynamics and a local quadratic cost model to improve the policy. This procedure becomes the basis for the SOLAR algorithm which we present in Section 5.
3.1 The Deep Bayesian LQS Model
In our problem setting, we have access to trajectories of the form sampled from the system using our current policy. We assume this observed data is generated as follows: there is a latent state that evolves according to linearGaussian dynamics, where the dynamics parameters themselves are stochastic and distributed according to a global prior. At each time step , the latent state is used to generate an image observation , and the state and action generate the cost observation
. The prior on the dynamics parameters increases the expressivity of the model by removing the assumption that the underlying dynamics are globally linear, since different trajectories may be explained by different samples from the prior. Furthermore, we approximate the observation function with a convolutional neural network, which makes the overall model nonlinear. We formalize this generative model as
(2)  
(3)  
(4)  
(5)  
(6) 
denotes the matrix normal inverseWishart (MNIW) distribution, which is the conjugate prior for linearGaussian dynamics models. Thus, conditioned on transitions from a particular time step, the posterior dynamics distribution
is still MNIW, and we describe in Section 4 how we leverage this conjugacy to infer local linear models using an approximate posterior distribution over the dynamics as a global prior. We refer to as an observation model or decoder, which is parameterized by neural network weightsand outputs a Bernoulli distribution over
, which are RGB images.There are a number of ways to parameterize the quadratic cost model , and we detail several options in Appendix B along with an alternate parameterization for sparse human feedback that we discuss in Section 5.
3.2 Joint Model and Representation Learning
We are interested in inferring two distributions of interest, both conditioned on the observations and actions:^{1}^{1}1Note that we do not condition on the cost observations for simplicity and also because the costs are scalars that contain relatively little information compared to image observations.

The posterior distribution over dynamics parameters , as this informs our policy update;

The posterior distribution over latent trajectories
, since we require an estimate of the latent state as the input to our policy.
The subscript denotes an entire trajectory. Both of these distributions are intractable due to the neural network observation model. We instead turn to variational inference which optimizes, with respect to KLdivergence, a variational distribution in order to approximate a distribution of interest . Specifically, we introduce the variational factors
represents our posterior belief about the system dynamics after observing the collected data, and we also model this distribution as MNIW. We construct the full variational distribution over latent state trajectories as the normalized product of the state dynamics and, borrowing terminology from undirected graphical models, learned evidence potentials . We refer to as a recognition model or encoder, which is parameterized by neural network weights and outputs the mean and diagonal covariance of a distribution over .
To learn the variational parameters, we optimize the evidence lower bound (ELBO), which is given by
Johnson et al. (2016) derived an algorithm for optimizing hybrid models with both deep neural networks and probabilistic graphical model (PGM) structure. In fact, our model bears strong resemblance to the LDS SVAE model from their work, though our ultimate goal is to fit local models for modelbased policy learning rather than focusing on global models as in their work. We explain the relevant details of the SVAE learning procedure, which we use to learn the neural network parameters and along with the global dynamics and cost models, in Appendix C.
Note that, because the dynamics and cost are learned with samples from the recognition model, we backpropagate the gradients from the cost likelihood and dynamics KL terms through the encoder in order to learn a representation that is better suited to linear dynamics and quadratic cost. Through this, we learn a latent representation that, in addition to being lowdimensional and regularized, is directly optimized for fitting a LQS model on the observed data.
In Figure 3, we depict our generative model using solid lines, and we depict the variational factors and recognition networks using dashed lines. Our method learns two variational distributions: first, a distribution over latent states which is used to provide inputs to the learned policy, and second, a global dynamics model that is used as a prior for inferring local linear dynamics models.
4 Inference and RL in the Latent Space
How can we utilize our learned representation and global models to enable local model methods? As shown in Figure 2, local model methods alternate between collecting batches of data from the current policy and using this data to fit local models and improve the policy. In order to improve the behavior of the local dynamics model fitting, especially in the low data regime, we use our global dynamics model as a prior and fit local dynamics models via posterior inference conditioned on data from the current policy.
For policy improvement, we fit local linear dynamics models separately at every time step, thus we augment the dynamics in our generative model from Equation 3 to instead be separate dynamics parameters at each time step . We model these parameters as independent samples from the global dynamics model
, and this can be interpreted as an empirical Bayes method, where we use data to estimate the parameters of our priors. In this way, the global dynamics model acts as a prior on the local timevarying dynamics models. In order to then infer the parameters of these local models conditioned on the data from the current policy, we employ a variational expectationmaximization (EM) procedure. The Estep computes
given the current local dynamics, which are initialized to the global prior. The Mstep optimizes, for each , with respect to the dynamics parameters, where the expectation is over the latent state distribution from the Estep. We refer readers to Appendix D for complete details.We additionally fit a local quadratic cost model to the latest batch of data, and this combined with the local linear dynamics models gives us a local latent LQS model. Thus, it is natural to use LQRbased control in order to learn a policy. However, as discussed in Section 2, using vanilla LQR typically leads to undesirable behavior due to modeling bias.
One way to understand the problem is through standard supervised learning analysis, which only guarantees that our local models will be accurate under the distribution of data from the current policy. This directly motivates updating our policy in such a way that the trajectory distribution induced by the new policy does not deviate heavily from the data distribution, and in fact, the update rule proposed by LQRFLM exactly accomplishes this goal
(Levine & Abbeel, 2014). Thus, our policy update method utilizes the same constrained optimization from Equation 1, and we solve this optimization using the same augmented cost function that penalizes deviation from the previous policy.Note that rolling out our policy requires computing an estimate of the current latent state . In order to handle partially observable tasks, we estimate the latent state using the history of observations and actions, i.e.,
, where we condition on the local linear dynamics fit to the latest batch of data. This distribution can be computed using Kalman filtering in the latent space and allows us to handle partial observability by aggregating information that may not be estimable from a single observation, such as system velocity from images.
5 The SOLAR Algorithm
The SOLAR algorithm is presented in Algorithm 1. Lines 13 detail the pretraining phase, corresponding to the representation and global model learning described in Section 3, where we collect trajectories using a random policy to train the representation, dynamics, and cost model. In our experiments in Section 7, we typically set . In the RL phase, we alternate between inferring dynamics at each time step conditioned on data from the latest policy as described in Section 4 (line 5), performing the LQRFLM update described in Section 2 given the inferred dynamics (line 6), collecting trajectories using the updated policy (line 7), and optionally finetuning the model on the new data (line 8).^{2}^{2}2In our experiments, we found that finetuning the model did not improve final performance, though this step may be more important for environments where exploration is more difficult. The model hyperparameters include number of iterations, learning rates, and minibatch size, and the policy hyperparameters include the policy update KL constraint
and the initial random variance.
We evaluate SOLAR in Section 7 in several RL settings involving continuous control including manipulation tasks on a real Sawyer robot. Beyond our method’s performance on these tasks, however, we can derive several other significant advantages from our representation and PGM learning. As we detail in the rest of this section, these advantages include transfer in the multitask RL setting and handling sparse reward settings using an augmented graphical model.
5.1 Transferring Representations and Models
In the scenario where the dynamics are unknown, LQRbased methods are typically used in a “trajectorycentric” fashion where the distributions over initial conditions and goal conditions are low variance (Levine & Abbeel, 2014; Chebotar et al., 2017). We similarly test our method in such settings in Section 7, e.g., learning Lego block stacking where the top block starts in a set position and the bottom block is fixed to the table. In the more general case where we may wish to handle several different conditions, we can learn a policy for each condition, however this may require significant amounts of data if there are many conditions.
However, one significant advantage of representation and model learning over alternative approaches, such as modelfree RL, is the potential for transferring knowledge across multiple tasks where the underlying system dynamics do not change (Lesort et al., 2018). Here, we consider each condition to be a separate task, and given a task distribution, we first sample various tasks and learn our model from Section 3 using random data from these tasks. We show in Section 7 that this “base model” can then be directly transferred to new tasks within the distribution, essentially removing the pretraining phase and dramatically speeding up learning for the Sawyer Lego block stacking domain.
5.2 Learning from Sparse Rewards
Reward functions can often be hard to specify for complex tasks in the real world, and in particular they may require highly instrumented setups such as motion capture when operating from image observations. In these settings, sparse feedback is often easier to specify as it can come directly from a human labeler. Because we incorporate PGM machinery in our learned latent representation, it is straightforward for SOLAR to handle alternate forms of supervision simply by augmenting our generative model to reflect how the new supervision is given. Specifically, we extend our cost model to the sparse reward setting by assuming that we observe a binary signal based on the policy performance, rather than costs , and then modeling
as a Bernoulli random variable with probability given by
Concretely, in our experiments, is generated by a human that only provides when the task is solved. This setup is reminiscent of Fu et al. (2018)
, though our goal is not to classify expert data from policy data. Learning
from observingamounts to logistic regression, and afterwards we can use
as before in order to perform control and policy learning. Note that we can still backpropagate gradients through the encoder in order to learn a representation that is more amenable to predicting . In Section 7, we use this method to solve a pushing task for which providing rewards is difficult without motion capture, and instead we use sparse human feedback and a set of goal images to specify the desired outcome. We provide the implementation details for this experiment in Appendix E.6 Related Work
Utilizing representation learning within modelbased RL has been studied in a number of previous works (Lesort et al., 2018), including using embeddings for state aggregation (Singh et al., 1994), dimensionality reduction (Nouri & Littman, 2010)
(Smith, 2002), value prediction (Oh et al., 2017), and deep autoencoders (Lange & Riedmiller, 2010; Higgins et al., 2017). Among these works, deep spatial autoencoders (DSAE; Finn et al., 2016) and embed to control (E2C; Watter et al., 2015; Banijamali et al., 2018) are the most closely related to our work, in that they consider local model methods combined with representation learning. The key difference in our work is that, rather than using a learning objective for reconstruction and forward prediction, our objective is more suited for local model methods by directly encouraging learning representations where fitting local models accurately explains the observed data. We also do not assume a known cost function, goal state, or access to the underlying system state as in DSAE and E2C, making SOLAR applicable even when the underlying states and cost function are unknown.^{3}^{3}3These methods may be extended to unknown underlying states and cost functions, though the authors do not experiment with this and it is unclear how well these approaches would generalize.Subsequent to our work, Hafner et al. (2018) formulate a representation and model learning method for imagebased continuous control tasks that is used in conjunction with modelpredictive control (MPC), which plans time steps ahead using the model, executes an action based on this plan, and then replans after receiving the next observation. We compare to a baseline that uses MPC in Section 7, and we empirically demonstrate the relative strengths of SOLAR and MPC, showing that SOLAR can overcome the shorthorizon bias that afflicts MPC. We also compare to robust locallylinear controllable embedding (RCE; Banijamali et al., 2018), an improved version of E2C, and we find that our approach tends to produce better empirical results.
7 Experiments
We aim to answer the following through our experiments:

What benefits do we derive by utilizing modelbased RL and representation learning in general?

How does SOLAR compare to similar methods in terms of solving imagebased control tasks?

Can we utilize SOLAR to solve imagebased control tasks on a real robotic system?
To answer 1, we compare SOLAR to PPO (Schulman et al., 2017), a stateoftheart modelfree RL method, and LQRFLM with no representation learning. For the real world tasks, we also compare to deep visual foresight (DVF; Ebert et al., 2018), a stateoftheart modelbased method for images which does not use representation learning.
To answer 2, we compare to RCE (Banijamali et al., 2018), which as discussed earlier is an improved version of E2C (Watter et al., 2015). We also set up an “VAE ablation” of SOLAR where we replace our representation learning scheme with a standard VAE. Finally, we consider an “MPC baseline” where we train neural network dynamics and cost models jointly with a latent representation and then use MPC with these models. Details regarding each of the comparisons are in Appendix F.
To answer 3, we evaluate SOLAR on a block stacking task and a pushing task on a Sawyer robot arm as shown in Figure 1. Videos of the learned policies are available at https://sites.google.com/view/icml19solar.
7.1 Experimental Tasks
We set up simulated imagebased robotic domains as well as manipulation tasks on a real Sawyer robotic arm, as shown in Figure 4. Details regarding task setup and training hyperparameters are provided in Appendix E.
2D navigation. Our 2dimensional navigation task is similar to Watter et al. (2015) and Banijamali et al. (2018) where an agent controls its velocity in a bounded planar system to reach a specified target. However, we make this task harder by randomizing the goal every episode rather than fixing it to the bottom right. Observations consist of two 32by32 images showing the positions of the agent and goal.
Nonholonomic car. The nonholonomic car starts in the bottom right of the 2dimensional space and controls its acceleration and steering velocity in order to reach the target in the top left. We use 64by64 images as the observation.
Reacher. We experiment with the reacher environment from OpenAI Gym (Brockman et al., 2016), where a 2DoF arm in a 2dimensional plane has to reach a fixed target denoted by a red dot. For observations, we directly use 64by64by3 images of the rendered environment, which provides a topdown view of the reacher and target.
Sawyer Lego block stacking. To demonstrate a challenging domain in the real world, we use our method to learn Lego block stacking with a real 7DoF Sawyer robotic arm. The observations are 64by64by3 images from a camera pointed at the robot, and the controller only receives images as the observation without joint angles or other information. As shown in Figure 4, we define different block stacking tasks as different initial positions of the Sawyer arm.
Sawyer pushing. We also experiment with the Sawyer arm learning to push a mug onto a white coaster, where we again use 64by64by3 images with no auxiliary information. Furthermore, we set up this task with only sparse binary rewards that indicate whether the mug is on top of the coaster, which are provided by a human labeler.
7.2 Comparisons to Prior Work
As shown in Figure 5, we compare to prior methods only on the simulated domains as these methods have not been shown to solve realworld imagebased domains with reasonable data efficiency. On the 2D navigation task, our method, the VAE ablation, and the MPC baseline are able to learn very quickly, converging to highperforming policies in 200 episodes. However, these policies still exhibit some “jittery” behavior due to modeling bias, especially for the VAE ablation, whereas PPO learns an extremely accurate policy that continues to improve the longer we train. This gain in asymptotic performance is typical of modelfree methods over modelbased methods, however achieving this performance requires two to three orders of magnitude more samples. We present logscale plots that illustrate the full learning progress of PPO in Appendix G.
LQRFLM from pixels fails to learn anything meaningful, and its performance does not improve over the initial policy. In fact, LQRFLM does not make progress on any of the tasks, and for the sake of clarity in the plots, we omit these results. Similarly, despite extensive tuning and using code directly from the original authors, we were unable to get RCE to learn a good model for our 2D navigation task, and thus the learned policy also does not improve over the initial policy. RCE did not learn successful policies for any of the other tasks that we experiment with, though in Appendix G, we show that RCE can indeed learn the easier fixedtarget 2D navigation task from prior work.
On the nonholonomic car, our method and the MPC baseline are able to learn with about 1500 episodes of experience, whereas the VAE ablation’s performance is less consistent. PPO eventually learns a successful policy for this task that performs better than our method, however it requires over 25 times more data to reach this performance.
Our method is outperformed by the final PPO policy on the reacher task, however, PPO requires about 40 times more data to learn. The VAE ablation and MPC baseline also make progress toward the target, though the performance is noticeably worse than our method. MPC often has better initial behavior than LQRFLM as it uses the pretrained models right away for planning, highlighting one benefit of planningbased methods, however the MPC baseline barely improves past this behavior. Forward prediction with this learned model deteriorates quickly as the horizon increases, which makes longhorizon planning impossible. MPC is thus limited to shorthorizon planning, and this limitation has been noted in prior work (Nagabandi et al., 2018; Feinberg et al., 2018). SOLAR does not suffer from this as we do not use our models for forward prediction.
Our opensource implementation of SOLAR is available at https://github.com/sharadmv/parasol.
7.3 Analysis of Real Robot Results
The realworld Lego block stacking results are shown in Figure 6. Our method is successful on all tasks, where we define success as achieving an average distance of m which generally corresponds to successful stacking, whereas the VAE ablation is only successful on the easiest task in the middle plot. The MPC baseline again starts off better and learns more quickly on the two easier tasks. However, MPC is again limited to shorthorizon planning, which causes it to fail on the most difficult task in the right plot as it simply greedily reduces the distance between the two blocks rather than lifting the block off the table. We can solve each block stacking task using about two hours of robot interaction time, though the xaxes in the plots show that we further reduce the total data requirements by about a factor of two by pretraining and transferring a shared representation and global model as described in Section 5.
As a comparison to a stateoftheart modelbased method that has been successful in realworld imagebased domains, we evaluate DVF (Ebert et al., 2018), which learns pixel space models and does not utilize representation learning. We find that this method can make progress but ultimately is not able to solve the two harder tasks even with more data than what we use for our method and even with a much smaller model. This highlights our method’s data efficiency, as we use about two hours of robot data compared to days or weeks of data as in this prior work.
Method 






SOLAR (ours) 
Finally, on the realworld pushing task, despite the additional challenge of sparse rewards, our method learns a successful policy in about an hour of interaction time as detailed in Table 1 and visualized in Figure 7. DVF performs worse than our method with a comparable amount of data, again even when using a downsized model. Videos depicting the learning process for both of the realworld tasks, as well as full size versions of the plots and learning curves, are available at https://sites.google.com/view/icml19solar.
8 Discussion
We presented SOLAR, a modelbased RL algorithm that is capable of learning policies in a dataefficient manner directly from raw highdimensional image observations. The key insights in SOLAR involve learning latent representations where simple models are more accurate and utilizing PGM structure to infer dynamics from data conditioned on observed trajectories. Our experimental results demonstrate that SOLAR is competitive in sample efficiency, while exhibiting superior final policy performance, compared to other modelbased methods. SOLAR is also significantly more dataefficient compared to modelfree RL methods, especially when transferring previously learned representations and models. We show that SOLAR can learn complex realworld robotic manipulation tasks with only image observations in one to two hours of interaction time.
Our model is designed for and tested on continuous action domains, and extending our model to discrete actions would necessitate some type of learned action representation. This is intriguing also as a potential mechanism for further reducing modeling bias. Certain systems such as dexterous hands and tensegrity robots not only exhibit complex state spaces but also complex action spaces (Zhu et al., 2018; Andrychowicz et al., 2018; Zhang et al., 2017), and learning simpler action representations that can potentially capture highlevel behavior, such as manipulation or locomotion primitives, is an exciting line of future work.
Acknowledgments.
MZ is supported by an NDSEG fellowship. SV is supported by NSF grant CNS1446912. This work was supported by the NSF, through IIS1614653, and computational resource donations from Amazon.
References
 Agrawal et al. (2016) Agrawal, P., Nair, A., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In NIPS, 2016.
 Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 Banijamali et al. (2018) Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Ghodsi, A. Robust locallylinear controllable embedding. In AISTATS, 2018.
 Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
 Camacho & Alba (2013) Camacho, E. and Alba, C. Model Predictive Control. Springer Science and Business Media, 2013.
 Chebotar et al. (2017) Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. In ICML, 2017.
 Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, 2018.
 Deisenroth et al. (2014) Deisenroth, M., Fox, D., and Rasmussen, C. Gaussian processes for dataefficient learning in robotics and control. PAMI, 2014.
 Ebert et al. (2018) Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Modelbased deep reinforcement learning for visionbased robotic control. arXiv preprint arXiv:1812.00568, 2018.
 Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M., Gonzalez, J., and Levine, S. Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
 Finn & Levine (2017) Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In ICRA, 2017.

Finn et al. (2016)
Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P.
Deep spatial autoencoders for visuomotor learning.
In ICRA, 2016.  Fu et al. (2018) Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Variational inverse control with events: A general framework for datadriven reward definition. In NIPS, 2018.
 Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actorcritic methods. In ICML, 2018.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
 Hafner et al. (2018) Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 Higgins et al. (2017) Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zeroshot transfer in reinforcement learning. In ICML, 2017.
 Hoffman et al. (2013) Hoffman, M., Blei, D., Wang, C., and Paisley, J. Stochastic variational inference. JMLR, 2013.
 Jacobson & Mayne (1970) Jacobson, D. and Mayne, D. Differential Dynamic Programming. American Elsevier, 1970.
 Johnson et al. (2016) Johnson, M., Duvenaud, D., Wiltschko, A., Datta, S., and Adams, R. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, 2016.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kingma & Welling (2014) Kingma, D. and Welling, M. Autoencoding variational Bayes. In ICLR, 2014.
 Kober et al. (2013) Kober, J., Bagnell, J., and Peters, J. Reinforcement learning in robotics: A survey. IJRR, 2013.
 Lange & Riedmiller (2010) Lange, S. and Riedmiller, M. Deep autoencoder neural networks in reinforcement learning. In IJCNN, 2010.
 Lesort et al. (2018) Lesort, T., DíazRodríguez, N., Goudou, J., and Filliat, D. State representation learning for control: An overview. Neural Networks, 2018.
 Levine & Abbeel (2014) Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2014.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. JMLR, 2016.
 Moldovan et al. (2015) Moldovan, T., Levine, S., Jordan, M., and Abbeel, P. Optimismdriven exploration for nonlinear systems. In ICRA, 2015.
 Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R., and Levine, S. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In ICRA, 2018.
 Nouri & Littman (2010) Nouri, A. and Littman, M. Dimension reduction and its application to modelbased exploration in continuous spaces. Machine Learning, 2010.
 Oh et al. (2017) Oh, J., Singh, S., and Lee, H. Value prediction network. In NIPS, 2017.
 Pinto & Gupta (2016) Pinto, L. and Gupta, A. Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
 Rezende et al. (2014) Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Singh et al. (1994) Singh, S., Jaakkola, T., and Jordan, M. Reinforcement learning with soft state aggregation. In NIPS, 1994.
 Smith (2002) Smith, A. Applications of the selforganizing map to reinforcement learning. Neural Networks, 2002.
 Sutton (1990) Sutton, R. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
 Tassa et al. (2012) Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors. In IROS, 2012.
 Todorov & Li (2005) Todorov, E. and Li, W. A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In ACC, 2005.
 Watter et al. (2015) Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, 2015.
 Winn & Bishop (2005) Winn, J. and Bishop, C. Variational message passing. JMLR, 2005.
 Zhang et al. (2017) Zhang, M., Geng, X., Bruce, J., Caluwaerts, K., Vespignani, M., SunSpiral, V., Abbeel, P., and Levine, S. Deep reinforcement learning for tensegrity robot locomotion. In ICRA, 2017.
 Zhu et al. (2018) Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V. Dexterous manipulation with deep reinforcement learning: Efficient, general, and lowcost. arXiv preprint arXiv:1810.06045, 2018.
Appendix A Policy Learning Details
Given a TVLG dynamics model and quadratic cost approximation, we can approximate our Q and value functions to second order with the following dynamic programming updates, which proceed from the last time step to the first step :
It can be shown (e.g., by Tassa et al. (2012)) that the action that minimizes the secondorder approximation of the Qfunction at every time step is given by
This action is a linear function of the state , thus we can construct an optimal linear policy by setting and . We can also show that the maximumentropy policy that minimizes the approximate Qfunction is given by
Furthermore, as in Levine & Abbeel (2014), we can impose a constraint on the total KLdivergence between the old and new trajectory distributions induced by the policies through an augmented cost function , where solving for via dual gradient descent can yield an exact solution to a KLconstrained LQR problem.
Appendix B Parameterizing the Cost Model
The simplest choice that we consider for parameterizing the cost model is as a full quadratic function of the state and action, i.e., where we assume that the actiondependent part of the cost – i.e., – is known, and we impose no restrictions on the learned parameters and . This is our default option due to its simplicity and the added benefit that fitting this model locally can be done in closed form through leastsquares quadratic regression on the observed states. However, another option we consider is to choose . is a lowertriangular matrix with nonnegative diagonal entries, and thus by constructing our cost matrix as we guarantee that the learned cost matrix is positive semidefinite, which can improve the behavior of the policy update.
In general, in this work, we consider quadratic parameterizations of the cost model since we wish to build a LQS model. However, in general it may be possible to use nonquadratic but twicedifferentiable cost models, such as a neural network model, and compute local quadratic cost models using a secondorder Taylor approximation as in Levine & Abbeel (2014). We also do not assume access to a goal observation, though if provided with such information we can construct a quadratic cost function that penalizes distance to this goal in the learned latent space, as in Finn et al. (2016) and Watter et al. (2015).
Appendix C The SVAE Algorithm
Johnson et al. (2016) build off of Hoffman et al. (2013) and Winn & Bishop (2005), who show that, for conjugate exponential models, the variational model parameters can be updated using natural gradients of the form
(7) 
Where denotes the MNIW parameters of the variational factors on , is the number of minibatches in the dataset, is the parameter for the prior distribution , and is the sufficient statistic function for . Thus, we can use this equation to compute the natural gradient update for , whereas for , , and the parameters of the cost model, we use stochastic gradient updates on Monte Carlo estimates of the ELBO, specifically using the Adam optimizer (Kingma & Ba, 2015). This leads to two simultaneous optimizations, and their learning rates are treated as separate hyperparameters. We have found and to be good default settings for the natural gradient step size and stochastic gradient step size, respectively.
Appendix D Fitting the Local Dynamics Model
In the pretraining phase described in Section 3, we are learning the following sets of parameters from observed trajectories:

The parameters of the variational posterior over global dynamics ;

The weights of the encoder and decoder networks and ;

The parameters of the cost function .
In the RL phase described in Section 4, after learning the representation and global models, we fit local linearGaussian dynamics models to additional trajectories. The conjugacy of the Bayesian LQS model enables a computationally efficient expectationmaximization procedure to learn the local dynamics. We assume the same graphical model as in Equation 2 to Equation 6 except we modify Equation 3 and Equation 4 to be
The model assumes that the TVLG dynamics are independent samples from our global dynamics, followed by a deep Bayesian LDS to generate trajectories. This is similar to the globally trained model, with the exception that we explicitly assume timevarying dynamics.
Now suppose we have collected a set of trajectories of the form and aim to fit a local dynamics model. We use variational inference to approximate the posterior distributions by setting up the variational factors

, which approximates the posterior distribution ;

, which approximates the posterior distribution
The ELBO under these variational factors is:
We use variational EM to alternatively optimize and . Using evidence potentials output by the recognition network , both of these optimizations can be done in closed form. Specifically, the optimal is computed via Kalman smoothing using evidence potentials from the recognition network, and the optimal
can be computed via Bayesian linear regression using expected sufficient statistics from
.Appendix E Experiment Setup
2D navigation.
Our recognition model architecture for the 2D navigation domain consists of two convolution layers with 2by2 filters and 32 channels each, with no pooling layers and ReLU nonlinearities, followed by another convolution with 2by2 filters and 2 channels. The output of the last convolution layer is fed into a fullyconnected layer which then outputs a Gaussian distribution with diagonal covariance. Our observation model consists of FC hidden layers with 256 ReLU activations, and the last layer outputs a categorical distribution over pixels. We initially collect 100 episodes which we use to train our model, and for every subsequent RL iteration we collect 10 episodes. The cost function we use is the sum of the
norm squared of the distance to the target and the commanded action, with weights of 1 and 0.001, respectively.As discussed in Section 7, we modify the 2D navigation task from Watter et al. (2015) and Banijamali et al. (2018) to randomize the location of the target every episode, and we set this location uniformly at random between and for both the x and y coordinates, as coordinates outside of are not visible in the image. We similarly randomize the initial position of the agent. In this setup, we use two 32by32 images as the observation, one with the location of the agent and the other with the location of the target, and in the fixedtarget version of the task we only use one 32by32 image.
Nonholonomic car. The nonholonomic car domain consists of 64by64 image observations. Our recognition model is a convolutional neural network with four convolutional layers with 4by4 filters with 4 channels each, and the first two convolution layers are followed by a ReLU nonlinearity. The output of the last convolutional layer is fed into three FC ReLU layers of width 2048, 512, and 128, respectively. Our final layer outputs a Gaussian distribution with dimension 8. Our observation model consists of four FC ReLU layers of width 256, 512, 1024, and 2048, respectively, followed by a Bernoulli distribution layer that models the image. For this domain, we collect 100 episodes initially to train our model, and then for RL we collect 100 episodes per iteration. The cost function we use is the sum of the norm squared of the distance from the center of the car to the target and the commanded action, with weights of 1 and 0.001, respectively.
Reacher. The reacher domain consists of 64by64by3 image observations. Our recognition model consists of three convolutional layers with 7by7, 5by5, and 3by3 filters with 64, 32 and 8 channels respectively. The first convolutional layer is followed by a ReLU nonlinearity. The output of the last convolutional layer is fed into an FC ReLU layer of width 256, which outputs a Gaussian distribution with dimension 10. Our observation model consists of one FC ReLU layers of width 512, followed by three deconvolutional layers with the reverse order of filters and channels as the recognition model. This is followed by a Bernoulli distribution layer that models each image. We collect 200 episodes initially to train our model, and then for RL we collect 100 episodes per iteration. The cost function we use is the sum of the norm of the distance from the fingertip to the target and the norm squared of the commanded action, which is the negative of the reward function as defined in Gym.
Sawyer Lego block stacking. The imagebased Sawyer blockstacking domain consists of 64by64by3 image observations. The policy outputs velocities on the end effector in order to control the robot. Our recognition model is a convolutional neural network with the following architecture: a 5by5 filter convolutional layer with 16 channels followed by two convolutional layers using 5by5 filters with 32 channels each. The convolutional layers are followed by ReLU activations leading to a 12 dimensional Gaussian distribution layer. Our observation model consists of a FC ReLU layer of width 128 feeding into three deconvolutional layers, the first with 5by5 filters with 16 channels and the last two of 6by6 filters with 8 channels each. These are followed by a final Bernoulli distribution layer.
For this domain, we collect 400 episodes initially to train our model and 10 per iteration thereafter. Note that this pretraining data is collected only once across solving all of the tasks that we test on. The cost function is the cubed root of the
norm of the displacement vector between the endeffector and the target in 3Dspace.
Sawyer pushing. The imagebased Sawyer pushing domain also operates on 64by64by3 image observations. Our recognition and observation models are the same as those used in the blockstacking domain. The dynamics model is learned by a network with two FC ReLU layers of width 128 followed by a 12 dimensional Gaussian distribution layer. The cost model is learned jointly with the representation and dynamics by optimizing the ELBO, which with regards to the cost corresponds to logistic regression on the observed sparse reward using a sampled latent state as the input. We collect 200 episodes to train our model and 20 per iteration for RL.
During the RL phase, the human supervisor uses keyboard input to provide the sparse reward signal to the learning algorithm, indicating whether or not the mug was successfully pushed onto the coaster. In practice, for simplicity, we label the last five images of the trajectory as either or depending on whether or not the keyboard was pressed at any time during the trajectory, as for this task a successful push is typically reflected in the end state. In order to overcome the exploration problem and provide a diverse dataset for pretraining the cost model, we manually collect “goal images” where the mug is on the coaster and the robot arm is in various locations.
Appendix F Implementation of Comparisons
PPO. We use the open source implementation of PPO (named “PPO2”) from the OpenAI Baselines project: https://github.com/openai/baselines. We write OpenAI gym wrappers for our simulated environments in order to test PPO on our simulated tasks.
LQRFLM. We implement LQRFLM based on the opensource implementation from the Guided Policy Search project: https://github.com/cbfinn/gps. The only modification to the LQRFLM algorithm that we make is to handle unknown cost functions by fitting a quadratic cost model to data from the current policy.
DVF. We train a video prediction model using the open source Stochastic Adversarial Video Prediction project: https://github.com/alexleegk/video_prediction. To define the task, we specify the location of a pixel whose movement to a specified goal location indicates success. The cost function is then the predicted probability of successfully moving the selected pixel to the goal. We then use MPC, specifically the crossentropy method (CEM) for offline planning: we sample sequences of actions from a Gaussian, predict the corresponding sequence of images using the video prediction model, evaluate the cost of the imagined trajectory with the cost model, and refit the parameters of the Gaussian to the best predicted action sequences. This iterative process eventually outputs an action sequence to perform in the real world in order to try and solve the task.
RCE. We use model learning code directly from the authors of RCE (Banijamali et al., 2018), though this code is not publicly available and to our knowledge there are no open source implementations of RCE or E2C (Watter et al., 2015) that are able to reproduce the results from the respective papers. In addition to LQRbased control, we also experiment with MPC with neural network dynamics and cost models in the learned latent representation. In our experiments, we report the best results using either of these control methods.
VAE ablation. In the VAE ablation, we replace our representation and global models with a standard VAE (Kingma & Welling, 2014; Rezende et al., 2014), which imposes a unit Gaussian prior on the latent representation. Because we cannot infer local dynamics as described in Section 4, we instead use a GMM dynamics prior that is trained on all data as described by Levine et al. (2016). After fitting a local quadratic cost model, we again have a local LQS model that we can use in conjunction with an LQRFLM policy update.
MPC baseline. (MPC) involves planning time steps ahead using a dynamics and cost model, executing an action based on this plan, and then replanning after receiving the next observation (Camacho & Alba, 2013). Recently, MPC has proven to be a successful control method when combined with neural network dynamics models, where many trajectories are sampled using the model and then the first action corresponding to the best imagined trajectory is executed (Nagabandi et al., 2018; Chua et al., 2018). Similar to LQRFLM, we can extend MPC to handle imagebased domains by learning dynamics and cost models within a learned latent representation. As MPC does not require an LQS model, we can instead utilize neural network dynamics and cost models which are more expressive.
Appendix G Additional Experiments
g.1 RCE on FixedTarget 2D Navigation
As mentioned in Section 7, RCE was unable to make progress for the 2D navigation task, though we were able to get more successful results by fixing the position of the goal to the bottom right as is done in the imagebased 2D navigation task considered in E2C (Watter et al., 2015) and RCE (Banijamali et al., 2018). Figure 8
details this experiment, which we ran for three random seeds and report the mean and standard deviation of the average final distance to the goal as a function of the number of training episodes. This indicates that RCE can indeed solve some tasks from image observations, though we were unable to use RCE succesfully on any of the tasks we consider.
g.2 Full Learning Progress of PPO
In Figure 9 we include the plots for the simulated tasks comparing SOLAR and PPO. Note that the xaxis is on a log scale, i.e., though our method is sometimes worse in final policy performance, we use one to three orders of magnitude fewer samples. This demonstrates our method’s sample efficiency compared to PPO, while being able to solve complex imagebased domains that are difficult for modelbased methods.
PPO is an onpolicy modelfree RL method, and typically offpolicy methods exhibit better sample efficiency (Fujimoto et al., 2018; Haarnoja et al., 2018). We use PPO in our comparisons because onpolicy methods are typically easier to tune, at the cost of being less efficient, and the complexity of our imagebased environments poses a major challenge for all RL methods. Specifically, we also compared to TD3 (Fujimoto et al., 2018), and we were unable to train successful policies despite extensive hyperparameter tuning. We also note that, to our knowledge, TD3 has never been tested on imagebased domains.
Comments
There are no comments yet.