A fundamental challenge in applying reinforcement learning (RL) to real robotics is the need to define a suitable state space representation. Designing perception systems manually makes it difficult to apply the same RL algorithm to a wide range of manipulation tasks in complex, unstructured environments. One could, in principle, apply model-free reinforcement algorithms from raw, low-level observations; however, these tend to be slow, sensitive to hyperparameters, and sample inefficient. Model-based reinforcement learning, where a predictive model of the environment is first learned and then a controller is derived from it, offers the potential to develop sample efficient algorithms even from raw, low-level observations. However, Model-based RL from high-dimensional observations poses a challenging problem: How do we learn a good latent space for planning?
Recent work in learning latent spaces for model-based RL can be categorized in three main classes: 1) hand-crafted latent spaces , 2) video prediction models , and 3) latent space learning with reconstruction objectives [4, 8, 13, 12, 14, 3]. While hand-crafted latent spaces provide a good inductive bias, such as forcing the latent to be feature points , they are too rigid for unstructured tasks and cannot incorporate semantic knowledge of the environment. Video prediction models and latent space with reconstruction objectives share the commonality that the latent space is learned using loss on the raw pixel observations. As a result, the latent space model needs to incorporate all information to reconstruct every detail in the observations, which is redundant as the task is usually represented by a small fraction of the scene in real world environments.
Instead, this work tackles the problem of representation learning from an information theoretic point of view: learning a latent space that maximizes the mutual information (MI) between the latent and future observations. The learned latent space encodes the underlying shared information between the different future observations, discarding the low-level information, and local noise. When predicting further into the future, the amount of information becomes much lower and the models needs to infer the global structure. The MI is optimized using energy-based models and a discriminative procedure[11, 5]. Energy-based models do not pose any assumption on the distribution between the latent and the image, and the discriminative procedure results in a robust latent space.
The main contribution of our work is a representational learning approach, MIRO (Mutual Information Robust Representation), which jointly optimizes for the latent representation and model, and results in a latent space that is robust against disturbances, noisy observations, and can achieve performance comparable to state-of-the-art model-based algorithms. Our experimental evaluation illustrates the strengths of our framework on four standard DeepMind Control Suite  environments.
This work tackles the problem of learning a latent space that is suitable for planning from high-dimensional observations in POMDPs. We maximize the mutual information (MI) of the latent variables and the future observations, and learn a predictive model of the environment.
Partially Observable Markov Decision Process.
A discrete-time finite partially observable Markov decision process (POMDP)is defined by the tuple . Here, is the set of states, the action space, the transition distribution,
is the probability of obtaining the rewardat the state , is the observation space, , the discount factor, represents the initial state distribution, and is the horizon of the process. We define the return as the sum of rewards along a trajectory . The goal of reinforcement learning is to find a controller that maximizes the expected return, i.e.: .
The mutual information between two random variableand , denoted by and the product of the marginals and : .
Enabling complex real robotics tasks requires extending current model-based methods to low-level high-dimensional observations. However, in order to do so, we need to specify which space we should plan with. Real-world environments are unstructured, cluttered, and present distractors. Our approach, MIRO, is able to learn latent representations that capture the relevant aspects of the dynamics and the task by framing the representational learning problem in information theoretic terms: maximizing the mutual information between the latent space and the future observations. This objective advocates for representing just the aspects that matter in the dynamics, removing the burden of reconstructing the entire pixel observation.
3.1 Learning Latent Spaces for Control
Since the learned latent space will be used for planning, it has to be able to accurately predict the rewards and incorporate new observations when available. To this end, we decompose the latent space learning in a single optimization procedure composed of four functions:
Encoder. The encoder is a function that maps high-dimensional observation to a lower dimensional manifold, . We represent the encoded observation with the notation . Contrary to prior work in learning latent spaces for planning, we do not make any assumption on the underlying distribution of .
Dynamics model. The dynamics model gives the transition function on the latent space, ; it generates the next latent state given the the current state and action
. The dynamics model is probabilistic following a Gaussian distribution, i.e.,.
Filtering function. This function filters the belief of the latent state with the current encoded observation to obtain the filtered latent variable .
Reward predictor. This function is a mapping between latent states and rewards,
. We assume that the rewards follow a Gaussian distribution with unit variance.
The parameters of these functions are learned in a single optimization objective; specifcally, we optimize the following constrain optimization problem:
The mutual information term preserves in latent space only the information relevant for the dynamics, while the constraints prevents the latent from degenerating on the terms that matter to predict the rewards and next latents. However, in practice, this objective is intractable: we cannot evaluate the exact mutual information term and we have a set of non-linear constraints. To optimize it, we formulate the Lagrangian of the objective and replace the mutual information objective with the noise contrastive estimator lower bound :
We use the reparametrization trick  for where and . In our case, the noise contrastive estimator term results in:
Here, represents the open loop prediction from and the sequences of actions .
3.2 Latent Space Model-Based Reinforcement Learning
We present the overall method for obtaining a model-based controller when learning the latent space using MIRO.
Data Collection. As is typical of model-based methods, we alternate between model learning and data collection using the latest controller. This allows the model to just learn the parts of the state space that the agent visits, removing the burden of modelling the entire space, and overcomes the insufficient coverage of the initial data distribution.
Model Learning. We learn the encoder, model, filtering function, and reward predictor altogether using equation 1 and all the data collected so far.
Planning. As a controller, we use model-predictive control (MPC) with cross-entropy method (CEM) . The CEM component selects the action that maximizes the expected return under the learned models. Specifically, it is a population based procedure that iteratively refits a Gaussian distribution on the best sequences of actions. The MPC component results in a more robust planning by doing CEM at each step.
In this section, we empirically corroborate the claims in the previous sections. Specifically, the experiments are designed to address the following questions: 1) Is our approach able to maintain its performance in front of distractors in the scene? 2) How does our method compare with state-of-the-art reconstruction objectives?
To answer the posed questions, we evaluate our framework, in four continuous control benchmark tasks MuJoCo simulator: cartpole balance, reacher, finger spin and half cheetah [10, 9]. We choose PlaNet  as the state-of-the-art reconstruction objective baseline.
Robustness Against Distractors. To test the robustness of MIRO and PlaNet in visually noisy environments, we add distractors to each of the four above environments in the background as shown in Fig. 1. As shown in Fig. 2, we observe that in all four environments, the performance of MIRO is not undermined by the presence of distractors. Interestingly, in the Cheetah environment, the performance even improves with visual noise in the background. A possible explanation is that the presence of the distractors forces the embedding to focus more on information relevant to the task and thus makes the embedding more suitable for planning. In comparison, the performance of PlaNet struggles in face of distractors. The agent barely learns in the Cartpole Balance environment, and in the Cheetah environment it suffers from a slower takeoff.
Learning curves of MIRO and PlaNet on environments with and without distractors. All curves represent mean and the shaded area represents one standard deviation among 3 seeds.
Comparison Against Reconstruction Objectices When no distractors ae present int the secene, we see that MIRO is able to achieve the same or superior asymptotic performance than its reconstruction-based counterpart method. Its learning speed its environment dependent. However, given the the performance boost brought by distractors, further regularization of the learned latent space presents it self as a promising direction towards performance improvement.
In this paper, we present MIRO, a mutual information based representation learning approach for model-based RL. Compared to state-of-the-art reconstruction-based methods, MIRO is more robust in noisy visual background in the environment, which is prevalent in real world applications. The initial results shed light on the importance of information theoretical representation learning and open an enticing research direction. Our next steps include optimizing the sampling of negative samples in NCE objective to encourage the latent space to disregard task-irrelevant information.
This work was supported in part by Berkeley Deep Drive (BDD) and ONR PECASE N000141612723.
-  (2013) The cross-entropy method for optimization. In Handbook of statistics, Vol. 31, pp. 35–59. Cited by: §3.2.
Learning Visual Feature Spaces for Robotic Manipulation with Deep Spatial Autoencoders. Proceedings - IEEE International Conference on Robotics and Automation 2016-June, pp. 512–519. External Links: Cited by: §1.
-  (2018) World models. CoRR abs/1803.10122. External Links: Cited by: §1.
-  (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §1, §4.
-  (2019) Data-efficient image recognition with contrastive predictive coding. CoRR abs/1905.09272. External Links: Cited by: §1, §2.
-  (2018) Time-agnostic prediction: predicting predictable video frames. CoRR abs/1808.07784. External Links: Cited by: §1.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.1.
-  (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. CoRR abs/1907.00953. External Links: Cited by: §1, §1.
-  (2018) DeepMind control suite. arXiv preprint arXiv:1801.00690. Cited by: §1, §4.
-  (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §4.
-  (2018-07) Representation Learning with Contrastive Predictive Coding. CoRR. External Links: Cited by: §1.
-  (2015) From pixels to torques: policy learning with deep dynamical models. External Links: Cited by: §1.
-  (2015) Embed to control: A locally linear latent dynamics model for control from raw images. CoRR abs/1506.07365. External Links: Cited by: §1.
-  (2018-08) SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning. CoRR. External Links: Cited by: §1.