Contrastive Variational Model-Based Reinforcement Learning for Complex Observations

08/06/2020 ∙ by Xiao Ma, et al. ∙ National University of Singapore 6

Deep model-based reinforcement learning (MBRL) has achieved great sample-efficiency and generalization in decision making for sophisticated simulated tasks, such as Atari games. However, real-world robot decision making requires reasoning with complex natural visual observations. This paper presents Contrastive Variational Reinforcement Learning (CVRL), an MBRL framework for complex natural observations. In contrast to the commonly used generative world models, CVRL learns a contrastive variational world model by maximizing the mutual information between latent states and observations discriminatively by contrastive learning. Contrastive learning avoids modeling the complex observation space and is significantly more robust than the standard generative world models. For decision making, CVRL discovers long-horizon behavior by online search guided by an actor-critic. CVRL achieves comparable performance with the state-of-the-art (SOTA) generative MBRL approaches on a series of Mujoco tasks, and significantly outperforms SOTAs on Natural Mujoco tasks, a new, more challenging continuous control RL benchmark with complex observations introduced in this paper.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (DRL) has achieved great success in game playing [1, 2], robot navigation [3, 4] and etc. However, model-free RL methods are notorious for the sample inefficiency and poor generalization to unseen environments. Model-based RL, in contrast, reasons a world model [5, 6], which summarizes the agent’s past experience in a parametric form then makes predictions of the future, and greatly improves the sample efficiency and generalization.

Classic methods use hand-crafted world models and perform explicit reasoning for the policy [5, 7]. However, manually constructing accurate world models is difficult. Recent advances in deep representation learning allow learning a compact latent world model from high-dimensional visual inputs [8, 9, 10]. Specifically, generative models learns the correspondence between observation and latent state by maximizing the observation likelihood , i.e., pixel-level reconstruction of observation from agent state  [10, 11, 12]. The learned model can be used to simulate diverse future trajectories, and achieves good performance in simulated tasks with relatively simple observations, e.g., Atari games.

Real-world observations, however, require reasoning with compact features embedded in variable high-dimensional complex observations. Consider, for example, a four-legged mini-cheetah robot [13] navigating on the campus. To determine the traversable path, the robot must extract the relevant geometric features that coexist with irrelevant variable backgrounds, such as the moving pedestrians, paintings on the wall, etc. Learning a generative model in this environment can be very difficult: has to capture the pixel-level distribution of all possible observations in the airport.

We introduce Contrastive Variational Reinforcement Learning (CVRL), a sample-efficient MBRL framework that plans long-horizon behavior with a robust world model learned from complex visual observations. CVRL maintains a robust contrastive variational world model to capture the stochastic dynamics of the environment, as well as the reward for state-action pairs, trained by contrastive learning. The contrastive learning avoids directly modeling the complex observations and is more robust than the generative models. Specifically, different from the the generative pixel-level reconstruction, contrastive learning maximizes the correspondence between state and observation by scoring the real pair against the fake pairs using a simple non-negative function. For example, by contrasting observations from different places, the mini-cheetah can identify its current position by simply understanding what observations are unlikely to receive. Mathematically, we derive a contrastive evidence lower bound (CELBO), a new lower bound of from the mutual information perspective and it sidesteps the difficulty of learning a complex generative latent world model. CVRL solves the decision making problem combining online model predictive control (MPC) [14]

with learned heuristics, i.e., an efficiently and robustly trained actor-critic, for learning long-horizon behavior.

Simulating robot in natural environment with complex observations is difficult and computationally expensive. To trade-off the realistic observations and light-weight simulated robotics task, we introduce Natural Mujoco tasks. We replace the simple background of Mujoco tasks designed in Deepmind Control Suite [15] by natural videos sampled from ILSVRC dataset [16] to simulate the “realistic" execution environment of a robot. For example, in Fig. 1, we simulate a walker walking on the road and a quadruped running through the woods. We evaluate CVRL on 10 challenging tasks and show: on standard Mujoco tasks, CVRL is comparable with SOTA MBRL methods; on natural Mujoco tasks, CVRL outperforms SOTA MBRL methods by a large margin. Specifically, CVRL achieves similar performance with or without the natural background in most cases.

We summarize our contributions as follows: 1) we present CVRL framework for MBRL with complex observations, which learns a world model without reconstructing the complex observations and significantly outperforms the SOTA MBRL method; 2) we introduce CELBO, a new variational lower bound using contrastive learning; 3) we introduce a Natural Mujoco, a new, challenging continuous control RL benchmark with complex observations.

(a) Natural Walker (b) Natural Quadruped (e) Natural Cheetah
Figure 1:

CVRL address the tasks with complex observations, sparse rewards, and many degrees of freedom, where SOTA MBRL methods often fail. We introduce natural Mujoco games where the backgrounds are replaced with natural videos to bridge the gap between realistic robot execution environment and the simulator.

2 Backgrounds

2.1 Related Works

MBRL with World Models. Classic MBRL approaches have focused on planning in a predefined low-dimensional state space [17]. However, manually specifying a world model is difficult [18, 19]

. Recently several works demonstrated that we could learn world models from raw pixel inputs. The majority rely on sequential variational autoencoders, which aims to minimize the reconstruction loss of the observations, to capture the stochastic dynamics of the environment 

[10, 11, 12]. Some other works in robotics learn to predict videos directly for planning [20, 21]. However, real-world observations are complex and noisy, building an accurate generative model over the entire observation space is challenging, which leads to an accumulated compositional error of the world model.

Contrastive Learning. Contrastive learning are widely used for learning word embeddings [22], image representation learning [23], graph representation learning [24] and etc. The main idea is to construct real and fake sample pairs and use a function to score them in different ways. Concurrent to our work, contrastive learning has been applied to learn latent world models [12, 25], motivated from different perspectives. Specifically, Hafner et al. [12] use contrastive learning as an alternative to image reconstruction, where the contrastive learned agent gives worse performance compared with the one learned by image reconstruction. On the contrary, we would like to emphasize the strength of contrastive learning in handling complex visual observations. CVRL significantly outperforms the SOTA model [12] on tasks with complex observations.

Reinforcement Learning under Complex Observations. Given complex observations, discriminative training is generally used to improve the robustness of the agent. Recent works suggest that learning task-oriented observation functions by end-to-end training improves the robustness of observation models [26, 19, 27, 28]. In particular, Ma et al. [27] introduced DPFRL which successfully addressed a challenging task with natural video in the background as well as robot navigation in a simulator constructed from real-world data. However, DPFRL relies on only the RL signal and is sample inefficient compared to model-based approaches. Besides, the generalization ability of DPFRL is also limited due to the model-free policy, and it failed on specific games. CVRL addresses the complex observation from a different perspective: we use contrastive learning to learn the latent world model, which avoids the modeling the complex observations. CVRL benefits both the sample efficiency of the model-based approaches and the robustness of the model-free approaches.

2.2 Variational Latent World Models

Variational latent world models are the sequential version of variational autoencoders (VAEs) [29]. For an observable variable , VAEs learn a latent variable that generates by optimizing an Evidence Lower Bound (ELBO) of


where is the prior distribution and is a proposal distribution that samples from the area that is possible to generate .

Since the visual observation reveals only part of the true state, we formulate the visual control problem as a partially observable Markov decision process (POMDP) with discrete time step

, continuous actions , complex visual observations , and scalar rewards . We observe sequences of the observation-action-reward triplets, , and we infer the latent states following a generative process. We first assume a generative latent dynamic model defined by transition function , observation function and reward function

. The transition function and reward function are parameterized as Gaussian distributions, where differentiable sampling is achieved by the reparameterization trick 


For training, we optimize an Evidence Lower Bound (ELBO) of


where and are model parameters, the first part encourages accurate reconstructions of the observation likelihood and reward likelihood , and the second part encourages learning self-consistent dynamics by KL-divergence. Specifically, the second part minimizes the KL divergence between the prior distribution with the posterior distribution conditioned on the observation sequences.

However, the pure stochastic transitions might have difficulties remembering the history and learning stability. Introducing a sequence of additional deterministic states tackles this issue [30, 11]. In this work, we use the recurrent state space model (RSSM) [11] that decomposes the original latent dynamic model into the following four components

Deterministic state model: Stochastic state model:
Observation model: Reward model:

As a result, during training, RSSM approximate by .

(a) Latent World Model (b) Contrastve Learning (c) Latent Guided MPC
Figure 2: (a) CVRL follows a standard generative latent world model (b) instead of generative observation function , CVRL maximizes the mutual information between state and the real observation , while minimizing the mutual information between the irrelevant observations , e.g., observations with different background videos; (c) CVRL chooses actions with a latent guided MPC using latent analytic gradients, which combines online planning with learned heuristics, i.e., an efficiently learned actor-critic.

3 Contrastive Variational Reinforcement Learning

We introduce CVRL, a model-based reinforcement learning framework for complex observations. Our key contribution is that, instead of generative modeling, we tackle the robust latent world model learning problem by contrastive learning. Contrastive learning avoids pixel-level reconstruction of the complex observations and gives a significantly more robust latent world model. CVRL has three main components: a variational world model (Fig. 2.a), a contrastive representation module that trains a robust latent representation from the complex observations (Fig. 2.b), , and a latent guided MPC planner using the an efficiently learned actor-critic (Fig. 2.c).

3.1 Contrastive Evidence Lower Bound

One big issue of RSSM is that the pixel level generative observation model has to model the entire observation space, which is problematic given complex observations, e.g., natural observations in autonomous driving or the natural Mujoco games. Given various videos, pixel-level reconstruction becomes difficult which leads to the inaccuracy in the learned latent world model. We introduce Contrastive Evidence Lower Bound (CELBO), a robust optimization objective that avoids reconstructing the observations and lower bounds the original ELBO (Eqn. 2).

Instead of maximizing the observation likelihood , we motivate the solution from a mutual information perspective. The mutual information between two variables and is defined as


In Eqn. 2, the observation likelihood is computed for a specific trajectory . In practice, during optimization, we consider the observation likelihood of a distribution of . We can rewrite the observation likelihood in Eqn. 2 as


where the second term could be treated as a constant that can be ignored during optimization. Eqn. 4

suggests that maximizing the observation likelihood is equivalent to maximizing the mutual information of the state-observation pairs. The benefit of such a formulation is that mutual information could be estimated without reconstructing the observations, e.g., using energy models 

[31] or the "compatibility function" [26, 27]. When the observations are complex, mutual information formulation is more robust than the generative parameterization.

To efficiently optimize the mutual information, we use the InfoNCE, which is a contrastive learning method that optimizes a lower bound of the mutual information [32]

and is proven to be powerful in a set of self-supervised learning tasks 

[32, 33]. Using the result in InfoNCE, the mutual information could be lower bounded by


where function is a non-negative function that measures the compatibility between state and observation , and is a set of irrelevant observations sampled from a replay buffer. An intuition for Eqn. 5 is that we want to maximize the compatibility between the state and the real observation (positive sample), while minimizing its compatibility between a set of irrelevant observations (negative samples). In our case, we follow the setup of the original InfoNCE loss and use a simple bi-linear model for , where

is an embedding vector for observation

and is a learnable weight matrix parameterized by .

Substituting Eqn. 4 and Eqn. 5 into Eqn. 2, we have the CELBO of as


The CELBO objective is similar to the Deep Variational Information Bottleneck [34] in the sense of mutual information maximization. The difference is that we take a mixed approach: we use contrastive learning to optimize the mutual information for only the state-observation pairs, and maximize the reward likelihood . Compared to the complex observations, the scalar reward is easy to reconstruct. The quality of contrastive learning highly depends on the choice of negative samples. Reward reconstruction is easier to optimize compared to contrastive learning.

We adopt a simple strategy to generate negative samples. We sample a batch of sequences from a replay buffer, where is the sequence length and is the batch size. For each state-observation pair , we treat the other observations in the same batch as negative samples. An intuition of this choice is that: 1) by contrasting with where and , CELBO learns to identify invariant features of the task given variable visual features; 2) by contrasting with where , CELBO learns to model the temporal dynamics of the task. We found this simple strategy works well in practice.

3.2 Hybrid Actor-Critic

CVRL trains an actor-critic using a hybrid-approach, benefiting from the sample-efficiency of the model-based learning and the task-oriented feature learning from the model-free RL.

Actor-Critic from Latent Imagination. First, CVRL uses latent imagination to train the actor-critic, i.e., reasoning the latent world model, which reduces the amount of the interactions needed with the non-differentiable environment. In particular, since the predicted reward and latent dynamics are differentiable, the analytic gradients can back-propagate through the dynamics. As a result, the actor-critic can potentially approximate long-horizon planning behaviors [12].

We adopt the same strategy with Dreamer [12]. We parameterize the actor model as a tanh-transformed Gaussian, i.e., , where . For value model, we use a feed-forward network with a scalar output. To compute the analytic gradient, we first estimate the state values of the imagined trajectory , where the actions are sampled from the actor network. We denote the value estimate of as a function . Detailed descriptions of the value estimation and imagined trajectory generation are in the appendix. The Dreamer learning objective is thus given by


Hybrid Actor-Critic. The performance of latent imagination highly relies on the accuracy of the learned latent world model. Given complex observations, learning an accurate world model is difficult, even with CELBO. We introduce a simple yet effective hybrid training scheme to address this issue. CVRL combines the Dreamer objective with a secondary training signal from standard off-policy RL, using the ground truth trajectories. Discriminative RL objective can improve the robustness of the actor-critic, while sacrificing the sample-efficiency [27]. Thus, CVRL benefits from both the sample-efficiency of the latent analytic gradient and the robustness of discriminative RL gradient.

In our experiment, we use the Soft Actor-Critic (SAC) [35] to perform off-policy RL. During each optimization step, we use the ground truth trajectory , and use the imagined trajectories . We have the final objective as


3.3 Latent Guided Model Predictive Control

Although the learned actor-critic maximizes the accumulated rewards, a model-free policy, without explicit reasoning with world models, might be stuck in local optimum [18, 36]. Model predictive control (MPC) is widely used to address the continuous control problems, where multiple iterations are required for the policy to converge to the optimal solution [37].

We introduce latent guided model predictive control. Specifically, we use the shooting method in trajectory optimization to address the MPC task. For state , we perform a forward search using the latent world model guided by the learned actor-critic, and generate latent imagination trajectory . We compute the value estimate for the sampled trajectory using , compute the analytic gradient by maximizing and update the action sequences with analytic gradients. In practice, the combination of the offline training with online planning gives a better performance. The detailed description of the algorithm can be found in the appendix.

4 Experiments

We evaluate CVRL on 10 continuous control tasks in Deepmind Control Suite [15]. These tasks pose various challenges to robotics, including the sparse reward, 3D scenes and contact dynamics. However, these tasks have a clean background, while real robots work in environment with variable and complex configurations. Thus, standard Mujoco environments do not necessarily reveal the performance of an algorithm on real robots.

We introduce a new benchmark for robot RL under complex variable natural observations, Natural Mujoco. Natural Atari games [27] replace the background of standard Atari games by videos to increase the complexity of the observation space. To bridge the gap between the simulated environments and the real robots, we introduce Natural Mujoco tasks by replacing the simple background with videos chosen from ILSVRC dataset [16], and as a result, the robot moves in a relatively more realistic environment, e.g., a walker walking on a road and a quadruped running through the woods (Fig. 1). Natural Mujoco tasks work as a tradeoff: it is easy to use as the standard Mujoco tasks, and simulates the visual features of a robot executing in real-world.

We compare CVRL with the state-of-the-art generative MBRL method, Dreamer [12], and a model-free baseline, Soft Actor-Critic [35]111we use the official implementation of Dreamer and the SAC implementation from OpenAI baselines. We also include the result of D4PG [38] trained for sufficient time on standard Mujoco tasks, as a baseline for Mujoco tasks. We show that: 1) CVRL significantly outperforms SAC in all cases, with much fewer training iterations; 2) CVRL significantly outperforms Dreamer on natural Mujoco tasks because of the robust contrastive learning, and achieves comparable performance on standard Mujoco; 3) the proposed hybrid actor-critic training scheme and guided model predictive control further improves the performance of CVRL on natural Mujoco tasks.

4.1 Experiment Setup

For all tasks, the agent observations are images with size and has action dimensions ranging from 1 to 12. We limit the maximum length for each episode to be 1000 and use an action repeat of 2 for all tasks, following the convention of Dreamer.

We use a similar network architecture with the official implementation of Dreamer [12] to make it a fair comparison. We use a batch size of 50 and training sequence length of 50, which gives 2499 negative samples for each positive state-observation pair. All models are trained with Adam optimizers, where we use different learning rates for the actor-critic and the latent world model. For CVRL, Dreamer and SAC, we train for environment steps, while D4PG is trained for steps, which we use as an indicator of the best model-free performance on Mujoco tasks. All reported results are averaged over 3 random seeds. More details are available in the appendix.

4.2 Results

Standard Natural
CVRL Dreamer [12] SAC D4PG [12] CVRL Dreamer SAC
walker-walk 980.3 961.7 355.7 968.3 941.5 206.6 44.1
walker-run 377.7 824.6 153.0 567.2 382.1 82.7 78.1
cheetah-run 528.1 894.5 181.8 523.8 248.7 100.7 35.8
finger-spin 989.2 498.8 309.5 985.7 850.4 13.6 23.8
cartpole-balance 997.1 979.6 355.5 992.8 911.9 163.7 206.0
catpole-swingup 863.4 833.6 252.5 862.0 413.8 117.6 150.5
cup-catch 964.9 962.5 537.3 980.5 894.2 131.1 202.2
reacher-easy 968.2 935.1 458.6 967.4 909.1 133.7 137.7
quadruped-walk 950.3 931.6 382.7 - 878.7 153.2 204.3
pendulum-swingup 912.1 833.0 28.6 680.9 842.9 12.4 14.8
Table 1: CVRL achieves comparable performance with the SOTA method, Dreamer [12], on standard Mujoco tasks and significantly outperforms Dreamer on Natural Mujoco tasks. CVRL, Dreamer and SAC are trained for steps, while the best model-free baseline D4PG is trained for steps, which we use as an indicator for the performance in standard Mujoco tasks. Results are taken directly from Dreamer paper.
Figure 3: Generative models learn a latent world model by pixel level reconstruction, which is difficult when the observations are complex and variable. The first row shows the complex observations of natural Walker with varying video backgrounds, and the second row shows the reconstruction of generative models.

We present the results in Table 1. We analyze the quantitative results as follows.

Model-based CVRL outperforms the model-free baseline. We observe that both CVRL reaches the best achievable performance, indicated by D4PG, the state-of-the-art model-free baseline trained for 20 times more steps ( steps for CVRL and Dreamer, and steps for D4PG). The learned latent world model successfully captures the real environment dynamics from pixel-level input, so that the trained actor-critic achieves comparable performance with the SOTA D4PG trained by ground truth trajectories. In contrast, given the same number of training steps, CVRL and Dreamer significantly outperform SAC on all tasks. This also suggests that the benefit of CVRL comes from the overall framework design, rather than the SAC.

CVRL is more robust to the natural observations. In Natural Mujoco tasks where the observations are more complex and variable, CVRL significantly outperforms the generative Dreamer in all cases. Although Dreamer achieves SOTA performance on the standard Mujoco tasks with relatively simple observations, its performance drops dramatically on natural Mujoco given complex observations introduced by the video background (e.g., on walker-walk, 961.7 V.S. 206.6). CVRL, however, achieves comparable performance on 8 out of 10 tasks with or without the video background (e.g., on walker-walk, 980.3 V.S. 941.5). This suggests that the contrastive learning, which avoids the pixel-level reconstruction, helps to learn a more robust latent world model than the generative models. Even with the variable complex video background, the learned latent world still successfully captures the underlying dynamics and achieves comparable performance with the simple observations. Besides, we visualize the reconstruction of generative models. We unroll the model for 40 steps and compare it with the ground truth images in Fig. 3. The reconstructions are blurry and lose information about the agent, which explains the failure of the generative Dreamer on Natural Mujoco tasks.

4.3 Ablation Studies

CVRL CVRL-generative CVRL-no-MPC CVRL-no-SAC CVRL-reward-only
walker-walk 941.5 297.7 904.8 915.2 197.9
walker-run 382.1 71.4 343.2 378.3 115.4
cheetah-run 248.7 113.3 430.1 301.0 284.8
finger-spin 850.4 13.9 753.3 668.8 68.7
cartpole-balance 911.9 188.4 996.3 962.3 431.6
catpole-swingup 413.8 160.5 353.0 465.9 176.3
ball_in_cup-catch 894.2 254.8 881.4 930.4 368.7
reacher-easy 909.1 235.8 858.9 880.5 167.2
quadruped-walk 878.7 157.3 595.2 213.5 188.7
pendulum-swingup 842.9 19.7 831.5 813.3 20.8
Table 2: Ablation Studies on natural Mujoco tasks. CVRL generally outperforms all other variants.

We conduct a comprehensive ablation study on the Natural Mujoco tasks to better understand each proposed component. The results are presented in Table 2.

Contrastive variational latent world model is more robust to complex observations. CVRL-generative replaces the contrastive learning with a generative model that performs image-level reconstruction. Unlike Dreamer, CVRL-generative only differs from the CVRL in the parameterization of the representation learning method, and still has the rest of the proposed components. However, its performance degrades on all cases compared to CVRL. This aligns with our previous observation that contrastive learning is more robust given complex observations.

Latent guided MPC improves the ability of CVRL to reason about long-horizon behaviors. CVRL-no-MPC uses only the actor-critic for decision making. We observe it performs poorly some of the challenging tasks, e.g., cartpole-swingup and quadruped-walk, where multi-step reasoning is required. The proposed latent guided MPC improves the overall performance of CVRL.

The hybrid actor-critic is robust given complex observations. CVRL-no-SAC removes the SAC during actor-critic learning. Its performance drops on certain cases, compared to CVRL (on cheetah-run, 497.3 V.S. 301.0 and finger-spin, 987.1 V.S. 668.8). This is because when the useful features are highly coupled with variable and complex background, learning an accurate latent world model becomes difficult, even for CELBO. With ground-truth trajectories, SAC can provide accurate training signals to compensate for the compositional error of the latent world model.

Reward signal alone is not enough for learning the latent world model. CVRL-reward-only uses only reward prediction for representation learning. Its performance drops in all cases. This suggests that the robustness of CVRL comes from the contrastive learning, rather than only the reward learning.

5 Conclusions

We introduce CVRL, a framework for robust MBRL under natural complex observations. CVRL learns a contrastive variational world model with CELBO objective, a contrastive learning alternative to the ELBO, which avoids reconstructing the complex observations. CVRL lerans a robust hybrid actor-critic and uses guided MPC for decision making. CVRL achieves comparable performance with the SOTA methods on 10 challenging Mujoco control tasks. Further, we present natural Mujoco tasks, a new challenging benchmark with complex natural observations. CVRL significantly outperforms alternative SOTA methods on a more challenging domain, natural Mujoco tasks.

However, CVRL does not perform as well as Dreamer on some tasks on standard Mujcoo tasks (walker-run and cheetah-run), where the observation is simple. While contrastive learning is robust to complex observations, its quality highly depends on the sampling strategy of negative samples. Currently we use a very simple strategy. Further work may consider smarter sampling strategies, e.g., learning to sample using meta-learning.


Appendix A Algorithm Details

a.1 Latent Imagination

CVRL first generates the imagined trajectories using the learned world model parameterized by . Specifically, given a state , we sample the next imagined state by , which further generates a reward and the next action . We repeat this process until we have an imagined trajectory .

a.2 Value Estimation of Dreamer

Dreamer estimates the value of imagined trajectories using the following equations:

estimates the value of using the rewards of steps of rollouts and the value function estimate of the last state. Dreamer ues as the final value estimation, which is an exponentially-weighted average of different

-step rollouts to tradeoff the bias and variance.

a.3 Latent Guided MPC

Originally, for each state , the actor network generates the action which maximizes the long-horizon accumulated reward. However, the approximation highly depends on the quality of the learned world model and might have difficulties approximating complex policies. Most importantly, it lacks the reasoning ability to adapt to variable environments.

We use the shooting method for MPC with differentiable world model. Specifically, we use stochastic gradient ascent to optimize the action sequences to output high accumulated reward. During execution, for each obsevation , previous state and action , we encode / propose the current state by . Next, we perform latent imagination and sample the imagined trajectories and estimate . As

is computed using predicted rewards and value estimations, which are conditioned on the action sequences, we can backpropagate the gradients from

to the actions. We update the actions by

We repeat this for all actions and return the first action after update.

Our latent guided MPC is similar to the planning algorithm used in DPI-Net [39]. The difference is that DPI-Net requires a pre-defined observation of the goal to compute the loss, whereas CVRL directly maximizes the accumulated reward and alleviate this assumption.

Appendix B Implementation Details

Hardware and Software.

We train all models on single NVidia RTX 2080Ti GPUs with Intel Xeon Gold 5220 CPU @ 2.20GHz. We implement all models with Tensorflow 2.2.0 and Tensorflow Probability 0.10.0. Specifically, part of the our code is developed based on the official Tensorflow implementation of Dreamer, but heavily modified. We use the official implementation of Dreamer as our baseline, and we use the SAC implementation of OpenAI baselines. For all methods, we share certain structure, including the encoder, RSSM model and the actor-critic networks to make it a fair comparison.

Observation Encoder. We use an encoder of 4 convolutional layers for image observations, which have a fixed kernel size of 4 with increasing channel numbers: 32, 65, 128, 256. We do not encode the actions again and directly concatenate it with the states.

RSSM. We use a stochastic state with size 30 and a deterministic state with size 200. The deterministic update function is parameterized using a GRU and the for the stochastic part, we learn the mean and variance of using two fully connected layers with size 200 and 30.

Contrastive Learning. In contrastive learning, we learn the compatibility between state and observation with a function . In our implementation, we first encode both and by two separate fully connected layers with size , then we compute the value of , where and are the embeddings of the observation and the state, and is a matrix.

Actor-Critic. For the actor network, we use 4 fully connected layer which takes in the concatenation of and

as input, with intermediate hidden dimension of 400, and output the corresponding action, with tanh as the activation function. Specifically, a transformed distribution is used to achieve differentiable sampling. For the value network, 3 fully connected layers are used with hidden dimension of 400 and output dimension of 1. In addition, SAC needs additional Q-value network during training. For models needs SAC, we use 2 Q-value network with similar structure, except that the input is a concatenation of

, and .

Model Learning. We train CVRL by 4 separate optimizers for different part of the network: model optimizer, value optimizer, actor optimizer and SAC optimizer. For all optimizers, we use Adam optimizer in our implementation with different learning rate. Model optimizer updates all contrastive variational world model dynamics by representation learning defined in Eqn. 6 with learning rate ; value optimizer updates only value network parameters with learning rate ; actor optimizer updates the actor parameters with learning rate ; SAC optimizer updates the actor parameters and the two Q-value network parameters with learning rate .

Latent Guided MPC. In latent guided MPC, we unroll for 15 steps and update the actions by standard SGD with learning rate 0.003.

Appendix C Additional Visualizations

(a) Natural Walker Walk (b) Natural Walker Run (c) Natural Cheetah Run
(d) Natural Finger Spin (e) Natural Cartpole Balance (f) Natural Cartpole Swingup
(g) Natural Cup Catch (h) Natural Reacher Easy (i) Natural Quadruped Walk
(j) Natural Pendulum Swingup