Log In Sign Up

S2P: State-conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning

Offline reinforcement learning (Offline RL) suffers from the innate distributional shift as it cannot interact with the physical environment during training. To alleviate such limitation, state-based offline RL leverages a learned dynamics model from the logged experience and augments the predicted state transition to extend the data distribution. For exploiting such benefit also on the image-based RL, we firstly propose a generative model, S2P (State2Pixel), which synthesizes the raw pixel of the agent from its corresponding state. It enables bridging the gap between the state and the image domain in RL algorithms, and virtually exploring unseen image distribution via model-based transition in the state space. Through experiments, we confirm that our S2P-based image synthesis not only improves the image-based offline RL performance but also shows powerful generalization capability on unseen tasks.


page 1

page 8

page 16


DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

Offline reinforcement learning algorithms promise to be applicable in se...

Functional Regularization for Reinforcement Learning via Learned Fourier Features

We propose a simple architecture for deep reinforcement learning by embe...

Offline Reinforcement Learning from Images with Latent Space Models

Offline reinforcement learning (RL) refers to the problem of learning po...

State Advantage Weighting for Offline RL

We present state advantage weighting for offline reinforcement learning ...

MoCoDA: Model-based Counterfactual Data Augmentation

The number of states in a dynamic process is exponential in the number o...

Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL

Offline Reinforcement Learning (RL) aims to extract near-optimal policie...

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

We study an approach to offline reinforcement learning (RL) based on opt...

1 Introduction

Figure 1: S2P generates the dynamics-consistent image transition data by virtually exploring in the state space to extend the distribution of the offline datasets.

Deep learning algorithms have shown significant development thanks to the large pre-collected dataset, such as SQuAD rajpurkar2016squad

in natural language processing (NLP), and ImageNet


in computer vision. On the contrary, reinforcement learning (RL) requires an online trial-and-error in training process to collect the data by interacting with the environment, which hinders its utilization in many real-world applications. Due to this intrinsic property of current online RL algorithms, there exist some approaches that try to deploy large and diverse pre-recorded datasets without online interaction with the environment, which is called offline RL.

However, recent studies have observed that the current online RL algorithms haarnoja2018soft; DBLP:journals/corr/LillicrapHPHETS15 perform poorly in an offline setting. It is primarily attributed to the large extrapolation error when the Q-function is evaluated on out-of-distribution actions, which is called the distributional shift kumar2019stabilizing; DBLP:conf/nips/KidambiRNJ20; fujimoto2019off. That is, due to the offline setting that limits online data collection, the offline RL has struggled to generalize beyond the given offline dataset. Even though some offline RL methods DBLP:conf/nips/KumarZTL20; kostrikov2021offline; wu2019behavior achieve reasonable performances in some settings, their training is still limited to behaviors within the given offline dataset distribution, and detours the evaluation on out-of-distribution data rather than directly addressing such an empty space of the offline dataset. Thus, there exists a growing need for the development of algorithms specialized to directly address such out-of-distribution data by extending the offline dataset distribution’s support.

To alleviate the fixed dataset distribution problem, recent studies propose some data augmentation strategies. In state-based RL, model-based algorithms deisenroth2011pilco; chua2018deep; kumar2016optimal; janner2019trust which learn a dynamics model from the pre-recorded dataset and augment the dataset with generated state transitions have emerged as a promising paradigm. As the model-based approach trains dynamics models in a supervised manner, it allows a stable training process and generates reliable state transition data for augmentation. Thus, it can be a plausible choice that it enables the generalization into the unseen state-action by performing dynamics-consistent planning on unseen state distribution.

When it comes to the image domain, however, there is still no augmentation strategy to mitigate the aforementioned distribution shift. Even though some model-based image RL methods DBLP:conf/iclr/HafnerLB020; hafner2019learning that propose to learn latent dynamics using reconstruction error from ELBO objective jordan1999introduction; tishby2000information; kingma2013auto can be exploited for generating image transition data in a similar manner to the state-based methods, the quality of the output images from these approaches is not satisfactory because 1) their focus is on learning latent representation suitable for the RL network’s inputs rather than generating high-quality and accurate images, and 2) ELBO-based objective cannot generate photo-realistic images compared to other generative models such as GAN NIPS2014_5ca3e9b1 or diffusion ho2020denoising; nichol2021improved and it usually produces blurry outputs. Above all, 3) model-based image RL only exploits image input and it makes the generative model fail to capture the accurate dynamics and the details of the agent’s posture or objects in the image nguyen2021temporal; okada2021dreaming; deng2021dreamerpro. These undesired properties discourage offline RL algorithms from adding the reconstruction output of model-based image RL to their training data as an augmentation strategy.

Therefore, we propose S2P (State2Pixel) which utilizes multi-modal input (the state, and the previous image observation of the agent) to synthesize the realistic image from its corresponding state. The key element of S2P is a multi-modal affine transformation (MAT) which effectively exploits both state and image cross-modality information. Unlike previous learned affine transformation karras2019style; karras2020analyzing; karras2021alias; park2019semantic; li2020manigan, which leverages a single domain input, MAT fuses the cross-modal representation from the state and the image to produce the scale and the bias modulation parameters. This multi-modality of S2P makes it possible to generate the dynamics-consistent images from the reliable state transition while preserving high-quality image generation capability.

To sum up, our work makes the following key contributions.

  • We propose a state-to-pixel generative model (S2P) which generates dynamics-consistent image and multimodal affine transformation (MAT) module for aggregating cross-modal inputs.

  • To the best of the author’s knowledge, this work is the first to propose image augmentation for offline image RL and overcome innate fixed distribution problem by implicitly leveraging reliable state transition.

  • We evaluate our S2P on the DMControl tunyasuvunakool2020dm_control benchmark environments with the offline setting, and it results in x higher offline RL performance by augmenting the generated synthetic image transition data.

  • Even with the state distributions of the unseen tasks, S2P can generalize to unseen image distribution, and we show that the agent can be trained by offline RL with these generated images only.

2 Related Work

2.1 Image Synthesis

Generative Adversarial Networks (GAN) NIPS2014_5ca3e9b1 based deep generative models enjoy huge success in synthesizing high-resolution photo-realistic images via style mapping function and the learned affine transformation karras2019style; karras2020analyzing; karras2021alias; park2019semantic; li2020manigan. StyleGAN karras2019style

firstly utilizes a style vector

and Adaptive Instance Normalization (AdaIN) dumoulin2016learned; huang2017arbitrary in the generative networks to disentangle the latent space and control the scale-specific synthesis. Following studies karras2020analyzing

pinpoint that the AdaIN operation which leads to information loss in the feature magnitude makes undesired droplet-like artifacts in the synthesized images and proposes weight demodulation by assuming the variance of input features. SPADE

park2019semantic proposes an architecture to synthesize the image using its corresponding semantic masks and spatially learned affine transformation. ManiGAN li2020manigan

suggests Text-Image Affine Combination Module (ACM) which enables the network to manipulate the images using text descriptions given by users. The difference between our proposed MAT and the previous studies is that we leverage cross-modal data, state and image, to estimate the modulation parameters for the learned affine transformation whereas other studies use a single data type such as text or image.

2.2 Offline Reinforcement Learning & Data Augmentation

Offline RL ernst2005tree; lange2012batch; levine2020offline is the task of learning policies from a given static dataset, which is different from online RL that learns useful behaviors through trial-and-error in the environment. Prior offline RL algorithms are designed to constrain the policy to the behavior policy used for offline data collection via direct state or action constraints fujimoto2019off; DBLP:conf/nips/0009SAB20, maximum mean discrepancy kumar2019stabilizing, KL divergence wu2019behavior; DBLP:conf/corl/ZhouBH20; jaques2019way, or learning conservative critics DBLP:conf/nips/KumarZTL20; kostrikov2021offline. However, most of these methods are limited to exploiting the state-action distribution of the given static dataset, rather than exploring and extending the distribution. As the offline setting prohibits online interaction with the environment, we suggest the synthetic data generation method to enable the offline RL agent to virtually explore and extend the distribution by leaving the support of the dataset.

Recent works in model-based state RL that involves learning a policy with a dynamics model DBLP:conf/nips/KidambiRNJ20; deisenroth2011pilco; chua2018deep; DBLP:conf/iclr/KurutachCDTA18; DBLP:conf/nips/YuTYEZLFM20; yu2021combo suggest the need to augment the data with generated transitions from the model for extending the data distribution. Also, augmentation strategies in image domain laskin2020curl; DBLP:conf/iclr/SchwarzerAGHCB21; laskin2020reinforcement; yarats2021mastering; DBLP:conf/iclr/YaratsKF21 emphasize the importance of image augmentation for sample efficiency and robust representation learning. But, these works focus on purely image manipulation techniques on the given image such as cropping rather than generating image transition. Some image-based methods rafailov2021offline; hafner2019learning; DBLP:conf/iclr/HafnerLB020 that use the variational model to train the latent dynamics are studied in a similar concept to the state-based ones. However, the generated image from these methods is the byproduct of learning the effective image representation rather than the main purpose of these works, and as these methods only utilize the image input, it leads to missing objects or inaccurate dynamics of the agent in the reconstructed image. Therefore, to bridge the gap between the model-based state transition data augmentation techniques and image augmentation, we propose the method that generates dynamics-consistent image transition data along timesteps with multi-modal inputs.

3 Method

Figure 2: An overview of S2P architecture. State and the previous image are used as input to generate current step image . The spatial size of the features gets larger as it passes through multiple upsampling generators. G and MAT indicate the generator block and the Multimodal Affine Transformation respectively


3.1 S2P Generator

Architecture. The goal of S2P (State2Pixel) generator is to synthesize the image which perfectly represents all the information of its corresponding state . Unfortunately, a state-sole condition cannot formulate a single deterministic rendered image because, in most cases, state does not provide the agent’s position from the global coordinate, but rather from an egocentric coordinate. Also, image-based RL algorithms utilize sequential images to capture the agent’s velocity using the change of the background, e.g. ground checkerboard, between input images. It means that the image of the current step is dependent not only on the current state , but also on the image of the previous step . We, therefore, build the generator to synthesize the image from both the state and the previous image so that the generated image can preserve the dynamics-consistency in the physical environment.


At the first layer of S2P, the input image and the state

are projected to the feature space via convolution and MLP encoders respectively. Both features are then fed to the hierarchical generator block which consists of several residual connections

he2016deep and the upsampling layer. After passing through each generator block, the spatial size of the feature map is doubled while the channel dimension is halved. The image features are converted to the RGB images at the last layer of the generator with a single convolution layer.

We observe that the input signals, i.e. state and image , become attenuated as they pass through deeper generating layers and the network produces images with poor quality. So, similar to recent style-based image synthesis algorithms karras2019style; karras2020analyzing; karras2021alias; park2019semantic; li2020manigan, we adopt a learned affine transformation architecture to inject auxiliary signals to the generator. The difference between the previous style-based generative models and S2P is that we propose a multimodal affine transformation (MAT) to produce the learnable modulation parameters, and , with the cross-modality representation via a multimodal feature extractor and state-to-latent mapping function. The overall architecture of our proposed S2P is depicted in Figure 2.

A non-linear latent mapping function which is implemented as an 8-layer MLP produces a latent code from the given state in the state space . The latent code w is spatially expanded as the same size of the input feature of MAT module x, and the conditioned image

is also linearly interpolated to make its size same as

x. The spatially expanded w and the resized are channel-wisely concatenated and fed to the multimodal feature extractor to fuse the state and image cross-modality representation. Each estimator then produces the learnable scale and bias for effective cross-modal affine transformation.

Finally, our proposed MAT operation is defined as:


where and

are the channel-wise mean and the standard deviation of the input feature of MAT

from the block of the generator, and denotes Hadamard element-wise product. The design of MAT is illustrated in Figure 3.

In addition, it is shown that the neural network is biased toward learning a low frequency mapping and has difficulty in representing a high frequency information

rahaman2019spectral; mildenhall2020nerf. To mitigate such undesired tendency, we do not use naïve state vector as input, but employ a positional encoding with the high frequency function which is defined as:


where indicates each component of state vector .

We utilize multi-scale discriminators in wang2018high to increase the receptive field without deeper layers or larger convolution kernels for alleviating the overfitting. Two discriminators with identical architecture are adopted during training, and the synthesized and real images which are resized to several spatial sizes are fed to each discriminator.

Loss Function. Our S2P generator is trained by linearly combined multiple objectives. First, we leverage a pixel-wise loss between the output of the generator and the real image ,

Figure 3: Multimodal Affine Transformation (MAT) module.

In addition to the pixel-wise loss, we also utilize the ImageNet

ILSVRC15 pre-trained VGG19 simonyan2014very network to calculate the perceptual similarity loss between two images,


where denotes the i-th layer of VGG19.

We implement the adversarial objective for both S2P generator and multi-scale discriminators same as pix2pixHD wang2018high. The difference is that we replace the least square loss with the hinge-based loss DBLP:journals/corr/LimY17 and we condition state information to the discriminator so that the generator is induced to produce the dynamics-consistent outputs.


The total loss function to optimize the S2P generator can be defined as:


where , , and

are the hyperparameters to balance among the objectives.

3.2 Offline reinforcement learning with synthetic data

We consider the Markov decision process (MDP)

, where denotes the image space, the state space corresponding to , the action space, the transition dynamics, the reward function, the initial distribution, and the discount factor. We denote the discounted image visitation distribution of a policy using , where

is the probability of reaching image observation

at time by using in . Similarly, we denote the image-action visitation distribution with . The objective of RL is to optimize a policy that maximizes the expected discounted return .

In the offline RL setting, the algorithm has access to a static dataset collected by unknown behavior policy . In other words, the dataset is obtained from and the goal is to find the best possible policy using the static dataset without online interaction with the environment.

To utilize the dynamics-consistent transition data for augmentation in offline RL, we take the model-based approach that trains an ensemble of dynamics and reward model , which outputs the predicted next state, and reward . Once a model has been learned, we can construct the learned MDP , which has the same spaces, but uses the learned dynamics and reward function. Naively optimizing the RL objective with the is known to fail in the offline RL setting, both in theory and practice DBLP:conf/nips/KidambiRNJ20; DBLP:conf/nips/YuTYEZLFM20, due to the distribution shift and model-bias. To overcome these, we take an uncertainty estimation algorithm like bootstrap ensembles osband2018randomized; DBLP:conf/iclr/LowreyRKTM19, and obtain , an estimate of uncertainty in dynamics. Then we could utilize the uncertainty penalized reward , where is a hyperparameter. We consider the following uncertainty quantification method that uses the maximum learned variance over the ensemble, DBLP:conf/nips/YuTYEZLFM20; DBLP:conf/nips/KidambiRNJ20.

As a final process, we train offline RL with the following hybrid objective : , where , is the ratio of the datapoints drawn from the offline dataset , and is the state rollout distribution used with the trained dynamics ensemble model , and is same as except that the reward is instead of . Samples from can be obtained by rollout in and convert the obtained state transitions into by using the trained S2P generator in Section 3.1. For implementation, we collect synthetic image transition data in separate replay buffer and train the agent by any offline RL algorithms with the sampled mini-batches from and by the ratio of and . The overall algorithm and more training details are summarized in Appendix B.

4 Experiments

4.1 Environments & Data collection

We evaluate our method on a large subset of the dataset from the DeepMind Control (DMControl) suite tunyasuvunakool2020dm_control. It includes 6 environments, which were typically used for online image-based RL benchmarks. However, to the best of our knowledge, all of these environments have never been properly evaluated in an offline image-based RL setting. The datasets in these benchmarks are generated as follows: random

: rollout by a random policy that samples actions from a uniform distribution.

mixed : train a policy using state-based SAC haarnoja2018soft until 500k steps for finger, cheetah and 100k steps for the others, then randomly samples trajectories from the replay buffer. 500k, 100k steps are minimum steps required to reach the expert level performance for each task. expert : train a policy using state-based SAC. After convergence, we collect trajectories from the converged policy.

4.2 Offline Reinforcement Learning

To validate whether the S2P can help improve the offline RL performance, we evaluate our algorithm with recent offline RL algorithms like CQL DBLP:conf/nips/KumarZTL20 which utilizes conservative training of the critic, and IQL kostrikov2021offline which trains critic by implicitly querying actions near the distribution of the dataset, and SLAC-off lee2020stochastic which is state-of-the-art online image-based RL algorithm. We use this SLAC-off with the offline setting. We also compare the policy constraint-based offline RL algorithms like BEAR kumar2019stabilizing, and behavior cloning, BC. The results for these two algorithms are in Appendix B. To extend the offline RL into the image-based setting, we follow the image encoder architecture from lee2020stochastic and train a variational model using the offline data. Then, we train in the latent space of this model.

Environment Dataset IQL IQL CQL CQL SLAC-off SLAC-off
+S2P +S2P +S2P
cheetah, run random 10.28 12.64(16.21) 4.89 11.77(7.52) 16.37 18.14(35.38)
walker, walk random -0.28 4.03(0.83) -0.43 10.44(1.99) 18.23 17.38(20.15)
ball in cup, catch random 74.77 82.28(80.39) 84.87 92.81(91.61) 70.04 85.57(52.81)
reacher, easy random 33.75 70.45(55.33) 52.32 75.01(81.48) 77.43 85.76(87.84)
finger, spin random -0.17 0.46(-0.11) -0.01 -0.11(0.07) 30.24 27.62(32.65)
cartpole, swingup random 24.52 38.59(29.1) 27.67 32.93(42.12) 35.03 31.01(52.22)
cheetah, run mixed 41.68 88.53(70.44) 92.63 93.16(93.48) 16.63 26.39(24.42)
walker, walk mixed 96.07 95.49(97.80) 97.18 97.84(98.70) 29.02 92.60(67.09)
ball in cup, catch mixed 41.94 37.79(40.65) 30.82 51.28(37.21) 28.54 40.41(32.88)
reacher, easy mixed 66.88 75.61(75.01) 70.37 75.53(77.54) 62.49 63.59(77.58)
finger, spin mixed 98.18 94.78(98.65) 98.54 87.17(80.07) 64.41 83.31(83.29)
cartpole, swingup mixed 14.49 14.04(51.25) 14.76 -4.66(36.94) 14.51 16.36(25.41)
cheetah, run expert 79.89 87.18(88.89) 94.20 96.28(95.54) 8.92 14.41(8.42)
walker, walk expert 94.34 94.97(94.35) 95.43 97.97(98.47) 11.71 70.95(19.66)
ball in cup, catch expert 28.57 28.60(28.48) 28.42 28.62(28.68) 28.56 38.87(28.69)
reacher, easy expert 52.13 58.19(57.51) 57.68 32.54(48.46) 26.61 42.85(34.49)
finger, spin expert 98.42 94.42(99.19) 73.07 97.25(99.51) 24.75 81.05(52.21)
cartpole, swingup expert 20.43 18.37(18.03) 19.35 18.54(30.22) 14.11 11.18(-3.80)
Table 1: Offline RL results for DMControl. The numbers are the averaged normalized scores proposed in fu2020d4rl, where 100 corresponds to expert and 0 corresponds to the random policy. The results with standard deviation are in Appendix B.

We include the offline RL results on different types of environments and data in Table 1. The results on the originally given offline dataset (50k) are shown in the left column of each algorithm, and the results on the S2P-based augmented dataset are shown in the right column of each algorithm. The S2P augments the same amount of the original offline dataset (50k). As it physically has more data (100k) than the original dataset (50k), for a reference, we also include the results on the 100k offline dataset in the parenthesis in Table 1. Overall, the S2P-based method achieves better performance than the 50k dataset, even exceeding the 100k dataset’s score in some tasks.

As the S2P answers how to generate the image transition data, we also have to consider where to generate the image transition data through S2P. Specifically, we use a random policy as in the mixed, expert dataset, and a policy trained by the state-based offline RL as in the random dataset. These strategies are considered due to the following two assumptions. Firstly, as the random dataset only has random behavior, the dataset may not have any meaningfully rewarded states even with the augmentation by the random policy, especially in the locomotion environment (e.g. cheetah, walker cannot leave the initial states by the random policy). As the S2P’s objective is to extend the distribution of the given dataset, the policy trained in an offline manner can help leave the support of the dataset compared to the naive random policy. Secondly, as the non-random datasets may have relatively biased state-action distributions that receive meaningful rewards compared to the random dataset, it is difficult to get out of the support of these datasets by the trained policy as most of the state-action induced by the trained policy are included in the similar distribution of these datasets. However, the random policy can be effective as it can bring some exploration effects like noise injection or increasing entropy in DBLP:journals/corr/LillicrapHPHETS15; haarnoja2018soft.

dataset method 50k +S2P +S2P
dataset (random ) (offRL )
cheetah run random IQL 10.28 -0.107 12.64
CQL 4.89 -0.69 11.77
SLAC-off 16.37 11.94 18.14
cheetah run mixed IQL 41.68 88.53 58.67
CQL 92.63 93.16 89.6
SLAC-off 16.63 26.39 26.53
cheetah run expert IQL 79.89 87.18 79.20
CQL 94.20 96.28 93.69
SLAC-off 8.92 14.41 7.79
Table 2: Experiments on each different rollout distribution in cheetah-run environment.

To verify such assumptions, we analyze the effect of different types of rollout distribution in offline RL performance (Table 2). We denote 50k dataset as the results of the given original offline dataset, and +S2P(random ) as the results of the data augmented by rollout with the random actions, and +S2P(offRL ) as the results of the data augmented by rollout with the state-based offline RL policy. As expected, the random policy is more effective in the expert dataset. Also, we could find the opposite phenomenon in the random dataset, and trade-offs between these two strategies in the mixed dataset.

4.3 Comparison with model-based image RL

To show why the multi-modal inputs are necessary for augmenting the image transition data in offline image RL, we compare our S2P with the previous model-based image RL algorithm, Dreamer DBLP:conf/iclr/HafnerLB020, as it can also reconstruct the images of the agent by training the reconstruction error from ELBO objective only using uni-modal inputs (previous images of the agent). For comparison, we trained Dreamer in an offline manner with the same dataset used for training S2P, and predicted future images with the episode context obtained from 5 consecutive ground truth images. Even though the Dreamer saw more previous steps’ images compared to S2P (only a single image of the previous step), it still has difficulty in generating accurate posture, which supports the S2P’s advantages (Fig 4.3). It is because the priority of the model-based image RL algorithms is learning the effective latent representation for RL tasks, and they do not utilize a broader source of supervision from the state inputs.

To prove that state-inconsistent images from the model-based image RL method cannot improve offline RL performance at the S2P level, we perform the same experiment in Table 1, but replace the augmented images from S2P with images from Dreamer. We observe that the image augmentation from Dreamer even degrades the original RL performance in several tasks, and the performance improvements with S2P excels the augmentation from Dreamer by a large margin in all baselines (Section 4.3). Thus, we could say that the inaccurate posture and quality of the images generated by the model-based method trained with the uni-modal inputs (images) are not sufficient for augmenting image transition data in the offline setting. More experiments and details of the augmentation are in the Appendix.

figureQualitative comparison of S2P and Dreamer. dataset method 50k +S2P +Dreamer dataset cheetah run mixed IQL 41.68 88.53 2.09 CQL 92.63 93.16 58.93 SLAC-off 16.63 26.39 4.68 walker walk mixed IQL 96.07 95.49 1.28 CQL 97.18 97.84 95.88 SLAC-off 29.02 92.60 52.65 cheetah run expert IQL 79.89 87.18 73.23 CQL 94.20 96.28 53.69 SLAC-off 8.92 14.41 3.65 walker walk expert IQL 94.34 94.97 34.95 CQL 95.43 97.97 96.14 SLAC-off 11.71 70.95 52.03 tableQuantitative comparison of S2P and Dreamer.

4.4 Ablation

To observe how each component of the S2P contributes to the quality of the synthesized images, we perform an ablation study on the model architecture of the S2P and show its qualitative results on Figure 4(a). A baseline architecture without any contribution of our proposed method, i.e. positional encoding (PE) and multimodal affine transformation (MAT), shows the worst image quality. We report that a simple application of the high frequency function to the input state (PE) results in a noticeable increase in image quality. We also address the necessity of the input image to estimate the modulation parameters, , , for the learned affine transformation. We ablate the input in MAT and utilize only the spatially expanded input state which is called State Affine Transformation (SAT). The generator which replaces the MAT module with SAT has difficulty in exploiting dynamics information from the previous image and the translation error is accumulated as the generator recurrently synthesizes the long-horizon trajectory. We can observe such dynamics inconsistency especially in the locomotion tasks, e.g. cheetah, and walker, which expresses the velocity in images by the change of the checkerboard in the ground. Compared to SAT, our proposed MAT which leverages both the state and the previous image shows better performance in reconstructing not only the posture of the agent but also its dynamics-consistent background.

Figure 4: Qualitative results. We report the generator performance by synthesizing multiple steps with a single trajectory. (a) demonstrates the effectiveness of each component in the S2P generator where PE, SAT, and MAT indicate positional encoding, state affine transformation and multimodal affine transformation, respectively. SAT cannot perfectly estimate the correct location of the agent as there is misalignment at the ground checkerboard compared to MAT. (b) represents unseen task adaptation using the trajectories given from the state-level transition model. (c) shows that our model can recover the posture of the agent without any reference image. Additional qualitative results on several environments are provided in Appendix A.

4.5 Zero-shot Task Adaptation

To validate whether the S2P can help the offline RL process in settings that require generalization to tasks that are different from the given dataset, we construct two environments cheetah-jump and walker-run. In cheetah-jump, which is referred from DBLP:conf/nips/YuTYEZLFM20, the agent should solve a task that is different from the original purpose of the behavior policy. Specifically, we relabel the rewards in the cheetah-run-mixed dataset to reward the cheetah to jump as high as possible. Then, we generate the image transition data from the states whose z positions are bigger than a threshold to validate the S2P-based augmentation’s advantage in tasks that require generalization. By training with the relabeled reward, the agent with the S2P achieves a higher return and learns to bounce back and forth to take a leap higher (Figure 4(b)), even though the batch data contain little jumping motion.

To investigate whether the S2P can generate unseen image distributions from unseen state distribution, we collect walker-run state dataset by the same way of other mixed datasets in Section 4.1

. Then, we generate images from these unseen states by recurrently using the S2P generator and we apply offline RL on these generated image transition data. Even though the state distribution is totally different from

walker-walk dataset as the agent should run instead of walk, not only the S2P successfully generalizes to the unseen image distributions (Figure 4(b)), but also the agent can be trained to run by offline RL only with these synthesized images beyond the expertise of the dataset (Table 3). More details are in Appendix B.

Method Walker-Run Method Cheetah-Jump
Batch mean 572.94 Batch mean 2.05
IQL N/A IQL 36.6
IQL+S2P 659.25 IQL+S2P 54.4
CQL N/A CQL 40.8
CQL+S2P 652.19 CQL+S2P 47.2
Table 3: Average returns of the walker-run and cheetah-jump tasks. We include mean undiscounted return of the episodes in the batch data for comparison.

We attribute such satisfactory task generalization to the model architecture of the S2P which leverages both and for synthesizing the . The posture of the agent is deterministic with the sole-state condition and the background of the image such as the ground checkerboard is dependent both on the state and the previous image. Therefore, our S2P exploits the state input to generate the posture of the agent and exploits the image input to generate the background. It is well shown in Figure 4

(c) where we intentionally replace all the image input with the zero matrix

during the inference phase. We observe that S2P still perfectly reconstructs the posture of the agent with its corresponding state only while it fails to recover the background and the ground checkerboard as we expected.

5 Conclusion

We firstly present the state-to-pixel (S2P) algorithm that synthesizes the raw pixel from its corresponding state and the previous image. As the augmentation paradigm of the S2P is generating dynamics-consistent image transition data, we demonstrate that S2P not only improves the image-based offline RL performance but also shows powerful generalization capability on unseen tasks. Even though S2P shows promising results in offline RL by data augmentation techniques, the assumption that datasets consist of pairs of images and states is still a strong assumption. Thus, for future work, we plan to further extend the idea with more relaxed assumptions such as unpaired datasets or extend other state-based applications to the image-based RL algorithms.

6 Acknowledgement

This research was supported by Future Challenge Program through the Agency for Defense Development funded by the Defense Acquisition Program Administration and by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]



  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? See Section 4

    2. Did you describe the limitations of your work? See Section 5

    3. Did you discuss any potential negative societal impacts of your work? This work inherits the potential negative societal impacts of reinforcement learning. We do not anticipate any additional negative impacts that are unique to this work.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We included code in the supplemental materials.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section 3 and Appendix

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 4

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See Section 4

    2. Did you mention the license of the assets? See Appendix.

    3. Did you include any new assets either in the supplemental material or as a URL? We include the code & model in the supplemental materials.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Image generation

a.1 Architecture specification

The generator architecture is implemented with several MAT Residual Blocks followed by bilinear upsampling as shown in Figure 5(b). The architecture of the residual block is largely implemented from park2019semantic where we replace the SPADE module with MAT in SPADE Residual block. We also leverage the Spectral Normalization miyato2018spectral to all the convolution layers in the generator. The latent mapping network for generating the latent code w is implemented with an 8-layer of MLP with channel-wise normalization at the first layer, which is the same as the style mapping function in StyleGAN karras2019style. The first MLP layer of the latent mapping function is different from the physical environment as the number of the input state, , is different. The dimension of the input state is increased times by a high frequency positional encoding function . We set the hyperparameter as 10 and concatenate the input and output of , which leads to 21 times increase in the channel dimension. The specification of the entire architecture is shown in Figure 6.

17 24 5 9 8 6
Table 4: The number of the state according to the environment of DMControl.
(a) Latent mapping function
(b) MAT residual block
Figure 5: An architecture of the sub-network in the generator.
Figure 6: Specification of the generator.

a.2 Training details

We implement our proposed generator architecture with the public deep learning platform PyTorch and train it on a single NVIDIA RTX A6000 GPU. We train 30 epochs for each task and the generated image size is set to

. An Adam optimizer kingma2014adam with the learning rate of 0.0002 is utilized and the batch size is set to 16.

a.3 Additional results of image synthesis

Figure 7: Additional qualitative results on the DMControl environment. The first row of each environment is ground truth images and the second row is the synthesized images from S2P.
Environment Method FID () LPIPS () PSNR () SSIM ()
Cheetah run Dreamer 63.46 0.042 27.62 0.90
S2P 47.70 0.028 33.40 0.94
Walker walk Dreamer 209.99 0.30 18.38 0.69
S2P 74.08 0.078 25.20 0.84
Cartpole swingup Dreamer 82.02 0.129 28.05 0.94
S2P 112.81 0.117 28.83 0.86
Ball-in-cup catch Dreamer 112.45 0.055 33.25 0.93
S2P 77.11 0.035 33.75 0.97
Reacher easy Dreamer 171.92 0.115 27.64 0.95
S2P 58.34 0.178 26.02 0.86
Finger spin Dreamer 86.63 0.112 21.93 0.90
S2P 18.86 0.025 39.54 0.99
Mean value Dreamer 121.13 0.126 28.19 0.89
S2P 64.82 0.077 31.13 0.91
Table 5: Quantitative results of generated images. S2P outperforms Dreamer in all the metrics for evaluating image quality.

We report additional qualitative results on the multiple environments of DMControl in Figure 7. We show that our proposed S2P generates high-quality images regardless of the environment. We recurrently generate a single trajectory exploiting the current state and the previous images which are also the output of the generator from the previous state. Also, we evaluate the quality of generated images quantitatively with metrics (FID score, LPIPS, PSNR, SSIM) which are frequently adopted for evaluating image quality in Table 5. Our S2P outperforms Dreamer in all the quantitative results which indicates the generated image from S2P has better quality than from Dreamer.

Appendix B Offline RL Experiments Details

b.1 Algorithm

  Input: offline dataset , state rollout distribution .
  Train the probabilistic dynamics models on .
  Train the image generator on .
  for  to  do
     Randomly sample .
     Get by using the and .
     Generate from by .
     Save in .
  end for
  Apply any offline RL algorithm with the dataset sampled from with the ratio of , .
Algorithm 1 Offline RL with the S2P

b.2 Ablation studies

b.2.1 Uncertainty types

We conduct experiments on how the choice of the uncertainty affects the performance. We denote as Max Var, as Ens Var, and average value of both uncertainty types as Average Both. In practice, we found that Max Var achieves better performance compared to other types of uncertainty in mixed, expert dataset, while Ens Var achieves better at random dataset. We hypothesize that the dynamics model is quite uncertain in the aspect of the Max Var on the random dataset due to the excessive randomness of the data. Thus, it could induce excessive penalty on the predicted reward when Max Var is used, and the agent could become too conservative.

Dataset Method Max Var Average Both Ens Var
cheetah run random IQL 12.64 16.61 19.27
CQL 11.77 8.59 19.83
SLAC-off 18.14 8.67 31.81
cheetah run mixed IQL 88.53 83.92 67.74
CQL 93.16 83.93 87.5
SLAC-off 26.39 26.39 24.41
cheetah run expert IQL 87.18 64.43 62.41
CQL 96.28 89.38 90.36
SLAC-off 14.41 9.85 11.92
Table 6: Different types of uncertainty quantification on cheetah-run environments.

b.2.2 Rollout horizons

To validate the effectiveness of different rollout horizons, we conduct experiments with horizon length 1 (+S2P (1step)) and 5 (+S2P (5step)), while following the same rollout strategies in Section 4.2. For the 5 step case, as the proposed S2P generator is conditioned on the previous timestep’s image to synthesize the next image, we recurrently generate the image transition data. That is, at the first timestep, the ground truth image is conditioned on the image generator, but after then, the generated image is conditioned on the image generator for generating the following timestep’s image. The results shown in Table 7 represent that the augmentation with a longer horizon also has advantages in offline RL, but a short horizon is more effective overall. This is due to the uncertainty accumulation effect as shown in Figure 8. The average uncertainty grows as the rollout horizon increases due to the model bias, and it leads to more penalties on the predicted rewards, which can induce a too conservative agent.

dataset method 50k dataset +S2P (1step) +S2P (5step)
cheetah run random IQL 10.28 12.64 12.08
CQL 4.89 11.77 11.46
SLAC-off 16.37 18.14 19.09
cheetah run mixed IQL 41.68 88.53 66.38
CQL 92.63 93.16 87.44
SLAC-off 16.63 26.39 27.07
cheetah run expert IQL 79.89 87.18 81.01
CQL 94.20 96.28 87.53
SLAC-off 8.92 14.41 17.17
Table 7: Effect of rollout horizons in cheetah-run environment.
Figure 8: Average uncertainty of each different rollout horizon in cheetah-run environment.

b.2.3 Results on policy constraint-based methods

We additionally test the proposed method on the policy constraint-based methods such as BEAR kumar2019stabilizing and behavior cloning (BC) on the cheetah-run environment. As shown in Table 8, the performance of these methods with augmented data is worse than the results of non-augmented data. The poor performance is reasonable, because BEAR utilizes the action distribution’s support matching by MMD, and BC is trained with maximizing the likelihood of the action. As the augmented action distributions are totally different from the behavior policy that induces the offline dataset, these two methods perform poorly because these methods try to clone or match both types of actions. That is, these methods try to clone or match the support of the given offline dataset and sampled actions that could have different distribution or support.

DATASET METHOD 50k dataset +S2P (random ) +S2P (offRL )
cheetah random BEAR -1.15 -1.39 1.14
BC -1.41 -1.3 2.54
cheetah mixed BEAR 10.64 -0.29 6.57
BC 51.81 0.06 5.17
cheetah expert BEAR 73.49 11.11 56.86
BC 77.42 30.06 79.40
Table 8: Experiments on policy constraint-based methods.

b.3 Additional experiments on Dreamer and conventional image augmentation technique

Dreamer with larger dataset. To validate whether the performance degradation by augmentation from Dreamer (Section 4.3) stems from the small dataset as the Dreamer requires more than 100k samples in online training, we collect an additional 250k dataset and augment the image transition amount of 250k by Dreamer (totally 500k datasets), and perform the same experiment. Despite the bigger size dataset, the performance does not increase overall (Table 9), and we could say that the inaccurate posture and quality of the images generated by the dreamer are attributed to the limited source of supervision from the uni-modal inputs rather than the size of the dataset, and it is not that proper for data augmentation in the offline setting.

Comparison with the conventional image-augmentation technique.

To validate why S2P is needed instead of the conventional image-augmentation method, we additionally experiment with random crop and reflection padding, which is frequently used in image representation learning and online image RL. We perform the same experiment in

Table 1, but replace the augmented images from S2P with randomly cropped images. Slightly better performance than Dreamer could be interpreted as it just manipulates the given true images (while maintaining the accurate posture of the agent) rather than generating new images like Dreamer (Table 9). But it still has difficulty in surpassing the S2P’s results as it cannot deviate from the state-action distribution of the offline dataset.

dataset method 50k +S2P +Reflect +Dreamer +Dreamer
dataset RandomCrop 500k
cheetah run mixed IQL 41.68 88.53 70.17 2.09 1.85
CQL 92.63 93.16 79.78 58.93 82.55
SLAC-off 16.63 26.39 3.10 4.68 0.14
walker walk mixed IQL 96.07 95.49 95.39 1.28 7.21
CQL 97.18 97.84 97.61 95.88 97.70
SLAC-off 29.02 92.60 17.13 52.65 0.48
cheetah run expert IQL 79.89 87.18 68.39 73.23 3.50
CQL 94.20 96.28 94.01 53.69 27.58
SLAC-off 8.92 14.41 8.15 3.65 3.24
walker walk expert IQL 94.34 94.97 92.46 34.95 37.92
CQL 95.43 97.97 97.11 96.14 82.51
SLAC-off 11.71 70.95 6.91 52.03 10.43
Table 9: Quantitative comparison of S2P and other data augmentation methods.

b.4 Visualization of the image distribution

To validate whether the S2P really affects the distribution of the offline dataset, we visualized the cheetah-run-expert dataset and the S2P-based augmented dataset by the random policy by applying the t-sne (Figure 9).

Figure 9: The t-sne visualizations examples of the cheetah-run-expert dataset (red) and the augmented dataset by the random policy (green).

As shown in Figure 9, the augmented dataset not only occupies a similar area to the original dataset but also encloses the original dataset, even connecting the clusters of the original dataset. It can be interpreted that S2P connects different modes of image distribution by deploying virtual exploration in the state space.

b.5 Model learning

In state space, we represent the dynamics, reward model as a probabilistic neural network that outputs a Gaussian distribution over the next state and reward given the current state and action.


We train an ensemble of dynamics models , with each model trained independently via maximum likelihood estimation on offline dataset . We set the number of dynamics model as 7 following yu2021combo

. During model rollouts, we randomly pick one model. Each model in the ensemble is represented as 3 layers with 256 hidden units and relu activation function.

b.6 Representation learning

For image-based offline RL process, we follow lee2020stochastic that uses a variational model with the following components:

Full derivation of those equations are in lee2020stochastic. We train the representation model using the evidence lower bound :

where is the number of sequences, and z is the latent representation. We set and the dimension of

is 288, same as the original implementation. For image encoder, we use 6 convolutional neural network with kernel sizes [5, 3, 3, 3, 3, 4] and strides [2,2,2,2,2,2] respectively with leaky relu activation function. The decoder is constructed in a symmetrical manner to the encoder. We pre-train the image encoder by 300k steps and use the trained encoder’s weights as initial weights when offline RL training proceeds.

b.7 Training Detail

We use the batch size of 128 and the Adam optimizer for training the value function and policy with the learning rates 0.0003, 0.0001, respectively. Each and is represented as MLPs with hidden layer sizes (1024, 1024), relu activation function. We apply tanh activation on the output of the policy with reparameterization trick same as haarnoja2018soft. Also, we use the uncertainty penalty coefficient for all environments except the finger-spin, which use . The sampling ratio of the offline dataset is . For offline RL implementation, we referred to the original implementations of each work and follow default parameters.

b.8 Zero-shot Task Adaptation Detail

For the cheetah-jump task, we relabel the reward in the given offline dataset to the sparse reward that indicates whether the cheetah is jumping or not. That is, if z-init z 0.3, the reward is relabeled as 1, otherwise 0, where z denotes the z-position of the cheetah and init z denotes the initial z-position. For augmentation, we randomly select states whose z-position is greater than 0.2, and generate images from these states to encourage the jumping motion. Then, we augment these generated image transition data in the same way of Table 1, and evaluate the offline RL.

For the walker-run task, we generate the dataset in the same manner as the mixed type in Section 4.1. That is, we train the state-based SAC until convergence, and we randomly sample trajectories from the replay buffer. Then, we generate from and , where is the initial image when the agent is reset, and recurrently generate images from () by the S2P generator trained with the walker-walk-mixed dataset. After then, we apply offline RL on these generated image transition data.

b.9 Environment Detail

The DMControl’s environment details and the random and expert scores obtained by training the state-based SAC are shown in Table 10. The normalized score is computed by (return - random score)/(expert score - random score), which is proposed in fu2020d4rl.

Environment Expert Score Random Score Action Repeat Maximum steps per episode
Cheetah-run 900 12.6 4 250
Walker-walk 970 43.2 2 500
Ball in cup-catch 976 25.1 4 250
Cartpole-swingup 979 61.8 8 125
Reacher-easy 906 2.1 4 250
Finger-spin 882 125.2 2 500
Walker-run 790 25.2 2 500
Table 10: The environment details including the expert and random scores for computing normalized scores.

b.10 Results with standard deviation

We included the results in the manuscript with standard deviation (Table 11, Table 12, Section B.10, Table 13).

Environment Dataset IQL IQL CQL CQL SLAC-off SLAC-off
+S2P +S2P +S2P
cheetah, run random 10.281.0 12.641.5 (16.212.5) 4.891.2 11.776.4 (7.526.6) 16.376.7 18.143.1 (35.3810.1)
walker, walk random -0.281.7 4.033.5 (0.831.7) -0.431.4 10.443.4 (1.994.6) 18.233.1 17.383.4 (20.154.1)
ball in cup, catch random 74.7712.4 82.289.2 (80.3914.7) 84.879.5 92.816.4 (91.619.1) 70.048.1 85.575.1 (52.817.6)
reacher, easy random 33.753.8 70.454.1 (55.334.5) 52.323.8 75.015.0 (81.483.1) 77.439.1 85.767.1 (87.848.8)
finger, spin random -0.170.5 0.461.1 (-0.110.3) -0.010.6 -0.110.4 (0.070.7) 30.243.1 27.623.2 (32.653.6)
cartpole, swingup random 24.523.7 38.595.0 (29.15.0) 27.675.2 32.934.9 (42.123.6) 35.038.7 31.017.1 (52.2210.6)
cheetah, run mixed 41.6813.2 88.536.1 (70.4412.4) 92.633.0 93.162.8 (93.481.5) 16.635.1 26.394.3 (24.425.2)
walker, walk mixed 96.073.1 95.492.8 (97.802.2) 97.183.7 97.842.2 (98.703.7) 29.028.3 92.607.1 (67.097.4)
ball in cup, catch mixed 41.943.9 37.795.1 (40.655.0) 30.824.5 51.284.8 (37.214.9) 28.545.9 40.414.4 (32.885.8)
reacher, easy mixed 66.884.5 75.613.2 (75.013.5) 70.374.8 75.533.0 (77.543.9) 62.496.1 63.595.3 (77.586.5)
finger, spin mixed 98.181.1 94.783.6 (98.651.5) 98.542.1 87.176.2 (80.077.3) 64.419.3 83.318.8 (83.2910.2)
cartpole, swingup mixed 14.496.6 14.046.3 (51.2513.7) 14.765.7 -4.666.9 (36.9413.3) 14.518.1 16.367.7 (25.4110.3)
cheetah, run expert 79.897.5 87.185.6 (88.898.2) 94.204.3 96.283.4 (95.545.5) 8.925.1 14.415.0 (8.425.7)
walker, walk expert 94.346.4 94.974.0 (94.356.3) 95.434.5 97.972.9 (98.473.6) 11.719.8 70.956.9 (19.668.8)
ball in cup, catch expert 28.5714.9 28.604.7 (28.4810.5) 28.4211.6 28.624.8 (28.6814.7) 28.5611.1 38.8710.1 (28.6910.8)
reacher, easy expert 52.134.6 58.194.4 (57.514.6) 57.686.9 32.544.1 (48.464.9) 26.617.1 42.856.1 (34.496.8)
finger, spin expert 98.421.5 94.421.5 (99.191.2) 73.073.2 97.252.9 (99.510.5) 24.755.1 81.055.2 (52.217.6)
cartpole, swingup expert 20.433.2 18.378.8 (18.0310.2) 19.3511.1 18.547.4 (30.2212.4) 14.1112.6 11.1810.1 (-3.8013.3)
Table 11: Offline RL results for DMControl. The numbers are the averaged normalized scores proposed in fu2020d4rl, where 100 corresponds to expert and 0 corresponds to the random policy.
dataset method 50k +S2P +S2P
dataset (random ) (offRL )
cheetah run random IQL 10.281.0 -0.1070.27 12.641.5
CQL 4.891.2 -0.690.4 11.776.4
SLAC-off 16.376.7 11.943.9 18.143.1
cheetah run mixed IQL 41.6813.2 88.536.1 58.6710.3
CQL 92.633.0 93.162.8 89.62.3
SLAC-off 16.635.1 26.394.3 26.535.9
cheetah run expert IQL 79.897.5 87.185.6 79.2016.0
CQL 94.204.3 96.283.4 93.692.0
SLAC-off 8.925.1 14.415.0 7.796.6
Table 12: Experiments on each different rollout distribution in cheetah-run environment.
dataset method 50k +S2P +Dreamer
cheetah run mixed IQL 41.6813.2 88.536.1 2.091.27
CQL 92.633.0 93.162.8 58.9327.6
SLAC-off 16.635.1 26.394.3 4.681.3
walker walk mixed IQL 96.073.1 95.492.8 1.286.3
CQL 97.183.7 97.842.2 95.883.6
SLAC-off 29.028.3 92.607.1 52.658.6
cheetah run expert IQL 79.897.5 87.185.6 73.2315.6
CQL 94.204.3 96.283.4 53.6921.3
SLAC-off 8.925.1 14.415.0 3.652.8
walker walk expert IQL 94.346.4 94.974.0 34.9524.7
CQL 95.434.5 97.972.9 96.143.4
SLAC-off 11.719.8 70.956.9 52.039.9

tableQuantitative comparison of S2P and Dreamer.

Method Walker-Run Method Cheetah-Jump
Batch mean 572.94159.5 Batch mean 2.0518.7
IQL N/A IQL 36.615.2
IQL+S2P 659.2554.3 IQL+S2P 54.410.1
CQL N/A CQL 40.812.2
CQL+S2P 652.1958.2 CQL+S2P 47.210.4
Table 13: Average returns of the walker-run and cheetah-jump tasks. We include mean undiscounted return of the episodes in the batch data for comparison.