## 1 Introduction

Offline reinforcement learning (offline RL), also known as batch RL Lange2012BatchRL, aims at learning from a static dataset that was previously gathered by an unknown behavioral policy. Offline RL is deemed to be promising Fujimoto2019OffPolicyDR; Fu2020D4RLDF

as online learning requires the agent to continuously interact with the environment, which however may be costly, time-consuming, or even dangerous. The progress in offline RL will undoubtedly scale RL methods to being widely applied in real-world applications, considering the impressive success in computer vision or natural language processing by adopting large-scale offline datasets

Deng2009ImageNetAL; Chelba2014OneBW.Prior off-policy online RL methods Fujimoto2018AddressingFA; Haarnoja2018SoftAO; Munos2016SafeAE are known to fail on fixed offline datasets, even on expert demonstrations Fu2020D4RLDF, due to extrapolation errors Fujimoto2019OffPolicyDR

. In the offline setting, the agent can overgeneralize from the static dataset, resulting in arbitrarily wrong estimates upon out-of-distribution (OOD) state-action pairs and dangerous action execution. To address this issue, recent model-free offline RL algorithms compel the learned policy to stay close to the behavioral policy

Fujimoto2019OffPolicyDR; Kumar2019StabilizingOQ; Wu2019BehaviorRO, or incorporate some penalties into the critic Nachum2019AlgaeDICEPG; Kumar2020ConservativeQF; Kostrikov2021OfflineRL. However, such approaches often suffer from loss of generalization capability Wu2021UncertaintyWA; Wang2021OfflineRL, since they purposely avoid OOD states or actions.Model-based offline RL methods, instead, enrich the logged dataset by generating synthetic samples with the trained forward or reverse (backward) dynamics model Kidambi2020MOReLM; Yu2020MOPOMO; Wang2021OfflineRL; Yu2021COMBOCO

. These methods benefit from better generalization thanks to richer transition samples. Intuitively, the performance of the agent is largely confined by the quality of the model-generated data, i.e., learning on bad states or actions will negatively affect the policy via backpropagation. Unfortunately, there is no guarantee that reliable transitions can be generated by the trained forward or backward dynamics model

Asadi2018TowardsAS.In this paper, we aim to generate reliable transitions for offline RL via a *double check mechanism*. The intuition behind this lies in the fact that humans often do double check when they are uncertain and need to be cautious. Besides the forward model, we train the backward model to generate simulated rollouts backward and use one to check whether the synthetic samples the other generated are credible. To be specific, we train bidirectional dynamics models along with bidirectional rollout policies. Instead of injecting pessimism into value estimation, we introduce *conservatism into transition*, i.e., only samples that the forward model and reverse model agree on are trusted.

We use Figure 1 to further illustrate our insight, where we take forward transition generation as an example, of which the process is identical to the reverse setting. Starting from , the forward model predicts next state . However, it is hard for the agent to decide whether is trustworthy. One natural solution, which follows human’s way of reasoning Holyoak1999BidirectionalRI, is backtracking where it comes from, i.e., looking backward to trace previous state , and check whether the imagined state based on is different from the true state . We are confident to if is similar to and vice versa.

To this end, we propose Confidence-Aware Bidirectional Offline Model-Based Imagination (CABI), which is a simple yet effective data augmentation method. CABI generally guarantees the reliability of the generated samples via the double check mechanism, and can be combined with any off-the-shelf model-free offline RL methods, e.g., BCQ Fujimoto2019OffPolicyDR and TD3_BC Fujimoto2021AMA, to enjoy better generalization in a conservative manner. Extensive experimental results on the D4RL benchmarks Fu2020D4RLDF show that CABI significantly boosts the performance of the base model-free offline RL methods, and achieves competitive or even better scores against recent model-free and model-based offline RL methods.

## 2 Related Work

In this paper, we consider offline reinforcement learning Lange2012BatchRL; Levine2020OfflineRL, which defines the task of learning from a static dataset that was collected by an unknown behavior policy. Applications of offline RL include robotics Mandlekar2020IRISIR; Singh2020COGCN; Rafailov2021OfflineRL, healthcare Gottesman2018EvaluatingRL; Wang2018SupervisedRL, recommendation system Strehl2010LearningFL; Swaminathan2015BatchLF, etc.

Model-free offline RL. Since it is risky to execute out-of-support actions, existing offline RL algorithms are often designed to constrain the policy search within the support of the static offline dataset. They realize it via importance sampling Precup2001OffPolicyTD; Sutton2016AnEA; Liu2019OffPolicyPG; Nachum2019DualDICEBE; Gelada2019OffPolicyDR, explicit or implicit policy constraints Fujimoto2019OffPolicyDR; Kumar2019StabilizingOQ; Wu2019BehaviorRO; Laroche2019SafePI; Liu2020ProvablyGB; Zhou2020PLASLA, learning conservative critics Kumar2020ConservativeQF; Ma2021ConservativeOD; Kumar2021DR3VD; Ma2021OfflineRL; Kostrikov2021OfflineRL, and quantifying estimation uncertainty Wu2021UncertaintyWA; Zanette2021ProvableBO; Deng2021SCORESC. Recently, sequential modeling is also explored in the offline RL setting Chen2021DecisionTR; Janner2021ReinforcementLA; Meng2021OfflinePM. Despite these advances, model-free offline RL methods suffer from loss of generalization beyond the dataset Wang2021OfflineRL, and CABI is proposed to mitigate it.

Model-based offline RL. Model-based offline RL algorithms benefit from better generalization as the static dataset is extended by the synthetic samples generated from the trained forward Ross2012AgnosticSI; Finn2017DeepVF; Argenson2021ModelBasedOP or reverse dynamics model Wang2021OfflineRL. These methods heavily rely on uncertainty quantification Ovadia2019CanYT; Yu2020MOPOMO; Kidambi2020MOReLM; Diehl2021UMBRELLAUM, compelling the policy towards the behavior policy Swazinna2021OvercomingMB; matsushima2021deploymentefficient, representation learning lee2021representation; Rafailov2021OfflineRL, and penalizing Q-values Yu2021COMBOCO. However, it is hard to judge whether the transitions generated by the trained dynamics model are reliable, and poor imagined samples will negatively affect the performance of the agent. There are some studies that focus on trajectory pruning Kidambi2020MOReLM; Zhan2021ModelBasedOP, while they involve uncertainty measurement. CABI, instead, ensures reliable imaginations by conducting double check with the forward and backward models, which fully exploits the advantages of bidirectional modeling.

Model-based online RL. Model-based online RL methods achieve superior sample efficiency Sutton1990IntegratedAF; Kaelbling1996ReinforcementLA; Buckman2018SampleEfficientRL; Janner2019WhenTT by learning a dynamics model of the environment and planning with the model Sutton2005ReinforcementLA; Walsh2010IntegratingSP; Wang2020Exploring. Learning a backward dynamics model that produces traces towards the aimed state is also widely explored Edwards2018ForwardBackwardRL; pmlrv119lee20g; goyal2018recall; Lai2020BidirectionalMP. Among them, most similar to our work is Lai2020BidirectionalMP

, which leverages bidirectional model rollouts for reduced compounding error in the online setting. However, the main differences are: (1) CABI is proposed to augment the fixed dataset instead of performing policy optimization in a model-based way; (2) CABI interpolates a double check mechanism for reliable imaginations; (3) Model predictive control (MPC)

Camacho2013model is not involved in CABI.## 3 Preliminaries

We study RL under Markov Decision Process (MDP) specified by a tuple

, where is the state space, is the action space, denotes initial state distribution, is the stochastic transition dynamics, is the reward function, and is the discount factor. The policyis a mapping from states to a probability distribution over actions. The goal of RL is to obtain a policy

such that the expected discounted cumulative rewards can be maximized, . In online RL, the agent learns from the experience collected from the interactions with the environment. However, in the offline setting, both interaction and exploration are infeasible, and the agent can only get access to the logged static dataset , which was gathered in advance by the unknown behavior policy. Since the fixed dataset is typically a subset of full space , the generalization beyond the raw dataset becomes challenging. Model-based RL mitigates this issue by learning a dynamics model and reward function , and generating synthetic transitions to augment the dataset. However, there is no guarantee that the generated samples are reliable (see Section 4.1), and we focus on addressing this issue in this paper.## 4 Confidence-Aware Bidirectional Offline Model-Based Imagination

In this section, we first use a toy example to illustrate the necessity of training bidirectional models with the double check mechanism. Then, we give the detailed framework of our method, Confidence-Aware Bidirectional Offline Model-Based Imagination (CABI).

### 4.1 You Need to Double Check Your State

Model-free offline RL algorithms suffer from poor generalization as they are trained on a fixed dataset with limited samples. Model-based methods extend the logged dataset by generating synthetic transitions from the trained dynamics model. Despite such an advantage, they lack a mechanism for checking the *reliability* of the generated samples. If the model is inaccurate, poor transition samples that lie in the out-of-support region of the dataset can be generated, which may downgrade the performance of offline RL algorithms.

Human beings tend to conduct *double check* when they are uncertain about the outcome, e.g., clinical medicine research Subramanyam2016InfusionME, autonomous driving Van2012ANA, etc. Inspired by this nature, we propose to train bidirectional dynamics models and admit the samples where the forward model and backward model have few disagreements instead of roughly trusting all generated samples. In this way, we introduce conservatism into the transition itself instead of the critic or the actor networks. We give the illustration of double check mechanism in Figure 2.

We argue that either forward dynamics model or backward dynamics model is unreliable, and bidirectional modeling in conjunction with the double check mechanism is critical for trustworthy sample generation. We verify this by designing a toy task, 2-dimensional environment with continuous state space and action space, namely RiskWorld, as shown in Figure 3. The central point of the square region in RiskWorld is , and the state space gives . Each episode, the agent randomly starts at the region and takes actions . There is a danger zone , and the done flag would turn into true if the agent steps into this region, along with a reward of . The agent will receive a reward of if it lies in , and if it locates at .

We run a random policy in RiskWorld for timesteps to collect an offline dataset. Figure 4(a) shows the state distribution (blue cross) of the dataset, where there are no transitions in (red circle area) as the episode terminates if the agent steps into

. To compare different ways of imagination, we train a forward model, a backward model, and a bidirectional model with the double check mechanism on this dataset. The training epoch is set to be 100, and the rollout horizon is set to be 3 for all of them. The detailed experimental setup is available in Appendix A.

We use the trained forward model, reverse model, and bidirectional model to generate transition samples, and plot the state distributions of their generated samples respectively. As shown in Figure 4(b), the forward model generates many samples in the danger zone (red circle area). Figure 4(c) reveals that the reverse model generates a lot of illegal samples that lie out of the state space , and also many transitions that lie in the dangerous area . These all show evidence that both the forward and backward dynamics model fail to output reliable transitions. However, we observe in Figure 4(d) that bidirectional modeling with the double check mechanism successfully produces reliable and conservative synthetic samples, i.e., out-of-support or dangerous samples are not included, because the forward model and backward model have large disagreements at those states. We hence argue that bidirectional modeling with double check is necessary for reliable data generation in offline RL.

### 4.2 Bidirectional Models Learning in CABI

Bidirectional dynamics models training.

Our bidirectional modeling models transition probability and reward function simultaneously, i.e., the forward model

and the reverse model parameterized by and respectively. The forward model represents the probability of the next state and corresponding reward given the current state and action, and the backward model outputs the probability of the current state and reward using the next state and action as input. We assume that the predicted reward function only depends on the current state and action , then the unified model can be decomposed as and. We denote the loss functions for the forward model and backward model as

and respectively, and optimize them by maximizing the log-likelihood via (1) and (2), where is the raw static dataset.(1) | |||

(2) |

Following prior work Yu2020MOPOMO; Kidambi2020MOReLM, we train an ensemble of bootstrapped probabilistic dynamics models, which has been widely demonstrated to be effective in model-based RL Chua2018DeepRL; Janner2019WhenTT

. Each model in the ensemble is parameterized by a multi-layer neural network, which outputs a Gaussian distribution

. Detailed hyperparameter setup for dynamics models training is deferred to Appendix C.

Bidirectional rollout policies training.

We additionally train bidirectional generative models, which serve as rollout policies, and are used to generate actions to augment the static dataset. We model the rollout policy with a conditional variational autoencoder (CVAE)

Kingma2014AutoEncodingVB; Sohn2015LearningSO; Fujimoto2019OffPolicyDR, which offers diverse actions while staying within the span of the dataset. CVAE is made up of an encoder that outputs the latent variable under the Gaussian distribution, and a decoder that maps to the desired space. We denote the forward rollout policy as parameterized by where is the parameter of the encoder and is the parameter of the decoder . The forward rollout policy is then trained by maximizing its variational lower bound, which is equivalent to minimizing the following loss:(3) |

where denotes the KL-divergence, and

is an identity matrix. The first term of RHS of (

3) represents the reconstruction loss where we want the decoded action to approximate the real action. Then for action generation, we first sample latent vector

from the multivariate Gaussian distribution , and then pass it with the current state into the decoder to output the action.Similarly, the backward rollout policy parameterized by contains an encoder and a decoder , . The loss function of the backward rollout policy gives:

(4) |

We then draw from the Gaussian distribution , and draw the action from the action decoder with the next state and latent variable as input.

We present the detailed procedure for the model training part of CABI in Algorithm 1.

### 4.3 Conservative Data Augmentation with CABI

After the bidirectional dynamics models and bidirectional rollout policies are well trained, we utilize them to generate imaginary samples. Each time, we sample a state from the raw dataset to produce imagined forward trajectory with the forward dynamics model and forward rollout policy , and sample the next state from to generate imagined reverse trajectory with the reverse dynamics model and reverse rollout policy . For each step in the rollout horizon , we do double check and reject those badly imagined synthetic transitions.

To be specific, when performing forward imagination from and generating synthetic next state , we trace back from with the reverse model, and get the backward state . We evaluate the deviation of from , and trust if the deviation is small. Similarly, starting from the state , we backtrack its previous state with the backward dynamics model, and then look forward from to get with the forward dynamics model. We trust if the deviation of from is small.

We keep those trustworthy rollouts and gather them to get the model buffer . We combine the synthetic model buffer with to obtain the final buffer , i.e., . We then can train *any* model-free offline RL algorithms based on the composite dataset.

One naïve way for implementing double check mechanism is to set a threshold , and admit the transition if for forward imagination, or for backward imagination. However, such a method lacks flexibility, and one may need to carefully tune per dataset based on the strong prior knowledge about the dataset, which impedes the application of double check mechanism. We resort to sorting the transitions in a mini-batch by the state deviation from small to large and keep the top % of them. We keep 20% transitions that have the smallest deviation for all of our experiments in Section 5 (empirical study on is available in Appendix C).

Our method is confidence-aware and conservative as we only admit the transitions that the forward model and backward model agree on, thus excluding those poor transitions from the model buffer . The full procedure for the data generation part of CABI is available in Algorithm 2.

Task Name | CABI+BCQ | BCQ | UWAC | BEAR | BC | AWR | CQL | MOPO | COMBO |
---|---|---|---|---|---|---|---|---|---|

pen-cloned | 54.72.0 | 44.0 | 33.1 | 26.5 | 56.9 | 28.0 | 39.2 | -2.1 | -2.4 |

pen-human | 75.11.5 | 68.9 | 21.7 | -1.0 | 34.4 | 12.3 | 37.5 | 9.7 | 27.7 |

pen-expert | 127.62.0 | 114.9 | 111.9 | 105.9 | 85.1 | 111.0 | 107.0 | -0.6 | 11.5 |

door-cloned | 0.50.2 | 0.0 | 0.0 | -0.1 | -0.1 | 0.0 | 0.4 | -0.1 | 0.0 |

door-human | 1.70.1 | 0.0 | 2.1 | -0.3 | 0.5 | 0.4 | 9.9 | -0.2 | -0.3 |

door-expert | 105.30.5 | 99.0 | 104.1 | 103.4 | 34.9 | 102.9 | 101.5 | -0.2 | 4.9 |

relocate-cloned | -0.20.0 | -0.3 | -0.3 | -0.3 | -0.1 | -0.2 | -0.1 | -0.3 | -0.1 |

relocate-human | 0.10.1 | 0.5 | 0.5 | -0.3 | 0.0 | 0.0 | 0.2 | -0.3 | -0.3 |

relocate-expert | 105.91.0 | 41.6 | 105.6 | 98.6 | 101.3 | 91.5 | 95.0 | -0.2 | 17.2 |

hammer-cloned | 4.3 1.6 | 0.4 | 0.4 | 0.3 | 0.8 | 0.4 | 2.1 | 0.2 | 0.4 |

hammer-human | 3.12.2 | 0.5 | 1.1 | 0.3 | 1.5 | 1.2 | 4.4 | 0.2 | 0.2 |

hammer-expert | 128.90.9 | 107.2 | 110.6 | 127.3 | 125.6 | 39.0 | 86.7 | 0.3 | 0.3 |

Total Score | 607.0 | 476.7 | 490.8 | 460.3 | 440.8 | 386.5 | 483.8 | 6.4 | 59.1 |

## 5 Experiments

In this section, we combine CABI with off-the-shelf model-free offline RL algorithms and conduct extensive experiments on the D4RL benchmarks Fu2020D4RLDF. In Section 5.1, we combine CABI with BCQ Fujimoto2019OffPolicyDR, and evaluate it on the challenging Adroit dataset to show the effectiveness of conservative data augmentation with CABI. We present a detailed ablation study in Section 5.2, where we aim to answer the following questions: (1) Is the double check mechanism a critical component for CABI? (2) How does CABI compare with the forward/reverse imagination? (3) How does CABI compare against other augmentation methods, e.g., random selection? Furthermore, we incorporate CABI with another recent model-free offline RL method, TD3_BC Fujimoto2021AMA, and evaluate it on the MuJoCo datasets, to show the generality and advantages of CABI. We additionally combine CABI with IQL kostrikov2022offline and evaluate the performance of CABI+IQL on both Adroit tasks and MuJoCo tasks. Due to the space limit, the results are deferred to Appendix G.

### 5.1 Performance on Challenging Adroit Dataset

We demonstrate the benefits of CABI by combining it with BCQ and evaluating it on the challenging Adroit dataset Rajeswaran2018LearningCD. Adroit dataset involves controlling a 24-DoF simulated robotic hand that aims at hammering a nail, opening a door, twirling a pen, or picking/moving a ball. It contains three types of datasets for each task (*human*, *cloned*, and *expert*), yielding a total of 12 datasets. This domain is very challenging for prior methods to learn from because the dataset is made up of narrow human demonstrations on a sparse reward, high-dimensional robotic manipulation task.

We summarize the overall results in Table 1, where we compare CABI+BCQ against recent model-free offline RL methods, such as UWAC Wu2021UncertaintyWA, CQL Kumar2020ConservativeQF, BCQ Fujimoto2019OffPolicyDR, and model-based offline RL methods, such as MOPO Yu2020MOPOMO, and COMBO Yu2021COMBOCO. We run MOReL and COMBO on these datasets with our reproduced code. Results of MOPO and UWAC on the Adroit domain are acquired by running their official codebases, and the results of the rest baselines are taken directly from Fu2020D4RLDF. All methods are run over 5 different random seeds and normalized average scores are reported in Table 1

. We only report the standard deviation for CABI+BCQ, and the full table is deferred to Appendix H.

As shown, CABI significantly boosts the performance of vanilla BCQ on almost all datasets, achieving a total score of 607.0 vs. 476.7 of BCQ. CABI+BCQ also surpasses the baseline model-free and model-based offline RL methods on 7 out of 12 datasets and achieves the highest total score.

It is worth noting that model-based offline RL methods generally fail on the Adroit tasks, because (1) the dataset distribution is narrow and high-dimensional, making it challenging for the trained forward dynamics model to generate accurate and reliable transitions; (2) the actions in the synthetic transitions are generated by the actor during the training process, thus the error may accumulate if the actor is updated towards a wrong direction. CABI, instead, alleviates the underlying issues via adopting the CVAE for action generation and conducting double check on state prediction.

### 5.2 Ablation Study

Is the double check mechanism critical? To answer this question, we exclude the double check mechanism in CABI and admit all generated synthetic samples from bidirectional models, which gives rise to Bidirectional Offline Model-based Imagination (BOMI). We evaluate CABI+BCQ and BOMI+BCQ on the Adroit tasks with identical parameter configuration over 5 different random seeds and show the average normalized scores in Table 2. It can be seen that BOMI brings some performance improvement on most of the tasks via data augmentation with bidirectional models and rollout policies. However, the generated data may be unreliable (we observe a performance drop in *pen-cloned, pen-human*), which impedes the benefits of bidirectional data augmentation. Such concern can be alleviated with the aid of the double check mechanism. As illustrated in Table 2, CABI+BCQ outperforms BOMI+BCQ on most tasks and incurs a much better total score.

Task name | BCQ | +Forward | +Backward | +BOMI | +CABI |
---|---|---|---|---|---|

pen-cloned | 44.0 | 41.21.1 | 36.86.6 | 43.46.1 | 54.72.0 |

pen-human | 68.9 | 57.89.3 | 60.95.6 | 49.61.4 | 75.11.5 |

pen-expert | 114.9 | 114.45.4 | 118.54.7 | 121.81.2 | 127.62.0 |

door-cloned | 0.0 | 0.00.0 | 0.00.0 | 0.00.0 | 0.50.2 |

door-human | 0.0 | -0.10.1 | 0.00.1 | 0.00.1 | 1.70.1 |

door-expert | 99.0 | 104.20.3 | 103.70.2 | 102.51.2 | 105.30.5 |

relocate-cloned | -0.3 | -0.30.0 | -0.30.0 | -0.20.0 | -0.20.0 |

relocate-human | 0.5 | 0.00.0 | 0.00.0 | 0.00.1 | 0.10.1 |

relocate-expert | 41.6 | 72.92.0 | 76.86.8 | 80.19.3 | 105.91.0 |

hammer-cloned | 0.4 | 1.70.1 | 0.40.1 | 3.13.8 | 4.31.6 |

hammer-human | 0.5 | 2.00.2 | 2.80.5 | 2.10.7 | 3.12.2 |

hammer-expert | 107.2 | 126.81.0 | 126.91.0 | 126.81.3 | 128.90.9 |

Total score | 476.7 | 520.6 | 526.5 | 529.2 | 607.0 |

Task Name | CABI+TD3_BC | TD3_BC | UWAC | MOPO | BCQ | BC | CQL | FisherBRC |
---|---|---|---|---|---|---|---|---|

halfcheetah-random | 15.10.4 | 10.2 | 2.3 | 35.4 | 2.2 | 2.0 | 21.7 | 32.2 |

hopper-random | 11.90.1 | 11.0 | 9.8 | 11.7 | 10.6 | 9.5 | 10.7 | 11.4 |

walker2d-random | 6.41.5 | 1.4 | 3.8 | 13.6 | 4.9 | 1.2 | 2.7 | 0.6 |

halfcheetah-medium-replay | 44.40.2 | 43.3 | 38.9 | 53.1 | 38.2 | 34.7 | 41.9 | 43.3 |

hopper-medium-replay | 31.30.7 | 31.4 | 18.0 | 67.5 | 33.1 | 19.7 | 28.6 | 35.6 |

walker2d-medium-replay | 29.41.3 | 25.2 | 8.4 | 39.0 | 15.0 | 8.3 | 15.8 | 42.6 |

halfcheetah-medium | 45.10.1 | 42.8 | 37.4 | 42.3 | 40.7 | 36.6 | 37.2 | 41.3 |

hopper-medium | 100.40.3 | 99.5 | 30.3 | 28.0 | 54.5 | 30.0 | 44.2 | 99.4 |

walker2d-medium | 82.00.4 | 79.7 | 17.4 | 17.8 | 53.1 | 11.4 | 57.5 | 79.5 |

halfcheetah-medium-expert | 105.00.2 | 97.9 | 40.6 | 63.3 | 64.7 | 67.6 | 27.1 | 96.1 |

hopper-medium-expert | 112.70.0 | 112.2 | 95.4 | 23.7 | 110.9 | 89.6 | 111.4 | 90.6 |

walker2d-medium-expert | 108.41.3 | 101.1 | 14.8 | 44.6 | 57.5 | 12.0 | 68.1 | 103.6 |

halfcheetah-expert | 107.60.9 | 105.7 | 104.0 | - | 89.9 | 105.2 | 82.4 | 106.8 |

hopper-expert | 112.40.1 | 112.2 | 109.1 | - | 107.0 | 111.5 | 111.2 | 112.3 |

walker2d-expert | 108.61.5 | 105.7 | 88.4 | - | 102.3 | 56.0 | 103.8 | 79.9 |

Total Score | 1020.7 | 979.3 | 618.6 | - | 784.6 | 595.3 | 764.3 | 974.6 |

CABI against forward/backward imagination. We incorporate BCQ with the pure forward model, backward model, and CABI, and conduct extensive experiments on the Adroit tasks over 5 different random seeds. The forward model and reverse model are trained with the same configuration as CABI. The results are summarized in Table 2. It can be seen that either the forward or reverse model results in limited improvement, which is consistent with the results of BOMI. As previously discussed, the forward model and reverse model may generate unreliable transitions. We see such evidence as the performance of BCQ falls on some of the tasks (e.g., *pen-cloned*) if trained on mere forward or reverse imagination. BCQ+CABI, instead, outperforms BCQ+Forward and BCQ+Backward on all tasks. Hence, we conclude that CABI guarantees trustworthy transitions for training, and brings improvement on almost all of the tasks.

CABI against other augmentation methods. We further compare CABI against three data augmentation methods: (1) CABI-random where we replace the CVAE with the random policy as the rollout policy in CABI; (2) R-20 where we *randomly* select 20% synthetic transitions for bidirectional imagination; (3) EV-20 where we select 20% transitions with the smallest *ensemble variance* for bidirectional imagination. We use BCQ as the base algorithm and run experiments on four Adroit tasks for these augmentation methods with identical parameter setup as CABI (e.g., real data ratio). The results in Table 4

show that CABI performs consistently better than these methods. Since the data augmentation process of CABI is isolated from the policy optimization, we cannot leverage a random rollout policy because the generated actions of a random policy may possibly lie out of the span of the dataset, which can negatively affect the performance of the agent. Hence, CVAE is critical to ensure a safe and conservative data augmentation. Meanwhile, relying on the ensemble variance for data selection is not trustworthy as the models in the ensemble are trained on the identical data and may all incur wrong predictions but small variance.

Task Name | BCQ | +CABI | +R-20 | +EV-20 | +CABI-random |
---|---|---|---|---|---|

pen-cloned | 44.0 | 54.72.0 | 41.23.0 | 40.42.0 | 37.67.8 |

pen-expert | 114.9 | 127.62.0 | 112.65.6 | 118.82.5 | 106.33.7 |

hammer-cloned | 0.4 | 4.31.6 | 0.90.6 | 0.40.1 | 0.30.0 |

hammer-expert | 107.2 | 128.90.9 | 104.224.6 | 125.55.5 | 103.81.5 |

### 5.3 Broad Results on MuJoCo Dataset

To show the generality of our method, we integrate CABI with another recent model-free offline RL method, TD3_BC Fujimoto2021AMA, and conduct experiments on 15 MuJoCo datasets. We widely compare CABI+TD3_BC against other recent model-free offline RL methods, such as FisherBRC Kostrikov2021OfflineRL, UWAC Wu2021UncertaintyWA, CQL Kumar2020ConservativeQF, and model-based batch RL method, MOPO Yu2020MOPOMO. We run CABI+TD3_BC over 5 different random seeds. We also run UWAC using the official codebase on the MuJoCo datasets over 5 different random seeds. The results of TD3_BC, BC, CQL, FisherBRC are taken directly from Fujimoto2021AMA, and the results of other baseline methods are taken from Wu2021UncertaintyWA.

The experimental results in Table 3 reveal that our approach exceeds all baseline methods on 10 out of 15 datasets, and is the strongest in terms of the total score. On almost all of the tasks, we observe performance improvement with CABI over the base TD3_BC algorithm. Unfortunately, with the existence of behavioral cloning term, the performance improvement upon TD3_BC is limited. Still, the experimental results in Table 1 and 3 show that CABI is a powerful data augmentation method and can boost the performance of the base model-free offline RL methods.

## 6 Conclusion and Limitations

In this paper, we follow human nature and propose to do *double check* during synthetic transition generation to ensure that the imagined samples are conservative and accurate. We admit samples that the forward model and reverse model agree on. Our method, CABI, involves training bidirectional dynamics models and rollout policies and can be combined with *any* off-the-shelf model-free offline RL algorithms. Extensive experiments on the D4RL benchmarks show that our method significantly boosts the performance of the base model-free offline RL method, and can achieve competitive or better performance against recent baseline methods. For future work, it is interesting to evaluate CABI in the online setting and investigate whether it can benefit model-based online RL as well.

The major limitation of our proposed method lies in the computation cost as we train bidirectional dynamics models and rollout policies. However, since CABI is isolated from policy optimization, we can enhance the dataset beforehand.

## References

## Appendix A Experimental Setup of Toy RiskWorld Task

In this section, we give the detailed experimental setup of our toy RiskWorld task. RiskWorld is a 2-dimensional, continuous state space, continuous action space environment as shown in Figure 3. We suppose the central point of RiskWorld is , and the permitted range of RiskWorld gives , i.e., the length of RiskWorld is 3. The state information in RiskWorld is composed of the coordinates of the agent, i.e., . There is a dangerous area locating at the central point with radius 0.5, i.e., . The agent randomly starts at a point in , and can take actions . There is also a high reward zone locating at . The reward function is defined in (5).

(5) |

The agent will receive a large minus reward of if it steps into the dangerous zone , and the done flag will turn into true. The agent is not allowed to step out of the legal region . The episode length for RiskWorld is set to be 300. The RiskWorld is intrinsically a sparse reward environment. We run a random policy on RiskWorld for timesteps and log the transition data it collected during interactions to form a static dataset *RiskWorld-random*.

Model-based reinforcement learning (RL) learns either forward dynamics or reverse dynamics of the environment [Sutton2005ReinforcementLA, goyal2018recall], and can produce imaginary transitions for training, which has been widely demonstrated to be effective in improving the sample efficiency of RL in the online setting [Buckman2018SampleEfficientRL, Janner2019WhenTT]. The forward dynamics model predicts the next state and the corresponding reward function given the current state and action, and the reverse dynamics model outputs the previous state and reward signal given action and the next state. Bidirectional modeling combines both forward dynamics model and backward dynamics model.

To compare different ways of imagination, i.e., the forward imagination, reverse imagination, and bidirectional imagination with the double check mechanism, we train a forward dynamics model, a backward dynamics model, and a bidirectional dynamics model on RiskWorld-random dataset, respectively. We represent the forward and reverse dynamics model by training a probabilistic neural network. The forward model parameterized by receives the current state and action as input, and outputs a multivariate Gaussian distribution that predicts the next state and reward as shown in (6).

(6) |

where and represent the mean and variance of the forward model , respectively.

Similarly, for the backward model parameterized by

, it adopts the next state and action as input and outputs a multivariate normal distribution predicting reward signal and the previous state (see (

7)).(7) |

where and denote the mean and variance of the backward model , respectively.

The probabilistic neural network is modeled by a multi-layer neural network that consists of 4 feedforward layers with 400 hidden units. We adopt *swish* activation for each intermediate layer. Following prior works [Janner2019WhenTT, Yu2020MOPOMO, Kidambi2020MOReLM], we train an ensemble of seven such probability neural networks for both the forward and backward model. We use a hold-out set made up of 1000 transitions to validate the performance of the trained dynamics, and select the five models that have the best performance accordingly. When performing forward or reverse imagination, we randomly pick one model out of the five best model candidates to generate synthetic trajectories per step. Considering the simplicity of the toy RiskWorld task, we train both the forward dynamics model and reverse dynamics model for 100 epochs, and the rollout length (horizon) is set to be 3 for the forward model, reverse model, and bidirectional model. We use the trained dynamics model (forward, reverse, bidirectional) to generate imaginary transition samples, and log their model buffer respectively. We plot in Figure 4 the model buffer of these dynamics models and the raw static dataset obtained by running the random policy.

## Appendix B Datasets and Evaluation Setting on the D4RL Benchmarks

In this section, we give a detailed description of the datasets we used in this paper, and also describe the evaluation setting that is adopted on the D4RL benchmarks [Fu2020D4RLDF]. D4RL is specially designed for evaluating offline RL (or batch RL) algorithms, which covers the dimensions that offline RL may encounter in practical applications, such as passively logged data, human demonstrations, etc.

### b.1 Adroit datasets and MuJoCo datasets

The Adroit dataset involves controlling a 24-DoF simulated Shadow Hand robot to perform tasks like hammering a nail, opening a door, twirling a pen, and picking/moving a ball, as shown in Figure 5. The Adroit domain is super challenging for even online RL algorithms because: (1) the dataset contains narrow human demonstrations; (2) this domain solves sparse reward, high-dimensional robotic manipulation tasks. There are four tasks in the dataset, and there are three types of datasets for each task, *cloned, human, expert*. human: a small number of demonstrations operated by a human (25 trajectories per task). expert: a large amount of expert data from a fine-tuned RL policy. cloned:

a large amount of data generated by performing imitation learning on the human demonstrations, running the policy, and mixing the data at a 50-50 ratio with the demonstrations. Dataset mixing is involved for

*cloned*as the cloned policies themselves fail on the tasks, making the dataset otherwise hard to learn from.

The MuJoCo dataset is collected during the interactions with the continuous action environments in Gym [Brockman2016OpenAIG] simulated by MuJoCo [Todorov2012MuJoCoAP]. We adopt three tasks in this dataset, *halfcheetah, hopper, walker2d* as illustrated in Figure 6. Each task in the MuJoCo dataset contains five types of datasets, *random, medium, medium-replay, medium-expert, expert*. random: a large amount of data from a random policy. medium: experiences collected from an early-stopped SAC policy for 1M steps. medium-replay: replay buffer of a policy trained up to the performance of the medium agent. expert: a large amount of data gathered by the SAC policy that is trained to completion. medium-expert: a large amount of data by mixing the medium data and expert data at a 50-50 ratio.

Note that the Adroit dataset is qualitatively different from the MuJoCo Gym dataset because (1) there do not exist human demonstrations in the MuJoCo dataset; (2) the reward is dense in MuJoCo, making it less challenging to learn from; (3) the dimension of transitions in MuJoCo is low compared with Adroit. It is hard for even online RL algorithms to learn useful policies on the Adroit tasks, while it is easy for online RL methods to achieve superior performance on the MuJoCo environments.

### b.2 Evaluation setting in D4RL

D4RL suggests using the normalized score metric to evaluate the performance of the offline RL algorithms [Fu2020D4RLDF]. Denote the expected return of a random policy on the dataset as (reference min score), and the expected return of an expert policy as (reference max score). Suppose that an offline RL algorithm achieves an expected return of after training on the given dataset. Then the normalized score is given by (8).

(8) |

The normalized score ranges roughly from 0 to 100, where 0 corresponds to the performance of a random policy and 100 corresponds to the performance of an expert policy. We give the detailed reference min score and reference max score in Table 5, where all of the tasks share the same reference min score and reference max score across different types of datasets.

Domain | Task Name | Reference min score | Reference max score |
---|---|---|---|

Adroit | pen | 96.26 | 3076.83 |

Adroit | door | 56.51 | 2880.57 |

Adroit | relocate | 6.43 | 4233.88 |

Adroit | hammer | 274.86 | 12794.13 |

MuJoCo | halfcheetah | 280.18 | 12135.0 |

MuJoCo | hopper | 20.27 | 3234.3 |

MuJoCo | Walker2d | 1.63 | 4592.3 |

## Appendix C Implementation Details and Hyperparameters

### c.1 Implementation details

In this section, we give implementation details and hyperparameters for Confidence-Aware Bidirectional Offline Model-Based Imagination (CABI). We represent the approximated forward dynamics and reward model, and backward dynamics and reward model by training a probabilistic neural network. The configuration of the probabilistic neural network is identical to Appendix A. That is, the forward model and reverse model are modeled as a multivariate Gaussian distribution with mean and variance . For the forward model parameterized by , it accepts the current state and action as input and generates the next state and reward. The backward model parameterized by receives the next state and current action as input and outputs the former state and scalar reward. The probabilistic neural network is modeled by a multi-layer neural network that contains 4 feedforward layers, with 400 hidden units in each layer, and a *swish* activation in each intermediate layer. We train an ensemble of seven such probabilistic neural networks and select the best five models based on their performance on a hold-out set made up of 1000 transitions from the offline dataset.

As a data augmentation method, CABI does not actively generate actions during the training process, i.e., the policy optimization process is isolated from the data generation process. We use a conditional variational autoencoder (CVAE) to approximate the behavior policy in the static dataset. We give brief introduction to the VAE in Appendix E. The CVAE contains an encoder and a decoder . Both the encoder and the decoder in the forward rollout policy and backward rollout policy contain two intermediate layers with 750 hidden units each layer. We adopt *relu* activation for each intermediate layer. Specifically, we train a forward rollout policy with a CVAE , which contains an encoder and a decoder , , and a reverse rollout policy , where . Note that the forward rollout policy and reverse rollout policy sample actions using stochastic inference from an underlying latent space, so as to increase diversity in the generated actions.

Intrinsically, CABI can be combined with *any* off-the-shelf model-free offline RL algorithms. In this work, we incorporate CABI with BCQ [Fujimoto2019OffPolicyDR] and TD3_BC [Fujimoto2021AMA], and conduct extensive experiments on the Adroit dataset and MuJoCo dataset on the D4RL benchmarks, respectively. There are generally three steps when combining CABI with model-free offline RL methods: (1) Model training. We first train bidirectional dynamics models and bidirectional rollout policies using the raw static offline dataset ; (2) Data generation. After the bidirectional models and rollout policies are well trained, we utilize them to generate imaginary trajectories, while conducting double check and admitting high-confidence transitions simultaneously. This will induce the model generated dataset ; (3) Policy optimization. We then merge the real dataset with the imagined dataset to form a composite dataset , i.e., . The mini-batch samples used for training the model-free offline RL algorithms come from and . To be specific, we define the ratio of data come from (real data) as . Suppose we use a mini-batch size of for training the algorithm. Then for each optimization step, we sample samples from , and sample samples from . We pick the optimal real data ratio among using grid search.

Influence of . For the double check mechanism, we keep the top 20% of samples in a mini-batch that the forward model and backward model have the smallest disagreements. As explained in Section 4.3, we do not set a threshold and then admit samples that the deviation of the forward model and backward model on them are smaller than the threshold, because it requires human knowledge and the threshold is task-specific. However, in real-world problems, we cannot always have full knowledge about the system. For better flexibility and generality, we choose to keep top % samples. Note that if we use , then the influence of data augmentation will be excluded. If we use , then CABI will degenerate into Bidirectional Model-based Offline Imagination (BOMI), where there is no double check procedure. Intuitively, if is small, few samples can be left, which may negatively affect the advantages of data augmentation with a model. While if is large, some poorly imagined transitions may be included in the model buffer. We simply set by default, and use it throughout our experiments on the Adroit dataset and MuJoCo dataset. We conduct experiments on two types of MuJoCo datasets, *random, medium*, with varied in over 5 different random seeds. The results are shown in Table 6, where we observe performance drop for both large and small . We hence set for all of our experiments.

Task Name | |||||
---|---|---|---|---|---|

halfcheetah-random | 10.2 | 14.8 | 15.1 | 14.0 | 11.4 |

hopper-random | 11.0 | 10.3 | 11.9 | 10.6 | 9.3 |

walker2d-random | 1.4 | 4.9 | 6.4 | 5.8 | 4.3 |

halfcheetah-medium | 42.8 | 44.6 | 45.1 | 44.4 | 44.3 |

hopper-medium | 99.5 | 99.7 | 100.4 | 32.3 | 3.1 |

walker2d-medium | 79.7 | 78.7 | 82.0 | 79.8 | 78.6 |

Computation time and compute infrastructure. The computation time for CABI ranges from 4 to 14 hours on the MuJoCo and Adroit tasks. The model training time differs on different types of datasets, e.g., it takes much less time to train our bidirectional models and rollout policies on *medium-replay* and *human* datasets (about 40 minutes), while it takes comparatively longer time to train on other types of datasets (about 2-6 hours). TD3_BC consumes about 3 hours to run on all MuJoCo datasets, and BCQ takes about 6-8 hours to train on the Adroit tasks. We run both CABI+BCQ and CABI+TD3_BC for timesteps. We additionally run IQL [kostrikov2022offline], and the results are presented in Section G. IQL takes about 3-7 hours to run on all tasks with timesteps. We give detailed compute infrastructure in Appendix F.

Discussion on CABI and ROMI [Wang2021OfflineRL]. A recent work, Reverse Offline Model-based Imagination (ROMI) [Wang2021OfflineRL], explores the data augmentation in offline RL via training a reverse dynamics model. It is worth noting that we do not directly compare with ROMI+BCQ as there are many secondary components in the codebase of ROMI (https://github.com/wenzhe-li/romi), e.g., prioritized experience replay, modifying state information, adopting varied rollout policies for different domains, etc. CABI represents the rollout policy using only conditional variational autoencoder (CVAE). Also, ROMI assumes that the termination functions are known. However, CABI does not include any prior knowledge about termination conditions, even on simple MuJoCo tasks. That generally follows the key claims in the recent work of [Fujimoto2021AMA]. For a fair comparison, we disable the forward model as well as the double check mechanism in CABI to get our reverse model. As for the comparison with forward imagination, we disable the double check mechanism and the reverse dynamics part of CABI to get our pure forward model.

Other implementation details. On the Adroit tasks, we combine CABI with BCQ, and compare against recent state-of-the-art methods, including vanilla BCQ [Fujimoto2019OffPolicyDR], UWAC [Wu2021UncertaintyWA], CQL [Kumar2020ConservativeQF], MOPO [Yu2020MOPOMO], and COMBO [Yu2021COMBOCO], etc. On the MuJoCo domain, we incorporate CABI with TD3_BC [Fujimoto2021AMA], and compare with baseline methods like FisherBRC [Kostrikov2021OfflineRL], UWAC, CQL, BEAR [Kumar2019StabilizingOQ], MOPO, etc. Note that we omit some baseline methods, such as AWAC [Nair2020AcceleratingOR] and BRAC [Wu2019BehaviorRO] in the MuJoCo domain, as they do not obtain good enough performance for comparison. We run UWAC with the official codebase (https://github.com/apple/ml-uwac), and so is MOPO (https://github.com/tianheyu927/mopo). We run MOPO on the Adroit tasks as those results are not reported in the original paper, and we take the results of MOPO on the MuJoCo datasets from [Wu2021UncertaintyWA] directly. We re-run UWAC on the Adroit domain and MuJoCo domain because, unfortunately, we cannot reproduce the results reported in its original paper. All baseline methods are run for timesteps over 5 different random seeds.

### c.2 Hyperparameters

In this subsection, we give the detailed hyperparameter setup for our experiments in Table 7. We keep the top 20% samples in a mini-batch for all tasks. For simplicity, we use identical rollout length for the forward model and backward model. For all of the MuJoCo tasks and most of the Adroit tasks, the rollout length for both the forward model and backward model is set to be 3, which yields a total horizon of 6. On datasets that the model disagreement are comparatively large for long horizons (e.g., *pen-expert*, see Table 9), we set the forward and backward horizon as 1, which leads to a total horizon of 2. We use the forward and backward horizon 5 for *hammer-human* as we experimentally find that it performs better. On other tasks, the forward and backward horizon is set to be 3 by default. Note that the model disagreement on MuJoCo datasets are smaller than 0.1 for all horizons, and the trained forward and backward model well fits MuJoCo datasets. We therefore adopt the forward and backward horizon of 3 for all of these tasks.

Domain | Dataset Type | Task Name | ForH | BackH | Real Data Ratio |

Adroit | human | pen | 1 | 1 | 0.7 (BCQ), 0.5 (IQL) |

Adroit | human | door | 3 | 3 | 0.7 (BCQ), 0.5 (IQL) |

Adroit | human | relocate | 3 | 3 | 0.9 (BCQ), 0.5 (IQL) |

Adroit | human | hammer | 5 | 5 | 0.5 (BCQ), 0.7 (IQL) |

Adroit | cloned | pen | 1 | 1 | 0.5 (BCQ), 0.5 (IQL) |

Adroit | cloned | door | 3 | 3 | 0.5 (BCQ), 0.7 (IQL) |

Adroit | cloned | relocate | 3 | 3 | 0.3 (BCQ), 0.5 (IQL) |

Adroit | cloned | hammer | 3 | 3 | 0.5 (BCQ), 0.7 (IQL) |

Adroit | expert | pen | 1 | 1 | 0.7 (BCQ), 0.9 (IQL) |

Adroit | expert | door | 3 | 3 | 0.7 (BCQ), 0.9 (IQL) |

Adroit | expert | relocate | 3 | 3 | 0.9 (BCQ), 0.7 (IQL) |

Adroit | expert | hammer | 1 | 1 | 0.9 (BCQ), 0.5 (IQL) |

MuJoCo | random | halfcheetah | 3 | 3 | 0.7 (TD3_BC), 0.7 (IQL) |

MuJoCo | random | hopper | 3 | 3 | 0.1 (TD3_BC), 0.7 (IQL) |

MuJoCo | random | walker2d | 3 | 3 | 0.1 (TD3_BC), 0.7 (IQL) |

MuJoCo | medium | halfcheetah | 3 | 3 | 0.7 (TD3_BC), 0.7 (IQL) |

MuJoCo | medium | hopper | 3 | 3 | 0.9 (TD3_BC), 0.7 (IQL) |

MuJoCo | medium | walker2d | 3 | 3 | 0.7 (TD3_BC), 0.9 (IQL) |

MuJoCo | medium-replay | halfcheetah | 3 | 3 | 0.5 (TD3_BC), 0.7 (IQL) |

MuJoCo | medium-replay | hopper | 3 | 3 | 0.7 (TD3_BC), 0.7 (IQL) |

MuJoCo | medium-replay | walker2d | 3 | 3 | 0.5 (TD3_BC), 0.9 (IQL) |

MuJoCo | medium-expert | halfcheetah | 3 | 3 | 0.7 (TD3_BC), 0.9 (IQL) |

MuJoCo | medium-expert | hopper | 3 | 3 | 0.9 (TD3_BC), 0.9 (IQL) |

MuJoCo | medium-expert | walker2d | 3 | 3 | 0.7 (TD3_BC), 0.9 (IQL) |

MuJoCo | expert | halfcheetah | 3 | 3 | 0.7 (TD3_BC), 0.9 (IQL) |

MuJoCo | expert | hopper | 3 | 3 | 0.9 (TD3_BC), 0.9 (IQL) |

MuJoCo | expert | walker2d | 3 | 3 | 0.7 (TD3_BC), 0.9 (IQL) |

We search for the best over . We find that the real data ratio and are generally effective for CABI. The best ratio strongly depends on the dataset and may need to be tuned manually. For example, *random* dataset in the MuJoCo domain and *cloned* dataset in the Adroit domain are poor for training naturally, and small is therefore needed. While for *expert* dataset or *medium* dataset, a comparatively large is better.

## Appendix D Model Prediction Error and Model Disagreement

In this section, we are interested in exploring (1) can CABI generate more trustworthy transitions in complex environments (2) the model disagreement of the forward and backward models in CABI under different horizons, aiming at checking whether the model disagrees with each other more with the increment of the horizon. To begin with, we define the one-step model prediction error to check whether CABI admits more accurate transitions.

###### Definition D.1 (Model Prediction Error).

Given the static offline dataset , we define one-step model prediction error for forward model and reverse model as:

and generally capture the accuracy of the trained dynamics models, i.e., smaller and indicate better forward and backward dynamics model fitting.
Intuitively, the one-step model prediction error of admitted samples in CABI should be smaller than that of the mere forward dynamics model or reverse dynamics model, as only transitions that the forward model and backward model are all confident about are admitted.
We verify this by comparing the one-step model error in the forward model, backward model, and CABI, where we keep the top 20% imagined samples for CABI. The results are presented in Table 8, where we observe CABI leads to significant error drop for both forward and reverse models on all of the tasks. For example, the forward error in *door-cloned* drops from 24.7 to 0.05 and the backward error drops from 27.7 to 0.01, which reveals that CABI can select reliable and conservative imaginations that well fit the dataset for training.

Task Name | Unidirectional | Bidirectional (CABI) | ||
---|---|---|---|---|

pen-cloned | 837.5 | 777.4 | 751.5 | 603.0 |

pen-human | 195.0 | 177.8 | 107.5 | 97.8 |

pen-expert | 169.01 | 179.8 | 143.58 | 149.8 |

door-cloned | 24.7 | 27.7 | 0.05 | 0.01 |

door-human | 18.2 | 20.2 | 4.4 | 6.0 |

door-expert | 4.3 | 10.5 | 1.8 | 6.3 |

relocate-cloned | 351.9 | 1271.4 | 0.0 | 0.9 |

relocate-human | 229.5 | 267.4 | 178.6 | 205.1 |

relocate-expert | 201.5 | 48.3 | 167.5 | 37.9 |

hammer-cloned | 1330.8 | 1984.3 | 72.3 | 1602.2 |

hammer-human | 577.9 | 596.4 | 480.9 | 477.8 |

hammer-expert | 601.1 | 561.4 | 557.2 | 503.4 |

We then define the model disagreement of forward model and backward model in the following.

###### Definition D.2 (Bidirectional Model Disagreement).

For a sampled current state and reward from a given static offline dataset , a series of forward states and reward signals can be generated by utilizing the forward model, . Denote the imagined backward state and reward based on as and , respectively, . Then the forward model disagreement is defined as:

(9) |

Similarly, for a sampled next state and reward from the offline dataset, an imaginary trajectory containing the backward states and rewards , , can be generated with the aid of the backward dynamics model. For each imagined state in , its previous state and reward are generated by the forward model, . Then the backward model disagreement is defined as:

(10) |

Remark: The above definition generally capture the disagreement between the forward model and the backward model. Note that the model disagreement is different from the model prediction error defined above even if the rollout length is set as 1. The model prediction error measure how well the forward or backward model fits the transition data, while the model disagreement measures how the forward model and backward model disagree on the transition. We take the forward setting as an example. The forward model prediction error is the deviation between the forward imagined state and reward against the real next state and reward signal, while the forward model disagreement is the deviation between the real *current state* and scalar reward with the backward imagined *current state* and reward based on the forward imagination.

Table 9 details model disagreement comparison of CABI against CABI without double check mechanism, which turns into BOMI, i.e., bidirectional modeling without double check, under different horizons. We perform experiments on 12 Adroit tasks and the sampled mini-batch size is set to be . As demonstrated in the table, the model disagreement of CABI is significantly smaller than that of BOMI under different rollout steps. It is worth noting that the model disagreement for both CABI and BOMI is irrelevant to the rollout length. The model disagreement generally is small when performing one-step model rollout, and increases if longer horizon imaginations are generated (some datasets like *door-human* are exceptions). We observe that the model disagreement in CABI is much more controllable than BOMI, e.g., on some expert datasets.

Task Name | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

CABI | BOMI | CABI | BOMI | CABI | BOMI | CABI | BOMI | CABI | BOMI | CABI | BOMI | |

pen-human | 0.23 | 1829.21 | 20.63 | 1608.66 | 21.24 | 1608.94 | 0.23 | 1857.14 | 19.93 | 1693.70 | 19.25 | 1686.21 |

door-human | 0.53 | 36.03 | 0.21 | 28.06 | 0.21 | 29.04 | 0.53 | 36.19 | 0.40 | 29.38 | 0.38 | 30.07 |

relocate-human | 0.00 | 268.37 | 5.32 | 237.03 | 5.45 | 236.64 | 0.00 | 265.34 | 4.68 | 193.56 | 4.72 | 195.08 |

hammer-human | 0.67 | 401.58 | 5.56 | 261.43 | 5.53 | 259.86 | 0.69 | 402.67 | 5.63 | 260.93 | 5.43 | 258.86 |

pen-cloned | 159.58 | 12048.06 | 373.10 | 22506.36 | 359.56 | 22440.26 | 136.89 | 11624.98 | 359.91 | 22350.11 | 331.18 | 21939.83 |

door-cloned | 0.00 | 46.35 | 0.00 | 63.75 | 0.00 | 62.89 | 0.00 | 30.68 | 0.00 | 61.03 | 0.00 | 62.96 |

relocate-cloned | 0.00 | 745.77 | 0.00 | 938.51 | 0.00 | 929.95 | 0.00 | 269.66 | 0.00 | 909.64 | 0.00 | 889.79 |

hammer-cloned | 0.0 | 3791.10 | 0.01 | 4509.45 | 0.01 | 4395.46 | 0.0 | 1021.85 | 0.02 | 4443.90 | 0.03 | 4447.14 |

pen-expert | 0.42 | 1965.76 | 21.92 | 9642.54 | 26.05 | 9691.13 | 0.41 | 1953.54 | 56.12 | 9569.12 | 52.69 | 9399.45 |

door-expert | 0.95 | 257.49 | 0.94 | 261.84 | 1.04 | 263.52 | 1.01 | 258.51 | 0.79 | 276.60 | 0.69 | 276.12 |

relocate-expert | 0.05 | 649.59 | 1.47 | 770.99 | 2.18 | 784.79 | 0.06 | 652.63 | 0.06 | 631.83 | 0.05 | 631.74 |

hammer-expert | 1.03 | 6135.13 | 58.07 | 82422.41 | 60.78 | 82633.47 | 1.02 | 6198.06 | 11.24 | 92882.87 | 10.70 | 92102.51 |

The double check mechanism we introduced in the main text selects trustworthy synthetic samples based on the deviation between states, i.e., transition samples with small state deviation will be kept in the model buffer. While we can also trust the transition samples via the model disagreement, i.e., keep transition samples with small model disagreement. We experimentally find that evaluating the deviation between states brings almost the same performance as evaluating the model disagreement under the identical hyperparameter setup. We choose to select transitions according to the deviation between states alone as shown in Algorithm 2 for both space and time saving during data generation process of CABI.

## Appendix E Omitted Background for VAE

In this section, we provide a brief introduction to the variational autoencoder (VAE) [Kingma2014AutoEncodingVB]. Given a dataset , the VAE is trained to generate samples that come from the same distribution as the data points. That is to say, the goal of a VAE is to maximize , where is the parameter of the approximate maximum-likelihood (ML) or maximum a posterior (MAP) estimation. To reach this goal, a latent variable sampled from its posterior distribution is introduced, and we model a decoder parameterized by . However, directly optimizing the marginal likelihood is intractable. Instead, VAE approximates the true posterior via training an encoder , and we resort to optimizing the evidence lower bound (ELBO) on the log-likelihood of the data as shown in (11).

(11) |

The first term in the right-hand-side of (11) denotes the reconstruction loss, where is sampled from . The second term represents the KL-divergence between the learned encoder of and its true prior. The encoder is usually set to be a multivariate Gaussian distribution with mean and variance . The prior of the latent variable is set to be a standard multivariate Gaussian distribution. Optimizing the lower bound in (11) enables the trained model to generate samples similar to the data distribution. After the VAE is well trained, we sample from the encoder and pass it through the decoder to obtain samples.

In this work, we use the conditional variational autoencoder (CVAE) [Fujimoto2019OffPolicyDR] to model the behavior policy in the dataset. CVAE is a variant of the vanilla VAE, which aims to model . Similar to the original ELBO of VAE, CVAE optimizes the conditional lower bound as shown in (12).

(12) |

## Appendix F Compute Infrastructure

In Table 10, we list the compute infrastructure that we use to run all of the baseline algorithms and experiments.

CPU | GPU | Memory |
---|---|---|

AMD EPYC 7452 | RTX30908 | 288GB |

## Appendix G Experimental Results of CABI+IQL

In this section, we additionally incorporate CABI with a recently proposed offline RL method, IQL [kostrikov2022offline]. IQL learns without querying OOD samples. Such a learning paradigm ensures that the whole learning process is conducted under the support of the dataset, and a safe policy can be learned. However, as we explained in the main text, the datasets often cannot contain all possible transitions. Hence, the generalization capability of IQL is actually limited. With the aid of CABI, such concern can be mitigated to some extent. We conduct experiments on 12 Adroit datasets and 15 MuJoCo datasets over 5 different random seeds. For IQL, we use its official codebase (https://github.com/ikostrikov/implicit_q_learning) to run on all 27 datasets over 5 random seeds with the hyperparameters suggested by the authors. We incorporate CABI with IQL and run CABI+IQL over 5 different random seeds. The forward and backward horizons for CABI+IQL are identical to CABI+BCQ on Adroit tasks and CABI+TD3_BC on MuJoCo datasets. We summarize the results in Table 11 and Table 12.

As shown, CABI boosts the performance of IQL on all 27 datasets of Adroit and MuJoCo. CABI+IQL outperforms baseline methods on 10 out of 12 datasets. While on MuJoCo datasets, CABI+IQL only surpasses baseline methods on 5 out of 15 datasets, due to the fact that the base method IQL itself has poor performance on "-v0" datasets. Nevertheless, CABI+IQL has a total score of 604.1 on Adroit, surpassing the total score 562.5 of the vanilla IQL. CABI+IQL achieves a total score of 909.3 on MuJoCo datasets, while vanilla IQL only has a total score of 860.7. We want to emphasize here that we do not aim to beat the most recent strong baseline methods in this paper, the key point we want to carry here is the conservative data augmentation with CABI is effective and beneficial for the performance improvement over the base offline RL algorithms. The empirical experiments work as the evidence to validate our claim.

Task Name | CABI+IQL | IQL | UWAC | BEAR | BC | AWR | CQL | MOPO | COMBO |
---|---|---|---|---|---|---|---|---|---|

pen-cloned | 42.2 |