Objective Mismatch in Model-based Reinforcement Learning

02/11/2020 ∙ by Nathan Lambert, et al. ∙ Facebook berkeley college 0

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework – what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t. the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 8

page 9

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-based reinforcement learning (MBRL) is a popular approach for learning to control nonlinear systems that cannot be expressed analytically (bertsekas1995dynamic; sutton2018reinforcement; deisenroth2011pilco; williams2017information). MBRL techniques achieve the state of the art performance for continuous-control problems with access to a limited number of trials (chua2018deep; wang2019exploring) and in controlling systems given only visual observations with no observations of the original system’s state (hafner2018learning; zhang2018solar). MBRL approaches typically learn a forward dynamics model that predicts how the dynamical system will evolve when a set of control signals are applied. This model is classically fit with respect to the maximum likelihood of a set of trajectories collected on the real system, and then used as part of a control algorithm to be executed on the system (e.g., model-predictive control).

In this paper, we highlight a fundamental problem in the MBRL learning scheme: the objective mismatch issue. The learning of the forward dynamics model is decoupled from the subsequent controller that it induces through the optimization of two different objective functions – prediction accuracy or loss of the single- or multi-step look-ahead prediction for the dynamics model, and task performance (i.e., reward) for the policy optimization. While the use of negative log-likelihood (NLL) for system identification is an historically accepted objective, it results in optimizing an objective that does not necessarily correlate to performance. The contributions of this paper are to: 1) identify and formalize the problem of objective mismatch in MBRL; 2) examine the signs of and the effects of objective mismatch on simulated control tasks; 3) propose an initial mechanism to mitigate objective mismatch; 4) discuss the impact of objective mismatch and outline future directions to address this problem.

2 Model-based Reinforcement Learning

We now outline the MBRL formulation used in the paper. At time , we denote the state , the actions , and the reward . We say that the MBRL agent acts in an environment governed by a state transition distribution

. We denote a parametric model to approximate this distribution with

. MBRL follows the approach of an agent acting in its environment, learning a model of said environment, and then leveraging the model to act. While iterating over parametric control policies, the agent collects measurements of state, action, next-state and forms a dataset . With the dynamics data 

, the agent learns the environment in the form of a neural network forward dynamics model, learning an approximate dynamics

. This dynamics model is leveraged by a controller that takes in the current state and returns an action sequence maximizing the expected reward , where is the predictive horizon and is the set of state transitions induced by the model . In our paper, we primarily use probabilistic networks designed to minimize the NLL of the predicted parametric distribution , denoted as , or ensembles of probabilistic networks denoted , and compare to deterministic networks minimizing the mean squared error (MSE), denoted or . Unless otherwise stated we use the models as in PETS (chua2018deep) with an expectation-based trajectory planner and a cross-entropy-method (CEM) optimizer.

3 Objective Mismatch and its Consequences

Figure 1: Objective mismatch in MBRL arises when a dynamics model is trained to maximize the likelihood but then used for the policy to maximize a reward signal not considered during training.

The Origin of Objective Mismatch: The Subtle Differences between MBRL and System Identification

Many ideas and concepts in model-based RL are rooted in the field of optimal control and system identification (bertsekas1995dynamic; zhou1996robust; kirk2012optimal; bryson2018applied; sutton2018reinforcement). In system identification (SI), the main idea is to use a two-step process where we first generate (optimal) elicitation trajectories  to fit a dynamics model (typically analytical), and subsequently we apply this model to a specific task. This particular scheme has multiple assumptions:

Figure 2: Sketches of state-action spaces. (Left) In system identification, the elicitation trajectories are designed off-line to cover the entire state-action space. (Right) In MBRL instead, the data collected during learning is often concentrated in trajectories towards the goal, with other parts of the state-action space being completely unexplored (grey area).

1) the elicitation trajectories collected cover the entire state-action space; 2) the presence of virtually infinite amount of data; 3) the global and generalizable nature of the model resulting from the SI process. With these assumptions, the theme of system identification is effectively to collect a large amount of data covering the whole state-space to create a sufficiently accurate, global model that we can deploy on any desired task, and still obtain good performance.

When adopting the idea of learning the dynamics model used in optimal control for MBRL, it is important to consider if these assumptions still hold. The assumption of virtually infinite data is visibly in tension with the explicit goal of MBRL which is to reduce the number of interactions with the environment by being “smart” about the sampling of new trajectories. In fact, in MBRL the offline data collection performed via elicitation trajectories is largely replaced by on-policy sampling in order to explicitly reduce the need to collect large amount of data (chua2018deep)

. Moreover, in the MBRL setting the data will not usually cover the entire state-action space, since they are generated by optimizing one task. In conjunction with the use of non-parametric models, this results in learned models that are strongly biased towards capturing the distribution of the locally accurate, task-specific data. Nonetheless, this is not an immediate issue since the MBRL setting rarely tests for generalization capabilities of the learned dynamics.

In practice, we can now see how the assumptions and goals of system identification are in contrast with the ones of MBRL. Understanding these differences and the downstream effects on algorithmic approach is crucial to design new families of MBRL algorithms.

Objective Mismatch

During the MBRL process of iteratively learning a controller, the reward signal from the environment is diluted by the training of a forward dynamics model with a independent metric, as showing in Fig. 1. In our experiments, we highlight that the minimization of some network training cost does not hold a strong correlation to maximization of episode reward. As dynamic environments becoming increasingly complex in dimensionality, the assumptions of collected data distributions become weaker and over-fitting to different data poses an increased risk.

Formally, the problem of objective mismatch appears as two de-coupled optimization problems repeated over many cycles of learning, shown in Eq. (1a,b), which could be at the cost of minimizing the final reward. This loop becomes increasingly difficult to analyze as the dataset used for model training changes with each experimental trial – a step that is needed to include new data from previously unexplored states. In this paper we characterize the problems introduced by the interaction of these two optimization problems, but avoid to consider the interactions added by the changes in data distribution during the learning process, as this would significantly increase the complexity of the analysis. In addition, we discuss potential solutions, but do not make claims about the best way to do so, which is left for future work.

(1a,b)

4 Identifying Objective Mismatch

We now experimentally study the issue of objective mismatch in MBRL to answer the following questions: 1) Does the distribution of models obtained from running a MBRL algorithm show a strong correlation between NLL and reward? 2) Are there signs of sub-optimality in the dynamics models training process that could be limiting performance? 3) What model differences are reflected in reward but not in NLL?

Experimental Setting

In our experiments, we use two popular RL benchmark tasks: the cartpole and half cheetah. For more details on these tasks, model parameters, and control properties see chua2018deep. For our experiments, we aggregated cartpole models and half cheetah models from PETS runs. With a large set of on-policy dynamics models, we then used a set of 3 different datasets to evaluate how different assumptions made in MBRL affect performance. We start with expert datasets (cartpole , half cheetah ) to test if on-policy performance is linked to having adequately explored the environment. As a baseline, we compare the expert data to datasets collected on-policy by the PETS algorithm or by sampling tuples representative of the entire state space. More details, additional experiments, and an expanded manuscript can be found on the website https://sites.google.com/view/mbrl-mismatch.

fig:nllVr

Figure 3: The distribution of dynamics models from our PETS experiments plotting in the LL-Reward space on three datasets, with correlation coefficients . Each reward point is the mean over 10 trials. There is a trend of high reward to ‘good’ LL that breaks down as the datasets contain more of the state-space than only expert trajectories.

4.1 Exploration of Model Loss vs Episode Reward Space

In the MBRL framework, it is often assumed that there is a clear correlation between model accuracy and policy performance, which we challenge even in simple domains, and on specific on-policy data. The relationships between model accuracy and reward on data representing the full environment space show no clear trend in Fig. LABEL:fig:nllVrc,f. The simplicity of the cartpole environment results in quick learning and a concentration of networks around peak reward. The distribution of rewards versus log-likelihood (LL) shown in Fig. LABEL:fig:nllVr

a-c shows substantial variance and points of disagreement overshadowing a visual trend of increased reward as LL decreases. This bi-model distribution on the half cheetah expert dataset, shown in most clearly Fig. 

LABEL:fig:nllVrd, relates to a unrecoverable state failure mode in early half cheetah trials. The contrast between Fig. LABEL:fig:nllVre and Fig. LABEL:fig:nllVrd,f shows a considerable difference in the transitions represented within the datasets.

These results confirm that objective mismatch is not so strong as to break all correlation between validation loss and episode reward. Rather, the mismatch likely is a ceiling on performance by reducing the correlation between model accuracy and evaluation reward, and specifically conflicts with any guarantee of improvement that is expected when training a ‘better’ model.

4.2 Model Loss vs Episode Reward During Training

fig:cp-tc

Validation Error (, )  

Episode Reward (, )

Figure 4:

The reward versus epoch when re-evaluating the controller leveraging a dynamics model at each training epoch for different types of dynamics models. Even for the simple cartpole environment,

, fail to achieve full performance, while , reach higher performance but eventually over-fit to available data. The over-fitting of the model is further evaluated in Fig. LABEL:fig:cp-tc-eval.

fig:cp-tc-eval

Grid Data () 

Policy Data () 

Expert Data () 

Episode Reward ()

Figure 5: The effect of the dataset choice on model () training and accuracy in different regions of the state-space. (Left) when training on the complete dataset, the model begin over-fitting to the on-policy data even before the performance drops in the controller. (Right) A model trained only on policy data does not accurately model the entire state-space.

This section explores how model training impacts performance at the per-epoch level. These experiments shed light onto the impact of the strong dataset assumptions outlined in Sec. 3. As a dynamics model is trained, there are two key inflection points - the first is the training epoch where episode reward is maximized, and the second is when error on the validation set is minimized. These experiments are focused on showing the disconnect between three practices in MBRL a) the assumption that the on-policy dynamics data can express large portions of the state-space, b) the idea that simple neural networks can satisfactorily capture complex dynamics, c) and the practice that model training is a simple optimization problem disconnected from reward.

For the grid cartpole dataset, Fig. LABEL:fig:cp-tc shows that the reward is maximized at a drastically different time than when validation loss is minimized when evaluating the controller per-epoch of training for , models. Fig. LABEL:fig:cp-tc-eval highlights how the trained models are able to represent other datasets than they are trained on (in terms of additional validation errors). There is no indication that on-policy data will lead to a complete dynamics understanding because the grid validation data rapidly diverges in Fig. LABEL:fig:cp-tc-evalb. When training on grid data, the fact that the on-policy data diverges in Fig. LABEL:fig:cp-tc-evala before the reward decreases is encouraging as objective mismatch may be preventable in simple tasks. Similar experiments on half cheetah are omitted because models for this environment are trained incrementally on aggregated data rather then fully on each dataset chua2018deep.

4.3 Decoupling Model Loss from Controller Performance

In this section, we explore how differences in dynamics models – even if they have similar NLLs – are reflected in control policies to show that a accurate dynamics model does not guarantee performance.

Figure 6: (left) Planned trajectories along the expert trajectory for the initial model and (right) the adversarially generated model trained to lower the reward . It can be seen how the planned trajectories are qualitatively similar except for the peak at where the control behavior deviates.
Figure 7: Convergence of the CMA-ES population’s best member.

Adversarial attack on model performance

We performed an adversarial attack (szegedy2013intriguing) on a neural network dynamics model so that it attains a good likelihood but poor reward. We start with a dynamics model that achieves high likelihood and high reward and tweak the parameters so that it continues achieving high likelihood but has a low reward. Specifically, we fine-tune the network’s last layer with a zeroth-order optimizer, CMA-ES, (the cumulative reward is non-differentiable) to lower reward with a large penalty if the model validation likelihood drops. As a starting point for this experiment we sampled a dynamics model from the last trial of a PETS run on cartpole. This model achieves reward of and has a NLL of on it’s on-policy training dataset. Using CMA-ES, we reduced the on-policy reward of the model to while slightly improving the NLL; the CMA-ES convergence over population iterations is shown in Fig. 7 and the difference between the two models is visualized in Fig. 6. Fine tuning of all model parameters would be even more likely to find sub-optimal performing controllers with low model loss because the output layer consists of about of the total model parameters. This experiment shows that the model parameters that achieve a low model loss inhabit a broader space than the subset that also achieves high reward.

5 Addressing Objective Mismatch During Model Training

Figure 8: Mean reward of PETS trials (N=100), with and without model re-weighting, on a log-grid of dynamics model training sets with number of points and sampling optimal-distance bounds . The re-weighting improves performance for smaller dataset sizes, but suffers from increased variance in larger set sizes. The performance of PETS declines when the dynamics model is trained on points too near to the optimal trajectory because the model lacks robustness when running online with the stochastic MPC.
Figure 9: We propose to re-weight the loss of the dynamics model w.r.t. the distance from optimal trajectory.

Tweaking dynamics model training can partially mitigate the problem of objective mismatch. Taking inspiration from imitation learning, we propose that the learning capacity of the model would be most useful when accurately modeling the dynamics along trajectories that are relevant for the task at hand, while maintaining knowledge of nearby transitions for robustness under a stochastic controller. Intuitively, it is more important to model accurately the dynamics along the optimal trajectory, rather than modeling part of the state-action space that might never be visited to solve the particular task. However, using the NLL as model loss does not consider the density of the data, or their usefulness for the task. For this reason, we now propose a model loss aimed at alleviating this issue.

Given a element of a state space , we quantify the distance of any two tuples, . With this distance, we re-weight the loss, , of points further from the optimal policy to be lower, so that points in the optimal trajectory get a weight , and points at the edge of the grid dataset used in Sec. 4 get a weight . Using the expert dataset discussed in Sec. 4 as a distance baseline, we generated tuples of by uniformly sampling across the state and action space of cartpole. We sorted this data by taking the minimum orthogonal distance, , from each of the points to the element dataset from the optimal trajectory (i.e., an expert trajectory that achieved a reward of ). To create different datasets that range from near-optimal to nearly uniform across the state space, we vary the distance bound, , and number of points, , trained on. This simple form of re-weighting the neural network loss with a sharp roll-off initially from the exponential, shown in Eq. (2a,b,c), demonstrated an improvement in sample efficiency to learn the cartpole task, as seen in Fig. 8

. Unfortunately, this approach is impractical for many application where it is not known in advance the optimal trajectory. However, building on these encouraging results, future work could develop an iterative method to jointly estimate and re-weight samples in an online training method to address objective mismatch.

(2a,b,c)

6 Discussion, Related Work, and Future Work

Objective mismatch impacts the performance of MBRL. Our experiments have gone deeper into this fragility in the context of the state-of-the-art MBRL algorithms. Beyond the re-weighting of the NLL presented in Sec. 5, here we summarize and discuss the relevant works in the community.

Learning the dynamics model to optimize the task performance

Most relevant are research directions on controllers that directly connect the reward signal back to the controller. In theory, this exactly solves the model mismatch problem we discuss in this paper but in practice the current approaches have proven difficult to scale to complex systems. One way to do this is by designing systems that are fully differentiable so that the task reward can be backpropagated through the dynamics. This has been investigated with differentiable MPC 

(amos2018differentiable), Path Integral control (okada2017path), and stochastic optimization (donti2017task). Universal Planning Networks (srinivas2018universal) propose a differentiable planner that unrolls gradient descent steps over the action space of a planning network. Bansal2017 use a zero-order optimizer instead to maximize the controller’s performance without having to compute gradients explicitly.

Add heuristics to the dynamics model structure or training process to make control easier

If it is infeasible or intractable to shape the dynamics of a controller, adding heuristics to the training process of the dynamics model is reasonable and can improvements in many settings. One challenge in these heuristics is that they may be unstable and difficult to fix or improve when they do not work in new environments. These heuristics can manifest in the form of learning a latent space that is locally linear,

e.g., in Embed to Control and related methods (watter2015embed), by enforcing that the model makes long-horizon predictions (ke2019learning), ignoring uncontrollable parts of the state space (ghosh2018learning), detecting and correcting when a predictive model steps off the manifold of reasonable states (talvitie2017self), adding reward signal prediction on top of the latent space gelada2019deepmdp, or adding noise when training transitions mankowitz2019robust.

Add inductive biases to the controller

Prior knowledge can be added to the controller in the form of hyper-parameters such as the horizon length, or by penalizing unreasonable control sequences by using, e.g., a slew rate penalty. These heuristics can significantly improve the performance if done correctly but can be difficult to tune. jiang2015dependence use complexity theory to justify using a short planning horizon with an approximate model to reduce the the class of induced policies.

Continuing Experiments

Our experiments represent an initial exploration into the challenges of objective mismatch in MBRL. Sec. 4.2 is limited to cartpole due to computational challenges of training with large dynamics datasets and Sec. 4.3 could be strengthened by defining quantitative comparisons in controller performance. Additionally, these effects should be quantified in other MBRL algorithms such as MBPO (janner2019trust) and POPLIN (wang2019exploring).

7 Conclusion

This paper identifies, formalizes and analyzes the issue of objective mismatch in MBRL. This fundamental disconnect between the likelihood of the dynamics model, and the overall task reward emerges from incorrect assumptions at the origins of MBRL. Experimental results highlight the negative effects that objective mismatch has on the performance of a current state of the art MBRL algorithm. In providing a first insight on the issue of objective mismatch in MBRL, we hope future work will deeply examine this issue to overcome it with a new generation of MBRL algorithms.

References

Appendix

Appendix A Effect of Dataset Distribution when Learning

Number of random transitions:

200,  

2000,  

4000,  

20000

Figure 10: Cartpole (Mujoco simulations) learning efficiency is suppressed when additional data not relevant to the task is added to the dynamics model training set. This effect is related to the issue of objective mismatch because model training needs to account for potential off-task data.

Learning speed can be slowed by many factors in dataset distribution, such as adding additional irrelevant transitions. When extra transitions from a specific area of the state space are included in the training set, the dynamics model will spend increased expression on these transitions. NLL of the model will be biased down as it learns this data, but it will reduce the learning speed as new, more relevant transitions are added to the training set.

Running cartpole random data collection with a short horizon of 10 steps (while forcing initial babbling state to always be 0), for 20, 200,400 and 2000 babbling roll-outs (that sums up to 200, 2000, 4000 and 20000 transitions in the dataset finally shows some regression in the learning speed for runs with more useless data in the motor babbling. This data highlights the importance of careful exploration vs exploitation trade-offs, or changing how models are trained to be selective with data.

Appendix B Task Generalization in Simple Environments

Validation Error  

Episode Reward

Figure 11: Learning curve for the standard Cartpole task used in this paper (). The median reward from 10 trials is plotted with the mean NLL of the dynamics models at each iteration. The reward reaches maximum (180) well before the NLL is at it’s minimum.

In this section, we compare the performance of a model trained on data for the standard cartpole task (x position goal at 0) to policies attempting to move the cart to different positions in the x-axis. Fig. 11 is a learning curve of PETS with a PE model using the CEM optimizer. Even though performance levels out, the NLL continues to decrease as the dynamics models accrue more data. With more complicated systems, such as halfcheetah, the reward of different tasks verses global likelihood of the model would likely be more interesting (especially with incremental model training) - we will investigate this in future work. Below, we show that the dynamics model generalizes well to tasks close to zero (both positive in (Fig. LABEL:fig:right) and negative positions (Fig. LABEL:fig:left), but performance drops off in areas the training set does not cover as well.

Below the learning curves in Fig. LABEL:fig:general_dist, we include snapshots of the distributions of training data used for these models at different trials, showing how coverage relates to reward in cartpole. It is worth investigating how many points can be removed from the training set while maintaining peak performance on each task.

fig:general


Figure 12: MPC control with different reward functions with the same dynamics models loaded from trials shown in Fig. 11. The cartpole solves tasks further from proportional to the state space coverage (Goal further from zero causes reduced performance). The distribution of data encountered is shown in Fig. LABEL:fig:general_dist.

fig:general_dist

Figure 13: Distribution of position encountered during the trials shown in Fig. 11. The distribution converges to a high concentration around , making it difficult for MPC to control outside of the area close to .

Appendix C Validating models with trajectories rather than random tuples

The goal of the dynamics model for planning is to be able to predict stable long term roll-outs conditioned on different actions. We propose that evaluating the test set when training a dynamics model could be more reliable (in terms of relation between loss and reward) if the model is validated on batches consisting entirely of the same trajectory, rather then a random shuffle of points. When randomly shuffling points, the test loss can be easily dominated by an outlier in each batch.

To test this, we re-ran experiments from Sec. 4.1 with the NLL being calculated on trajectories rather than random batches. The results are shown in Fig. LABEL:fig:traj-loss.

fig:traj-loss

Figure 14: There is a slight increase in the correlation between NLL and reward when training on cartpole trajectories rather than random samples. This could be one small step in the right direction of solving objective mismatch.

fig:nllVr-traj

Figure 15: Validation of model LL versus reward with different types of validation of the half cheetah models. (left) Is a new method for training, where each batch of the validation set is a complete subsection of a trajectory in the aggregated dataset. (center) We compare the trajectory loss to the regularization that would be provided when just validating with larger batches, which would reduce variance from outliers. (right) Copied from figure Fig. LABEL:fig:nllVre where validation is done on small batches randomly sampled.

Appendix D Ways model mismatch can harm the performance of a controller

Model mismatch between fitting the likelihood and optimizing the task’s reward manifests itself in many ways. Here we highlight two of them and in Sec. 6 we discuss how related work connects in with these issues.

Long-horizon roll-outs of the model may be unstable and inaccurate.

Time-series or dynamics models that are unrolled for long periods of time easily diverge from the true prediction and can easily step into predicting future states that are not on the manifold of reasonable trajectories. Taking these faulty dynamics models and using them as a smaller part of a controller that optimizes some cost function under a poor approximation to the dynamics. Issues can especially manifest if, e.g., the approximate dynamics do not properly capture stationarity properties necessary for the optimality of the true physical system being modeled.


Parameter
Cartpole Half-Cheetah
Experiment Parameters
Trial Time-steps 200 1000

Random Sampling Parameters
Horizon 25 30
Trajectories 2000 2500

CEM Parameters
Horizon 25 30
Trajectories 400 500
Elites 40 50
CEM Iterations 5 5

Network Parameters
Width 500 200
Depth 2 3
E 5 5

Training Parameters
Training Type Full Incremental
Full / Initial Epochs 100 20
Incremental Epochs - - 10
Optimizer Adam Adam
Batch Size 16 64
Learning Rate 1E-4 1E-4
Test Train Split 0.9 0.9

Table 1: PETS Hyper-parameters

Non-convex and non-smooth models may make the control optimization problem challenging

The approximate dynamics might have bad properties that make the control optimization problem much more difficult than on the true system, even when the true optimal action sequence is optimal under the approximate model. This is especially true when using neural network as they introduce non-linearities and non-smoothness that make many classical control approaches difficult.

Sampling models with similar NLLs, different rewards

To better understand the objective mismatch, we also compared how a difference of model loss can impact a control policy. We sampled models with similar NLL’s and extremely different rewards from Fig. LABEL:fig:nllVrd-e and visualized the chosen optimal action sequences along an expert trajectory. The control policies and dynamics models appear to be converging to different regions of state spaces. In these visualizations, there is not a emphatic reason why the models achieved different reward, so further study is needed to quantify the impact of model differences. The interpretability of the difference between models and controllers will be important to solving the objective-mismatch issue.

Appendix E Hyper-paramters and Simulation Environment

Here is a table including the default parameters for our cartpole and half-cheetah experiments. Both of these experiments were run with Mujoco version (which we found to be significant in replicating various papers across the field of Deep RL.

Experimental datasets

We include a table with the sizes of each dataset used in the experimental section of this paper. The expert datasets employed are generated by a combination of a) running PETS with a true, environment-based dynamics model for prediction or soft actor-critic at convergence. The on-policy data is taken from the end of a trial that solved the given task (rather then sampling from all on-policy data). The grid dataset for cartpole is generated by slicing the state and action spaces evenly. Due to the high dimensionality of half cheetah, uniform slicing does not work, so the dataset is generated by uniformly sampling within the state and action spaces.

Type Number of points
Cartpole Datasets

Grid
16807
On-policy 3780
Expert 2400

Half Cheetah Datasets
Sampled 200000
On-policy 90900
Expert 3000

Table 2: Experimental Dataset Sizes