1 Introduction
Modelbased reinforcement learning (MBRL) is a popular approach for learning to control nonlinear systems that cannot be expressed analytically (bertsekas1995dynamic; sutton2018reinforcement; deisenroth2011pilco; williams2017information). MBRL techniques achieve the state of the art performance for continuouscontrol problems with access to a limited number of trials (chua2018deep; wang2019exploring) and in controlling systems given only visual observations with no observations of the original system’s state (hafner2018learning; zhang2018solar). MBRL approaches typically learn a forward dynamics model that predicts how the dynamical system will evolve when a set of control signals are applied. This model is classically fit with respect to the maximum likelihood of a set of trajectories collected on the real system, and then used as part of a control algorithm to be executed on the system (e.g., modelpredictive control).
In this paper, we highlight a fundamental problem in the MBRL learning scheme: the objective mismatch issue. The learning of the forward dynamics model is decoupled from the subsequent controller that it induces through the optimization of two different objective functions – prediction accuracy or loss of the single or multistep lookahead prediction for the dynamics model, and task performance (i.e., reward) for the policy optimization. While the use of negative loglikelihood (NLL) for system identification is an historically accepted objective, it results in optimizing an objective that does not necessarily correlate to performance. The contributions of this paper are to: 1) identify and formalize the problem of objective mismatch in MBRL; 2) examine the signs of and the effects of objective mismatch on simulated control tasks; 3) propose an initial mechanism to mitigate objective mismatch; 4) discuss the impact of objective mismatch and outline future directions to address this problem.
2 Modelbased Reinforcement Learning
We now outline the MBRL formulation used in the paper. At time , we denote the state , the actions , and the reward . We say that the MBRL agent acts in an environment governed by a state transition distribution
. We denote a parametric model to approximate this distribution with
. MBRL follows the approach of an agent acting in its environment, learning a model of said environment, and then leveraging the model to act. While iterating over parametric control policies, the agent collects measurements of state, action, nextstate and forms a dataset . With the dynamics data, the agent learns the environment in the form of a neural network forward dynamics model, learning an approximate dynamics
. This dynamics model is leveraged by a controller that takes in the current state and returns an action sequence maximizing the expected reward , where is the predictive horizon and is the set of state transitions induced by the model . In our paper, we primarily use probabilistic networks designed to minimize the NLL of the predicted parametric distribution , denoted as , or ensembles of probabilistic networks denoted , and compare to deterministic networks minimizing the mean squared error (MSE), denoted or . Unless otherwise stated we use the models as in PETS (chua2018deep) with an expectationbased trajectory planner and a crossentropymethod (CEM) optimizer.3 Objective Mismatch and its Consequences
The Origin of Objective Mismatch: The Subtle Differences between MBRL and System Identification
Many ideas and concepts in modelbased RL are rooted in the field of optimal control and system identification (bertsekas1995dynamic; zhou1996robust; kirk2012optimal; bryson2018applied; sutton2018reinforcement). In system identification (SI), the main idea is to use a twostep process where we first generate (optimal) elicitation trajectories to fit a dynamics model (typically analytical), and subsequently we apply this model to a specific task. This particular scheme has multiple assumptions:
1) the elicitation trajectories collected cover the entire stateaction space; 2) the presence of virtually infinite amount of data; 3) the global and generalizable nature of the model resulting from the SI process. With these assumptions, the theme of system identification is effectively to collect a large amount of data covering the whole statespace to create a sufficiently accurate, global model that we can deploy on any desired task, and still obtain good performance.
When adopting the idea of learning the dynamics model used in optimal control for MBRL, it is important to consider if these assumptions still hold. The assumption of virtually infinite data is visibly in tension with the explicit goal of MBRL which is to reduce the number of interactions with the environment by being “smart” about the sampling of new trajectories. In fact, in MBRL the offline data collection performed via elicitation trajectories is largely replaced by onpolicy sampling in order to explicitly reduce the need to collect large amount of data (chua2018deep)
. Moreover, in the MBRL setting the data will not usually cover the entire stateaction space, since they are generated by optimizing one task. In conjunction with the use of nonparametric models, this results in learned models that are strongly biased towards capturing the distribution of the locally accurate, taskspecific data. Nonetheless, this is not an immediate issue since the MBRL setting rarely tests for generalization capabilities of the learned dynamics.
In practice, we can now see how the assumptions and goals of system identification are in contrast with the ones of MBRL. Understanding these differences and the downstream effects on algorithmic approach is crucial to design new families of MBRL algorithms.
Objective Mismatch
During the MBRL process of iteratively learning a controller, the reward signal from the environment is diluted by the training of a forward dynamics model with a independent metric, as showing in Fig. 1. In our experiments, we highlight that the minimization of some network training cost does not hold a strong correlation to maximization of episode reward. As dynamic environments becoming increasingly complex in dimensionality, the assumptions of collected data distributions become weaker and overfitting to different data poses an increased risk.
Formally, the problem of objective mismatch appears as two decoupled optimization problems repeated over many cycles of learning, shown in Eq. (1a,b), which could be at the cost of minimizing the final reward. This loop becomes increasingly difficult to analyze as the dataset used for model training changes with each experimental trial – a step that is needed to include new data from previously unexplored states. In this paper we characterize the problems introduced by the interaction of these two optimization problems, but avoid to consider the interactions added by the changes in data distribution during the learning process, as this would significantly increase the complexity of the analysis. In addition, we discuss potential solutions, but do not make claims about the best way to do so, which is left for future work.
(1a,b) 
4 Identifying Objective Mismatch
We now experimentally study the issue of objective mismatch in MBRL to answer the following questions: 1) Does the distribution of models obtained from running a MBRL algorithm show a strong correlation between NLL and reward? 2) Are there signs of suboptimality in the dynamics models training process that could be limiting performance? 3) What model differences are reflected in reward but not in NLL?
Experimental Setting
In our experiments, we use two popular RL benchmark tasks: the cartpole and half cheetah. For more details on these tasks, model parameters, and control properties see chua2018deep. For our experiments, we aggregated cartpole models and half cheetah models from PETS runs. With a large set of onpolicy dynamics models, we then used a set of 3 different datasets to evaluate how different assumptions made in MBRL affect performance. We start with expert datasets (cartpole , half cheetah ) to test if onpolicy performance is linked to having adequately explored the environment. As a baseline, we compare the expert data to datasets collected onpolicy by the PETS algorithm or by sampling tuples representative of the entire state space. More details, additional experiments, and an expanded manuscript can be found on the website https://sites.google.com/view/mbrlmismatch.
4.1 Exploration of Model Loss vs Episode Reward Space
In the MBRL framework, it is often assumed that there is a clear correlation between model accuracy and policy performance, which we challenge even in simple domains, and on specific onpolicy data. The relationships between model accuracy and reward on data representing the full environment space show no clear trend in Fig. LABEL:fig:nllVrc,f. The simplicity of the cartpole environment results in quick learning and a concentration of networks around peak reward. The distribution of rewards versus loglikelihood (LL) shown in Fig. LABEL:fig:nllVr
ac shows substantial variance and points of disagreement overshadowing a visual trend of increased reward as LL decreases. This bimodel distribution on the half cheetah expert dataset, shown in most clearly Fig.
LABEL:fig:nllVrd, relates to a unrecoverable state failure mode in early half cheetah trials. The contrast between Fig. LABEL:fig:nllVre and Fig. LABEL:fig:nllVrd,f shows a considerable difference in the transitions represented within the datasets.These results confirm that objective mismatch is not so strong as to break all correlation between validation loss and episode reward. Rather, the mismatch likely is a ceiling on performance by reducing the correlation between model accuracy and evaluation reward, and specifically conflicts with any guarantee of improvement that is expected when training a ‘better’ model.
4.2 Model Loss vs Episode Reward During Training
This section explores how model training impacts performance at the perepoch level. These experiments shed light onto the impact of the strong dataset assumptions outlined in Sec. 3. As a dynamics model is trained, there are two key inflection points  the first is the training epoch where episode reward is maximized, and the second is when error on the validation set is minimized. These experiments are focused on showing the disconnect between three practices in MBRL a) the assumption that the onpolicy dynamics data can express large portions of the statespace, b) the idea that simple neural networks can satisfactorily capture complex dynamics, c) and the practice that model training is a simple optimization problem disconnected from reward.
For the grid cartpole dataset, Fig. LABEL:fig:cptc shows that the reward is maximized at a drastically different time than when validation loss is minimized when evaluating the controller perepoch of training for , models. Fig. LABEL:fig:cptceval highlights how the trained models are able to represent other datasets than they are trained on (in terms of additional validation errors). There is no indication that onpolicy data will lead to a complete dynamics understanding because the grid validation data rapidly diverges in Fig. LABEL:fig:cptcevalb. When training on grid data, the fact that the onpolicy data diverges in Fig. LABEL:fig:cptcevala before the reward decreases is encouraging as objective mismatch may be preventable in simple tasks. Similar experiments on half cheetah are omitted because models for this environment are trained incrementally on aggregated data rather then fully on each dataset chua2018deep.
4.3 Decoupling Model Loss from Controller Performance
In this section, we explore how differences in dynamics models – even if they have similar NLLs – are reflected in control policies to show that a accurate dynamics model does not guarantee performance.
Adversarial attack on model performance
We performed an adversarial attack (szegedy2013intriguing) on a neural network dynamics model so that it attains a good likelihood but poor reward. We start with a dynamics model that achieves high likelihood and high reward and tweak the parameters so that it continues achieving high likelihood but has a low reward. Specifically, we finetune the network’s last layer with a zerothorder optimizer, CMAES, (the cumulative reward is nondifferentiable) to lower reward with a large penalty if the model validation likelihood drops. As a starting point for this experiment we sampled a dynamics model from the last trial of a PETS run on cartpole. This model achieves reward of and has a NLL of on it’s onpolicy training dataset. Using CMAES, we reduced the onpolicy reward of the model to while slightly improving the NLL; the CMAES convergence over population iterations is shown in Fig. 7 and the difference between the two models is visualized in Fig. 6. Fine tuning of all model parameters would be even more likely to find suboptimal performing controllers with low model loss because the output layer consists of about of the total model parameters. This experiment shows that the model parameters that achieve a low model loss inhabit a broader space than the subset that also achieves high reward.
5 Addressing Objective Mismatch During Model Training
Tweaking dynamics model training can partially mitigate the problem of objective mismatch. Taking inspiration from imitation learning, we propose that the learning capacity of the model would be most useful when accurately modeling the dynamics along trajectories that are relevant for the task at hand, while maintaining knowledge of nearby transitions for robustness under a stochastic controller. Intuitively, it is more important to model accurately the dynamics along the optimal trajectory, rather than modeling part of the stateaction space that might never be visited to solve the particular task. However, using the NLL as model loss does not consider the density of the data, or their usefulness for the task. For this reason, we now propose a model loss aimed at alleviating this issue.
Given a element of a state space , we quantify the distance of any two tuples, . With this distance, we reweight the loss, , of points further from the optimal policy to be lower, so that points in the optimal trajectory get a weight , and points at the edge of the grid dataset used in Sec. 4 get a weight . Using the expert dataset discussed in Sec. 4 as a distance baseline, we generated tuples of by uniformly sampling across the state and action space of cartpole. We sorted this data by taking the minimum orthogonal distance, , from each of the points to the element dataset from the optimal trajectory (i.e., an expert trajectory that achieved a reward of ). To create different datasets that range from nearoptimal to nearly uniform across the state space, we vary the distance bound, , and number of points, , trained on. This simple form of reweighting the neural network loss with a sharp rolloff initially from the exponential, shown in Eq. (2a,b,c), demonstrated an improvement in sample efficiency to learn the cartpole task, as seen in Fig. 8
. Unfortunately, this approach is impractical for many application where it is not known in advance the optimal trajectory. However, building on these encouraging results, future work could develop an iterative method to jointly estimate and reweight samples in an online training method to address objective mismatch.
(2a,b,c) 
6 Discussion, Related Work, and Future Work
Objective mismatch impacts the performance of MBRL. Our experiments have gone deeper into this fragility in the context of the stateoftheart MBRL algorithms. Beyond the reweighting of the NLL presented in Sec. 5, here we summarize and discuss the relevant works in the community.
Learning the dynamics model to optimize the task performance
Most relevant are research directions on controllers that directly connect the reward signal back to the controller. In theory, this exactly solves the model mismatch problem we discuss in this paper but in practice the current approaches have proven difficult to scale to complex systems. One way to do this is by designing systems that are fully differentiable so that the task reward can be backpropagated through the dynamics. This has been investigated with differentiable MPC
(amos2018differentiable), Path Integral control (okada2017path), and stochastic optimization (donti2017task). Universal Planning Networks (srinivas2018universal) propose a differentiable planner that unrolls gradient descent steps over the action space of a planning network. Bansal2017 use a zeroorder optimizer instead to maximize the controller’s performance without having to compute gradients explicitly.Add heuristics to the dynamics model structure or training process to make control easier
If it is infeasible or intractable to shape the dynamics of a controller, adding heuristics to the training process of the dynamics model is reasonable and can improvements in many settings. One challenge in these heuristics is that they may be unstable and difficult to fix or improve when they do not work in new environments. These heuristics can manifest in the form of learning a latent space that is locally linear,
e.g., in Embed to Control and related methods (watter2015embed), by enforcing that the model makes longhorizon predictions (ke2019learning), ignoring uncontrollable parts of the state space (ghosh2018learning), detecting and correcting when a predictive model steps off the manifold of reasonable states (talvitie2017self), adding reward signal prediction on top of the latent space gelada2019deepmdp, or adding noise when training transitions mankowitz2019robust.Add inductive biases to the controller
Prior knowledge can be added to the controller in the form of hyperparameters such as the horizon length, or by penalizing unreasonable control sequences by using, e.g., a slew rate penalty. These heuristics can significantly improve the performance if done correctly but can be difficult to tune. jiang2015dependence use complexity theory to justify using a short planning horizon with an approximate model to reduce the the class of induced policies.
Continuing Experiments
Our experiments represent an initial exploration into the challenges of objective mismatch in MBRL. Sec. 4.2 is limited to cartpole due to computational challenges of training with large dynamics datasets and Sec. 4.3 could be strengthened by defining quantitative comparisons in controller performance. Additionally, these effects should be quantified in other MBRL algorithms such as MBPO (janner2019trust) and POPLIN (wang2019exploring).
7 Conclusion
This paper identifies, formalizes and analyzes the issue of objective mismatch in MBRL. This fundamental disconnect between the likelihood of the dynamics model, and the overall task reward emerges from incorrect assumptions at the origins of MBRL. Experimental results highlight the negative effects that objective mismatch has on the performance of a current state of the art MBRL algorithm. In providing a first insight on the issue of objective mismatch in MBRL, we hope future work will deeply examine this issue to overcome it with a new generation of MBRL algorithms.
References
Appendix
Appendix A Effect of Dataset Distribution when Learning
Learning speed can be slowed by many factors in dataset distribution, such as adding additional irrelevant transitions. When extra transitions from a specific area of the state space are included in the training set, the dynamics model will spend increased expression on these transitions. NLL of the model will be biased down as it learns this data, but it will reduce the learning speed as new, more relevant transitions are added to the training set.
Running cartpole random data collection with a short horizon of 10 steps (while forcing initial babbling state to always be 0), for 20, 200,400 and 2000 babbling rollouts (that sums up to 200, 2000, 4000 and 20000 transitions in the dataset finally shows some regression in the learning speed for runs with more useless data in the motor babbling. This data highlights the importance of careful exploration vs exploitation tradeoffs, or changing how models are trained to be selective with data.
Appendix B Task Generalization in Simple Environments
In this section, we compare the performance of a model trained on data for the standard cartpole task (x position goal at 0) to policies attempting to move the cart to different positions in the xaxis. Fig. 11 is a learning curve of PETS with a PE model using the CEM optimizer. Even though performance levels out, the NLL continues to decrease as the dynamics models accrue more data. With more complicated systems, such as halfcheetah, the reward of different tasks verses global likelihood of the model would likely be more interesting (especially with incremental model training)  we will investigate this in future work. Below, we show that the dynamics model generalizes well to tasks close to zero (both positive in (Fig. LABEL:fig:right) and negative positions (Fig. LABEL:fig:left), but performance drops off in areas the training set does not cover as well.
Below the learning curves in Fig. LABEL:fig:general_dist, we include snapshots of the distributions of training data used for these models at different trials, showing how coverage relates to reward in cartpole. It is worth investigating how many points can be removed from the training set while maintaining peak performance on each task.
Appendix C Validating models with trajectories rather than random tuples
The goal of the dynamics model for planning is to be able to predict stable long term rollouts conditioned on different actions. We propose that evaluating the test set when training a dynamics model could be more reliable (in terms of relation between loss and reward) if the model is validated on batches consisting entirely of the same trajectory, rather then a random shuffle of points. When randomly shuffling points, the test loss can be easily dominated by an outlier in each batch.
To test this, we reran experiments from Sec. 4.1 with the NLL being calculated on trajectories rather than random batches. The results are shown in Fig. LABEL:fig:trajloss.
Appendix D Ways model mismatch can harm the performance of a controller
Model mismatch between fitting the likelihood and optimizing the task’s reward manifests itself in many ways. Here we highlight two of them and in Sec. 6 we discuss how related work connects in with these issues.
Longhorizon rollouts of the model may be unstable and inaccurate.
Timeseries or dynamics models that are unrolled for long periods of time easily diverge from the true prediction and can easily step into predicting future states that are not on the manifold of reasonable trajectories. Taking these faulty dynamics models and using them as a smaller part of a controller that optimizes some cost function under a poor approximation to the dynamics. Issues can especially manifest if, e.g., the approximate dynamics do not properly capture stationarity properties necessary for the optimality of the true physical system being modeled.
Parameter 
Cartpole  HalfCheetah 

Experiment Parameters  
Trial Timesteps  200  1000 
Random Sampling Parameters 

Horizon  25  30 
Trajectories  2000  2500 
CEM Parameters 

Horizon  25  30 
Trajectories  400  500 
Elites  40  50 
CEM Iterations  5  5 
Network Parameters 

Width  500  200 
Depth  2  3 
E  5  5 
Training Parameters 

Training Type  Full  Incremental 
Full / Initial Epochs  100  20 
Incremental Epochs     10 
Optimizer  Adam  Adam 
Batch Size  16  64 
Learning Rate  1E4  1E4 
Test Train Split  0.9  0.9 

Nonconvex and nonsmooth models may make the control optimization problem challenging
The approximate dynamics might have bad properties that make the control optimization problem much more difficult than on the true system, even when the true optimal action sequence is optimal under the approximate model. This is especially true when using neural network as they introduce nonlinearities and nonsmoothness that make many classical control approaches difficult.
Sampling models with similar NLLs, different rewards
To better understand the objective mismatch, we also compared how a difference of model loss can impact a control policy. We sampled models with similar NLL’s and extremely different rewards from Fig. LABEL:fig:nllVrde and visualized the chosen optimal action sequences along an expert trajectory. The control policies and dynamics models appear to be converging to different regions of state spaces. In these visualizations, there is not a emphatic reason why the models achieved different reward, so further study is needed to quantify the impact of model differences. The interpretability of the difference between models and controllers will be important to solving the objectivemismatch issue.
Appendix E Hyperparamters and Simulation Environment
Here is a table including the default parameters for our cartpole and halfcheetah experiments. Both of these experiments were run with Mujoco version (which we found to be significant in replicating various papers across the field of Deep RL.
Experimental datasets
We include a table with the sizes of each dataset used in the experimental section of this paper. The expert datasets employed are generated by a combination of a) running PETS with a true, environmentbased dynamics model for prediction or soft actorcritic at convergence. The onpolicy data is taken from the end of a trial that solved the given task (rather then sampling from all onpolicy data). The grid dataset for cartpole is generated by slicing the state and action spaces evenly. Due to the high dimensionality of half cheetah, uniform slicing does not work, so the dataset is generated by uniformly sampling within the state and action spaces.
Type  Number of points 

Cartpole Datasets  
Grid 
16807 
Onpolicy  3780 
Expert  2400 
Half Cheetah Datasets 

Sampled  200000 
Onpolicy  90900 
Expert  3000 

Comments
There are no comments yet.