1 Introduction
In modelbased reinforcement learning (RL), the actions of an agent are computed based on information from an explicit model of the environment dynamics. Modelbased control with humanengineered simulatorbased models is widely used in robotics and has been demonstrated to solve challenging tasks such as human locomotion (Tassa et al., 2012, 2014) and dexterous inhand manipulation (Lowrey et al., 2018). However, this is not possible when the simulator is unavailable (the environment is unknown) or slow. In such cases, we can learn a dynamics model of the environment and use it for planning.
The arguably most important benefit of modelbased RL algorithms is that it can be typically more sampleefficient than their modelfree counterparts (Deisenroth et al., 2013; Arulkumaran et al., 2017; Chua et al., 2018). Due to their sampleefficiency, modelbased RL methods are attractive to use in realworld applications where collecting samples is expensive (Deisenroth & Rasmussen, 2011). In addition, it is trivial to use data collected with any policy to train the dynamics model, and the dynamics model can transfer well to other tasks in the same environment.
Trajectory optimization with learned dynamics models is particularly challenging because of overconfident predictions of outofdistribution trajectories that yield highly overoptimistic rewards. Powerful function approximators such as deep neural networks are required to learn dynamics of complex environments. Such models are likely to produce erroneous predictions outside the training distribution. This problem can be severe in highdimensional environments especially when there is little training data available. Recent works learn an ensemble of dynamics models to attenuate this problem
(Chua et al., 2018; Kurutach et al., 2018; Clavera et al., 2018).In this paper, we propose to use a denoising autoencoder (DAE) (Vincent et al., 2010) to regularize trajectory optimization by penalizing trajectories that are unlikely to appear in the past experience. The DAE is trained to denoise the same trajectories used to train the dynamics models. The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the dynamics model predictions will be less reliable as it has not been trained on such data. We can use this as a regularization term to optimize for trajectories that yield a high return while keeping the uncertainty low. Previously, DAEs have been successfully used to regularize conditional iterative generation of images (Nguyen et al., 2017).
The main contributions of this work are as follows:

Denoising regularization. We present a new algorithm for regularizing trajectory optimization with learned dynamics models. We empirically demonstrate the efficacy of the algorithm in openloop control of an industrial process and closedloop control of a set of popular motor control tasks.

Gradientbased trajectory optimization. We demonstrate effective gradientbased trajectory optimization with learned global dynamics models to achieve stateoftheart sampleefficiency.
2 ModelBased Reinforcement Learning
Problem setting. At every time step , the environment is in a state and the agent observes
. In a partially observable Markov decision process (POMDP), the state
is hidden and the observation does not completely reveal . In a fully observable Markov decision process (MDP), the state is observable and hence . The agent takes an action causing the environment to transition to a new state , following the stochastic dynamics , and the agent receives a reward . The goal is to implement a policy that maximizes the expected cumulative reward .Learned dynamics models. Modelbased RL differs from modelfree RL by modeling the dynamics and using it to influence the choice of actions. In this paper, we use environments with deterministic dynamics i.e. . In fully observable environments (such as in Section 5), the dynamics model can be a fullyconnected network trained to predict the state transition from time to :
In partially observable environments (such as in Section 4
), the dynamics model can be a recurrent neural network trained to directly predict the future observations based on past observations and actions:
In this paper, we assume access to the reward function and that it can be computed from the agent observations i.e. , although this can easily be extended by modeling the reward function.
Trajectory optimization. The learned dynamics model is used as a proxy to the environment to optimize a trajectory of actions that maximizes the expected cumulative reward. The goal of trajectory optimization is, given the past observations and planning horizon , to find a sequence of actions
that yields the highest expected cumulative reward
such that (for MDPs) or (for POMDPs).
Openloop control and closedloop control. The optimized sequence of actions from trajectory optimization can be directly applied to the environment without any further interaction (openloop control) or provided as suggestions to a human (humanintheloop). We demonstrate effective openloop control of an industrial process in Section 4. Openloop control is challenging because the dynamics model has to be able to make accurate longrange predictions. We can account for modeling errors and feedback from the environment by only taking the first action of the optimized trajectory and then replanning at each step (closedloop control). In the control literature, this flavor of modelbased RL is called modelpredictive control (MPC) (Mayne et al., 2000; Rossiter, 2003; Kouvaritakis & Cannon, 2001; Nagabandi et al., 2018). MPC has been used successfully to control various simulated robotic agents, as shown in Tassa et al. (2012, 2014); Lowrey et al. (2018). We demonstrate closedloop control of several motor control tasks using MPC in Section 5.
3 Regularized Trajectory Optimization
The trajectory optimization problem is known to be highly nontrivial with many sources of numerical instability. One potential problem in the trajectory optimization procedure is that the estimates of the expected returns can be clearly inaccurate for outofdistribution trajectories and the agent is likely to be tempted to try trajectories that look very different from the collected experience.
In this paper, we propose to address this problem by penalizing trajectories that are unlikely to appear in the past experience. This can be achieved by adding a penalty term to the objective function:
where
is the probability of observing the trajectory
in the past experience andis a tuning hyperparameter. In this paper, instead of using the joint probability of the whole trajectory, we use marginal probabilities over short windows of size
:(1) 
The optimal sequence of actions can be found by a gradientbased optimization procedure. The gradients
can be computed by backpropagation in a computational graph using the trained dynamics model (see Fig.
1). In such backpropagationthroughtime procedure, one needs to compute the gradient with respect to the actions at each time instance :(2) 
where we denote by
a concatenated vector of observations
and actions , over a window of size . Thus to enable a regularized gradientbased optimization procedure, we need means to compute .In order to evaluate (or its derivative), one needs to train a separate model
of the past experience, which is the task of unsupervised learning. In principle, any probabilistic model can be used for that. In this paper, we propose to regularize trajectory optimization with a denoising autoencoder (DAE) which does not build an explicit probabilistic model
but rather learns to approximate the derivative of the log probability density which can be used directly in a gradientbased optimization procedure.The theory of denoising autoencoders states (Arponen et al., 2017) that the optimal denoising function (for zeromean Gaussian corruption) is given by the following expression:
where
is the probability density function for data
corrupted with noise andis the standard deviation of the Gaussian corruption. Thus, by training a DAE and assuming
, we can approximate the required gradient aswhich yields
(3) 
In automatic differentiation software, this gradient can be computed by adding the penalty term to and stopping the gradient propagation through . In practice, stopping the gradient through did not yield any benefits in our experiments compared to simply adding the penalty term to the cumulative reward, so we used the simple penalty term in our experiments.
Although gradientbased trajectory optimization is potentially effective, in practice it can fail due to several reasons:

The optimized reward is much more sensitive to the actions at the beginning of a plan than at the end, therefore using smaller learning rates for smaller can be beneficial. This problem can be addressed by using secondorder methods such as iLQR (Todorov & Li, 2005).

Multimodality of the optimization landscape. In many tasks, there can be multiple locally optimal trajectories that produce substantially different cumulative rewards. This problem can be overcome by using global optimization methods such as random shooting.

One needs to do backpropagation in a computational graph which contains transitions using the dynamics model . Therefore, problems with explosions in the forward computations and vanishing gradients easily arise.
Despite these challenges, we demonstrate effective regularization of gradientbased trajectory optimization in an industrial process control task and a set of popular motor control tasks. We also demonstrate that the DAE regularization can also be effectively used with gradientfree optimization methods such as crossentropy method (CEM) (Botev et al., 2013), where we optimize the sum of the expected cumulative reward and the DAE penalty term:
(4) 
Algorithm. Initially, we collect training data by performing random actions or using an existing policy and train the dynamics model and DAE on this data. Then, we interact with the environment by performing regularized trajectory optimization using the learned dynamics model and DAE. After each episode, we store the transitions generated by our algorithm and retrain our dynamics model and DAE. The general procedure is described in Algorithm 1.
4 Experiments on Industrial Process Control
To study trajectory optimization, we first consider the problem of control of a simple industrial process. An effective industrial control system could achieve better production and economic efficiency than manually operated controls. In this paper, we learn the dynamics of an industrial process and use it to optimize the controls, by minimizing a cost function. In some critical processes, safety is of utmost importance and regularization methods could prevent adaptive control methods from exploring unsafe trajectories.
We consider the problem of control of a continuous nonlinear twophase reactor from (Ricker, 1993). The simulated industrial process consists of a single vessel that represents a combination of the reactor and separation system. The process has two feeds: one contains substances A, B and C and the other one is pure A. Reaction occurs in the vapour phase. The liquid is pure D which is the product. The process is manipulated by three valves which regulate the flows in the two feeds and an output stream which contains A, B and C. The plant has ten measured variables including the flow rates of the four streams (), pressure, liquid holdup volume and mole % of A, B and C in the purge. The control problem is to transition to a specified product rate and maintain it by manipulating the three valves. The pressure must be kept below the shutdown limit of 3000 kPa. The original paper suggests a multiloop control strategy with several PI controllers (Ricker, 1993).
We collected simulated data corresponding to about 0.5M steps of operation by randomly generating control setpoints and using the original multiloop control strategy. The collected data were used to train a neural network model with one layer of 80 LSTM units and a linear readout layer to predict the nextstep measurements. The inputs were the three controls and the ten process measurements. The data were preprocessed by scaling such that the standard deviation of the derivatives of each measured variable was of the same scale. This way, the model learned better the dynamics of slow changing variables. We used a fullyconnected network architecture with 8 hidden layers (10020010020100200100) to train a DAE on windows of five successive measurementcontrol pairs. The scaled measurementcontrol pairs in a window were concatenated to a single vector and corrupted with zeromean Gaussian noise () and the DAE was trained to denoise it.
The trained model was then used for optimizing a sequence of actions to ramp production as rapidly as possible from to kmol h, while satisfying all other constraints (Scenario II from Ricker, 1993). We formulated the objective function as the Euclidean distance to the desired targets (after preprocessing). The targets corresponded to the following targets for three measurements: kmol h for product rate, 2850 kPa for pressure and 63 mole % for A in the purge.
We optimized a plan of actions 30 hours ahead (or 300 discretized time steps). The optimized sequence of controls were initialized with the original multiloop policy applied to the trained dynamics model. That control sequence together with the predicted and the real outcomes (black and red curves respectively) are shown in Fig. 2a. We then optimized the control sequence using 10000 iterations of Adam with learning rate 0.01 without and with DAE regularization (with penalty ).
(a) Multiloop PI control  (b) No regularization  (c) DAE regularization 
The results are shown in Fig. 2. One can see that without regularization the control signals are changed abruptly and the trajectory imagined by the model deviates from reality (Fig. 2b). In contrast, the openloop plan found with the DAE regularization is noticeably the best solution (Fig. 2c), leading the plant to the specified product rate much faster than the humanengineered multiloop PI control from (Ricker, 1993). The imagined trajectory (black) stays close to predictions and the targets are reached in about ten hours. This shows that even in a lowdimensional environment with a large amount of training data, regularization is necessary for planning using a learned model.
5 Experiments on Motor Control
In order to compare the proposed method to the performance of existing modelbased RL methods we also test it on the same set of Mujocobased (Todorov et al., 2012; Brockman et al., 2016) continuous motor control tasks as in (Chua et al., 2018):
Cartpole. This task involves a pole attached to a moving cart in a frictionless track, with the goal of swinging up the pole and balancing it in an upright position in the center of the screen. The cost at every time step is measured as the angular distance between the tip of the pole and the target position. Each episode is 200 steps long.
Reacher
. This environment consists of a simulated PR2 robot arm with seven degrees of freedom, with the goal of reaching a particular position in space. The cost at every time step is measured as the distance between the arm and the target position. The target position changes every episode. Each episode is 150 steps long.
Pusher. This environment also consists of a simulated PR2 robot arm, with a goal of pushing an object to a target position that changes every episode. The cost at every time step is measured as the distance between the object and the target position. Each episode is 150 steps long.
Halfcheetah. This environment involves training a twolegged ”halfcheetah” to run forward as fast as possible by applying torques to 6 different joints. The cost at every time step is measured as the negative forward velocity. Each episode is 1000 steps long, but the length is reduced to 200 for the benchmark with (Clavera et al., 2018).
Additionally, we also test it on the Ant environment with larger state and action spaces:
Ant. This is the most challenging environment of the benchmark (with a large state space dimension of 111). It consists of a fourlegged ”ant” controlled by applying torques to its 8 joints. The cost, similar to Halfcheetah, is the negative forward velocity.
5.1 Comparison to Prior Methods
We focus on the control performance in the initial 10 episodes because: 1) there is little data available and regularization is very important, 2) it is easy to disentangle the benefits of regularization since the asymptotic performance could be affected by differences in exploration, which is outside of the scope of this paper. We compare our proposed method to the following stateoftheart methods:
PETS. Probabilistic Ensembles with Trajectory Sampling (PETS) (Chua et al., 2018) consists of an ensemble of probabilistic neural networks and uses particlebased trajectory sampling to regularize trajectory optimization. We compare to the best results of PETS (denoted as PETS1) reported in (Chua et al., 2018). In PETS1, the next state prediction is sampled from a different model in the ensemble at each time step. We obtain the benchmark results by running the code provided by the authors.
MBMPO. We also compare the performance of our method with ModelBased Meta Policy Optimization (MBMPO) (Clavera et al., 2018), an approach that combines the benefits of modelbased RL and meta learning: the algorithm trains a policy using simulations generated by an ensemble of models, learned from data. Metalearning allows this policy to quickly adapt to the various dynamics, hence learning how to quickly adapt in the real environment, using ModelAgnostic Meta Learning (MAML) (Finn et al., 2017). We use the benchmark results published by the authors on the companion site of their paper (Clavera et al., 2018).
GP. In the Cartpole environment, we also compare our results against Gaussian Processes (GP). Here, a GP is used as the dynamics model and only the expectation of the next state prediction is considered (denoted as GPE in Chua et al. (2018)). This algorithm performed the best in this task in (Chua et al., 2018). These benchmark results were also obtained by running the code provided by Chua et al. (2018).
5.2 Model Architecture and Hyperparameters
We use the probabilistic neural network from (Chua et al., 2018)
as our baseline. Given a state and action pair, the probabilistic neural network predicts the mean and variance of the next state (assuming a Gaussian distribution). Although we only use the mean prediction, we found that also training to predict the variance improves the stability of the training. In
(Chua et al., 2018), the dynamics model is only trained for 5 epochs after every episode and we observed that this leads to underfitting of the training data and severely limits the control performance during the initial episodes. This is demonstrated in Figure
4. Since the dynamics model is poor during these initial episodes, regularizing the trajectory optimization is not sensible in this setting. To alleviate this problem, we train the dynamics model for 100 or more epochs after every episode so that the training converges. We found that this significantly improves the learning progress of the algorithm and use this as our baseline. This is illustrated in Figure 4. It can also be seen that similarly training PETS for more epochs does not yield noticeable improvements over the baseline.We use dynamics models with the same architecture for all environments: 3 hidden layers of size 200 with the Swish nonlinearity (Ramachandran et al., 2017). Similar to prior works, we train the dynamics model to predict the difference between and instead of predicting directly. We use the same architecture as the dynamics model for the denoising autoencoder. The stateaction pairs in the past episodes were corrupted with zeromean Gaussian noise and the DAE was trained to denoise it. Important hyperparameters used in our experiments are reported in the Appendix.
5.3 Results
Comparison to Chua et al. The learning progress of our algorithm is illustrated in Figure 3. In (Chua et al., 2018), the learning curves of all algorithms are reported using the maximum rewards seen so far. This can be misleading and we instead report the average returns across different seeds, after each episode. In Cartpole, all the methods eventually converge to the maximum cumulative reward. There are however differences in how quickly the methods converge, with the proposed method converging the fastest. Interestingly, our algorithm also surpasses the Gaussian Process (GP) baseline, which is known to be a sampleefficient method widely used for control of simple systems. In Reacher, the proposed method converges to the same asymptotic performance as PETS, but faster. In Pusher, all algorithms perform similarly. In Halfcheetah, the proposed method is the fastest, learning an effective running gait in only a couple of episodes ^{1}^{1}1Videos of our agents during training can be found in the website https://sites.google.com/view/regularizingmbrlwithdae/home. Denoising regularization is effective for both gradientfree and gradientbased planning, with gradientbased planning performing the best. The result after 10 episodes using the proposed method is an improvement over (Chua et al., 2018), and even perform better than the asymptotic performance of several recent modelfree algorithms as reported in (Fujimoto et al., 2018; Duan et al., 2016). In all the tested environments, using denoising regularization shows a faster or comparable learning speed, quickly reaching good asymptotic performance.
Comparison to Clavera et al. In Figure 5 we compare our method against MBMPO and other modelbased methods included in (Clavera et al., 2018) for the Halfcheetah environment with shorter episodes (200 timesteps). Also in this case, our method learns faster than the comparison methods.
Comparison to Gaussian regularization. To emphasize the importance of denoising regularization, we also compare against a simple Gaussian regularization baseline: we fit a Gaussian distribution (with diagonal covariance matrix) to the states and actions in the replay buffer and regularize the trajectory optimization by adding a penalty term to the cost, proportional to the negative log probability of the states and actions in the trajectory (Equation 1). The performance of this baseline in the Halfcheetah task (with an episode length of 200) is shown in Figure 6. We observe that the Gaussian distribution poorly fits the trajectories and consistently leads the optimization to a bad local minimum.
Experiments with Ant. In Figure 7, we show our results on the Ant environment. We use the OpenAI Gym environment and the episodes are terminated if the done condition is met. We use the forward movement velocity as the reward in each step. We collect initial data using a random policy (choosing actions uniformly in the range of 0.5 and 0.5) for 200 episodes. Then, we train the agents online for 10 episodes. After the random exploration, trajectory optimization with denoising regularization learns very quickly and performs much better than the baseline.
6 Related Work
Several methods have been proposed for planning with learned dynamics models. Locally linear timevarying models (Kumar et al., 2016; Levine & Abbeel, 2014) and Gaussian processes (Deisenroth & Rasmussen, 2011; Ko et al., 2007) are dataefficient but have problems scaling to highdimensional environments. Recently, deep neural networks have been successfully applied to modelbased RL. Nagabandi et al. (2018) use deep neural networks as dynamics models in modelpredictive control to achieve good performance, and then shows how modelbased RL can be finetuned with a modelfree approach to achieve even better performance. Chua et al. (2018) introduce PETS, a method to improve modelbased performance by estimating and propagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate how their approach can beat several recent modelbased and modelfree techniques. Clavera et al. (2018) combines modelbased RL and metalearning with MBMPO, training a policy to quickly adapt to slightly different learned dynamics models, thus enabling faster learning.
Levine & Koltun (2013) and Kumar et al. (2016) use KL divergence penalty between action distributions to stay close to the training distribution. Similar bounds are also used to stabilize training of policy gradient methods (Schulman et al., 2015, 2017). While such KL penalty bounds the evolution of action distributions, the proposed method also bounds the familiarity of states, which could be important in highdimensional state spaces. While penalizing unfamiliar states also penalize exploration, it allows for more controlled and efficient exploration. Exploration is out of the scope of the paper but was studied in (Di Palo & Valpola, 2018), where a nonzero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between familiarity and exploration.
7 Conclusion
In this work we introduced a novel algorithm for regularization in modelbased reinforcement learning. We tackled the problem of regularizing trajectory optimization in order to avoid the planning inaccuracies caused by outofdistribution errors of deep neural networks. After a theoretical analysis of the proposed method, we empirically demonstrated how this approach enables high learning speed in different continuous control environments.
In recent years, a lot of effort has been put in making deep reinforcement algorithms more sampleefficient, and thus adaptable to real world scenarios. Modelbased reinforcement learning has shown promising results, obtaining sampleefficiency even orders of magnitude better than modelfree counterparts, but these methods have often suffered from suboptimal performance due to many reasons. As already noted in the recent literature (Nagabandi et al., 2018; Chua et al., 2018), outofdistribution errors and model overfitting are often sources of performance degradation when using complex function approximators, and our experiments highlight how tackling these problems improves the performance of modelbased reinforcement learning. We argue that the increase in learning speed can make this method, if further explored, a viable solution for realworld motor control learning in complex robots.
Recent work in uncertainty estimation in modelbased control has typically either used less complex models, like Gaussian processes, or approximated Bayesian neural networks using ensembles of models and sampling, as in the case of (Chua et al., 2018), or stochastic dropout over several feedforward computations (Kahn et al., 2017; Gal, 2016). In this paper, we used denoising autoencoders as a regularization tool in trajectory optimization and show that it enables stateoftheart sampleefficiency in a set of popular continuous control tasks.
A possible avenue for further research would be to explore the possibilities of gradientbased trajectory optimization for multimodal distributions. Gradientbased optimization is expected to be easier to scale to highdimensional action spaces and could pave the way for successes in the low data regime of complex reinforcement learning problems.
Acknowledgements
We would like to thank Jussi Sainio, Jari Rosti and Isabeau PrémontSchwarz for their valuable contributions in the experiments on industrial process control.
References
 Arponen et al. (2017) Arponen, H., Herranen, M., and Valpola, H. On the exact relationship between the denoising function and the data distribution. arXiv preprint arXiv:1709.02797, 2017.
 Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.
 Botev et al. (2013) Botev, Z. I., Kroese, D. P., Rubinstein, R. Y., and L’Ecuyer, P. The crossentropy method for optimization. In Handbook of statistics, volume 31, pp. 35–59. Elsevier, 2013.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
 Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4759–4770. 2018.
 Clavera et al. (2018) Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018.
 Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472, 2011.
 Deisenroth et al. (2013) Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1–2):1–142, 2013.
 Di Palo & Valpola (2018) Di Palo, N. and Valpola, H. Improving modelbased control and active exploration with reconstruction uncertainty optimization. arXiv preprint arXiv:1812.03955, 2018.
 Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135, 2017.
 Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1582–1591, 2018.

Gal (2016)
Gal, Y.
Uncertainty in deep learning.
University of Cambridge, 2016.  Kahn et al. (2017) Kahn, G., Villaflor, A., Pong, V., Abbeel, P., and Levine, S. Uncertaintyaware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
 Ko et al. (2007) Ko, J., Klein, D. J., Fox, D., and Haehnel, D. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Robotics and Automation, 2007 IEEE International Conference on, pp. 742–747. IEEE, 2007.
 Kouvaritakis & Cannon (2001) Kouvaritakis, B. and Cannon, M. Nonlinear Predictive Control: theory and practice. Iet, 2001.
 Kumar et al. (2016) Kumar, V., Todorov, E., and Levine, S. Optimal control with learned local models: Application to dexterous manipulation. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 378–383. IEEE, 2016.
 Kurutach et al. (2018) Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Modelensemble trustregion policy optimization. In International Conference on Learning Representations, 2018.
 Levine & Abbeel (2014) Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 1071–1079. Curran Associates, Inc., 2014.
 Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
 Lowrey et al. (2018) Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E., and Mordatch, I. Plan online, learn offline: Efficient learning and exploration via modelbased control. arXiv preprint arXiv:1811.01848, 2018.
 Mayne et al. (2000) Mayne, D. Q., Rawlings, J. B., Rao, C. V., and Scokaert, P. O. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
 Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.

Nguyen et al. (2017)
Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., and Yosinski, J.
Plug & play generative networks: Conditional iterative generation of
images in latent space.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4467–4477, 2017.  Parmas et al. (2018) Parmas, P., Rasmussen, C. E., Peters, J., and Doya, K. Pipps: Flexible modelbased policy search robust to the curse of chaos. In International Conference on Machine Learning, pp. 4062–4071, 2018.
 Ramachandran et al. (2017) Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
 Ricker (1993) Ricker, N. L. Model predictive control of a continuous, nonlinear, twophase reactor. Journal of Process Control, 3(2):109–123, 1993.
 Rossiter (2003) Rossiter, J. Modelbased Predictive Controla Practical Approach. CRC Press, 01 2003.
 Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 1889–1897, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Tassa et al. (2012) Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913. IEEE, 2012.
 Tassa et al. (2014) Tassa, Y., Mansard, N., and Todorov, E. Controllimited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1168–1175. IEEE, 2014.
 Todorov & Li (2005) Todorov, E. and Li, W. A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005. Proceedings of the 2005, pp. 300–306. IEEE, 2005.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
Appendix A Additional Experimental Details
The important hyperparameters for all our experiments are shown in Tables 1, 2 and 3. We found the DAE noise level, regularization penalty weight and Adam learning rate to be the most important hyperparameters.
Environment  Optimizer  Optim Iters  Epochs  Adam LR  DAE noise  Batch Norm  

Cartpole  CEM  5  500    0.001  0.1  
Adam  10  500  0.001  0.001  0.2  
Reacher  CEM  5  500    0.01  0.1  
Adam  5  300  1  0.01  0.1  
Pusher  CEM  5  500    0.01  0.1  
Adam  5  300  1  0.01  0.1  
Halfcheetah  CEM  5  100    2  0.1  
Adam  10  200  0.1  1  0.2 
Environment  Optimizer  Optim Iters  Epochs  Adam LR  DAE noise  Batch Norm  

Halfcheetah  CEM  5  20    2  0.2  
Adam  10  40  0.1  1  0.1 
Environment  Optimizer  Optim Iters  Epochs  Adam LR  DAE noise  Batch Norm  

Ant  CEM  5  100    0.5  0.2 
Comments
There are no comments yet.