In model-based reinforcement learning (RL), the actions of an agent are computed based on information from an explicit model of the environment dynamics. Model-based control with human-engineered simulator-based models is widely used in robotics and has been demonstrated to solve challenging tasks such as human locomotion (Tassa et al., 2012, 2014) and dexterous in-hand manipulation (Lowrey et al., 2018). However, this is not possible when the simulator is unavailable (the environment is unknown) or slow. In such cases, we can learn a dynamics model of the environment and use it for planning.
The arguably most important benefit of model-based RL algorithms is that it can be typically more sample-efficient than their model-free counterparts (Deisenroth et al., 2013; Arulkumaran et al., 2017; Chua et al., 2018). Due to their sample-efficiency, model-based RL methods are attractive to use in real-world applications where collecting samples is expensive (Deisenroth & Rasmussen, 2011). In addition, it is trivial to use data collected with any policy to train the dynamics model, and the dynamics model can transfer well to other tasks in the same environment.
Trajectory optimization with learned dynamics models is particularly challenging because of overconfident predictions of out-of-distribution trajectories that yield highly over-optimistic rewards. Powerful function approximators such as deep neural networks are required to learn dynamics of complex environments. Such models are likely to produce erroneous predictions outside the training distribution. This problem can be severe in high-dimensional environments especially when there is little training data available. Recent works learn an ensemble of dynamics models to attenuate this problem(Chua et al., 2018; Kurutach et al., 2018; Clavera et al., 2018).
In this paper, we propose to use a denoising autoencoder (DAE) (Vincent et al., 2010) to regularize trajectory optimization by penalizing trajectories that are unlikely to appear in the past experience. The DAE is trained to denoise the same trajectories used to train the dynamics models. The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the dynamics model predictions will be less reliable as it has not been trained on such data. We can use this as a regularization term to optimize for trajectories that yield a high return while keeping the uncertainty low. Previously, DAEs have been successfully used to regularize conditional iterative generation of images (Nguyen et al., 2017).
The main contributions of this work are as follows:
Denoising regularization. We present a new algorithm for regularizing trajectory optimization with learned dynamics models. We empirically demonstrate the efficacy of the algorithm in open-loop control of an industrial process and closed-loop control of a set of popular motor control tasks.
Gradient-based trajectory optimization. We demonstrate effective gradient-based trajectory optimization with learned global dynamics models to achieve state-of-the-art sample-efficiency.
2 Model-Based Reinforcement Learning
Problem setting. At every time step , the environment is in a state and the agent observes
. In a partially observable Markov decision process (POMDP), the stateis hidden and the observation does not completely reveal . In a fully observable Markov decision process (MDP), the state is observable and hence . The agent takes an action causing the environment to transition to a new state , following the stochastic dynamics , and the agent receives a reward . The goal is to implement a policy that maximizes the expected cumulative reward .
Learned dynamics models. Model-based RL differs from model-free RL by modeling the dynamics and using it to influence the choice of actions. In this paper, we use environments with deterministic dynamics i.e. . In fully observable environments (such as in Section 5), the dynamics model can be a fully-connected network trained to predict the state transition from time to :
In partially observable environments (such as in Section 4
), the dynamics model can be a recurrent neural network trained to directly predict the future observations based on past observations and actions:
In this paper, we assume access to the reward function and that it can be computed from the agent observations i.e. , although this can easily be extended by modeling the reward function.
Trajectory optimization. The learned dynamics model is used as a proxy to the environment to optimize a trajectory of actions that maximizes the expected cumulative reward. The goal of trajectory optimization is, given the past observations and planning horizon , to find a sequence of actions
that yields the highest expected cumulative reward
such that (for MDPs) or (for POMDPs).
Open-loop control and closed-loop control. The optimized sequence of actions from trajectory optimization can be directly applied to the environment without any further interaction (open-loop control) or provided as suggestions to a human (human-in-the-loop). We demonstrate effective open-loop control of an industrial process in Section 4. Open-loop control is challenging because the dynamics model has to be able to make accurate long-range predictions. We can account for modeling errors and feedback from the environment by only taking the first action of the optimized trajectory and then re-planning at each step (closed-loop control). In the control literature, this flavor of model-based RL is called model-predictive control (MPC) (Mayne et al., 2000; Rossiter, 2003; Kouvaritakis & Cannon, 2001; Nagabandi et al., 2018). MPC has been used successfully to control various simulated robotic agents, as shown in Tassa et al. (2012, 2014); Lowrey et al. (2018). We demonstrate closed-loop control of several motor control tasks using MPC in Section 5.
3 Regularized Trajectory Optimization
The trajectory optimization problem is known to be highly non-trivial with many sources of numerical instability. One potential problem in the trajectory optimization procedure is that the estimates of the expected returns can be clearly inaccurate for out-of-distribution trajectories and the agent is likely to be tempted to try trajectories that look very different from the collected experience.
In this paper, we propose to address this problem by penalizing trajectories that are unlikely to appear in the past experience. This can be achieved by adding a penalty term to the objective function:
is the probability of observing the trajectoryin the past experience and
is a tuning hyperparameter. In this paper, instead of using the joint probability of the whole trajectory, we use marginal probabilities over short windows of size:
The optimal sequence of actions can be found by a gradient-based optimization procedure. The gradients
can be computed by backpropagation in a computational graph using the trained dynamics model (see Fig.1). In such backpropagation-through-time procedure, one needs to compute the gradient with respect to the actions at each time instance :
where we denote by
a concatenated vector of observationsand actions , over a window of size . Thus to enable a regularized gradient-based optimization procedure, we need means to compute .
In order to evaluate (or its derivative), one needs to train a separate model
of the past experience, which is the task of unsupervised learning. In principle, any probabilistic model can be used for that. In this paper, we propose to regularize trajectory optimization with a denoising autoencoder (DAE) which does not build an explicit probabilistic modelbut rather learns to approximate the derivative of the log probability density which can be used directly in a gradient-based optimization procedure.
The theory of denoising autoencoders states (Arponen et al., 2017) that the optimal denoising function (for zero-mean Gaussian corruption) is given by the following expression:
is the probability density function for datacorrupted with noise and
is the standard deviation of the Gaussian corruption. Thus, by training a DAE and assuming, we can approximate the required gradient as
In automatic differentiation software, this gradient can be computed by adding the penalty term to and stopping the gradient propagation through . In practice, stopping the gradient through did not yield any benefits in our experiments compared to simply adding the penalty term to the cumulative reward, so we used the simple penalty term in our experiments.
Although gradient-based trajectory optimization is potentially effective, in practice it can fail due to several reasons:
The optimized reward is much more sensitive to the actions at the beginning of a plan than at the end, therefore using smaller learning rates for smaller can be beneficial. This problem can be addressed by using second-order methods such as iLQR (Todorov & Li, 2005).
Multi-modality of the optimization landscape. In many tasks, there can be multiple locally optimal trajectories that produce substantially different cumulative rewards. This problem can be overcome by using global optimization methods such as random shooting.
One needs to do backpropagation in a computational graph which contains transitions using the dynamics model . Therefore, problems with explosions in the forward computations and vanishing gradients easily arise.
Despite these challenges, we demonstrate effective regularization of gradient-based trajectory optimization in an industrial process control task and a set of popular motor control tasks. We also demonstrate that the DAE regularization can also be effectively used with gradient-free optimization methods such as cross-entropy method (CEM) (Botev et al., 2013), where we optimize the sum of the expected cumulative reward and the DAE penalty term:
Algorithm. Initially, we collect training data by performing random actions or using an existing policy and train the dynamics model and DAE on this data. Then, we interact with the environment by performing regularized trajectory optimization using the learned dynamics model and DAE. After each episode, we store the transitions generated by our algorithm and re-train our dynamics model and DAE. The general procedure is described in Algorithm 1.
4 Experiments on Industrial Process Control
To study trajectory optimization, we first consider the problem of control of a simple industrial process. An effective industrial control system could achieve better production and economic efficiency than manually operated controls. In this paper, we learn the dynamics of an industrial process and use it to optimize the controls, by minimizing a cost function. In some critical processes, safety is of utmost importance and regularization methods could prevent adaptive control methods from exploring unsafe trajectories.
We consider the problem of control of a continuous nonlinear two-phase reactor from (Ricker, 1993). The simulated industrial process consists of a single vessel that represents a combination of the reactor and separation system. The process has two feeds: one contains substances A, B and C and the other one is pure A. Reaction occurs in the vapour phase. The liquid is pure D which is the product. The process is manipulated by three valves which regulate the flows in the two feeds and an output stream which contains A, B and C. The plant has ten measured variables including the flow rates of the four streams (), pressure, liquid holdup volume and mole % of A, B and C in the purge. The control problem is to transition to a specified product rate and maintain it by manipulating the three valves. The pressure must be kept below the shutdown limit of 3000 kPa. The original paper suggests a multiloop control strategy with several PI controllers (Ricker, 1993).
We collected simulated data corresponding to about 0.5M steps of operation by randomly generating control setpoints and using the original multiloop control strategy. The collected data were used to train a neural network model with one layer of 80 LSTM units and a linear readout layer to predict the next-step measurements. The inputs were the three controls and the ten process measurements. The data were pre-processed by scaling such that the standard deviation of the derivatives of each measured variable was of the same scale. This way, the model learned better the dynamics of slow changing variables. We used a fully-connected network architecture with 8 hidden layers (100-200-100-20-100-200-100) to train a DAE on windows of five successive measurement-control pairs. The scaled measurement-control pairs in a window were concatenated to a single vector and corrupted with zero-mean Gaussian noise () and the DAE was trained to denoise it.
The trained model was then used for optimizing a sequence of actions to ramp production as rapidly as possible from to kmol h, while satisfying all other constraints (Scenario II from Ricker, 1993). We formulated the objective function as the Euclidean distance to the desired targets (after pre-processing). The targets corresponded to the following targets for three measurements: kmol h for product rate, 2850 kPa for pressure and 63 mole % for A in the purge.
We optimized a plan of actions 30 hours ahead (or 300 discretized time steps). The optimized sequence of controls were initialized with the original multiloop policy applied to the trained dynamics model. That control sequence together with the predicted and the real outcomes (black and red curves respectively) are shown in Fig. 2a. We then optimized the control sequence using 10000 iterations of Adam with learning rate 0.01 without and with DAE regularization (with penalty ).
|(a) Multiloop PI control||(b) No regularization||(c) DAE regularization|
The results are shown in Fig. 2. One can see that without regularization the control signals are changed abruptly and the trajectory imagined by the model deviates from reality (Fig. 2b). In contrast, the open-loop plan found with the DAE regularization is noticeably the best solution (Fig. 2c), leading the plant to the specified product rate much faster than the human-engineered multiloop PI control from (Ricker, 1993). The imagined trajectory (black) stays close to predictions and the targets are reached in about ten hours. This shows that even in a low-dimensional environment with a large amount of training data, regularization is necessary for planning using a learned model.
5 Experiments on Motor Control
In order to compare the proposed method to the performance of existing model-based RL methods we also test it on the same set of Mujoco-based (Todorov et al., 2012; Brockman et al., 2016) continuous motor control tasks as in (Chua et al., 2018):
Cartpole. This task involves a pole attached to a moving cart in a frictionless track, with the goal of swinging up the pole and balancing it in an upright position in the center of the screen. The cost at every time step is measured as the angular distance between the tip of the pole and the target position. Each episode is 200 steps long.
. This environment consists of a simulated PR2 robot arm with seven degrees of freedom, with the goal of reaching a particular position in space. The cost at every time step is measured as the distance between the arm and the target position. The target position changes every episode. Each episode is 150 steps long.
Pusher. This environment also consists of a simulated PR2 robot arm, with a goal of pushing an object to a target position that changes every episode. The cost at every time step is measured as the distance between the object and the target position. Each episode is 150 steps long.
Half-cheetah. This environment involves training a two-legged ”half-cheetah” to run forward as fast as possible by applying torques to 6 different joints. The cost at every time step is measured as the negative forward velocity. Each episode is 1000 steps long, but the length is reduced to 200 for the benchmark with (Clavera et al., 2018).
Additionally, we also test it on the Ant environment with larger state and action spaces:
Ant. This is the most challenging environment of the benchmark (with a large state space dimension of 111). It consists of a four-legged ”ant” controlled by applying torques to its 8 joints. The cost, similar to Half-cheetah, is the negative forward velocity.
5.1 Comparison to Prior Methods
We focus on the control performance in the initial 10 episodes because: 1) there is little data available and regularization is very important, 2) it is easy to disentangle the benefits of regularization since the asymptotic performance could be affected by differences in exploration, which is outside of the scope of this paper. We compare our proposed method to the following state-of-the-art methods:
PETS. Probabilistic Ensembles with Trajectory Sampling (PETS) (Chua et al., 2018) consists of an ensemble of probabilistic neural networks and uses particle-based trajectory sampling to regularize trajectory optimization. We compare to the best results of PETS (denoted as PE-TS1) reported in (Chua et al., 2018). In PE-TS1, the next state prediction is sampled from a different model in the ensemble at each time step. We obtain the benchmark results by running the code provided by the authors.
MB-MPO. We also compare the performance of our method with Model-Based Meta Policy Optimization (MB-MPO) (Clavera et al., 2018), an approach that combines the benefits of model-based RL and meta learning: the algorithm trains a policy using simulations generated by an ensemble of models, learned from data. Meta-learning allows this policy to quickly adapt to the various dynamics, hence learning how to quickly adapt in the real environment, using Model-Agnostic Meta Learning (MAML) (Finn et al., 2017). We use the benchmark results published by the authors on the companion site of their paper (Clavera et al., 2018).
GP. In the Cartpole environment, we also compare our results against Gaussian Processes (GP). Here, a GP is used as the dynamics model and only the expectation of the next state prediction is considered (denoted as GP-E in Chua et al. (2018)). This algorithm performed the best in this task in (Chua et al., 2018). These benchmark results were also obtained by running the code provided by Chua et al. (2018).
5.2 Model Architecture and Hyperparameters
We use the probabilistic neural network from (Chua et al., 2018)
as our baseline. Given a state and action pair, the probabilistic neural network predicts the mean and variance of the next state (assuming a Gaussian distribution). Although we only use the mean prediction, we found that also training to predict the variance improves the stability of the training. In(Chua et al., 2018)
, the dynamics model is only trained for 5 epochs after every episode and we observed that this leads to underfitting of the training data and severely limits the control performance during the initial episodes. This is demonstrated in Figure4. Since the dynamics model is poor during these initial episodes, regularizing the trajectory optimization is not sensible in this setting. To alleviate this problem, we train the dynamics model for 100 or more epochs after every episode so that the training converges. We found that this significantly improves the learning progress of the algorithm and use this as our baseline. This is illustrated in Figure 4. It can also be seen that similarly training PETS for more epochs does not yield noticeable improvements over the baseline.
We use dynamics models with the same architecture for all environments: 3 hidden layers of size 200 with the Swish non-linearity (Ramachandran et al., 2017). Similar to prior works, we train the dynamics model to predict the difference between and instead of predicting directly. We use the same architecture as the dynamics model for the denoising autoencoder. The state-action pairs in the past episodes were corrupted with zero-mean Gaussian noise and the DAE was trained to denoise it. Important hyperparameters used in our experiments are reported in the Appendix.
Comparison to Chua et al. The learning progress of our algorithm is illustrated in Figure 3. In (Chua et al., 2018), the learning curves of all algorithms are reported using the maximum rewards seen so far. This can be misleading and we instead report the average returns across different seeds, after each episode. In Cartpole, all the methods eventually converge to the maximum cumulative reward. There are however differences in how quickly the methods converge, with the proposed method converging the fastest. Interestingly, our algorithm also surpasses the Gaussian Process (GP) baseline, which is known to be a sample-efficient method widely used for control of simple systems. In Reacher, the proposed method converges to the same asymptotic performance as PETS, but faster. In Pusher, all algorithms perform similarly. In Half-cheetah, the proposed method is the fastest, learning an effective running gait in only a couple of episodes 111Videos of our agents during training can be found in the website https://sites.google.com/view/regularizing-mbrl-with-dae/home. Denoising regularization is effective for both gradient-free and gradient-based planning, with gradient-based planning performing the best. The result after 10 episodes using the proposed method is an improvement over (Chua et al., 2018), and even perform better than the asymptotic performance of several recent model-free algorithms as reported in (Fujimoto et al., 2018; Duan et al., 2016). In all the tested environments, using denoising regularization shows a faster or comparable learning speed, quickly reaching good asymptotic performance.
Comparison to Clavera et al. In Figure 5 we compare our method against MB-MPO and other model-based methods included in (Clavera et al., 2018) for the Half-cheetah environment with shorter episodes (200 timesteps). Also in this case, our method learns faster than the comparison methods.
Comparison to Gaussian regularization. To emphasize the importance of denoising regularization, we also compare against a simple Gaussian regularization baseline: we fit a Gaussian distribution (with diagonal covariance matrix) to the states and actions in the replay buffer and regularize the trajectory optimization by adding a penalty term to the cost, proportional to the negative log probability of the states and actions in the trajectory (Equation 1). The performance of this baseline in the Half-cheetah task (with an episode length of 200) is shown in Figure 6. We observe that the Gaussian distribution poorly fits the trajectories and consistently leads the optimization to a bad local minimum.
Experiments with Ant. In Figure 7, we show our results on the Ant environment. We use the OpenAI Gym environment and the episodes are terminated if the done condition is met. We use the forward movement velocity as the reward in each step. We collect initial data using a random policy (choosing actions uniformly in the range of -0.5 and 0.5) for 200 episodes. Then, we train the agents online for 10 episodes. After the random exploration, trajectory optimization with denoising regularization learns very quickly and performs much better than the baseline.
6 Related Work
Several methods have been proposed for planning with learned dynamics models. Locally linear time-varying models (Kumar et al., 2016; Levine & Abbeel, 2014) and Gaussian processes (Deisenroth & Rasmussen, 2011; Ko et al., 2007) are data-efficient but have problems scaling to high-dimensional environments. Recently, deep neural networks have been successfully applied to model-based RL. Nagabandi et al. (2018) use deep neural networks as dynamics models in model-predictive control to achieve good performance, and then shows how model-based RL can be fine-tuned with a model-free approach to achieve even better performance. Chua et al. (2018) introduce PETS, a method to improve model-based performance by estimating and propagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate how their approach can beat several recent model-based and model-free techniques. Clavera et al. (2018) combines model-based RL and meta-learning with MB-MPO, training a policy to quickly adapt to slightly different learned dynamics models, thus enabling faster learning.
Levine & Koltun (2013) and Kumar et al. (2016) use KL divergence penalty between action distributions to stay close to the training distribution. Similar bounds are also used to stabilize training of policy gradient methods (Schulman et al., 2015, 2017). While such KL penalty bounds the evolution of action distributions, the proposed method also bounds the familiarity of states, which could be important in high-dimensional state spaces. While penalizing unfamiliar states also penalize exploration, it allows for more controlled and efficient exploration. Exploration is out of the scope of the paper but was studied in (Di Palo & Valpola, 2018), where a non-zero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between familiarity and exploration.
In this work we introduced a novel algorithm for regularization in model-based reinforcement learning. We tackled the problem of regularizing trajectory optimization in order to avoid the planning inaccuracies caused by out-of-distribution errors of deep neural networks. After a theoretical analysis of the proposed method, we empirically demonstrated how this approach enables high learning speed in different continuous control environments.
In recent years, a lot of effort has been put in making deep reinforcement algorithms more sample-efficient, and thus adaptable to real world scenarios. Model-based reinforcement learning has shown promising results, obtaining sample-efficiency even orders of magnitude better than model-free counterparts, but these methods have often suffered from sub-optimal performance due to many reasons. As already noted in the recent literature (Nagabandi et al., 2018; Chua et al., 2018), out-of-distribution errors and model overfitting are often sources of performance degradation when using complex function approximators, and our experiments highlight how tackling these problems improves the performance of model-based reinforcement learning. We argue that the increase in learning speed can make this method, if further explored, a viable solution for real-world motor control learning in complex robots.
Recent work in uncertainty estimation in model-based control has typically either used less complex models, like Gaussian processes, or approximated Bayesian neural networks using ensembles of models and sampling, as in the case of (Chua et al., 2018), or stochastic dropout over several feed-forward computations (Kahn et al., 2017; Gal, 2016). In this paper, we used denoising autoencoders as a regularization tool in trajectory optimization and show that it enables state-of-the-art sample-efficiency in a set of popular continuous control tasks.
A possible avenue for further research would be to explore the possibilities of gradient-based trajectory optimization for multi-modal distributions. Gradient-based optimization is expected to be easier to scale to high-dimensional action spaces and could pave the way for successes in the low data regime of complex reinforcement learning problems.
We would like to thank Jussi Sainio, Jari Rosti and Isabeau Prémont-Schwarz for their valuable contributions in the experiments on industrial process control.
- Arponen et al. (2017) Arponen, H., Herranen, M., and Valpola, H. On the exact relationship between the denoising function and the data distribution. arXiv preprint arXiv:1709.02797, 2017.
- Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.
- Botev et al. (2013) Botev, Z. I., Kroese, D. P., Rubinstein, R. Y., and L’Ecuyer, P. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pp. 35–59. Elsevier, 2013.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
- Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4759–4770. 2018.
- Clavera et al. (2018) Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018.
- Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.
- Deisenroth et al. (2013) Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1–2):1–142, 2013.
- Di Palo & Valpola (2018) Di Palo, N. and Valpola, H. Improving model-based control and active exploration with reconstruction uncertainty optimization. arXiv preprint arXiv:1812.03955, 2018.
- Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135, 2017.
- Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1582–1591, 2018.
Uncertainty in deep learning.University of Cambridge, 2016.
- Kahn et al. (2017) Kahn, G., Villaflor, A., Pong, V., Abbeel, P., and Levine, S. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
- Ko et al. (2007) Ko, J., Klein, D. J., Fox, D., and Haehnel, D. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Robotics and Automation, 2007 IEEE International Conference on, pp. 742–747. IEEE, 2007.
- Kouvaritakis & Cannon (2001) Kouvaritakis, B. and Cannon, M. Non-linear Predictive Control: theory and practice. Iet, 2001.
- Kumar et al. (2016) Kumar, V., Todorov, E., and Levine, S. Optimal control with learned local models: Application to dexterous manipulation. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 378–383. IEEE, 2016.
- Kurutach et al. (2018) Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.
- Levine & Abbeel (2014) Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 1071–1079. Curran Associates, Inc., 2014.
- Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
- Lowrey et al. (2018) Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E., and Mordatch, I. Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848, 2018.
- Mayne et al. (2000) Mayne, D. Q., Rawlings, J. B., Rao, C. V., and Scokaert, P. O. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
- Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.
- Nguyen et al. (2017) Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., and Yosinski, J. Plug & play generative networks: Conditional iterative generation of images in latent space. In
- Parmas et al. (2018) Parmas, P., Rasmussen, C. E., Peters, J., and Doya, K. Pipps: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, pp. 4062–4071, 2018.
- Ramachandran et al. (2017) Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Ricker (1993) Ricker, N. L. Model predictive control of a continuous, nonlinear, two-phase reactor. Journal of Process Control, 3(2):109–123, 1993.
- Rossiter (2003) Rossiter, J. Model-based Predictive Control-a Practical Approach. CRC Press, 01 2003.
- Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1889–1897, 2015.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Tassa et al. (2012) Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913. IEEE, 2012.
- Tassa et al. (2014) Tassa, Y., Mansard, N., and Todorov, E. Control-limited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1168–1175. IEEE, 2014.
- Todorov & Li (2005) Todorov, E. and Li, W. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005. Proceedings of the 2005, pp. 300–306. IEEE, 2005.
- Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
- Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
Appendix A Additional Experimental Details
The important hyperparameters for all our experiments are shown in Tables 1, 2 and 3. We found the DAE noise level, regularization penalty weight and Adam learning rate to be the most important hyperparameters.
|Environment||Optimizer||Optim Iters||Epochs||Adam LR||DAE noise||Batch Norm|
|Environment||Optimizer||Optim Iters||Epochs||Adam LR||DAE noise||Batch Norm|
|Environment||Optimizer||Optim Iters||Epochs||Adam LR||DAE noise||Batch Norm|