Regularizing Trajectory Optimization with Denoising Autoencoders

03/28/2019 ∙ by Rinu Boney, et al. ∙ 0

Trajectory optimization with learned dynamics models can often suffer from erroneous predictions of out-of-distribution trajectories. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the dynamics model. We visually demonstrate the effectiveness of the regularization in gradient-based trajectory optimization for open-loop control of an industrial process. We compare with recent model-based reinforcement learning algorithms on a set of popular motor control tasks to demonstrate that the denoising regularization enables state-of-the-art sample-efficiency. We demonstrate the efficacy of the proposed method in regularizing both gradient-based and gradient-free trajectory optimization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In model-based reinforcement learning (RL), the actions of an agent are computed based on information from an explicit model of the environment dynamics. Model-based control with human-engineered simulator-based models is widely used in robotics and has been demonstrated to solve challenging tasks such as human locomotion (Tassa et al., 2012, 2014) and dexterous in-hand manipulation (Lowrey et al., 2018). However, this is not possible when the simulator is unavailable (the environment is unknown) or slow. In such cases, we can learn a dynamics model of the environment and use it for planning.

The arguably most important benefit of model-based RL algorithms is that it can be typically more sample-efficient than their model-free counterparts (Deisenroth et al., 2013; Arulkumaran et al., 2017; Chua et al., 2018). Due to their sample-efficiency, model-based RL methods are attractive to use in real-world applications where collecting samples is expensive (Deisenroth & Rasmussen, 2011). In addition, it is trivial to use data collected with any policy to train the dynamics model, and the dynamics model can transfer well to other tasks in the same environment.

Trajectory optimization with learned dynamics models is particularly challenging because of overconfident predictions of out-of-distribution trajectories that yield highly over-optimistic rewards. Powerful function approximators such as deep neural networks are required to learn dynamics of complex environments. Such models are likely to produce erroneous predictions outside the training distribution. This problem can be severe in high-dimensional environments especially when there is little training data available. Recent works learn an ensemble of dynamics models to attenuate this problem

(Chua et al., 2018; Kurutach et al., 2018; Clavera et al., 2018).

In this paper, we propose to use a denoising autoencoder (DAE) (Vincent et al., 2010) to regularize trajectory optimization by penalizing trajectories that are unlikely to appear in the past experience. The DAE is trained to denoise the same trajectories used to train the dynamics models. The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the dynamics model predictions will be less reliable as it has not been trained on such data. We can use this as a regularization term to optimize for trajectories that yield a high return while keeping the uncertainty low. Previously, DAEs have been successfully used to regularize conditional iterative generation of images (Nguyen et al., 2017).

The main contributions of this work are as follows:

  1. Denoising regularization. We present a new algorithm for regularizing trajectory optimization with learned dynamics models. We empirically demonstrate the efficacy of the algorithm in open-loop control of an industrial process and closed-loop control of a set of popular motor control tasks.

  2. Gradient-based trajectory optimization. We demonstrate effective gradient-based trajectory optimization with learned global dynamics models to achieve state-of-the-art sample-efficiency.

2 Model-Based Reinforcement Learning

Problem setting. At every time step , the environment is in a state and the agent observes

. In a partially observable Markov decision process (POMDP), the state

is hidden and the observation does not completely reveal . In a fully observable Markov decision process (MDP), the state is observable and hence . The agent takes an action causing the environment to transition to a new state , following the stochastic dynamics , and the agent receives a reward . The goal is to implement a policy that maximizes the expected cumulative reward .

Learned dynamics models. Model-based RL differs from model-free RL by modeling the dynamics and using it to influence the choice of actions. In this paper, we use environments with deterministic dynamics i.e. . In fully observable environments (such as in Section 5), the dynamics model can be a fully-connected network trained to predict the state transition from time to :

In partially observable environments (such as in Section 4

), the dynamics model can be a recurrent neural network trained to directly predict the future observations based on past observations and actions:

In this paper, we assume access to the reward function and that it can be computed from the agent observations i.e. , although this can easily be extended by modeling the reward function.

Trajectory optimization. The learned dynamics model is used as a proxy to the environment to optimize a trajectory of actions that maximizes the expected cumulative reward. The goal of trajectory optimization is, given the past observations and planning horizon , to find a sequence of actions

that yields the highest expected cumulative reward

such that (for MDPs) or (for POMDPs).

Open-loop control and closed-loop control. The optimized sequence of actions from trajectory optimization can be directly applied to the environment without any further interaction (open-loop control) or provided as suggestions to a human (human-in-the-loop). We demonstrate effective open-loop control of an industrial process in Section 4. Open-loop control is challenging because the dynamics model has to be able to make accurate long-range predictions. We can account for modeling errors and feedback from the environment by only taking the first action of the optimized trajectory and then re-planning at each step (closed-loop control). In the control literature, this flavor of model-based RL is called model-predictive control (MPC) (Mayne et al., 2000; Rossiter, 2003; Kouvaritakis & Cannon, 2001; Nagabandi et al., 2018). MPC has been used successfully to control various simulated robotic agents, as shown in Tassa et al. (2012, 2014); Lowrey et al. (2018). We demonstrate closed-loop control of several motor control tasks using MPC in Section 5.

3 Regularized Trajectory Optimization

The trajectory optimization problem is known to be highly non-trivial with many sources of numerical instability. One potential problem in the trajectory optimization procedure is that the estimates of the expected returns can be clearly inaccurate for out-of-distribution trajectories and the agent is likely to be tempted to try trajectories that look very different from the collected experience.

In this paper, we propose to address this problem by penalizing trajectories that are unlikely to appear in the past experience. This can be achieved by adding a penalty term to the objective function:

where

is the probability of observing the trajectory

in the past experience and

is a tuning hyperparameter. In this paper, instead of using the joint probability of the whole trajectory, we use marginal probabilities over short windows of size

:

(1)

The optimal sequence of actions can be found by a gradient-based optimization procedure. The gradients

can be computed by backpropagation in a computational graph using the trained dynamics model (see Fig. 

1). In such backpropagation-through-time procedure, one needs to compute the gradient with respect to the actions at each time instance :

(2)

where we denote by

a concatenated vector of observations

and actions , over a window of size . Thus to enable a regularized gradient-based optimization procedure, we need means to compute .

Figure 1: Example: fragment of a computational graph used during trajectory optimization in an MDP. Here, window size i.e. .

In order to evaluate (or its derivative), one needs to train a separate model

of the past experience, which is the task of unsupervised learning. In principle, any probabilistic model can be used for that. In this paper, we propose to regularize trajectory optimization with a denoising autoencoder (DAE) which does not build an explicit probabilistic model

but rather learns to approximate the derivative of the log probability density which can be used directly in a gradient-based optimization procedure.

The theory of denoising autoencoders states (Arponen et al., 2017) that the optimal denoising function (for zero-mean Gaussian corruption) is given by the following expression:

where

is the probability density function for data

corrupted with noise and

is the standard deviation of the Gaussian corruption. Thus, by training a DAE and assuming

, we can approximate the required gradient as

which yields

(3)

In automatic differentiation software, this gradient can be computed by adding the penalty term to and stopping the gradient propagation through . In practice, stopping the gradient through did not yield any benefits in our experiments compared to simply adding the penalty term to the cumulative reward, so we used the simple penalty term in our experiments.

  Gather initial data executing random actions
  for episode  do
     Train dynamics model and DAE on
     while episode is not over do
        Sample random action trajectories
        for optimization steps  do
           Predict future states of action trajectories using
           Compute regularization penalty using .
           Evaluate trajectories using regularized cumulative reward (Equation 4)
           Optimize the action trajectories
        end for
        Execute first action from the optimized trajectory
        Collect data,
     end while
  end for
Algorithm 1 Regularizing online trajectory optimization using denoising autoencoders

Although gradient-based trajectory optimization is potentially effective, in practice it can fail due to several reasons:

  1. The optimized reward is much more sensitive to the actions at the beginning of a plan than at the end, therefore using smaller learning rates for smaller can be beneficial. This problem can be addressed by using second-order methods such as iLQR (Todorov & Li, 2005).

  2. Multi-modality of the optimization landscape. In many tasks, there can be multiple locally optimal trajectories that produce substantially different cumulative rewards. This problem can be overcome by using global optimization methods such as random shooting.

  3. One needs to do backpropagation in a computational graph which contains transitions using the dynamics model . Therefore, problems with explosions in the forward computations and vanishing gradients easily arise.

  4. The learned dynamics model can become chaotic which can cause high variance in the estimates of the backpropagated gradients

    (Parmas et al., 2018).

Despite these challenges, we demonstrate effective regularization of gradient-based trajectory optimization in an industrial process control task and a set of popular motor control tasks. We also demonstrate that the DAE regularization can also be effectively used with gradient-free optimization methods such as cross-entropy method (CEM) (Botev et al., 2013), where we optimize the sum of the expected cumulative reward and the DAE penalty term:

(4)

Algorithm. Initially, we collect training data by performing random actions or using an existing policy and train the dynamics model and DAE on this data. Then, we interact with the environment by performing regularized trajectory optimization using the learned dynamics model and DAE. After each episode, we store the transitions generated by our algorithm and re-train our dynamics model and DAE. The general procedure is described in Algorithm 1.

4 Experiments on Industrial Process Control

To study trajectory optimization, we first consider the problem of control of a simple industrial process. An effective industrial control system could achieve better production and economic efficiency than manually operated controls. In this paper, we learn the dynamics of an industrial process and use it to optimize the controls, by minimizing a cost function. In some critical processes, safety is of utmost importance and regularization methods could prevent adaptive control methods from exploring unsafe trajectories.

We consider the problem of control of a continuous nonlinear two-phase reactor from (Ricker, 1993). The simulated industrial process consists of a single vessel that represents a combination of the reactor and separation system. The process has two feeds: one contains substances A, B and C and the other one is pure A. Reaction occurs in the vapour phase. The liquid is pure D which is the product. The process is manipulated by three valves which regulate the flows in the two feeds and an output stream which contains A, B and C. The plant has ten measured variables including the flow rates of the four streams (), pressure, liquid holdup volume and mole % of A, B and C in the purge. The control problem is to transition to a specified product rate and maintain it by manipulating the three valves. The pressure must be kept below the shutdown limit of 3000 kPa. The original paper suggests a multiloop control strategy with several PI controllers (Ricker, 1993).

We collected simulated data corresponding to about 0.5M steps of operation by randomly generating control setpoints and using the original multiloop control strategy. The collected data were used to train a neural network model with one layer of 80 LSTM units and a linear readout layer to predict the next-step measurements. The inputs were the three controls and the ten process measurements. The data were pre-processed by scaling such that the standard deviation of the derivatives of each measured variable was of the same scale. This way, the model learned better the dynamics of slow changing variables. We used a fully-connected network architecture with 8 hidden layers (100-200-100-20-100-200-100) to train a DAE on windows of five successive measurement-control pairs. The scaled measurement-control pairs in a window were concatenated to a single vector and corrupted with zero-mean Gaussian noise () and the DAE was trained to denoise it.

The trained model was then used for optimizing a sequence of actions to ramp production as rapidly as possible from to kmol h, while satisfying all other constraints (Scenario II from Ricker, 1993). We formulated the objective function as the Euclidean distance to the desired targets (after pre-processing). The targets corresponded to the following targets for three measurements:  kmol h for product rate, 2850 kPa for pressure and 63 mole % for A in the purge.

We optimized a plan of actions 30 hours ahead (or 300 discretized time steps). The optimized sequence of controls were initialized with the original multiloop policy applied to the trained dynamics model. That control sequence together with the predicted and the real outcomes (black and red curves respectively) are shown in Fig. 2a. We then optimized the control sequence using 10000 iterations of Adam with learning rate 0.01 without and with DAE regularization (with penalty ).

(a) Multiloop PI control (b) No regularization (c) DAE regularization
Figure 2: Open-loop planning for a continuous nonlinear two-phase reactor from (Ricker, 1993). Three subplots in every subfigure show three measured variables (solid lines): product rate, pressure and A in the purge. The black curves represent the model’s imagination while the red curves represent the reality if those controls are applied in an open-loop mode. The targets for the variables are shown with dashed lines. The fourth (low right) subplots show the three manipulated variables: valve for feed 1 (blue), valve for feed 2 (red) and valve for stream 3 (green).

The results are shown in Fig. 2. One can see that without regularization the control signals are changed abruptly and the trajectory imagined by the model deviates from reality (Fig. 2b). In contrast, the open-loop plan found with the DAE regularization is noticeably the best solution (Fig. 2c), leading the plant to the specified product rate much faster than the human-engineered multiloop PI control from (Ricker, 1993). The imagined trajectory (black) stays close to predictions and the targets are reached in about ten hours. This shows that even in a low-dimensional environment with a large amount of training data, regularization is necessary for planning using a learned model.

5 Experiments on Motor Control

Figure 3: Results of our experiments on the four benchmark environments, in comparison to PETS (Chua et al., 2018). We show the return obtained in each episode. All the results are averaged across 5 seeds, with the shaded area representing standard deviation. PETS is a recent state-of-the-art model-based RL algorithm and GP-based (Gaussian Processes) control algorithms are well known to be sample-efficient and are extensively used for the control of simple systems.

In order to compare the proposed method to the performance of existing model-based RL methods we also test it on the same set of Mujoco-based (Todorov et al., 2012; Brockman et al., 2016) continuous motor control tasks as in (Chua et al., 2018):

Cartpole. This task involves a pole attached to a moving cart in a frictionless track, with the goal of swinging up the pole and balancing it in an upright position in the center of the screen. The cost at every time step is measured as the angular distance between the tip of the pole and the target position. Each episode is 200 steps long.

Reacher

. This environment consists of a simulated PR2 robot arm with seven degrees of freedom, with the goal of reaching a particular position in space. The cost at every time step is measured as the distance between the arm and the target position. The target position changes every episode. Each episode is 150 steps long.

Pusher. This environment also consists of a simulated PR2 robot arm, with a goal of pushing an object to a target position that changes every episode. The cost at every time step is measured as the distance between the object and the target position. Each episode is 150 steps long.

Half-cheetah. This environment involves training a two-legged ”half-cheetah” to run forward as fast as possible by applying torques to 6 different joints. The cost at every time step is measured as the negative forward velocity. Each episode is 1000 steps long, but the length is reduced to 200 for the benchmark with (Clavera et al., 2018).

Additionally, we also test it on the Ant environment with larger state and action spaces:

Ant. This is the most challenging environment of the benchmark (with a large state space dimension of 111). It consists of a four-legged ”ant” controlled by applying torques to its 8 joints. The cost, similar to Half-cheetah, is the negative forward velocity.

5.1 Comparison to Prior Methods

We focus on the control performance in the initial 10 episodes because: 1) there is little data available and regularization is very important, 2) it is easy to disentangle the benefits of regularization since the asymptotic performance could be affected by differences in exploration, which is outside of the scope of this paper. We compare our proposed method to the following state-of-the-art methods:

PETS. Probabilistic Ensembles with Trajectory Sampling (PETS) (Chua et al., 2018) consists of an ensemble of probabilistic neural networks and uses particle-based trajectory sampling to regularize trajectory optimization. We compare to the best results of PETS (denoted as PE-TS1) reported in (Chua et al., 2018). In PE-TS1, the next state prediction is sampled from a different model in the ensemble at each time step. We obtain the benchmark results by running the code provided by the authors.

MB-MPO. We also compare the performance of our method with Model-Based Meta Policy Optimization (MB-MPO) (Clavera et al., 2018), an approach that combines the benefits of model-based RL and meta learning: the algorithm trains a policy using simulations generated by an ensemble of models, learned from data. Meta-learning allows this policy to quickly adapt to the various dynamics, hence learning how to quickly adapt in the real environment, using Model-Agnostic Meta Learning (MAML) (Finn et al., 2017). We use the benchmark results published by the authors on the companion site of their paper (Clavera et al., 2018).

GP. In the Cartpole environment, we also compare our results against Gaussian Processes (GP). Here, a GP is used as the dynamics model and only the expectation of the next state prediction is considered (denoted as GP-E in Chua et al. (2018)). This algorithm performed the best in this task in (Chua et al., 2018). These benchmark results were also obtained by running the code provided by Chua et al. (2018).

5.2 Model Architecture and Hyperparameters

We use the probabilistic neural network from (Chua et al., 2018)

as our baseline. Given a state and action pair, the probabilistic neural network predicts the mean and variance of the next state (assuming a Gaussian distribution). Although we only use the mean prediction, we found that also training to predict the variance improves the stability of the training. In

(Chua et al., 2018)

, the dynamics model is only trained for 5 epochs after every episode and we observed that this leads to underfitting of the training data and severely limits the control performance during the initial episodes. This is demonstrated in Figure 

4. Since the dynamics model is poor during these initial episodes, regularizing the trajectory optimization is not sensible in this setting. To alleviate this problem, we train the dynamics model for 100 or more epochs after every episode so that the training converges. We found that this significantly improves the learning progress of the algorithm and use this as our baseline. This is illustrated in Figure 4. It can also be seen that similarly training PETS for more epochs does not yield noticeable improvements over the baseline.

We use dynamics models with the same architecture for all environments: 3 hidden layers of size 200 with the Swish non-linearity (Ramachandran et al., 2017). Similar to prior works, we train the dynamics model to predict the difference between and instead of predicting directly. We use the same architecture as the dynamics model for the denoising autoencoder. The state-action pairs in the past episodes were corrupted with zero-mean Gaussian noise and the DAE was trained to denoise it. Important hyperparameters used in our experiments are reported in the Appendix.


Figure 4: Effect of increased number of training epochs after every episode: we can see that training the dynamics model for more epochs after each episode leads to a much better performance in the initial episodes. With this modification, a single dynamics model with no regularization seems to work almost as well as PETS. It can also be clearly seen that the use of denoising regularization enables an improvement in the learning progress. To compare with PETS, we used the CEM optimizer in this ablation study.

5.3 Results


Figure 5: Results of our experiments on Half-cheetah, in comparison to MB-MPO (Clavera et al., 2018), MB-TRPO (Kurutach et al., 2018) and MB-MPC (Nagabandi et al., 2018). We plot the average return over the last 20 episodes. Our results are averaged across 3 seeds, with the shaded area representing standard deviation. Note that the comparison numbers are picked from (Clavera et al., 2018) and the results from the first 20 episodes are not reported.

Figure 6: Comparison to Gaussian regularization: we can see that trajectory optimization with Adam without any regularization is very unstable and completely fails in the initial episodes. While Gaussian regularization helps in the first few episodes, it is not able to fit the data properly and seems to consistently lead the optimization to a local minimum. As shown earlier in Figure 5, denoising regularization is able to successfully regularize the optimization, enabling good asymptotic performance from very few episodes of interaction.

Figure 7: Results of our experiments on Ant. We use a random policy to collect data for the first 200 episodes and then train the agent online for 10 episodes. We show the rewards of the the random policy on episodes 190 to 200 for comparison and we can see that the trajectory optimization with DAE regularization performs much than the baseline.

Comparison to Chua et al. The learning progress of our algorithm is illustrated in Figure 3. In (Chua et al., 2018), the learning curves of all algorithms are reported using the maximum rewards seen so far. This can be misleading and we instead report the average returns across different seeds, after each episode. In Cartpole, all the methods eventually converge to the maximum cumulative reward. There are however differences in how quickly the methods converge, with the proposed method converging the fastest. Interestingly, our algorithm also surpasses the Gaussian Process (GP) baseline, which is known to be a sample-efficient method widely used for control of simple systems. In Reacher, the proposed method converges to the same asymptotic performance as PETS, but faster. In Pusher, all algorithms perform similarly. In Half-cheetah, the proposed method is the fastest, learning an effective running gait in only a couple of episodes 111Videos of our agents during training can be found in the website https://sites.google.com/view/regularizing-mbrl-with-dae/home. Denoising regularization is effective for both gradient-free and gradient-based planning, with gradient-based planning performing the best. The result after 10 episodes using the proposed method is an improvement over (Chua et al., 2018), and even perform better than the asymptotic performance of several recent model-free algorithms as reported in (Fujimoto et al., 2018; Duan et al., 2016). In all the tested environments, using denoising regularization shows a faster or comparable learning speed, quickly reaching good asymptotic performance.

Comparison to Clavera et al. In Figure 5 we compare our method against MB-MPO and other model-based methods included in (Clavera et al., 2018) for the Half-cheetah environment with shorter episodes (200 timesteps). Also in this case, our method learns faster than the comparison methods.

Comparison to Gaussian regularization. To emphasize the importance of denoising regularization, we also compare against a simple Gaussian regularization baseline: we fit a Gaussian distribution (with diagonal covariance matrix) to the states and actions in the replay buffer and regularize the trajectory optimization by adding a penalty term to the cost, proportional to the negative log probability of the states and actions in the trajectory (Equation 1). The performance of this baseline in the Half-cheetah task (with an episode length of 200) is shown in Figure 6. We observe that the Gaussian distribution poorly fits the trajectories and consistently leads the optimization to a bad local minimum.

Experiments with Ant. In Figure 7, we show our results on the Ant environment. We use the OpenAI Gym environment and the episodes are terminated if the done condition is met. We use the forward movement velocity as the reward in each step. We collect initial data using a random policy (choosing actions uniformly in the range of -0.5 and 0.5) for 200 episodes. Then, we train the agents online for 10 episodes. After the random exploration, trajectory optimization with denoising regularization learns very quickly and performs much better than the baseline.

6 Related Work

Several methods have been proposed for planning with learned dynamics models. Locally linear time-varying models (Kumar et al., 2016; Levine & Abbeel, 2014) and Gaussian processes (Deisenroth & Rasmussen, 2011; Ko et al., 2007) are data-efficient but have problems scaling to high-dimensional environments. Recently, deep neural networks have been successfully applied to model-based RL. Nagabandi et al. (2018) use deep neural networks as dynamics models in model-predictive control to achieve good performance, and then shows how model-based RL can be fine-tuned with a model-free approach to achieve even better performance. Chua et al. (2018) introduce PETS, a method to improve model-based performance by estimating and propagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate how their approach can beat several recent model-based and model-free techniques. Clavera et al. (2018) combines model-based RL and meta-learning with MB-MPO, training a policy to quickly adapt to slightly different learned dynamics models, thus enabling faster learning.

Levine & Koltun (2013) and Kumar et al. (2016) use KL divergence penalty between action distributions to stay close to the training distribution. Similar bounds are also used to stabilize training of policy gradient methods (Schulman et al., 2015, 2017). While such KL penalty bounds the evolution of action distributions, the proposed method also bounds the familiarity of states, which could be important in high-dimensional state spaces. While penalizing unfamiliar states also penalize exploration, it allows for more controlled and efficient exploration. Exploration is out of the scope of the paper but was studied in (Di Palo & Valpola, 2018), where a non-zero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between familiarity and exploration.

7 Conclusion

In this work we introduced a novel algorithm for regularization in model-based reinforcement learning. We tackled the problem of regularizing trajectory optimization in order to avoid the planning inaccuracies caused by out-of-distribution errors of deep neural networks. After a theoretical analysis of the proposed method, we empirically demonstrated how this approach enables high learning speed in different continuous control environments.

In recent years, a lot of effort has been put in making deep reinforcement algorithms more sample-efficient, and thus adaptable to real world scenarios. Model-based reinforcement learning has shown promising results, obtaining sample-efficiency even orders of magnitude better than model-free counterparts, but these methods have often suffered from sub-optimal performance due to many reasons. As already noted in the recent literature (Nagabandi et al., 2018; Chua et al., 2018), out-of-distribution errors and model overfitting are often sources of performance degradation when using complex function approximators, and our experiments highlight how tackling these problems improves the performance of model-based reinforcement learning. We argue that the increase in learning speed can make this method, if further explored, a viable solution for real-world motor control learning in complex robots.

Recent work in uncertainty estimation in model-based control has typically either used less complex models, like Gaussian processes, or approximated Bayesian neural networks using ensembles of models and sampling, as in the case of (Chua et al., 2018), or stochastic dropout over several feed-forward computations (Kahn et al., 2017; Gal, 2016). In this paper, we used denoising autoencoders as a regularization tool in trajectory optimization and show that it enables state-of-the-art sample-efficiency in a set of popular continuous control tasks.

A possible avenue for further research would be to explore the possibilities of gradient-based trajectory optimization for multi-modal distributions. Gradient-based optimization is expected to be easier to scale to high-dimensional action spaces and could pave the way for successes in the low data regime of complex reinforcement learning problems.

Acknowledgements

We would like to thank Jussi Sainio, Jari Rosti and Isabeau Prémont-Schwarz for their valuable contributions in the experiments on industrial process control.

References

Appendix A Additional Experimental Details

The important hyperparameters for all our experiments are shown in Tables 1, 2 and 3. We found the DAE noise level, regularization penalty weight and Adam learning rate to be the most important hyperparameters.


Environment Optimizer Optim Iters Epochs Adam LR DAE noise Batch Norm
Cartpole CEM 5 500 - 0.001 0.1
Adam 10 500 0.001 0.001 0.2
Reacher CEM 5 500 - 0.01 0.1
Adam 5 300 1 0.01 0.1
Pusher CEM 5 500 - 0.01 0.1
Adam 5 300 1 0.01 0.1
Half-cheetah CEM 5 100 - 2 0.1
Adam 10 200 0.1 1 0.2
Table 1: Important hyperparameters used in our experiments for comparison with PETS. Additionally, for the experiments with gradient-based trajectory optimization on Reacher and Pusher, we initialize the trajectory with a few iterations (2 iterations for Reacher and 5 iterations for Pusher) of CEM.

Environment Optimizer Optim Iters Epochs Adam LR DAE noise Batch Norm
Half-cheetah CEM 5 20 - 2 0.2
Adam 10 40 0.1 1 0.1
Table 2: Important hyperparameters used in our experiments for comparison with MB-MPO

Environment Optimizer Optim Iters Epochs Adam LR DAE noise Batch Norm
Ant CEM 5 100 - 0.5 0.2
Table 3: Important hyperparameters used in our experiments with the Ant environment