Synthesizing Neural Network Controllers with Probabilistic Model based Reinforcement Learning

03/06/2018 ∙ by Juan Camilo Gamboa Higuera, et al. ∙ McGill University 0

We present an algorithm for rapidly learning controllers for robotics systems. The algorithm follows the model-based reinforcement learning paradigm, and improves upon existing algorithms; namely Probabilistic learning in Control (PILCO) and a sample-based version of PILCO with neural network dynamics (Deep-PILCO). We propose training a neural network dynamics model using variational dropout with truncated Log-Normal noise. This allows us to obtain a dynamics model with calibrated uncertainty, which can be used to simulate controller executions via rollouts. We also describe set of techniques, inspired by viewing PILCO as a recurrent neural network model, that are crucial to improve the convergence of the method. We test our method on a variety of benchmark tasks, demonstrating data-efficiency that is competitive with PILCO, while being able to optimize complex neural network controllers. Finally, we assess the performance of the algorithm for learning motor controllers for a six legged autonomous underwater vehicle. This demonstrates the potential of the algorithm for scaling up the dimensionality and dataset sizes, in more complex control tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Model-based reinforcement learning (RL) is an attractive framework for addressing the synthesis of controllers for robots of all kinds due to its promise of data-efficiency. An RL agent can use learned dynamics models to search for good controllers in simulation. This has the potential of minimizing costly trials on real robots. Minimizing interactions, however, means that datasets will often not be large enough to obtain accurate models. Bayesian models are very helpful in this situation. Instead of requiring an accurate model, the robot agent may keep track of a distribution over hypotheses of models that are compatible with its experience. Evaluating a controller then involves quantifying its performance over the model distribution. To improve its chances of working in the real world an effective controller should perform well, on average, on models drawn from this distribution. PILCO (Probabilistic Inference and Learning for COntrol) and Deep-PILCO are successful applications of this idea.

PILCO [1]

uses Gaussian Process (GP) models to fit one-step dynamics and networks of radial basis functions (RBFs) as feedback policies. PILCO has been shown to perform very well with little data in simulated tasks and on real robots 

[1]. We have used PILCO successfully for synthesizing swimming controllers for an underwater swimming robot [2]. However, PILCO is computationally expensive. Model fitting scales and long-term predictions scale , where is the dataset size and is the number of state dimensions, limiting its applicability only to scenarios with small datasets and low dimensionality.

Fig. 1: The AQUA robot executing a 6-leg knife edge maneuver. The robot starts in its resting position and must swim forward at a constant depth while stabilizing a roll angle of 90 degrees. The sequence of images illustrates a controller obtained with our proposed approach.

Deep-PILCO [3] aims to address these limitations by employing Bayesian Neural Networks (BNNs), implemented via binary dropout [4, 5]. Deep-PILCO performs a sampling-based procedure for simulating with BNN models of the dynamics. Policy search and model learning are done via stochastic gradient optimization, which scales more favorably to larger datasets and higher dimensionality. Deep-PILCO has been shown to result in better policies for a cart-pole swing-up benchmark task, but show reduced data efficiency when compared with PILCO. We extend on the results of [3] by:

  • Modifying the simulation procedure to incorporate the use of fixed random numbers for policy optimization

  • Clipping gradients to stabilize optimization with back-propagation through time (BPTT)

  • Using BNNs with multiplicative parameter noise where the noise distribution is adapted from data [6]

We show how these improvements allow us to optimize neural network controllers with Deep-PILCO, while matching the data efficiency of PILCO on the cart-pole swing-up task; i.e. learning a successful controller with the same amount of experience. We also show how training stochastic policies (implemented as BNNs) can be beneficial for the convergence of robust policies. Finally, we demonstrate how these methods can be applied for learning swimming controllers for a 6 legged autonomous underwater vehicle.

Ii Related Work

Dynamics models have long been a core element in the modeling and control of robotic systems. Trajectory optimization approaches [7, 8, 9] can produce highly effective controllers for complex robotic systems when precise analytical models are available. For complex and stochastic systems such as swimming robots, classical models are less reliable. In these cases, either performing online system identification [10] or learning complete dynamics models from data has proven to be effective, and can be integrated tightly with model-based control schemes [11, 12, 13, 14].

Multiple works have applied Deep RL methods to learn various continuous control tasks [15, 16], including full-body control of humanoid characters [17]

. These methods do not assume a known reward function, estimating the value of each action from experience. Along with their model-free nature, this results in lower data efficiency compared with the methods we consider here, but there are ongoing efforts to connect model-based and model-free approaches

[18].

The most similar works to our own are those which use probabilistic dynamics models for policy optimization. Locally linear controllers can be learned in this fashion, for example by extending the classical Differential Dynamic Programming (DDP) [19] method or Iterative LQG [20] to use GP models. For more complex robots, it is desirable to learn complex non-linear policies using the predictions of learned dynamics. Black-DROPS [21] has recently shown promising performance competitive with the gradient-based PILCO [1] for training GP and NN policies using GP dynamics models. As yet, we are only aware of BNNs being used in the policy learning loop within Deep-PILCO [3], which is the method we directly improve upon. Our approach is the first model-based RL approach to utilize BNNs for both the dynamics as well as the policy network.

Iii Problem Statement

We focus on model-based policy search methods for episodic tasks. We consider systems that can be modeled with discrete-time dynamics , where is unknown, with states and controls , indexed by time-step . The goal is to find the parameters of a policy that minimize a task-dependent cost function accumulated over a finite time horizon ,

(1)

The expectation in our case is due to not knowing the true dynamics , which induces a distribution over trajectories . The objective could be minimized by black-box optimization or likelihood ratio methods, obtaining trajectory samples directly from the target system. However, such methods are known to require a large number of evaluations, which may be impractical for applications with real robot systems. An alternative is to use experience to fit a model of the dynamics and use it to estimate the objective in Eq. (1). Alg. (1) describes a sketch for model-based optimization methods. A goal of these methods is data-efficiency: to use as little real-world experience as possible. Since we consider fixed horizon tasks, data-efficiency can be measured in the number of episodes, or trials, until the task is successfully learned.

1:Initialize parameters , and dataset
2:for episode in  do
3:     Obtain by executing for steps on robot
4:     Append to
5:     Use to update model learning
6:     Use to minimize , update policy optimization
7:Return
Algorithm 1 Episodic Model-Based RL

Iv Background

Iv-a Learning a dynamics model with BNNs

A key to data-efficiency is avoiding model bias [22, 23], i.e. optimizing Eq. (1) with a model that makes bad predictions with high confidence. BNNs address model bias by using the posterior distribution over their parameters. Given a model with parameters and a dataset we’d like to use the posterior to make predictions at new test points. This distribution represents the uncertainty about the true value of , which induces uncertainty on the model predictions: , where is the prediction at test point . Using the true posterior for predictions on a neural network is intractable. Fortunately, various methods based on variational inference exist, which use tractable approximate posteriors and Monte Carlo integration for predictions [5, 24, 25, 26, 27]. Fitting is done by minimizing the Kullback-Leibler (KL) divergence between the true and the approximate posterior, which can done by optimizing the objective

(2)

where is the expected value of the likelihood , is the approximate posterior and is a user-defined prior on the parameters. These methods usually set as a deterministic transformation of noise samples , where are the parameters of the posterior [25]. For example, in binary dropout multiplies the weights matrices for each layer of the network with the dropout masks

, consisting of diagonal noise matrices with entries drawn from a Bernoulli distribution with dropout probability

 [4]. To fit the dynamics model, we build the dataset of tuples ; where are the state-action pairs that we use as input to the dynamics model, and are the changes in state after applying action . We fit the model by minimizing the objective in Eq.( 2

) via stochastic gradient descent.

Iv-B Policy optimization with learned models

To estimate the objective function in Eq. (1) we base our approach on Deep-PILCO [3], which we summarize in Alg. (2). For every optimization iteration, the algorithm draws particles consisting of an initial state and a set of weights sampled from , as shown in line 2. For the models used in [3] and this work, sampling weights is equivalent to sampling dropout masks . The loop in lines 4 to 6 can be executed in parallel using batch processing. This algorithm requires the task cost function to be known and differentiable. Deep-PILCO uses back-propagation through time (BPTT) to estimate the policy gradients .

1:for  in  do
2:     Sample particles
3:     for  in  do
4:          for  in  do
5:               Evaluate policy
6:               Propagate state           
7:          Fit mean and covariance
8:          Resample from      
9:     Evaluate objective
10:     Compute gradient estimate
11:     Update by stochastic gradient descent step
Algorithm 2 Policy search with Deep-PILCO

V Improvements to Deep-PILCO

Here we describe the changes we have done to Deep-PILCO that were crucial for improving its data-efficiency and obtaining the results we describe in Sec. VI. Our changes are summarized in Alg. (3

). This algorithm can still be executed efficiently using batch processing with state-of-the art deep learning frameworks.

V-a Common random numbers for policy evaluation

The convergence of Algorithms (2) and (3

) is highly dependent on the variance of the estimated gradient

. In this case, the variance of the gradients is dependent on the sources of randomness for simulating trajectories: the initial state samples , the multiplicative noise masks , and the random numbers used for re-sampling . A common variance reduction technique used in stochastic optimization is to fix random numbers during optimization [28]. Using common random numbers (CRNs) reduces variance in two ways: gradient evaluations become deterministic and evaluations over different values for the optimization variable become correlated. We introduce CRNs by drawing all the random numbers we need for simulating trajectories at the beginning of the policy optimization (lines 1 to 3 in Alg. (3)) and keeping them fixed as the policy parameters are updated. This is possible because we use BNNs that rely on the re-parametrization trick [25] for evaluation. This is effective in reducing variance and improving convergence, but it may introduce bias. A simple way to deal with bias is to increase the number of particles used for gradient evaluation. We increased from 10, the number used in [3], to 100 for our experiments, and found it to improve convergence with small penalty on running time.

Fixing random numbers in the context of policy search is known as the PEGASUS111Policy Evaluation-of-Goodness And Search Using Scenarios algorithm [29]

. PEGASUS consists of transforming a given Markov Decision Process (MDP) into ”an equivalent one where all transitions are deterministic” by assuming access to a

deterministic simulative model of the MDP. A deterministic simulative model is one that has no internal random number generator, so any random numbers that are needed must be given to it as input. This is the case when using BNNs models. PEGASUS provides theoretical justification to our approach, particularly in that to decrease the upper bound on the error of estimates of using CRNs it suffices to increase .

V-B Stabilization for back-propagation through time

As noted in [3], the recurrent application of BNNs in Algortihm 2 can be interpreted as a Recurrent Neural Network (RNN) model. As such, Deep-PILCO is prone to suffer from vanishing and exploding gradients when computing them via BPTT [30]

, especially when dealing with tasks that require long time horizon or very deep models for the dynamics and policy. Although numerous techniques have been proposed in the RNN literature, we opted to deal with these problems by using ReLU activations for the policy and dynamics model, and clipping the gradients to have a norm of at most

. We show the effect of various settings of the clipping value on the convergence of policy search in Fig. (2(b)) .

1:Sample noise for dynamics
2:Sample noise for policy
3:Sample state noise
4:for  in  do
5:     Sample particles
6:     for  in  do
7:          for  in  do
8:               
9:               
10:               Evaluate policy
11:               Propagate state           
12:          Fit mean and covariance
13:          for  in  do
14:                =                
15:     Evaluate objective
16:     Compute gradient estimate
17:     if  then
18:                
19:     Update by stochastic gradient descent step
Algorithm 3

Our method: Deep-PILCO with PEGASUS evaluation and gradient clipping

V-C BNN models with Log-Normal multiplicative noise

We focused on methods that use multiplicative noise on the activations (e.g. binary dropout) because of their simplicity and computational efficiency. Deep-PILCO with binary dropout requires tuning the dropout probability to a value appropriate for the model size. We experimented with various BNN models [6, 25, 26, 27] to enable learning the dropout probabilities from data. The best performing method in our experiments was using truncated Log-Normal dropout with a truncated log-uniform prior . This choice prior causes the multiplicative noise to be constrained to values between 0 and 1 [6].

V-D Training neural network controllers

While Deep-PILCO had been limited to training single-layer Radial Basis Function policies, the application of gradient clipping and CRNs allows stable training of deep neural network policies, opening the door for richer behaviors. We found that adding dropout to the policy networks improves performance. During policy evaluation, we sample policies the same way as we do for dynamics models: a policy sample corresponds to a set of dropout masks . Thus each simulated state particle has a corresponding dynamics model and policy, which remain fixed during policy optimization. This can be interpreted as attempting to learn a distribution of controllers that are likely to perform well over plausible dynamics models. We make the policy stochastic during execution on the target system by re-sampling the policy dropout mask at every time step. This provides some amount of exploration that we found beneficial.

Vi Results

(a) Cart-pole task
(b) Double cart-pole task
Fig. 2: Benchmark tasks used in Sec. VI. In both tasks, the tip of the pendulum starts downright, with the cart centered at . The goal is to balance the tip of the pole at its highest possible location, while keeping the cart at . This occurs when for the cart-pole, and when in the double cart-pole.

We tested the improvements, described in Section V, on two benchmark scenarios: swinging up and stabilizing an inverted pendulum on a cart, and swinging up and stabilizing a double pendulum on a cart (see Fig. (2)). The first task was meant to compare performance on the same experiment as [3]. We chose the second scenario to compare the methods with a harder long-term prediction task; due to the chaotic dynamics of the double-pendulum. In both cases, the system is controlled by applying a horizontal force to the cart. We also evaluate our approach on the gait learning tasks for an underwater hexapod robot [2] to demonstrate the applicability of our approach for locomotion tasks on complex robot systems. We use the ADAM optimizer [31] for model fitting and policy optimization, with the default parameters suggested by the authors, and report the best results obtained after manual hyper-parameter tuning.

Vi-a Cart-pole swing-up task

(a) Comparison of the effect of introducing CRNs
(b) Effect of clipping gradients
Fig. 3: (a) Illustrates the benefit of fixing random numbers for policy evaluation (Alg. (3)) vs. stochastic policy evaluations (Alg. (2)). In (b) we show the area under the learning curve for the cart-pole task for various gradient clipping values (lower is better).

While previous experiments combining PILCO with PEGASUS were unsuccessful [23], we found its application to Deep-PILCO to result in a significant improvement on convergence when training neural network policies. Fig. (2(a)) shows how the use of CRNs (Deterministic policy evaluation) results in faster convergence than the original Deep-PILCO formulation (Stochastic policy evaluation), which only matches the cost of our approach after around the 20th trial. These experiments were done with a learning rate of and clipping value . Fig. (2(b)) illustrates the effect of gradient clipping for different values of for Alg. (3). The area under the learning curve gives us an idea of the speed of convergence as the clipping value changes. The trend is that any value of gradient clipping made a large improvement over not clipping at all and that the specific choice of clipping values was highly stable.

(a) Cart-pole RBF Policies
(b) Cart-pole Deep Policies
Fig. 4: Cost per trial on the cart-pole swing-up task. In (a), we compare different dynamics models for learning RBF policies. (b) compares BNN models for learning NN policies, showing how our approach matches the data-efficiency of PILCO with better final performance.

Fig. (4) summarizes our results for the cart-pole domain222The code used in these experiments is available at https://github.com/juancamilog/kusanagi.. Fig. (3(a)) illustrates the difference in performace between PILCO using sparse spectrum GP (SSGP) regression [32] for the dynamics and two versions of Deep-PILCO using BNN dynamics: one using binary dropout with dropout probability , and the other using Log-Normal dropout with a truncated log-uniform

. The BNN models are ReLU networks with two hidden layers of 200 units and a linear output layer. The models predict heteroscedastic noise, which is used to corrupt the input to the policy during simulation. We used data from all previous episodes for model learning after each trial. The initial experience was gathered with a single execution of a policy that selects actions uniformly-at-random. The learning rate was set to

for model learning and for policy optimization. The policies were RBF networks with 30 units. Fig. (3(b)) provides a comparison of different BNN dynamics models when training neural network policies. The policy networks are ReLU networks with two hidden layers of 200 units. For BNN policies (Drop MLP) we set a constant dropout probability . Note that our method is able to train neural network controllers with better performance (lower cost) than either PILCO or Deep-PILCO with RBF controllers, within a similar number of trials. Using truncated Log-Normal dropout (Log-Normal Drop Dyn) for learning a stochastic policy (Drop MLP) results in the best performance for the cart-pole task.

Vi-B Double pendulum on cart swing-up task

Fig. (5) illustrates the effect of learning neural network controllers on the more complicated double cart-pole swing-up task. We were unable to get Alg. (2) with RBF policies to converge in this task. The setup is similar to the cart-pole task, but we change the network architectures as the dynamics are more complex. The dynamics models are ReLU networks with 4 hidden layers of 200 units and a linear output layer. The policies are ReLU networks with four hidden layers of 50 units. The learning rate for policy learning was set to . The initial experience was comes from 2 runs with random actions. Here the differences in performance are more pronounced: our method converges after 42 trials, corresponding to 126 s of experience at 10 Hz. This is close to the 84 s at 13.3 Hz reported in [23]. We see that the combination of BNN dynamics (Log-Normal Drop Dyn) and a BNN policy (Drop MLP Pol) results in the least number of trials for achieving the lowest cost.

Fig. 5:

Cost per trial on the double cart-pole swing-up task. The shaded regions correspond to half a standard deviation. This demonstrates the benefit of using Log-Normal multiplicative noise for the dynamics with dropout regularization for the policies

Vi-C Learning swimming gaits on an underwater robot.

(a) 2-Leg Knife edge
(b) 2-Leg Belly up
(c) 2-Leg Corkscrew
(d) 2-Leg Depth change
Fig. 6: Learning curve and the evolution of the trajectory distribution as learning progresses for 2-leg tasks. The robot learns to control its pose by setting the appropriate amplitudes and leg offset angles for its back 2 legs. The dashed lines represent the desired target states. Additional results and videos of these behaviours available at https://github.com/mcgillmrl/robot_learning

These tasks consist of finding feedback controllers for controlling the robot’s 3D pose via periodic motion of its legs. Fig. (1) illustrates the execution of a gait learned using our methods. The robot’s state space consists of readings from its inertial measurement unit (IMU), its depth sensor and motor encoders. To compare with previously published results, the action space is defined as the parameters of the periodic leg command (PLC) pattern generator [2], with the same constraints as prior work. We conducted experiments on the following control tasks333The code used for these experiments and video examples of other learned gaits are available at https://github.com/mcgillmrl/robot_learning.:

  1. knife edge: Swimming straight-ahead with roll

  2. belly up: Swimming straight-ahead with roll

  3. corkscrew: Swimming straight-ahead with rolling velocity (anti-clockwise)

  4. 1 m depth change: Diving and stabilizing 1 meter below current depth.

There were two versions of these experiments. In the first one, which we call 2-leg tasks, the robot controls only the amplitudes and offsets of the two back legs (4 control dimensions). Its state corresponds to the angles and angular velocities, as measured by the IMU, and the depth sensor measurement (7 state dimensions). In the second version, the robot controls amplitudes and offsets and phases for all 6 legs (18 control dimensions). In this case, the state consists of the IMU and depth sensor readings plus the leg angles as measured form the motor encoders (13 state dimensions). We transform angle dimensions into their complex representation before passing the state as input to the dynamics model and policy, as described in [23].

(a) 6-Leg Knife edge
(b) 6-Leg Belly up
(c) 6-Leg Corkscrew
(d) 6-Leg Depth change
Fig. 7: Learning curve and the evolution of the trajectory distribution as learning progresses for 6-leg tasks. In this case, the robot is trying to control the amplitudes, leg angle offsets, and phase offsets for all 6 legs. The algorithm takes longer to converge in this case, when compared to the 2-leg tasks. This is possibly due to the larger state and action spaces (13 state dimensions + 18 action dimensions). Nevertheless, this demonstrates that the algorithm can scale to higher dimensional problems.

We trained dynamics models and policies with 4 hidden layers of 200 units each. The dynamics models use truncated Log-Normal dropout and we enable dropout for the policy with . We used a learning rate of and clip gradients to . The experience dataset is initialized with 5 random trials, common to all the tasks with the same state and action spaces.

Fig. (6) and (7) show the results of gait learning in the simulation environment described in [2]. In addition to learning curves on the left of each task panel, we show detailed state telemetry for selected learning episodes on the right to provide intuition on stability and learning progression. The shaded regions represent the variance of the trajectory distributions on the target system over 10 different runs. In each case attempted, our method was able to learn effective swimming behavior, to coordinate the motions of multiple flippers and overcome simulated hydrodynamic effects without any prior model. For the 2-leg tasks, our method obtains successful policies in 10-20 trials, a number competitive with results reported in [2]. We obtained successful controllers for the depth-change task, which was unsuccessful in prior work. The 6-leg tasks, with their considerably higher-dimensional state and action spaces, take roughly double the number of trials. But all tasks still converged by trial 50, which remains practical for real deployment.

Vii Conclusion

We have presented improvements to a probabilistic model-based reinforcement learning algorithm, Deep-PILCO, to enable fast synthesis of controllers for robotics applications. Our algorithm is based on treating neural network models trained with dropout as an approximation to the posterior distribution of dynamics models given the experience data. Sampling dynamics models from this distribution helps in avoiding model-bias during policy optimization; policies are optimized for a finite sample of dynamics models, obtained through the application of dropout noise masks. Our changes enable training of neural network controllers, which we demonstrate to outperform RBF controllers on the cart-pole swing-up task. We obtain competitive performance on the task of swing-up and stabilization of a double pendulum on a cart. Finally, we demonstrated the usefulness of the algorithm on the higher dimensional tasks of learning gaits for pose stabilization for a six legged underwater robot. We replicate previous results [2] where we control the robot with 2 flippers, and provide new results on learning to control the robot using all 6 legs, now including phase offsets.

References

  • [1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for data-efficient learning in robotics and control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
  • [2] D. Meger, J. C. G. Higuera, A. Xu, P. Giguere, and G. Dudek, “Learning legged swimming gaits from experience,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, 2015.
  • [3] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with Bayesian neural network dynamics models,” in

    Data-Efficient Machine Learning workshop, ICML

    , 2016.
  • [4] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, 2014.
  • [5] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of The 33rd International Conference on Machine Learning, 2016.
  • [6] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems, 2017.
  • [7] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming.    Elsevier, 1970.
  • [8] W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems,” in Proceedings of the 1st International Conference on Informatics in Control, Automation and Robotics, 2004.
  • [9] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differential dynamic programming,” in Proceedings of the International Conference on Robotics and Automation (ICRA), 2014.
  • [10] A. D. Marchese, R. Tedrake, and D. Rus, “Dynamics and trajectory optimization for a soft spatial fluidic elastomer manipulator,” International Journal of Robotics Research, 2015.
  • [11] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: a survey,” Cognitive Processing, vol. 12, no. 4, pp. 319–340, Nov 2011.
  • [12] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning for control,” Lazy learning, pp. 75 – 113, 1997.
  • [13] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in Proceedings of Neural Information Processing Systems (NIPS), 2006.
  • [14] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, 2014, pp. 1071–1079.
  • [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in International Conference on Machine Learning, 2016.
  • [16] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” in Advances in Neural Information Processing Systems, 2015.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
  • [18] S. Bansal, R. Calandra, S. Levine, and C. Tomlin, “MBMF: model-based priors for model-free reinforcement learning,” CoRR, vol. abs/1709.03153, 2017.
  • [19] Y. Pan and E. Theodorou, “Probabilistic differential dynamic programming,” in Advances in Neural Information Processing Systems, 2014.
  • [20] G. Lee, S. S. Srinivasa, and M. T. Mason, “GP-ILQG: data-driven robust optimal control for uncertain nonlinear dynamical systems,” CoRR, vol. abs/1705.05344, 2017.
  • [21] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.-B. Mouret, “Black-box data-efficient policy search for robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  • [22] C. G. Atkeson and J. C. Santamaria, “A comparison of direct and model-based reinforcement learning,” in Proceedings of International Conference on Robotics and Automation, vol. 4, 1997.
  • [23] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes.    KIT Scientific Publishing, 2010.
  • [24] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in International Conference on Machine Learning, 2015.
  • [25] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems, 2015.
  • [26] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning, 2017.
  • [27] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems, 2017.
  • [28] N. L. Kleinman, J. C. Spall, and D. Q. Naiman, “Simulation-based optimization with stochastic approximation using common random numbers,” Management Science, 1999.
  • [29] A. Y. Ng and M. Jordan, “Pegasus: A policy search method for large mdps and pomdps,” in

    Proceedings of the Sixteenth conference on Uncertainty in Artificial Intelligence

    , 2000.
  • [30] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning, 2013.
  • [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [32] M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum gaussian process regression,” Journal of Machine Learning Research, 2010.

Ii Related Work

Dynamics models have long been a core element in the modeling and control of robotic systems. Trajectory optimization approaches [7, 8, 9] can produce highly effective controllers for complex robotic systems when precise analytical models are available. For complex and stochastic systems such as swimming robots, classical models are less reliable. In these cases, either performing online system identification [10] or learning complete dynamics models from data has proven to be effective, and can be integrated tightly with model-based control schemes [11, 12, 13, 14].

Multiple works have applied Deep RL methods to learn various continuous control tasks [15, 16], including full-body control of humanoid characters [17]

. These methods do not assume a known reward function, estimating the value of each action from experience. Along with their model-free nature, this results in lower data efficiency compared with the methods we consider here, but there are ongoing efforts to connect model-based and model-free approaches

[18].

The most similar works to our own are those which use probabilistic dynamics models for policy optimization. Locally linear controllers can be learned in this fashion, for example by extending the classical Differential Dynamic Programming (DDP) [19] method or Iterative LQG [20] to use GP models. For more complex robots, it is desirable to learn complex non-linear policies using the predictions of learned dynamics. Black-DROPS [21] has recently shown promising performance competitive with the gradient-based PILCO [1] for training GP and NN policies using GP dynamics models. As yet, we are only aware of BNNs being used in the policy learning loop within Deep-PILCO [3], which is the method we directly improve upon. Our approach is the first model-based RL approach to utilize BNNs for both the dynamics as well as the policy network.

Iii Problem Statement

We focus on model-based policy search methods for episodic tasks. We consider systems that can be modeled with discrete-time dynamics , where is unknown, with states and controls , indexed by time-step . The goal is to find the parameters of a policy that minimize a task-dependent cost function accumulated over a finite time horizon ,

(1)

The expectation in our case is due to not knowing the true dynamics , which induces a distribution over trajectories . The objective could be minimized by black-box optimization or likelihood ratio methods, obtaining trajectory samples directly from the target system. However, such methods are known to require a large number of evaluations, which may be impractical for applications with real robot systems. An alternative is to use experience to fit a model of the dynamics and use it to estimate the objective in Eq. (1). Alg. (1) describes a sketch for model-based optimization methods. A goal of these methods is data-efficiency: to use as little real-world experience as possible. Since we consider fixed horizon tasks, data-efficiency can be measured in the number of episodes, or trials, until the task is successfully learned.

1:Initialize parameters , and dataset
2:for episode in  do
3:     Obtain by executing for steps on robot
4:     Append to
5:     Use to update model learning
6:     Use to minimize , update policy optimization
7:Return
Algorithm 1 Episodic Model-Based RL

Iv Background

Iv-a Learning a dynamics model with BNNs

A key to data-efficiency is avoiding model bias [22, 23], i.e. optimizing Eq. (1) with a model that makes bad predictions with high confidence. BNNs address model bias by using the posterior distribution over their parameters. Given a model with parameters and a dataset we’d like to use the posterior to make predictions at new test points. This distribution represents the uncertainty about the true value of , which induces uncertainty on the model predictions: , where is the prediction at test point . Using the true posterior for predictions on a neural network is intractable. Fortunately, various methods based on variational inference exist, which use tractable approximate posteriors and Monte Carlo integration for predictions [5, 24, 25, 26, 27]. Fitting is done by minimizing the Kullback-Leibler (KL) divergence between the true and the approximate posterior, which can done by optimizing the objective

(2)

where is the expected value of the likelihood , is the approximate posterior and is a user-defined prior on the parameters. These methods usually set as a deterministic transformation of noise samples , where are the parameters of the posterior [25]. For example, in binary dropout multiplies the weights matrices for each layer of the network with the dropout masks

, consisting of diagonal noise matrices with entries drawn from a Bernoulli distribution with dropout probability

 [4]. To fit the dynamics model, we build the dataset of tuples ; where are the state-action pairs that we use as input to the dynamics model, and are the changes in state after applying action . We fit the model by minimizing the objective in Eq.( 2

) via stochastic gradient descent.

Iv-B Policy optimization with learned models

To estimate the objective function in Eq. (1) we base our approach on Deep-PILCO [3], which we summarize in Alg. (2). For every optimization iteration, the algorithm draws particles consisting of an initial state and a set of weights sampled from , as shown in line 2. For the models used in [3] and this work, sampling weights is equivalent to sampling dropout masks . The loop in lines 4 to 6 can be executed in parallel using batch processing. This algorithm requires the task cost function to be known and differentiable. Deep-PILCO uses back-propagation through time (BPTT) to estimate the policy gradients .

1:for  in  do
2:     Sample particles
3:     for  in  do
4:          for  in  do
5:               Evaluate policy
6:               Propagate state           
7:          Fit mean and covariance
8:          Resample from      
9:     Evaluate objective
10:     Compute gradient estimate
11:     Update by stochastic gradient descent step
Algorithm 2 Policy search with Deep-PILCO

V Improvements to Deep-PILCO

Here we describe the changes we have done to Deep-PILCO that were crucial for improving its data-efficiency and obtaining the results we describe in Sec. VI. Our changes are summarized in Alg. (3

). This algorithm can still be executed efficiently using batch processing with state-of-the art deep learning frameworks.

V-a Common random numbers for policy evaluation

The convergence of Algorithms (2) and (3

) is highly dependent on the variance of the estimated gradient

. In this case, the variance of the gradients is dependent on the sources of randomness for simulating trajectories: the initial state samples , the multiplicative noise masks , and the random numbers used for re-sampling . A common variance reduction technique used in stochastic optimization is to fix random numbers during optimization [28]. Using common random numbers (CRNs) reduces variance in two ways: gradient evaluations become deterministic and evaluations over different values for the optimization variable become correlated. We introduce CRNs by drawing all the random numbers we need for simulating trajectories at the beginning of the policy optimization (lines 1 to 3 in Alg. (3)) and keeping them fixed as the policy parameters are updated. This is possible because we use BNNs that rely on the re-parametrization trick [25] for evaluation. This is effective in reducing variance and improving convergence, but it may introduce bias. A simple way to deal with bias is to increase the number of particles used for gradient evaluation. We increased from 10, the number used in [3], to 100 for our experiments, and found it to improve convergence with small penalty on running time.

Fixing random numbers in the context of policy search is known as the PEGASUS111Policy Evaluation-of-Goodness And Search Using Scenarios algorithm [29]

. PEGASUS consists of transforming a given Markov Decision Process (MDP) into ”an equivalent one where all transitions are deterministic” by assuming access to a

deterministic simulative model of the MDP. A deterministic simulative model is one that has no internal random number generator, so any random numbers that are needed must be given to it as input. This is the case when using BNNs models. PEGASUS provides theoretical justification to our approach, particularly in that to decrease the upper bound on the error of estimates of using CRNs it suffices to increase .

V-B Stabilization for back-propagation through time

As noted in [3], the recurrent application of BNNs in Algortihm 2 can be interpreted as a Recurrent Neural Network (RNN) model. As such, Deep-PILCO is prone to suffer from vanishing and exploding gradients when computing them via BPTT [30]

, especially when dealing with tasks that require long time horizon or very deep models for the dynamics and policy. Although numerous techniques have been proposed in the RNN literature, we opted to deal with these problems by using ReLU activations for the policy and dynamics model, and clipping the gradients to have a norm of at most

. We show the effect of various settings of the clipping value on the convergence of policy search in Fig. (2(b)) .

1:Sample noise for dynamics
2:Sample noise for policy
3:Sample state noise
4:for  in  do
5:     Sample particles
6:     for  in  do
7:          for  in  do
8:               
9:               
10:               Evaluate policy
11:               Propagate state           
12:          Fit mean and covariance
13:          for  in  do
14:                =                
15:     Evaluate objective
16:     Compute gradient estimate
17:     if  then
18:                
19:     Update by stochastic gradient descent step
Algorithm 3

Our method: Deep-PILCO with PEGASUS evaluation and gradient clipping

V-C BNN models with Log-Normal multiplicative noise

We focused on methods that use multiplicative noise on the activations (e.g. binary dropout) because of their simplicity and computational efficiency. Deep-PILCO with binary dropout requires tuning the dropout probability to a value appropriate for the model size. We experimented with various BNN models [6, 25, 26, 27] to enable learning the dropout probabilities from data. The best performing method in our experiments was using truncated Log-Normal dropout with a truncated log-uniform prior . This choice prior causes the multiplicative noise to be constrained to values between 0 and 1 [6].

V-D Training neural network controllers

While Deep-PILCO had been limited to training single-layer Radial Basis Function policies, the application of gradient clipping and CRNs allows stable training of deep neural network policies, opening the door for richer behaviors. We found that adding dropout to the policy networks improves performance. During policy evaluation, we sample policies the same way as we do for dynamics models: a policy sample corresponds to a set of dropout masks . Thus each simulated state particle has a corresponding dynamics model and policy, which remain fixed during policy optimization. This can be interpreted as attempting to learn a distribution of controllers that are likely to perform well over plausible dynamics models. We make the policy stochastic during execution on the target system by re-sampling the policy dropout mask at every time step. This provides some amount of exploration that we found beneficial.

Vi Results

(a) Cart-pole task
(b) Double cart-pole task
Fig. 2: Benchmark tasks used in Sec. VI. In both tasks, the tip of the pendulum starts downright, with the cart centered at . The goal is to balance the tip of the pole at its highest possible location, while keeping the cart at . This occurs when for the cart-pole, and when in the double cart-pole.

We tested the improvements, described in Section V, on two benchmark scenarios: swinging up and stabilizing an inverted pendulum on a cart, and swinging up and stabilizing a double pendulum on a cart (see Fig. (2)). The first task was meant to compare performance on the same experiment as [3]. We chose the second scenario to compare the methods with a harder long-term prediction task; due to the chaotic dynamics of the double-pendulum. In both cases, the system is controlled by applying a horizontal force to the cart. We also evaluate our approach on the gait learning tasks for an underwater hexapod robot [2] to demonstrate the applicability of our approach for locomotion tasks on complex robot systems. We use the ADAM optimizer [31] for model fitting and policy optimization, with the default parameters suggested by the authors, and report the best results obtained after manual hyper-parameter tuning.

Vi-a Cart-pole swing-up task

(a) Comparison of the effect of introducing CRNs
(b) Effect of clipping gradients
Fig. 3: (a) Illustrates the benefit of fixing random numbers for policy evaluation (Alg. (3)) vs. stochastic policy evaluations (Alg. (2)). In (b) we show the area under the learning curve for the cart-pole task for various gradient clipping values (lower is better).

While previous experiments combining PILCO with PEGASUS were unsuccessful [23], we found its application to Deep-PILCO to result in a significant improvement on convergence when training neural network policies. Fig. (2(a)) shows how the use of CRNs (Deterministic policy evaluation) results in faster convergence than the original Deep-PILCO formulation (Stochastic policy evaluation), which only matches the cost of our approach after around the 20th trial. These experiments were done with a learning rate of and clipping value . Fig. (2(b)) illustrates the effect of gradient clipping for different values of for Alg. (3). The area under the learning curve gives us an idea of the speed of convergence as the clipping value changes. The trend is that any value of gradient clipping made a large improvement over not clipping at all and that the specific choice of clipping values was highly stable.

(a) Cart-pole RBF Policies
(b) Cart-pole Deep Policies
Fig. 4: Cost per trial on the cart-pole swing-up task. In (a), we compare different dynamics models for learning RBF policies. (b) compares BNN models for learning NN policies, showing how our approach matches the data-efficiency of PILCO with better final performance.

Fig. (4) summarizes our results for the cart-pole domain222The code used in these experiments is available at https://github.com/juancamilog/kusanagi.. Fig. (3(a)) illustrates the difference in performace between PILCO using sparse spectrum GP (SSGP) regression [32] for the dynamics and two versions of Deep-PILCO using BNN dynamics: one using binary dropout with dropout probability , and the other using Log-Normal dropout with a truncated log-uniform

. The BNN models are ReLU networks with two hidden layers of 200 units and a linear output layer. The models predict heteroscedastic noise, which is used to corrupt the input to the policy during simulation. We used data from all previous episodes for model learning after each trial. The initial experience was gathered with a single execution of a policy that selects actions uniformly-at-random. The learning rate was set to

for model learning and for policy optimization. The policies were RBF networks with 30 units. Fig. (3(b)) provides a comparison of different BNN dynamics models when training neural network policies. The policy networks are ReLU networks with two hidden layers of 200 units. For BNN policies (Drop MLP) we set a constant dropout probability . Note that our method is able to train neural network controllers with better performance (lower cost) than either PILCO or Deep-PILCO with RBF controllers, within a similar number of trials. Using truncated Log-Normal dropout (Log-Normal Drop Dyn) for learning a stochastic policy (Drop MLP) results in the best performance for the cart-pole task.

Vi-B Double pendulum on cart swing-up task

Fig. (5) illustrates the effect of learning neural network controllers on the more complicated double cart-pole swing-up task. We were unable to get Alg. (2) with RBF policies to converge in this task. The setup is similar to the cart-pole task, but we change the network architectures as the dynamics are more complex. The dynamics models are ReLU networks with 4 hidden layers of 200 units and a linear output layer. The policies are ReLU networks with four hidden layers of 50 units. The learning rate for policy learning was set to . The initial experience was comes from 2 runs with random actions. Here the differences in performance are more pronounced: our method converges after 42 trials, corresponding to 126 s of experience at 10 Hz. This is close to the 84 s at 13.3 Hz reported in [23]. We see that the combination of BNN dynamics (Log-Normal Drop Dyn) and a BNN policy (Drop MLP Pol) results in the least number of trials for achieving the lowest cost.

Fig. 5:

Cost per trial on the double cart-pole swing-up task. The shaded regions correspond to half a standard deviation. This demonstrates the benefit of using Log-Normal multiplicative noise for the dynamics with dropout regularization for the policies

Vi-C Learning swimming gaits on an underwater robot.

(a) 2-Leg Knife edge
(b) 2-Leg Belly up
(c) 2-Leg Corkscrew
(d) 2-Leg Depth change
Fig. 6: Learning curve and the evolution of the trajectory distribution as learning progresses for 2-leg tasks. The robot learns to control its pose by setting the appropriate amplitudes and leg offset angles for its back 2 legs. The dashed lines represent the desired target states. Additional results and videos of these behaviours available at https://github.com/mcgillmrl/robot_learning

These tasks consist of finding feedback controllers for controlling the robot’s 3D pose via periodic motion of its legs. Fig. (1) illustrates the execution of a gait learned using our methods. The robot’s state space consists of readings from its inertial measurement unit (IMU), its depth sensor and motor encoders. To compare with previously published results, the action space is defined as the parameters of the periodic leg command (PLC) pattern generator [2], with the same constraints as prior work. We conducted experiments on the following control tasks333The code used for these experiments and video examples of other learned gaits are available at https://github.com/mcgillmrl/robot_learning.:

  1. knife edge: Swimming straight-ahead with roll

  2. belly up: Swimming straight-ahead with roll

  3. corkscrew: Swimming straight-ahead with rolling velocity (anti-clockwise)

  4. 1 m depth change: Diving and stabilizing 1 meter below current depth.

There were two versions of these experiments. In the first one, which we call 2-leg tasks, the robot controls only the amplitudes and offsets of the two back legs (4 control dimensions). Its state corresponds to the angles and angular velocities, as measured by the IMU, and the depth sensor measurement (7 state dimensions). In the second version, the robot controls amplitudes and offsets and phases for all 6 legs (18 control dimensions). In this case, the state consists of the IMU and depth sensor readings plus the leg angles as measured form the motor encoders (13 state dimensions). We transform angle dimensions into their complex representation before passing the state as input to the dynamics model and policy, as described in [23].

(a) 6-Leg Knife edge
(b) 6-Leg Belly up
(c) 6-Leg Corkscrew
(d) 6-Leg Depth change
Fig. 7: Learning curve and the evolution of the trajectory distribution as learning progresses for 6-leg tasks. In this case, the robot is trying to control the amplitudes, leg angle offsets, and phase offsets for all 6 legs. The algorithm takes longer to converge in this case, when compared to the 2-leg tasks. This is possibly due to the larger state and action spaces (13 state dimensions + 18 action dimensions). Nevertheless, this demonstrates that the algorithm can scale to higher dimensional problems.

We trained dynamics models and policies with 4 hidden layers of 200 units each. The dynamics models use truncated Log-Normal dropout and we enable dropout for the policy with . We used a learning rate of and clip gradients to . The experience dataset is initialized with 5 random trials, common to all the tasks with the same state and action spaces.

Fig. (6) and (7) show the results of gait learning in the simulation environment described in [2]. In addition to learning curves on the left of each task panel, we show detailed state telemetry for selected learning episodes on the right to provide intuition on stability and learning progression. The shaded regions represent the variance of the trajectory distributions on the target system over 10 different runs. In each case attempted, our method was able to learn effective swimming behavior, to coordinate the motions of multiple flippers and overcome simulated hydrodynamic effects without any prior model. For the 2-leg tasks, our method obtains successful policies in 10-20 trials, a number competitive with results reported in [2]. We obtained successful controllers for the depth-change task, which was unsuccessful in prior work. The 6-leg tasks, with their considerably higher-dimensional state and action spaces, take roughly double the number of trials. But all tasks still converged by trial 50, which remains practical for real deployment.

Vii Conclusion

We have presented improvements to a probabilistic model-based reinforcement learning algorithm, Deep-PILCO, to enable fast synthesis of controllers for robotics applications. Our algorithm is based on treating neural network models trained with dropout as an approximation to the posterior distribution of dynamics models given the experience data. Sampling dynamics models from this distribution helps in avoiding model-bias during policy optimization; policies are optimized for a finite sample of dynamics models, obtained through the application of dropout noise masks. Our changes enable training of neural network controllers, which we demonstrate to outperform RBF controllers on the cart-pole swing-up task. We obtain competitive performance on the task of swing-up and stabilization of a double pendulum on a cart. Finally, we demonstrated the usefulness of the algorithm on the higher dimensional tasks of learning gaits for pose stabilization for a six legged underwater robot. We replicate previous results [2] where we control the robot with 2 flippers, and provide new results on learning to control the robot using all 6 legs, now including phase offsets.

References

  • [1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for data-efficient learning in robotics and control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
  • [2] D. Meger, J. C. G. Higuera, A. Xu, P. Giguere, and G. Dudek, “Learning legged swimming gaits from experience,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, 2015.
  • [3] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with Bayesian neural network dynamics models,” in

    Data-Efficient Machine Learning workshop, ICML

    , 2016.
  • [4] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, 2014.
  • [5] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of The 33rd International Conference on Machine Learning, 2016.
  • [6] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems, 2017.
  • [7] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming.    Elsevier, 1970.
  • [8] W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems,” in Proceedings of the 1st International Conference on Informatics in Control, Automation and Robotics, 2004.
  • [9] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differential dynamic programming,” in Proceedings of the International Conference on Robotics and Automation (ICRA), 2014.
  • [10] A. D. Marchese, R. Tedrake, and D. Rus, “Dynamics and trajectory optimization for a soft spatial fluidic elastomer manipulator,” International Journal of Robotics Research, 2015.
  • [11] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: a survey,” Cognitive Processing, vol. 12, no. 4, pp. 319–340, Nov 2011.
  • [12] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning for control,” Lazy learning, pp. 75 – 113, 1997.
  • [13] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in Proceedings of Neural Information Processing Systems (NIPS), 2006.
  • [14] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, 2014, pp. 1071–1079.
  • [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in International Conference on Machine Learning, 2016.
  • [16] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” in Advances in Neural Information Processing Systems, 2015.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
  • [18] S. Bansal, R. Calandra, S. Levine, and C. Tomlin, “MBMF: model-based priors for model-free reinforcement learning,” CoRR, vol. abs/1709.03153, 2017.
  • [19] Y. Pan and E. Theodorou, “Probabilistic differential dynamic programming,” in Advances in Neural Information Processing Systems, 2014.
  • [20] G. Lee, S. S. Srinivasa, and M. T. Mason, “GP-ILQG: data-driven robust optimal control for uncertain nonlinear dynamical systems,” CoRR, vol. abs/1705.05344, 2017.
  • [21] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.-B. Mouret, “Black-box data-efficient policy search for robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  • [22] C. G. Atkeson and J. C. Santamaria, “A comparison of direct and model-based reinforcement learning,” in Proceedings of International Conference on Robotics and Automation, vol. 4, 1997.
  • [23] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes.    KIT Scientific Publishing, 2010.
  • [24] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in International Conference on Machine Learning, 2015.
  • [25] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems, 2015.
  • [26] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning, 2017.
  • [27] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems, 2017.
  • [28] N. L. Kleinman, J. C. Spall, and D. Q. Naiman, “Simulation-based optimization with stochastic approximation using common random numbers,” Management Science, 1999.
  • [29] A. Y. Ng and M. Jordan, “Pegasus: A policy search method for large mdps and pomdps,” in

    Proceedings of the Sixteenth conference on Uncertainty in Artificial Intelligence

    , 2000.
  • [30] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning, 2013.
  • [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [32] M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum gaussian process regression,” Journal of Machine Learning Research, 2010.

Iii Problem Statement

We focus on model-based policy search methods for episodic tasks. We consider systems that can be modeled with discrete-time dynamics , where is unknown, with states and controls , indexed by time-step . The goal is to find the parameters of a policy that minimize a task-dependent cost function accumulated over a finite time horizon ,

(1)

The expectation in our case is due to not knowing the true dynamics , which induces a distribution over trajectories . The objective could be minimized by black-box optimization or likelihood ratio methods, obtaining trajectory samples directly from the target system. However, such methods are known to require a large number of evaluations, which may be impractical for applications with real robot systems. An alternative is to use experience to fit a model of the dynamics and use it to estimate the objective in Eq. (1). Alg. (1) describes a sketch for model-based optimization methods. A goal of these methods is data-efficiency: to use as little real-world experience as possible. Since we consider fixed horizon tasks, data-efficiency can be measured in the number of episodes, or trials, until the task is successfully learned.

1:Initialize parameters , and dataset
2:for episode in  do
3:     Obtain by executing for steps on robot
4:     Append to
5:     Use to update model learning
6:     Use to minimize , update policy optimization
7:Return
Algorithm 1 Episodic Model-Based RL

Iv Background

Iv-a Learning a dynamics model with BNNs

A key to data-efficiency is avoiding model bias [22, 23], i.e. optimizing Eq. (1) with a model that makes bad predictions with high confidence. BNNs address model bias by using the posterior distribution over their parameters. Given a model with parameters and a dataset we’d like to use the posterior to make predictions at new test points. This distribution represents the uncertainty about the true value of , which induces uncertainty on the model predictions: , where is the prediction at test point . Using the true posterior for predictions on a neural network is intractable. Fortunately, various methods based on variational inference exist, which use tractable approximate posteriors and Monte Carlo integration for predictions [5, 24, 25, 26, 27]. Fitting is done by minimizing the Kullback-Leibler (KL) divergence between the true and the approximate posterior, which can done by optimizing the objective

(2)

where is the expected value of the likelihood , is the approximate posterior and is a user-defined prior on the parameters. These methods usually set as a deterministic transformation of noise samples , where are the parameters of the posterior [25]. For example, in binary dropout multiplies the weights matrices for each layer of the network with the dropout masks

, consisting of diagonal noise matrices with entries drawn from a Bernoulli distribution with dropout probability

 [4]. To fit the dynamics model, we build the dataset of tuples ; where are the state-action pairs that we use as input to the dynamics model, and are the changes in state after applying action . We fit the model by minimizing the objective in Eq.( 2

) via stochastic gradient descent.

Iv-B Policy optimization with learned models

To estimate the objective function in Eq. (1) we base our approach on Deep-PILCO [3], which we summarize in Alg. (2). For every optimization iteration, the algorithm draws particles consisting of an initial state and a set of weights sampled from , as shown in line 2. For the models used in [3] and this work, sampling weights is equivalent to sampling dropout masks . The loop in lines 4 to 6 can be executed in parallel using batch processing. This algorithm requires the task cost function to be known and differentiable. Deep-PILCO uses back-propagation through time (BPTT) to estimate the policy gradients .

1:for  in  do
2:     Sample particles
3:     for  in  do
4:          for  in  do
5:               Evaluate policy
6:               Propagate state           
7:          Fit mean and covariance
8:          Resample from      
9:     Evaluate objective
10:     Compute gradient estimate
11:     Update by stochastic gradient descent step
Algorithm 2 Policy search with Deep-PILCO

V Improvements to Deep-PILCO

Here we describe the changes we have done to Deep-PILCO that were crucial for improving its data-efficiency and obtaining the results we describe in Sec. VI. Our changes are summarized in Alg. (3

). This algorithm can still be executed efficiently using batch processing with state-of-the art deep learning frameworks.

V-a Common random numbers for policy evaluation

The convergence of Algorithms (2) and (3

) is highly dependent on the variance of the estimated gradient

. In this case, the variance of the gradients is dependent on the sources of randomness for simulating trajectories: the initial state samples , the multiplicative noise masks , and the random numbers used for re-sampling . A common variance reduction technique used in stochastic optimization is to fix random numbers during optimization [28]. Using common random numbers (CRNs) reduces variance in two ways: gradient evaluations become deterministic and evaluations over different values for the optimization variable become correlated. We introduce CRNs by drawing all the random numbers we need for simulating trajectories at the beginning of the policy optimization (lines 1 to 3 in Alg. (3)) and keeping them fixed as the policy parameters are updated. This is possible because we use BNNs that rely on the re-parametrization trick [25] for evaluation. This is effective in reducing variance and improving convergence, but it may introduce bias. A simple way to deal with bias is to increase the number of particles used for gradient evaluation. We increased from 10, the number used in [3], to 100 for our experiments, and found it to improve convergence with small penalty on running time.

Fixing random numbers in the context of policy search is known as the PEGASUS111Policy Evaluation-of-Goodness And Search Using Scenarios algorithm [29]

. PEGASUS consists of transforming a given Markov Decision Process (MDP) into ”an equivalent one where all transitions are deterministic” by assuming access to a

deterministic simulative model of the MDP. A deterministic simulative model is one that has no internal random number generator, so any random numbers that are needed must be given to it as input. This is the case when using BNNs models. PEGASUS provides theoretical justification to our approach, particularly in that to decrease the upper bound on the error of estimates of using CRNs it suffices to increase .

V-B Stabilization for back-propagation through time

As noted in [3], the recurrent application of BNNs in Algortihm 2 can be interpreted as a Recurrent Neural Network (RNN) model. As such, Deep-PILCO is prone to suffer from vanishing and exploding gradients when computing them via BPTT [30]

, especially when dealing with tasks that require long time horizon or very deep models for the dynamics and policy. Although numerous techniques have been proposed in the RNN literature, we opted to deal with these problems by using ReLU activations for the policy and dynamics model, and clipping the gradients to have a norm of at most

. We show the effect of various settings of the clipping value on the convergence of policy search in Fig. (2(b)) .

1:Sample noise for dynamics
2:Sample noise for policy
3:Sample state noise
4:for  in  do
5:     Sample particles
6:     for  in  do
7:          for  in  do
8:               
9:               
10:               Evaluate policy
11:               Propagate state           
12:          Fit mean and covariance
13:          for  in  do
14:                =                
15:     Evaluate objective
16:     Compute gradient estimate
17:     if  then
18:                
19:     Update by stochastic gradient descent step
Algorithm 3

Our method: Deep-PILCO with PEGASUS evaluation and gradient clipping

V-C BNN models with Log-Normal multiplicative noise

We focused on methods that use multiplicative noise on the activations (e.g. binary dropout) because of their simplicity and computational efficiency. Deep-PILCO with binary dropout requires tuning the dropout probability to a value appropriate for the model size. We experimented with various BNN models [6, 25, 26, 27] to enable learning the dropout probabilities from data. The best performing method in our experiments was using truncated Log-Normal dropout with a truncated log-uniform prior . This choice prior causes the multiplicative noise to be constrained to values between 0 and 1 [6].

V-D Training neural network controllers

While Deep-PILCO had been limited to training single-layer Radial Basis Function policies, the application of gradient clipping and CRNs allows stable training of deep neural network policies, opening the door for richer behaviors. We found that adding dropout to the policy networks improves performance. During policy evaluation, we sample policies the same way as we do for dynamics models: a policy sample corresponds to a set of dropout masks . Thus each simulated state particle has a corresponding dynamics model and policy, which remain fixed during policy optimization. This can be interpreted as attempting to learn a distribution of controllers that are likely to perform well over plausible dynamics models. We make the policy stochastic during execution on the target system by re-sampling the policy dropout mask at every time step. This provides some amount of exploration that we found beneficial.

Vi Results

(a) Cart-pole task
(b) Double cart-pole task
Fig. 2: Benchmark tasks used in Sec. VI. In both tasks, the tip of the pendulum starts downright, with the cart centered at . The goal is to balance the tip of the pole at its highest possible location, while keeping the cart at . This occurs when for the cart-pole, and when in the double cart-pole.

We tested the improvements, described in Section V, on two benchmark scenarios: swinging up and stabilizing an inverted pendulum on a cart, and swinging up and stabilizing a double pendulum on a cart (see Fig. (2)). The first task was meant to compare performance on the same experiment as [3]. We chose the second scenario to compare the methods with a harder long-term prediction task; due to the chaotic dynamics of the double-pendulum. In both cases, the system is controlled by applying a horizontal force to the cart. We also evaluate our approach on the gait learning tasks for an underwater hexapod robot [2] to demonstrate the applicability of our approach for locomotion tasks on complex robot systems. We use the ADAM optimizer [31] for model fitting and policy optimization, with the default parameters suggested by the authors, and report the best results obtained after manual hyper-parameter tuning.

Vi-a Cart-pole swing-up task

(a) Comparison of the effect of introducing CRNs
(b) Effect of clipping gradients
Fig. 3: (a) Illustrates the benefit of fixing random numbers for policy evaluation (Alg. (3)) vs. stochastic policy evaluations (Alg. (2)). In (b) we show the area under the learning curve for the cart-pole task for various gradient clipping values (lower is better).

While previous experiments combining PILCO with PEGASUS were unsuccessful [23], we found its application to Deep-PILCO to result in a significant improvement on convergence when training neural network policies. Fig. (2(a)) shows how the use of CRNs (Deterministic policy evaluation) results in faster convergence than the original Deep-PILCO formulation (Stochastic policy evaluation), which only matches the cost of our approach after around the 20th trial. These experiments were done with a learning rate of and clipping value . Fig. (2(b)) illustrates the effect of gradient clipping for different values of for Alg. (3). The area under the learning curve gives us an idea of the speed of convergence as the clipping value changes. The trend is that any value of gradient clipping made a large improvement over not clipping at all and that the specific choice of clipping values was highly stable.

(a) Cart-pole RBF Policies
(b) Cart-pole Deep Policies
Fig. 4: Cost per trial on the cart-pole swing-up task. In (a), we compare different dynamics models for learning RBF policies. (b) compares BNN models for learning NN policies, showing how our approach matches the data-efficiency of PILCO with better final performance.

Fig. (4) summarizes our results for the cart-pole domain222The code used in these experiments is available at https://github.com/juancamilog/kusanagi.. Fig. (3(a)) illustrates the difference in performace between PILCO using sparse spectrum GP (SSGP) regression [32] for the dynamics and two versions of Deep-PILCO using BNN dynamics: one using binary dropout with dropout probability , and the other using Log-Normal dropout with a truncated log-uniform

. The BNN models are ReLU networks with two hidden layers of 200 units and a linear output layer. The models predict heteroscedastic noise, which is used to corrupt the input to the policy during simulation. We used data from all previous episodes for model learning after each trial. The initial experience was gathered with a single execution of a policy that selects actions uniformly-at-random. The learning rate was set to

for model learning and for policy optimization. The policies were RBF networks with 30 units. Fig. (3(b)) provides a comparison of different BNN dynamics models when training neural network policies. The policy networks are ReLU networks with two hidden layers of 200 units. For BNN policies (Drop MLP) we set a constant dropout probability . Note that our method is able to train neural network controllers with better performance (lower cost) than either PILCO or Deep-PILCO with RBF controllers, within a similar number of trials. Using truncated Log-Normal dropout (Log-Normal Drop Dyn) for learning a stochastic policy (Drop MLP) results in the best performance for the cart-pole task.

Vi-B Double pendulum on cart swing-up task

Fig. (5) illustrates the effect of learning neural network controllers on the more complicated double cart-pole swing-up task. We were unable to get Alg. (2) with RBF policies to converge in this task. The setup is similar to the cart-pole task, but we change the network architectures as the dynamics are more complex. The dynamics models are ReLU networks with 4 hidden layers of 200 units and a linear output layer. The policies are ReLU networks with four hidden layers of 50 units. The learning rate for policy learning was set to . The initial experience was comes from 2 runs with random actions. Here the differences in performance are more pronounced: our method converges after 42 trials, corresponding to 126 s of experience at 10 Hz. This is close to the 84 s at 13.3 Hz reported in [23]. We see that the combination of BNN dynamics (Log-Normal Drop Dyn) and a BNN policy (Drop MLP Pol) results in the least number of trials for achieving the lowest cost.

Fig. 5:

Cost per trial on the double cart-pole swing-up task. The shaded regions correspond to half a standard deviation. This demonstrates the benefit of using Log-Normal multiplicative noise for the dynamics with dropout regularization for the policies

Vi-C Learning swimming gaits on an underwater robot.

(a) 2-Leg Knife edge
(b) 2-Leg Belly up
(c) 2-Leg Corkscrew
(d) 2-Leg Depth change
Fig. 6: Learning curve and the evolution of the trajectory distribution as learning progresses for 2-leg tasks. The robot learns to control its pose by setting the appropriate amplitudes and leg offset angles for its back 2 legs. The dashed lines represent the desired target states. Additional results and videos of these behaviours available at https://github.com/mcgillmrl/robot_learning

These tasks consist of finding feedback controllers for controlling the robot’s 3D pose via periodic motion of its legs. Fig. (1) illustrates the execution of a gait learned using our methods. The robot’s state space consists of readings from its inertial measurement unit (IMU), its depth sensor and motor encoders. To compare with previously published results, the action space is defined as the parameters of the periodic leg command (PLC) pattern generator [2], with the same constraints as prior work. We conducted experiments on the following control tasks333The code used for these experiments and video examples of other learned gaits are available at https://github.com/mcgillmrl/robot_learning.:

  1. knife edge: Swimming straight-ahead with roll

  2. belly up: Swimming straight-ahead with roll

  3. corkscrew: Swimming straight-ahead with rolling velocity (anti-clockwise)

  4. 1 m depth change: Diving and stabilizing 1 meter below current depth.

There were two versions of these experiments. In the first one, which we call 2-leg tasks, the robot controls only the amplitudes and offsets of the two back legs (4 control dimensions). Its state corresponds to the angles and angular velocities, as measured by the IMU, and the depth sensor measurement (7 state dimensions). In the second version, the robot controls amplitudes and offsets and phases for all 6 legs (18 control dimensions). In this case, the state consists of the IMU and depth sensor readings plus the leg angles as measured form the motor encoders (13 state dimensions). We transform angle dimensions into their complex representation before passing the state as input to the dynamics model and policy, as described in [23].

(a) 6-Leg Knife edge
(b) 6-Leg Belly up
(c) 6-Leg Corkscrew
(d) 6-Leg Depth change
Fig. 7: Learning curve and the evolution of the trajectory distribution as learning progresses for 6-leg tasks. In this case, the robot is trying to control the amplitudes, leg angle offsets, and phase offsets for all 6 legs. The algorithm takes longer to converge in this case, when compared to the 2-leg tasks. This is possibly due to the larger state and action spaces (13 state dimensions + 18 action dimensions). Nevertheless, this demonstrates that the algorithm can scale to higher dimensional problems.

We trained dynamics models and policies with 4 hidden layers of 200 units each. The dynamics models use truncated Log-Normal dropout and we enable dropout for the policy with . We used a learning rate of and clip gradients to . The experience dataset is initialized with 5 random trials, common to all the tasks with the same state and action spaces.

Fig. (6) and (7) show the results of gait learning in the simulation environment described in [2]. In addition to learning curves on the left of each task panel, we show detailed state telemetry for selected learning episodes on the right to provide intuition on stability and learning progression. The shaded regions represent the variance of the trajectory distributions on the target system over 10 different runs. In each case attempted, our method was able to learn effective swimming behavior, to coordinate the motions of multiple flippers and overcome simulated hydrodynamic effects without any prior model. For the 2-leg tasks, our method obtains successful policies in 10-20 trials, a number competitive with results reported in [2]. We obtained successful controllers for the depth-change task, which was unsuccessful in prior work. The 6-leg tasks, with their considerably higher-dimensional state and action spaces, take roughly double the number of trials. But all tasks still converged by trial 50, which remains practical for real deployment.

Vii Conclusion

We have presented improvements to a probabilistic model-based reinforcement learning algorithm, Deep-PILCO, to enable fast synthesis of controllers for robotics applications. Our algorithm is based on treating neural network models trained with dropout as an approximation to the posterior distribution of dynamics models given the experience data. Sampling dynamics models from this distribution helps in avoiding model-bias during policy optimization; policies are optimized for a finite sample of dynamics models, obtained through the application of dropout noise masks. Our changes enable training of neural network controllers, which we demonstrate to outperform RBF controllers on the cart-pole swing-up task. We obtain competitive performance on the task of swing-up and stabilization of a double pendulum on a cart. Finally, we demonstrated the usefulness of the algorithm on the higher dimensional tasks of learning gaits for pose stabilization for a six legged underwater robot. We replicate previous results [2] where we control the robot with 2 flippers, and provide new results on learning to control the robot using all 6 legs, now including phase offsets.

References

  • [1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for data-efficient learning in robotics and control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
  • [2] D. Meger, J. C. G. Higuera, A. Xu, P. Giguere, and G. Dudek, “Learning legged swimming gaits from experience,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, 2015.
  • [3] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with Bayesian neural network dynamics models,” in

    Data-Efficient Machine Learning workshop, ICML

    , 2016.
  • [4] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, 2014.
  • [5] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of The 33rd International Conference on Machine Learning, 2016.
  • [6] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems, 2017.
  • [7] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming.    Elsevier, 1970.
  • [8] W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems,” in Proceedings of the 1st International Conference on Informatics in Control, Automation and Robotics, 2004.
  • [9] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differential dynamic programming,” in Proceedings of the International Conference on Robotics and Automation (ICRA), 2014.
  • [10] A. D. Marchese, R. Tedrake, and D. Rus, “Dynamics and trajectory optimization for a soft spatial fluidic elastomer manipulator,” International Journal of Robotics Research, 2015.
  • [11] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: a survey,” Cognitive Processing, vol. 12, no. 4, pp. 319–340, Nov 2011.
  • [12] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning for control,” Lazy learning, pp. 75 – 113, 1997.
  • [13] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in Proceedings of Neural Information Processing Systems (NIPS), 2006.
  • [14] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, 2014, pp. 1071–1079.
  • [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in International Conference on Machine Learning, 2016.
  • [16] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” in Advances in Neural Information Processing Systems, 2015.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
  • [18] S. Bansal, R. Calandra, S. Levine, and C. Tomlin, “MBMF: model-based priors for model-free reinforcement learning,” CoRR, vol. abs/1709.03153, 2017.
  • [19] Y. Pan and E. Theodorou, “Probabilistic differential dynamic programming,” in Advances in Neural Information Processing Systems, 2014.
  • [20] G. Lee, S. S. Srinivasa, and M. T. Mason, “GP-ILQG: data-driven robust optimal control for uncertain nonlinear dynamical systems,” CoRR, vol. abs/1705.05344, 2017.
  • [21] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.-B. Mouret, “Black-box data-efficient policy search for robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  • [22] C. G. Atkeson and J. C. Santamaria, “A comparison of direct and model-based reinforcement learning,” in Proceedings of International Conference on Robotics and Automation, vol. 4, 1997.
  • [23] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes.    KIT Scientific Publishing, 2010.
  • [24] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in International Conference on Machine Learning, 2015.
  • [25] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems, 2015.
  • [26] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning, 2017.
  • [27] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems, 2017.
  • [28] N. L. Kleinman, J. C. Spall, and D. Q. Naiman, “Simulation-based optimization with stochastic approximation using common random numbers,” Management Science, 1999.
  • [29] A. Y. Ng and M. Jordan, “Pegasus: A policy search method for large mdps and pomdps,” in

    Proceedings of the Sixteenth conference on Uncertainty in Artificial Intelligence

    , 2000.
  • [30] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning, 2013.
  • [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [32] M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum gaussian process regression,” Journal of Machine Learning Research, 2010.

Iv Background

Iv-a Learning a dynamics model with BNNs

A key to data-efficiency is avoiding model bias [22, 23], i.e. optimizing Eq. (1) with a model that makes bad predictions with high confidence. BNNs address model bias by using the posterior distribution over their parameters. Given a model with parameters and a dataset we’d like to use the posterior to make predictions at new test points. This distribution represents the uncertainty about the true value of , which induces uncertainty on the model predictions: , where is the prediction at test point . Using the true posterior for predictions on a neural network is intractable. Fortunately, various methods based on variational inference exist, which use tractable approximate posteriors and Monte Carlo integration for predictions [5, 24, 25, 26, 27]. Fitting is done by minimizing the Kullback-Leibler (KL) divergence between the true and the approximate posterior, which can done by optimizing the objective

(2)

where is the expected value of the likelihood , is the approximate posterior and is a user-defined prior on the parameters. These methods usually set as a deterministic transformation of noise samples , where are the parameters of the posterior [25]. For example, in binary dropout multiplies the weights matrices for each layer of the network with the dropout masks

, consisting of diagonal noise matrices with entries drawn from a Bernoulli distribution with dropout probability

 [4]. To fit the dynamics model, we build the dataset of tuples ; where are the state-action pairs that we use as input to the dynamics model, and are the changes in state after applying action . We fit the model by minimizing the objective in Eq.( 2

) via stochastic gradient descent.

Iv-B Policy optimization with learned models

To estimate the objective function in Eq. (1) we base our approach on Deep-PILCO [3], which we summarize in Alg. (2). For every optimization iteration, the algorithm draws particles consisting of an initial state and a set of weights sampled from , as shown in line 2. For the models used in [3] and this work, sampling weights is equivalent to sampling dropout masks . The loop in lines 4 to 6 can be executed in parallel using batch processing. This algorithm requires the task cost function to be known and differentiable. Deep-PILCO uses back-propagation through time (BPTT) to estimate the policy gradients .

1:for  in  do
2:     Sample particles
3:     for  in  do
4:          for  in  do
5:               Evaluate policy
6:               Propagate state           
7:          Fit mean and covariance
8:          Resample from      
9:     Evaluate objective