I Related Work
Reinforcement Learning
The environment in a RL problem is often modelled as an Markov Dynamic Processes (MDP) with a discrete set of states and actions [5]
. In this work we are focusing on problems with infinite/continuous state and action spaces. These include complex motor control tasks that have become a popular benchmark in the machine learning literature
[6]. Many recent RL approaches are based on policy gradient methods [7]where the gradient of the policy with respect to future discounted reward is approximated and used to update the policy. Recent advances in combining policy gradient methods and deep learning have led to impressive results for numerous problems, including Atari games and bipedal motion control
[8, 9, 10, 11, 12, 13].Sample Efficient RL
While policy gradient methods provide a general framework for how to update a policy given data, it is still a challenge to generate good data. Sample efficient RL methods are an important area of research as learning complex policies for motion control can take days and physically simulating on robots is timeconsuming. Learning can be made more sample efficient by further parameterizing the policy and passing noise through the network as an alternative to adding vanilla Gaussian noise [3, 2]. Other work encourages exploration of the state space that has not yet been seen by the agent [14]. There has been success in incorporating modelbased methods to generate synthetic data or locally approximate the dynamics [15, 16, 17]. Two methods are similar to the MBAE work that we propose. Deep Deterministic Policy Gradient ( Deep Deterministic Policy Gradient (DDPG)) is a method that directly links the policy and value function, propagating gradients into the policy from the value function [4]. Another is Stochastic Value Gradients (SVG), a framework for blending between modelbased and modelfree learning [18]. However, these methods do not use the gradients as a method for action exploration.
ModelBased Rl
generally refers to methods that use the structure of the problem to assist learning. Typically any method that uses more than a policy and value function is considered to fall into this category. Significant improvements have been made recently by including some modelbased knowledge into the RL problem. By first learning a policy using modelbased RL and then training a modelfree method to act like the modelbased method [19] significant improvements are achieved. There is also interest in learning and using models of the transition dynamics to improve learning [20]. The work in [16] uses modelbased policy optimization methods along with very accurate dynamics models to learn good policies. In this work, we learn a model of the dynamics to compute gradients to maximize future discounted reward for action exploration. The dynamics model used in this work does not need to be particularly accurate as the underlying modelfree RL algorithm can cope with a noisy action distribution.
Ii Framework
In this section we outline the MDP based framework used to describe the RL problem.
Iia Markov Dynamic Process
An MDP is a tuple consisting of . Here is the space of all possible state configurations and is the set of available actions. The reward function determines the reward for taking action in state . The probability of ending up in state after taking action in state is described by the transition dynamics function . Lastly, the discount factor controls the planning horizon and gives preference to more immediate rewards. A stochastic policy models the probability of choosing action given state . The quality of the policy can be computed as the expectation over future discounted rewards for the given policy starting in state and taking action .
(1) 
The actions over the trajectory are determined by the policy . The successor state is determined by the transition function .
IiB Policy Learning
The statevalue function estimates Eq. 1 starting from state for the policy . The actionvalued function models the future discounted reward for taking action in state and following policy thereafter. The advantage function is a measure of the benefit of taking action in state with respect to the current policy performance.
(2) 
The advantage function is then used as a metric for improving the policy.
(3) 
IiC Deep Reinforcement Learning
During each episode of interaction with the environment, data is collected for each action taken, as an experience tuple .
IiC1 Exploration
In continuous spaces the stochastic policy is often modeled by a Gaussian distribution with mean
. The standard deviation can be modeled by a statedependent neural network model,
, or can be state independent and sampled from .IiC2 Exploitation
We train a neural network to model the value function on data collected from the policy. The loss function used to train the value function (
) is the temporal difference error:(4) 
Using the learned value function as a baseline, the advantage function can be estimated from data. With an estimate of the policy gradient, via the advantage, policy updates can be performed to increase the policy’s likelihood of selecting actions with higher advantage:
(5) 
Iii ModelBased Action Exploration
In modelbased RL we are trying to solve the same policy parameter optimization as in Eq. 3. To model the dynamics, we train one model to estimate the reward function and another to estimate the successor state. The former is modeled as a direct prediction, while the latter is modeled as a distribution from which samples can be drawn via a GAN (generative adversarial network).
Iiia Stochastic ModelBased Action Exploration
A diagram of the MBAE method is shown in Figure 2. With the combination of the transition probability model and a value function, an actionvalued function is constructed. Using MBAE, action gradients can be computed and used for exploration.
By using a stochastic transition function the gradients computed by MBAE are nondeterministic. Algorithm 1 shows the method used to compute action gradients when predicted future states are sampled from a distribution. We use a Generative Advasarial Network (GAN) [21] to model the stochastic distribution. Our implementation closely follows [22] that uses a Conditional Generative Advasarial Network (cGAN) and combines a Mean Squared Error (MSE) loss with the normal GAN loss. We expect the simulation dynamics to have correlated terms, which the GAN can learn.
is a learning rate specific to MBAE and is the random noise sample used by the cGAN. This exploration method can be easily incorporated into RL algorithms. The pseudo code for using MBAE is given in Algorithm 2.
IiiB Dyna
In practise the successor state distribution produced from MBAE will differ from the environment’s true distribution. To compensate for this difference we perform additional training updates on the value function, replacing the successive states in the batch with ones produced from . This helps the value function better estimate future discounted reward for states produced by MBAE. This method is similar to DYNA (DYNA) [23, 17], but here we are performing these updates for the purposes of conditioning the value function on the transition dynamics model.
Iv Connections to Policy Gradient Methods
Actionvalued functions can be preferred because they model the effect of taking specific actions and can also implicitly encode the policy. However, performing a value iteration update over the all actions is intractable in continuous action spaces.
(6) 
DPG [24] compensates for this issue by linking the value and policy functions together allowing for gradients to be passed from the value function through to the policy. The policy parameters are then updated to increase the actionvalue function returns. This method has been successful [25] but has stability challenges [26].
More recently SVG [18] has been proposed as a method to unify modelfree and modelbased methods for learning continuous action control policies. The method learns a stochastic policy, value function and stochastic model of the dynamics that are used to estimate policy gradients. While SVG uses a similar model to compute gradients to optimize a policy, here we use this model to generate more informed exploratory actions.
V Results
MBAE is evaluated on a number of tasks, including: Membrane robot simulation of movetotarget and stacking, Membrane robot hardware movetotarget, OpenAIGym HalfCheetah, OpenAIGym 2D Reacher, 2D Biped simulation and Ndimensional particle navigation. The supplementary video provides a short overview of these systems and tasks. The method is evaluated using the Continuous Actor Critic Learning Automaton (CACLA) stochastic policy RL learning algorithm [11]. CACLA updates the policy mean using MSE for actions that have positive advantage.
Va NDimensional Particle
This environment is similar to a continuous action space version of the common grid world problem. In the grid world problem the agent (blue dot) is trying to reach a target location (red dot), shown in the left of Figure 2(a). In this version the agent receives reward for moving closer to its goal (). This problem is chosen because it can be extended to an Ndimensional world very easily, which is helpful as a simple evaluation of scalability as the actionspace dimensionality increases. We use a 10D version here [27, 28].
Figure 3 shows a visualization of a number of components used in MBAE. In Figure 4(a) we compare the learning curves of using a standard CACLA learning algorithm and one augmented with MBAE for additional action exploration. The learning curves show a significant improvement in learning speed and policy quality over the standard CACLA algorithm. We also evaluated the impact of pretraining the deterministic transition probability model for MBAE. This pretraining did not provide noticeable improvements.
VB 2D Biped Imitation
In this environment the agent is rewarded for developing a 2D walking gait. Reward is given for matching an overall desired velocity and for matching a given reference motion. This environment is similar to [29]. The 2D Biped used in the simulation is shown in Figure 3(a).
In Figure 4(b), five evaluations are used for the 2D Biped and the mean learning curves are shown. In this case MBAE consistently learns times faster than the standard CACLA algorithm. We further find that the use of MBAE also leads to improved learning stability and more optimal policies.
VC Gym and Membrane Robot Examples
We evaluate MBAE on two environments from openAIGym, 2D Reacher Figure 3(b) and HalfCheetah Figure 3(c). MBAE does not significantly improve the learning speed for the 2D Reacher. However, it results in a higher value policy Figure 4(c). For the HalfCheetah MBAE provides a significant learning improvement Figure 4(d), resulting in a final policy with more than times the average reward.
Finally, we evaluate MBAE on a simulation of the juggling Membrane robot shown in Figure 0(a). The underactuated system with complex dynamics and regular discontinuities due to contacts make this a challenging problem. The results for two tasks that include attempting to stack one box on top of another and a second task to move a ball to a target location are shown in Figure 4(f) and Figure 4(e). For both these environments the addition of MBAE provides only slight improvements. We believe that due to the complexity of this learning task, it is difficult to learn a good policy for this problem in general. The simulated version of the membranestack task is shown in Figure 5(c).






We also asses MBAE on the Membrane robot shown in Figure 0(a). OpenCV is used to track the location of a ball that is affected by the actuation of servos that cause pins to move linearly, shown in Figure 5(b). The pins are connected by passive prismatic joints that form the membrane. The robot begins each new episode by resetting itself which involves tossing the ball up and randomly repositioning the membrane. Please see the accompanying video for details. We transfer the movetotarget policy trained in simulation for use with the Membrane robot. We show the results of training on the robot with and without MBAE for hours each in Figure 5(a). Our main objective here is to demonstrate the feasibility of learning on the robot hardware; our current results are only from a single training run for each case. With this caveat in mind, MBAE appears to support improved learning. We believe that this is related to the transition probability model adjusting to the new state distribution of the robot quickly.



VD Transition Probability Network Design
We have experimented with many network designs for the transition probability model. We have found that using a DenseNet [30] works well and increases the models accuracy. We use dropout on the input and output layers, as well as the inner layers, to reduce overfitting. This makes the gradients passed through the transition probability model less biased.
Vi Discussion
Exploration Action Randomization and Scaling
Initially, when learning begins, the estimated policy gradient is flat, making MBAE actions . As learning progresses the estimated policy gradient gets sharper leading to actions produced from MBAE with magnitude . By using a normalized version of the action gradient, we maintain a reasonably sized explorative action, this is similar to the many methods used to normalize gradients between layers for deep learning [31, 32]. However, with normalized actions, we run the risk of being overly deterministic in action exploration. The addition of positive Gaussian noise to the normalized action length helps compensate for this. Modeling the transition dynamics stochasticity allows us to generate future states from a distribution, further increasing the stochastic nature of the action exploration.
transition probability Model Accuracy
Initially, the models do not need to be significantly accurate. They only have to perform better than random (Gaussian) sampling. We found it important to train the transition probability model while learning. This allows the model to adjust and be most accurate for the changing state distribution observed during training. This makes it more accurate as the policy converges.
Mbae Hyper Parameters
To estimate the policy gradient well and to maintain reasonably accurate value estimates, Gaussian exploration should still be performed. This helps the value function get a better estimate of the current policy performance. From empirical analysis, we have found that sampling actions from MBAE with a probability of has worked well across multiple environments. The learning progress can be more sensitive to the action learning rate . We found that annealing values between and MBAE assisted learning. The form of normalization that worked the best for MBAE was a form of batchnorm, were we normalize the action standard deviation to be similar to the policy distribution.
One concern could be that MBAE is benefiting mostly from the extra training that is being seen for the value function. We performed an evaluation of this effect by training MBAE without the use of exploratory actions from MBAE. We found no noticeable impact on the learning speed or final policy quality.
Via Future Work
It might still be possible to further improve MBAE by pretraining the transition probability model offline. As well, learning a more complex transition probability model similar to what has been done in [16] could improve the accuracy of the MBAE generated actions. It might also be helpful to learn a better model of the reward function using a method similar to [33]. One challenge is the addition of another step size for how much action gradient should be applied to the policy action, and it can be nontrivial to select this step size.
While we believe that the MBAE is promising, the learning method can suffer from stability issues when the value function is inaccurate, leading to poor gradients. We are currently investigating methods to limit the KL divergence of the policy between updates. These constraints are gaining popularity in recent RL methods [34]. This should reduce the amount the policy shifts from parameter updates, further increasing the stability of learning. The Membrane related tasks are particularly difficult to do well on; even after significant training the policies could still be improved. Lastly, while our focus has been on evaluating the method on many environments, we would also like to evaluate MBAE in the context of additional RL algorithms, such as PPO or QProp, to further assess its benefit.
References
 [1] I. Osband, D. Russo, Z. Wen, and B. Van Roy, “Deep Exploration via Randomized Value Functions,” ArXiv eprints, Mar. 2017.
 [2] M. Fortunato, M. Gheshlaghi Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg, “Noisy Networks for Exploration,” ArXiv eprints, June 2017.
 [3] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter Space Noise for Exploration,” ArXiv eprints, June 2017.
 [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015. [Online]. Available: http://arxiv.org/abs/1509.02971
 [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [6] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver, “Emergence of Locomotion Behaviours in Rich Environments,” ArXiv eprints, July 2017.
 [7] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in In Advances in Neural Information Processing Systems 12. MIT Press, 2000, pp. 1057–1063.
 [8] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015. [Online]. Available: http://arxiv.org/abs/1506.02438
 [9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783
 [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [11] H. Van Hasselt, “Reinforcement learning in continuous state and action spaces,” in Reinforcement Learning. Springer, 2012, pp. 207–251.
 [12] N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver, “Learning and transfer of modulated locomotor controllers,” CoRR, vol. abs/1610.05182, 2016. [Online]. Available: http://arxiv.org/abs/1610.05182
 [13] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Trans. Graph., vol. 36, no. 4, pp. 41:1–41:13, July 2017. [Online]. Available: http://doi.acm.org/10.1145/3072959.3073602
 [14] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “Curiositydriven exploration in deep reinforcement learning via bayesian neural networks,” CoRR, vol. abs/1605.09674, 2016. [Online]. Available: http://arxiv.org/abs/1605.09674
 [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep qlearning with modelbased acceleration,” in Proc. ICML, 2016, pp. 2829–2838.
 [16] N. Mishra, P. Abbeel, and I. Mordatch, “Prediction and control with temporal segment models,” CoRR, vol. abs/1703.04070, 2017. [Online]. Available: http://arxiv.org/abs/1703.04070
 [17] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,” ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991.
 [18] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp. 2944–2952.
 [19] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning,” ArXiv eprints, Aug. 2017.
 [20] S. Bansal, R. Calandra, S. Levine, and C. Tomlin, “MBMF: ModelBased Priors for ModelFree Reinforcement Learning,” ArXiv eprints, Sept. 2017.
 [21] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 2672–2680.
 [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,” CoRR, vol. abs/1611.07004, 2016. [Online]. Available: http://arxiv.org/abs/1611.07004
 [23] R. S. Sutton, “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,” in In Proceedings of the Seventh International Conference on Machine Learning. Morgan Kaufmann, 1990, pp. 216–224.
 [24] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proc. ICML, 2014.
 [25] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015. [Online]. Available: http://arxiv.org/abs/1509.02971
 [26] M. J. Hausknecht and P. Stone, “Deep reinforcement learning in parameterized action space,” CoRR, vol. abs/1511.04143, 2015. [Online]. Available: http://arxiv.org/abs/1511.04143
 [27] A. Tamar, S. Levine, and P. Abbeel, “Value iteration networks,” CoRR, vol. abs/1602.02867, 2016. [Online]. Available: http://arxiv.org/abs/1602.02867
 [28] C. Finn, T. Yu, J. Fu, P. Abbeel, and S. Levine, “Generalizing skills with semisupervised reinforcement learning,” CoRR, vol. abs/1612.00429, 2016. [Online]. Available: http://arxiv.org/abs/1612.00429
 [29] X. B. Peng and M. van de Panne, “Learning locomotion skills using deeprl: Does the choice of action space matter?” in Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation, ser. SCA ’17. New York, NY, USA: ACM, 2017, pp. 12:1–12:13. [Online]. Available: http://doi.acm.org/10.1145/3099564.3099567

[30]
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017.  [31] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” ArXiv eprints, July 2016.
 [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
 [33] D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. DulacArnold, D. Reichert, N. Rabinowitz, A. Barreto, and T. Degris, “The Predictron: EndToEnd Learning and Planning,” 2016. [Online]. Available: http://arxiv.org/abs/1612.08810
 [34] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015. [Online]. Available: http://arxiv.org/abs/1502.05477
 [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” ArXiv eprints, July 2017.
Vii Appendix
Viia Max Over All Actions, Value Iteration
By using MBAE in an iterative manner, for a single state (), it is possible to compute the max over all actions. This is a form of value iteration over the space of possible actions. It has been shown that embedding value iteration in the model design can be very beneficial [27] The algorithm to perform this computation is given in Algorithm 3.
ViiB More Results
We perform additional evaluation on MBAE. First we use MBAE with the Proximal Policy Optimization (PPO) [35] algorithm in Figure 6(a) to show that the method works with other learning algorithms. We also created a modified version of CACLA that is onpolicy to further study the advantage of using MBAE in this setting Figure 6(b).


Comments
There are no comments yet.