Model-Based Action Exploration

01/11/2018 ∙ by Glen Berseth, et al. ∙ 0

Deep reinforcement learning has great stride in solving challenging motion control tasks. Recently there has been a significant amount of work on methods to exploit the data gathered during training, but less work is done on good methods for generating data to learn from. For continuous actions domains, the typical method for generating exploratory actions is by sampling from a Gaussian distribution centred around the mean of a policy. Although these methods can find an optimal policy, in practise, they do not scale well, and solving environments with many actions dimensions becomes impractical. We consider learning a forward dynamics model to predict the result, (s_t+1), of taking a particular action, (a), given a specific observation of the state, (s_t). With a model such as this we, can perform what comes more naturally to biological systems that have already collect experience, we perform internal predictions of outcomes and endeavour to try actions we believe have a reasonable chance of success. This method greatly reduces the space of exploratory actions, increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Related Work

Reinforcement Learning

The environment in a RL problem is often modelled as an Markov Dynamic Processes (MDP) with a discrete set of states and actions [5]

. In this work we are focusing on problems with infinite/continuous state and action spaces. These include complex motor control tasks that have become a popular benchmark in the machine learning literature 

[6]. Many recent RL approaches are based on policy gradient methods [7]

where the gradient of the policy with respect to future discounted reward is approximated and used to update the policy. Recent advances in combining policy gradient methods and deep learning have led to impressive results for numerous problems, including Atari games and bipedal motion control 

[8, 9, 10, 11, 12, 13].

Sample Efficient RL

While policy gradient methods provide a general framework for how to update a policy given data, it is still a challenge to generate good data. Sample efficient RL methods are an important area of research as learning complex policies for motion control can take days and physically simulating on robots is time-consuming. Learning can be made more sample efficient by further parameterizing the policy and passing noise through the network as an alternative to adding vanilla Gaussian noise [3, 2]. Other work encourages exploration of the state space that has not yet been seen by the agent [14]. There has been success in incorporating model-based methods to generate synthetic data or locally approximate the dynamics [15, 16, 17]. Two methods are similar to the MBAE work that we propose. Deep Deterministic Policy Gradient ( Deep Deterministic Policy Gradient (DDPG)) is a method that directly links the policy and value function, propagating gradients into the policy from the value function [4]. Another is Stochastic Value Gradients (SVG), a framework for blending between model-based and model-free learning [18]. However, these methods do not use the gradients as a method for action exploration.

Model-Based Rl

generally refers to methods that use the structure of the problem to assist learning. Typically any method that uses more than a policy and value function is considered to fall into this category. Significant improvements have been made recently by including some model-based knowledge into the RL problem. By first learning a policy using model-based RL and then training a model-free method to act like the model-based method [19] significant improvements are achieved. There is also interest in learning and using models of the transition dynamics to improve learning [20]. The work in [16] uses model-based policy optimization methods along with very accurate dynamics models to learn good policies. In this work, we learn a model of the dynamics to compute gradients to maximize future discounted reward for action exploration. The dynamics model used in this work does not need to be particularly accurate as the underlying model-free RL algorithm can cope with a noisy action distribution.

Ii Framework

In this section we outline the MDP based framework used to describe the RL problem.

Ii-a Markov Dynamic Process

An MDP is a tuple consisting of . Here is the space of all possible state configurations and is the set of available actions. The reward function determines the reward for taking action in state . The probability of ending up in state after taking action in state is described by the transition dynamics function . Lastly, the discount factor controls the planning horizon and gives preference to more immediate rewards. A stochastic policy models the probability of choosing action given state . The quality of the policy can be computed as the expectation over future discounted rewards for the given policy starting in state and taking action .


The actions over the trajectory are determined by the policy . The successor state is determined by the transition function .

Ii-B Policy Learning

The state-value function estimates Eq. 1 starting from state for the policy . The action-valued function models the future discounted reward for taking action in state and following policy thereafter. The advantage function is a measure of the benefit of taking action in state with respect to the current policy performance.


The advantage function is then used as a metric for improving the policy.


Ii-C Deep Reinforcement Learning

During each episode of interaction with the environment, data is collected for each action taken, as an experience tuple .

Ii-C1 Exploration

In continuous spaces the stochastic policy is often modeled by a Gaussian distribution with mean

. The standard deviation can be modeled by a state-dependent neural network model,

, or can be state independent and sampled from .

Ii-C2 Exploitation

We train a neural network to model the value function on data collected from the policy. The loss function used to train the value function (

) is the temporal difference error:


Using the learned value function as a baseline, the advantage function can be estimated from data. With an estimate of the policy gradient, via the advantage, policy updates can be performed to increase the policy’s likelihood of selecting actions with higher advantage:


Iii Model-Based Action Exploration

In model-based RL we are trying to solve the same policy parameter optimization as in Eq. 3. To model the dynamics, we train one model to estimate the reward function and another to estimate the successor state. The former is modeled as a direct prediction, while the latter is modeled as a distribution from which samples can be drawn via a GAN (generative adversarial network).

Iii-a Stochastic Model-Based Action Exploration

A diagram of the MBAE method is shown in Figure 2. With the combination of the transition probability model and a value function, an action-valued function is constructed. Using MBAE, action gradients can be computed and used for exploration.

Fig. 2: Schematic of the Model-Based Action Exploration design. States are generated from the simulator, the policy produces an action , and are used to predict the next state . The gradient is computed back through the value function to give the gradient of the state that is then used to compute a gradient that changes the action by to produce a predicted state with higher value.

By using a stochastic transition function the gradients computed by MBAE are non-deterministic. Algorithm 1 shows the method used to compute action gradients when predicted future states are sampled from a distribution. We use a Generative Advasarial Network (GAN[21] to model the stochastic distribution. Our implementation closely follows [22] that uses a Conditional Generative Advasarial Network (cGAN) and combines a Mean Squared Error (MSE) loss with the normal GAN loss. We expect the simulation dynamics to have correlated terms, which the GAN can learn.

1:function getActionDelta()
7:     return
8:end function
Algorithm 1 Compute Action Gradient

is a learning rate specific to MBAE and is the random noise sample used by the cGAN. This exploration method can be easily incorporated into RL algorithms. The pseudo code for using MBAE is given in Algorithm 2.

1:Randomly initialize model parameters
2:while not done do
3:     while Simulate episode do
4:         if generate exploratory action then
6:              if  then
8:              end if
9:         else
11:         end if
12:     end while
13:     Sample batch from
14:     Update value function, policy and transition probability given
15:end while
Algorithm 2 MBAE algorithm

Iii-B Dyna

In practise the successor state distribution produced from MBAE will differ from the environment’s true distribution. To compensate for this difference we perform additional training updates on the value function, replacing the successive states in the batch with ones produced from . This helps the value function better estimate future discounted reward for states produced by MBAE. This method is similar to DYNA (DYNA[23, 17], but here we are performing these updates for the purposes of conditioning the value function on the transition dynamics model.

Iv Connections to Policy Gradient Methods

Action-valued functions can be preferred because they model the effect of taking specific actions and can also implicitly encode the policy. However, performing a value iteration update over the all actions is intractable in continuous action spaces.


DPG [24] compensates for this issue by linking the value and policy functions together allowing for gradients to be passed from the value function through to the policy. The policy parameters are then updated to increase the action-value function returns. This method has been successful [25] but has stability challenges [26].

More recently SVG [18] has been proposed as a method to unify model-free and model-based methods for learning continuous action control policies. The method learns a stochastic policy, value function and stochastic model of the dynamics that are used to estimate policy gradients. While SVG uses a similar model to compute gradients to optimize a policy, here we use this model to generate more informed exploratory actions.

V Results

MBAE is evaluated on a number of tasks, including: Membrane robot simulation of move-to-target and stacking, Membrane robot hardware move-to-target, OpenAIGym HalfCheetah, OpenAIGym 2D Reacher, 2D Biped simulation and N-dimensional particle navigation. The supplementary video provides a short overview of these systems and tasks. The method is evaluated using the Continuous Actor Critic Learning Automaton (CACLA) stochastic policy RL learning algorithm [11]. CACLA updates the policy mean using MSE for actions that have positive advantage.

V-a N-Dimensional Particle

This environment is similar to a continuous action space version of the common grid world problem. In the grid world problem the agent (blue dot) is trying to reach a target location (red dot), shown in the left of Figure 2(a). In this version the agent receives reward for moving closer to its goal (). This problem is chosen because it can be extended to an N-dimensional world very easily, which is helpful as a simple evaluation of scalability as the action-space dimensionality increases. We use a 10D version here [27, 28].

(a) Nav environment and current policy
(b) transition probability error
(c) MBAE direction
Fig. 3: The figure (a) left is the current layout of the continuous grid world. The agent is blue and target location for the agent is the red dot and the green boxes are obstacles. In (a) right, the current policy is shown as if the agent was located at each arrow action to give the unit direction of the action. The current value at each state is visualized by the colour of the arrows, red being the highest. In (b) the error of the forward dynamics model is visualized as the distance between the successive state predicted and the actual successive states (). (c) is the unit length action gradient from MBAE. Only the first two dimensions of the state and action are visualized here.

Figure 3 shows a visualization of a number of components used in MBAE. In Figure 4(a) we compare the learning curves of using a standard CACLA learning algorithm and one augmented with MBAE for additional action exploration. The learning curves show a significant improvement in learning speed and policy quality over the standard CACLA algorithm. We also evaluated the impact of pre-training the deterministic transition probability model for MBAE. This pre-training did not provide noticeable improvements.

V-B 2D Biped Imitation

In this environment the agent is rewarded for developing a 2D walking gait. Reward is given for matching an overall desired velocity and for matching a given reference motion. This environment is similar to [29]. The 2D Biped used in the simulation is shown in Figure 3(a).

(a) 2D Biped
(b) 2D Reacher (c) HalfCheetah
Fig. 4: Additional environments MBAE is evaluated on.

In Figure 4(b), five evaluations are used for the 2D Biped and the mean learning curves are shown. In this case MBAE consistently learns times faster than the standard CACLA algorithm. We further find that the use of MBAE also leads to improved learning stability and more optimal policies.

V-C Gym and Membrane Robot Examples

We evaluate MBAE on two environments from openAIGym, 2D Reacher Figure 3(b) and HalfCheetah Figure 3(c). MBAE does not significantly improve the learning speed for the 2D Reacher. However, it results in a higher value policy Figure 4(c). For the HalfCheetah MBAE provides a significant learning improvement Figure 4(d), resulting in a final policy with more than times the average reward.

Finally, we evaluate MBAE on a simulation of the juggling Membrane robot shown in Figure 0(a). The under-actuated system with complex dynamics and regular discontinuities due to contacts make this a challenging problem. The results for two tasks that include attempting to stack one box on top of another and a second task to move a ball to a target location are shown in Figure 4(f) and  Figure 4(e). For both these environments the addition of MBAE provides only slight improvements. We believe that due to the complexity of this learning task, it is difficult to learn a good policy for this problem in general. The simulated version of the membrane-stack task is shown in Figure 5(c).

(a) Particle 10D
(b) 2D PD Biped
(c) Reacher 2D
(d) Half-Cheetah
(e) Membrane Target
(f) Membrane Stack
Fig. 5: Comparisons of using the CACLA learning method with and without MBAE. These performance curves are the average of separate simulation with different random seeds.

We also asses MBAE on the Membrane robot shown in Figure 0(a). OpenCV is used to track the location of a ball that is affected by the actuation of servos that cause pins to move linearly, shown in Figure 5(b). The pins are connected by passive prismatic joints that form the membrane. The robot begins each new episode by resetting itself which involves tossing the ball up and randomly repositioning the membrane. Please see the accompanying video for details. We transfer the movetotarget policy trained in simulation for use with the Membrane robot. We show the results of training on the robot with and without MBAE for hours each in Figure 5(a). Our main objective here is to demonstrate the feasibility of learning on the robot hardware; our current results are only from a single training run for each case. With this caveat in mind, MBAE appears to support improved learning. We believe that this is related to the transition probability model adjusting to the new state distribution of the robot quickly.

(a) Membrane Target
(b) Membrane Camera
(c) Membrane Stack
Fig. 6: (a) Comparison of using the MBAE on the physical robot task. (b) is the camera view the robot uses to track its state and (c) is a still frame from the simulated box stacking tasks.

V-D Transition Probability Network Design

We have experimented with many network designs for the transition probability model. We have found that using a DenseNet [30] works well and increases the models accuracy. We use dropout on the input and output layers, as well as the inner layers, to reduce overfitting. This makes the gradients passed through the transition probability model less biased.

Vi Discussion

Exploration Action Randomization and Scaling

Initially, when learning begins, the estimated policy gradient is flat, making MBAE actions . As learning progresses the estimated policy gradient gets sharper leading to actions produced from MBAE with magnitude . By using a normalized version of the action gradient, we maintain a reasonably sized explorative action, this is similar to the many methods used to normalize gradients between layers for deep learning [31, 32]. However, with normalized actions, we run the risk of being overly deterministic in action exploration. The addition of positive Gaussian noise to the normalized action length helps compensate for this. Modeling the transition dynamics stochasticity allows us to generate future states from a distribution, further increasing the stochastic nature of the action exploration.

transition probability Model Accuracy

Initially, the models do not need to be significantly accurate. They only have to perform better than random (Gaussian) sampling. We found it important to train the transition probability model while learning. This allows the model to adjust and be most accurate for the changing state distribution observed during training. This makes it more accurate as the policy converges.

Mbae Hyper Parameters

To estimate the policy gradient well and to maintain reasonably accurate value estimates, Gaussian exploration should still be performed. This helps the value function get a better estimate of the current policy performance. From empirical analysis, we have found that sampling actions from MBAE with a probability of has worked well across multiple environments. The learning progress can be more sensitive to the action learning rate . We found that annealing values between and MBAE assisted learning. The form of normalization that worked the best for MBAE was a form of batchnorm, were we normalize the action standard deviation to be similar to the policy distribution.

One concern could be that MBAE is benefiting mostly from the extra training that is being seen for the value function. We performed an evaluation of this effect by training MBAE without the use of exploratory actions from MBAE. We found no noticeable impact on the learning speed or final policy quality.

Vi-a Future Work

It might still be possible to further improve MBAE by pre-training the transition probability model offline. As well, learning a more complex transition probability model similar to what has been done in [16] could improve the accuracy of the MBAE generated actions. It might also be helpful to learn a better model of the reward function using a method similar to [33]. One challenge is the addition of another step size for how much action gradient should be applied to the policy action, and it can be non-trivial to select this step size.

While we believe that the MBAE is promising, the learning method can suffer from stability issues when the value function is inaccurate, leading to poor gradients. We are currently investigating methods to limit the KL divergence of the policy between updates. These constraints are gaining popularity in recent RL methods [34]. This should reduce the amount the policy shifts from parameter updates, further increasing the stability of learning. The Membrane related tasks are particularly difficult to do well on; even after significant training the policies could still be improved. Lastly, while our focus has been on evaluating the method on many environments, we would also like to evaluate MBAE in the context of additional RL algorithms, such as PPO or Q-Prop, to further assess its benefit.


Vii Appendix

Vii-a Max Over All Actions, Value Iteration

By using MBAE in an iterative manner, for a single state (), it is possible to compute the max over all actions. This is a form of value iteration over the space of possible actions. It has been shown that embedding value iteration in the model design can be very beneficial [27] The algorithm to perform this computation is given in Algorithm 3.

2:while not done do
4:end while
Algorithm 3 Action optimization

Vii-B More Results

We perform additional evaluation on MBAE. First we use MBAE with the Proximal Policy Optimization (PPO[35] algorithm in Figure 6(a) to show that the method works with other learning algorithms. We also created a modified version of CACLA that is on-policy to further study the advantage of using MBAE in this setting Figure 6(b).

(a) PPO game
(b) on-policy CACLA game
Fig. 7: (a) Result of Applying MBAE to PPO. In (b) we show that an on-policy version of CACLA + MBAE can learn faster than CACLA alone.