Discrete Sequential Prediction of Continuous Actions for Deep RL

05/14/2017 ∙ by Luke Metz, et al. ∙ Google Nvidia 0

It has long been assumed that high dimensional continuous control problems cannot be solved effectively by discretizing individual dimensions of the action space due to the exponentially large number of bins over which policies would have to be learned. In this paper, we draw inspiration from the recent success of sequence-to-sequence models for structured prediction problems to develop policies over discretized spaces. Central to this method is the realization that complex functions over high dimensional spaces can be modeled by neural networks that use next step prediction. Specifically, we show how Q-values and policies over continuous spaces can be modeled using a next step prediction model over discretized dimensions. With this parameterization, it is possible to both leverage the compositional structure of action spaces during learning, as well as compute maxima over action spaces (approximately). On a simple example task we demonstrate empirically that our method can perform global search, which effectively gets around the local optimization issues that plague DDPG and NAF. We apply the technique to off-policy (Q-learning) methods and show that our method can achieve the state-of-the-art for off-policy methods on several continuous control tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 12

page 13

page 14

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning has long been considered as a general framework applicable to a broad range of problems. Reinforcement learning algorithms have been categorized in various ways. There is an important distinction, however, arises between discrete and continuous action spaces. In discrete domains, there are several algorithms, such as Q-learning, that leverage backups through Bellman equations and dynamic programming to solve problems effectively. These strategies have led to the use of deep neural networks to learn policies and value functions that can achieve superhuman accuracy in several games mnih2013playing ; silver2016mastering where actions lie in discrete domains. This success spurred the development of RL techniques that use deep neural networks for continuous control problems lillicrap2015continuous ; gu2016continuous ; levine2016end . The gains in these domains, however, have not been as outsized as they have been for discrete action domains, and superhuman performance seems unachievable with current techniques.

This disparity is in part a result of the inherent difficulty in maximizing an arbitrary function on a continuous domain, even in low dimensional settings; further, techniques such as beam searches do not exist or do not apply directly. This makes it difficult to perform inference and learning in such settings. Additionally, it becomes harder to apply dynamic programming methods to back up value function estimates from successor states to parent states. Several of the recent continuous control reinforcement learning approaches attempt to borrow characteristics from discrete problems by proposing models that allow maximization and backups more easily

gu2016continuous .

One way in which continuous control can avail itself of the above advantages is to discretize each of the dimensions of continuous control action spaces. As noted in lillicrap2015continuous , doing this naively, however, would create an exponentially large discrete space of actions. For example with dimensions being discretized into bins, the problem would balloon to a discrete space with possible actions.

We leverage the recent success of sequence-to-sequence type models seq2seq to train such discretized models, without falling into the trap of requiring an exponentially large number of actions. Our method relies on a technique that was first introduced in bengio1999modeling

, which allows us to escape the curse of dimensionality in high dimensional spaces by modeling complicated probability distributions using the chain rule decomposition. In this paper, we similarly parameterize functions of interest – Q-values – using a decomposition of the joint function into a sequence of conditional approximations. Families of models that exhibit these sequential prediction properties have also been referred to as autoregressive. With this formulation, we are able to achieve fine-grained discretization of individual domains, without an explosion in the number of parameters; at the same time we can model arbitrarily complex distributions while maintaining the ability to perform (approximate) maximization.

While this strategy can be applied to most function approximation settings in RL, we focus on off-policy settings with a DQN inspired algorithm for Q-learning. Complementary results for an actor-critic model sutton1998reinforcement ; sutton1999policy based on the same autoregressive concept are presented in Appendix D. Empirical results on an illustrative multimodal problem demonstrates how our model is able to perform global maximization, avoiding the exploration problems faced by algorithms like NAF naf and DDPG lillicrap2015continuous . We also show the effectiveness of our method in solving a range of benchmark continuous control problems including hopper to humanoid.

2 Method

In this paper, we introduce the idea of building continuous control algorithms utilizing sequential, or autoregressive, models that predict over action spaces one dimension at a time. Here, we use discrete distributions over each dimension (achieved by discretizing each continuous dimension into bins) and apply it on off-policy learning. We explore one instantiation of such a model in the body of this work and discuss three additional variants in Appendix 

C, Appendix D and Appendix E.

2.1 Preliminaries

We briefly describe the notation we use in this paper. Let be the observed state of the agent, be the dimensional action space, and be the stochastic environment in which the agent operates. Finally, let

be the vector obtained by taking the sub-range of a vector

.

At each step , the agent takes an action , receives a reward from the environment and transitions stochastically to a new state according to (possibly unknown) dynamics . An episode consists of a sequence of such steps , with where is the last time step. An episode terminates when a stopping criterion is true (for example when a game is lost, or when the number of steps is greater than some threshold length ).

Let be the discounted reward received by the agent starting at step of an episode. As with standard reinforcement learning, the goal of our agent is to learn a policy that maximizes the expected total reward it would receive from the environment by following this policy.

Because this paper is focused on off-policy learning with Q-Learning watkins1992q , we will provide a brief description of the algorithm.

2.1.1 Q-Learning

Q-learning is an off-policy algorithm that learns an action-value function and a corresponding greedy-policy, . The model is trained by finding the fixed point of the transition operator, i.e.

(1)

This is done by minimizing the Bellman Error, over the exploration distribution,

(2)

Traditionally, is represented as a table of state action pairs or with linear function approximators or shallow neural networks watkins1992q ; tesauro1995temporal . Recently, there has been an effort to apply these techniques to more complex domains using non-linear function approximators that are deep neural networks mnih2013playing ; mnih2015human . In these models, a Deep Q-Network (DQN) parameterized by parameters, , is used to predict Q-values, i.e. . The DQN parameters, , are trained by performing gradient descent on the error in equation 2, without taking a gradient through the Q-values of the successor states (although, see Baird95residualalgorithms for an approach that takes this into account).

Since the greedy policy, , uses the action value with the maximum Q-value, it is essential that any parametric form of be able to find a maxima easily with respect to actions. For a DQN where the output layer predicts the Q-values for each of the discrete outputs, it is easy to find this max – it is simply the action corresponding to the index of the output with the highest estimated Q-value. In continuous action problems, it can be tricky to formulate a parametric form of the Q-value where it is easy to find such a maxima. Techniques like Normalized Advantage Function (NAF)naf constrain the function approximator such that it is easy to compute a max analytically at the cost of reduced flexibility. Another approach has been followed by the Deep Deterministic Policy Gradients algorithm, which uses an actor-critic algorithm in an off-policy setting to train a deterministic policy lillicrap2015continuous . The model trains a critic to estimate the -function and a separate policy to try to approximate a max over the critic.

In this work, we develop a similar technique where we modify the form of our Q-value function while still retaining the ability to find local maxima over actions for use in a greedy policy.

2.2 Sequential DQN

We propose a Sequential DQN (SDQN) model that structures a function in the form of a sequential model over action dimensions. To introduce this idea, we first show a transformation done on the environment which splits a D action into a sequence of 1D actions. We then shift this modification from an environment transformation to the construction of a function resulting in our final technique.

2.2.1 Transformed Environment

Consider an environment with an dimensional action space. We can perform a transformation to this environment into a similar environment inserting fictitious states. Each new fictitious state has a 1D action space. Each new action choice is conditioned on the previous state as well as on all previously selected actions. When the th action is selected, the environment dynamics take into effect using the previous actions to perform one step. The process is depicted for in Figure 1.

Figure 1: Demonstration of a transformation on an environment with three dimensional action space. Fictitious states are introduced to keep the action dimension at each transition one dimensional. Each circle represents a state in the MDP. The transformed environment’s replicated states are now augmented with the previously selected action. When all three action dimensions are chosen, the underlying environment progresses to .

This transformation reduces the D actions to a series of 1D actions. We can now discretize the 1D output space and directly apply -learning. Note that we could apply this strategy to continuous values, without discretization, by choosing a conditional distribution, such as a mixture of 1-D Gaussians, over which a maxima can easily be found. As such, this approach is equally applicable to pure continuous domains as compared to discrete approximations.

The downside to this transformation is that it increases the amount of steps needed to solve a MDP. In practice, this transformation makes learning a function considerably harder. The extra steps of dynamic programming coupled with learned function approximators causes large overestimation and stability issues.

2.2.2 Model

In this section, we shift the previous section’s ideas into a model that acts on the original, untransformed environment. We learn a deterministic policy defined as:

(3)

Here is the number of action dimensions, and is the deterministic policy for the dimension action choice, i.e. . A greedy policy from -learning is used to compute these policies:

(4)

where and is the set of possible actions for values for . Here is a "partial Q-value function" that we estimate using -learning and is defined recursively because is an input to . These "partial Q-value functions" correspond to the Q-values occurring on the fictitious states introduced in the previous section.

We parameterize as a neural network that receives as input. The neural network outputs a dimensional vector, where is the number of bins that we discretized action space into. This parameterization allows for arbitrarily complex resolution at the cost of increasing the number of bins, . We found to be a good tradeoff between performance and tractability.

This model is similar to doing -learning on the transformed environment described in the previous section. As before, the entire chain of ’s can then be trained with -learning. In practice, however, we found that the values were significantly overestimated, due to the increased time dependencies and the max operator in -learning. When training, there is an approximation error in value predictions. This approximation error and a max operator in -learning causes overestimation. To overcome this issue we introduce a second function, , inspired by double DQN hasselt2016deep , which we use to produce a final estimate of the -value of all action dimensions. Note, in the traditional double DQN work, a past version of the function is used. In our setting, we train an entirely separate network, , on the original, untransformed environment. By doing this, learning becomes easier as there are less long term dependencies.

There are several components we must train for our model. First, minimizes the Bellman Error:

(5)

Next, is trained to match the predicted value from :

(6)

Finally, learns to satisfy the Bellman equation ensuring that . We do this by creating an internal consistency loss for each dimension :

(7)

We sample transitions from a replay buffer, , to minimize all three losses using SGD mnih2013playing . As in mnih2013playing , at each training step, is filled with one transition sampled using the current policy, .

Figure 2: Pictorial view of the SDN policy, , with 32 discretization bins. A state, , is fed into the network, gets embedded, and a distribution of s for the first action dimension, , are predicted. The max bin is selected, and converted to a continuous action. This action is then fed in with the state to predict the distribution of the second action dimension, . See Figure A.1 for a pictorial view off the training procedure.

The policy action selection, , is illustrated in Figure 2, while training is depicted in Figure A.1.

Another motivation for structuring our losses in this way is to decouple the model that learns values from the model that chooses max actions – or the policy. In this way our approach is similar to DDPG. A value is learned in one network, , and a policy to approximate the max is learned in another, .

2.3 Exploration

Numerous exploration strategies have been explored in reinforcement learning osband2016deep ; blundell2015weight ; houthooft2016vime . When computing actions from our sequential policies, we can additionally inject noise at each choice and then select the remaining actions based on this choice. We found this yields better performance as compared to epsilon greedy.

2.4 Neural Network Parameterization

We implemented each learned function in the above model as a deep neural network. We explored two parameterizations of the sequential components, the s. First, we looked at a recurrent LSTM model hochreiter1997long . This model has shared weights and passes information via hidden activations from one action to another. The input at each time step is a function of the current state, , and action,

. Second, we looked at a version with no shared weights, passing all information into each sequential prediction model. These models are feed forward neural networks that take as input a concatenation of all previous action selections as well as the state. In more complex domains, such as vision based control tasks for example, it would make sense to untie only a subset of the weights. In practice, we found that the untied version performed better. Optimizing these model families is still an ongoing work. For full detail of model architectures and training procedures selection see Appendix 

F.

3 Related Work

Our work was motivated by two distinct desires – to learn policies over exponentially large discrete action spaces, and to approximate value functions over high dimensional continuous action spaces effectively. In our paper we used a sequential parameterization of policies that help us to achieve this without making an assumption about the actual functional form of the model. Other prior work attempts to handle high dimensional action spaces by assuming specific decompositions. For example, sallans2004reinforcement were able to scale up learning to extremely large action sizes by factoring the action value function and use product of experts to learn policies. An alternative strategy was proposed in dulac2015deep using action embeddings and applying k-nearest neighbors to reduce scaling of action sizes. By laying out actions on a hypercube, pazis2011generalized are able to perform a binary search over actions resulting in a logarithmic search for the optimal action. Their method is similar to SDQN, as both construct a -value from sub

-values. Their approach presupposes these constraints, however, and optimizes the Bellman equation by optimizing hyperplanes independently thus enabling optimizing via linear programming. Our approach is iterative and refines the action selection, which contrasts to their independent sub-plane maximization.

Along with the development of discrete space algorithms, researchers have innovated specialized solutions to learn over continuous state and action environments including lever2014deterministic ; lillicrap2015continuous ; naf . More recently, novel deep RL approaches have been developed for continuous state and action problems. TRPO schulman2015trust and A3C mnih2016asynchronous

uses a stocastic policy parameterized by diagonal covariance Gaussian distributions. NAF

naf relies on quadratic advantage function enabling closed form optimization of the optimal action. Other methods structure the network in a way such that they are convex in the actions while being non-convex with respect to states amos2016input or use a linear policy rajeswaran2017towards .

In the context of reinforcement learning, sequential or autoregressive policies have previously been used to describe exponentially large action spaces such as the space of neural architectures, zoph2016neural and over sequences of words norouzi2016reward ; shen2015minimum . These approaches rely on policy gradient methods whereas we explore off-policy methods. Hierarchical/options based methods, including dayan1993feudal which perform spatial abstraction or sutton1999option that perform temporal abstraction pose another way to factor action spaces. These methods refine their action selection from time where our approaches operates on the same timescale and factors the action space.

A vast literature on constructing sequential models to solve tasks exists outside of RL. These models are a natural fit when the data is generated in a sequential process such as in language modeling bengio2003neural

. One of the first and most effective deep learned sequence-to-sequence models for language modeling was proposed in

sutskever2014sequence , which used an encoder-decoder architecture. In other domains, techniques such as NADE larochelle2011neural have been developed to compute tractable likelihood. Techniques like Pixel RNN oord2016pixel have been used to great success in the image domain where there is no clear generation sequence. Hierarchical softmax morin2005hierarchical performs a hierarchical decomposition based on WordNet semantic information.

The second motivation of our work was to enable learning over more flexible, possibly multimodal policy landscape. Existing methods use stochastic neural networks florensa2017 or construct energy models haarnoja2017reinforcement sampled with Stein variational gradient descent liu2016stein ; wang2016learning . In our work instead of sampling, we construct a secondary network to evaluate a max.

4 Experiments

4.1 Multimodal Example Environment

To consider the effectiveness of our algorithm, we consider a deterministic environment with a single time step, and a 2D action space. This can be thought of as being a two-armed bandit problem with deterministic rewards, or as a search problem in 2D action space. We chose our reward function to be a multimodal distribution as shown in the first column in Figure 3. A large suboptimal mode and a smaller optimal mode exist.

As with bandit problems, this formulation helps us isolate the ability of our method to find an optimal policy, without the confounding effect that arises from backing up rewards via the Bellman operator for sequential problems. We look at the behavior of SDQN as well as that of DDPG and NAF on this task. As in traditional RL, we do exploration while learning. We consider uniformly sampling (-greedy with

) as well as sampling data from a normal distribution centered at the current policy – we refer to this as "local." A visualization of the final

surfaces as well as training curves can be found in Figure 3.

Figure 3: Left: Final reward/ surface for each algorithm tested. Final policy is marked with a red . The SDQN model is capable of performing global search and thus finds the global maximum. The top row contains data collected uniformly over the action space. SDQN and DDPG use this to accurately reconstruct the target surface. Algorithms like NAF, however, fail to even converge to a local maximum. In the bottom row, actions are sampled from a normal distribution centered on the policy. This results in more sample efficiency but yields poor approximations of the

surface outside of where the policy is. Right: Smoothed reward achieved over time. DDPG quickly converges to a local maximum. SDQN has high variance performance initially as it searches the space, but then quickly converges to the global maximum as the

surface estimate becomes more accurate. NAF, when sampling uniformly, fails to converge to a global maximum. The location of the max is actually lower than a global maximum.

First, we considered the performance of DDPG. DDPG uses local optimization to learn a policy on a constantly changing estimate of values predicted by a critic. The form of the distribution is flexible and as such there is no closed form properties we can make use of for learning a policy. As such, we resort to gradient descent, a local optimization algorithm. Due to its local nature, it is possible for this algorithm to get stuck in sub-optimal policies that are local maximum of the current critic. We hypothesize that these local maximum in policy space exist in more realistic simulated environments as well. Traditionally, deep learning methods use local optimizers and avoid local minima or maxima by working in a high dimensional parameter space choromanska2015loss . In RL, however, the action space of a policy is relatively small dimensionally thus it is much more likely that they exist. For example, in the hopper environment, a common failure mode we experienced when training algorithms like DDPG is to learn to balance instead of moving forward and hopping.

Next, we consider how NAF behaves on this environment. NAF makes an assumption that the function is quadratic in action space. During training, NAF fits this quadratic surface to minimize the expected

loss evaluated with transitions from a replay buffer. Given the restricted functional form of the model, it is no longer possible to model the entire space without error. As such, the distribution of sampled points used for learning (i.e. the behavior policy) matters greatly. When the behavior policy is a uniform distribution over the action space, the quadratic approximation yields a surface where the maximum is in a low reward region with no path to improve. Interestingly though, this is no longer the case when the behavior policy is a stochastic Gaussian policy whose mean is the greedy policy w.r.t. the previous estimate of the

values. In this setting, NAF only models the quadratic surface around where samples are taken, i.e. locally around the maxima of the estimated values. This non-stationary optimization results in behavior quite similar to DDPG in that it performs a local optimization. This experiment suggests that in NAF there is a balance between a global quadratic model, and exploiting local structure when fitting a function.

Finally, we consider a SDQN. As expected, this model is capable of completely representing the surface (under the limits of discretization) and does not suffer from inflexibility of the previous methods. The optimization of the policy is not done locally – both the uniform behavior policy and the stochastic Gaussian behavior policy converge to the optimal solution. Much like DDPG, the loss surface learned can be stationary – it does not need to shift over time to learn the optimal solution. Unlike DDPG, however, the policy will not get stuck in a local maximum. With the uniform behavior policy setting, the model slowly reaches the right solution, as there are many wasted samples. With a behavior policy that is closer to being on-policy (such as the stochastic Gaussian greedy policy referred to above), this slowness is reduced. Much of the error occurs from selecting over estimated actions. When sampling more on policy, the over estimated data points get sampled more frequently converging to the optimal solution. Unlike the other global algorithms like NAF, performance will not decrease if we increase sampling noise to cover more of the action space. This allows us to balance exploration and learning of the Q function more easily. 111This assumes that the models have enough capacity. In a limited capacity setting, one would still want to explore locally. Much like NAF, SDQN models will shift capacity to modeling these spaces, which are sampled, thus making better use of the capacity.

4.2 Mujoco environments

To evaluate the relative performance of these models we perform a series of experiments on common continuous control tasks. We test the hopper, swimmer, half cheetah, walker2d and humanoid environments from the OpenAI gym suite gym . 222 For technical reasons, our simulations use a different numerical simulation strategy provided by Mujoco mujoco . In practice though, we found the differences in final reward to be within the expected variability of rerunning an algorithm with a different random seed.

We performed a wide hyper parameter search over various parameters in our models (described in Appendix F), and selected the best performing runs. We then ran 10 random seeds of the same hyper parameters to evaluate consistency and to get a more realistic estimate of performance. We believe this replication is necessary as many of these algorithms are not only sensitive to both hyper parameters but random seeds. This is not a quality we would like in our RL algorithms and by doing the 10x replications, we are able to detect this phenomena.

Figure 4: Learning curves of highest performing hyper parameters trained on Mujoco tasks. We show a smoothed median (solid line) with 25 and 75 percentiles range (transparent line) from the 10 random seeds run. SDQN quickly achieves good performance on these tasks.

First, we look at learning curves of some of the environments tested in Figure 4. Our method quickly achieves good policies much faster than DDPG.

Next, for a more qualitative analysis, we use the best reward achieved while training averaged across over 25,000 steps and with evaluations sampled every 5,000 steps. Again we perform an average over 10 different random seeds. This metric gives a much better sense of stability than the traditionally reported instantaneous max reward achieved during training.

We compare our algorithm to the current state-of-the-art in off-policy continuous control: DDPG. Through careful model selection and extensive hyper parameter tuning, we train models with performance better than previously published for DDPG on some of these tasks. Despite this search, however, we believe that there is still space for significant performance gain for all the models given different neural network architectures and hyper parameters. Results can be seen in Figure 5. Our algorithm achieves better performance on four of the five environments we tested.

agent hopper swimmer half cheetah humanoid walker2d
SDQN 3342.62 179.23 7774.77 3096.71 3227.73
DDPG 3296.49 133.52 6614.26 3055.98 3640.93
Figure 5: Maximum reward achieved over training averaged over a 25,000 step window with evaluations every 5,000 steps. Results are averaged over 10 randomly initialized trials with fixed hyper parameters. SDQN models perform competitively as compared to DDPG.

5 Discussion

Conceptually, our approach centers on the idea that action selection at each stage can be factored and sequentially selected using an autoregressive formulation. In this work we use 1D action spaces that are discretized. Existing work in the image modeling domain suggests that using a mixture of logistic units salimans2017pixelcnn++ greatly speeds up training and would also satisfy our need for a closed form max. Additionally, this work imposes a prespecified ordering of actions which may negatively impact training for certain classes of problems. To address this, we could learn to factor the action space into the sequential order for continuous action spaces or learn to group action sets for discrete action spaces. Another promising direction is to combine this approximate max action with gradient based optimization procedure. This would relieve some of the complexity of the modeling task of the maxing network, at the cost of increased compute when sampling from the policy. Finally, the work presented here is exclusively on off-policy methods. Use of an autoregressive policy with discretized actions could also be used as the policy for any stochastic policy optimization algorithm such as TRPO schulman2015trust or A3C mnih2016asynchronous .

6 Conclusion

In this work we present a continuous control algorithm that utilize discretized action spaces and sequential models. The technique we propose is an off-policy RL algorithm that utilizes sequential prediction and discretization. We decompose our model into a function and an auxiliary network that acts as a policy and is responsible for computing an approximate max over actions. The effectiveness of our method is demonstrated on illustrative and benchmark tasks, as well as on more complex continuous control tasks. Two additional formulations of discretized sequential prediction models are presented in Appendix C and Appendix D

Acknowledgements

We would like to thank Nicolas Heess for his insight on exploration and assistance in scaling up task complexity, and Oscar Ramirez for his assistance running some experiments. We would like to thank Eric Jang, Sergey Levine, Mohammad Norouzi, Leslie Phillips, Chase Roberts, and Vincent Vanhoucke for their comments and feedback. Finally we would like to thank the entire Brain Team for their support.

References

Appendix A Model Diagrams

Figure A.1: Pictorial view for the SDQN network showing training. See Figure 2 for model in evaluation mode.

Appendix B Model Visualization

To gain insight into the characteristics of that our SDQN algorithm learns, we visualized results from the hopper environment, because these are easier to comprehend.

First we compute each action dimension’s distribution, , and compare those distributions to that of the double DQN network for the full action taken, . A figure containing these distributions and corresponding state visualization can be found in Figure B.2.

For most states in the hopper walk cycle, the distribution is very flat. This implies that small changes in the action taken in a state will have little impact on future reward. This makes sense as the system can recover from any action taken in this frame. However, this is not true for all states – certain critical states exist, such as when the hopper is pushing off, where not selecting the correct action value greatly degrades performance. This can be seen in frame 466.

Our algorithm is trained with a number of soft constraints. First, if fully converged, we would expect >= as every new sub-action taken should maintain or improve the expected future discounted reward. Additionally, we would expect (from eq. 2.2.2). In the majority of frames these properties seem correct, but there is certainly room for improvement.

Figure B.2: Exploration of the sub-DQN during after training. The top row shows the predictions for a given frame (action dimensions correspond to the joint starting at the top and moving toward the bottom – action 3 is the ankle joint). The bottom row shows the corresponding rendering of the current state. For insensitive parts of the gait, such as when the hopper is in the air (e.g. frame 430, 442, 490, 502), the network learns to be agnostic to the choice of actions; this is reflected in the flat Q-value distribution, viewed as a function of action index. On the other hand, for critical parts of the gait, such as when the hopper is in contact with the ground (e.g. frames 446, 478), the network learns that certain actions are much better than others, and the Q-distribution is no longer a flat line. This reflects the fact that taking wrong actions in these regimes could lead to bad results such as tripping, yielding a lower reward.

Next, we attempt to look at surfaces in a more global manner. We plot 2D cross sections for each pair of actions and assume the third dimension is zero. Figure B.3 shows the results.

As seen in the previous visualization, the surface of both the autoregressive surface and the is not smooth, which is expected as the environment action space for Hopper is expected to be highly non-linear. Some regions of the surface seem quite noisy which is not expected. Interestingly though, these regions of noise do not seem to lower the performance of the final policy. In -learning, only the maximum value regions have any impact on the taken policy. Future work is needed to better characterize this effect. We would like to explore techniques that use "soft" Q-learning [29, schulman2017equivalencee, 14]. These techniques will use more of the surface thus smooth the representations.

Additionally, we notice that the dimensions of the autoregressive model are modeled differently. The last action, has considerably more noise than the previous two action dimensions. This large difference in the smoothness and shape of the surfaces demonstrates that the order of the actions dimensions matters. This figure suggests that the model has a harder time learning sharp features in the dimension. In future work, we would like to explore learned orderings, or bidirectional models, to combat this.

Finally, the form of is extremely noisy and has many cube artifacts. The input of this function is both a one hot quantized action, as well as the floating point representation. It appears the model uses the quantization as its main feature and learns a sharp surface.

Figure B.3: surfaces given a fixed state. Top row is the autoregressive model, . The bottom row is the double DQN, . We observe high noise in both models. Additionally, we see smoother variation in earlier action dimensions, which suggests that conditioning order matters greatly. values are computed with a reward scale of 0.1, and a discounted return of 0.995.

Appendix C Add SDQN

In this section, we discuss a different model that also makes use of sequential prediction and quantization. The SDQN model uses partial-DQN’s, , to define sequential greedy policies over all dimensions, . In this setting, one can think of it as acting similarly to an environment transformed to predict one dimension at a time. Thus reward signals from the final reward must be propagated back through a chain of partial-DQN. This results in a more difficult credit assignment problem that needs to be solved. This model attempts to solve this by changing the structure of networks. This formulation, called Add SDQN replaces the series of maxes from the Bellman backup with a summation over learned functions.

Results for this method and the others presented in the appendices can be found in Appendix E.2.

c.1 Method

As before, we aim to learn a deterministic policy of the DQN, where

(8)

Here the value is defined as the sum of :

(9)

Unlike before, the sequential components no longer represent functions, so we will swap the notation of our compositional function from to .

The parameters of all the models are trained by matching , in equation 9 to as follows:

(10)

We train with -learning, as shown in equation 5.

Unlike the sequential max policy of previous section, here we find the optimal action by beam search to maximize equation 9. In practice, the learning dynamics of the neural network parameterizations we use yield solutions that are amenable to this shallow search and do not require a search of the full exponential space.

A figure showing this network’s training procedure can be found in Figure C.4.

Figure C.4: Pictorial view for the Add network showing both training. Policy evaluation from this network is done in a procedure similar to that shown in Figure 2

c.2 Network Parameterization Note

At this point, we have only tested the LSTM variant and not the untied parameterization. Optimizing these model families is an ongoing work and we could assume that Add SDQN could potentially perform better if it were using the untied version as we did for the originally presented SDQN algorithm.

Appendix D Prob SDQN

In the previous sections we showed the use of our technique within the Q-learning framework. Here we describe its use within the off-policy, actor-critic framework to learn a sequential policy [42, 43, 10]. We name this model Prob SDQN.

Results for this method and the others presented in the appendices can be found in Appendix E.2.

d.1 Method

We define a policy, , recursively as follows where is defined as:

(11)

Unlike in the previous two models, is not some form of of another function but a learned function.

As in previous work, we optimize the expected reward of policy (as predicted by a function) under data collected from some other policy, [10, 19].

(12)

where is an estimate of values and is trained to minimize equation 5.

We use policy gradients / REINFORCE to compute gradients of equation 12 [49]. Because is trained off-policy, we include an importance sampling term to correct for the mismatch of the state distribution, , from the learned policy and the state distribution, from the behavior policy. To reduce variance, we employ a Monte Carlo baseline, , which is the mean reward from samples from [23].

(13)
(14)

In practice, using importance sampling ratios like this have been known to introduce high variance in gradients [28]. In this work, we make the assumption that and are very similar – the smaller the replay buffer is, the better this assumption. This term can be removed and we are no longer required to compute . This assumption introduces bias, but drastically lowers the variance of this gradient in practice.

During training, we approximate the highest probability path with a beam search and treat our policy as deterministic.

As is the case with off-policy algorithms, training the policy does not require samples from the environment under the target policy – only evaluations of . This makes it much more attractive for tasks where sampling the environment is challenging – such as robotics.

A figure showing the training procedure can be found in figure D.5.

Figure D.5: Pictorial view for the Prob network showing training. Policy evaluation from this network is done in a procedure similar to that shown in Figure 2.

d.2 Network Parameterization Note

At this time, we have only tested the LSTM variant and not the untied parameterization. Optimizing these model families is ongoing and we could assume that Prob SDQN could potentially perform better if it were using the untied version as we did for the originally presented SDQN algorithm.

Appendix E Independent DQN

In the previous work, all previous methods contain both discretization and sequential prediction to enable arbitrarily complex distributions. We wished to separate these two factors, so we constructed a model that just performed discretization and keeps the independence assumption that is commonly used in RL.

Results for this method and the others presented in the appendices can be found in Appendix E.2.

e.1 Method

We define a function as the mean of many independent functions, :

(15)

Because each is independent of all other actions, a tractable max exists and as such we define our policy as an over each independent dimension:

(16)

As in previous models, is then trained with -learning as in equation 5.

e.2 Results

Results for the additional techniques can be seen in table 1. SDQN and DDPG are copied from the previous section for ease of reference.

The Add SDQN method performs about 800 reward better on our hardest task, humanoid, but performs worse on the simpler environments. The IDQN method, somewhat surprisingly, is able to learn reasonable policies in spite of its limited functional form. In the case of swimmer, the independent model performs slightly better than SDQN, but worse than the other versions of our models. Prob SDQN performs the best on the swimmer task by large margin, but underperforms dramatically on half cheetah and humanoid. Looking into trade offs of model design with respect to environments is of interest to us for future work.

agent hopper swimmer half cheetah humanoid walker2d
SDQN 3342.62 179.23 7774.77 3096.71 3227.73
Prob SDQN 3056.35 268.88 650.33 691.11 2342.97
Add SDQN 1624.33 202.04 4051.47 3811.44 1517.17
IDQN 2135.47 189.52 2563.25 1032.60 668.28
DDPG 3296.49 133.52 6614.26 3055.98 3640.93


Table 1: Maximum reward achieved over training averaged over a 25,000 step window with evaluations ever 5,000 steps. Results are averaged over 10 randomly initialized trials with fixed hyper parameters.

Appendix F Training and Model details

f.1 Hyper Parameter Sensitivity

The most sensitive hyper parameters were the learning rate of the two networks, reward scale, and finally, discount factor. Parameters such as batch size, quantization amounts, and network sizes mattered to a lesser extent. We found it best to have exploration be done on less than 10% of the transitions. We didn’t see any particular value below this that gave better performance. In our experiments, the structure of model used also played a large impact in performance, such as, using tied versus untied weights for each sequential prediction model.

In future work, we would like to lower the amount of hyper parameters needed in these algorithms and study the effects of various hyper parameters more thoroughly.

f.2 Sdqn

In this model, we looked at a number of configurations. Of the greatest importance is the form of the model itself. We looked at an LSTM parameterization as well as an untied weight variant. The untied weight variant’s parameter search is listed below.

To compute we first take the state and previous actions and do one fully connected layer of size "embedding size". We then concatenate these representations and feed them into a 2 hidden layer MLP with "hidden size" units. The output of this network is "quantization bins" wide with no activation.

uses the same embedding of state and action and then passes it though a 1 hidden layer fully connected network finally outputting a single scalar.

Hyper Parameter Range Notes
use batchnorm on, off use batchnorm on the networks
replay capacity: 2e4, 2e5, inf
batch size 256, 512
quantization bins 32 We found higher values generally converged to better final solutions.
hidden size 256, 512
embedding size 128
reward scaling 0.05, 0.1
target network moving average 0.99, 0.99, 0.98
adam learning rate for TD updates 1e-3, 1e-4, 1e-5
adam learning rate for maxing network 1e-3, 1e-4, 1e-5
gradient clipping off, 10
l2 reguralization off, 1e-1, 1e-2, 1e-3, 1e-4
learning rate decay log linear, none
learning rate decay delta -2 Decay 2 orders of magnitude down from 0 to 1m steps.
td term multiplier 0.2, 0.5,
using target network on double q network on, off
tree consistency multiplier 5 Scalar on the tree loss
energy use penalty 0, 0.05, 0.1, 0.2 Factor multiplied by velocity and subtracted from reward
gamma (discount factor) 0.9, 0.99, 0.999
drag down reguralizer 0.0, 0.1 Constant factor to penalize high q values. This is used to control over exploration. It has a very small effect in final performance.
tree target greedy penalty 1.0 A penalty on MSE or Q predictions from greedy net to target. This acts to prevent taking too large steps in function space
exploration type boltzmann or epsilon greedy
boltzman temperature 1.0, 0.5, 0.1, 0.01, 0.001
prob to sample from boltzman (vs take max) 1.0, 0.5, 0.2, 0.1, 0.0
boltzman decay decay both prob to sample and boltzman temperature to 0.001
epsilon noise 0.5, 0.2, 0.1, 0.05, 0.01
epsilon decay linearly to 0.001 over the first 1M steps

Best hyper parameters for a subset of environments.

Hyper Parameter Hopper Cheetah
use batchnorm off off
replay capacity: inf inf
batch size 512 512
quantization bins 32 32
hidden size 256 512
embedding size 128 128
reward scaling 0.1 0.1
target network moving average 0.99 0.9
adam learning rate for TD updates 1e-3 1e-3
adam learning rate for maxing network 5e-5 1e-4
gradient clipping off off
l2 regularization 1e-4 1e-4
learning rate decay for q log linear log linear
learning rate decay delta for q

2 orders of magnitude down from interpolated over 1M steps.

2 orders down interpolated over 1M steps
learning rate decay for tree none none
learning rate decay delta for tree NA NA
td term multiplier 0.5 0.5
useing target network on double q network off on
tree consistency multiplier 5 5
energy use penalty 0 0.0
gamma (discount factor) 0.995 0.99
drag down reguralizer 0.1 0.1
tree target greedy penalty 1.0 1.0
exploration type boltzmann boltzmann
boltzman temperature 1.0 0.1
prob to sample from boltzman (vs take max) 0.2 1.0
boltzman decay decay both prob to sample and boltzman temperature to 0.001 over 1M steps decay both prob to sample and boltzman temperature to 0.001 over 1M steps
epsilon noise NA NA
epsilon decay NA NA

f.3 Add SDQN

is parameterized the same as in F.2. The policy is parameterized by a multi layer LSTM. Each step of the LSTM takes in the previous action value, and outputs some number of "quantization bins." An action is selected, converted back to a one hot representation, and fed into an embedding module. The result of the embedding is then fed into the LSTM.

When predicting the first action, a learned initial state variable is fed into the LSTM as the embedding.

Hyper Parameter Range Notes
replay capacity: 2e4, 2e5, inf
batch size 256, 512
quantization bins 8, 16, 32 We found higher values generally converged to better final solutions.
lstm hidden size 128, 256, 512
number of lstm layers 1, 2
embedding size 128
Adam learning rate for TD updates 1e-3, 1e-4, 1e-5
Adam learning rate for maxing network 1e-3, 1e-4, 1e-5
td term multiplier 1.0, 0.2, 0.5,
target network moving average 0.99, 0.99, 0.999
using target network on double q network on, off
reward scaling 0.01, 0.05, 0.1, 0.12, 0.08
train number beams 1,2,3 number of beams used when evaluating argmax during training.
eval number beams 1,2,3 number of beams used when evaluating the argmax during evaluation.
exploration type boltzmann or epsilon greedy Epsilon noise injected after each action choice
boltzmann temperature 1.0, 0.5, 0.1, 0.01, 0.001
prob to sample from boltzmann (vs take max) 1.0, 0.5, 0.2, 0.1, 0.05
epsilon noise 0.5, 0.2, 0.1, 0.05, 0.01

Best hyper parameters for a subset of environments.

Hyper Parameter Hopper Cheetah
replay capacity: inf inf
batch size 256 256
quantization bins 16 32
lstm hidden size 128 256
number of lstm layers 1 1
embedding size 128
Adam learning rate for TD updates 1e-4 5e-3
Adam learning rate for maxing network 1e-5 5e-5
td term multiplier 0.2 1.0
target network moving average 0.999 0.99
using target network on double q network on on
reward scaling 0.05 0.12
train number beams 2 1
eval number beams 2 2
exploration type boltzmann
boltzmann temperature 0.1 0.1
prob to sample from boltzmann (vs take max) 0.5 0.5
epsilon noise NA NA

f.4 Prob SDQN

is parameterized the same as in F.2. The policy is parameterized by a multi layer LSTM. Each step of the LSTM takes in the previous action value, and outputs some number of "quantization bins". A softmax activation is applied thus normalizing this distribution. An action is selected, converted back to a one hot representation and fed into an embedding module. The result is then fed into the next time step.

When predicting the first action, a learned initial state variable is fed into the LSTM as the embedding.

Hyper Parameter Range Notes
replay capacity: 2e4, 2e5, inf
batch size 256, 512
quantization bins 8, 16, 32 We found higher values generally converged to better final solutions.
hidden size 256
embedding size 128
adam learning rate for TD updates 1e-3, 1e-4
adam learning rate for maxing network 1e-4, 1e-5, 1e-6
td term multiplier 10, 1.0, 0.5, 0.1,
target network moving average 0.995, 0.99, 0.999, 0.98
number of baseline samples 2, 3, 4
train number beams 1,2,3 number of beams used when evaluating argmax during training.
eval number beams 1,2,3 number of beams used when evaluating the argmax during evaluation.
epsilon noise 0.5, 0.2, 0.1, 0.05, 0.01
epsilon noise decay linearly move to 0.001 over 1m steps
reward scaling 0.0005, 0.001, 0.01, 0.015, 0.1, 1
energy use penalty 0, 0.05, 0.1, 0.2 Factor multiplied by velocity and subtracted from reward
entropy regularizer penalty 1.0

Best parameters from for a subset of environments.

Hyper Parameter Hopper Cheetah
replay capacity: 2e4 2e4
batch size 512 256
quantization bins 32 32
hidden size 256 256
embedding size 128 128
adam learning rate for TD updates 1e-4 1e-3
adam learning rate for maxing network 1e-5 1e-4
td term multiplier 10 10
target network moving average 0.98 0.99
number of baseline samples 2 4
train number beams 1 1
eval number beams 1 1
epsilon noise 0.1 0.0
epsilon noise decay linearly move to 0.001 over 1m steps NA
reward scaling 0.1 0.5
energy use penalty 0.05 0.0
entropy regularizer penalty 1.0 1.0

f.5 Idqn

We construct 1 hidden layer MLP, one for each action dimension. Each mlp has "hidden size" units. We perform Bellman updates using the same strategy done in DQN [25].

We perform our initial hyper parameter search with points sampled from the following grid.

Hyper Parameter Range Notes
replay capacity: inf
batch size 256, 512
quantization bins 8, 16, 32
hidden size 128, 256, 512
gamma (discount factor) 0.95, 0.99, 0.995, 0.999
reward scaling 0.05, 0.1
target network moving average 0.99, 0.99, 0.98
l2 reguralization off, 1e-1, 1e-2, 1e-3, 1e-4
noise type epsilon greedy
epsilon amount(percent of time noise) 0.01, 0.05, 0.1, 0.2
adam learning rate 1e-3, 1e-4, 1e-5

Best hyper parameters on a subset of environments.

Hyper Parameter Hopper Cheetah
replay capacity: inf inf
batch size 512 256
quantization bins 16 8
hidden size 512 128
gamma (discount factor) 0.99 0.995
reward scaling 0.1 0.1
target network moving average 0.99 0.99
l2 reguralization off 1e-4
noise type epsilon greedy epsilon greedy
epsilon amount(percent of time noise) 0.05 0.01
adam learning rate 1e-4 1e-4

f.6 Ddpg

Due to our DDPG implementation, we chose to do a mix of random search and parameter selection on a grid.

Hyper Parameter Range Notes
learning rate : [1e-5, 1e-3] Done on log scale
gamma (discount factor) 0.95, 0.99, 0.995, 0.999
batch size [10, 500]
actor hidden 1 layer units [10, 300]
actor hidden 2 layer units [5, 200]
critic hidden 1 layer units [10, 400]
critic hidden 2 layer units [4, 300]
reward scaling 0.0005, 0.001, 0.01, 0.015, 0.1, 1
target network update rate [10, 500]
target network update fraction [1e-3, 1e-1] Done on log scale
gradient clipping from critic to target [0, 10]
OU noise damping [0, 1]
OU noise std [0, 1]

Best hyper parameters on a subset of environments.

Hyper Parameter Hopper Cheetah
learning rate : 0.00026 0.000086
gamma (discount factor) 0.995 .995
batch size 451 117
actor hidden 1 layer units 48 11
actor hidden 2 layer units 107 199
critic hidden 1 layer units 349 164
critic hidden 2 layer units 299 256
reward scaling 0.01 0.01
target network update rate 10 445
target network update fraction 0.0103 0.0677
gradient clipping from critic to target 8.49 0.600
OU noise damping 0.0367 0.6045
OU noise std 0.074 0.255