Q-Learning in enormous action spaces via amortized approximate maximization

01/22/2020 ∙ by Tom Van de Wiele, et al. ∙ Google 0

Applying Q-learning to high-dimensional or continuous action spaces can be difficult due to the required maximization over the set of possible actions. Motivated by techniques from amortized inference, we replace the expensive maximization over all actions with a maximization over a small subset of possible actions sampled from a learned proposal distribution. The resulting approach, which we dub Amortized Q-learning (AQL), is able to handle discrete, continuous, or hybrid action spaces while maintaining the benefits of Q-learning. Our experiments on continuous control tasks with up to 21 dimensional actions show that AQL outperforms D3PG (Barth-Maron et al, 2018) and QT-Opt (Kalashnikov et al, 2018). Experiments on structured discrete action spaces demonstrate that AQL can efficiently learn good policies in spaces with thousands of discrete actions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the recent resurgence of interest in combining reinforcement learning with neural network function approximators, Q-learning

(watkins1992q) and its many neural variants (mnih2015human; bellemare2017distributional) have remained competitive with the state-of-the-art methods (horgan2018distributed; kapturowski2019recurrent). The simplicity of Q-learning makes it relatively straightforward to implement, even in combination with neural networks (mnih2015human). Because Q-learning is an off-policy reinforcement learning algorithm, it is trivial to combine with techniques like experience replay (lin1993reinforcement) for improved data efficiency or to implement in a distributed training setting where experience is generated using slightly stale network parameters, without requiring importance sampling-based off-policy corrections used by stochastic actor-critic methods (espeholt2018impala). Q-learning can also be easily and robustly combined with exotic exploration strategies (ostrovski2017count; mnih2016asynchronous; horgan2018distributed) as well as data augmentation through post-hoc modification (kaelbling1993goals; hindsight2017).

One limitation of Q-learning is the requirement to maximize over the set of possible actions, limiting its applicability in environments with continuous or high-cardinality discrete action spaces. In the case of Q-learning with neural network function approximation, the common approach of computing values for all actions in a single forward pass becomes infeasible, requiring instead an architecture which accepts as inputs both the state and the action, producing a scalar estimate as output. Other types of methods, such stochastic actor-critic or policy gradient approaches can naturally handle discrete, continuous, or even hybrid action spaces (williams1992simple; schulman2015trust; mnih2016asynchronous) because they do not perform a maximization over actions and only require the ability to efficiently sample actions from an appropriately parameterized stochastic policy. Furthermore, the structure of an action space often suggests a stochastic policy parametrization which admits efficient sampling even when the number of distinct actions is enormous. For example, a -level discretization of a continuous action space with degrees of freedom (throughout this work, we will refer to this as a -dimensional action space) will have distinct actions, but a policy can be designed to represent these actions as a product of discrete distributions each with arity , allowing for sampling in rather than time. The same principle cannot be directly applied to Q-learning, where identification of the maximal value would require an exhaustive enumeration of all actions. While a number of approaches for dealing with the intractable maximization over actions in Q-learning have been proposed in order to make it applicable to richer action spaces, these approaches are usually specific to a particular form of action space (hafner2009nfqca; pazis2009binary; yoshida2015binary).

Here, we show that instead of performing an exact maximization over the set of actions at each time step, it can be preferable to learn to search for the best action, thus amortizing the action selection cost over the course of training (hafner2009nfqca; lillicrap2016continuous). We treat the search for the best action as another learning problem and replace the exact maximization over all actions with a maximization over a set of actions sampled from a learned proposal distribution. We show that an effective proposal can be learned by training a neural network to predict the best known action found by a stochastic search procedure. This approach, which we dub Amortized Q-learning (AQL), is able to naturally handle high-dimensional discrete, continuous, or even hybrid action spaces (consisting of both discrete and continuous degrees of freedom) because, like stochastic actor-critic methods, AQL only requires the ability to sample actions from an appropriately parameterized proposal distribution.

We evaluate the effectiveness of Amortized Q-learning on both continuous actions spaces and large, but structured, discrete action spaces. On continuous control tasks from the DeepMind Control Suite, Amortized Q-learning outperforms D3PG and QT-Opt, two strong continuous control methods. On foraging tasks from the DeepMind Lab 3D first-person environment, Amortized Q-learning is able to learn effective policies using an action space with over 3500 actions. AQL thus bridges the gap between Q-learning and stochastic actor-critic methods in their flexibility towards the action space, removing the need for Q-learning methods tailored to specific action spaces.

2 Background

We consider discrete time reinforcement learning problems with large action spaces. At time step the agent observes a state and produces an action . The agent then receives a reward and transitions to the next state according to the environment transition distribution defined as . The aim of the agent is to maximize the expected future discounted return where

is a discount factor. While in the standard Markov Decision Process formulation, the actions

come from a finite set of possible actions , we consider a more general case. Specifically, we assume that the action space is combinatorially structured, i.e. defined as a Cartesian product of sub-action spaces. Formally we assume where each is either a finite set or . We therefore consider to be a

-dimensional vector and each component of

is referred to as a sub-action.

Given a policy that maps states to distributions over actions, the action-value function for policy is defined as . The optimal action-value function defined as gives the expected return for taking action in state and acting optimally thereafter. If is known, an optimal deterministic policy can be obtained by acting greedily with respect to , i.e. taking . An important property of is that it can be decomposed recursively according to the Bellman equation

(1)

In value-based reinforcement learning methods the aim is to learn the optimal value function by starting with a parametric estimate and iteratively improving the parameters based on experience sampled from the environment.

The Q-learning algorithm (watkins1992q) is one such value-based method, which uses an iterative update based on the recursive relationship in the Bellman equation to learn . Specifically, Q-learning can be formalized as optimizing the loss

(2)

where but is treated as a constant for the purposes of optimization, i.e. no gradients flow through it.

3 Related Work

Deterministic policy gradient algorithms (hafner2009nfqca; silver2014deterministic; lillicrap2016continuous) learn a deterministic policy that maps states to continuous actions by following the gradient of an action-value function critic with respect to the action . silver2014deterministic justified this approach by proving the deterministic policy gradient theorem, which shows how the gradient of an action-value function with respect to the action at state can be used to improve the policy at state . As others have noted (haarnoja2017reinforcement)

, these methods can be interpreted as Q-learning, with the deterministic policy serving as an approximate maximizer over the action space, partially explaining why such methods work well with off-policy data. Taking this view makes deterministic policy gradient methods similar to our proposed method. However, AQL learns an approximate maximizer using explicit search and supervised learning rather than using the gradient of the critic. This lack of reliance on gradients with respect to actions renders AQL agnostic to the type of the action space, be it discrete, continuous or a combination of the two, while deterministic policy gradients are inapplicable to discrete action spaces.

gu2016continuous address the intractability of -maximization in the continuous setting by restricting the form of to be a state-dependent quadratic function of the continuous actions, rendering the maximization trivial. While considerably simpler than DDPG, the practical consequences of this representational restriction in the continuous case are unclear, and the technique is inapplicable to discrete or hybrid action spaces.

metz2017discrete apply Q-learning to multi-dimensional continuous action spaces by discretizing each continuous sub-action. Their Sequential DQN approach avoids maximizing over a number of actions that is exponential in the number of action sub-actions by considering an extended MDP in which the agent chooses its sub-actions sequentially. Sequential DQN learns functions for the original MDP as well as the extended MDPs, using the latter to perform Q-learning backups for the former across environment transitions. tavakoli2017 model Q-values for all sub-actions using a shared state value and a sub-action-specific advantage parametrization. Their best working parametrization subtracts the mean advantage for each action advantage dimension. The target Q-value is shared for all action dimensions and computed using the observed rewards and a bootstrapped value equal to the mean argmax action of the target network. Notably, metz2017discrete conditions the lower-level function on sub-actions already taken as part of the current macro-action; we experiment with a similar strategy for our learned proposal distribution.

Another alternative to performing an exact maximization over actions is to replace this maximization with a fixed stochastic search procedure. kalashnikov2018qt and quillen2018grasping used the cross-entropy method to perform an approximate maximization over actions in order to apply Q-learning to robotic grasping tasks. This approach has been shown to work very well for tasks with up to 6-dimensional action spaces, but it is unclear if it can scale to higher-dimensional action spaces. Notably, concurrent work111Versions of both lim2019actor and the present manuscript appeared concurrently at the NeurIPS 2018 Deep Reinforcement Learning Workshop. of lim2019actor develops an Actor-Expert framework wherein an approximate function is trained on continuous actions and a stochastic policy is updated with a state-conditional variant of the cross-entropy method. Our work demonstrates that a simpler approach, rendered efficient by parallel computation of values, can scale to even larger continuous action spaces, as well as high-dimensional discrete and hybrid action spaces.

On the subject of replacing policies with proposal distributions, concurrent work of hunt2018

learns proposal distributions in the stochastic case. Their proposal distribution consists of a mixture of truncated normal distributions that is learned by minimizing the forward KL divergence with the Boltzmann policy. This method is focused on sampling in continuous action spaces during transfer, and does not deal with discrete or hybrid action spaces.

Finally, we note also the concurrent work of wiehe2018sampled applied the idea of learning an approximate maximizer over actions using search and supervised learning to continuous control. Our work shows that this idea is more generally applicable and our experiments validate it on more challenging tasks.

4 Amortized Q-Learning

Input :  Proposal network parameters , network parameters , number of actions to draw from the proposal , number of actions to draw uniformly , unroll length

, exploration probability

repeat
       for  do
             Observe state with probability ,
                   // Select uniformly at random
                  
            
            
            otherwise
                  
            
            
             // Take action , receive reward
            
       end for
      Update using , , by descending the gradient of (4). Update using and by descending the gradient of (3).
until termination
Algorithm 1 AQL

Amortized inference (dayan1995helmholtz) has revolutionized variational inference by eliminating the need for a costly maximization with respect to the parameters of the approximate variational posterior, making training latent variable models much more efficient. Key to this approach is a form of amortization

in which an expensive procedure such as iterative optimization over the variational parameters is replaced with the more efficient forward pass in a parametric model, such as a neural network. By training the model by optimizing the same objective as the original procedure, we obtain an efficient approximation to it that can be applied to new instances of the problem. We propose to leverage the same principle in order to amortize the maximization of

with respect to the vector of sub-actions . Specifically, in addition to a neural network representing the function (which takes the state and vector of sub-actions as inputs) we train an additional neural network to predict, from the state , the sufficient statistics of a proposal distribution over possible actions. We then replace exact maximization of over actions , which is required both for acting greedily and performing Q-learning updates, with maximization over a set of samples from the proposal .

As long as the proposal distribution assigns non-zero probability to all actions, this method will approach Q-learning with exact maximization for discrete action spaces as the number of samples from the proposal increases222To see why this is true, let be the probability assigned to the action with the highest -value by the proposal. Then after drawing samples from the proposal, the probability that the best action has been sampled at least once is which tends to 1 as tends to infinity.. Since the computational cost of the approach will grow linearly with the number of samples taken from the proposal, for practical reasons we are interested in learning a proposal from which we need to draw the fewest possible samples.

In this paper, we consider a simple approach to learning a proposal distribution which, at a high level, works as follows. To update the proposal for state we first perform an approximate stochastic search for the action with the highest -value at state . We then update the proposal to predict the best action found by the search procedure, i.e. the one with the highest -value, using supervised learning to make the action found by the search procedure more probable under the proposal distribution. Because the proposal is meant to track approximately maximal actions for a given state as prescribed by a parametric function which is itself continually changing, we add a regularization term to the proposal objective in order to prevent the proposal from becoming deterministic and potentially becoming stuck in a bad local optimum.

More formally, we update the proposal distribution for a state , by drawing a set of

samples from the uniform distribution over all actions and a set of

samples from the current proposal for state . Denoting the action with the highest -value from the union of these two sets as , the regularized loss for training the proposal distribution is given by

(3)

The first term in the objective is the negative log-likelihood for the proposal with as the target. Minimizing it makes the sample with the highest value more likely under the proposal. The second term is the negative entropy of the proposal distribution, which encourages uncertainty in throughout training and prevents it from collapsing to a deterministic distribution.

Finally, we can define the AQL loss for updating the action-value function as

(4)

which is identical to the Q-learning loss but with maximization over all actions replaced by the stochastic search procedure based on the proposal distribution. Pseudocode for the complete Amortized Q-learning algorithm is shown in Algorithm 2. While the algorithm considers the action-value and proposal parameters separate, we consider the possibility of sharing some of the parameters between the two following standard practice. The parameter sharing is made clear in the network architecture diagram in Figure 1, which details the architecture used for all experiments.

The AQL algorithm can be applied to continuous, discrete, and hybrid discrete/continuous action spaces simply by changing the form of the proposal distribution . We use autoregressive proposals of the form

in order to incorporate the dependencies among sub-actions. For discrete sub-actions we parameterized the proposal

using a softmax. We experimented with discretized and continuous proposal distributions for continuous sub-actions. Proposal distributions for continuous sub-actions are either parameterized using a softmax over a set of discrete choices representing uniformly-spaced values from the continuous sub-action’s range or using a Gaussian distribution with fixed variance. A detailed description of the architecture is included in the Appendix.

Figure 1: Left: The AQL architecture used in the experiments. The shared state embedding network is the identity function for the Control Suite experiments (i.e. the proposal network and network each operate directly on the observations and share no parameters). For DeepMind Lab experiments, where the input consists of pixel observations, the shared state embedding consists of 3 layers of ResNet blocks followed by a fully connected layer with 256 units and a recurrent LSTM core with 256 cells as in espeholt2018impala. The left head represents the proposal distribution. There is a proposal output layer of dimension for each sub-action . represents the number of choices for dimension for discrete or discretized continuous actions and the number of distribution parameters for continuous actions. Samples from the proposal distribution are embedded and concatenated with an embedding of the state and used to compute . Right:

Learning curves of the mean return across all tasks in the Control Suite. The error bars represent the standard error of the mean episode return.

5 Experiments

We performed experiments on two families of tasks: the set of DeepMind Control Suite (tassa2018deepmind) tasks with constrained actions (16 distinct domains, with a total of 39 tasks spread between them) and two tasks in the DeepMind Lab (beattie2016deepmind)

3D first-person environment. Descriptions of the architectures and hyperparameter settings are available in the Appendix.

5.1 Control Suite

The DeepMind Control Suite is a collection of continuous control tasks built on the MuJoCo physics engine (todorov2012mujoco). The Control Suite is considered a good benchmark as it spans a wide range of tasks of varying action complexity. In the simplest tasks, the agent controls only a single actuator while the humanoid tasks require selection of at least 21 sub-actions at every time step. As is common, we employed the underlying state variables rather than visual representations of the state as observations, as the focus of our inquiry is the complexity of the action space rather than the observation space. We explore a domain where the observations are pixel renderings in Section 5.2.

We compared AQL to several baseline continuous control methods. First, we considered an ablation which replaces the learned proposal distribution with a fixed uniform proposal distribution. We dub this method Uniform Q-learning. We also considered two strong continuous control methods: D3PG (barth2018distributed), a distributed version of DDPG (lillicrap2016continuous), and QT-Opt (kalashnikov2018qt) which corresponds to Q-learning where action maximization is performed using the cross-entropy method (CEM) (kalashnikov2018qt; rubinstein2004). CEM is a derivative-free iterative optimization algorithm that samples values at each iteration, fits a Gaussian distribution to the best of these samples, and then samples the next batch of from that distribution. The iterative process is initialized with uniform samples from the action space. In our implementation, we used the same number of initial samples as the number of proposal actions and , performed three iterations and fit independent Gaussian distributions for each sub-action to the actions with the highest corresponding -values. Following previous work (kalashnikov2018qt; hafner2018Planet), is chosen to be an order of magnitude smaller than . The final baseline method consisted of the IMPALA agent (espeholt2018impala), an importance weighted advantage actor-critic method that is inspired by A3C (mnih2016asynchronous).

Our AQL implementation used an autoregressive proposal distribution which models sub-actions sequentially using the ordering in which they appear in the Control Suite task specifications. We considered both discretized categorical and Gaussian action distributions, the latter with fixed variance

. The logits (in the discretized case) or distribution means (continuous case) of the proposal distribution for sub-action

are computed by a linear transformation of a concatenation of features derived from the state and one-hot encodings of all previously sampled sub-actions

. Note that entire composite actions sampled from the proposal distribution remain independent and identically distributed.

For the discretized AQL method and the Uniform Q-learning method, actions were discretized for each degree of freedom to one of five sub-actions in . This yields a total of distinct actions. All methods except D3PG used proposal samples and uniform search samples as described in Algorithm 2. We considered more uniform search samples since they are less expensive to compute, as they do not require the relatively expensive autoregressive sampling procedure. The -value evaluations were implemented as a single batched operation for each actor step, which rendered evaluation highly efficient. Following the literature, our D3PG experiments made use of an Ornstein-Uhlenbeck process (uhlenbeck1930) for exploration with and . We employed the Adam optimizer (adamoptimizer2014) to train all models. In order fairly compare across methods, all architectural details that are not specific to the method under consideration were held fixed across conditions. Further details are included in the Appendix.

Figure 1 depicts the learning curves for 200 million actor steps averaged over all tasks (task specific learning curves are reported in the Appendix), which shows that discretized AQL performs best on average. Returns at a given point in training were obtained by allowing the agent to act in the environment for 1000 time steps and summing the rewards. Rewards range between 0 and 1 for all tasks in the Control Suite, which makes the average rewards comparable between tasks. While not shown in Figure 1 we note that all methods also substantially outperform the A3C results reported in (tassa2018deepmind) except for the IMPALA results which are only marginally better.

Closer inspection revealed that the difference between the methods in Figure 1 was mostly explained by the performance on the medium- and high-dimensional tasks. Exploration in high-dimensional action spaces is typically a harder problem, and the same appears to hold for searching for the action with the maximum -value. Analysis of tasks with high-dimensional action spaces revealed that learning a discretized proposal distribution (AQL) or a deterministic policy (D3PG) clearly outperformed Uniform Q-learning and QT-Opt, both of which use fixed stochastic search procedures for action selection. IMPALA performed worse than the other methods on most Control Suite tasks.

The continuous implementation of AQL performed similarly on the low- and medium-dimensional tasks as the discretized AQL variant but failed to learn on the high-dimensional tasks. This may be related to the observation that exploration is often easier when the action space is discretized (metz2017discrete; peng2017learning). Figure 2 shows the average mean final performance for the medium- and high-dimensional tasks. Results on the low-dimensional Control Suite tasks are available in the Appendix and show less drastic differences between the methods. These experimental results support the hypothesis that learning a proposal distribution is beneficial for more complex or higher-dimensional action spaces.

Figure 2: Mean final performance, averaged over 3 seeds, for the medium- and high-dimensional Control Suite tasks. The error bars represent the standard error of the mean episode return.

Overall, these results demonstrate that AQL can scale up to problems with a relatively high-dimensional action space. The 21-dimensional humanoid tasks are of particular interest: although our discretization scheme leads to a total of possible discrete actions on these tasks, AQL was able to learn policies competitive with D3PG.

Uniform Q-learning and QT-Opt, both of which correspond to Q-learning with a fixed procedure for maximizing over actions, largely failed to learn on the humanoid tasks, demonstrating the advantage of learning a state-dependent maximization procedure in high-dimensional action spaces.

5.2 DeepMind Lab

DeepMind Lab (beattie2016deepmind) is a platform for 3D first person reinforcement learning environments. We evaluated AQL on DeepMind Lab because it offers a combination of a rich observation space (we consider RGB observations) and a complex, structured action space.

The full action space consists of 7 discrete sub-actions which can be selected independently. The first two sub-actions specify the angular velocity of the viewport rotation (left/right and up/down). Each of these represents an integer between -512 and 512. The next two sub-actions represent sets of three mutually exclusive movement-related options: strafe left/right/no-op and forward/backward/no-op. The last three sub-actions are binary: fire/no-op, jump/no-op and crouch/no-op. We considered two action sets: a curated set with 11 actions based on the minimal subset that allows good performance on all tasks, and a second, larger action set with 3528 actions consisting of the Cartesian product of 7 rotation actions for each of the two rotational axes and the 5 remaining ternary and binary sub-actions ().

We compared three methods on each of the two action sets: our distributed implementation of AQL with proposal samples and uniform search samples, an ablation that performs exact Q-learning (i.e. evaluates the Q-values for all possible discrete actions at every time step) and the IMPALA method. As with our AQL implementation, the Q-learning ablation agent was distributed, replay-based, and used a recurrent Q-function trained with making it a strong baseline similar to R2D2 (kapturowski2019recurrent). Other details such as network architecture and optimizer choice were held fixed across conditions. The AQL implementation for the large action set incorporated the 7-dimensional structure of the discrete action space and used the same style of autoregressive proposal as the Control Suite experiments. Every action was repeated 4 times and each seed was run for 200 million environment steps (50 million actor steps given the repeated actions). The precise network architecture is described in the Appendix.

Figure 3: Learning curves on two DeepMind Lab exploration tasks. The error bars represent the standard error of the mean episode return over 5 seeds. The top row shows the mean episode return versus the number of actor steps and the bottom row represent the mean episode return versus time for the first 24 hours of the experiment.

Figure 3 shows the results for two exploration tasks from DeepMind Lab. In both tasks the agent was rewarded for navigating to the object(s) in a maze. The explore_goal task only contains a single object with goal location(s), level layout and theme randomized per episode. As expected, there was no significant performance difference between AQL and exact Q-learning (i.e. computing Q-values for all 11 actions) when using the curated action set. IMPALA also performs comparably with the curated action set. When training with the large action set, exploration is harder, which explains why the initial performance was worse for all methods. AQL achieved the best final score on both tasks, eventually outperforming IMPALA, exact Q-learning with the curated action set and exact Q-learning on the large action set. The agent trained with AQL on the large action set was able to outperform agents trained with the curated action set by navigating more efficiently (e.g. by simultaneously strafing and moving forward). While it may be surprising that AQL achieves better data efficiency than Q-learning on the large action set (where Q-learning evaluates all possible actions), we speculate that AQL benefits from the additional stochasticity in action selection due to approximate maximization, which potentially improves exploration. It is also possible that by using approximate maximization for bootstrapping, AQL is less prone to over-estimation of Q-values, from which Q-learning with exact maximization is known to suffer (hasselt2010double); we leave this investigation to future research. Critically, considering all 3528 actions with exact Q-learning takes about 10 times longer in wall-clock time because the Q-function must be evaluated for each action. IMPALA is also slowed down drastically when considering 3528 actions. These experiments show that AQL can indeed efficiently learn good policies in large discrete structured action spaces and, unsurprisingly, performs comparably to exact Q-learning in low cardinality action spaces.

6 Discussion

We presented AQL, a simple approach to scaling up Q-learning to multi-dimensional action spaces where the sub-actions can be discrete, continuous or a combination of the two. We showed that AQL is competitive with several strong continuous-control methods on low-dimensional control tasks, and is able to outperform them on medium- and high-dimensional action spaces. Perhaps most notably, AQL closes the gap between Q-learning and stochastic actor-critic methods in their ability to handle high-dimensional and structured action spaces. While actor-critic methods have been able to handle such action spaces simply by changing the form of the stochastic policy (schulman2015trust; mnih2016asynchronous), different variants of Q-learning have been developed to handle each specific case, i.e. DDPG for continuous actions and DQN for low-dimensional discrete action spaces. AQL allows one to handle different action spaces simply by varying the form of the proposal distribution while keeping the rest of the algorithm unchanged. While AQL and stochastic actor-critics are equally general in terms of which action spaces they can handle (being limited only by the ability to efficiently sample from a distribution over actions), AQL may have some advantages over actor-critic methods. Namely, AQL supports off-policy learning without the need for importance sampling-based correction terms used by actor-critic methods, making the method easier to implement in replay-based agents.

There are several promising avenues for future work. AQL could be combined with the cross-entropy method (CEM) by modeling the parameters of the initial distribution of CEM with the proposal distribution. Another interesting direction would be to use the proposal distribution for more intelligent exploration than epsilon-greedy exploration. Currently AQL uses a simple way of learning the proposal based on supervised learning. Better ways of optimizing the proposal parameters could improve the overall algorithm. For example, one natural choice would be to train the proposal by maximizing the expected -value of a sample from it using REINFORCE (williams1992simple). Considering sequential proposal distributions is another interesting possibility. By conditioning each sample from the proposal on all previous samples it may be possible to learn a proposal that achieves the same overall performance using many fewer samples.

Acknowledgements

We thank Marlos C. Machado, Jonathan Hunt and Tim Harley for invaluable feedback on drafts of the manuscript. We additionally thank Jonathan Hunt, Benigno Uria, Siddhant Jayakumar and Catalin Ionescu for helpful discussions. Additional credit goes to Wojtek Czarnecki for creating the DeepMind Lab action set with 3528 actions.

References

Appendix

A1 Architecture Details

We employ a distributed reinforcement learning architecture inspired by the IMPALA reinforcement learning architecture (espeholt2018impala). There is one centralized GPU learner batching parameter updates on experience collected by a large number of CPU-based parallel actors. In all our experiments we use 100 parallel actors. All experiments employ the architecture detailed in the main paper.

The state embedding network consists of the identity function for the Control Suite, so there is no parameter sharing between the proposal and the value networks. The state embedding network for DeepMind Lab experiments is set to the convolutional network employed in espeholt2018impala: 3 layers of ResNet blocks followed by a fully connected layer with 256 units and a recurrent LSTM core with 256 cells. The weights of the network are optimized with two optimizers: one for the proposal distribution head and one for the shared state embedding network and -value head. The parameters of the shared state embedding network are considered constant when optimizing the proposal loss. The actors use a local FIFO replay buffer from which unrolls are sampled without prioritization.

A2 Experimental Details

The actors send the current trajectory to the learner queue at the end of an unroll along with two samples from the local replay buffer. Each local replay buffer can store steps, representing a total effective replay size of since 100 parallel actors were used. The unroll length is set to 30 for Control Suite experiments and to 100 for DeepMind Lab experiments. We train both optimizers with Adam (adamoptimizer2014) with a learning rate of

, default TensorFlow hyperparameters and mini-batches of size

. We could use a target -network as described in the main text, but we do not since we did not observe significant gain in any of our hyperparameter iterations. The discount factor is set to . We use the variant due to peng1994incremental (also Chapter 7, sutton1998book) to compute -targets with . Following mnih2016asynchronous and horgan2018distributed, we use a different amount of -greedy exploration for each actor, as this has been shown to improve exploration. The first 10 actors use and the remaining actors use .

procedure Actor
       Input : Proposal network parameters ,
       network parameters ,
       number of actions to draw from the proposal ,
       number of actions to draw uniformly ,
       unroll length ,
       number of replay steps ,
       exploration probability
      
       Initialize local replay buffer . repeat
             for  do
                   Observe state with probability ,
                         // Select uniformly at random
                        
                  
                  
                  otherwise
                        
                  
                  
                   // Take action , receive reward
                  
             end for
            Send unroll to the learner. Add unroll to . for  do
                   Sample unroll from . Send unroll to the learner.
             end for
            Poll the learner periodically for updated values of , . Reset the environment if the episode has terminated.
      until termination
procedure Learner
       Input : Batch size
       repeat
             Assemble batch of experience Update with a step of gradient descent on
             Update with a step of gradient descent on
             Periodically set
      until termination
Algorithm 2 AQL (Distributed)

A3 Control Suite

We have uploaded a video of the final performance of the discretized AQL agent for all Control Suite tasks at https://youtu.be/WgTXjJhe6iQ. The video shows the behavior of the greedy policy along with the proposal distribution and the Q-values of the sampled proposal actions. The videos are selected by picking the seed with the best performance after training. Closer inspection of the trained proposal distributions of the AQL agent reveals that the sub-action proposal distributions have a general tendency to alternate between near-deterministic, low-entropy distributions at critical decision times and high-entropy distributions. The proposal distributions typically still have high entropy when the task is solved and there is a clear optimal action to maintain the equilibrium. Examples of this behavior can be found in the finger, ball-in-cup, hopper, humanoid, and reacher tasks. One notable example of the side effect of having high entropy in the proposal distribution can be seen in the walker-stand task. The high entropy of the policy results in alternating balancing and walking behavior. Figure 4 visualizes the proposal distribution for the pendulum-swingup task. The proposal distribution has low entropy when swinging the pendulum up and low entropy during the balancing stage of the task.

Figure 4: Chronological visualization of the proposal distribution for the pendulum-swingup task. The top two images show the pendulum during the swingup phase and the bottom two images visualize the balancing phase. The leftmost part of each plot shows the behavior of the agent. The middle part plots a histogram of Q-values for 100 sampled actions from the proposal distribution with an x-axis that ranges from 0 to 100 (maximum possible value since ). The right part shows the probabilities of the 5 discretized action options for the proposal distribution. Green bars in the proposal distribution belong to the argmax action and also the selected action since the agent follows the greedy policy. The proposal distribution is near-deterministic while swinging up and near-uniform when it gets close to the balancing equilibrium.

We also considered a simpler AQL method on the Control Suite. The independent AQL method models the different sub-actions as conditionally independent given the state as opposed to the variant from the main text where an order over sub-actions is assumed and each sub-action is conditioned on the state as well as preceding sub-actions. Figure 5 shows that the independent, discretized variant performed only slightly worse on average on the Control Suite. The AQL implementation with a continuous proposal distribution used an autoregressive policy. We have also uploaded a video of the final performance of the independent proposal AQL agent for all Control Suite tasks on https://youtu.be/9YIujaHjsQY. The video shows the behavior along with the proposal distribution and the Q-values of the sampled proposal actions.

Figure 6 shows the mean final performance results for the low-dimensional Control Suite tasks, with 1 or 2 sub-actions. IMPALA and D3PG represent the weakest baselines on these tasks. The mean final performance results for the medium-dimensional Control Suite tasks, with 4 to 6 sub-actions, is shown in Figure 7. For medium-dimensional tasks, D3PG and IMPALA again stand out as the methods that perform worse than the other baselines on average. Results for high-dimensional tasks are shown in Figure 8. Here, D3PG and the AQL methods outperform QT-Opt, IMPALA and Uniform Q-Learning.

Figure 5: Learning curves of the mean return across all tasks in the Control Suite. The error bars represent the standard error of the mean episode return over 3 seeds. The AQL variant where sub-actions are sampled independently performs slightly worse on average.
Figure 6: Mean final performance, averaged over 3 seeds, for the low-dimensional Control Suite tasks. The error bars represent the standard error of the mean episode return.
Figure 7: Mean final performance, averaged over 3 seeds, for the medium-dimensional Control Suite tasks. The error bars represent the standard error of the mean episode return.
Figure 8: Mean final performance, averaged over 3 seeds, for the high-dimensional Control Suite tasks. The error bars represent the standard error of the mean episode return.

Figures 9 and 10 show the individual learning curves of the Control Suite for all tasks.

Figure 9: Task specific learning curves for the Control Suite (1/2).
Figure 10: Task specific learning curves for the Control Suite (2/2).

Somewhat surprisingly, we found that the choice of the network architecture and optimizer along with the optimizer hyperparameters can have a dramatic effect on the final performance. We chose the hyperparameters for the preceding experiments by first running the baselines on some of the high-dimensional tasks for the considered methods and then committing to those settings that were found to work best on average for the all tasks. Figure 11

shows the results of an earlier sweep with an alternative architecture and the RMSProp optimizer 

(tieleman2012rmsprop) instead of Adam. The architecture in the earlier sweep shared weights in the first two layers and was deeper but had fewer units in each layer. The results of the earlier sweep are significantly worse on average for all baselines except for IMPALA and continuous AQL. The results for continuous AQL are competitive with the best baseline of AQL (discretized, autoregressive proposal distribution), while the IMPALA results using this sharing architecture are comparable to the results from the main text. D3PG was the most affected by the choice of architecture and optimizer type, being the second worst performing method when trained with RMSProp and the second best performing method when trained with Adam.

Figure 11: Left: alternative architecture with RMSProp as the optimizer. Right: the corresponding learning curves of the mean return across all tasks in the Control Suite. The error bars represent the standard error of the mean episode return. The results are significantly worse for all baselines compared with the results in the main text.