1 Introduction
Combining the recent advances in deep learning techniques
(LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016) with reinforcement learning algorithms (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Szepesvari, 2010) has proven to be effective in many domains. Notable examples include the Deep QNetwork (DQN) (Mnih et al., 2013, 2015) and AlphaGo (Silver et al., 2016, 2017). The main advantage of using artificial neural networks as function approximators in reinforcement learning is their ability to deal with highdimensional input data by modeling complex hierarchical or compositional data abstractions and features. Despite these successes, which have enabled the use of reinforcement learning algorithms in domains with unprocessed, highdimensional sensory input, the application of these methods to highdimensional, discrete action spaces remains to suffer from the same issues as in tabular reinforcement learning—that is, the number of actions that need to be explicitly represented grows exponentially with increasing action dimensionality. Formally, for an environment with an
dimensional action space and discrete subactions for each dimension , using the existing discreteaction algorithms, a total of possible actions need to be considered. This can rapidly render the application of discreteaction reinforcement learning algorithms intractable to domains with multidimensional action spaces, as such large action spaces are difficult to explore efficiently (Lillicrap et al., 2016). This limitation is a significant one as there are numerous efficient discreteaction algorithms whose applications are currently restricted to domains with relatively small discrete action spaces. For instance, Qlearning (Watkins and Dayan, 1992) is a powerful discreteaction algorithm, with many extensions (Hessel et al., 2017), which due to its offpolicy nature can, in principle, achieve better sample efficiency than policy gradient methods by reusing transitions from a replay memory of past experience transitions or demonstrations (Gu et al., 2016, 2017).Given the potential of discreteaction reinforcement learning algorithms and their current limited application, in this paper we introduce a novel neural architecture that enables the use of discreteaction algorithms in deep reinforcement learning for domains with highdimensional discrete or continuous action spaces. The core notion of the proposed architecture is to distribute the representation of the action controllers across individual network branches, meanwhile, maintaining a shared decision module among them to encode a latent representation of the input and help with the coordination of the branches (see Figure 1). The proposed decomposition of the actions enables the linear growth of the total number of network outputs with increasing action dimensionality as opposed to the combinatorial growth in current discreteaction algorithms. This simple idea can potentially enable a spectrum of fundamental discreteaction reinforcement learning algorithms to be effectively applied to domains with highdimensional discrete or continuous action spaces using neural network function approximators.
To showcase this capability, we introduce a novel agent, called Branching Dueling QNetwork (BDQ), which is a branching variant of the Dueling Double DQN (Dueling DDQN) (Wang et al., 2016). We evaluate BDQ on a variety of complex control problems via finegrained discretization of the continuous action space. Our empirical study shows that BDQ can scale robustly to environments with highdimensional action spaces to solve the benchmark domains and even outperform the Deep Deterministic Policy Gradient (DDPG) algorithm (Lillicrap et al., 2016) in the most challenging task with a corresponding discretized combinatorial action space of approximately action tuples. To solve problems in environments with discrete action spaces of this magnitude is a feat that was previously thought intractable for discreteaction algorithms (Lillicrap et al., 2016; Schulman et al., 2017). In order to demonstrate the vital role of the shared decision module in our architecture, we compare BDQ against a completely independent variant which we refer to as Independent Dueling QNetwork (IDQ)—an agent consisting of multiple independent networks, one for each action dimension, and without any shared parameters among the networks. The results show that the performance of IDQ quickly deteriorates with increasing action dimensionality. This implies the inability of the agent to coordinate the independent action decisions across its several networks. This could be due to any of the several wellknown issues for independent fullycooperative learning agents: Paretoselection, nonstationarity, stochasticity, alterexploration, and shadowed equilibria (Matignon et al., 2012; Wei and Luke, 2016).
Partial distribution of control, or action branching
as we call it, is also found in nature. Octopuses, for instance, have complex neural systems where each arm is able to function with a degree of autonomy and even respond to stimuli after being detached from the central control. In fact, more than half of the neurons in an octopus are spread throughout its body, especially within the arms
(GodfreySmith, 2016). Since the octopuses’ arms have virtually unlimited degrees of freedom, they are highly difficult to control in comparison to jointed limbs. This calls for the partial delegation of control to the arms in order to work out the details of their motions themselves. Interestingly, not only do the arms have a degree of autonomy, they have also been observed to engage in independent exploration (GodfreySmith, 2016).2 Related Work
To enable the application of reinforcement learning algorithms to largescale, discreteaction problems, DulacArnold et al. (2015) propose the Wolpertinger policy architecture based on a combination of DDPG and an approximate nearestneighbor method. This approach leverages prior information about the discrete actions in order to embed them in a continuous space upon which it can generalize, meanwhile, achieving logarithmictime lookup complexity relative to the number of actions. Due to the underlying algorithm being essentially a continuousaction algorithm, this approach may be unsuitable for domains with naturally discrete action spaces where no assumption should be imposed on having associated continuous space correlations. Also, this approach does not enable the application of discreteaction algorithms to domains with highdimensional action spaces as it relies on a continuousaction algorithm.
Concurrent to our work, Metz et al. (2017) have developed an approach that can deal with problem domains with highdimensional discrete action spaces using Qlearning. They use an autoregressive network architecture to sequentially predict the action value for each action dimension. This requires manual ordering of the action dimensions which imposes a priori
assumptions on the structure of the task. Additionally, due to the sequential structure of the network, as the number of action dimensions increases, so does the noise in the Qvalue estimations. Therefore, with increasing number of action dimensions, the Qvalue estimates on the latter layers may become too noisy to be useful. Due to the parallel representation of the action values or policies, our proposed approach is not prone to cumulative estimation noise with increasing number of action dimensions and does not impose manual action factorization. Furthermore, our proposed approach is much simpler to implement as it does not require advanced neural network architectures, such as recurrent neural networks.
A potential approach towards achieving scalability with increasing number of action dimensions is to extend deep reinforcement learning algorithms to fullycooperative multiagent settings in which each agent—responsible for controlling an individual degree of freedom—observes the global state, selects an individual action, and receives a team reward common to all agents. Tampuu et al. (2017) combine DQN with independent Qlearning, in which each agent independently and simultaneously learns its own actionvalue function. Even though this approach has been successfully applied in practice to domains with two agents, in principle, it can lead to convergence problems (Matignon et al., 2012). In this paper, we empirically investigate this scenario and show that by maintaining a shared set of parameters among the action branches, our approach is able to scale to highdimensional action spaces.
3 The Action Branching Architecture
The key insight behind the proposed action branching architecture is that for solving problems in multidimensional action spaces, it is possible to optimize for each action dimension with a degree of independence. If executed appropriately, this altered perspective has the potential to trigger a dramatic reduction in the number of required network outputs. However, it is wellknown that the naïve distribution of the value function or the policy representation across several independent function approximators is subject to numerous challenges which can lead to convergence problems (Matignon et al., 2012). To address this, the proposed neural architecture distributes the representation of the value function or the policy across several network branches while keeping a shared decision module among them to encode a latent representation of the common input state (see Figure 1
). We hypothesize that this shared network module, paired with an appropriate training procedure, can play a significant role in coordinating the subactions that are based on the semiindependent branches and, therefore, achieve training stability and convergence to good policies. We believe this is due to the rich features in the shared module that is trained via the backpropagation of the gradients originating from all the branches.
To verify this capability, we present a novel agent that is based on the incorporation of the proposed action branching architecture into a popular discreteaction reinforcement learning agent, the Dueling Double Deep QNetwork (Dueling DDQN). The proposed agent, which we call Branching Dueling QNetwork (BDQ), is only an example of how we envision our action branching architecture can be combined with a discreteaction algorithm in order to enable its direct application to problem domains with highdimensional, discrete or continuous action spaces. We select deep Qlearning (also known as DQN) as the algorithmic basis for our proofofconcept agent as it is a simple, yet powerful, offpolicy algorithm with an excellent track record and numerous extensions (Hessel et al., 2017).
While our experiments focus on a specific algorithm (i.e. deep Qlearning), we believe that the empirical verification of the aforementioned hypothesis, suggests the potential of the proposed approach in enabling the direct application of a spectrum of existing discreteaction algorithms to environments with highdimensional action spaces.
4 Branching Dueling QNetwork
In this section, we begin by providing a brief overview of a select set of available extensions for DQN that we incorporate into the proposed BDQ agent. We then describe the details of the proposed agent, including the specific methods that were used to adapt DQN and its selected extensions into our proposed action branching architecture. Figure 2 demonstrates a pictorial view of the BDQ network.
4.1 Background
The following is an outline of three existing key innovations, designed to improve upon the sample efficiency and policy evaluation quality of the DQN algorithm.
Double Qlearning.
Both tabular Qlearning and DQN have been shown to suffer from the overestimation of the action values (van Hasselt, 2010; van Hasselt et al., 2016). This overoptimism stems from the fact that the same values are accessed in order to both select and evaluate actions. In the standard DQN algorithm (Mnih et al., 2013, 2015), a previous version of the current Qnetwork, called the target network, is used to select the next greedy action involved in the Qlearning updates. To address the overoptimism in the Qvalue estimations, van Hasselt et al. (2016) propose the Double DQN (DDQN) algorithm that uses the current Qnetwork to select the next greedy action, but evaluates it using the target network.
Prioritized Replay.
The experience replay enables online, offpolicy reinforcement learning agents to reuse past experiences or demonstrations. In the standard DQN algorithm, the experience transitions were sampled uniformly from a replay buffer. To enable more efficient learning from the experience transitions, Schaul et al. (2016) propose a framework for prioritizing experience in order to replay important experience transitions, which have a high expected learning progress, more frequently.
Dueling Network Architecture.
The dueling network architecture (Wang et al., 2016) explicitly separates the representation of the state value and the (statedependent) action advantages into two separate branches while sharing a common featurelearning module among them. The two branches are combined, via a special aggregating layer, to produce an estimate of the actionvalue function. By training this network with no additional considerations than those used for the DQN algorithm, the dueling network automatically produces separate estimates of the state value and advantage functions. Wang et al. (2016) introduce multiple aggregation methods for combining the state value and advantages. They demonstrate that subtracting the mean of the advantages from each individual advantage and then summing them with the state value results in improved learning stability when compared to the naïve summation of the state value and advantages. The dueling network architecture has been shown to lead to better policy evaluation in the presence of many similarvalued (or redundant) actions, and thus achieves faster generalization over large action spaces.
4.2 Methods
Here we introduce various methods for adapting the DQN algorithm, as well as its notable extensions that were explained earlier, into the action branching architecture. For brevity, we mainly focus on the methods that result in our best performing DQNbased agent, BDQ.
Common StateValue Estimator.
As demonstrated in the action branching network of Figure 2, BDQ uses a common statevalue estimator for all action branches. This approach, which can be thought of as an adaptation of the dueling network into the action branching architecture, generally yields a better performance. The use of the dueling architecture with action branching is particularly an interesting augmentation for learning in large action spaces. This is due to the fact that the dueling architecture can more rapidly identify action redundancies and generalize more efficiently by learning a general value that is shared across many similar actions. In order to adapt the dueling architecture into our action branching network, we distribute the representation of the (statedependent) action advantages on the several action branches, meanwhile, adding a single additional branch for estimating the statevalue function. Similar to the dueling architecture, the advantages and the state value are combined, via a special aggregating layer, to produce estimates of the distributed action values. We experimented with several aggregation methods and our best performing method is to locally subtract each branch’s mean advantage from its subaction advantages, prior to their summation with the state value. Formally, for an action dimension with discrete subactions, the individual branch’s Qvalue at state and subaction is expressed in terms of the common state value and the corresponding (statedependent) subaction advantage by:
(1) 
We realize that this aggregation method does not resolve the lack of identifiability for which the maximum and the average reduction methods were originally proposed (Wang et al., 2016). However, based on our experimentation, this method yields a better performance than both the naïve alternative,
(2) 
and the local maximum reduction method, which replaces the averaging operator in Equation 1 with a maximum operator:
(3) 
TemporalDifference Target.
We tried several different methods for generating the temporaldifference (TD) targets for the DQN updates. A simple approach is to calculate a TD target, similar to that in DDQN, for each individual action dimension separately:
(4) 
with denoting the branch of the target network . Alternatively, the maximum DDQNbased TD target over the action branches may be set as a single global learning target for all action dimensions:
(5) 
The best performing method, also used for BDQ, replaces the maximum operator in Equation 5 with a mean operator:
(6) 
Loss Function.
There exist numerous ways by which the distributed TD errors across the branches can be aggregated to specify a loss. A simple approach is to define the loss to be the expected value of a function of the averaged TD errors across the branches. However, due to the signs of such errors, their summation is subject to canceling out which, in effect, generally reduces the magnitude of the loss. To overcome this, the loss can be specified as the expected value of a function of the averaged absolute TD errors across the branches. In practice, we found that defining the loss to be the expected value of the mean squared TD error across the branches mildly enhances the performance:
(7) 
where denotes a (prioritized) replay buffer and denotes the jointaction tuple .
Error for Experience Prioritization.
Adapting the prioritized replay into the action branching architecture requires an appropriate method for aggregating the distributed TD errors (of a single transition) into a unified one. This error is then used by the replay memory to calculate the transition’s priority. In order to preserve the magnitudes of the errors, for BDQ, we specify the unified prioritization error to be the sum across a transition’s absolute, distributed TD errors:
(8) 
where denotes the error used for prioritization of the transition tuple .
Gradient Rescaling.
During the backward pass, since all branches backpropagate gradients through the shared network module, we rescale the combined gradient prior to entering the deepest layer in the shared network module by .
5 Experiments
We evaluate the performance of the proposed BDQ agent on several challenging continuous control environments of varying action dimensionality and complexity. These environments are simulated using the MuJoCo physics engine (Todorov et al., 2012). We first study the performance of BDQ against its standard nonbranching variant, Dueling DDQN, on a set of custom reaching tasks with increasing degrees of freedom and under two different granularity discretizations. We then compare the performance of BDQ against a stateoftheart continuous control algorithm, Deep Deterministic Policy Gradient (DDPG), on a set of standard continuous control manipulation and locomotion benchmark domains from the OpenAI’s MuJoCo Gym collection (Brockman et al., 2016; Duan et al., 2016). We also compare BDQ against a fully independent alternative, Independent Dueling QNetwork (IDQ), in order to verify our hypothesis regarding the significance of the shared network module in coordinating the distributed policies. To make the continuousaction domains compatible with the discreteaction algorithms in our study (i.e. BDQ, Dueling DDQN, and IDQ), in both sets of experiments, we discretize each action dimension , in the underlying continuous action space, into equally spaced values, yielding a discrete combinatorial action space of possible actions.
5.1 Custom NDimensional ActionSpace Problems
We begin by comparing the performance of BDQ against its standard nonbranching variant, the Dueling DDQN agent, on a set of physical manipulation tasks with increasing action dimensionality (see Figure 3). These tasks are custom variants of the standard Reacherv1 task (from the OpenAI’s MuJoCo Gym collection) that feature more actuated joints (i.e. ) with constraints on their ranges of motion to prevent collision between segments. Unlike the original Reacherv1 domain, reaching the target position immediately terminates an episode without the need to decelerate and maintain position at the target. This was done to simplify these complex control tasks (as a result of more frequently experienced episodic successes) in order to allow faster experimentation. We consider two discretization resolutions resulting in and subactions per joint. This is done in order to examine the impact of finer granularity, or equivalently more discrete subactions per action dimension, with increasing degrees of freedom. The general idea is to empirically study the effectiveness of action branching in the face of increasing actionspace dimensionality as compared to the standard nonbranching variant. Therefore, the tasks are designed to have sufficiently small action spaces for a standard nonbranching algorithm to still be tractable for the purpose of our evaluations.
runs with random initialization seeds, while shaded areas show the standard deviations. Evaluations were conducted every
episodes of training for episodes with a greedy policy.The performances are summarized in Figure 4. The results show that in the lowdimensional reaching task with , all agents learn at about the same rate, with slightly steeper learning curves towards the end for Dueling DDQN. In the task with , we see that the Dueling DDQN agent with starts off less efficiently (i.e. slower learning curve) than its corresponding BDQ agent, but eventually converges and outperforms both BDQ agents in their final performance. However, in the same task, the Dueling DDQN agent with shows a significantly less efficient learning performance against its BDQ counterpart. In the highdimensional reaching task with , we see that the Dueling DDQN agent with performs rather poorly in terms of its sample efficiency. For this task, we were unable to run the Dueling DDQN agent with since running it was computationally expensive—due to the large number of actions that need to be explicitly represented by its network (i.e. ) and consequently the extremely large number of network parameters that need to be trained at every iteration. In contrast, in the same task, we see that BDQ performs well and converges to good policies with robustness against the discretization granularity.
5.2 Standard Continuous Control Benchmarks
Here we evaluate the performance of BDQ on a set of standard continuous control benchmark domains from the OpenAI’s MuJoCo Gym collection. Figure 5 demonstrates sample illustrations of the environments used in our experiments. We compare the performance of BDQ against a stateoftheart continuousaction reinforcement learning algorithm, DDPG, as well as against a completely independent agent, IDQ. For all environments, we evaluate the performance of BDQ with two different discretization resolutions resulting in and subactions per degree of freedom. We do this to compare the relative performance of BDQ for the same environments with substantially larger discrete action spaces. Where feasible (i.e. Reacherv1 and Hopperv1), we also run the Dueling DDQN agent with .
The results demonstrated in Figure 6 show that IDQ’s performance quickly deteriorates with increasing action dimensionality, while BDQ continues to perform competitively against DDPG. Interestingly, BDQ significantly outperforms DDPG in the most challenging domain, the Humanoidv1 task which involves action dimensions, leading to a combinatorial action space of approximately possible actions for . Our ablation study on BDQ (with a shared network module) and IDQ (no shared network module) verifies the significance of the shared decision module in coordinating the distributed policies, and thus enabling the BDQ agent to progress in learning and to converge to good policies in a stable manner. Furthermore, remarkably, to perform competitively against a stateoftheart continuous control algorithm in such highdimensional domains is a feat previously considered intractable for discreteaction algorithms (Lillicrap et al., 2016; Schulman et al., 2017). However, in the simpler tasks DDPG performs better or on par with BDQ. We think a potential explanation for this could be the use of a specialized exploration noise process by DDPG which, due to its temporally correlated nature, enables effective exploration in domains with momentum.
By comparing the performance of BDQ for and , we see that, despite the significant difference in the total number of possible actions, the proposed agent continues to learn rather efficiently and converges to similar final performance levels. An interesting point to note is the exceptional performance of Dueling DDQN for in Reacherv1. Yet, increasing the action dimensionality by only one degree of freedom (from in Reacherv1 to in Hopperv1) renders the Dueling DDQN agent ineffective.
Finally, it is noteworthy that BDQ is highly robust against the specifications of the TD target and loss function, while it highly deteriorates with the ablation of the prioritized replay. Characterizing the role of the prioritized experience replay, in stabilizing the learning process for action branching networks, remains the subject of future research.
6 Conclusion
We introduced a novel neural network architecture that distributes the representation of the policy or the value function over several network branches, meanwhile, maintaining a shared network module for enabling a form of implicit centralized coordination. We adapted the DQN algorithm, along with several of its most notable extensions, into the proposed action branching architecture. We illustrated the effectiveness of the proposed architecture in enabling the application of a currently restricted discreteaction algorithm to domains with highdimensional discrete or continuous action spaces. This is a feat which was previously thought intractable. We believe that the highly promising performance of the action branching architecture in scaling DQN and its potential generality evoke further theoretical and empirical investigations.
References
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. URL http://dx.doi.org/10.1038/nature14539.
 Schmidhuber (2015) Jürgen Schmidhuber. Deep learning in neural networks: an overview. Neural Networks, 61:85–117, 2015. URL https://doi.org/10.1016/j.neunet.2014.09.003.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 Bertsekas and Tsitsiklis (1996) Dimitri P. Bertsekas and John Tsitsiklis. NeuroDynamic Programming. Athena Scientific, 1996.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement Learning: an Introduction. MIT Press, 1998.
 Szepesvari (2010) Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. URL http://dx.doi.org/10.1038/nature14236.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. URL http://dx.doi.org/10.1038/nature16961.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 2017. URL http://dx.doi.org/10.1038/nature24270.
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Watkins and Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Qlearning. Machine Learning, 8(3):279–292, 1992. URL http://dx.doi.org/10.1007/BF00992698.
 Hessel et al. (2017) Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
 Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 Gu et al. (2017) Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. QProp: sampleefficient policy gradient with an offpolicy critic. In International Conference on Learning Representations, 2017.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pages 1995–2003, 2016.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Matignon et al. (2012)
Laetitia Matignon, Guillaume J Laurent, and Nadine Le FortPiat.
Independent reinforcement learners in cooperative markov games: a
survey regarding coordination problems.
The Knowledge Engineering Review
, 27(1):1–31, 2012.  Wei and Luke (2016) Ermo Wei and Sean Luke. Lenient learning in independentlearner stochastic cooperative games. Journal of Machine Learning Research, 17(84):1–42, 2016. URL http://jmlr.org/papers/v17/15417.html.
 GodfreySmith (2016) Peter GodfreySmith. Other minds: the octopus, the sea, and the deep origins of consciousness. Farrar, Straus and Giroux, 2016.
 DulacArnold et al. (2015) Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
 Metz et al. (2017) Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep reinforcement learning. arXiv preprint arXiv:1705.05035, 2017.
 Tampuu et al. (2017) Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4):1–15, 2017. URL https://doi.org/10.1371/journal.pone.0172395.
 van Hasselt (2010) Hado van Hasselt. Double Qlearning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.

van Hasselt et al. (2016)
Hado van Hasselt, Arthur Guez, and David Silver.
Deep reinforcement learning with double Qlearning.
In
AAAI Conference on Artificial Intelligence
, pages 2094–2100, 2016.  Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: a physics engine for modelbased control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 Hesse et al. (2017) Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. OpenAI Baselines. https://github.com/openai/baselines, 2017.
 Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
 Uhlenbeck and Ornstein (1930) G. E. Uhlenbeck and L. S. Ornstein. On the theory of the Brownian motion. Physical Review, 36(5):823–841, 1930. URL https://link.aps.org/doi/10.1103/PhysRev.36.823.
Appendix A Experiment Details
Here we provide information about the technical details and hyperparameters used for training the agents in our experiments. Common to all agents, training always started after the first
steps and, thereafter, we ran one step of training at every time step. We did not perform tuning of the reward scaling parameter for either of the algorithms and, instead, used each domain’s raw rewards. We used the OpenAI Baselines [Hesse et al., 2017] implementation of DQN as the basis for the development of all the DQNbased agents.a.1 Branching Dueling QNetwork
We used the Adam optimizer [Kingma and Ba, 2015] with a learning rate of , , and . We trained with a minibatch size of and a discount factor . The target network was updated every
time steps. We used the rectified nonlinearity (or ReLU)
[Glorot et al., 2011] for all hidden layers and linear activation on the output layers. The network had two hidden layers with and units in the shared network module and one hidden layer per branch with units. The weights were initialized using the Xavier initialization [Glorot and Bengio, 2010]and the biases were initialized to zero. A gradient clipping of size
was applied. We used the prioritized replay with a buffer size of and hyperparameters , , , and .While an
greedy policy is often used with Qlearning, random exploration (with an exploration probability) in physical, continuousaction domains can be inefficient. To explore well in physical environments with momentum, such as those in our experiments, DDPG uses an OrnsteinUhlenbeck process
[Uhlenbeck and Ornstein, 1930]which creates a temporally correlated exploration noise centered around the output of its deterministic policy. The application of such a noise process to discreteaction algorithms is, nevertheless, somewhat nontrivial. For BDQ, we decided to sample actions from a Gaussian distribution with its mean at the greedy actions and with a small fixed standard deviation throughout the training to encourage lifelong exploration. We used a fixed standard deviation of
during training and zero during evaluation. This exploration strategy yielded a mildly better performance as compared to using an greedy policy with a fixed or linearly annealed exploration probability. For the custom reaching domains, however, we used an greedy policy with a linearly annealed exploration probability, similar to that commonly used for Dueling DDQN.a.2 Dueling Double Deep QNetwork
We generally used the same hyperparameters as for BDQ. The gradients from the dueling streams were rescaled by prior to entering the shared feature module as recommended by Wang et al. [2016]. Same as the reported best performing agent from [Wang et al., 2016], the average aggregation method was used to combine the state value and advantages. We experimented with both a Gaussian and an greedy exploration policy with a linearly annealed exploration probability, and observed a moderately better performance for the linearly annealed greedy strategy. Therefore, in our experiments we used the latter.
a.3 Independent Dueling QNetwork
Once more, we generally used the same hyperparameters as for BDQ. Similarly, the same number of hidden layers and hidden units per layer were used for each independent network, with the difference being that the first two hidden layers were not shared among the several networks (which was the case for BDQ). The dueling architecture was applied to each network independently (i.e. each network had its own statevalue estimator). This agent serves as a baseline for investigating the significance of the shared decision module in the proposed action branching architecture.
a.4 Deep Deterministic Policy Gradient
Appendix B Environment Details
Table 1 states the dimensionality information of the standard benchmark domains from the OpenAI’s MuJoCo Gym collection that were used in our experiments. The values provided are calculated for the specific case of , the finest granularity that we experimented with.
Domain  dim()  

Reacherv1  
Hopperv1  
Walker2dv1  
Humanoidv1 