1 Introduction
The ability to solve complex control tasks with highdimensional input and action spaces is a key milestone in developing realworld artificial intelligence. The use of reinforcement learning to solve these types of tasks has exploded following the work of the Deep Q Network (DQN) algorithm
(Mnih et al., 2015), capable of humanlevel performance on many Atari games. Similarly, ground breaking achievements have been made in classical games such as Go (Silver et al., 2016). However, these algorithms are restricted to problems with a finite number of discrete actions.In control tasks, commonly seen in the robotics domain, continuous action spaces are the norm. For algorithms such as DQN the policy is only implicitly defined in terms of its value function, with actions selected by maximizing this function. In the continuous control domain this would require either a costly optimization step or discretization of the action space. While discretization is perhaps the most straightforward solution, this can prove a particularly poor approximation in highdimensional settings or those that require finer grained control. Instead, a more principled approach is to parameterize the policy explicitly and directly optimize the long term value of following this policy.
In this work we consider a number of modifications to the Deep Deterministic Policy Gradient (DDPG) algorithm (Lillicrap et al., 2015). This algorithm has several properties that make it ideal for the enhancements we consider, which is at its core an offpolicy actorcritic method. In particular, the policy gradient used to update the actor network depends only on a learned critic. This means that any improvements to the critic learning procedure will directly improve the quality of the actor updates. In this work we utilize a distributional (Bellemare et al., 2017) version of the critic update which provides a better, more stable learning signal. Such distributions model the randomness due to intrinsic factors, among these is the inherent uncertainty imposed by function approximation in a continuous environment. We will see that using this distributional update directly results in better gradients and hence improves the performance of the learning algorithm.
Due to the fact that DDPG is capable of learning offpolicy it is also possible to modify the way in which experience is gathered. In this work we utilize this fact to run many actors in parallel, all feeding into a single replay table. This allows us to seamlessly distribute the task of gathering experience, which we implement using the ApeX framework (Horgan et al., 2018). This results in significant savings in terms of wallclock time for difficult control tasks. We will also introduce a number of small improvements to the DDPG algorithm, and in our experiments will show the individual contributions of each component. Finally, this algorithm, which we call the Distributed Distributional DDPG algorithm (D4PG), obtains stateoftheart performance across a wide variety of control tasks, including hard manipulation and locomotion tasks.
1.1 Related work
Historically, estimation of the policy gradient has relied on the likelihood ratio trick
(see e.g. Glynn, 1990), more commonly known as REINFORCE (Williams, 1992) in the reinforcement learning community. Modern variants of these socalled “vanilla” policy gradient methods include the work of (Mnih et al., 2016). Alternatively, one can consider secondorder or “natural” variants of this objective, a set of techniques that include e.g. the Natural ActorCritic (Peters & Schaal, 2008) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) algorithms. More recently Proximal Policy Optimization (PPO) (Schulman et al., 2017), which can be seen as an approximation of TRPO, has proven very effective in largescale distributed settings. Often, however, algorithms of this form are restricted to learning onpolicy, which can limit both the amount of datareuse as well as restrict the types of policies that are used for exploration.The Deterministic Policy Gradient (DPG) algorithm (Silver et al., 2014) upon which this work is based starts from a different set of ideas, namely the policy gradient theorem of (Sutton et al., 2000). The deterministic policy gradient theorem builds upon this earlier approach, but replaces the stochastic policy with one that includes no randomness. This approach is particularly important because it had previously been believed that the deterministic policy gradient did not exist in a modelfree setting. The form of this gradient is also interesting in that it does not require one to integrate over the action space, and hence may require less samples to learn. DPG was later built upon by Lillicrap et al. (2015)
who extended this algorithm and made use of a deep neural network as the function approximator, primarily as a mechanism for extending these results to work with visionbased inputs. Further, this entire endeavor lends itself very readily to an offpolicy actorcritic architecture such that the actor’s gradients depend only on derivatives through the learned critic. This means that by improving estimation of the critic one is directly able to improve the actor gradients. Most interestingly, there have also been recent attempts to distribute updates for the DDPG algorithm,
(e.g. Popov et al., 2017) and more generally in this work we build on work of (Horgan et al., 2018) for implementing distributed actors.Recently, Bellemare et al. (2017) showed that the distribution over returns, whose expectation is the value function, obeys a distributional Bellman equation. Although the idea of estimating a distribution over returns has been revisited before (Sobel, 1982; Morimura et al., 2010), Bellemare et al. demonstrated that this estimation alone was enough to achieve stateoftheart results on the Atari 2600 benchmarks. Crucially, this technique achieves these gains by directly improving updates for the critic.
2 Background
In this work we consider a standard reinforcement learning setting wherein an agent interacts with an environment in discrete time. At each timestep the agent makes observations , takes actions , and receives rewards . Although we will in general make no assumptions about the inputs , we will assume that the environments considered in this work have realvalued actions .
In this standard setup, the agent’s behavior is controlled by a policy which maps each observation to an action. The stateaction value function, which describes the expected return conditioned on first taking action from state and subsequently acting according to , is defined as
(1) 
and is commonly used to evaluate the quality of a policy. While it is possible to derive an updated policy directly from , such an approach typically requires maximizing this function with respect to and is made complicated by the continuous action space. Instead we will consider a parameterized policy and maximize the expected value of this policy by optimizing . By making use of the deterministic policy gradient theorem (Silver et al., 2014) one can write the gradient of this objective as
(2) 
where is the statevisitation distribution associated with some behavior policy. Note that by letting the behavior policy differ from we are able to empirically evaluate this gradient using data gathered offpolicy.
While the exact gradient given by (2) assumes access to the true value function of the current policy, we can instead approximate this quantity with a parameterized critic . By introducing the Bellman operator
(3) 
whose expectation is taken with respect to the next state , we can minimize the temporal difference (TD) error, i.e. the difference between the value function before and after applying the Bellman update. Typically the TD error will be evaluated under separate target policy and value networks, i.e. networks with separate parameters , in order to stabilize learning. By taking the twonorm of this error we can write the resulting loss as
(4) 
In practice we will periodically replace the target networks with copies of the current network weights. Finally, by training a neural network policy using the deterministic policy gradient in (2) and training a deep neural to minimize the TD error in (4) we obtain the Deep Deterministic Policy Gradient (DDPG) algorithm (Lillicrap et al., 2016). Here a samplebased approximation to these gradients is employed by using data gathered in some replay table.
3 Distributed Distributional DDPG
The approach taken in this work starts from the DDPG algorithm and includes a number of enhancements. These extensions, which we will detail in this section, include a distributional critic update, the use of distributed parallel actors, step returns, and prioritization of the experience replay.
First, and perhaps most crucially, we consider the inclusion of a distributional critic as introduced in Bellemare et al. (2017). In order to introduce the distributional update we first revisit (1
) in terms of the return as a random variable
, such that . The distributional Bellman operator can be defined as(5) 
where equality is with respect to the probability law of the random variables; note that this expectation is taken with respect to distribution of
as well as the transition dynamics.While the definition of this operator looks very similar to the canonical Bellman operator defined in (3), it differs in the types of functions it acts on. The distributional variant takes functions which map from stateaction pairs to distributions, and returns a function of the same form. In order to use this function within the context of the actorcritic architecture introduced above, we must parameterize this distribution and define a loss similar to that of Equation 4. We will write the loss as
(6) 
for some metric that measures the distance between two distributions. Two components that can have a significant impact on the performance of this algorithm are the specific parameterization used for and the metric used to measure the distributional TD error. In both cases we will give further details in Appendix A; in the experiments that follow we will use the Categorical distribution detailed in that section.
We can complete this distributional policy gradient algorithm by including the actionvalue distribution inside the actor update from Equation 2. This is done by taking the expectation with respect to the actionvalue distribution, i.e.
(7) 
As before, this update can be empirically evaluated by replacing the outer expectation with a samplebased approximation.
Next, we consider a modification to the DDPG update which utilizes step returns when estimating the TD error. This can be seen as replacing the Bellman operator with an step variant
(8) 
where the expectation is with respect to the step transition dynamics. Although not used by Lillicrap et al. (2016), step returns are widely used in the context of many policy gradient algorithms (e.g. Mnih et al., 2016) as well as Qlearning variants (Hessel et al., 2017). This modification can be applied analogously to the distributional Bellman operator in order to make use of it when updating the distributional critic.
Finally, we also modify the standard training procedure in order to distribute the process of gathering experience. Note from Equations (2,4) that the actor and critic updates rely entirely on sampling from some statevisitation distribution . We can parallelize this process by using independent actors, each writing to the same replay table. A learner process can then sample from some replay table of size and perform the necessary network updates using this data. Additionally sampling can be implemented using nonuniform priorities as in Schaul et al. (2016). Note that this requires the use of importance sampling, implemented by weighting the critic update by a factor of . We implement this procedure using the ApeX framework (Horgan et al., 2018) and refer the reader there for more details.
Algorithm pseudocode for the D4PG algorithm which includes all the abovementioned modifications can be found in Algorithm 1
. Here the actor and critic parameters are updated using stochastic gradient descent with learning rates,
and respectively, which are adjusted online using ADAM (Kingma & Ba, 2015). While this pseudocode focuses on the learning process, also shown is pseudocode for actor processes which in parallel fill the replay table with data.Actor
4 Results
In this section we describe the performance of the D4PG algorithm across a variety of continuous control tasks. To do so, in each environment we run our learning procedure and periodically snapshot the policy in order to test it without exploration noise. We will primarily be interested in the performance as a function of wall clock time, however we will also examine the data efficiency. Most interestingly, from a scientific perspective, we also perform a number of ablations which individually remove components of the D4PG algorithm in order to determine their specific contributions.
First, we experiment with and without distributional updates. In this setting we focus on use of a categorical distribution as we found in preliminary experiments that the use of a mixture of Gaussians performed worse and was less stable with respect to hyperparameter values across different tasks; a selection of these runs can be found in Appendix
C. Across all tasks—except for one which we will introduce later—we use 51 atoms for the categorical distribution. In what follows we will refer to nondistributional variants of this algorithm as Distributed DDPG (D3PG).Next, we consider prioritized and nonprioritized versions of these algorithm variants. For the nonprioritized variants, transitions are sampled from replay uniformly. For prioritized variants we use the absolute TDerror to sample from replay in the case of D3PG, and for D4PG we use the absolute distributional TDerror as described in Section A. We also vary the trajectory length .
In all experiments we use a replay table of size and only consider behavior policies which add fixed Gaussian noise to the current online policy; in all experiments we use a value of . We experimented with correlated noise drawn from an OrnsteinUhlenbeck process, as suggested by (Lillicrap et al., 2016), however we found this was unnecessary and did not add to performance. For all algorithms we initialize the learning rates for both actor and critic updates to the same value. In the next section we will present a suite of simple control problems for which this value corresponds to ; for the following, harder problems we set this to a smaller value of . Similarly for the control suite we utilize a batch size of and for all subsequent problems we will increase this to .
4.1 Standard control suite
We first consider evaluating performance on a number of simple, physical control tasks by utilizing a suite of benchmark tasks (Tassa et al., 2018) developed in the MuJoCo physics simulator (Todorov et al., 2012). Each task is run for exactly 1000 steps and provides either an immediate dense reward or sparse reward depending on the particular task. For each domain, the inputs presented to the agent consist of reasonably lowdimensional observations, many consisting of physical state, joint angles, etc. These observations range between 6 and 60 dimensions, however note that the difficulty of the task is not immediately associated with its dimensionality. For example the acrobot is one of the lowest dimensional tasks in this suite which, due to its level of controllability, can prove much more difficult to learn than other, higher dimensional tasks. For an illustration of these domains see Figure 9; see Appendix D for more details.
For algorithms in these experiments we consider actor and critic architectures of the form given in Figure 1 and for each experiment we use actors. Figure 2 shows the performance of D4PG and its various ablations across the entire suite of control tasks. This set of plots is quite busy, however it serves as a broad set of tasks with which we can obtain a general idea of the algorithms performance. Later experiments on harder domains look more closely at the difference between algorithms. Here we also compare against the canonical (nondistributed) DDPG algorithm as a baseline, shown as a dotted black line. This removes all the enhancements proposed in this paper, and we can see that except on the simplest domain, Cartpole (Swingup), it performs worse than all other methods. This performance disparity worsens as we increase the difficulty of tasks, and hence for further experiments we will drop this line from the plot.
Next, across all tasks we see that the best performance is obtained by the full D4PG algorithm (shown in purple and bold). Here we see that the longer unroll length of is uniformly better (we show these as solid lines), and in particular we sometimes see for both D3PG and D4PG that an unroll length of (shown as dashed lines) can occasionally result in instability. This is especially apparent in the Cheetah (Walk) and Cartpole (Swingup Sparse) tasks.
The next biggest gain is arguably due to the inclusion of the distributional critic update, where it is particularly helpful on the hardest tasks e.g. Humanoid (Run) and Acrobot. The manipulator is also quite difficult among this suite of tasks, and here we see that the inclusion of the distributional update does not help as much as in other tasks, although note that here the D3PG and D4PG variants obtain approximately the same performance. As far as the use of prioritization is concerned, it does not appear to contribute significantly to the performance of D4PG. This is not the case for D3PG, however, which on many tasks is helped significantly by the inclusion of prioritization.
4.2 Manipulation
Next, we consider a set of tasks designed to highlight the ability of the D4PG agent to learn dexterous manipulation. Tasks of this form can prove difficult for many reasons, most notably the higher dimensionality of the control task, intermittent contact dynamics, and potential underactuation of the manipulator.
Here we use a simulated hand model implemented within MuJoCo, consisting of 13 actuators which control 22 degrees of freedom. For these experiments the wrist site is attached to a fixed location in space, about which it is allowed to rotate axially. In particular this allows the hand to pick up objects, rotate into a palmup position, and manipulate them. We first consider a task in which a cylinder is dropped onto the hand from a random height, and the goal of the task is to catch the falling cylinder. The next task requires the agent to pick up an object from the tabletop and then maneuver it to a target position and orientation. The final task is one wherein a broad cylinder must be rotated inhand in order to match a target orientation. See Appendix
E for further details regarding both the model and the tasks. For these tasks we use the same network architectures as in the previous section as well as actors.In Figure 3 we again compare the D4PG algorithm against ablations of its constituent components. Here we split the algorithms between in the top row and in the bottom row, and in particular we can see that across all algorithms is uniformly better. For all tasks, the full D4PG algorithm performs either at the same level or better than other ablations; this is particularly apparent in the case. Overall the use of priorization never seems to harm D4PG, however it does appear to be of limited additional value. Interestingly this is not necessarily the case with the D3PG variant (i.e. without distributional updates). Here we can see that prioritization sometimes harms the performance of D3PG, and this is very readily seen in the case where the algorithm can either become unstable, or in the case of the Pickup and Orient task it completely fails to learn.
4.3 Parkour
Finally, we consider the parkour domain introduced by (Heess et al., 2017). In this setting the agent controls a simplified robotic walker which is rewarded for forward movement, but is impeded by a number of randomly sampled obstacles; see Figure 4 for a visualization and refer to the earlier work for further details. The first of our experiments considers a twodimensional walker, i.e. a domain in which the walker is allowed to move horizontally and vertically, but is constrained to a fixed depth position. In this domain the obstacles presented to the agent include gaps in the floor surface, barriers it must jump over, and platforms that it can either run over or underneath. The agent is presented with proprioceptive observations corresponding to the angles of its limbs and other functions of these quantities. It is also given access to observations which includes features such as a depth map of the upcoming terrain, etc. In order to accommodate these inputs we utilize a network architecture as specified in Figure 1. In particular we make use of a stack of feedforward layers which process the terrain information to reduce it to a smaller number of hidden units before concatenating with the proporioceptive information for further processing. The actions in this domain take the form of torque controls .
In order to examine the performance of the D4PG algorithm in this setting we consider the ablations of the previous sections and we have further introduced a PPO baseline as utilized in the earlier paper of (Heess et al., 2017). For all algorithms, including PPO, we use actors. These results are shown in Figure 5 in the top row. As before we examine the performance separately for and , and again we see that the higher unroll length results in better performance. Note that we show the PPO baseline on both plots for consistency, but in both plots this is the same algorithm, with settings proposed in the earlier paper and unrolls of length 50.
Here we again see a clear delineation and clear gains for each of the other algorithm components. The biggest gain comes from the inclusion of the distributional update, which we can see by comparing the nonprioritized D3PG/D4PG variants. We see marginal benefit to using prioritization for D3PG, but this gain disappears when we consider the distributional update. Finally, we can see when comparing to the PPO baseline that this algorithm compares favorably to D3PG in the case of , however is outperformed by D4PG; when all algorithms outperform PPO.
Next, in the plots shown in Figure 5 on the bottom row we also consider the performance not just in terms of training time, but also in terms of the sample complexity. In order to do so we plot the performance of each algorithm versus the number of actor steps, i.e. the quantity of transitions collected. This is perhaps more favorable to PPO, as the parallel actors considered in this work are not necessarily tuned for sample efficiency. Here we see that PPO is able to outperform the nonprioritized version of D3PG, and early on in training is favorable compared to the prioritized version, although this trails off. However, we still see significant performance gains by utilizing the distributional updates, both in a prioritized and nonprioritized setting. Interestingly we see that the use of prioritization does not gain much, if any over the nonprioritized D4PG version. Early in the trajectory for , in fact, we see that the nonprioritized D4PG exhibits better performance, however later these performance curves level out. With respect to wallclock time these small differences may be due to small latencies in the scheduling of different runs, as we see that this difference is less for the plot with respect to actor steps.
Finally we consider a humanoid walker which is able to move in all three dimensions. The obstacles in this domain consist of gaps in the floor, barriers that must be jumped over, and walls with gaps that allow the agent to run through. For this experiment we utilize the same network architecture as in the previous experiment, except now the observations are of size and . Again actions are torque controls, but in 21 dimensions. In this task we also increased the number of atoms for the categorical distribution from 51 to 101. This change increases the level of resolution for the distribution in order to keep the resolution roughly consistent with other tasks. This is a much higher dimensional problem than the previous parkour task with a significantly more difficult control task: the walker is more unstable and there are many more ways for the agent to fail than in the previous experiment. The results for this particular domain are displayed in Figure 6, and here we concentrate on performance as a function of wallclock time, restricted to the previously best performing rollout length of . In this setting we see a clear delineation between first the PPO results which are the poorest performing, the D3PG results where the prioritized version has a slight edge, and finally the D4PG results. Interestingly for D4PG we again see as in the twodimensional walker case, the use of prioritization seems to have no benefit, with both versions have almost identical performance curves; in fact the performance here is perhaps even closer than that of the previous set of experiments.
5 Discussion
In this work we introduced the D4PG, or Distributed Distributional DDPG, algorithm. Our main contributions include the inclusion of a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. We also consider a number of other, smaller changes to the algorithm. All of these simple modifications contribute to the overall performance of the D4PG algorithm; the biggest performance gain of these simple changes is arguably the use of step returns. Interestingly we found that the use of priority was less crucial to the overall D4PG algorithm especially on harder problems. While the use of prioritization was definitely able to increase the performance of the D3PG algorithm, we found that it can also lead to unstable updates. This was most apparent in the manipulation tasks.
Finally, as our results can attest, the D4PG algorithm is capable of stateoftheart performance on a number of very difficult continuous control problems.
References

Bellemare et al. (2017)
Marc G Bellemare, Will Dabney, and Rémi Munos.
A distributional perspective on reinforcement learning.
In
International Conference on Machine Learning
, pp. 449–458, 2017.  Glynn (1990) Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 Hafner & Riedmiller (2011) Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine Learning, 84(12):137–169, jul 2011. doi: 10.1007/s109940115235x.
 Heess et al. (2017) Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Hessel et al. (2017) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
 Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel BarthMaron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. International Conference on Learning Representations, 2018.
 Johannes et al. (2011) Matthew S. Johannes, John D. Bigelow, James M. Burck, Stuart D. Harshbarger, Matthew V. Kozlowski, and Thomas Van Doren. An overview of the developmental process for the modular prosthetic limb. Johns Hopkins APL Technical Digest (Applied Physics Laboratory), 30(3):207–216, 2011.
 Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Kumar & Todorov (2015) Vikash Kumar and Emanuel Todorov. MuJoCo HAPTIX: A virtual reality system for hand manipulation. In IEEERAS International Conference on Humanoid Robots, volume 2015December, pp. 657–663. IEEE, 2015. doi: 10.1109/HUMANOIDS.2015.7363441.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 Morimura et al. (2010) Tetsuro Morimura, Hirotaka Hachiya, Masashi Sugiyama, Toshiyuki Tanaka, and Hisashi Kashima. Parametric Return Density Estimation for Reinforcement Learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010.
 Peters & Schaal (2008) Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(7):1180–1190, 2008.
 Popov et al. (2017) Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel BarthMaron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Dataefficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
 Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. International Conference on Learning Representations, 2016.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

Sobel (1982)
Matthew J. Sobel.
The variance of discounted markov decision processes.
Journal of Applied Probability, 19(04):794–802, 1982.  Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018. URL http://arxiv.org/abs/1801.00690.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Uhlenbeck & Ornstein (1930) George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
Appendix A Distributions and losses
In this section we consider two potential parameterized distributions for D4PG. Parameterized distributions, in this framework, are implemented as a neural network layer mapping the output of the critic torso (see Figure 1) to the parameters of a given distribution (e.g. mean and variance). In what follows we will detail the distributions and their corresponding losses.
Categorical
Following Bellemare et al. (2017)
, we first consider the categorical parameterization, a layer whose parameters are the logits
of a discretevalued distribution defined over a fixed set of atoms . This distribution has hyperparameters for the number of atoms , and the bounds on the support (). Given these, corresponds to the distance between atoms, and gives the location of each atom. We can then define the actionvalue distribution as(9) 
Observe that this distributional layer simply corresponds to a linear layer from the critic torso to the logits , followed by a softmax activation (see Figure 7, left).
However, this distribution is not closed under the Bellman operator defined earlier, due to the fact that adding and scaling these values will no longer lie on the support defined by the atoms. This support is explicitly defined by the () hyperparameters. As a result we instead use a projected version of the distributional Bellman operator (Bellemare et al., 2017); see Appendix B for more details. Letting be the probabilities of the projected distributional Bellman operator applied to some target distribution , we can write the loss in terms of the crossentropy
(10) 
Mixture of Gaussians
We can also consider parameterizing the actionvalue distribution using a mixture of Gaussians; here the random variable has density given by
(11) 
Thus, the distribution layer maps, through a linear layer, from the critic torso to the mixture weight , mean , and variance for each mixture component (see Figure 7, center). We can then specify a loss corresponding to the crossentropy portion of the KL divergence between two distributions. Given a sample transition we can take samples from the target density and approximate the crossentropy term using
(12) 
Appendix B Categorical projection operator
The categorical parameterized distribution has finite support. Thus, the result of applying the distributional Bellman equation will generally not coincide with this support. Therefore, some projection step is required before minimizing the crossentropy. The categorical projection of Bellemare et al. (2017) is given by , where is a piecewise linear ‘hat’ function,
(13) 
Appendix C Mixtures of Gaussians control suite results
In Figure 8 we display results of running D4PG on a selection of control suite tasks using a mixture of Gaussians output distribution for two choices of learning rates. Here the distributional TD loss is minimized using the samplebased KL introduced earlier. While this is definitely a technique that is worth further exploration, we found in initial experiments that this choice of distribution underperformed the Categorical distribution by a fair margin. This lends further credence to the choice of distribution made in (Bellemare et al., 2017).
Appendix D Control suite details
In this section we provide further details for the control suite domains. In particular see Figure 9 for images of the control suite tasks. The physics state , action , and observation dimensionalities for each task are provided in Table 1.
Domain  Task  

acrobot  swingup  1  4  6 
swingup_sparse  
cartpole  swingup  1  4  5 
swingup_sparse  
cheetah  walk  6  18  17 
finger  turn_easy  2  6  12 
turn_hard  
fish  upright  5  27  24 
swim  
hopper  stand  4  14  15 
humanoid  stand  21  55  67 
walk  
run  
manipulator  bring_ball  2  22  37 
swimmer  swimmer6  5  16  25 
swimmer15  14  34  61 
Appendix E Manipulation details
For the dexterous manipulation tasks we used a simulated model of the Johns Hopkins Modular Prosthetic Limb hand (Johannes et al., 2011) implemented in MuJoCo (Kumar & Todorov, 2015). This anthropomorphic hand has a total of 22 degrees of freedom (19 in the fingers, 3 in the wrist), which are driven by a set of 13 position actuators (PDcontrollers). The underactuation of the hand is due to coupling between some of the finger joints. For these experiments the wrist was positioned in a fixed location above a table, such that rotation and flexion about the wrist joints allowed the hand to pick up objects from the table, rotate into a palmup position, and then manipulate them.
We focused on a set of three tasks where the agent must learn to manipulate a cylindrical object (Figure 10). In each of these tasks, the observations contain the positions and velocities of all of the joints in the hand, the current position targets for the actuators in the hand, the position and quaternion of the object being manipulated, and its translational and rotational velocities. The observations given in each task are summarized in Table 2. The agent’s actions are increments applied to the position targets for the actuators.
Task  
Size  Catch  Pickupandorient  Rotateinhand  
Hand  joint positions  22  ✓  ✓  ✓ 
joint velocities  22  ✓  ✓  ✓  
actuator targets  13  ✓  ✓  ✓  
Object  position  3  ✓  ✓  ✓ 
quaternion  4  ✓  ✓  ✓  
velocity  6  ✓  ✓  ✓  
Target  position  3  –  ✓  – 
quaternion  4  –  ✓  –  
,  2  –  –  ✓  
Total  70  77  72 
In the ‘catch’ task the agent must learn to catch a falling object before it strikes the table below. The position, height, and orientation of the object are randomly initialized at the start of each episode. The reward is given by
(14) 
where is a soft indicator function similar to one described by Hafner & Riedmiller (2011)
(15) 
Here , and the tolerance and margin parameters are and respectively. Contact between the object and the table causes the current episode to terminate immediately with no reward, otherwise it will continue until a 500 step limit is reached.
In the ‘pickupandorient’ task, the agent must pick up a cylindrical object from the table and maneuver it into a target position and orientation. Both the initial position and orientation of the object, and the position and orientation of the target are randomized between episodes. The reward function consists of two additive components that depend on the distance from the object to the target position, and on the angle between the axes of the object and target body frames
(16) 
where =, =, =, =. Note that the distancedependent component of the reward multiplicatively gates the orientation component. This helps to encourage the agent to pick up the object before attempting to orient it to match the target. Each episode has a fixed duration of 500 steps.
Finally, in the ‘rotateinhand’ task the agent begins with a broad cylinder in its palm, and must rotate it axially in order to match a moving target. This requires dynamically forming and breaking contacts with the object being manipulated. The target angle is initialized uniformly, and then incremented on each time step using temporally correlated noise drawn from an OrnsteinUhlenbeck process (=, =0.01; Uhlenbeck & Ornstein 1930). The reward consists of two multiplicative components
(17) 
where =, =, =, =, and denotes projection onto the global plane. The first component provides an incentive to match the axial rotation of the target, and the second component penalizes the agent for allowing the orientation of the cylinder’s long axis to deviate too far from that of the target. The maximum episode duration is 1000 steps, with early termination if the object makes contact with the table.