Model-free Reinforcement Learning (RL) is a learning paradigm which aims to maximize a cumulative reward signal based on experience gathered through interaction with an environment(Sutton and Barto, 1998). It is divided into two primary categories. Value-based approaches involve learning the value of each action and acting greedily with respect to it (i.e., selecting the action with highest value). On the other hand, policy-based approaches (the focus of this work) learn the policy directly, thereby explicitly learning a mapping from state to action.
Policy gradients (PGs) (Sutton et al., 2000b)
have been the go-to approach for learning policies in empirical applications. The combination of the policy gradient with recent advances in deep learning has enabled the application of RL in complex and challenging environments. Such domains include continuous control problems, in which an agent controls complex robotic machines both in simulation(Schulman et al., 2015; Haarnoja et al., 2017; Peng et al., 2018) as well as real life (Levine et al., 2016; Andrychowicz et al., 2018; Riedmiller et al., 2018). Nevertheless, there exists a fundamental problem when PG methods are applied to continuous control regimes. As the gradients require knowledge of the probability of the performed action , the PG is empirically limited to parametric distribution functions. Common parametric distributions used in the literature include the Gaussian (Schulman et al., 2015, 2017), Beta (Chou et al., 2017) and Delta (Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018) distribution functions.
In this work, we show that while the PG is properly defined over parametric distribution functions, it is prone to converge to sub-optimal exterma (Section 3). The leading reason is that these distributions are not convex in the distribution space111 As an example, consider the Gaussian distribution, which is known to be non-convex.
As an example, consider the Gaussian distribution, which is known to be non-convex.and are thus limited to local improvement in the action space itself. Inspired by Approximate Policy Iteration schemes, for which convergence guarantees exist (Puterman and Brumelle, 1979), we introduce the Distributional Policy Optimization (DPO) framework in which an agent’s policy evolves towards a distribution
over improving actions. This framework requires the ability to minimize a distance (loss function) which is defined over two distributions, as opposed to the policy gradient approach which requires an explicit differentiation through the density function.
DPO establishes the building blocks for our generative algorithm, the Generative Actor Critic222Code provided in the following anonymous repository: github.com/neurips-2019/GAC
. It is composed of three elements: a generative model which represents the policy, a value, and a critic. The value and the critic are combined to obtain the advantage of each action. A target distribution is then defined as one which improves the value (i.e., all actions with negative advantage receive zero probability mass). The generative model is optimized directly from samples without the explicit definition of the underlying probability distribution using quantile regression and Autoregressive Implicit Quantile Networks (see Section4). Generative Actor Critic is evaluated on tasks in the MuJoCo control suite (Section 5), showing promising results on several difficult baselines.
We consider an infinite-horizon discounted Markov Decision Process (MDP) with a continuous action space. An MDP is defined as the 5-tuple(Puterman, 1994), where is a countable state space, the continuous action space, is a transition kernel, is a reward function, and is the discount factor. Let be a stationary policy, where is the set of probability measures on the Borel sets of . We denote by the set of stationary stochastic policies. In addition to , often one is interested in optimizing over a set of parametric distributions. We denote the set of possible distribution parameters by (e.g., the mean
and varianceof a Gaussian distribution).
Two measures of interest in RL are the value and action-value functions and , respectively. The value of a policy , starting at state and performing action is defined by . The value function is then defined by . Given the action-value and value functions, the advantage of an action at state is defined by . The optimal policy is defined by and the optimal value by .
3 From Policy Gradient to Distributional Policy Optimization
Current practical approaches leverage the Policy Gradient Theorem (Sutton et al., 2000b) in order to optimize a policy, which updates the policy parameters according to
where is the stationary distribution of states under . Since this update rule requires knowledge of the log probability of each action under the current policy , empirical methods in continuous control resort to parametric distribution functions. Most commonly used are the Gaussian (Schulman et al., 2017), Beta (Chou et al., 2017) and deterministic Delta (Lillicrap et al., 2015) distribution functions. However, as we show in Proposition 1, this approach is not ensured to converge, even though there exists an optimal policy which is deterministic (i.e., Delta) - a policy which is contained within this set.
The sub-optimality of uni-modal policies such as Gaussian or Delta distributions does not occur due to the limitation induced by their parametrization (e.g., the neural network), but is rather a result of the predefined set of policies. As an example, consider the set of Delta distributions. As illustrated in Figure1, while this set is convex in the parameter (the mean of the distribution), it is not convex in the set . This is due to the fact that results in a stochastic distribution over two supports, which cannot be represented using a single Delta function. Parametric distributions such as Gaussian and Delta functions highlight this issue, as the policy gradient considers the gradient w.r.t. the parameters . This results in local movement in the action space. Clearly such an approach can only guarantee convergence to a locally optimal solution and not a global one.
For any initial Gaussian policy and there exists an MDP such that satisfies
where is the convergent result of a PG method with step size bounded by . Moreover, given the result follows even when is only known to lie in some ball of radius R around , .
For brevity we prove for the case of , such that is a finite interval . We also assume , and . The general case proof can be found in the supplementary material. Let . We consider a single state MDP (i.e., x-armed bandit) with action space and a multi-modal reward function (similar to the illustration in Figure 0(c)), defined by
where is the window function.
In PG, we assume is parameterized by some parameters . Without loss of generality, let us consider the derivative with respect to . At iteration the derivative can be written as PG will thus update the policy parameter by As , it holds that It follows that if and then so is . Then, . That is, the policy can never reach the interval in which the optimal solution lies. Hence, and the result follows for . ∎
3.1 Distributional Policy Optimization (DPO)
In order to overcome issues present in parametric distribution functions, we consider an alternative approach. In our solution, the policy does not evolve based on the gradient w.r.t. distribution parameters (e.g., ), but rather updates the policy distribution according to
where is a projection operator onto the set of distributions, is a distance measure (e.g., Wasserstein distance), and is a distribution defined over the support (i.e., the positive advantage). Table 1 provides examples of such distributions.
Algorithm 1 describes the Distributional Policy Optimization (DPO) framework as a three time-scale approach to learning the policy. It can can be shown, under standard stochastic approximation assumptions (Borkar, 2009; Konda and Tsitsiklis, 2000; Bhatnagar and Lakshmanan, 2012; Chow et al., 2017), to converge to an optimal solution. DPO consists of 4 elements: (1) A policy on a fast timescale, (2) a delayed policy
on a slow timescale, (3) a value and (4) a critic, which estimate the quality of the delayed policyon an intermediate timescale. Unlike the PG approach, DPO does not require access to the underlying p.d.f. In addition, as
is updated on the fast timescale, it can be optimized using supervised learning techniques. Finally, we note that in DPO, the target distributioninduces a higher value than the current policy , ensuring an always improving policy.
The concept of policy evolution using positive advantage is depicted in Figure 2. While the policy starts as a uni-modal distribution, it is not restricted to this subset of policies. As the policy evolves, less actions have positive advantage, and the process converges to an optimal solution. In the next section we construct a practical algorithm under the DPO framework using a generative actor.
In this section we present our method, the Generative Actor Critic, which learns a policy based on the Distributional Policy Optimization framework (Section 3). Distributional Policy Optimization requires a model which is both capable of representing arbitrarily complex distributions and can be optimized by minimizing a distributional distance. We consider the Autoregressive Implicit Quantile Network (Ostrovski et al., 2018), which is detailed below.
4.1 Quantile Regression & Autoregressive Implicit Quantile Networks
As seen in Algorithm 1, DPO requires the ability to minimize a distance between two distributions. The Implicit Quantile Network (IQN) (Dabney et al., 2018a) provides such an approach using the Wasserstein metric. The IQN receives a quantile value and is tasked at returning the value of the corresponding quantile from a target distribution. As the IQN learns to predict the value of the quantile, it allows one to sample from the underlying distribution (i.e., by sampling and performing a forward pass). Learning such a model requires the ability to estimate the quantiles. The quantile regression loss (Koenker and Hallock, 2001) provides this ability. It is given by , where is the quantile and the error.
Nevertheless, the IQN is only capable of coping with univariate (scalar) distribution functions. Ostrovski et al. (2018) proposed to extend the IQN to the multi-variate case using quantile autoregression (Koenker and Xiao, 2006). Let
be an n-dimensional random variable. Given a fixed ordering of thedimensions, the c.d.f. can be written as the product of conditional likelihoods
The Autoregressive Implicit Quantile Network (AIQN), receives an i.i.d. vector. The network architecture then ensures each output dimension is conditioned on the previously generated values ; trained by minimizing the quantile regression loss.
4.2 Generative Actor Critic (GAC)
Next, we introduce a practical implementation of the DPO framework. As shown in Section 3, DPO is composed of 4 elements: an actor, a delayed actor, a value, and an action-value estimator. The Generative Actor Critic (GAC) uses a generative actor trained using an AIQN, as described below. Contrary to parametric distribution functions, a generative neural network acts as a universal function approximator, enabling us to represent arbitrarily complex distributions, as corollary of the following lemma.
Lemma (Kernels and Randomization (Kallenberg, 2006)).
Let be a probability kernel from a measurable space to a Borel space . Then there exists some measurable function such that if is , then has distribution for every .
Actor: DPO defines the actor as one which is capable of representing arbitrarily complex policies. To obtain this we construct a generative neural network, an AIQN. The AIQN learns a mapping from a sampled noise vector to a target distribution.
As illustrated in Figure 3, the actor network contains a recurrent cell which enables sequential generation of the action. This generation schematic ensures the autoregressive nature of the model. Each generated action dimension is conditioned only on the current sampled noise scalar and the previous action dimensions . In order to train the generative actor, the AIQN requires the ability to produce samples from the target distribution
. Although we are unable to sample from this distribution, given an action, we are able to estimate its probability. An unbiased estimator of the loss can be attained by uniformly sampling actions and then multiplying them by their corresponding weight. More specifically, the weighted autoregressive quantile loss is defined by
Delayed Actor: The delayed actor, also known as Polyak averaging (Polyak, 1990), is an appealing requirement as it is common in off-policy actor-critic schemes (Lillicrap et al., 2015). The delayed actor is an additional AIQN , which tracks . It is updated based on and is used for training the value and critic networks.
Value and Action-Value: While it is possible to train a critic and use its empirical mean w.r.t. the policy as a value estimate, we found it to be noisy, resulting in bad convergence. We therefore train a value network to estimate the expectation of the critic w.r.t. the delayed policy. In addition, as suggested in Fujimoto et al. (2018), we train two critic networks in parallel. During both policy and value updates, we refer to the minimal value of the two critics. We observed that this indeed reduced variance and improved overall performance.
To summarize, GAC combines 4 elements. The delayed actor tracks the actor using a Polyak averaging scheme. The value and critic networks estimate the performance of the delayed actor. Provided and estimations, we are able to estimate the advantage of each action and thus propose the weighted autoregressive quantile loss, used to train the actor network. We refer the reader to the supplementary material for an exhaustive overview of the algorithm and architectural details.
In order to evaluate our approach, we test GAC on a variety of continuous control tasks in the MuJoCo control suite (Todorov et al., 2012). The agents are composed of joints: from 2 joints in the simplistic Swimmer task and up to 17 in the Humanoid robot task. The state is a vector representation of the agent, containing the spatial location and angular velocity of each element. The action is a continuous dimensional vector, representing how much torque to apply to each joint. The task in these domains is to move forward as much as possible within a given time-limit.
We run each task for 1 million steps and, as GAC is an off-poicy approach, evaluate the policy every 5000 steps and report the average over 10 evaluations. We train GAC using a batch size of 128 and uncorrelated Gaussian noise for exploration. Results are depicted in Figure 4. Each curve presented is a product of 5 training procedures with a randomly sampled seed. In addition to our raw results, we compare to the relevant baselines333We use the implementations of DDPG and PPO from the OpenAI baselines repo (Dhariwal et al., 2017), and TD3 (Fujimoto et al., 2018) from the authors GitHub repository., including: (1) DDPG (Lillicrap et al., 2015), (2) TD3 (Fujimoto et al., 2018), an off-policy actor critic approach which represents the policy using a deterministic delta distribution, and (3) PPO (Schulman et al., 2017), an on-policy method which represents the policy using a Gaussian distribution.
As we have shown in the previous sections, DPO and GAC only require some target distribution to be defined, namely, a distribution over actions with positive advantage. In our results we present two such distributions: the linear and Boltzmann distributions (see Table 1). We also test a non-autoregressive version of our model 444Theoretically, the dimensions of the actions may be correlated and thus should be represented using an auto-regressive model. using an IQN. For completeness, we provide additional discussion regarding the various parameters and how they performed, in addition to a pseudo-code illustration of our approach, in the supplementary material.
Comparison to the policy gradient baselines: Results in Figure 4 show the ability of GAC to solve complex, high dimensional problems. GAC attains competitive results across all domains, often outperforming the baseline policy gradient algorithms and exhibiting lower variance. This is somewhat surprising, as GAC is a vanila algorithm, it is not supported by numerous improvements apparent in recent PG methods. In addition to these results, we provide numerical results in the supplementary material, which emphasize this claim.
Parameter Comparison: Below we discuss how various parameters affect the behavior of GAC in terms of convergence rates and overall performance:
At each step, the target policy is approximated through samples using the weighted quantile loss (Equation (3)). The results presented in Figure 4 are obtained using 256 samples at each step. 128 samples are taken uniformly over the action space and 128 from the delayed policy (a form of combining exploration and exploitation). Ablation tests showed that increasing the number of samples improved stability and overall performance. Moreover, we observed that the combination of both sampling methods is crucial for success.
presents results for Linear and Boltzman type target policies. Not presented is the Uniform distribution over the target actions, which did not work well. We believe this is due to the fact that the Uniform target provides an equal weight to actions which are very good while also to those which barely improve the value.
We observed that in most tasks, similar to the observations of Korenkevych et al. (2019), the AIQN model outperforms the IQN (non-autoregressive) one. Nevertheless, in the Humanoid task, the IQN version dramatically outperforms all other approaches. As the IQN is contained within the AIQN approach, we believe that this phenomena is due to the complexity of modeling the inter-action dependencies which results in faster convergence of the simpler, IQN model.
6 Related Work
Distributional RL: Recent interest in distributional methods for RL has grown with the introduction of deep RL approaches for learning the distribution of the return. Bellemare et al. (2017) presented the C51-DQN which partitions the possible values into a fixed number of bins and estimates the p.d.f. of the return over this discrete set. Dabney et al. (2017) extended this work by representing the c.d.f. using a fixed number of quantiles. Finally, Dabney et al. (2018a) extended the QR-DQN to represent the entire distribution using the Implicit Quantile Network (IQN). In addition to the empirical line of work, Qu et al. (2018) and Rowland et al. (2018) have provided fundamental theoretical results for this framework.
Generative Modeling: Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) combine two neural networks in a game-theoretic approach which attempt to find a Nash Equilbirium. This equilibrium is found when the generative model is capable of “fooling” the discriminator (i.e., the discriminator is no longer capable of distinguishing between samples produced from the real distribution and those from the generator). Multiple GAN models and training methods have been introduced, including the Wasserstein-GAN (Arjovsky et al., 2017) which minimizes the Wasserstein loss. However, as the optimization scheme is highly non-convex, these approaches are not proven to converge and may thus suffer from instability and mode collapse (Salimans et al., 2016).
Policy Learning: Learning a policy is generally performed using one of two methods. The Policy Gradient (PG) (Williams, 1992; Sutton et al., 2000a) defines the gradient as the direction which maximizes the reward under the assumed policy parametrization class. Although there have been a multitude of improvements, including the ability to cope with deterministic policies (Silver et al., 2014; Lillicrap et al., 2015), stabilize learning through trust region updates (Schulman et al., 2015, 2017) and bayesian approaches (Ghavamzadeh et al., 2016), these methods are bounded to parametric distribution sets (as the gradient is w.r.t. the log probability of the action). An alternative line of work formulates the problem as a maximum entropy (Haarnoja et al., 2018), this enables the definition of the target policy using an energy functional. However, training is performed via minimizing the KL-divergence. The need to know the KL-divergence limits practical implementation to parametric distributions functions, similar to PG methods.
7 Discussion and Future Work
In this work we presented limitations inherent to empirical Policy Gradient (PG) approaches in continuous control. While current PG methods in continuous control are computationally efficient, they are not ensured to converge to a global extrema. As the policy gradient is defined w.r.t. the log probability of the policy, the gradient results in local changes in the action space (e.g., changing the mean and variance of a Gaussian policy). These limitations do not occur in discrete action spaces.
In order to ensure better asymptotic results, it is often needed to use methods that are more complex and computationally demanding (i.e., “No Free Lunch” (Wolpert et al., 1997)). Existing approaches attempting to mitigate these issues, either enrich the policy space using mixture models, or discretize the action space. However, while the discretization scheme is appealing, there is a clear trade-off between optimality and efficiency. While finer discretization improves guarantees, the complexity (number of discrete actions) grows exponentially in the action dimension (Tang and Agrawal, 2019).
Similar to the limitations inherent in PG approaches, these limitations also exist when considering mixture models, such as Gaussian Mixtures. A mixture model of -Gaussians provides a categorical distribution over Gaussian distributions. The policy gradient w.r.t. these parameters, similarly to the single Gaussian model, directly controls the mean and variance of each Gaussian independently. As such, even a mixture model is confined to local improvement in the action space.
In practical scenarios, and as the number of Gaussians grows, it is likely that the modes of the mixture would be located in a vicinity of a global optima. A Gaussian Mixture model may therefore be able to cope with various non-convex continuous control problems. Nevertheless, we note that Gaussian Mixture models, unlike a single Gaussian, are numerically unstable. Due to the summation over Gaussians, the log probability of a mixture of Gaussians does not result in a linear representation. This can cause numerical instability, and thus hinder the learning process. These insights lead us to question the optimality of current PG approaches in continuous control, suggesting that, although these approaches are well understood, there is room for research into alternative policy-based approaches.
In this paper we suggested the Distributional Policy Optimization (DPO) framework and its empirical implementation - the Generative Actor Critic (GAC). We evaluated GAC on a series of continuous control tasks under the MuJoCo control suite. When considering overall performance, we observed that despite the algorithmic maturity of PG methods, GAC attains competitive performance and often outperforms the various baselines. Nevertheless, as noted above, there is “no free lunch”. While GAC remains as sample efficient as the current PG methods (in terms of the batch size during training and number of environment interactions), it suffers from high computational complexity.
Finally, the elementary framework presented in this paper can be extended in various future research directions. First, improving the computational efficiency is a top priority for GAC to achieve deployment in real robotic agents. In addition, as the target distribution is defined w.r.t. the advantage function, future work may consider integrating uncertainty estimates in order to improve exploration. Moreover, PG methods have been thoroughly researched and many of their improvements, such as trust region optimization (Schulman et al., 2015), can be adapted to the DPO framework. Finally, DPO and GAC can be readily applied to other well-known frameworks such as the Soft-Actor-Critic (Haarnoja et al., 2018), in which entropy of the policy is encouraged through an augmented reward function. We believe this work is a first step towards a principal alternative for RL in continuous action space domains.
- Andrychowicz et al.  Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
Arjovsky et al. 
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks.
International Conference on Machine Learning, pages 214–223, 2017.
- Bellemare et al.  Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
- Bhatnagar and Lakshmanan  Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
- Borkar  Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
Chou et al. 
Po-Wei Chou, Daniel Maturana, and Sebastian Scherer.
Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 834–843. JMLR. org, 2017.
- Chow et al.  Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
- Dabney et al.  Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.
- Dabney et al. [2018a] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018a.
Dabney et al. [2018b]
Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos.
Distributional reinforcement learning with quantile regression.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018b.
- Dhariwal et al.  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
- Fujimoto et al.  Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
- Ghavamzadeh et al.  Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actor-critic algorithms. The Journal of Machine Learning Research, 17(1):2319–2371, 2016.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Haarnoja et al.  Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.
- Haarnoja et al.  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856–1865, 2018.
- Huber  Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992.
- Kallenberg  Olav Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006.
- Koenker and Hallock  Roger Koenker and Kevin Hallock. Quantile regression: An introduction. Journal of Economic Perspectives, 15(4):43–56, 2001.
- Koenker and Xiao  Roger Koenker and Zhijie Xiao. Quantile autoregression. Journal of the American Statistical Association, 101(475):980–990, 2006.
- Konda and Tsitsiklis  Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
- Korenkevych et al.  Dmytro Korenkevych, A Rupam Mahmood, Gautham Vasan, and James Bergstra. Autoregressive policies for continuous control deep reinforcement learning. arXiv preprint arXiv:1903.11524, 2019.
- Levine et al.  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Lillicrap et al.  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Ostrovski et al.  Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. arXiv preprint arXiv:1806.05575, 2018.
- Peng et al.  Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.
- Polyak  Boris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-107):2, 1990.
- Puterman  Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
- Puterman and Brumelle  Martin L Puterman and Shelby L Brumelle. On the convergence of policy iteration in stationary dynamic programming. Mathematics of Operations Research, 4(1):60–69, 1979.
- Qu et al.  Chao Qu, Shie Mannor, and Huan Xu. Nonlinear distributional gradient temporal-difference learning. arXiv preprint arXiv:1805.07732, 2018.
- Riedmiller et al.  Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International Conference on Machine Learning, pages 4341–4350, 2018.
- Rowland et al.  Mark Rowland, Marc G Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163, 2018.
- Salimans et al.  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
- Salimans et al.  Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Schulman et al.  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al.  David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Sutton et al. [2000a] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000a.
- Sutton et al. [2000b] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000b.
- Tang and Agrawal  Yunhao Tang and Shipra Agrawal. Discretizing continuous action space for on-policy optimization. arXiv preprint arXiv:1901.10500, 2019.
- Todorov et al.  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Williams  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Wolpert et al. 
David H Wolpert, William G Macready, et al.
No free lunch theorems for optimization.
IEEE transactions on evolutionary computation, 1(1):67–82, 1997.
Appendix A Proof of Proposition 1
Let . We consider a single state MDP (i.e., x-armed bandit) with action space and a multi-modal reward function defined by
where will be defined later, and is the Dirac delta function satisfying for all continuous compactly supported functions .
Denote by the multivariate Gaussian distribution, defined by
In PG, we assume is parameterized by some parameters . Without loss of generality, let us consider the derivative with respect to . At iteration the derivative can be written as
PG will thus update the policy parameter by
Notice that given a Bernoulli random variable , one can write . Then by Fubini’s theorem we have
We wish to show that the gradient has a higher correlation with the direction of rather than . That is we wish to show that
Substituting the above equation is equivalent to
Since we only need to show that for large enough (which depends on the constants and )
as all other values tend to zero.
If then we are done. Otherwise, if then
where in the first step we used the Cauchy–Schwarz inequality, and in the second step we used the fact if a vector satisfies then for any constant , .
Assume Equation (4) holds from some . Then by the gradient procedure we know that , and thus we can use the same proof as the base case. Hence, and the result follows for .
Appendix B Experimental Details
Target policy estimation:
To estimate the target policy, for each state , we sample 128 actions uniformly from the action space , 128 samples from the target policy and the per-sample loss is weighted by the positive advantage . This can be seen as a form of ‘exploration-exploitation’ - while uniform sampling ensures proper exploration of the action set, sampling from the policy has a higher probability of producing actions with positive advantage.
The loss is thus the weighted quantile loss. We do note that while one would want to define the target policy as the linear/Boltzmann distribution over the positive advantage, this is not possible in practice. As actions are sampled, we can only construct such a distribution on a per-batch instance. This approach does provide higher weight for better performing actions, but does result in a different underlying distribution. In addition, in order to ensure stability, we normalize the quantile loss weights in each batch - this is to ensure that very small (high) advantage values do not incur a near-zero (huge) gradients which may harm model stability.
Actor: As presented in Figure 3, our architecture incorporates a recurrent cell. The recurrent cell ensures that each dimension of the action is a function of the state , the sampled quantile and the previous predicted action dimensions . Notice that using this architecture, the prediction of is not affected by . This approach is a strict requirement when considering the autoregressive approach.
We believe other, potentially more efficient architectures can be explored. For instance, a fully connected network, similar to the non-autoregressive approach, with attention over the previous action dimensions may work well [Vaswani et al., 2017]. Such evaluation is out of the scope of this work and is an interesting investigation for future work.
Value & Critic:
While the actor architecture is a non-standard approach, for both the value and critic networks, we use the classic MLP network. Specifically, we use a two layer fully connected network with 400 and 300 neurons in each layer, respectively. Similarly toFujimoto et al. , the critic receives a concatenated vector of both the state and action as input.
Appendix C Discussion and Common Mistakes
As shown in the body of the paper, there exist alternative approaches. We take this section in order to provide some additional discussion into how and why we decided on certain approaches and what else can be done.
c.1 Alternative Gradient Approaches
Going back to the policy gradient approach, specifically the deterministic version, we can write the value of the current policy of our generative model (policy) as:
or an estimation using samples
It may then be desirable to directly optimize this objective function by taking the gradient w.r.t. the parameters of . However, this approach does not ensure optimality. Clearly, the gradient direction is provided by the critic for each value of . This can be seen as optimizing an ensemble of DDPG models whereas each value selects a different model from this set. As DDPG is a uni-modal parametric distribution and is thus not ensured to converge to an optimal policy, this approach suffers from the same caveats.
However, Evolution Strategies [Salimans et al., 2017] is a feasible approach. As opposed to the gradient method, this approach can be seen as directly calculating , i.e., it estimates the best direction in which to move the policy. As long as the policy is still capable of representing arbitrarily complex distributions this approach should, in theory, converge to a global maxima. However, as there is interest in sample efficient learning, our focus in this work was on introducing an off-policy learning method under the common actor-critic framework.
c.2 Target Networks and Stability
Our empirical approach, as shown in Algorithm 2, uses a target network for each approximator (critic, value and the target policy). While the critic and value target networks are mainly for stability of the empirical approach, they can be disposed of, the policy target network is required for the algorithm to converge (as shown in Section 3).
The quantile loss, and any distribution loss in general, is concerned with moving probability mass from the current distribution towards the target distribution. This leads to two potential issues when lacking the delayed policy: (1) non-quasi-stationarity of the target distribution, and (2) non-increasing policy.
The first point is important from an optimization point of view. As the quantile loss is aimed to estimate some target distribution, the assumption is that this distribution is static. Lacking the delayed policy network, this distribution potentially changes at each time step and thus can not be properly estimated using sample based approaches. The delayed policy solves this problem, as it tracks the policy on a slower timescale it can be seen as quasi-static and thus the target distribution becomes well defined.
The second point is important from an RL point of view. In general, RL proofs evolve around two concepts - either you are attempting to learn the optimal Q values and convergence is shown through proving the operator is contracting towards a unique globally stable equilibrium, or the goal is to learn a policy and thus the proof is based on showing the policy is monotonically improving. As the delayed policy network slowly tracks the policy network, the multi-timescale framework tells us that “by the time” the delayed policy network changes, the policy network can be assumed to converge. As the policy network is aimed to estimate a distribution over the positive advantage of the delayed policy, this approach ensures that the delayed policy is monotonically improving (under the correct theoretical step-size and realizability assumptions).
c.3 Sample Complexity and Policy Samples
When considering sample complexity in its simplest form, our approach is as efficient as the baselines we compared to. It does not require the use of larger batches nor does it require more environment samples. However, as we are optimizing a generative model, it does require sampling from the model itself.
As opposed to Dabney et al. [2018a], we found that in our approach the number of samples does affect the convergence ability of the network. While using 16 samples for each transition in the batch did result in relatively good policies, increasing this number affected stability and performance positively. For this reason, we decided to run with a sample size of 128. This results in longer training times. For instance, training the TD3 algorithm on the Hopper-v2 domain using two NVIDIA GTX 1080-TI cards took around 3 hours, whereas our approach took 40 hours to train. We argue that as often the resulting policy is what matters, it is worth to sacrifice time efficiency in order to gain a better final result.
c.4 Generative Adversarial Policy Training
Our approach used the AIQN framework in order to train a generative policy. An alternative method for learning distributions from samples is using the GAN framework. A discriminator can be trained to differentiate between samples from the current policy and those from the target distribution; thus, training the policy to ‘fool’ the discriminator will result in generating a distribution similar to the target.
However, while the GAN framework has seen multiple successes, it still lacks the theoretical guarantees of convergence to the Nash equilibrium. As opposed to the AIQN which is trained on a supervision signal, the GAN approach is modeled as a two player zero-sum game.
Appendix D Distributional Policy Optimization Assumptions
We provide the assumptions required for the 3-timescale stochastic approximation approach, namely DPO, to converge.
The first assumption is regarding the step-sizes. It ensures that the policy moves on the fastest time-scale, the value and critic on an intermediate and the delayed policy on the slowest. This enables the quasi-static analysis in which the fast elements see the slower as static and the slow view the faster as if they have already converged.
[Step size assumption]
The second assumption requires that the action set be compact. Since there exists a deterministic policy which is optimal, this assumption ensures that this policy is indeed finite and thus the process converges.
[Compact action set] The action set is compact for every .
The final two assumptions (3 and 4) ensure that , moving on the fast time-scale, converges. The Lipschitz assumption ensures that the action-value function and in turn the target distribution are smooth.
[Lipschitz and bounded Q] The action-value function is Lipschitz and bounded for every and .
For any and , there exists a loss such that as .
Finally, it can be shown that DPO converges under these assumptions using the standard multi-timescale approach.