Distributional Policy Optimization: An Alternative Approach for Continuous Control

05/23/2019 ∙ by Chen Tessler, et al. ∙ 0

We identify a fundamental problem in policy gradient-based methods in continuous control. As policy gradient methods require the agent's underlying probability distribution, they limit policy representation to parametric distribution classes. We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions. We suggest a novel distributional framework, able to represent arbitrary distribution functions over the continuous action space. Using this framework, we construct a generative scheme, trained using an off-policy actor-critic paradigm, which we call the Generative Actor Critic (GAC). Compared to policy gradient methods, GAC does not require knowledge of the underlying probability distribution, thereby overcoming these limitations. Empirical evaluation shows that our approach is comparable and often surpasses current state-of-the-art baselines in continuous domains.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free Reinforcement Learning (RL) is a learning paradigm which aims to maximize a cumulative reward signal based on experience gathered through interaction with an environment

(Sutton and Barto, 1998). It is divided into two primary categories. Value-based approaches involve learning the value of each action and acting greedily with respect to it (i.e., selecting the action with highest value). On the other hand, policy-based approaches (the focus of this work) learn the policy directly, thereby explicitly learning a mapping from state to action.

Policy gradients (PGs) (Sutton et al., 2000b)

have been the go-to approach for learning policies in empirical applications. The combination of the policy gradient with recent advances in deep learning has enabled the application of RL in complex and challenging environments. Such domains include continuous control problems, in which an agent controls complex robotic machines both in simulation

(Schulman et al., 2015; Haarnoja et al., 2017; Peng et al., 2018) as well as real life (Levine et al., 2016; Andrychowicz et al., 2018; Riedmiller et al., 2018). Nevertheless, there exists a fundamental problem when PG methods are applied to continuous control regimes. As the gradients require knowledge of the probability of the performed action , the PG is empirically limited to parametric distribution functions. Common parametric distributions used in the literature include the Gaussian (Schulman et al., 2015, 2017), Beta (Chou et al., 2017) and Delta (Silver et al., 2014; Lillicrap et al., 2015; Fujimoto et al., 2018) distribution functions.

In this work, we show that while the PG is properly defined over parametric distribution functions, it is prone to converge to sub-optimal exterma (Section 3). The leading reason is that these distributions are not convex in the distribution space111

As an example, consider the Gaussian distribution, which is known to be non-convex.

and are thus limited to local improvement in the action space itself. Inspired by Approximate Policy Iteration schemes, for which convergence guarantees exist (Puterman and Brumelle, 1979), we introduce the Distributional Policy Optimization (DPO) framework in which an agent’s policy evolves towards a distribution

over improving actions. This framework requires the ability to minimize a distance (loss function) which is defined over two distributions, as opposed to the policy gradient approach which requires an explicit differentiation through the density function.

DPO establishes the building blocks for our generative algorithm, the Generative Actor Critic222Code provided in the following anonymous repository: github.com/neurips-2019/GAC

. It is composed of three elements: a generative model which represents the policy, a value, and a critic. The value and the critic are combined to obtain the advantage of each action. A target distribution is then defined as one which improves the value (i.e., all actions with negative advantage receive zero probability mass). The generative model is optimized directly from samples without the explicit definition of the underlying probability distribution using quantile regression and Autoregressive Implicit Quantile Networks (see Section 

4). Generative Actor Critic is evaluated on tasks in the MuJoCo control suite (Section 5), showing promising results on several difficult baselines.

2 Preliminaries

We consider an infinite-horizon discounted Markov Decision Process (MDP) with a continuous action space. An MDP is defined as the 5-tuple

(Puterman, 1994), where is a countable state space, the continuous action space, is a transition kernel, is a reward function, and is the discount factor. Let be a stationary policy, where is the set of probability measures on the Borel sets of . We denote by the set of stationary stochastic policies. In addition to , often one is interested in optimizing over a set of parametric distributions. We denote the set of possible distribution parameters by (e.g., the mean

and variance

of a Gaussian distribution).

Two measures of interest in RL are the value and action-value functions and , respectively. The value of a policy , starting at state and performing action is defined by . The value function is then defined by . Given the action-value and value functions, the advantage of an action at state is defined by . The optimal policy is defined by and the optimal value by .

3 From Policy Gradient to Distributional Policy Optimization

Current practical approaches leverage the Policy Gradient Theorem (Sutton et al., 2000b) in order to optimize a policy, which updates the policy parameters according to


where is the stationary distribution of states under . Since this update rule requires knowledge of the log probability of each action under the current policy , empirical methods in continuous control resort to parametric distribution functions. Most commonly used are the Gaussian (Schulman et al., 2017), Beta (Chou et al., 2017) and deterministic Delta (Lillicrap et al., 2015) distribution functions. However, as we show in Proposition 1, this approach is not ensured to converge, even though there exists an optimal policy which is deterministic (i.e., Delta) - a policy which is contained within this set.

The sub-optimality of uni-modal policies such as Gaussian or Delta distributions does not occur due to the limitation induced by their parametrization (e.g., the neural network), but is rather a result of the predefined set of policies. As an example, consider the set of Delta distributions. As illustrated in Figure 

1, while this set is convex in the parameter (the mean of the distribution), it is not convex in the set . This is due to the fact that results in a stochastic distribution over two supports, which cannot be represented using a single Delta function. Parametric distributions such as Gaussian and Delta functions highlight this issue, as the policy gradient considers the gradient w.r.t. the parameters . This results in local movement in the action space. Clearly such an approach can only guarantee convergence to a locally optimal solution and not a global one.

(a) Policy vs. Parameter Space
(b) Delta
(c) Gaussian
(b) Delta
Figure 1: (a): A conceptual diagram comparing policy optimization in parameter space (black dots) in contrast to distribution space (white dots). Plots depict values in both spaces. As parameterized policies are non-convex in the distribution space, they are prone to converge to a local optima. Considering the entire policy space ensures convergence to the global optima. (b,c): Policy evolution of Delta and Gaussian parameterized policies for multi-modal problems.
Proposition 1.

For any initial Gaussian policy and there exists an MDP such that satisfies


where is the convergent result of a PG method with step size bounded by . Moreover, given the result follows even when is only known to lie in some ball of radius R around , .

Proof sketch.

For brevity we prove for the case of , such that is a finite interval . We also assume , and . The general case proof can be found in the supplementary material. Let . We consider a single state MDP (i.e., x-armed bandit) with action space and a multi-modal reward function (similar to the illustration in Figure 0(c)), defined by

where is the window function.

In PG, we assume is parameterized by some parameters . Without loss of generality, let us consider the derivative with respect to . At iteration the derivative can be written as PG will thus update the policy parameter by As , it holds that It follows that if and then so is . Then, . That is, the policy can never reach the interval in which the optimal solution lies. Hence, and the result follows for . ∎

3.1 Distributional Policy Optimization (DPO)

In order to overcome issues present in parametric distribution functions, we consider an alternative approach. In our solution, the policy does not evolve based on the gradient w.r.t. distribution parameters (e.g., ), but rather updates the policy distribution according to

where is a projection operator onto the set of distributions, is a distance measure (e.g., Wasserstein distance), and is a distribution defined over the support (i.e., the positive advantage). Table 1 provides examples of such distributions.

Algorithm 1 describes the Distributional Policy Optimization (DPO) framework as a three time-scale approach to learning the policy. It can can be shown, under standard stochastic approximation assumptions (Borkar, 2009; Konda and Tsitsiklis, 2000; Bhatnagar and Lakshmanan, 2012; Chow et al., 2017), to converge to an optimal solution. DPO consists of 4 elements: (1) A policy on a fast timescale, (2) a delayed policy

on a slow timescale, (3) a value and (4) a critic, which estimate the quality of the delayed policy

on an intermediate timescale. Unlike the PG approach, DPO does not require access to the underlying p.d.f. In addition, as

is updated on the fast timescale, it can be optimized using supervised learning techniques. Finally, we note that in DPO, the target distribution

induces a higher value than the current policy , ensuring an always improving policy.

The concept of policy evolution using positive advantage is depicted in Figure 2. While the policy starts as a uni-modal distribution, it is not restricted to this subset of policies. As the policy evolves, less actions have positive advantage, and the process converges to an optimal solution. In the next section we construct a practical algorithm under the DPO framework using a generative actor.

1:Input: learning rates
Algorithm 1 Distributional Policy Optimization (DPO)
Boltzmann ()
Table 1: Examples of target distributions over the set of improving actions

4 Method

In this section we present our method, the Generative Actor Critic, which learns a policy based on the Distributional Policy Optimization framework (Section 3). Distributional Policy Optimization requires a model which is both capable of representing arbitrarily complex distributions and can be optimized by minimizing a distributional distance. We consider the Autoregressive Implicit Quantile Network (Ostrovski et al., 2018), which is detailed below.

4.1 Quantile Regression & Autoregressive Implicit Quantile Networks

As seen in Algorithm 1, DPO requires the ability to minimize a distance between two distributions. The Implicit Quantile Network (IQN) (Dabney et al., 2018a) provides such an approach using the Wasserstein metric. The IQN receives a quantile value and is tasked at returning the value of the corresponding quantile from a target distribution. As the IQN learns to predict the value of the quantile, it allows one to sample from the underlying distribution (i.e., by sampling and performing a forward pass). Learning such a model requires the ability to estimate the quantiles. The quantile regression loss (Koenker and Hallock, 2001) provides this ability. It is given by , where is the quantile and the error.

Nevertheless, the IQN is only capable of coping with univariate (scalar) distribution functions. Ostrovski et al. (2018) proposed to extend the IQN to the multi-variate case using quantile autoregression (Koenker and Xiao, 2006). Let

be an n-dimensional random variable. Given a fixed ordering of the

dimensions, the c.d.f. can be written as the product of conditional likelihoods

The Autoregressive Implicit Quantile Network (AIQN), receives an i.i.d. vector

. The network architecture then ensures each output dimension is conditioned on the previously generated values ; trained by minimizing the quantile regression loss.

Figure 2: Policy evolution of a general, non-parametric policy, where the target policy is a distribution over the actions with positive advantage. The horizontal dashed line denotes the current value of the policy, the colored green region denotes the target distribution (i.e., the actions with a positive advantage) and denotes the policy after multiple updates. As opposed to Delta and Gaussian distributions, the fixed point of this approach is the optimal policy.

4.2 Generative Actor Critic (GAC)

Next, we introduce a practical implementation of the DPO framework. As shown in Section 3, DPO is composed of 4 elements: an actor, a delayed actor, a value, and an action-value estimator. The Generative Actor Critic (GAC) uses a generative actor trained using an AIQN, as described below. Contrary to parametric distribution functions, a generative neural network acts as a universal function approximator, enabling us to represent arbitrarily complex distributions, as corollary of the following lemma.

Lemma (Kernels and Randomization (Kallenberg, 2006)).

Let be a probability kernel from a measurable space to a Borel space . Then there exists some measurable function such that if is , then has distribution for every .

Actor: DPO defines the actor as one which is capable of representing arbitrarily complex policies. To obtain this we construct a generative neural network, an AIQN. The AIQN learns a mapping from a sampled noise vector to a target distribution.

As illustrated in Figure 3, the actor network contains a recurrent cell which enables sequential generation of the action. This generation schematic ensures the autoregressive nature of the model. Each generated action dimension is conditioned only on the current sampled noise scalar and the previous action dimensions . In order to train the generative actor, the AIQN requires the ability to produce samples from the target distribution

. Although we are unable to sample from this distribution, given an action, we are able to estimate its probability. An unbiased estimator of the loss can be attained by uniformly sampling actions and then multiplying them by their corresponding weight. More specifically, the weighted autoregressive quantile loss is defined by


where is the coordinate of action , and is the Huber quantile loss (Huber, 1992; Dabney et al., 2018b). Estimation of in the target distribution is obtained using the estimated advantage.

Delayed Actor: The delayed actor, also known as Polyak averaging (Polyak, 1990), is an appealing requirement as it is common in off-policy actor-critic schemes (Lillicrap et al., 2015). The delayed actor is an additional AIQN , which tracks . It is updated based on and is used for training the value and critic networks.

Figure 3: Illustration of the actor’s architecture. is the hadamard product, a concatenation operator, and a mapping .

Value and Action-Value: While it is possible to train a critic and use its empirical mean w.r.t. the policy as a value estimate, we found it to be noisy, resulting in bad convergence. We therefore train a value network to estimate the expectation of the critic w.r.t. the delayed policy. In addition, as suggested in Fujimoto et al. (2018), we train two critic networks in parallel. During both policy and value updates, we refer to the minimal value of the two critics. We observed that this indeed reduced variance and improved overall performance.

To summarize, GAC combines 4 elements. The delayed actor tracks the actor using a Polyak averaging scheme. The value and critic networks estimate the performance of the delayed actor. Provided and estimations, we are able to estimate the advantage of each action and thus propose the weighted autoregressive quantile loss, used to train the actor network. We refer the reader to the supplementary material for an exhaustive overview of the algorithm and architectural details.

5 Experiments

In order to evaluate our approach, we test GAC on a variety of continuous control tasks in the MuJoCo control suite (Todorov et al., 2012). The agents are composed of joints: from 2 joints in the simplistic Swimmer task and up to 17 in the Humanoid robot task. The state is a vector representation of the agent, containing the spatial location and angular velocity of each element. The action is a continuous dimensional vector, representing how much torque to apply to each joint. The task in these domains is to move forward as much as possible within a given time-limit.

We run each task for 1 million steps and, as GAC is an off-poicy approach, evaluate the policy every 5000 steps and report the average over 10 evaluations. We train GAC using a batch size of 128 and uncorrelated Gaussian noise for exploration. Results are depicted in Figure 4. Each curve presented is a product of 5 training procedures with a randomly sampled seed. In addition to our raw results, we compare to the relevant baselines333We use the implementations of DDPG and PPO from the OpenAI baselines repo (Dhariwal et al., 2017), and TD3 (Fujimoto et al., 2018) from the authors GitHub repository., including: (1) DDPG (Lillicrap et al., 2015), (2) TD3 (Fujimoto et al., 2018), an off-policy actor critic approach which represents the policy using a deterministic delta distribution, and (3) PPO (Schulman et al., 2017), an on-policy method which represents the policy using a Gaussian distribution.

As we have shown in the previous sections, DPO and GAC only require some target distribution to be defined, namely, a distribution over actions with positive advantage. In our results we present two such distributions: the linear and Boltzmann distributions (see Table 1). We also test a non-autoregressive version of our model 444Theoretically, the dimensions of the actions may be correlated and thus should be represented using an auto-regressive model. using an IQN. For completeness, we provide additional discussion regarding the various parameters and how they performed, in addition to a pseudo-code illustration of our approach, in the supplementary material.

Figure 4: Training curves on continuous control benchmarks. For the Generative Actor Critic approach we three models (i) Autoregressive with Linear target distribution, (ii) Autoregressive with Boltzmann target distribution and (iii) Non-autoregressive with Boltzmann target distribution.

Comparison to the policy gradient baselines: Results in Figure 4 show the ability of GAC to solve complex, high dimensional problems. GAC attains competitive results across all domains, often outperforming the baseline policy gradient algorithms and exhibiting lower variance. This is somewhat surprising, as GAC is a vanila algorithm, it is not supported by numerous improvements apparent in recent PG methods. In addition to these results, we provide numerical results in the supplementary material, which emphasize this claim.

Parameter Comparison: Below we discuss how various parameters affect the behavior of GAC in terms of convergence rates and overall performance:

  1. At each step, the target policy is approximated through samples using the weighted quantile loss (Equation (3)). The results presented in Figure 4 are obtained using 256 samples at each step. 128 samples are taken uniformly over the action space and 128 from the delayed policy (a form of combining exploration and exploitation). Ablation tests showed that increasing the number of samples improved stability and overall performance. Moreover, we observed that the combination of both sampling methods is crucial for success.

  2. Figure 4

    presents results for Linear and Boltzman type target policies. Not presented is the Uniform distribution over the target actions, which did not work well. We believe this is due to the fact that the Uniform target provides an equal weight to actions which are very good while also to those which barely improve the value.

  3. We observed that in most tasks, similar to the observations of Korenkevych et al. (2019), the AIQN model outperforms the IQN (non-autoregressive) one. Nevertheless, in the Humanoid task, the IQN version dramatically outperforms all other approaches. As the IQN is contained within the AIQN approach, we believe that this phenomena is due to the complexity of modeling the inter-action dependencies which results in faster convergence of the simpler, IQN model.

Environment Humanoid-v2 Walker2d-v2 Hopper-v2 HalfCheetah-v2 Ant-v2 Swimmer-v2 Relative Result

Table 2: Relative best GAC results compared to the best policy gradient baseline

6 Related Work

Distributional RL: Recent interest in distributional methods for RL has grown with the introduction of deep RL approaches for learning the distribution of the return. Bellemare et al. (2017) presented the C51-DQN which partitions the possible values into a fixed number of bins and estimates the p.d.f. of the return over this discrete set. Dabney et al. (2017) extended this work by representing the c.d.f. using a fixed number of quantiles. Finally, Dabney et al. (2018a) extended the QR-DQN to represent the entire distribution using the Implicit Quantile Network (IQN). In addition to the empirical line of work, Qu et al. (2018) and Rowland et al. (2018) have provided fundamental theoretical results for this framework.

Generative Modeling: Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) combine two neural networks in a game-theoretic approach which attempt to find a Nash Equilbirium. This equilibrium is found when the generative model is capable of “fooling” the discriminator (i.e., the discriminator is no longer capable of distinguishing between samples produced from the real distribution and those from the generator). Multiple GAN models and training methods have been introduced, including the Wasserstein-GAN (Arjovsky et al., 2017) which minimizes the Wasserstein loss. However, as the optimization scheme is highly non-convex, these approaches are not proven to converge and may thus suffer from instability and mode collapse (Salimans et al., 2016).

Policy Learning: Learning a policy is generally performed using one of two methods. The Policy Gradient (PG) (Williams, 1992; Sutton et al., 2000a) defines the gradient as the direction which maximizes the reward under the assumed policy parametrization class. Although there have been a multitude of improvements, including the ability to cope with deterministic policies (Silver et al., 2014; Lillicrap et al., 2015), stabilize learning through trust region updates (Schulman et al., 2015, 2017) and bayesian approaches (Ghavamzadeh et al., 2016), these methods are bounded to parametric distribution sets (as the gradient is w.r.t. the log probability of the action). An alternative line of work formulates the problem as a maximum entropy (Haarnoja et al., 2018), this enables the definition of the target policy using an energy functional. However, training is performed via minimizing the KL-divergence. The need to know the KL-divergence limits practical implementation to parametric distributions functions, similar to PG methods.

7 Discussion and Future Work

In this work we presented limitations inherent to empirical Policy Gradient (PG) approaches in continuous control. While current PG methods in continuous control are computationally efficient, they are not ensured to converge to a global extrema. As the policy gradient is defined w.r.t. the log probability of the policy, the gradient results in local changes in the action space (e.g., changing the mean and variance of a Gaussian policy). These limitations do not occur in discrete action spaces.

In order to ensure better asymptotic results, it is often needed to use methods that are more complex and computationally demanding (i.e., “No Free Lunch” (Wolpert et al., 1997)). Existing approaches attempting to mitigate these issues, either enrich the policy space using mixture models, or discretize the action space. However, while the discretization scheme is appealing, there is a clear trade-off between optimality and efficiency. While finer discretization improves guarantees, the complexity (number of discrete actions) grows exponentially in the action dimension (Tang and Agrawal, 2019).

Similar to the limitations inherent in PG approaches, these limitations also exist when considering mixture models, such as Gaussian Mixtures. A mixture model of -Gaussians provides a categorical distribution over Gaussian distributions. The policy gradient w.r.t. these parameters, similarly to the single Gaussian model, directly controls the mean and variance of each Gaussian independently. As such, even a mixture model is confined to local improvement in the action space.

In practical scenarios, and as the number of Gaussians grows, it is likely that the modes of the mixture would be located in a vicinity of a global optima. A Gaussian Mixture model may therefore be able to cope with various non-convex continuous control problems. Nevertheless, we note that Gaussian Mixture models, unlike a single Gaussian, are numerically unstable. Due to the summation over Gaussians, the log probability of a mixture of Gaussians does not result in a linear representation. This can cause numerical instability, and thus hinder the learning process. These insights lead us to question the optimality of current PG approaches in continuous control, suggesting that, although these approaches are well understood, there is room for research into alternative policy-based approaches.

In this paper we suggested the Distributional Policy Optimization (DPO) framework and its empirical implementation - the Generative Actor Critic (GAC). We evaluated GAC on a series of continuous control tasks under the MuJoCo control suite. When considering overall performance, we observed that despite the algorithmic maturity of PG methods, GAC attains competitive performance and often outperforms the various baselines. Nevertheless, as noted above, there is “no free lunch”. While GAC remains as sample efficient as the current PG methods (in terms of the batch size during training and number of environment interactions), it suffers from high computational complexity.

Finally, the elementary framework presented in this paper can be extended in various future research directions. First, improving the computational efficiency is a top priority for GAC to achieve deployment in real robotic agents. In addition, as the target distribution is defined w.r.t. the advantage function, future work may consider integrating uncertainty estimates in order to improve exploration. Moreover, PG methods have been thoroughly researched and many of their improvements, such as trust region optimization (Schulman et al., 2015), can be adapted to the DPO framework. Finally, DPO and GAC can be readily applied to other well-known frameworks such as the Soft-Actor-Critic (Haarnoja et al., 2018), in which entropy of the policy is encouraged through an augmented reward function. We believe this work is a first step towards a principal alternative for RL in continuous action space domains.


Appendix A Proof of Proposition 1

Let . We consider a single state MDP (i.e., x-armed bandit) with action space and a multi-modal reward function defined by

where will be defined later, and is the Dirac delta function satisfying for all continuous compactly supported functions .

Denote by the multivariate Gaussian distribution, defined by

In PG, we assume is parameterized by some parameters . Without loss of generality, let us consider the derivative with respect to . At iteration the derivative can be written as

PG will thus update the policy parameter by

Notice that given a Bernoulli random variable , one can write . Then by Fubini’s theorem we have

We wish to show that the gradient has a higher correlation with the direction of rather than . That is we wish to show that

Substituting the above equation is equivalent to


Proving Equation (4) for all will complete the proof.
We continue the proof by induction on .
Base case (k = 0):
Recall that . Writing Equation (4) explicitly we get

Since we only need to show that for large enough (which depends on the constants and )

as all other values tend to zero.
If then we are done. Otherwise, if then

where in the first step we used the Cauchy–Schwarz inequality, and in the second step we used the fact if a vector satisfies then for any constant , .

Induction step:
Assume Equation (4) holds from some . Then by the gradient procedure we know that , and thus we can use the same proof as the base case. Hence, and the result follows for .

Appendix B Experimental Details

1:Input: number of time steps , policy samples , minibatch size
2:Initialize critic networks , , value network and actor network with random parameters , , ,
3:Initialize target networks , , ,
4:Initialize replay buffer
5:for  do
6:     Select action with exploration noise ,
7:      and observe reward and new state
8:     Store transition tuple in
9:     Sample mini-batch of transitions from
11:     Update critics:
15:     Update value:
17:     Sample actions from sampling policy
19:     Update actor:
21:     Update target networks:
Algorithm 2 Generative Actor Critic

Our approach is depicted in Algorithm 2. In addition, we provide a numerical comparison of the various approaches in Table 3. These results show a clear picture.

Target policy estimation:

To estimate the target policy, for each state , we sample 128 actions uniformly from the action space , 128 samples from the target policy and the per-sample loss is weighted by the positive advantage . This can be seen as a form of ‘exploration-exploitation’ - while uniform sampling ensures proper exploration of the action set, sampling from the policy has a higher probability of producing actions with positive advantage.

The loss is thus the weighted quantile loss. We do note that while one would want to define the target policy as the linear/Boltzmann distribution over the positive advantage, this is not possible in practice. As actions are sampled, we can only construct such a distribution on a per-batch instance. This approach does provide higher weight for better performing actions, but does result in a different underlying distribution. In addition, in order to ensure stability, we normalize the quantile loss weights in each batch - this is to ensure that very small (high) advantage values do not incur a near-zero (huge) gradients which may harm model stability.

Architectural Details:

Actor: As presented in Figure 3, our architecture incorporates a recurrent cell. The recurrent cell ensures that each dimension of the action is a function of the state , the sampled quantile and the previous predicted action dimensions . Notice that using this architecture, the prediction of is not affected by . This approach is a strict requirement when considering the autoregressive approach.

We believe other, potentially more efficient architectures can be explored. For instance, a fully connected network, similar to the non-autoregressive approach, with attention over the previous action dimensions may work well [Vaswani et al., 2017]. Such evaluation is out of the scope of this work and is an interesting investigation for future work.

Value & Critic:

While the actor architecture is a non-standard approach, for both the value and critic networks, we use the classic MLP network. Specifically, we use a two layer fully connected network with 400 and 300 neurons in each layer, respectively. Similarly to

Fujimoto et al. [2018], the critic receives a concatenated vector of both the state and action as input.

Table 3: Comparison of the maximal attained value across training.

Appendix C Discussion and Common Mistakes

As shown in the body of the paper, there exist alternative approaches. We take this section in order to provide some additional discussion into how and why we decided on certain approaches and what else can be done.

c.1 Alternative Gradient Approaches

Going back to the policy gradient approach, specifically the deterministic version, we can write the value of the current policy of our generative model (policy) as:

or an estimation using samples

It may then be desirable to directly optimize this objective function by taking the gradient w.r.t. the parameters of . However, this approach does not ensure optimality. Clearly, the gradient direction is provided by the critic for each value of . This can be seen as optimizing an ensemble of DDPG models whereas each value selects a different model from this set. As DDPG is a uni-modal parametric distribution and is thus not ensured to converge to an optimal policy, this approach suffers from the same caveats.

However, Evolution Strategies [Salimans et al., 2017] is a feasible approach. As opposed to the gradient method, this approach can be seen as directly calculating , i.e., it estimates the best direction in which to move the policy. As long as the policy is still capable of representing arbitrarily complex distributions this approach should, in theory, converge to a global maxima. However, as there is interest in sample efficient learning, our focus in this work was on introducing an off-policy learning method under the common actor-critic framework.

c.2 Target Networks and Stability

Our empirical approach, as shown in Algorithm 2, uses a target network for each approximator (critic, value and the target policy). While the critic and value target networks are mainly for stability of the empirical approach, they can be disposed of, the policy target network is required for the algorithm to converge (as shown in Section 3).

The quantile loss, and any distribution loss in general, is concerned with moving probability mass from the current distribution towards the target distribution. This leads to two potential issues when lacking the delayed policy: (1) non-quasi-stationarity of the target distribution, and (2) non-increasing policy.

The first point is important from an optimization point of view. As the quantile loss is aimed to estimate some target distribution, the assumption is that this distribution is static. Lacking the delayed policy network, this distribution potentially changes at each time step and thus can not be properly estimated using sample based approaches. The delayed policy solves this problem, as it tracks the policy on a slower timescale it can be seen as quasi-static and thus the target distribution becomes well defined.

The second point is important from an RL point of view. In general, RL proofs evolve around two concepts - either you are attempting to learn the optimal Q values and convergence is shown through proving the operator is contracting towards a unique globally stable equilibrium, or the goal is to learn a policy and thus the proof is based on showing the policy is monotonically improving. As the delayed policy network slowly tracks the policy network, the multi-timescale framework tells us that “by the time” the delayed policy network changes, the policy network can be assumed to converge. As the policy network is aimed to estimate a distribution over the positive advantage of the delayed policy, this approach ensures that the delayed policy is monotonically improving (under the correct theoretical step-size and realizability assumptions).

c.3 Sample Complexity and Policy Samples

When considering sample complexity in its simplest form, our approach is as efficient as the baselines we compared to. It does not require the use of larger batches nor does it require more environment samples. However, as we are optimizing a generative model, it does require sampling from the model itself.

As opposed to Dabney et al. [2018a], we found that in our approach the number of samples does affect the convergence ability of the network. While using 16 samples for each transition in the batch did result in relatively good policies, increasing this number affected stability and performance positively. For this reason, we decided to run with a sample size of 128. This results in longer training times. For instance, training the TD3 algorithm on the Hopper-v2 domain using two NVIDIA GTX 1080-TI cards took around 3 hours, whereas our approach took 40 hours to train. We argue that as often the resulting policy is what matters, it is worth to sacrifice time efficiency in order to gain a better final result.

c.4 Generative Adversarial Policy Training

Our approach used the AIQN framework in order to train a generative policy. An alternative method for learning distributions from samples is using the GAN framework. A discriminator can be trained to differentiate between samples from the current policy and those from the target distribution; thus, training the policy to ‘fool’ the discriminator will result in generating a distribution similar to the target.

However, while the GAN framework has seen multiple successes, it still lacks the theoretical guarantees of convergence to the Nash equilibrium. As opposed to the AIQN which is trained on a supervision signal, the GAN approach is modeled as a two player zero-sum game.

Appendix D Distributional Policy Optimization Assumptions

We provide the assumptions required for the 3-timescale stochastic approximation approach, namely DPO, to converge.

The first assumption is regarding the step-sizes. It ensures that the policy moves on the fastest time-scale, the value and critic on an intermediate and the delayed policy on the slowest. This enables the quasi-static analysis in which the fast elements see the slower as static and the slow view the faster as if they have already converged.

Assumption 1.

[Step size assumption]

The second assumption requires that the action set be compact. Since there exists a deterministic policy which is optimal, this assumption ensures that this policy is indeed finite and thus the process converges.

Assumption 2.

[Compact action set] The action set is compact for every .

The final two assumptions (3 and 4) ensure that , moving on the fast time-scale, converges. The Lipschitz assumption ensures that the action-value function and in turn the target distribution are smooth.

Assumption 3.

[Lipschitz and bounded Q] The action-value function is Lipschitz and bounded for every and .

Assumption 4.

For any and , there exists a loss such that as .

Finally, it can be shown that DPO converges under these assumptions using the standard multi-timescale approach.