1 Introduction
Reinforcement learning (RL) has shown potential in solving a variety of complex problems in robotics and control [MniKavDil15, LevFinDar16]. RL algorithms can be roughly divided into two categories: value function based and policy gradient methods. Value function based algorithms do not explicitly parameterize the policy, but rather obtain the policy from a learned value function. SARSA [RumNir94] and Qlearning [WatDay92]
are popular methods for estimation of the value function based on Bellman equation. Recently, deep Qnetworks (DQNs)
[MniKavSil15] have been utilized to approximate value function and have demonstrated to achieve humanlevel performance on computer games. Alternative to value function based approaches, policy gradient methods improve a parameterized policy based on the policy gradient theorem [SutMcaSin00] and has shown to be more effective in continuous action space control setting. In particular, deep deterministic policy gradient (DDPG) [LilHunPri16]utilizes neural networks to parameterize the policy and have been successful in solving continuous control tasks.
Instead of modeling the value function as the expected sum of the discounted rewards, recently proposed distributional reinforcement learning (DRL) [BelDabMun17] framework suggests to work with the full distribution of random returns, known as value or return distribution. Several typical DRL algorithms such as C51 [BelDabMun17], D4PG [MarHofBud18], and QRDQN [DabRowBel18] have shown significant performance improvements over nondistributional counterparts in multiple environments including Atari games and DeepMind Control Suite [Tas18]. In DRL, the return distribution is usually represented by discrete categorical form [BelDabMun17, MarHofBud18, QuManXu18]
or quantile function
[DabRowBel18, ZhaMarYao19]. Most of existing work within DRL framework are value function based and thus are not suitable for tasks with continuous action space. One of the exceptions is D4PG [MarHofBud18], an actorcritic type policy gradient algorithm based on DRL. It has demonstrated much better performance [MarHofBud18, Tas18] as compared to its nondistributional counterpart (DDPG). However, it still suffers from various drawbacks such as sample inefficiency and extra burden of parameter tuning and projection, which is largely due to the fact that the return distribution in D4PG is modeled by a discrete categorical distribution.In this paper, we advocate using samples for representing return distribution instead of categorical form or quantiles. Our algorithm which we call sample based distributional policy gradient (SDPG) learns the return distribution by directly generating the return samples via reparameterizing some simple random (e.g. Gaussian) noise samples. SDPG is an actorcritic type policy gradient based algorithm within DRL framework which employs two neural networks: an actor network to parameterize the policy and a critic network to mimic the target return distribution determined via the distributional Bellman equation based on samples. Since the return distribution is usually 1dimensional, we leverage the quantile Huber loss as a surrogate of the Wasserstein distance for comparing return distributions and thereby learning the critic network.
From a theoretical perspective, SDPG has the following advantages over D4PG:

There is no discretization over the value distributions. The value function network is capable of generating any value distributions, which is in general not categorical as in D4PG.

SDPG does not require the knowledge of the range of the return distribution a prior. In contrast, D4PG requires the domain knowledge in terms of bounds on the return distribution.

No projection is required, instead, Wasserstein distance (quantile Huber loss) gives finer comparison between return distributions.

Once the model is trained, the value distribution can be recovered easily to arbitrary precision by sampling. In contrast, in D4PG, the resolution of the value distribution is fixed once trained.
Empirically, we compare the performance of our algorithm with that of D4PG on multiple OpenAI Gym [BroChePet16] environments for continuous control tasks. We observe that SDPG exhibits better sample efficiency and performs better than or onpar with D4PG in term of rewards in almost all the environments.
Related Work: Most of the algorithms proposed under DRL framework are value function based methods that would run into scalability issue for problems with continuous action space. The C51 algorithm [BelDabMun17], a value based algorithm, represented the return distribution using a discrete distribution parameterized by 51 number of uniformly spaced atoms in a specified range. Later, the QRDQN algorithm [DabRowBel18] proposed to use discrete set of quantiles to represent the return distribution and demonstrated its effectiveness over C51 algorithm on the Atari 2600 games. QRDQN was further extended in IQN [DabOstSilMun18] to learn the full quantile function. D4PG [MarHofBud18] and Reactor [GruDabAza17] are existing policy gradient based methods within DRL framework; D4PG dealt with control problems in continuous action space, whereas Reactor was studied in discrete action settings. However, D4PG also utilized the discrete categorical form to represent the return distribution similar to C51 algorithm [BelDabMun17], which limits the expressive power of the value distribution network. There have been several works on utilizing samples to represent the return distributions. Generative adversarial networks (GANs) [GodPouMir14], sample based generative models, have been employed in value function based approaches: GANDQN [DoaMazLyl18], value distribution GAN learning (VDGL) [FreShiMei19], and GANDDQN [HuaLiZha19]. GANDQN focused on discrete action space and did not show significant improvement over traditional value based methods such as QLearning and DQN. Moreover, VDGL utilized GANs to learn multivariate return distributions and thereby learning the value function. Also GANDDQN combined GAN and IQN to learn the value function for resource allocation in communication systems. An important point to note is that apart from being value based approaches, the existing GAN based DRL methods employ two networks – a generator and a discriminator – for generating return samples by solving a saddlepoint problem. In contrast, we utilize quantile Huber loss, as a surrogate of the Wasserstein distance, directly from samples [DesZhaSch18], which results in a single objective rather than saddlepoint formulation thereby eliminating the need of a discriminator for learning the return distribution.
2 Background
2.1 Distributional RL
We consider a standard RL problem with underlying model where, as usual, denote the state and action spaces respectively, is the reward of taking action at state , is the transition kernel and is the discount factor. The reward can be random in general. This is different to the convention where is the expected reward. The state and action spaces could be either discrete or continuous, though we focus on the more challenging continuous setting. The goal of RL is to find a stationary policy to maximize the longterm accumulated reward
(1) 
Since we focus on continuousstateaction tasks, we restrict the policy to be deterministic, that is, .
The Qfunction, denoted by , describes the expected reward of the agent from taking action from state , that is,
(2)  
It satisfies Bellman’s equation [Bel66]
(3) 
Here we have adopted the convention that denotes the state succeeding , i.e., .
In [BelDabMun17], the authors proposed distributional reinforcement learning (DRL), which relies on a random version of Qfunction, defined by
(4)  
Clearly,
(5) 
namely,
is the statistical mean of the random variable
. So in principle, one should be able to recover everything based on using . Moreover, the return distributioncontains extra information such as variance that may be used to incorporate risk in the RL framework. The
function satisfies a modified Bellman’s equation [BelDabMun17](6) 
where the equation holds in the probability sense.
Multiple approaches have been proposed to model the return distribution including distribution quantiles [DabRowBel18, ZhaMarYao19] and discrete categorical distribution [BelDabMun17, MarHofBud18, QuManXu18]. In [BelDabMun17] (D4PG), the return distribution is modeled by a categorical discretization at each pair. More specifically,
is described a probability vector/histogram with fixed bins. The positions of the bins are chosen a prior that need to be tuned according to the environment under consideration.
2.2 D4pg
The DRL was extended to the policy gradient framework in [MarHofBud18]. In policy gradient framework, the policy is modeled by a network directly. For continuousstate task, a widely used method is deterministic policy gradient (DPG), which relies on the deterministic policy gradient theorem [SilLevHeeRie14]. Let be the average return with control strategy , then
(7) 
This theorem is generalized to the DRL setting [MarHofBud18], stated as
(8) 
This result follows directly from the fact (5). The distributed distributional deterministic policy gradients (D4PG) [MarHofBud18] algorithm is based on this extension of policy gradient theorem (8). It is an actorcritic type algorithm in which the critic learns the return distribution via a neural network. Similar to [BelDabMun17], the distribution is modeled by a categorical discretization at each pair. The actor is updated via the generalized policy gradient theorem (8) with the expectation being replaced by empirical average.
2.3 Optimal transport
Optimal (mass) transport (OT) is a powerful mathematical tool to study probability distributions
[Vil03]. Given two random variable in the Euclidean space associated with probability distribution , the OT problem seeks the solution to(9) 
with
denoting the set of feasible joint distributions of
and . The unit cost function is often taken to be , , in which case, (9) defines the Wasserstein distance [Vil03] between and . It has an equivalent form(10) 
The Wasserstein distance is a metric and possesses many nice properties, including the weak continuity [Vil03, Bil71]
, which gives reasonable measure of difference between two distributions with disjoint supports. This property is extremely useful in data science as most datasets indeed lie in lowdimensional submanifold and therefore any small perturbation can lead to disjoint supports. One representative application relying on this property is the Wasserstein generative adversarial networks
[ArjChiBot17].Computing the Wasserstein distance requires solving a linear programming (
9). Despite the recent development of algorithms [PeyCut19] in OT algorithms, computation complexity remains a bottleneck of it. One exception is the onedimensional problem, which has closedform solution. Let be the CDFs of respectively, then(11) 
When only samples generated by are available, then their Wasserstein distance can be approximated as follows. Let be i.i.d. samples corresponding to respectively, and () be the ascending sorted version of (), then
(12) 
The approximation error goes to with rate [DesZhaSch18]. The computational complexity of the sorting operation is .
However as noted in [DabRowBel18], Equation (12) does not give unbiased approximation of the Wasserstein distance. In general,
that is, minimizing the distance to the empirical distribution composed of samples from one distribution is not equivalent to minimizing the distance to that distribution itself [DabRowBel18]. To circumvent this difficulty, we borrow tools from quantile regression method and use quantile Huber loss [Huber1964, DabRowBel18] as a surrogate of the Wasserstein distance. This is given by
(13) 
where
and
with .
2.4 Reparameterization
Reparameterization is an effective method to model random variables, especially in case when the goal is to sample from a target distribution instead of modeling them directly using function approximations. Briefly, to model a random variable with distribution , reparameterization trick seek a neural network to map a simple random variable (e.g. Gaussian) to the target random variable , that is, . The hope is that after training, the random variable is closed to in the probability sense. This is extremely useful for sampling purpose because one only needs to sample from simple distribution in order to generate samples of . The reparameterization trick has been widely used in generative adversarial networks (GANs) [GodPouMir14] where the generator is a map from simple random variable to target data set. It has also been used in variational autoencoder [KinWel13] to model the encoder.
3 Algorithm
A flow diagram of our SDPG algorithm is shown in Figure 1
. We model the return distribution by samples reparameterized via a simple noise distribution (we use zeromean Gaussian distribution with unit variance in our implementation). The critic network
, which is a neural network with parameters , generates the return samples for each state and action pair by transforming the noise samples . These generated samples are compared against the samples generated using the distributional Bellman equation (given by (6)) employed in the target critic network . The critic network is learned by minimizing the quantile Huber loss as defined in (13), which is a surrogate of the Wasserstein distance between the two 1dimensional return distributions. Therefore, the loss function for critic network is
(14) 
where are samples after sorting. We emphasize that the sorting is important here to associate each sample with a reasonable . This is different to [DabOstSilMun18] where itself is the random seed over .
The actor network , parameterized by , outputs the action given a state . The actor network receives feedback from the critic network in terms of the gradients of the return distribution with respect to the actions determined by the policy. This feedback is used to update the actor network by applying distributional form of the policy gradient theorem given by (8). Therefore, the gradient of the actor network loss function is
(15) 
All the steps of our SDPG algorithm are described in Algorithm 1. The network parameters of actor and critic networks are updated alternatively in stochastic gradient ascent/descent fashion.
In contrast to the categorical parameterization of the return distribution considered in D4PG, we use samples to represent the return distribution. Since the return distribution is required to be differentiable with respect to the network parameters in order to be learned, we utilize reparameterization trick discussed in Section 2.4 to model the return distribution via random noise input. This allows us to learn a continuous distribution via samples as opposed to a discretevalued categorical distribution in D4PG. Moreover, D4PG requires a projection step in every iteration during training in order to make the target distribution resulting from the distributional Bellman equation coincide with the support of categorical parameterized distribution being learned; SDPG eliminates the need of such a projection step during training. Furthermore, the range of the discretized grid required in D4PG must be tuned according to the reward values for each environment; SDPG does not require such tuning. Another advantage of SDPG is that one can recover the return distribution to arbitrary precision by sampling from the trained critic network. However, the resolution of the return distribution is fixed in D4PG and the critic network has to be trained again from the scratch if one wants to change the resolution.
4 Experimental Results
We compare the performance of proposed SDPG with D4PG algorithm on a range of challenging continuous control tasks from the OpenAI Gym environments. Figure 2 shows example screenshots of samples from different domains considered in our experiments. Note that in the original D4PG paper, the environments considered were from DeepMind Control Suite [Tas18] where the rewards are bounded between 0 to 1 for all the domains. In [BelDabMun17] and [Tas18], it was demonstrated that D4PG outperforms DDPG [LilHunPri16]
in almost all the environments and therefore, we compare our algorithm with the only existing policy gradient method in DRL – D4PG. All the experiments are performed using TensorFlow with one NVIDIA TITAN Xp GPU.
For both actor and critic networks, we use a two layer feedforward neural network with hidden layer sizes of 400 and 300, respectively, and rectified linear units (ReLU) between each hidden layer. We also used batch normalization on all the layers of both networks. Moreover, the output of the actor network is passed through a hyperbolic tangent (Tanh) activation unit.
In all experiments we use learning rates of , batch size , exploration constant , and . We use a replay table of size for all the domains except for Pendulum and LunarLanderContinuous. Across all the tasks, for D4PG we use 51 atoms to represent the categorical distribution and similarly, for SDPG we use
number of samples to represent return distributions. Moreover, we run each task for a maximum of 1000 steps per episode. Note that SDPG is a centralized algorithm at this moment for a single agent. Thus, the distributed algorithm in D4PG is deactivated for fair comparison. One can easily establish a distributed version of SDPG. Since SDPG requires sorting operation during training, SDPG takes a little more time per episode (almost
) than D4PG.Mean returns of SDPG (green color) and D4PG (red color) agents over environments steps for different OpenAI Gym environments. The shaded region represents standard deviation of the average returns over 5 random seeds. The curves are smoothed uniformly for visual clarity.
Reward  

Domain  Simulator  Train Steps  D4PG  SDPG 
Pendulum  Classic control  
LunarLander  Box2D  
BipedalWalker  Box2D  
Reacher  MuJoCo  
Swimmer  MuJoCo  
Ant  MuJoCo  
HalfCheetah  MuJoCo  
Humanoid  MuJoCo 
4.1 Training and Evaluation
First we demonstrate the ability of the critic network to learn the return distribution utilizing the variant of Huber loss as in Equation (13) based on the distributional Bellman equation. Figure 3 shows the histograms of the samples generated by the learned critic network and the corresponding histograms of samples generated based on the distributional Bellman equation on BipedalWalkerv2 domain. Clearly, the histograms match almost perfectly which demonstrates that the critic network in SDPG successfully learns the target return distribution determined via the distributional Bellman equation. Note that this learned critic network can be used to generate as many samples as required (see Figure 4) to approximate the return distribution at arbitrary resolution.
Next, we study the effect of varying number of samples representing the return distribution while training. Figure 5 depicts the training curves with different samples on Antv2 domain. For a fixed number of samples, the algorithm is trained for five different seeds: the solid lines represent the mean returns over five trials and the shaded region represent the corresponding standard deviation. Initially, increasing the number of samples improves the performance in terms of efficiency as well as returns. However, when using 100 samples for training although an improvement in efficiency is observed, the returns have gone down significantly.
For comparison, we train five different instances of each algorithm with different random seeds, with each performing one evaluation rollout every 5000 environment steps. Figure 6 shows the comparison of mean returns on different environments. It is evident form the figure that SDPG exhibits significantly better sample efficiency than D4PG on almost all the environments. Moreover in terms of average returns, SDPG performs better than D4PG on all the domains except Humanoidv2. This maybe due to insufficient training steps. The performance of SDPG for Humanoidv2 keeps increasing during the entire training process and this trend is expected to continue.
Episodes  

Environment  Threshold  D4PG  SDPG 
Pendulum  150  
LunarLander  200  
BipedalWalker  250  
Reacher  7  
Swimmer  90  
Ant  3500  
HalfCheetah  4700  
Humanoid  2500 
We evaluate the performance of our algorithm, SDPG, based on two criteria: average returns and sample efficiency. Table 1 lists the maximal mean returns (the average of the maximal returns over different trials) along with the standard deviation. The average returns are evaluated every 5000 training steps over 100 episodes. We observe that the returns for SDPG are significantly better than D4PG for all the environments except Humanoid. To compare the sample efficiency of D4PG and SDPG, we list the number of episodes needed to reach certain return threshold in Table 2. The episode numbers reported in the table are averaged over 5 different trials and for each trial the episode number is the number of episodes required before the reward crosses a certain threshold. It is evident that SDPG requires significantly smaller number of episodes than D4PG on many environments.
5 Conclusion
In this paper, driven by applications in continuous action space, we proposed samplebased distributional policy gradient (SDPG) algorithm for learning the policy within DRL framework. This algorithm is a combination of an actorcritic type of policy gradient method and DRL. Departing from the existing stateofart distributional policy gradient algorithm D4PG, the sampledbased reparameterization technique used in SDPG enables us to learn the return distribution to arbitrary resolution. We compared the performance of SDPG with D4PG on multiple OpenAI Gym environments. Our algorithm showed better sample efficiency than D4PG in most environments and performed better than D4PG in terms of average returns.
Comments
There are no comments yet.