## 1 Introduction

Deep neural networks (NNs) provide rich representations that can enable reinforcement learning (RL) algorithms to master a variety of challenging domains, from games to robotic control

(mnih2015DQN; silver2016mastering; mnih2016A3C; silver2017mastering). However, most RL algorithms are known to sometimes learn unrealistically high action values Q, resulting in suboptimal policies.The overestimations of RL were first found in Q-learning algorithm (watkins1989Q-learning), which is the prototype of most existing value-based RL algorithms (sutton2018reinforcement). For this algorithm, van2016double_DQN demonstrated that any kind of estimation errors can induce an upward bias, irrespective of whether these errors are caused by system noise, function approximation, or any other sources. This overestimation bias is firstly induced by the max operator over all noisy Q estimates of the same state, which tends to prefer overestimated to underestimated Q values (touretzky1996overestimation). This overestimation bias will be further propagated and exaggerated through the temporal difference learning (sutton2018reinforcement), wherein the Q estimate of some state are updated using the Q estimate of a subsequent state. mnih2015DQN proposed Deep Q-Networks (DQN) algorithm by employing a deep NN to estimate the Q value. Although the deep NN can provide rich representations with the potential for low asymptotic approximation errors, overestimations still exist, even in deterministic environments (van2016double_DQN; Fujimoto2018TD3).

Fujimoto2018TD3 shows that the overestimation problem also persists in actor-critic RL, such as Deterministic Policy Gradient (DPG) and Deep DPG (silver2014DPG; lillicrap2015DDPG)

. In practice, inaccurate estimation exists in almost all RL algorithms because, on the one hand, any algorithm will introduce some estimation biases and variances, simply due to the true Q values are initially unknown

(sutton2018reinforcement). On the other hand, the function approximation errors are usually unavoidable. This is particularly problematic because inaccurate estimation can cause arbitrarily suboptimal actions to be overestimated, resulting in suboptimal policy.To reduce overestimations in standard Q-learning, hasselt2010double_Q

proposed Double Q-learning that decouples the max operation in the target into action selection and evaluation. To update one of these two Q estimates, this Q estimate is used to determine the greedy policy, while another Q is used to determine its value, resulting in unbiased estimates.

van2016double_DQN further proposed a deep variant of Double Q-learning, Double DQN, to deal with the overestimation problem of DQN algorithm without introducing additional NNs. The target Q network of DQN provides a natural candidate for the second Q estimate, which is used to make estimates of the actions selected using the online Q network. However, these two methods can only handle discrete action spaces.Fujimoto2018TD3 proposed actor-critic variants of the standard Double DQN and Double Q-learning based on DDPG for continuous control setting, by making action selections using the policy optimized with respect to the corresponding Q estimate. However, the actor-critic Double DQN suffers from a similar overestimation as DDPG, because the current and target Q estimates were too similar to provide an independent estimation. While actor-critic Double Q-learning is more effective, it introduces additional Q and policy networks, which increases the computation time for each iteration. To address this problem, Fujimoto2018TD3 finally proposed a Clipped Double Q-learning taking the minimum value between the two Q estimates. They also extended this method to incorporate a number of modifications, such as slow-updating target NNs and delayed policy updates, to reduce the estimation variance, and proposed the Twin Delayed Deep Deterministic policy gradient (TD3) algorithm based on DDPG. However, this method still requires an additional Q function.

In this paper, we show that the distributional Q-learning framework can be used to improve the estimation accuracy of the Q-value, which is more effective than all the above approaches. We apply the distributional return function to the maximum entropy RL framework (Haarnoja2017Soft-Q; schulman2017PG_Soft-Q; Haarnoja2018SAC; Haarnoja2018ASAC), to form the Distributional Soft Actor-Critic algorithm, DSAC. Unlike traditional distributional RL algorithms, such as C51 (bellemare2017C51) and Distributed Distributional Deep Deterministic Policy Gradient algorithm (D4PG) (barth-maron2018D4PG), which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo tasks, achieving the state of the art performance.

## 2 Related Work

Over the last decade, numerous deep RL algorithms have appeared (mnih2015DQN; lillicrap2015DDPG; schulman2015TRPO; mnih2016A3C; schulman2017PPO; heess2017DPPO; Fujimoto2018TD3; barth-maron2018D4PG; Haarnoja2018SAC), and several approaches have been proposed to address the overestimations of RL, such as Double Q-learning, Double DQN, Clipped Double Q-learning (hasselt2010double_Q; van2016double_DQN; Fujimoto2018TD3). In this paper, we show that the learning of the distributional return function can be used to improve the estimation accuracy of the Q-value, which is more effective than all the above approaches. Besides, our algorithm mainly focus on continuous control setting, so that an actor-critic architecture with separate policy and value networks is employed (sutton2018reinforcement). We also incorporate the off-policy formulation to improve sample efficiency, and the maximization entropy framework based on the stochastic policy network to encourage stability and exploration. With reference to algorithms such as DDPG (lillicrap2015DDPG), TD3 (Fujimoto2018TD3) and D4PG (barth-maron2018D4PG), the off-policy learning and continuous control can be easily enabled using separate Q and policy network. Therefore, we mainly review prior work on maximum entropy framework and distributional RL in this section.

Maximum entropy RL optimizes policies to maximize both the expected return and the expected policy entropy. While many prior RL algorithms consider the entropy of the policy, they only use it as an regularizer (schulman2015TRPO; mnih2016A3C; schulman2017PPO). Recently, several papers have noted the connection between Q-learning and policy gradient methods in the setting of maximum entropy framework (ODonoghue2016PGQL; schulman2017PG_Soft-Q; nachum2017bridging). However, many prior maximum entropy RL algorithms (sallans2004Boltzmann_explore; ODonoghue2016PGQL; Fox2016G-learning) only care about the policy entropy of the current state. Haarnoja2017Soft-Q proposed Soft Q-learning algorithm which directly augments the reward with an entropy term, such that the optimal policy aims to reach states where they will have high entropy in the future. Haarnoja2018SAC developed an off-policy actor-critic variant of the Soft Q-learning for large continuous domains, called Soft Actor-Critic (SAC). Haarnoja2018ASAC later devised a gradient-based method for SAC that can automatically learn the optimal temperature of entropy term during training. In this paper, we build on work of Haarnoja2018SAC; Haarnoja2018ASAC for implementing the maximum entropy framework.

The distributional RL method, in which one models the distribution over returns, whose expectation is the value function, was recently introduced by bellemare2017C51. The first distributional RL algorithm, C51, achieved great performance improvements on many Atari 2600 benchmarks. Since then, many distributional RL algorithms and their inherent analyses have appeared in literature (dabney2018quantileregre; davney2018quantilenet; rowland2018distributional_ana; lyle2019comparative_dstributional). However, these works can only handle discrete and low-dimensional action spaces, as they only learned a Q network like DQN. barth-maron2018D4PG combined the distributional return function within an actor-critic framework for policy learning in continuous control setting domain, and proposed the Distributed Distributional Deep Deterministic Policy Gradient (D4PG) algorithm. Existing distributional RL algorithms usually learn discrete value distribution because it is computationally friendly. However, this poses a problem: we need to divide the value function into multiple discrete intervals in advance. This is inconvenient because different tasks usually require different division numbers and intervals. In addition, the role of distributional return function in solving overestimation or underestimation problems was barely discussed before.

## 3 Preliminaries

### 3.1 Notation

We consider the standard reinforcement learning (RL) setting wherein an agent interacts with an environment

in discrete time. This environment can be modeled as a Markov Decision Process, defined by the tuple

. The state space and action space are assumed to be continuous,is a stochastic reward function mapping state-action pairs to a distribution over a set of bounded rewards, and the unknown state transition probability

maps the given andto the probability distribution over

. For the sake of simplicity, the current state-action pair and the next next state-action pairs are also denoted as and , respectively.At each time step , the agent receives a state and selects an action . In return, the agent receives the next state and a scalar reward . The process continues until the agent reaches a terminal state after which the process restarts. The agent’s behavior is defined by a stochastic policy , which maps states to a probability distribution over actions. We will use and to denote the state and state-action distribution induced by policy in environment .

### 3.2 Maximum Entropy RL

The goal in standard RL is to learn a policy which maximizes the expected future accumulated return , where is the discount factor. In this paper, we will consider a more general entropy-augmented objective, which augments the reward with an entropy term

(1) |

This objective improves the exploration efficiency of the optimal policy by maximizing both the expected future return and policy entropy. The temperature parameter determines the relative importance of the entropy term against the reward. Maximum entropy RL gradually approaches the conventional RL as .

We use to denote the entropy-augmented accumulated return from . The soft Q-value of policy is defined as

(2) |

which describes the expected entropy-augmented return for selecting in state and thereafter following policy .

The optimal maximum entropy policy is learned by a maximum entropy variant of the policy iteration method which alternates between policy evaluation and policy improvement, called soft policy iteration. In the policy evaluation process, given policy , the soft Q-value can be learned by repeatedly applying a modified Bellman operator under policy given by

The goal of the policy improvement process is to find a new policy that is better than the current policy , such that . Hence, we can directly update the policy directly by maximizing the the entropy-augmented objective (1), i.e.,

(3) | ||||

The convergence and optimality of soft policy iteration have been verified by Haarnoja2017Soft-Q; Haarnoja2018SAC; Haarnoja2018ASAC and schulman2017PG_Soft-Q.

### 3.3 State-Action Return Distribution

The return of policy from a state action pair is defined as

(4) |

The return

is usually a random variable due to the randomness in the state transition

, reward function and policy . From (2), it is clear that(5) |

Instead of considering only the expected return from a state-action pair , one can choose to directly model the distribution of returns . We will use the notion to denote the return distribution function. The distributional variant of the Bellman operator in maximum entropy framework can be derived as

where denotes that two random variables and have equal probability laws. Suppose , the return distribution can be optimized by

(6) |

where is some metric to measure the distance between two distribution.

If is a Wasserstein metric, the distributional variant of policy iteration which alternates between (6) and (3) has been proved to converge to the optimal return distribution and policy uniformly in (bellemare2017C51). In practical distributional RL algorithms, Kullback-Leibler (kL) divergence, denoted as , is usually used to replace Wasserstein metric for calculation convenience (bellemare2017C51; barth-maron2018D4PG).

## 4 Overestimation Bias

This section mainly focuses on the impact of distributional return learning on reducing overestimation. So, the entropy coefficient is assumed to be here.

### 4.1 Overestimation in Q-learning

In Q-learning with discrete actions, suppose the Q-value is approximated by a parameterized state Q-function , where are parameters. Defining the greedy target , the Q-estimate can be updated by minimizing the loss using gradient descent methods, i.e.,

(7) |

where is the learning rate. However, in practical applications, and usually contain random errors, which may be caused by system noises and function approximation. Denoting the true parameters as , we have

(8) | ||||

Suppose both random errors and have zero mean. Let represent the post-update parameters obtained based on the true parameters , that is,

(9) |

Clearly, this error causes some inaccuracy on the right-hand side of (7). It is known that (thrun1993issues). Hence, it follows that

Define . Then, can be further expressed as

Supposing is sufficiently small, the post-update Q-function can be well-approximated by linearizing around using Taylor’s expansion:

Therefore, the upward bias of the post-update Q-function can be calculated as

(10) |

In fact, any kind of estimation errors can induce an upward bias due to the max operator. Although it is reasonable to expect the upward bias caused by single update to be small, these overestimation errors can be further exaggerated through temporal difference learning, which may result in large overestimation bias and suboptimal policy updates.

### 4.2 Q-distribuion for Reducing Overestimation

Before discussing the distributional version of the Q-learning algorithm, we first assume that the random returns

obey a Gaussian distribution

, i.e.,. Suppose the Q-value and standard deviation of

are approximated by two independent parameterized functions and , with parameters and .Similar to standard Q-learning mentioned before, the Q-distribution estimate is also updated with a random greedy target , where . Suppose . Assuming , then . Therefore,

. Considering the loss function (

6) under the KL divergence measurement, and are updated by minimizingSo, the update rule can be expressed as

(11) | ||||

For convenience, we assume here. Suppose and contain the same random errors described in (8). Similar to (9), the true parameters and updated based on true Q-value can be expressed as

Similar to the derivation of (12), the overestimation bias of in distributional Q-learning can be calculated as

(12) | ||||

It’s clear that, the overestimation errors will decrease squarely with the increase of . On the other hand, from (11), we have

It is clear that tends to be a larger value in areas with high target return variance and random errors . Since is often positively related to the randomness of systems , reward function and next-state return distribution , distributional Q-learning can be used to reduce the overestimation errors caused by problem randomness and approximation errors.

## 5 Distributional Soft Actor-Critic

In this section, we adapt the existing distributional return learning method by introducing a return boundary, so that it can be directly used to learn a continuous return distribution. Finally, we present the off-policy Distributional Soft Actor-Critic (DSAC) algorithm, along with a new asynchronous parallel architecture, based on this theory within the maximum entropy actor-critic framework.

We will consider a parameterized state-action return distribution function and a stochastic policy , where and

are parameters. For example, both the return distribution and policy function can be modeled as Gaussian with mean and covariance given by neural networks. We will next derive update rules for these parameter vectors.

### 5.1 Algorithm

The soft return distribution can be trained to minimize the loss function in (6) under the KL-divergence measurement

where c is a constant. We provide details of derivation in the supplementary material. The parameters can be optimized with the following gradients

The gradients are prone to explode when

is a continuous Gaussian or Gaussian mixture model because

as . To address this problem, we propose to clip to keep it close to the expectation value of the current distribution . This makes our modified update gradients:where

and is the clipping boundary.

In order to stabilize the learning process, target return distribution and policy functions with separate parameters and are used to evaluate the target function. The target networks use a slow-moving update rate, parameterized by , such as

The policy can be learned by directly maximizing a parameterized variant of the objective function (3)

There are several options, such as log derivative and and reparameterization tricks, for maximizing . In this paper, we apply the reparameterization trick to reduce the gradient estimation variance (kingma2013repa).

If the Q-value function is explicitly parameterized through parameters , we only need to express the random action as a deterministic variable, i.e.,

where is an auxiliary variable which is sampled form some fixed distribution. Then the policy update gradient can be approximated with

If cannot be expressed explicitly through , we also need to reparameterize the random return as

In this case, we have

Besides, the distribution offers a richer set of predictions for learning than its expected value . Therefore, we can also choose to maximize the th percentile of

where denotes the th percentile. The gradient of the objective function can also be approximated using the reparamterization trick.

Finally, according to (Haarnoja2018ASAC), the temperature is updated by minimizing the following objective

where is the expected entropy.

In addition, two-timescale updates, i.e., less frequent policy updates, usually result in higher quality policy updates (Fujimoto2018TD3). Therefore, the policy, temperature and target networks are updated every iterations in this paper. The final algorithm is listed in Algorithm 1.

### 5.2 Architecture

To improve the learning efficiency, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) referring to the other distributed architectures, such as IMPALA and Ape-X (Espeholt2018IMPALA; horgan2018Ape-X). As shown in Fig. 1, buffers, actors and learners are all distributed across multiple workers, which are used to improve the efficiency of exploration, storage and sampling, and updating, respectively.

Both actors and learners asynchronously synchronize the parameters from the shared memory. The experience generated by each actor is asynchronously and randomly sent to a certain buffer at each time step. Each buffer continuously stores data and sends the sampled experience to a random learner. Based on the received sampled data, the learners calculate the update gradient using their local functions, and then use the gradients to update the shared value and policy functions. For practical applications, we always implement our DSAC algorithm within the PABAL architecture,

## 6 Experiments

Our experimental evaluation aims to study two primary questions: (1) How well does DSAC perform on benchmark RL tasks in terms of Q-value estimation and average return, compared to state-of-the-art model-free algorithms? (2) Can the return distribution learning mechanism be extended to other RL algorithms?

We evaluate the performance of the proposed DSAC on a suite of MuJoCo continuous control tasks without modifications to environment or reward (todorov2012mujoco), interfaced through OpenAI Gym (brockman2016openaigym). The agents we used were InvertedDoublePendulum, Walker2d, HalfCheetah, Ant, Humanoid. Their detailed info is listed in Appendix D.

Algorithm | Environment | ||||

InvertedDoublePendulum-v2 | Ant-v2 | HalfCheetah-v2 | Walker2d-v2 | Huamnoid-v2 | |

DSAC | 8700 | 7800 | 17200 | 6500 | 10400 |

SAC (ours) | 8700 | 8000 | 14200 | 5300 | 8500 |

SAC (paper) | None | 6000 | 15000 | 6000 | 8000 |

Double Q SAC | 8700 | 8000 | 12500 | 6000 | None |

TD3 | 8700 | 8000 | 15000 | 4500 | None |

DDPG | 8700 | 6000 | 12500 | 3000 | None |

TD4 | 8700 | 8000 | 2500 | 6000 | None |

To answer the above two questions^{1}^{1}1We will evaluate and compare the accuracy of the Q-value estimates of different algorithms in the following versions. Work still in progress., we compare our algorithm against Deep Deterministic Policy Gradient (DDPG) (lillicrap2015DDPG), Twin Delayed Deep Deterministic policy gradient algorithm (TD3) (Fujimoto2018TD3), and Soft Actor-Critic (SAC) (Haarnoja2018ASAC). Additionally, we compare our method with our proposed Twin Delayed Deep Deterministic Distributional policy gradient algorithm (TD4), which is developed by replacing the clipped double Q-learning in TD3 with the distributional return learning; Double Q-learning variant of SAC (Double-Q SAC), in which we update the soft Q-value SAC using the original Double Q-learning formulation (hasselt2010double_Q). See the supplementary material for a detailed description of TD4 and Double-Q SAC, and hyper-parameters of our baselines.

All the algorithms mentioned above are implemented in the proposed PABAL architecture, including 6 learners, 6 actors and 4 buffers. For all algorithms, we use a fully connected network with 5 hidden layer, consisting of 256 units per layer, with Gaussian Error Linerar Units (GELU) between each layer, for both actor and critic. The Adam method with a cosine annealing learning rate is used to update all the parameters. All algorithms adopt almost the same hyperparameters, and the details are shown in Table

4.Fig. 2 shows the average return over the best 3 of 5 evaluations during training^{2}^{2}2In this preprint, we only run each algorithm only once in each environment. Work still in progress.. Table 1 gives the policy performance of 3 million iterations (1 million for InvertedDoublePendulum-v2). Results show that our DSAC algorithm outperforms almost all other baselines on all benchmarks. Particularly, DSAC outperforms original SAC in a large margin on HalfCheetah-v2 and Humanoid-v2 ( 2000).

Fig. 3 compares the time efficiency of different algorithms^{3}^{3}3Results are based on single run. We will refine our results later.. Result shows that the time consumption per iteration is basically proportional to the total number of networks. DASC, together with TD4 and DDPG, has the best performance due to the smallest number of networks.

We also compare performance of DDPG, TD3 and TD4 to see the extension ability of the distributional return learning technique. While DDPG performs worst due to poor value estimation of target Q, TD3 and TD4 improve it via clipped Double-Q learning and distributional return learning. From the result, TD4 has better performance than TD3 except that it failed to learn on HalfCheetah-v2, which empirically proves that distributional return has the ability to combine with other policy learning methods.

## 7 Conclusion

In current RL methods, function approximation errors are known to lead to the overestimated or underestimated Q-value estimates, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current Q distribution to prevent gradient explosion. We also develop a distributional variant of TD3 algorithm, called TD4. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo tasks, achieving the state of the art performance.

## References

## Appendix A Derivation of the Objective Function for Soft Return Distribution Update

## Appendix B Double-Q SAC Algorithm

Suppose the Q-value and policy are approximated by parameterized functions and respectively. A pair of Q-value function and policies are required in Double-Q SAC, where is updated with respect to and with respect to . Given separate target functions and policies , the target Q-value of and are calculated as:

The soft Q-value can be trained by directly minimizing

The policy can be learned by directly maximizing a parameterized variant of the objective function (3)

We reparameterize the policy as , then the policy update gradient can be approximated with

The temperature is updated by minimizing the following objective

The pseudo-code of Double-Q SAC is shown in Algorithm 2.

## Appendix C TD4 Algorithm

Consider a parameterized state-action return distribution function and a deterministic policy , where and are parameters. The target networks and are used to stabalize learning. The return distribution can be trained to minimize

where

Comments

There are no comments yet.