Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

05/08/2020 ∙ by Arsenii Kuznetsov, et al. ∙ 0

The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method—Truncated Quantile Critics, TQC,—blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25 the most challenging Humanoid environment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

Code Repositories

tqc_pytorch

Implementation of Truncated Quantile Critics method for continuous reinforcement learning. https://bayesgroup.github.io/tqc/


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sample efficient off-policy reinforcement learning demands accurate approximation of the Q-function. Quality of approximation is key for stability and performance, since it is the cornerstone for temporal difference target computation, and action selection in value-based methods

(Mnih et al., 2013), or policy optimization in continuous actor-critic settings (Haarnoja et al., 2018a; Fujimoto et al., 2018).

In continuous domains, policy optimization relies on gradients of the Q-function approximation, sensing and exploiting unavoidable erroneous positive biases. Recently, Fujimoto et al. (2018) significantly improved the performance of a continuous policy by introducing a novel way to alleviate the overestimation bias (Thrun and Schwartz, 1993). We continue this line of research and propose an alternative highly competitive method for controlling overestimation bias.

Thrun and Schwartz (1993) elucidate the overestimation as a consequence of Jensen’s inequality: the maximum of the Q-function over actions is not greater than the expected maximum of noisy (approximate) Q-function. Specifically, for any action-dependent random noise such that ,

(1)

In practice, the noise may arise for various reasons and from various sources, such as spontaneous errors in function approximation, Q-function invalidation due to ongoing policy optimization, stochasticity of environment, etc. Off-policy algorithms grounded in temporal difference learning are especially sensitive to approximation errors since errors are propagated backward through episodic time and accumulate over the learning process.

The de facto standard for alleviating overestimations in discrete control is the double estimator

(Van Hasselt, 2010, 2013). However, Fujimoto et al. (2018) argue that for continuous control this estimator may still overestimate in highly variable state-action space regions, and propose to promote underestimation by taking the minimum over two separate approximators. These approximators constitute naturally an ensemble, the size of which controls the intensity of underestimation: more approximators correspond to more severe underestimation (Lan et al., 2020). We argue, that this approach, while very successful in practice, has a few shortcomings:

  • The overestimation control is coarse: it is impossible to take the minimum over a fractional number of approximators (see Section 4.1).

  • The aggregation with is wasteful: it ignores all estimates except the minimal one, diminishing the power of the ensemble of approximators.

We address these shortcomings with a novel method called Truncated Quantile Critics (TQC). In the design of TQC, we draw on three ideas: distributional representation of a critic, truncation of approximated distribution, and ensembling.

Distributional representations  The distributional perspective (Bellemare et al., 2017) advocates the modeling of the distribution of the random return, instead of the more common modeling of the Q-function, the expectation of the return. In our work, we adapt QR-DQN (Dabney et al., 2018b) for continuous control and approximate the quantiles of the return distribution conditioned on the state and action. Distributional perspective allows for learning the intrinsic randomness of the environment and policy, also called aleatoric uncertainty. We are not aware of any prior work employing aleatoric uncertainty for overestimation bias control. We argue that the granularity of distributional representation is especially useful for precise overestimation control.

Truncation  To control the overestimation, we propose to truncate the right tail of the return distribution approximation by dropping several of the topmost atoms. By varying the number of dropped atoms, we can balance between over- and underestimation. In a sense, the truncation operator is parsimonious: we drop only a small number of atoms (typically, around 8% of the total number of atoms). Additionally, truncation does not require multiple separate approximators: our method surpasses the current state of the art (which uses multiple approximators) on some benchmarks even using only a single one (Figure 1).

Ensembling  The core operation of our method—truncation of return distribution—does not impose any restrictions on the number of required approximators. This effectively decouples overestimation control from ensembling, which, in turn, provides for additional performance improvement (Figure 1).

Our method improves the performance on all environments in the standard OpenAI gym (Brockman et al., 2016) benchmark suite powered by MuJoCo (Todorov et al., 2012), with up to 30% improvement on some of the environments. For the most challenging Humanoid environment this improvement translates into twice the running speed of the previous SOTA (since agent gets as part of reward per step until it fell). The price to pay for this improvement is the computational overhead carried by distributional representations and ensembling (Section 5.2).

This work makes the following contributions to the field of continuous control:

  1. We design a practical method for the fine-grained control over the overestimation bias, called Truncated Quantile Critics (Section 3). For the first time, we (1) incorporate aleatoric uncertainty into the overestimation bias control, (2) decouple overestimation control and multiplicity of approximators, (3) ensemble distributional approximators in a novel way.

  2. We advance the state of the art on the standard continuous control benchmark suite (Section 4) and perform extensive ablation study (Section 5).

Figure 1: Evaluation on the Humanoid environment. Results are averaged over 4 seeds, std is shaded.

To facilitate reproducibility, we carefully document the experimental setup, perform exhaustive ablation, average experimental results over a large number of seeds, publish raw data of seed runs, and release the code for Tensorflow

111https://github.com/bayesgroup/tqc

and PyTorch

222https://github.com/bayesgroup/tqc_pytorch.

2 Background

2.1 Notation

We consider a Markov decision process, MDP, defined by the tuple

, with continuous state and action spaces and , unknown state transition density

, random variable reward function

, and discount factor .

A policy maps each state to a distribution over . We write to denote the entropy of the policy conditioned on the state .

We write for the dimensionality of the space . Unless explicitly stated otherwise, the signifies the expectation over the from experience replay , and from . We use the overlined notation to denote the parameters of target networks, i.e., denotes the exponential moving average of parameters .

2.2 Soft Actor Critic

The Soft Actor Critic (SAC) (Haarnoja et al., 2018a) is an off-policy actor-critic algorithm based on the maximum entropy framework. The objective encourages policy stochasticity by augmenting the reward with the entropy at each step.

The policy parameters can be learned by minimizing the

(2)

where is the soft Q-function and is the normalizing constant.

The soft Q-function parameters can be learned by minimizing the soft Bellman residual

(3)

where denotes the temporal difference target

(4)

and is the entropy temperature coefficient. Haarnoja et al. (2018b) proposed to dynamically adjust the  by taking a gradient step with respect to the loss

(5)

each time the changes. This decreases the , if the stochastic estimate of policy entropy, , is higher than , and increases

otherwise. The target entropy usually is set heuristically to

.

Haarnoja et al. (2018b) takes the minimum over two Q-function approximators to compute the target in equation 4 and policy objective in equation 2.

2.3 Distributional Reinforcement Learning with Quantile Regression

Distributional reinforcement learning focuses on approximating the return random variable where , and , , as opposed to approximating the expectation of the return, also known as the Q-function, .

QR-DQN (Dabney et al., 2018b) approximates the distribution with , a mixture of atoms—Dirac delta functions at locations

given by a parametric model

.

Parameters are optimized by minimizing the averaged over the replay 1-Wasserstein distance between and the temporal difference target distribution , where is the distributional Bellman operator (Bellemare et al., 2017):

(6)

As Dabney et al. (2018b) show, this minimization can be performed by learning quantile locations for fractions via quantile regression. The quantile regression loss, defined for a quantile fraction , is

(7)

To improve gradients for small authors propose to use the Huber quantile loss (asymmetric Huber loss):

(8)

where is a Huber loss with parameter .

3 Truncated Quantile Critics, TQC

We start with an informal explanation of TQC and motivate our design choices. Next, we outline the formal procedure at the core of TQC, specify the loss functions and present an algorithm for practical implementation.

3.1 Overview

Figure 2: Selection of atoms for the temporal difference target distribution . First, we compute approximations of the return distribution conditioned on and by evaluating separate target critics. Second, we make a mixture out of the distributions from the previous step. Third, we truncate the right tail of this mixture to obtain atoms from equation 11.

To achieve granularity in controlling the overestimation, we ”decompose” the expected return into atoms of distributional representation. By varying the number of atoms, we can control the precision of the return distribution approximation.

To control the overestimation, we propose to truncate the approximation of the return distribution: we drop atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. By varying the total number of atoms and the number of dropped ones, we can flexibly balance between under- and overestimation. The truncation naturally accounts for the inflated overestimation due to the high return variance: the higher the variance, the lower the Q-value estimate after truncation.

To improve the Q-value estimation, we ensemble multiple distributional approximators in the following way. First, we form a mixture of distributions predicted by

approximators. Second, we truncate this mixture by removing atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. The order of operations—the truncation of the mixture vs. the mixture of truncated distributions—may matter. The truncation of a mixture removes the largest outliers from the pool of all predictions. Such a truncation may be useful in a hypothetical case of one of the critics goes crazy and overestimates much more than the others. In this case, the truncation of a mixture removes the atoms predicted by this inadequate critic. In contrast, the mixture of truncated distributions truncates all critics evenly.

Our method is different from previous approaches (Zhang and Yao, 2019; Dabney et al., 2018a) that distorted the critic’s distribution at the policy optimization stage only. We use nontruncated critics’ predictions for policy optimization. And truncate target return distribution at the value learning stage. Intuitively, this prevents errors from propagating to other states via TD learning updates and eases policy optimization.

Next, we present TQC formally and summarize the procedure in Algorithm 1.

3.2 Computation of the target distribution

We propose to train approximations of the policy conditioned return distribution . Each maps each

to a probability distribution

(9)

supported on atoms .

We train approximations on the temporal difference target distribution . We construct it as follows. We pool atoms of distributions into a set

(10)

and denote elements of sorted in ascending order by , with .

The smallest elements of define atoms

(11)

of the target distribution

(12)

In practice, we always populate with atoms predicted by target networks , which are more stable.

3.3 Loss functions

We minimize the 1-Wasserstein distance between each of and the temporal difference target distribution . Equivalently (Dabney et al., 2018b), to minimize this distance we can approximate the quantiles of the target distribution, i.e., learn the locations for quantile fractions .

We approximate the quantiles of with by minimizing the loss

(13)

over the parameters , where

(14)

In this way, each learnable location becomes dependent on all atoms of the truncated mixture of target distributions.

The policy parameters can be optimized to maximize the entropy penalized estimate of the Q-value by minimizing the loss

(15)

where . We use nontruncated estimate of the Q-value for policy optimization to avoid double truncation: Z-functions approximate already truncated future distribution.

   Initialize policy , critics for
   Set replay , , ,
  for each iteration do
     for each environment step, until done do
        collect transition with policy
        
     end for
     for each gradient step do
        sample a batch from the replay
         Eq. (5)
         Eq. (15)
        , Eq. (13)
        ,
     end for
  end for
  return policy , critics , .
Algorithm 1 TQC. denotes the stochastic gradient

4 Experiments

First, we compare our method with other possible ways to mitigate the overestimation bias on a simple MDP, for which we can compute the true Q-function and the optimal policy.

Next, we quantitatively compare our method with competitors on a standard continuous control benchmark – the set of MuJoCo (Todorov et al., 2012) environments implemented in OpenAI Gym (Brockman et al., 2016). The details of the experimental setup are in Appendix A.

We implement TQC on top of the SAC (Haarnoja et al., 2018b) with auto-tuning of the entropy temperature (Section 2.2). For all MuJoCo experiments, we use critic networks with three hidden layers of neurons each, atoms, and the best number of dropped atoms per network

, if not stated otherwise. The other hyperparameters are the same as in SAC (see Appendix

B).

4.1 Single state MDP

In this experiment we evaluate bias correction techniques (Table 1) in a single state continuous action infinite horizon MDP (Figure 5). We train Q-networks (or Z-networks, depending on the method) with two hidden layers of size from scratch on the replay buffer of size for iterations, which is enough for all methods to converge. We populate the buffer by sampling a reward once for each action from a uniform action grid. At each step of temporal difference learning, we use a policy, which is greedy with respect to the objective in Table 6.

We define as a signed discrepancy between the approximate and the true Q-value. For tqc . We vary the parameters controlling the overestimation for each method and report the robust average ( of each tail is truncated) over seeds of and .

For avg and min we vary the number of networks , for tqc—the number of dropped quantiles per network . We present the results in Figure 6 with bubbles of diameter, inversely proportional to the averaged over the seeds absolute distance between the optimal and the of the policy objective.

(a) Diagram
(b) Reward function
Figure 5: Infinite horizon MDP with a single state and one-dimensional action space . At each step agent receives a stochastic reward (see Appendix C for details).
Method Critic target Policy objective
avg
min
tqc
Table 1: Bias correction methods. For simplicity, we omit the state from all arguments.
Figure 6: Robust average ( of each tail is truncated) of bias and variance of Q-function approximation for different methods: tqc, min, avg. See the text for the details about axis labels.

The results (Figure 6) suggest TQC can achieve the lowest variance and the smallest bias of Q-function approximation among all the competitors. The variance and the bias correlate well with the policy performance, suggesting TQC may be useful in practice.

4.2 Comparative Evaluation

Figure 7: Average performances of methods on MuJoCo Gym Environments with std shaded. Smoothed with a window of 100.

We compare our method with original implementations of state of the art algorithms: SAC333https://github.com/rail-berkeley/softlearning, TrulyPPO444https://github.com/wangyuhuix/TrulyPPO, and TD3555https://github.com/sfujim/TD3. For HalfCheetah, Walker, and Ant we evaluate methods on the extended frame range: until all methods plateaus ( versus usual ). For Hopper, we extended the range to steps.

For our method we selected the number of dropped atoms for each environment independently, based on separate evaluation. Best value for Hopper is , for HalfCheetah and for the rest .

Figure 7 shows the learning curves. In Table 2 we report the average and std of 10 seeds. Each seed performance is an average of 100 last evaluations. We evaluate the performance every 1000 frames as an average of 10 deterministic rollouts. As our results suggest, TQC performs consistently better than any of the competitors. TQC also improves upon the maximal published score on four out of five environments (Table 3).

Env TrulyPPO TD3 SAC TQC
Hop
HC *
Wal *
Ant *
Hum *
Table 2: Average and std of the seed returns (thousands). The best average return is bolded, and marked with  if it is the best at level

according to the two-sided Welch’s t-test with Bonferroni correction for multiple comparison testing.

Env ARS-V2-t SAC TQC
Hop
HC
Wal
Ant
Hum
Table 3: Maximum immediate evaluation score (thousands). Maximum was taken over the learning progress and over 10 seeds (see Figure 7 for the mean plot). ARS results were taken from (Mania et al., 2018). The best return per row is bolded.

5 Ablation study

We ablate TQC on the Humanoid 3D environment, which has the highest resolution power due to its difficulty, and Walker2d—a 2D environment with the largest sizes of action and observation spaces. In this section and in the Appendix E we average metrics over four seeds.

5.1 Design choices evaluation

The path from SAC (Haarnoja et al., 2018b) to TQC comprises five modifications: Q-network size increase (Big), quantile Q-network introduction (Quantile), target distribution truncation (Truncate), atom pooling (Pool), and ensembling. To reveal the effects behind these modifications, we build four methods – the intermediate steps on the incremental path from SAC to TQC. Each subsequent method adds the next modification from the list to the previous method or changes the order of applying modifications. For all modifications, except the final (ensembling), we use networks. In all truncation operations we drop atoms in total, where .

B-SAC is SAC with an increased size of Q-networks (Big SAC): 3 layers with 512 neurons versus 2 layers of 256 neurons in SAC. Policy network size does not change.

QB-SAC is B-SAC with Quantile distributional networks (Dabney et al., 2018b). This modification changes the form of Q-networks and the loss function, quantile Huber (equation 8). We adapt the clipped double estimator (Fujimoto et al., 2018) to quantile networks: we recover Q-values from distributions and use atoms of the argmin of Q-values to compute the target distribution , where is

(16)

and .

TQB-SAC is QB-SAC with individual truncation instead of : is trained to approximate the truncated temporal difference distribution , which is based on the predictions of the single target network only. That is, is trained to approximate , where is

(17)

PTQB-SAC is TQB-SAC with pooling: approximates the mixture of (already truncated) :

(18)

TQC = TPQB-SAC is PTQB-SAC with pooling and truncation operations swapped. This modification drops the same number of atoms as two previous methods, but differs in which atoms are dropped. TQC drops largest from the union of critics predictions. While each of PTQB-SAC and TQB-SAC (no pooling) drops largest atoms from each of critics.

Ensembling To illustrate the advantage brought by ensembling, we include the results for TQC with two and five Z-networks.

The ablation results (Figure 8) suggest the following. The increased network size does not necessarily improve SAC (though some improvement is visible on Walker2d). The quantile representation improves the performance in both environments with the most notable improvement on Humanoid. Among the following three modifications—individual truncation, and two orders of applying the pooling and truncation—the TQC is a winner for Humanoid and ”truncate-pooling” modification seems to be better on Walker2d. Overall, truncation stabilizes results on Walker2d and seems to reduce the seed variance on Humanoid. Finally, the ensembling consistently improves results on both environments.

Figure 8: Design choices evaluation. where isn’t stated otherwise, and where applicable. Smoothed with a window of 200, std is shaded.

5.2 Sensitivity to hyperparameters

Number of truncated quantiles   In this experiment we vary the number of atoms (per network) to drop in the range . The total number of atoms dropped is . We fix the number of atoms for each Q-network to . The results (Figure 9) show that (1) truncation is essential and (2) there is an optimal number of dropped atoms (i.e., or ).

Figure 9: Varying the number of dropped atoms per critic . networks, atoms. Smoothed with a window of 200, std is plotted.

Number of total quantiles   In this experiment we vary the total number of atoms and adjust the number of dropped quantiles to keep the truncation ratio approximately constant. The results (Appendix E) suggest this parameter does not have much influence, except for the case of very small , such as . For learning curves are indistinguishable.

Number of Z-networks   In this experiment we vary the number of Z-networks . The results (Appendix E) suggest that (1) a single network is consistently inferior to larger ensembles and (2) performance improvement saturates at approximately .


Env SAC B-SAC TQC N=2 TQC N=5
Walker2d 9.5 13.9 14.1 32.4
Humanoid 10.7 16.5 17.4 36.8
Table 4:

Time measurements (in seconds) of a single training epoch (1000 frames), averaged over 1000 epochs, executed on the Tesla P40 GPU.

Ensembling and distributional networks incur additional computational overhead, which we quantify for different methods in Table 4.

6 Related work

Distributional perspective

Since the introduction of distributional paradigm (see White (1988) and references therein) and its reincarnation for deep reinforcement learning (Bellemare et al., 2017) a great body of research emerged. Dabney et al. proposed a method to learn quantile values (or locations) for a uniform grid of fixed (Dabney et al., 2018b) or sampled (Dabney et al., 2018a) quantile fractions. Yang et al. (2019) proposed a method to learn both quantile fractions and quantile values (i.e. locations and probabilities of elements in the mixture approximating the unknown distribution). Choi et al. (2019) used a mixture of Gaussians for approximating the distribution of returns. Most of these works, as well as their influential follow-ups, such as (Hessel et al., 2018), are devoted to the discrete control setting.

The adoption of the distributional paradigm in continuous control, to the best of our knowledge, starts from D4PG (Barth-Maron et al., 2018)—a distributed distributional off-policy algorithm building on the C51 (Bellemare et al., 2017). Recently, the distributional paradigm was adopted in distributed continuous control for robotic grasping (Bodnar et al., 2019). The authors proposed Q2-Opt as a collective name for two variants: based on QR-DQN (Dabney et al., 2018b), and on IQN (Dabney et al., 2018a). In contrast to D4PG and Q2-Opt, we focus on the usual, non-distributed setting and modify the target on which the critic is trained.

A number of works develop exploration methods based on the quantile form of value-function. DLTV (Mavrin et al., 2019) uses variability of the quantile distribution in the exploration bonus. QUOTA (Zhang and Yao, 2019)—the option-based modification of QR-DQN—partitions a range of quantiles into contiguous windows and trains a separate intra-option policy to maximize an average of quantiles in each of the windows. Our work proposes an alternative method for critic training, which is unrelated to the exploration problem.

Most importantly, our work differs from the research outlined above in our aim to control the overestimation bias by leveraging quantile representation of a critic network.

Overestimation bias

The overestimation bias is a long-standing topic in several research areas. It is known as the max-operator bias in statistics (D’Eramo et al., 2017) and as the ”winner’s curse” in economics (Smith and Winkler, 2006; Thaler, 2012).

The statistical community

studies estimators of the maximum expected value of a set of independent random variables. The simplest estimator—the maximum over sample means, Maximum Estimator (ME)—is biased positively, while for many distributions, such as Gaussian, an unbiased estimator does not exist

(Ishwaei D et al., 1985). The Double Estimator (DE) (Stone, 1974; Van Hasselt, 2013) uses cross-validation to decorrelate the estimation of the argmax and of the value for that argmax. He and Guo (2019) proposed a coupled estimator as an extension of DE to the case of partially overlapping cross-validation folds. Many works have aimed at alleviating the negative bias of DE, which in absolute value can be even larger than that of ME. D’Eramo et al. (2016)

assumed the Gaussian distribution for the sample mean and proposed Weighted Estimator (WE) with a bias in between of that for ME and DE.

Imagaw and Kaneko (2017) improved WE by using UCB for weights computation. D’Eramo et al. (2017) assumed a certain spatial correlation and extended WE to continuous sets of random variables. The problem of overestimation has also been discussed in the context of optimal stopping and sequential testing (Kaufmann et al., 2018).

The reinforcement learning community became interested in the bias since the work of Thrun and Schwartz (1993), who attributed a systematic overestimation to the generalization error. The authors proposed multiple ways to alleviate the problem, including (1) bias compensation with additive pseudo costs and (2) underestimation (e.g., in the uncertain areas). The underestimation concept became much more prominent, while the adoption of ”additive compensation” is quite limited to date (Patnaik and Anwar, 2008; Lee and Powell, 2012).

Van Hasselt (2010)

proposed Double Estimation in Q-learning, which was subsequently adapted to neural networks as Double DQN

Van Hasselt et al. (2015). Subsequently, (Zhang et al., 2017) and Lv et al. (2019) introduced the Weighted Estimator in the reinforcement learning community.

Another approach against overestimation and overall Q-function quality improvement is based on the idea of averaging or ensembling. Embodiments of this approach are based on dropout Anschel et al. (2017), employing previous Q-function approximations (Ghavamzadeh et al., 2011; Anschel et al., 2017), the linear combination between and over the pool of Q-networks (Li and Hou, 2019; Kumar et al., 2019), or the random mixture of predictions from the pool (Agarwal et al., 2019). Buckman et al. (2018) reported the reduction in overestimation originating from ensembling in model-based learning.

In continuous control, Fujimoto et al. (2018) proposed the TD3 algorithm, taking the over two approximators of Q-function to reduce the overestimation bias. Later, for discrete control Lan et al. (2020) developed a MaxMin Q-learning, taking the over more than two Q-functions. We build upon the minimization idea of Fujimoto et al. (2018) and, following Lan et al. (2020) use multiple approximators.

Our work differs in that we do not propose to control the bias by choosing between multiple approximators or weighting them. For the first time, we propose to successfully control the overestimation even for a single approximator and use ensembling only to improve the performance further.

7 Conclusion and Future Work

In this work, we propose to control the overestimation bias on the basis of aleatoric uncertainty. The method we propose comprises three essential ideas: distributional representations, truncation of a distribution, and ensembling.

Simulations reveal favorable properties of our method: low expected variance of the approximation error as well as the fine control over the under- and overestimation. The exceptional results on the standard continuous control benchmark suggest that distributional representations may be useful for controlling the overestimation bias.

Since little is known about the connection between aleatoric uncertainty and overestimation, we see the investigation of it as an exciting avenue for future work.

8 Acknowledgements

We would like to thank Artem Sobolev, Arsenii Ashukha, Oleg Ivanov and Dmitry Nikulin for their comments and suggestions regarding the early versions of the manuscript. We also thank the anonymous reviewers for their feedback.

This research was supported in part by the Russian Science Foundation grant no. 19-71-30020.

References

  • R. Agarwal, D. Schuurmans, and M. Norouzi (2019) Striving for simplicity in off-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543. Cited by: §6.
  • O. Anschel, N. Baram, and N. Shimkin (2017) Averaged-dqn: variance reduction and stabilization for deep reinforcement learning. pp. 176–185. Cited by: §6.
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §6.
  • M. G. Bellemare, W. Dabney, and R. Munos (2017) A Distributional Perspective on Reinforcement Learning. arXiv e-prints, pp. arXiv:1707.06887. External Links: 1707.06887 Cited by: §1, §2.3, §6, §6.
  • C. Bodnar, A. Li, K. Hausman, P. Pastor, and M. Kalakrishnan (2019) Quantile qt-opt for risk-aware vision-based robotic grasping. arXiv preprint arXiv:1910.02787. Cited by: §6.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1, §4.
  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. pp. 8224–8234. Cited by: §6.
  • Y. Choi, K. Lee, and S. Oh (2019) Distributional deep reinforcement learning with a mixture of gaussians. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797. Cited by: §6.
  • C. D’Eramo, A. Nuara, M. Pirotta, and M. Restelli (2017) Estimating the maximum expected value in continuous reinforcement learning problems. Cited by: §6, §6.
  • C. D’Eramo, M. Restelli, and A. Nuara (2016) Estimating maximum expected value through gaussian approximation. pp. 1032–1040. Cited by: §6.
  • W. Dabney, G. Ostrovski, D. Silver, and R. Munos (2018a) Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923. Cited by: §3.1, §6, §6.
  • W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos (2018b) Distributional reinforcement learning with quantile regression. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.3, §2.3, §3.3, §5.1, §6, §6.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1587–1596. External Links: Link Cited by: §E.4, §1, §1, §1, §5.1, §6.
  • M. Ghavamzadeh, H. J. Kappen, M. G. Azar, and R. Munos (2011) Speedy q-learning. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2411–2419. External Links: Link Cited by: §6.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §2.2.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: Figure 15, §E.4, §2.2, §2.2, §4, §5.1.
  • M. He and H. Guo (2019) Interleaved q-learning with partially coupled training process. pp. 449–457. Cited by: §6.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. Cited by: §6.
  • T. Imagaw and T. Kaneko (2017) Estimating the maximum expected value through upper confidence bound of likelihood. pp. 202–207. Cited by: §6.
  • B. Ishwaei D, D. Shabma, and K. Krishnamoorthy (1985) Non-existence of unbiased estimators of ordered parameters. Statistics: A Journal of Theoretical and Applied Statistics 16 (1), pp. 89–95. Cited by: §6.
  • E. Kaufmann, W. M. Koolen, and A. Garivier (2018) Sequential test for the lowest mean: from thompson to murphy sampling. pp. 6332–6342. Cited by: §6.
  • A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. pp. 11761–11771. Cited by: §6.
  • Q. Lan, Y. Pan, A. Fyshe, and M. White (2020) Maxmin q-learning: controlling the estimation bias of q-learning. External Links: Link Cited by: §E.4, §1, §6.
  • D. Lee and W. B. Powell (2012) An intelligent battery controller using bias-corrected q-learning. Cited by: §6.
  • Z. Li and X. Hou (2019) Mixing update q-value for deep reinforcement learning. pp. 1–6. Cited by: §6.
  • P. Lv, X. Wang, Y. Cheng, and Z. Duan (2019) Stochastic double deep q-network. IEEE Access 7 (), pp. 79446–79454. External Links: Document, ISSN 2169-3536 Cited by: §6.
  • H. Mania, A. Guy, and B. Recht (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055. Cited by: Table 3.
  • B. Mavrin, H. Yao, L. Kong, K. Wu, and Y. Yu (2019) Distributional reinforcement learning for efficient exploration. In Proceedings of the 36th International Conference on Machine Learning2019 International Joint Conference on Neural Networks (IJCNN)Advances in Neural Information Processing SystemsProceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence ErlbaumProceedings of the 34th International Conference on Machine Learning-Volume 70Twenty-Sixth AAAI Conference on Artificial IntelligenceAdvances in Neural Information Processing SystemsThirty-First AAAI Conference on Artificial IntelligenceInternational Conference on Machine LearningIJCAIAdvances in Neural Information Processing Systems2017 Conference on Technologies and Applications of Artificial Intelligence (TAAI)Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent SystemsThirty-Second AAAI Conference on Artificial IntelligenceConference on Robot LearningAdvances in neural information processing systemsInternational Conference on Learning Representations, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 4424–4434. Cited by: §6.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • K. Patnaik and S. Anwar (2008) Q learning in context of approximation spaces. Contemporary Engineering Sciences 1 (1), pp. 41–49. Cited by: §6.
  • J. E. Smith and R. L. Winkler (2006) The optimizer’s curse: skepticism and postdecision surprise in decision analysis. Management Science 52 (3), pp. 311–322. Cited by: §6.
  • M. Stone (1974) Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 (2), pp. 111–133. Cited by: §6.
  • R. Thaler (2012) The winner’s curse: paradoxes and anomalies of economic life. Simon and Schuster. Cited by: §6.
  • S. Thrun and A. Schwartz (1993) Issues in using function approximation for reinforcement learning. Cited by: §1, §1, §6.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §4.
  • H. Van Hasselt, A. Guez, and D. Silver (2015) Deep Reinforcement Learning with Double Q-learning. arXiv e-prints, pp. arXiv:1509.06461. External Links: 1509.06461 Cited by: §6.
  • H. Van Hasselt (2010) Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621. Cited by: §1, §6.
  • H. Van Hasselt (2013) Estimating the maximum expected value: an analysis of (nested) cross validation and the maximum sample average. arXiv preprint arXiv:1302.7175. Cited by: §1, §6.
  • D. White (1988) Mean, variance, and probabilistic criteria in finite markov decision processes: a review. Journal of Optimization Theory and Applications 56 (1), pp. 1–29. Cited by: §6.
  • D. Yang, L. Zhao, Z. Lin, T. Qin, J. Bian, and T. Liu (2019) Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems, pp. 6190–6199. Cited by: §6.
  • S. Zhang and H. Yao (2019) QUOTA: the quantile option architecture for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5797–5804. Cited by: §3.1, §6.
  • Z. Zhang, Z. Pan, and M. J. Kochenderfer (2017) Weighted double q-learning.. pp. 3455–3461. Cited by: §6.

Appendix A Experimental setting

We would like to caution about the use of MuJoCo 2.0 with versions of Gym at least up to v0.15.4

(the last released at the moment). For these versions Gym incorrectly nullifies state components corresponding to contact forces, which, in turn makes results incomparable to previous works.

In our work we use MuJoCo 1.5 and v3 versions of environments. Versions of all other packages we used are listed in the Conda environment file, distributed with the source code666See the code attached..

Appendix B Hyperparameters

Critic networks are fully-connected, with the last layer output size equal to the number of atoms .


Hyperparameter TQC SAC
Optimizer Adam
Learning rate
Discount 0.99
Replay buffer size
Number of critics 5 2
Number of hidden layers in critic networks 3 2
Size of hidden layers in critic networks 512 256
Number of hidden layers in policy network 2
Size of hidden layers in policy network 256
Minibatch size 256
Entropy target
Nonlinearity ReLU
Target smoothing coefficient 0.005
Target update interval 1
Gradient steps per iteration 1
Environment steps per iteration 1
Number of atoms 25
Huber loss parameter 1
Table 5: Hyperparameters values.

Environment Number of dropped atoms, Number of environment steps
Hopper 5
HalfCheetah 0
Walker2d 2
Ant 2
Humanoid 2
Table 6: Environment dependent hyperparameters for TQC.

Appendix C Toy experiment setting

The task is simplistic infinite horizon MDP () with only one state and 1-dimensional action space . Since there is only one state, the state transition function and initial state distribution are delta functions. On each step agent get stochastic reward , where . Mean reward function is the cosine with slowly increasing amplitude (Figure 10):

The discount factor is .

Figure 10: Reward function. -axis represents one dimensional action space, -axis - corresponding stochastic rewards and their expectation.

Such parameters gives raise to three local maxima: near the left end , in the right half (global) and at the right end . Optimal policy in this environment always selects .

In the toy experiment we evaluate bias correction techniques (Table 6) in this MDP. We train Q-networks (or Z-networks, depending on the method) with two hidden layers of size from scratch on the replay buffer of size for iterations. We populate the buffer by sampling a reward once for each action from a uniform action grid of size . At each step of temporal difference learning, we use a policy, which is greedy with respect to the objective in Table 6.

We define as a signed discrepancy between the approximate and the true Q-value. For TQC . We vary the parameters controlling the overestimation for each method and report the robust average ( of each tail is truncated) over seeds of and . Expectation and variance estimated over dense uniform grid of actions of size and then averaged over seeds.

For avg and min we vary the number of networks from and correspondingly. For tqc — number of dropped quantiles per network from out of . We present the results in Figure 6 with bubbles of diameter, inversely proportional to the averaged over the seeds absolute distance between the optimal and the of the policy objective.

To prevent interference of policy optimization subtleties into conclusions about Q-function approximation quality, we use implicit deterministic policy induced by value networks: the argmax of the approximation. To find the maximum, we evaluated the approximation over the dense uniform grid in the range with a step .

Each dataset consists of uniform grid of actions and sampled corresponding rewards. For each method we average results over several datasets and evaluate on different dataset sizes. In this way current policy defined implicitly as greedy one with respect to value function. This policy doesn’t interact with the environment instead actions predefined to be uniform.

Appendix D Mistakenly unreferenced appendix section

We are sorry for the void section. We keep this section to make references in the main text valid and will remove it in the camera ready version.

Appendix E Additional experimental results

e.1 Number of critics

Figure 11: Varying the number of critic networks for TQC with atoms per critic and of dropped atoms per critic. Smoothed with a window of 100, std is plotted.

e.2 Total number of atoms

Figure 12: Varying the number of atoms per critic for TQC with critics and dropped atoms per critic. Smoothed with a window of 100, std is plotted.

e.3 Removed atoms stats

TQC drops atoms with largest locations after the pooling of atoms from multiple Z-networks. Experimentally, this procedure drops more atoms for some Z-networks than for the others. To quantify this disbalance, we compute the ratio of dropped atoms to the total atoms for each of Z-networks. These proportions, once sorted and averaged over the replay, are approximately constant throughout learning: % for and % for (Figure 13).

Figure 13: Proportions of atoms dropped per critic, sorted and averaged over the minibatches drawn from the experience replay for TQC with and critics with and dropped atoms per critic. For example, the upper right plot, should be read as ”on average the largest proportion of dropped atoms per critic is 35%, i.e. out of atoms dropped approximately were predicted by a single critic. Smoothed with a window of 100, std is plotted.
Figure 14: Proportions of atoms dropped per critic, averaged over the minibatches drawn from the experience replay for TQC with and critics with and dropped atoms per critic. Same plot as 13, but without sorting. The figure illustrates that there is no a single critic consistently overestimating more than the others over the whole state action space. Smoothed with a window of 100, std is plotted.

Interestingly, without sorting the averaging over the replay gives almost perfectly equal proportions (Figure 14). These results suggest, that a critic overestimates in some regions of the state action space more, than any other critic. In other words, in practice the systematic overestimation of a single critic (w.r.t. other critics predictions) on the whole state action space does not occur.

e.4 Clipped Double Q-learning

To ensure that it is not possible to match the performance of TQC with careful tuning of previous methods, we varied the number of critic networks used in the Clipped Double Q-learning estimate (Fujimoto et al., 2018) for SAC (Haarnoja et al., 2018b). The larger the number of networks under the , the more the underestimation (Lan et al., 2020).

We have found that for MuJoCo benchmarks it is not possible to improve performance upon the published results by controlling the overestimation in such a coarse way for both the regular network size (Figure 15), and for the increased network size (Figure 16).

Figure 15: Varying the number of critic networks under the operation of the Clipped Double Q-learning estimate for SAC. Each critic networks is layers deep with neurons in each layer (the same network structure as in SAC (Haarnoja et al., 2018b)). Smoothed with a window of 100, std is plotted.
Figure 16: Varying the number of critic networks under the operation of the Clipped Double Q-learning estimate for SAC. Each critic networks is layers deep with neurons in each layer. Smoothed with a window of 100, std is plotted.