Implementation of Truncated Quantile Critics method for continuous reinforcement learning. https://bayesgroup.github.io/tqc/
The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method—Truncated Quantile Critics, TQC,—blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25 the most challenging Humanoid environment.READ FULL TEXT VIEW PDF
We identify a fundamental problem in policy gradient-based methods in
Distributional Reinforcement Learning (RL) differs from traditional RL i...
The distributional perspective on reinforcement learning (RL) has given ...
This work adopts the very successful distributional perspective on
With the advent of continuous health monitoring via wearable devices, us...
How to obtain good value estimation is one of the key problems in
Nonparametric control charts that can detect arbitrary distributional ch...
Implementation of Truncated Quantile Critics method for continuous reinforcement learning. https://bayesgroup.github.io/tqc/
Sample efficient off-policy reinforcement learning demands accurate approximation of the Q-function. Quality of approximation is key for stability and performance, since it is the cornerstone for temporal difference target computation, and action selection in value-based methods(Mnih et al., 2013), or policy optimization in continuous actor-critic settings (Haarnoja et al., 2018a; Fujimoto et al., 2018).
In continuous domains, policy optimization relies on gradients of the Q-function approximation, sensing and exploiting unavoidable erroneous positive biases. Recently, Fujimoto et al. (2018) significantly improved the performance of a continuous policy by introducing a novel way to alleviate the overestimation bias (Thrun and Schwartz, 1993). We continue this line of research and propose an alternative highly competitive method for controlling overestimation bias.
Thrun and Schwartz (1993) elucidate the overestimation as a consequence of Jensen’s inequality: the maximum of the Q-function over actions is not greater than the expected maximum of noisy (approximate) Q-function. Specifically, for any action-dependent random noise such that ,
In practice, the noise may arise for various reasons and from various sources, such as spontaneous errors in function approximation, Q-function invalidation due to ongoing policy optimization, stochasticity of environment, etc. Off-policy algorithms grounded in temporal difference learning are especially sensitive to approximation errors since errors are propagated backward through episodic time and accumulate over the learning process.
The de facto standard for alleviating overestimations in discrete control is the double estimator(Van Hasselt, 2010, 2013). However, Fujimoto et al. (2018) argue that for continuous control this estimator may still overestimate in highly variable state-action space regions, and propose to promote underestimation by taking the minimum over two separate approximators. These approximators constitute naturally an ensemble, the size of which controls the intensity of underestimation: more approximators correspond to more severe underestimation (Lan et al., 2020). We argue, that this approach, while very successful in practice, has a few shortcomings:
The overestimation control is coarse: it is impossible to take the minimum over a fractional number of approximators (see Section 4.1).
The aggregation with is wasteful: it ignores all estimates except the minimal one, diminishing the power of the ensemble of approximators.
We address these shortcomings with a novel method called Truncated Quantile Critics (TQC). In the design of TQC, we draw on three ideas: distributional representation of a critic, truncation of approximated distribution, and ensembling.
Distributional representations The distributional perspective (Bellemare et al., 2017) advocates the modeling of the distribution of the random return, instead of the more common modeling of the Q-function, the expectation of the return. In our work, we adapt QR-DQN (Dabney et al., 2018b) for continuous control and approximate the quantiles of the return distribution conditioned on the state and action. Distributional perspective allows for learning the intrinsic randomness of the environment and policy, also called aleatoric uncertainty. We are not aware of any prior work employing aleatoric uncertainty for overestimation bias control. We argue that the granularity of distributional representation is especially useful for precise overestimation control.
Truncation To control the overestimation, we propose to truncate the right tail of the return distribution approximation by dropping several of the topmost atoms. By varying the number of dropped atoms, we can balance between over- and underestimation. In a sense, the truncation operator is parsimonious: we drop only a small number of atoms (typically, around 8% of the total number of atoms). Additionally, truncation does not require multiple separate approximators: our method surpasses the current state of the art (which uses multiple approximators) on some benchmarks even using only a single one (Figure 1).
Ensembling The core operation of our method—truncation of return distribution—does not impose any restrictions on the number of required approximators. This effectively decouples overestimation control from ensembling, which, in turn, provides for additional performance improvement (Figure 1).
Our method improves the performance on all environments in the standard OpenAI gym (Brockman et al., 2016) benchmark suite powered by MuJoCo (Todorov et al., 2012), with up to 30% improvement on some of the environments. For the most challenging Humanoid environment this improvement translates into twice the running speed of the previous SOTA (since agent gets as part of reward per step until it fell). The price to pay for this improvement is the computational overhead carried by distributional representations and ensembling (Section 5.2).
This work makes the following contributions to the field of continuous control:
We design a practical method for the fine-grained control over the overestimation bias, called Truncated Quantile Critics (Section 3). For the first time, we (1) incorporate aleatoric uncertainty into the overestimation bias control, (2) decouple overestimation control and multiplicity of approximators, (3) ensemble distributional approximators in a novel way.
To facilitate reproducibility, we carefully document the experimental setup, perform exhaustive ablation, average experimental results over a large number of seeds, publish raw data of seed runs, and release the code for Tensorflow111https://github.com/bayesgroup/tqc
We consider a Markov decision process, MDP, defined by the tuple, with continuous state and action spaces and , unknown state transition density
, random variable reward function, and discount factor .
A policy maps each state to a distribution over . We write to denote the entropy of the policy conditioned on the state .
We write for the dimensionality of the space . Unless explicitly stated otherwise, the signifies the expectation over the from experience replay , and from . We use the overlined notation to denote the parameters of target networks, i.e., denotes the exponential moving average of parameters .
The Soft Actor Critic (SAC) (Haarnoja et al., 2018a) is an off-policy actor-critic algorithm based on the maximum entropy framework. The objective encourages policy stochasticity by augmenting the reward with the entropy at each step.
The policy parameters can be learned by minimizing the
where is the soft Q-function and is the normalizing constant.
The soft Q-function parameters can be learned by minimizing the soft Bellman residual
where denotes the temporal difference target
and is the entropy temperature coefficient. Haarnoja et al. (2018b) proposed to dynamically adjust the by taking a gradient step with respect to the loss
each time the changes. This decreases the , if the stochastic estimate of policy entropy, , is higher than , and increases
otherwise. The target entropy usually is set heuristically to.
Distributional reinforcement learning focuses on approximating the return random variable where , and , , as opposed to approximating the expectation of the return, also known as the Q-function, .
QR-DQN (Dabney et al., 2018b) approximates the distribution with , a mixture of atoms—Dirac delta functions at locations
given by a parametric model.
Parameters are optimized by minimizing the averaged over the replay 1-Wasserstein distance between and the temporal difference target distribution , where is the distributional Bellman operator (Bellemare et al., 2017):
As Dabney et al. (2018b) show, this minimization can be performed by learning quantile locations for fractions via quantile regression. The quantile regression loss, defined for a quantile fraction , is
To improve gradients for small authors propose to use the Huber quantile loss (asymmetric Huber loss):
where is a Huber loss with parameter .
We start with an informal explanation of TQC and motivate our design choices. Next, we outline the formal procedure at the core of TQC, specify the loss functions and present an algorithm for practical implementation.
To achieve granularity in controlling the overestimation, we ”decompose” the expected return into atoms of distributional representation. By varying the number of atoms, we can control the precision of the return distribution approximation.
To control the overestimation, we propose to truncate the approximation of the return distribution: we drop atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. By varying the total number of atoms and the number of dropped ones, we can flexibly balance between under- and overestimation. The truncation naturally accounts for the inflated overestimation due to the high return variance: the higher the variance, the lower the Q-value estimate after truncation.
To improve the Q-value estimation, we ensemble multiple distributional approximators in the following way. First, we form a mixture of distributions predicted by
approximators. Second, we truncate this mixture by removing atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. The order of operations—the truncation of the mixture vs. the mixture of truncated distributions—may matter. The truncation of a mixture removes the largest outliers from the pool of all predictions. Such a truncation may be useful in a hypothetical case of one of the critics goes crazy and overestimates much more than the others. In this case, the truncation of a mixture removes the atoms predicted by this inadequate critic. In contrast, the mixture of truncated distributions truncates all critics evenly.
Our method is different from previous approaches (Zhang and Yao, 2019; Dabney et al., 2018a) that distorted the critic’s distribution at the policy optimization stage only. We use nontruncated critics’ predictions for policy optimization. And truncate target return distribution at the value learning stage. Intuitively, this prevents errors from propagating to other states via TD learning updates and eases policy optimization.
Next, we present TQC formally and summarize the procedure in Algorithm 1.
We propose to train approximations of the policy conditioned return distribution . Each maps each
supported on atoms .
We train approximations on the temporal difference target distribution . We construct it as follows. We pool atoms of distributions into a set
and denote elements of sorted in ascending order by , with .
The smallest elements of define atoms
of the target distribution
In practice, we always populate with atoms predicted by target networks , which are more stable.
We minimize the 1-Wasserstein distance between each of and the temporal difference target distribution . Equivalently (Dabney et al., 2018b), to minimize this distance we can approximate the quantiles of the target distribution, i.e., learn the locations for quantile fractions .
We approximate the quantiles of with by minimizing the loss
over the parameters , where
In this way, each learnable location becomes dependent on all atoms of the truncated mixture of target distributions.
The policy parameters can be optimized to maximize the entropy penalized estimate of the Q-value by minimizing the loss
where . We use nontruncated estimate of the Q-value for policy optimization to avoid double truncation: Z-functions approximate already truncated future distribution.
First, we compare our method with other possible ways to mitigate the overestimation bias on a simple MDP, for which we can compute the true Q-function and the optimal policy.
Next, we quantitatively compare our method with competitors on a standard continuous control benchmark – the set of MuJoCo (Todorov et al., 2012) environments implemented in OpenAI Gym (Brockman et al., 2016). The details of the experimental setup are in Appendix A.
We implement TQC on top of the SAC (Haarnoja et al., 2018b) with auto-tuning of the entropy temperature (Section 2.2). For all MuJoCo experiments, we use critic networks with three hidden layers of neurons each, atoms, and the best number of dropped atoms per network
, if not stated otherwise. The other hyperparameters are the same as in SAC (see AppendixB).
In this experiment we evaluate bias correction techniques (Table 1) in a single state continuous action infinite horizon MDP (Figure 5). We train Q-networks (or Z-networks, depending on the method) with two hidden layers of size from scratch on the replay buffer of size for iterations, which is enough for all methods to converge. We populate the buffer by sampling a reward once for each action from a uniform action grid. At each step of temporal difference learning, we use a policy, which is greedy with respect to the objective in Table 6.
We define as a signed discrepancy between the approximate and the true Q-value. For tqc . We vary the parameters controlling the overestimation for each method and report the robust average ( of each tail is truncated) over seeds of and .
For avg and min we vary the number of networks , for tqc—the number of dropped quantiles per network . We present the results in Figure 6 with bubbles of diameter, inversely proportional to the averaged over the seeds absolute distance between the optimal and the of the policy objective.
|Method||Critic target||Policy objective|
The results (Figure 6) suggest TQC can achieve the lowest variance and the smallest bias of Q-function approximation among all the competitors. The variance and the bias correlate well with the policy performance, suggesting TQC may be useful in practice.
We compare our method with original implementations of state of the art algorithms: SAC333https://github.com/rail-berkeley/softlearning, TrulyPPO444https://github.com/wangyuhuix/TrulyPPO, and TD3555https://github.com/sfujim/TD3. For HalfCheetah, Walker, and Ant we evaluate methods on the extended frame range: until all methods plateaus ( versus usual ). For Hopper, we extended the range to steps.
For our method we selected the number of dropped atoms for each environment independently, based on separate evaluation. Best value for Hopper is , for HalfCheetah and for the rest .
Figure 7 shows the learning curves. In Table 2 we report the average and std of 10 seeds. Each seed performance is an average of 100 last evaluations. We evaluate the performance every 1000 frames as an average of 10 deterministic rollouts. As our results suggest, TQC performs consistently better than any of the competitors. TQC also improves upon the maximal published score on four out of five environments (Table 3).
according to the two-sided Welch’s t-test with Bonferroni correction for multiple comparison testing.
We ablate TQC on the Humanoid 3D environment, which has the highest resolution power due to its difficulty, and Walker2d—a 2D environment with the largest sizes of action and observation spaces. In this section and in the Appendix E we average metrics over four seeds.
The path from SAC (Haarnoja et al., 2018b) to TQC comprises five modifications: Q-network size increase (Big), quantile Q-network introduction (Quantile), target distribution truncation (Truncate), atom pooling (Pool), and ensembling. To reveal the effects behind these modifications, we build four methods – the intermediate steps on the incremental path from SAC to TQC. Each subsequent method adds the next modification from the list to the previous method or changes the order of applying modifications. For all modifications, except the final (ensembling), we use networks. In all truncation operations we drop atoms in total, where .
B-SAC is SAC with an increased size of Q-networks (Big SAC): 3 layers with 512 neurons versus 2 layers of 256 neurons in SAC. Policy network size does not change.
QB-SAC is B-SAC with Quantile distributional networks (Dabney et al., 2018b). This modification changes the form of Q-networks and the loss function, quantile Huber (equation 8). We adapt the clipped double estimator (Fujimoto et al., 2018) to quantile networks: we recover Q-values from distributions and use atoms of the argmin of Q-values to compute the target distribution , where is
TQB-SAC is QB-SAC with individual truncation instead of : is trained to approximate the truncated temporal difference distribution , which is based on the predictions of the single target network only. That is, is trained to approximate , where is
PTQB-SAC is TQB-SAC with pooling: approximates the mixture of (already truncated) :
TQC = TPQB-SAC is PTQB-SAC with pooling and truncation operations swapped. This modification drops the same number of atoms as two previous methods, but differs in which atoms are dropped. TQC drops largest from the union of critics predictions. While each of PTQB-SAC and TQB-SAC (no pooling) drops largest atoms from each of critics.
Ensembling To illustrate the advantage brought by ensembling, we include the results for TQC with two and five Z-networks.
The ablation results (Figure 8) suggest the following. The increased network size does not necessarily improve SAC (though some improvement is visible on Walker2d). The quantile representation improves the performance in both environments with the most notable improvement on Humanoid. Among the following three modifications—individual truncation, and two orders of applying the pooling and truncation—the TQC is a winner for Humanoid and ”truncate-pooling” modification seems to be better on Walker2d. Overall, truncation stabilizes results on Walker2d and seems to reduce the seed variance on Humanoid. Finally, the ensembling consistently improves results on both environments.
Number of truncated quantiles In this experiment we vary the number of atoms (per network) to drop in the range . The total number of atoms dropped is . We fix the number of atoms for each Q-network to . The results (Figure 9) show that (1) truncation is essential and (2) there is an optimal number of dropped atoms (i.e., or ).
Number of total quantiles In this experiment we vary the total number of atoms and adjust the number of dropped quantiles to keep the truncation ratio approximately constant. The results (Appendix E) suggest this parameter does not have much influence, except for the case of very small , such as . For learning curves are indistinguishable.
Number of Z-networks In this experiment we vary the number of Z-networks . The results (Appendix E) suggest that (1) a single network is consistently inferior to larger ensembles and (2) performance improvement saturates at approximately .
|Env||SAC||B-SAC||TQC N=2||TQC N=5|
Time measurements (in seconds) of a single training epoch (1000 frames), averaged over 1000 epochs, executed on the Tesla P40 GPU.
Ensembling and distributional networks incur additional computational overhead, which we quantify for different methods in Table 4.
Since the introduction of distributional paradigm (see White (1988) and references therein) and its reincarnation for deep reinforcement learning (Bellemare et al., 2017) a great body of research emerged. Dabney et al. proposed a method to learn quantile values (or locations) for a uniform grid of fixed (Dabney et al., 2018b) or sampled (Dabney et al., 2018a) quantile fractions. Yang et al. (2019) proposed a method to learn both quantile fractions and quantile values (i.e. locations and probabilities of elements in the mixture approximating the unknown distribution). Choi et al. (2019) used a mixture of Gaussians for approximating the distribution of returns. Most of these works, as well as their influential follow-ups, such as (Hessel et al., 2018), are devoted to the discrete control setting.
The adoption of the distributional paradigm in continuous control, to the best of our knowledge, starts from D4PG (Barth-Maron et al., 2018)—a distributed distributional off-policy algorithm building on the C51 (Bellemare et al., 2017). Recently, the distributional paradigm was adopted in distributed continuous control for robotic grasping (Bodnar et al., 2019). The authors proposed Q2-Opt as a collective name for two variants: based on QR-DQN (Dabney et al., 2018b), and on IQN (Dabney et al., 2018a). In contrast to D4PG and Q2-Opt, we focus on the usual, non-distributed setting and modify the target on which the critic is trained.
A number of works develop exploration methods based on the quantile form of value-function. DLTV (Mavrin et al., 2019) uses variability of the quantile distribution in the exploration bonus. QUOTA (Zhang and Yao, 2019)—the option-based modification of QR-DQN—partitions a range of quantiles into contiguous windows and trains a separate intra-option policy to maximize an average of quantiles in each of the windows. Our work proposes an alternative method for critic training, which is unrelated to the exploration problem.
Most importantly, our work differs from the research outlined above in our aim to control the overestimation bias by leveraging quantile representation of a critic network.
The overestimation bias is a long-standing topic in several research areas. It is known as the max-operator bias in statistics (D’Eramo et al., 2017) and as the ”winner’s curse” in economics (Smith and Winkler, 2006; Thaler, 2012).
The statistical community
studies estimators of the maximum expected value of a set of independent random variables. The simplest estimator—the maximum over sample means, Maximum Estimator (ME)—is biased positively, while for many distributions, such as Gaussian, an unbiased estimator does not exist(Ishwaei D et al., 1985). The Double Estimator (DE) (Stone, 1974; Van Hasselt, 2013) uses cross-validation to decorrelate the estimation of the argmax and of the value for that argmax. He and Guo (2019) proposed a coupled estimator as an extension of DE to the case of partially overlapping cross-validation folds. Many works have aimed at alleviating the negative bias of DE, which in absolute value can be even larger than that of ME. D’Eramo et al. (2016)
assumed the Gaussian distribution for the sample mean and proposed Weighted Estimator (WE) with a bias in between of that for ME and DE.Imagaw and Kaneko (2017) improved WE by using UCB for weights computation. D’Eramo et al. (2017) assumed a certain spatial correlation and extended WE to continuous sets of random variables. The problem of overestimation has also been discussed in the context of optimal stopping and sequential testing (Kaufmann et al., 2018).
The reinforcement learning community became interested in the bias since the work of Thrun and Schwartz (1993), who attributed a systematic overestimation to the generalization error. The authors proposed multiple ways to alleviate the problem, including (1) bias compensation with additive pseudo costs and (2) underestimation (e.g., in the uncertain areas). The underestimation concept became much more prominent, while the adoption of ”additive compensation” is quite limited to date (Patnaik and Anwar, 2008; Lee and Powell, 2012).
Van Hasselt (2010)
proposed Double Estimation in Q-learning, which was subsequently adapted to neural networks as Double DQNVan Hasselt et al. (2015). Subsequently, (Zhang et al., 2017) and Lv et al. (2019) introduced the Weighted Estimator in the reinforcement learning community.
Another approach against overestimation and overall Q-function quality improvement is based on the idea of averaging or ensembling. Embodiments of this approach are based on dropout Anschel et al. (2017), employing previous Q-function approximations (Ghavamzadeh et al., 2011; Anschel et al., 2017), the linear combination between and over the pool of Q-networks (Li and Hou, 2019; Kumar et al., 2019), or the random mixture of predictions from the pool (Agarwal et al., 2019). Buckman et al. (2018) reported the reduction in overestimation originating from ensembling in model-based learning.
In continuous control, Fujimoto et al. (2018) proposed the TD3 algorithm, taking the over two approximators of Q-function to reduce the overestimation bias. Later, for discrete control Lan et al. (2020) developed a MaxMin Q-learning, taking the over more than two Q-functions. We build upon the minimization idea of Fujimoto et al. (2018) and, following Lan et al. (2020) use multiple approximators.
Our work differs in that we do not propose to control the bias by choosing between multiple approximators or weighting them. For the first time, we propose to successfully control the overestimation even for a single approximator and use ensembling only to improve the performance further.
In this work, we propose to control the overestimation bias on the basis of aleatoric uncertainty. The method we propose comprises three essential ideas: distributional representations, truncation of a distribution, and ensembling.
Simulations reveal favorable properties of our method: low expected variance of the approximation error as well as the fine control over the under- and overestimation. The exceptional results on the standard continuous control benchmark suggest that distributional representations may be useful for controlling the overestimation bias.
Since little is known about the connection between aleatoric uncertainty and overestimation, we see the investigation of it as an exciting avenue for future work.
We would like to thank Artem Sobolev, Arsenii Ashukha, Oleg Ivanov and Dmitry Nikulin for their comments and suggestions regarding the early versions of the manuscript. We also thank the anonymous reviewers for their feedback.
This research was supported in part by the Russian Science Foundation grant no. 19-71-30020.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.3, §2.3, §3.3, §5.1, §6, §6.
We would like to caution about the use of MuJoCo 2.0 with versions of Gym at least up to v0.15.4
(the last released at the moment). For these versions Gym incorrectly nullifies state components corresponding to contact forces, which, in turn makes results incomparable to previous works.
In our work we use MuJoCo 1.5 and
v3 versions of environments. Versions of all other packages we used are listed in the Conda environment file, distributed with the source code666See the code attached..
Critic networks are fully-connected, with the last layer output size equal to the number of atoms .
|Replay buffer size|
|Number of critics||5||2|
|Number of hidden layers in critic networks||3||2|
|Size of hidden layers in critic networks||512||256|
|Number of hidden layers in policy network||2|
|Size of hidden layers in policy network||256|
|Target smoothing coefficient||0.005|
|Target update interval||1|
|Gradient steps per iteration||1|
|Environment steps per iteration||1|
|Number of atoms||25||—|
|Huber loss parameter||1||—|
|Environment||Number of dropped atoms,||Number of environment steps|
The task is simplistic infinite horizon MDP () with only one state and 1-dimensional action space . Since there is only one state, the state transition function and initial state distribution are delta functions. On each step agent get stochastic reward , where . Mean reward function is the cosine with slowly increasing amplitude (Figure 10):
The discount factor is .
Such parameters gives raise to three local maxima: near the left end , in the right half (global) and at the right end . Optimal policy in this environment always selects .
In the toy experiment we evaluate bias correction techniques (Table 6) in this MDP. We train Q-networks (or Z-networks, depending on the method) with two hidden layers of size from scratch on the replay buffer of size for iterations. We populate the buffer by sampling a reward once for each action from a uniform action grid of size . At each step of temporal difference learning, we use a policy, which is greedy with respect to the objective in Table 6.
We define as a signed discrepancy between the approximate and the true Q-value. For TQC . We vary the parameters controlling the overestimation for each method and report the robust average ( of each tail is truncated) over seeds of and . Expectation and variance estimated over dense uniform grid of actions of size and then averaged over seeds.
For avg and min we vary the number of networks from and correspondingly. For tqc — number of dropped quantiles per network from out of . We present the results in Figure 6 with bubbles of diameter, inversely proportional to the averaged over the seeds absolute distance between the optimal and the of the policy objective.
To prevent interference of policy optimization subtleties into conclusions about Q-function approximation quality, we use implicit deterministic policy induced by value networks: the argmax of the approximation. To find the maximum, we evaluated the approximation over the dense uniform grid in the range with a step .
Each dataset consists of uniform grid of actions and sampled corresponding rewards. For each method we average results over several datasets and evaluate on different dataset sizes. In this way current policy defined implicitly as greedy one with respect to value function. This policy doesn’t interact with the environment instead actions predefined to be uniform.
We are sorry for the void section. We keep this section to make references in the main text valid and will remove it in the camera ready version.
TQC drops atoms with largest locations after the pooling of atoms from multiple Z-networks. Experimentally, this procedure drops more atoms for some Z-networks than for the others. To quantify this disbalance, we compute the ratio of dropped atoms to the total atoms for each of Z-networks. These proportions, once sorted and averaged over the replay, are approximately constant throughout learning: % for and % for (Figure 13).
Interestingly, without sorting the averaging over the replay gives almost perfectly equal proportions (Figure 14). These results suggest, that a critic overestimates in some regions of the state action space more, than any other critic. In other words, in practice the systematic overestimation of a single critic (w.r.t. other critics predictions) on the whole state action space does not occur.
To ensure that it is not possible to match the performance of TQC with careful tuning of previous methods, we varied the number of critic networks used in the Clipped Double Q-learning estimate (Fujimoto et al., 2018) for SAC (Haarnoja et al., 2018b). The larger the number of networks under the , the more the underestimation (Lan et al., 2020).