DropoutQFunctionsforDoublyEfficientReinforcementLearning
Source files to replicate experiments in my Arxiv paper.
view repo
Randomized ensemble double Qlearning (REDQ) has recently achieved stateoftheart sample efficiency on continuousaction reinforcement learning benchmarks. This superior sample efficiency is possible by using a large Qfunction ensemble. However, REDQ is much less computationally efficient than nonensemble counterparts such as Soft ActorCritic (SAC). To make REDQ more computationally efficient, we propose a method of improving computational efficiency called Dr.Q, which is a variant of REDQ that uses a small ensemble of dropout Qfunctions. Our dropout Qfunctions are simple Qfunctions equipped with dropout connection and layer normalization. Despite its simplicity of implementation, our experimental results indicate that Dr.Q is doubly (sample and computationally) efficient. It achieved comparable sample efficiency with REDQ and much better computational efficiency than REDQ and comparable computational efficiency with that of SAC.
READ FULL TEXT VIEW PDF
We discuss the relative merits of optimistic and randomized approaches t...
read it
Modelbased reinforcement learning is a widely accepted solution for sol...
read it
Recent advances of gradient temporaldifference methods allow to learn
o...
read it
We introduce an ActorCritic Ensemble(ACE) method for improving the
perf...
read it
The field of Deep Reinforcement Learning (DRL) has recently seen a surge...
read it
The exploration mechanism used by a Deep Reinforcement Learning (RL) age...
read it
We consider interactive learning in the realizable setting and develop a...
read it
Source files to replicate experiments in my Arxiv paper.
In the reinforcement learning (RL) community, improving sample efficiency of RL methods has been important. Traditional RL methods have been shown to be promising for solving complex control tasks such as dexterous inhand manipulation (DBLP:journals/corr/abs180800177). However, RL methods generally require millions of training samples to solve a task (mendonca2019guided). This poor sample efficiency of RL methods is a severe obstacle to practical RL applications (e.g., applications on limited computational resources or in realworld environments without simulators). Motivated by these issues, many RL methods have been proposed to achieve higher sample efficiency. For example, haarnoja2018softa; haarnoja2018soft proposed Soft ActorCritic (SAC), which achieved higher sample efficiency than the previous stateoftheart RL methods (lillicrap2015continuous; fujimoto2018addressing; schulman2017proximal).
Since 2019, RL methods that use a high updatetodata (UTD) ratio to achieve high sample efficiency have emerged. The UTD ratio is defined as the number of Qfunction updates divided by the number of actual interactions with the environment. A high UTD ratio promotes sufficiently training Qfunctions within a few interactions, which leads to sampleefficient learning. ModelBased Policy Optimization (MBPO) (janner2019whento) is a seminal RL method that uses a high UTD ratio of 20–40 and achieves significantly higher sample efficiency than SAC, which uses a UTD ratio of 1. Encouraged by the success of MBPO, many RL methods with high UTD ratios have been proposed (shen2020ampo; lai2020bidirectional).
With such methods, randomized ensembled double Qlearning (REDQ) proposed by chen2021randomized
is currently the most sampleefficient method for the Mujoco benchmark. REDQ uses a high UTD ratio and large ensemble of Qfunctions. The use of a high UTD ratio promotes estimation bias in policy evaluation, which degrades sampleefficient learning. REDQ uses the ensemble of Qfunctions to suppress the estimation bias and improve sample efficiency.
chen2021randomized demonstrated that the sample efficiency of REDQ is equal to or even better than that of MBPO.However, REDQ leaves room for improvement in terms of computational efficiency. REDQ runs 1.1 to 1.4 times faster than MBPO (chen2021randomized) but is still less computationally efficient than nonensemblebased RL methods (e.g., SAC) due to the use of large ensembles. In Section 4.2, we show that REDQ runs more than two times slower than SAC and requires much larger memory. Computational efficiency is important in several scenarios, e.g., in RL applications with much lighter ondevice computation (e.g. mobile phones or other lightweight edge devices) (chen2021improving)
, or in situations in which rapid trials and errors are required for developing RL agents (e.g., hyperparameter tuning or proof of concept for RL applications with limited time resources). Therefore, RL methods that are superior not only in terms of sample efficiency but also in computational efficiency are preferable.
We propose a method of improving computational efficiency. Our method is called Dr.Q and is a REDQ variant that uses a small ensemble of dropout Qfunctions, in which dropout (JMLR:v15:srivastava14a) and layer normalization (ba2016layer) are used (Section 3). We experimentally show that Dr.Q is doubly (computationally and sample) efficient: (i) Dr.Q significantly improves computational efficiency over REDQ (Section 4.2) by more than two times and (ii) achieves sample efficiency comparable to REDQ (Section 4.1).
Although our primary contribution is proposing a doubly efficient RL method, we also make three significant contributions from other perspectives:
1. Simplicity of implementation. Dr.Q can be implemented by basically adding a few lines of readily available functions (dropout and layer normalization) to Qfunctions in REDQ (and SAC).
This simplicity enables one to easily replicate and extend it.
2. First successful demonstration of the usefulness of dropout in high UTD ratio settings.
Previous studies incorporated dropout into RL (gal2016dropout; Harrigan2016DeepRL; moerland2017efficient; gal2016improving; NIPS2017_84ddfb34; kahn2017uncertaintyaware) (see Section 5 for details).
However, these studies focused on low UTD ratio settings (i.e., UTD ratio ).
Dropout approaches generally do not work as well as ensemble approaches (NEURIPS2019_8558cb40; lakshminarayanan2017simple; Durasov21).
For this reason, instead of dropout approaches, ensemble approaches have been used in RL with high UTD ratio settings (chen2021randomized; janner2019whento; shen2020ampo; hiraoka2020meta; lai2020bidirectional).
In Section 4, we argue that Dr.Q achieves almost the same or better bias reduction ability and sample/computationally efficiency compared with ensemblebased RL methods in high UTD ratio settings.
This sheds light on dropout approaches once again and promotes their use as a reasonable alternative (or complement) to ensemble approaches in high UTD ratio settings.
3. Discovery of engineering insights to effectively apply dropout to RL.
Specifically, we discovered that the following three engineering practices are effective in reducing bias and improving sample efficiency:
(i) using the dropout and ensemble approaches together for constructing Qfunctions (i.e., using multiple dropout Qfunctions) (Section 4.1 and Appendix A.3);
(ii) introducing layer normalization into dropout Qfunctions (Section 4.3);
and (iii) using dropout for both the current and target Qfunctions (Appendix C). These engineering insights were not revealed in previous RL studies and would be useful to practitioners who attempt to apply dropout to RL.
RL addresses the problem of an agent learning to act in an environment. At each discrete time step , the environment provides the agent with a state , the agent responds by selecting an action , and then the environment provides the next reward and state . For convenience, as needed, we use the simpler notations of , , , , and to refer to a reward, state, action, next state, and next action, respectively.
We focus on maximum entropy RL, in which an agent aims to find its policy that maximizes the expected return with an entropy bonus: . Here, is a policy and is entropy. Temperature balances exploitation and exploration and affects the stochasticity of the policy.
REDQ (chen2021randomized) is a sampleefficient modelfree method for solving maximumentropy RL problems (Algorithm 1).
It has two primary components to achieve high sample efficiency.
1. High UTD ratio: It uses a high UTD ratio , which is the number of Qfunction updates (lines 4–10) divided by the number of actual interactions with the environment (line 3).
The high UTD ratio promotes sufficient training of Qfunctions within a few interactions.
However, this also promotes overestimation bias in the Qfunction training, which degrades sampleefficient learning (chen2021randomized).
2. Ensemble of Qfunctions: To reduce the overestimation bias, it uses an ensemble of Qfunctions for the target to be minimized (lines 6–7).
Specifically, a random subset of the ensemble is selected (line 6) then used for calculating the target (line 7).
The size of the subset is kept fixed and is denoted as .
In addition, each Qfunction in the ensemble is randomly and independently initialized but updated with the same target (lines 8–9).
chen2021randomized showed that using a large ensemble () and small subset () successfully reduces the bias.
Although using a large ensemble of Qfunctions is beneficial for reducing bias and improving sample efficiency, this makes REDQ computationally intensive. In the next section, we discuss reducing the ensemble size.
In this section, we discuss replacing the large ensemble of Qfunctions in REDQ with a small ensemble of dropout Qfunctions. We start our discussion by reviewing what the ensemble in REDQ does from the viewpoint of model uncertainty injection. Then, we propose to use dropout for model uncertainty injection instead of the large ensemble. Specifically, we propose (i) the dropout Qfunction that is a Qfunction equipped with dropout and layer normalization, and (ii) Dr.Q, a variant of REDQ that uses a small ensemble of dropout Qfunctions. Finally, we explain that the size of the ensemble can be smaller in Dr.Q than in REDQ.
We first explain our insight that, in REDQ, model (Qfunction parameters’) uncertainty is injected into the target. In REDQ, the subset of Qfunctions is used to compute the target value (lines 6–7 in Algorithm 1). This can be interpreted as an approximation for , the expected target value with respect to model uncertainty:
On the left hand side in the second line, the model distribution is replaced with a proposal distribution , which is based on resampling from the ensemble in line 6 in Algorithm 1^{1}^{1}1We assume that independently follow an identical distribution. . On the right hand side (RHS) in the second line, the expected target value is further approximated by one sample average on the basis of . The resulting approximation is used in line 7 in Algorithm 1. In the remainder of this section, we discuss another means to approximate the expected target value.
We use dropout Qfunctions for the target value approximation (Figure 1). Here, is a Qfunction that has dropout connections (JMLR:v15:srivastava14a). The left part of Figure 1 shows implementation by adding dropout layers to the Qfunction implementation used by chen2021randomized. Layer normalization (ba2016layer) is applied after dropout for more effectively using dropout as with NIPS2017_3f5ee243; NEURIPS2019_2f4fe03d. By using , the target value is approximated as
Instead of , a proposal distribution based on the dropout is used on the RHS in the first line. The expected target value is further approximated by one sample average on the basis of in the second line. We use the resulting approximation for injecting model uncertainty into the target value (the right part of Figure 1). For calculating , we use dropout Qfunctions, which have independently initialized and trained parameters . Using dropout Qfunctions improves the performance of Dr.Q (our RL method described in the next paragraph), compared with that using a single dropout Qfunction (further details are given in A.3).
We now explain Dr.Q^{2}^{2}2The period between “Dr” and “Q” is necessary to avoid name conflict with DrQ (yarats2021image)., in which is used for considering model uncertainty. The algorithmic description of Dr.Q is shown in Algorithm 2. Dr.Q is a variant of REDQ, and the modified parts from REDQ are highlighted in red in the algorithm. In line 6, is used to inject model uncertainty into the target, as we discussed in the previous paragraph. In lines 8 and 10, dropout Qfunctions are used instead of Qfunctions () to make Dr.Q more computationally efficient.
The ensemble size of the dropout Qfunctions for Dr.Q (i.e., ) should be smaller than that of Qfunctions for REDQ (i.e., ). for Dr.Q is equal to the subset size for REDQ, which is not greater than . In practice, is much smaller than (e.g., and in chen2021randomized). This reduction in the number of Qfunctions makes Dr.Q more computationally efficient. Specifically, Dr.Q is faster due to the reduction in the number of Qfunction updates (line 8 in Algorithm 2). Dr.Q also requires less memory for holding Qfunction parameters. In Section 4.2, we show that Dr.Q is computationally faster and less memory intensive than REDQ.
We conducted experiments to evaluate and analyse Dr.Q ^{3}^{3}3Source code to replicate the experiments is available at https://github.com/TakuyaHiraoka/DropoutQFunctionsforDoublyEfficientReinforcementLearning. In Section 4.1, we discuss the evaluation of Dr.Q’s performance (sample efficiency and biasreduction ability). In Section 4.2, we discuss the evaluation of the computational efficiency of Dr.Q. In Section 4.3, we explain the ablation study for Dr.Q.
To evaluate the performances of Dr.Q, we compared Dr.Q with three baseline methods in MuJoCo benchmark environments (todorov2012mujoco; brockman2016openai).
Following chen2021randomized; janner2019whento, we prepared the following environments: Hopper, Walker2d, Ant, and Humanoid.
In these environments, we compared the following four methods:
REDQ: Baseline method that follows the REDQ algorithm (chen2021randomized) (Algorithm 1).
SAC: Baseline method that follows the SAC algorithm (haarnoja2018softa; haarnoja2018soft). To improve the sample efficiency of this method, delayed policy update and high UTD ratio were used, as suggested by chen2021randomized.
Dr.Q: Proposed method that follows Algorithm 2.
DUVN: Baseline method. This method is a variant of Dr.Q that uses double uncertainty value networks (Harrigan2016DeepRL; moerland2017efficient) for policy evaluation.
Specifically, a single (i.e., ) target dropout Qfunction is used for target calculation on line 6 in Algorithm 2: .
As with Harrigan2016DeepRL; moerland2017efficient, layer normalization is not applied in .
Following chen2021randomized, we set the hyperparameters as , , and for all methods except DUVN (). More detailed hyperparameter settings are given in Appendix D.
The methods were compared on the basis of average return and estimation bias.
Average return:
An average return over episodes. We regarded 1000 environment steps in Hopper and 3000 environment steps in the other environments as one epoch, respectively. After every epoch, we ran ten test episodes with the current policy and recorded the average return.
Average and standard deviation of the normalized estimation error (bias) of Qfunctions
(chen2021randomized). The error represents how significantly the Qvalue estimation differs from the actual return. Formally, the error is defined as , where is the discounted Monte Carlo return obtained from the current policy in the test episodes and is its estimation. The was evaluated as for SAC and REDQ and as for Dr.Q and DUVN, respectively.The comparison results (Figure 2) indicate that Dr.Q achieved almost the same level of performance as REDQ. Regarding the average return, REDQ and Dr.Q achieved almost the same sample efficiency overall. In Walker2d and Ant, their learning curves highly overlapped, and there was no significant difference between them. In Humanoid, REDQ was slightly better than Dr.Q. In Hopper, Dr.Q was better than REDQ. In all environments except Hopper, REDQ and Dr.Q improved their average return significantly earlier than SAC and DUVN. Regarding bias, REDQ and Dr.Q consistently kept the valueestimation bias closer to zero than SAC and DUVN in all environments.
We next evaluated the computational efficiency of Dr.Q. We compared Dr.Q with the baseline methods on the basis of the following criteria: (i) Process time required for executing methods; (ii) Number of parameters of each method; (iii) Bottleneck memory consumption
suggested from the Pytorch profiler
^{4}^{4}4https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html. Bottleneck memory consumption is the maximum memory consumption recorded when running the methods. For evaluation, we ran each method on a machine equipped with two Intel(R) Xeon(R) CPU E52667 v4 and one NVIDIA Tesla K80.The process times of the methods (Table 1) indicate that Dr.Q runs more than two times faster than REDQ. Dr.Q (and SAC) requires process times in the 800—900msec range. REDQ requires process times in the 2300—2400msec range. The numbers in parentheses show that learning Qfunctions is dominant in the overall process. This suggests that using more compact (e.g., small numbers of) Qfunctions is important for improving process times.
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

SAC  910 (870)  870 (835)  888 (848)  893 (854) 
REDQ  2328 (2269)  2339 (2283)  2336 (2277)  2400 (2340) 
DUVN  664 (636)  762 (731)  733 (700)  692 (660) 
Dr.Q  948 (905)  933 (892)  954 (913)  989 (946) 
The number of parameters and bottleneck memory consumption of each method indicate that Dr.Q is more memory efficient than REDQ. Regarding the numbers of parameters (Table 2), we can see that those of Dr.Q (and SAC and DUVN) are about onefifth those of REDQ. Note that the number of parameters of Dr.Q is equivalent to that of SAC since Dr.Q and SAC use the same number (two) of Qfunctions. Regarding the bottleneck memory consumption (Table 3), we can see that that for Dr.Q (SAC and DUVN) is about onethird that for REDQ. We can also see that the bottleneck memory consumption is almost independent of the environment. This is because the most memoryintensive process is evaluation of the hidden layers in Qfunctions (e.g., applying the ReLU layer), which is not environment specific (number of dimensions of states and actions)^{5}^{5}5In our experiment, the number of hidden units is invariant over all environments (Appendix D)..
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

SAC  141,826  146,434  152,578  166,402 
REDQ  698,890  721,930  752,650  821,770 
DUVN  139,778  144,386  150,530  164,354 
Dr.Q  141,826  146,434  152,578  166,402 
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

SAC  73 / 64 / 62  73 / 64 / 62  73 / 64 / 62  73 / 65 / 62 
REDQ  241 / 211/ 200  241 / 211 / 200  241 / 211 / 200  241 / 212 / 201 
DUVN  72 / 71 / 51  72 / 71 / 51  72 / 71 / 51  72 / 71 / 52 
Dr.Q  73 / 72 / 69  73 / 72 / 69  73 / 72 / 70  73 / 72 / 70 
As an ablation study of Dr.Q, we investigated the performance of Dr.Q variants that do not use either dropout, or layer normalization, or both. We refer to the one without dropout as DO, that without layer normalization as LN, and that without both as DOLN. The results (Figure 3) indicate that the synergic effect of using dropout and layer normalization is high, especially in complex environments (Ant and Humanoid). In these environments, Dr.Q significantly outperformed its ablation variants (DO, LN and DOLN) in terms of both average return and bias reduction.
In this section, we review related studies and compare them with ours (Table 4).
Type of study 





Ensemble Qfunctions  Ensemble  Qfunctions 


Ensemble transition models  Ensemble  Transition models 


Dropout Qfunctions  Dropout  Qfunctions  No  
Dropout transition models  Dropout  Transition models  No  
Normalization  –  –  No  
Our study  Dropout (with ensemble)  Qfunctions  Yes 
Comparison between related studies and ours. We classify related studies into five types (e.g., “Ensemble Qfunctions”) on basis of three criteria (e.g. “Type of model uncertainty”).
Ensemble Qfunctions: Ensembles of Qfunctions have been used in RL to consider model uncertainty (fausser2015neural; NIPS2016_8d8818c8; anschel2017averaged; agarwal2020optimistic; lee2021sunrise; Lan2020Maxmin; chen2021randomized). Ensemble transition models: Ensembles of transition (and reward) models have been introduced to modelbased RL, e.g., (NEURIPS2018_3de568f8; kurutach2018modelensemble; janner2019whento; shen2020ampo; NEURIPS2020_a322852c; lee2020context; hiraoka2020meta; abraham2020model). The methods proposed in the above studies use a large ensemble of Qfunctions or transition models, thus are computationally intensive. Dr.Q does not use a large ensemble of Qfunctions, thus is computationally lighter.
Dropout transition models: gal2016improving; NIPS2017_84ddfb34; kahn2017uncertaintyaware introduced dropout and its modified variant to transition models of modelbased RL methods. Dropout Qfunctions: gal2016dropout introduced dropout to a Qfunction at action selection for considering model uncertainty in exploration. Harrigan2016DeepRL and moerland2017efficient introduced dropout to policy evaluation in the same vein as us. However, there are three main differences between these studies and ours. (i) Target calculation: they introduced dropout to a single Qfunction and used it for a target value, whereas we introduce dropout to multiple Qfunctions use the minimum of their outputs for the target value. On the basis of insights from fujimoto2018addressing; haarnoja2018soft, overestimation bias should be significantly suppressed by this minimum operation in the target. (ii) Use of engineering to stabilize learning: their methods do not use engineering to stabilize the learning of dropout Qfunctions, whereas Dr.Q uses layer normalization to stabilize the learning. (iii) RL setting: they focused on a low UTD ratio (and standard RL) setting, whereas we focused on a high UTD ratio (and maximum entropy RL) setting. As we explained in Section 2.2, a high UTD ratio setting promotes high estimation bias, thus is a more challenging setting than a low UTD ratio setting. In Section 4.1, we showed that their methods do not perform successfully in a high UTD setting.
Normalization in RL:
Normalization (e.g., batch normalization
(ioffe2015batch) or layer normalization (ba2016layer)) has been introduced into RL. Batch normalization and its variant are introduced in deep deterministic policy gradient (DDPG) (lillicrap2015continuous) and twin delayed DDPG (fujimoto2018addressing) (bhatt2020crossnorm). Layer normalization is introduced in the implementation of maximum a posteriori policy optimisation (abdolmaleki2018maximum; hoffman2020acme). It is also introduced in SAC extensions (ma2020dsac; zhang2021learning). Unlike our study, the above studies did not introduce dropout to consider model uncertainty. In addition, although some studies (ma2020dsac; zhang2021learning) focused on the maximum entropy RL setting, none focused on a high UTD ratio setting.We proposed, Dr.Q, an RL method based on a small ensemble of Qfunctions that are equipped with dropout connection and layer normalization. We experimentally demonstrated that Dr.Q significantly improves computational and memory efficiencies over REDQ while achieving sample efficiency comparable with REDQ. In the ablation study, we found that using both dropout and layer normalization had a synergistic effect on improving sample efficiency, especially in complex environments (e.g., Humanoid).
Dr.Q (Algorithm 2) uses an ensemble of multiple () dropout Qfunctions. This raises the question: “Why not use a single dropout Qfunction for Dr.Q?” To answer this question, we compared Dr.Q with SinDr.Q, a variant of Dr.Q that uses a single dropout Qfunction. Specifically, with SinDr.Q, the target in line 6 in Algorithm 2 is calculated by evaluating , a single dropout Qfunction, times:
Note that, the output of can differ in each evaluation due to the use of a dropout connection. The remaining part of SinDr.Q is the same as Dr.Q.
From the comparison results of Dr.Q and SinDr.Q (Figure 6), we can see that the average return of SinDr.Q was lower than that of Dr.Q. This result indicates that using multiple dropout Qfunctions is preferable to using a single dropout Qfunction with Dr.Q.
In Section 4, we discussed comparing Dr.Q with REDQ, which uses an ensemble size of 10 (i.e., =10). To make a more detailed comparison, we compared Dr.Q and REDQ by varying the ensemble size for REDQ. We denote REDQ that uses an ensemble size of as “REDQ” (e.g., REDQ5 for REDQ with an ensemble size of five).
Regarding average return (left part of Figure 7), overall, Dr.Q was superior to REDQ2–5. Dr.Q was comparable with REDQ2 and REDQ3 in Hopper but superior in more complex environments (Walker, Ant, and Humanoid). In addition, Dr.Q was somewhat better than REDQ5 in all environments. Regarding estimation bias (middle and right parts of Figure 7), overall, Dr.Q was significantly better than REDQ2 and comparable with REDQ3–10. Regarding the processing speed (Table 5), Dr.Q ran as fast as REDQ3 and faster than REDQ5 by 1.4 to 1.5 times. Regarding memory efficiency (Tables 6 and 7), Dr.Q was less memory intensive than REDQ3–10.
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

Dr.Q  948 (905)  933 (892)  954 (913)  989 (946) 
REDQ2  832 (792)  802 (768)  675 (641)  820 (773) 
REDQ3  1052 (1014)  876 (838)  919 (881)  950 (906) 
REDQ5  1414 (1368)  1425 (1378)  1373 (1327)  1552 (1503) 
REDQ10  2328 (2269)  2339 (2283)  2336 (2277)  2400 (2340) 
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

Dr.Q  141,826  146,434  152,578  166,402 
REDQ2  139,778  144,386  150,530  164,354 
REDQ3  209,667  216,579  225,795  246,531 
REDQ5  349,445  360,965  376,325  410,885 
REDQ10  698,890  721,930  752,650  821,770 
Hopperv2  Walker2dv2  Antv2  Humanoidv2  

Dr.Q  73 / 72 / 69  73 / 72 / 69  73 / 72 / 70  73 / 72 / 70 
REDQ2  73 / 51 / 51  73 / 51 / 51  73 / 51 / 51  73 / 52 / 52 
REDQ3  94 / 64 / 62  94 / 64 / 62  94 / 64 / 62  94 / 65 / 62 
REDQ5  136 / 106 / 100  136 / 106 / 100  136 / 106 / 100  136 / 107 / 101 
REDQ10  241 / 211 / 200  241 / 211 / 200  241 / 211 / 200  241 / 212 / 201 
Dropout is introduced into three parts of the algorithm for Dr.Q (i.e., lines 6, 8 and 10 of Algorithm 2 in Section 3).
In this section, we conducted an ablation study to answer the question “which dropout introduction contributes to overall performance improvement of Dr.Q?”
We compared Dr.Q with its following variants:
DO@TargetQ: A Dr.Q variant that does not use dropout in line 6. Specifically, dropout is not used in in the following part in line 6.
DO@CurrentQ: A Dr.Q variant that does not use dropout in line 8. Specifically, dropout is not used in in the following part in line 8.
DO@PolicyOpt: A Dr.Q variant that does not use dropout in line 10. Specifically, dropout is not used in in the following part in line 10.
DO: A Dr.Q variant that does not use dropout in lines 6, 8, and 10.
In this ablation study, we compared the methods on the basis of average return and estimation bias.
The comparison results (Figure 8) indicate that the use of dropout for the target and current Qfunctions (i.e., and in lines 6 and 8) is effective. Regarding average return, the Dr.Q variants that do not use dropout for either target Qfunctions in line 6 or current Qfunctions in line 8 perform significantly worse than that of Dr.Q. For example, in Ant, the variants that do not use dropout for target Qfunctions (DO@TargetQ and DO) perform much worse than Dr.Q. In addition, in Humanoid, the variants that do not use dropout for current Qfunctions (DO@CurrentQ and DO) perform much worse than Dr.Q. Regarding estimation bias, that of the variants that do not use dropout for target Qfunctions (DO@TargetQ and DO) is significantly worse than that of Dr.Q in all environments.
The hyperparameter settings for each method in the experiments discussed in Section 4 are listed in Table 8. Parameter values, except for (i) dropout rate for Dr.Q and DUVN and (ii) for DUVN, were set according to chen2021randomized. The dropout rate (i) was set through line search, and for DUVN (ii) was set according to Harrigan2016DeepRL; moerland2017efficient.
Method  Parameter  Value 

SAC, REDQ, Dr.Q, and DUVN  optimizer  Adam (kingma2014adam) 
learning rate  
discount rate ()  0.99  
targetsmoothing coefficient ()  0.005  
replay buffer size  
number of hidden layers for all networks  2  
number of hidden units per layer  256  
minibatch size  256  
random starting data  5000  
UTD ratio  20  
REDQ and Dr.Q  intarget minimization parameter  2 
REDQ  ensemble size  10 
Dr.Q and DUVN  dropout rate  0.01 
DUVN  intarget minimization parameter  1 
Comments
There are no comments yet.