Dropout Q-Functions for Doubly Efficient Reinforcement Learning

10/05/2021 ∙ by Takuya Hiraoka, et al. ∙ nec global The University of Tokyo 7

Randomized ensemble double Q-learning (REDQ) has recently achieved state-of-the-art sample efficiency on continuous-action reinforcement learning benchmarks. This superior sample efficiency is possible by using a large Q-function ensemble. However, REDQ is much less computationally efficient than non-ensemble counterparts such as Soft Actor-Critic (SAC). To make REDQ more computationally efficient, we propose a method of improving computational efficiency called Dr.Q, which is a variant of REDQ that uses a small ensemble of dropout Q-functions. Our dropout Q-functions are simple Q-functions equipped with dropout connection and layer normalization. Despite its simplicity of implementation, our experimental results indicate that Dr.Q is doubly (sample and computationally) efficient. It achieved comparable sample efficiency with REDQ and much better computational efficiency than REDQ and comparable computational efficiency with that of SAC.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

page 14

page 16

page 17

page 20

Code Repositories

Dropout-Q-Functions-for-Doubly-Efficient-Reinforcement-Learning

Source files to replicate experiments in my Arxiv paper.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the reinforcement learning (RL) community, improving sample efficiency of RL methods has been important. Traditional RL methods have been shown to be promising for solving complex control tasks such as dexterous in-hand manipulation (DBLP:journals/corr/abs-1808-00177). However, RL methods generally require millions of training samples to solve a task (mendonca2019guided). This poor sample efficiency of RL methods is a severe obstacle to practical RL applications (e.g., applications on limited computational resources or in real-world environments without simulators). Motivated by these issues, many RL methods have been proposed to achieve higher sample efficiency. For example, haarnoja2018softa; haarnoja2018soft proposed Soft Actor-Critic (SAC), which achieved higher sample efficiency than the previous state-of-the-art RL methods (lillicrap2015continuous; fujimoto2018addressing; schulman2017proximal).

Since 2019, RL methods that use a high update-to-data (UTD) ratio to achieve high sample efficiency have emerged. The UTD ratio is defined as the number of Q-function updates divided by the number of actual interactions with the environment. A high UTD ratio promotes sufficiently training Q-functions within a few interactions, which leads to sample-efficient learning. Model-Based Policy Optimization (MBPO) (janner2019whento) is a seminal RL method that uses a high UTD ratio of 20–40 and achieves significantly higher sample efficiency than SAC, which uses a UTD ratio of 1. Encouraged by the success of MBPO, many RL methods with high UTD ratios have been proposed (shen2020ampo; lai2020bidirectional).

With such methods, randomized ensembled double Q-learning (REDQ) proposed by chen2021randomized

is currently the most sample-efficient method for the Mujoco benchmark. REDQ uses a high UTD ratio and large ensemble of Q-functions. The use of a high UTD ratio promotes estimation bias in policy evaluation, which degrades sample-efficient learning. REDQ uses the ensemble of Q-functions to suppress the estimation bias and improve sample efficiency.

chen2021randomized demonstrated that the sample efficiency of REDQ is equal to or even better than that of MBPO.

However, REDQ leaves room for improvement in terms of computational efficiency. REDQ runs 1.1 to 1.4 times faster than MBPO (chen2021randomized) but is still less computationally efficient than non-ensemble-based RL methods (e.g., SAC) due to the use of large ensembles. In Section 4.2, we show that REDQ runs more than two times slower than SAC and requires much larger memory. Computational efficiency is important in several scenarios, e.g., in RL applications with much lighter on-device computation (e.g. mobile phones or other light-weight edge devices) (chen2021improving)

, or in situations in which rapid trials and errors are required for developing RL agents (e.g., hyperparameter tuning or proof of concept for RL applications with limited time resources). Therefore, RL methods that are superior not only in terms of sample efficiency but also in computational efficiency are preferable.

We propose a method of improving computational efficiency. Our method is called Dr.Q and is a REDQ variant that uses a small ensemble of dropout Q-functions, in which dropout (JMLR:v15:srivastava14a) and layer normalization (ba2016layer) are used (Section 3). We experimentally show that Dr.Q is doubly (computationally and sample) efficient: (i) Dr.Q significantly improves computational efficiency over REDQ (Section 4.2) by more than two times and (ii) achieves sample efficiency comparable to REDQ (Section 4.1).

Although our primary contribution is proposing a doubly efficient RL method, we also make three significant contributions from other perspectives:
1. Simplicity of implementation. Dr.Q can be implemented by basically adding a few lines of readily available functions (dropout and layer normalization) to Q-functions in REDQ (and SAC). This simplicity enables one to easily replicate and extend it.
2. First successful demonstration of the usefulness of dropout in high UTD ratio settings. Previous studies incorporated dropout into RL (gal2016dropout; Harrigan2016DeepRL; moerland2017efficient; gal2016improving; NIPS2017_84ddfb34; kahn2017uncertaintyaware) (see Section 5 for details). However, these studies focused on low UTD ratio settings (i.e., UTD ratio ). Dropout approaches generally do not work as well as ensemble approaches (NEURIPS2019_8558cb40; lakshminarayanan2017simple; Durasov21). For this reason, instead of dropout approaches, ensemble approaches have been used in RL with high UTD ratio settings (chen2021randomized; janner2019whento; shen2020ampo; hiraoka2020meta; lai2020bidirectional). In Section 4, we argue that Dr.Q achieves almost the same or better bias reduction ability and sample/computationally efficiency compared with ensemble-based RL methods in high UTD ratio settings. This sheds light on dropout approaches once again and promotes their use as a reasonable alternative (or complement) to ensemble approaches in high UTD ratio settings.
3. Discovery of engineering insights to effectively apply dropout to RL. Specifically, we discovered that the following three engineering practices are effective in reducing bias and improving sample efficiency: (i) using the dropout and ensemble approaches together for constructing Q-functions (i.e., using multiple dropout Q-functions) (Section 4.1 and Appendix A.3); (ii) introducing layer normalization into dropout Q-functions (Section 4.3); and (iii) using dropout for both the current and target Q-functions (Appendix C). These engineering insights were not revealed in previous RL studies and would be useful to practitioners who attempt to apply dropout to RL.

2 Preliminaries

2.1 Maximum Entropy Reinforcement Learning (maximum entropy RL)

RL addresses the problem of an agent learning to act in an environment. At each discrete time step , the environment provides the agent with a state , the agent responds by selecting an action , and then the environment provides the next reward and state . For convenience, as needed, we use the simpler notations of , , , , and to refer to a reward, state, action, next state, and next action, respectively.

We focus on maximum entropy RL, in which an agent aims to find its policy that maximizes the expected return with an entropy bonus: . Here, is a policy and is entropy. Temperature balances exploitation and exploration and affects the stochasticity of the policy.

2.2 Randomized Ensemble Double Q-Learning (REDQ)

REDQ (chen2021randomized) is a sample-efficient model-free method for solving maximum-entropy RL problems (Algorithm 1). It has two primary components to achieve high sample efficiency.
1. High UTD ratio: It uses a high UTD ratio , which is the number of Q-function updates (lines 4–10) divided by the number of actual interactions with the environment (line 3). The high UTD ratio promotes sufficient training of Q-functions within a few interactions. However, this also promotes overestimation bias in the Q-function training, which degrades sample-efficient learning (chen2021randomized).
2. Ensemble of Q-functions: To reduce the overestimation bias, it uses an ensemble of Q-functions for the target to be minimized (lines 6–7). Specifically, a random subset of the ensemble is selected (line 6) then used for calculating the target (line 7). The size of the subset is kept fixed and is denoted as . In addition, each Q-function in the ensemble is randomly and independently initialized but updated with the same target (lines 8–9). chen2021randomized showed that using a large ensemble () and small subset () successfully reduces the bias.

1:  Initialize policy parameters , Q-function parameters , , and empty replay buffer . Set target parameters , for .
2:  repeat
3:     Take action ; Observe reward , next state ; .
4:     for  updates do
5:        Sample a mini-batch from .
6:        Sample a set of distinct indices from .
7:        Compute the Q target (same for all Q-functions):
8:        for   do
9:           Update with gradient descent using
10:           Update target networks with .
11:     Update with gradient ascent using
Algorithm 1 REDQ

Although using a large ensemble of Q-functions is beneficial for reducing bias and improving sample efficiency, this makes REDQ computationally intensive. In the next section, we discuss reducing the ensemble size.

3 Injecting Model Uncertainty into Target with Dropout Q-functions

In this section, we discuss replacing the large ensemble of Q-functions in REDQ with a small ensemble of dropout Q-functions. We start our discussion by reviewing what the ensemble in REDQ does from the viewpoint of model uncertainty injection. Then, we propose to use dropout for model uncertainty injection instead of the large ensemble. Specifically, we propose (i) the dropout Q-function that is a Q-function equipped with dropout and layer normalization, and (ii) Dr.Q, a variant of REDQ that uses a small ensemble of dropout Q-functions. Finally, we explain that the size of the ensemble can be smaller in Dr.Q than in REDQ.

We first explain our insight that, in REDQ, model (Q-function parameters’) uncertainty is injected into the target. In REDQ, the subset of Q-functions is used to compute the target value (lines 6–7 in Algorithm 1). This can be interpreted as an approximation for , the expected target value with respect to model uncertainty:

On the left hand side in the second line, the model distribution is replaced with a proposal distribution , which is based on resampling from the ensemble in line 6 in Algorithm 1111We assume that independently follow an identical distribution. . On the right hand side (RHS) in the second line, the expected target value is further approximated by one sample average on the basis of . The resulting approximation is used in line 7 in Algorithm 1. In the remainder of this section, we discuss another means to approximate the expected target value.

We use dropout Q-functions for the target value approximation (Figure 1). Here, is a Q-function that has dropout connections (JMLR:v15:srivastava14a). The left part of Figure 1 shows implementation by adding dropout layers to the Q-function implementation used by chen2021randomized. Layer normalization (ba2016layer) is applied after dropout for more effectively using dropout as with NIPS2017_3f5ee243; NEURIPS2019_2f4fe03d. By using , the target value is approximated as

Instead of , a proposal distribution based on the dropout is used on the RHS in the first line. The expected target value is further approximated by one sample average on the basis of in the second line. We use the resulting approximation for injecting model uncertainty into the target value (the right part of Figure 1). For calculating , we use dropout Q-functions, which have independently initialized and trained parameters . Using dropout Q-functions improves the performance of Dr.Q (our RL method described in the next paragraph), compared with that using a single dropout Q-function (further details are given in A.3).

Figure 1: Dropout Q-function implementation (left part) and how dropout Q-functions are used in target (right part). Dropout Q-function implementation: Our dropout function is implemented by modifying that used by chen2021randomized

. Our modification (highlighted in red) is adding dropout (Dropout) and layer normalization (LayerNorm). “Weight” is a weight layer and “ReLU” is the activation layer of rectified linear units. Parameters

represent the weights and biases in weight layers. How dropout Q-functions are used in target: dropout Q-functions are used to calculate the target value as .

We now explain Dr.Q222The period between “Dr” and “Q” is necessary to avoid name conflict with DrQ (yarats2021image)., in which is used for considering model uncertainty. The algorithmic description of Dr.Q is shown in Algorithm 2. Dr.Q is a variant of REDQ, and the modified parts from REDQ are highlighted in red in the algorithm. In line 6, is used to inject model uncertainty into the target, as we discussed in the previous paragraph. In lines 8 and 10, dropout Q-functions are used instead of Q-functions () to make Dr.Q more computationally efficient.

1:  Initialize policy parameters , Q-function parameters , , and empty replay buffer . Set target parameters , for .
2:  repeat
3:     Take action . Observe reward , next state ; .
4:     for  updates do
5:        Sample a mini-batch from .
6:        Compute the Q target for the dropout Q-functions:
7:        for   do
8:           Update with gradient descent using
9:           Update target networks with .
10:     Update with gradient ascent using
Algorithm 2 Dr.Q

The ensemble size of the dropout Q-functions for Dr.Q (i.e., ) should be smaller than that of Q-functions for REDQ (i.e., ). for Dr.Q is equal to the subset size for REDQ, which is not greater than . In practice, is much smaller than (e.g., and in chen2021randomized). This reduction in the number of Q-functions makes Dr.Q more computationally efficient. Specifically, Dr.Q is faster due to the reduction in the number of Q-function updates (line 8 in Algorithm 2). Dr.Q also requires less memory for holding Q-function parameters. In Section 4.2, we show that Dr.Q is computationally faster and less memory intensive than REDQ.

4 Experiments

We conducted experiments to evaluate and analyse Dr.Q 333Source code to replicate the experiments is available at https://github.com/TakuyaHiraoka/Dropout-Q-Functions-for-Doubly-Efficient-Reinforcement-Learning. In Section 4.1, we discuss the evaluation of Dr.Q’s performance (sample efficiency and bias-reduction ability). In Section 4.2, we discuss the evaluation of the computational efficiency of Dr.Q. In Section 4.3, we explain the ablation study for Dr.Q.

4.1 Sample efficiency and bias-reduction ability of Dr.Q

To evaluate the performances of Dr.Q, we compared Dr.Q with three baseline methods in MuJoCo benchmark environments (todorov2012mujoco; brockman2016openai). Following chen2021randomized; janner2019whento, we prepared the following environments: Hopper, Walker2d, Ant, and Humanoid. In these environments, we compared the following four methods:
REDQ: Baseline method that follows the REDQ algorithm (chen2021randomized) (Algorithm 1).
SAC: Baseline method that follows the SAC algorithm (haarnoja2018softa; haarnoja2018soft). To improve the sample efficiency of this method, delayed policy update and high UTD ratio were used, as suggested by chen2021randomized.
Dr.Q: Proposed method that follows Algorithm 2.
DUVN: Baseline method. This method is a variant of Dr.Q that uses double uncertainty value networks (Harrigan2016DeepRL; moerland2017efficient) for policy evaluation. Specifically, a single (i.e., ) target dropout Q-function is used for target calculation on line 6 in Algorithm 2: . As with Harrigan2016DeepRL; moerland2017efficient, layer normalization is not applied in . Following chen2021randomized, we set the hyperparameters as , , and for all methods except DUVN (). More detailed hyperparameter settings are given in Appendix D.

The methods were compared on the basis of average return and estimation bias.
Average return:

An average return over episodes. We regarded 1000 environment steps in Hopper and 3000 environment steps in the other environments as one epoch, respectively. After every epoch, we ran ten test episodes with the current policy and recorded the average return.


Average/std. bias:

Average and standard deviation of the normalized estimation error (bias) of Q-functions

 (chen2021randomized). The error represents how significantly the Q-value estimation differs from the actual return. Formally, the error is defined as , where is the discounted Monte Carlo return obtained from the current policy in the test episodes and is its estimation. The was evaluated as for SAC and REDQ and as for Dr.Q and DUVN, respectively.

The comparison results (Figure 2) indicate that Dr.Q achieved almost the same level of performance as REDQ. Regarding the average return, REDQ and Dr.Q achieved almost the same sample efficiency overall. In Walker2d and Ant, their learning curves highly overlapped, and there was no significant difference between them. In Humanoid, REDQ was slightly better than Dr.Q. In Hopper, Dr.Q was better than REDQ. In all environments except Hopper, REDQ and Dr.Q improved their average return significantly earlier than SAC and DUVN. Regarding bias, REDQ and Dr.Q consistently kept the value-estimation bias closer to zero than SAC and DUVN in all environments.

Figure 2: Average return and average and standard deviation of estimation bias for REDQ, SAC, DUVN, and Dr.Q. The horizontal axis represents the number of interactions with the environment (e.g., the number of executions of line 3 of Algorithm 2). For each method, average score of five independent trials are plotted as solid lines, and standard deviation across trials is plotted as transparent shaded region.

4.2 Computational efficiency of Dr.Q

We next evaluated the computational efficiency of Dr.Q. We compared Dr.Q with the baseline methods on the basis of the following criteria: (i) Process time required for executing methods; (ii) Number of parameters of each method; (iii) Bottleneck memory consumption

suggested from the Pytorch profiler

444https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html. Bottleneck memory consumption is the maximum memory consumption recorded when running the methods. For evaluation, we ran each method on a machine equipped with two Intel(R) Xeon(R) CPU E5-2667 v4 and one NVIDIA Tesla K80.

The process times of the methods (Table 1) indicate that Dr.Q runs more than two times faster than REDQ. Dr.Q (and SAC) requires process times in the 800—900-msec range. REDQ requires process times in the 2300—2400-msec range. The numbers in parentheses show that learning Q-functions is dominant in the overall process. This suggests that using more compact (e.g., small numbers of) Q-functions is important for improving process times.

Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
SAC 910 (870) 870 (835) 888 (848) 893 (854)
REDQ 2328 (2269) 2339 (2283) 2336 (2277) 2400 (2340)
DUVN 664 (636) 762 (731) 733 (700) 692 (660)
Dr.Q 948 (905) 933 (892) 954 (913) 989 (946)
Table 1: Process times (in msec) required for executing overall loop (e.g., lines 3–10 in Algorithm 2). Times for updating Q-functions (e.g., lines 5–9 in Algorithm 2) are shown in parentheses.

The number of parameters and bottleneck memory consumption of each method indicate that Dr.Q is more memory efficient than REDQ. Regarding the numbers of parameters (Table 2), we can see that those of Dr.Q (and SAC and DUVN) are about one-fifth those of REDQ. Note that the number of parameters of Dr.Q is equivalent to that of SAC since Dr.Q and SAC use the same number (two) of Q-functions. Regarding the bottleneck memory consumption (Table 3), we can see that that for Dr.Q (SAC and DUVN) is about one-third that for REDQ. We can also see that the bottleneck memory consumption is almost independent of the environment. This is because the most memory-intensive process is evaluation of the hidden layers in Q-functions (e.g., applying the ReLU layer), which is not environment specific (number of dimensions of states and actions)555In our experiment, the number of hidden units is invariant over all environments (Appendix D)..

Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
SAC 141,826 146,434 152,578 166,402
REDQ 698,890 721,930 752,650 821,770
DUVN 139,778 144,386 150,530 164,354
Dr.Q 141,826 146,434 152,578 166,402
Table 2: Number of parameters with methods.
Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
SAC 73 / 64 / 62 73 / 64 / 62 73 / 64 / 62 73 / 65 / 62
REDQ 241 / 211/ 200 241 / 211 / 200 241 / 211 / 200 241 / 212 / 201
DUVN 72 / 71 / 51 72 / 71 / 51 72 / 71 / 51 72 / 71 / 52
Dr.Q 73 / 72 / 69 73 / 72 / 69 73 / 72 / 70 73 / 72 / 70
Table 3: Bottleneck memory consumption (in megabytes) with methods. Three worst bottleneck memory consumptions are shown in form of “1st worst bottleneck memory consumption / 2nd worst bottleneck memory consumption / 3rd worst bottleneck memory consumption.”

4.3 Ablation study

As an ablation study of Dr.Q, we investigated the performance of Dr.Q variants that do not use either dropout, or layer normalization, or both. We refer to the one without dropout as -DO, that without layer normalization as -LN, and that without both as -DO-LN. The results (Figure 3) indicate that the synergic effect of using dropout and layer normalization is high, especially in complex environments (Ant and Humanoid). In these environments, Dr.Q significantly outperformed its ablation variants (-DO, -LN and -DO-LN) in terms of both average return and bias reduction.

Figure 3: Ablation study results

5 Related Work

In this section, we review related studies and compare them with ours (Table 4).

Type of study
Method for
injecting model uncertainty
Type of
model uncertainty
Focus on high UTD ratio
(and maximum–entropy RL)
setting?
Ensemble Q-functions Ensemble Q-functions
Partially yes
chen2021randomized
Ensemble transition models Ensemble Transition models
Partially yes
e.g., janner2019whento
Dropout Q-functions Dropout Q-functions No
Dropout transition models Dropout Transition models No
Normalization No
Our study Dropout (with ensemble) Q-functions Yes
Table 4:

Comparison between related studies and ours. We classify related studies into five types (e.g., “Ensemble Q-functions”) on basis of three criteria (e.g. “Type of model uncertainty”).

Ensemble Q-functions: Ensembles of Q-functions have been used in RL to consider model uncertainty (fausser2015neural; NIPS2016_8d8818c8; anschel2017averaged; agarwal2020optimistic; lee2021sunrise; Lan2020Maxmin; chen2021randomized). Ensemble transition models: Ensembles of transition (and reward) models have been introduced to model-based RL, e.g., (NEURIPS2018_3de568f8; kurutach2018modelensemble; janner2019whento; shen2020ampo; NEURIPS2020_a322852c; lee2020context; hiraoka2020meta; abraham2020model). The methods proposed in the above studies use a large ensemble of Q-functions or transition models, thus are computationally intensive. Dr.Q does not use a large ensemble of Q-functions, thus is computationally lighter.

Dropout transition models: gal2016improving; NIPS2017_84ddfb34; kahn2017uncertaintyaware introduced dropout and its modified variant to transition models of model-based RL methods. Dropout Q-functions: gal2016dropout introduced dropout to a Q-function at action selection for considering model uncertainty in exploration. Harrigan2016DeepRL and moerland2017efficient introduced dropout to policy evaluation in the same vein as us. However, there are three main differences between these studies and ours. (i) Target calculation: they introduced dropout to a single Q-function and used it for a target value, whereas we introduce dropout to multiple Q-functions use the minimum of their outputs for the target value. On the basis of insights from fujimoto2018addressing; haarnoja2018soft, overestimation bias should be significantly suppressed by this minimum operation in the target. (ii) Use of engineering to stabilize learning: their methods do not use engineering to stabilize the learning of dropout Q-functions, whereas Dr.Q uses layer normalization to stabilize the learning. (iii) RL setting: they focused on a low UTD ratio (and standard RL) setting, whereas we focused on a high UTD ratio (and maximum entropy RL) setting. As we explained in Section 2.2, a high UTD ratio setting promotes high estimation bias, thus is a more challenging setting than a low UTD ratio setting. In Section 4.1, we showed that their methods do not perform successfully in a high UTD setting.

Normalization in RL:

Normalization (e.g., batch normalization 

(ioffe2015batch) or layer normalization (ba2016layer)) has been introduced into RL. Batch normalization and its variant are introduced in deep deterministic policy gradient (DDPG) (lillicrap2015continuous) and twin delayed DDPG (fujimoto2018addressing) (bhatt2020crossnorm). Layer normalization is introduced in the implementation of maximum a posteriori policy optimisation (abdolmaleki2018maximum; hoffman2020acme). It is also introduced in SAC extensions (ma2020dsac; zhang2021learning). Unlike our study, the above studies did not introduce dropout to consider model uncertainty. In addition, although some studies (ma2020dsac; zhang2021learning) focused on the maximum entropy RL setting, none focused on a high UTD ratio setting.

6 Conclusion

We proposed, Dr.Q, an RL method based on a small ensemble of Q-functions that are equipped with dropout connection and layer normalization. We experimentally demonstrated that Dr.Q significantly improves computational and memory efficiencies over REDQ while achieving sample efficiency comparable with REDQ. In the ablation study, we found that using both dropout and layer normalization had a synergistic effect on improving sample efficiency, especially in complex environments (e.g., Humanoid).

References

Appendix A Effect of dropout rate on Dr.Q and its variants

a.1 Dr.Q

Figure 4: Average return and average and standard deviation of estimation bias for Dr.Q with different dropout rates. Scores for Dr.Q are plotted as solid lines and labelled in accordance with dropout rates (e.g.,“0.2”).

a.2 Dr.Q without layer normalization

Figure 5: Average return and average and standard deviation of estimation bias for Dr.Q without layer normalization (i.e., -LN in Section 4.3) with different dropout rates

a.3 Sin-Dr.Q: Dr.Q variant using A single dropout Q-function

Dr.Q (Algorithm 2) uses an ensemble of multiple () dropout Q-functions. This raises the question: “Why not use a single dropout Q-function for Dr.Q?” To answer this question, we compared Dr.Q with Sin-Dr.Q, a variant of Dr.Q that uses a single dropout Q-function. Specifically, with Sin-Dr.Q, the target in line 6 in Algorithm 2 is calculated by evaluating , a single dropout Q-function, times:

Note that, the output of can differ in each evaluation due to the use of a dropout connection. The remaining part of Sin-Dr.Q is the same as Dr.Q.

From the comparison results of Dr.Q and Sin-Dr.Q (Figure 6), we can see that the average return of Sin-Dr.Q was lower than that of Dr.Q. This result indicates that using multiple dropout Q-functions is preferable to using a single dropout Q-function with Dr.Q.

Figure 6: Average return and average and standard deviation of estimation bias for Dr.Q and Sin-Dr.Q. Scores for Dr.Q are plotted as dash lines. Scores for Sin-Dr.Q are plotted as solid lines and labelled in accordance with dropout rates (e.g.,“0.2”).

Appendix B REDQ with different ensemble size

In Section 4, we discussed comparing Dr.Q with REDQ, which uses an ensemble size of 10 (i.e., =10). To make a more detailed comparison, we compared Dr.Q and REDQ by varying the ensemble size for REDQ. We denote REDQ that uses an ensemble size of as “REDQ” (e.g., REDQ5 for REDQ with an ensemble size of five).

Regarding average return (left part of Figure 7), overall, Dr.Q was superior to REDQ2–5. Dr.Q was comparable with REDQ2 and REDQ3 in Hopper but superior in more complex environments (Walker, Ant, and Humanoid). In addition, Dr.Q was somewhat better than REDQ5 in all environments. Regarding estimation bias (middle and right parts of Figure 7), overall, Dr.Q was significantly better than REDQ2 and comparable with REDQ3–10. Regarding the processing speed (Table 5), Dr.Q ran as fast as REDQ3 and faster than REDQ5 by 1.4 to 1.5 times. Regarding memory efficiency (Tables 6 and 7), Dr.Q was less memory intensive than REDQ3–10.

Figure 7: Average return and average and standard deviation of estimation bias for Dr.Q and REDQ. Scores for Dr.Q are plotted as dash lines. Scores for REDQ are plotted as solid lines and labelled as “REDQ”.
Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
Dr.Q 948 (905) 933 (892) 954 (913) 989 (946)
REDQ2 832 (792) 802 (768) 675 (641) 820 (773)
REDQ3 1052 (1014) 876 (838) 919 (881) 950 (906)
REDQ5 1414 (1368) 1425 (1378) 1373 (1327) 1552 (1503)
REDQ10 2328 (2269) 2339 (2283) 2336 (2277) 2400 (2340)
Table 5: Process times (in msec) required for executing overall loop (e.g., lines 3–10 in Algorithm 2). Times for updating Q-functions (e.g., lines 5–9 in Algorithm 2) are shown in parentheses.
Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
Dr.Q 141,826 146,434 152,578 166,402
REDQ2 139,778 144,386 150,530 164,354
REDQ3 209,667 216,579 225,795 246,531
REDQ5 349,445 360,965 376,325 410,885
REDQ10 698,890 721,930 752,650 821,770
Table 6: Number of parameters with methods.
Hopper-v2 Walker2d-v2 Ant-v2 Humanoid-v2
Dr.Q 73 / 72 / 69 73 / 72 / 69 73 / 72 / 70 73 / 72 / 70
REDQ2 73 / 51 / 51 73 / 51 / 51 73 / 51 / 51 73 / 52 / 52
REDQ3 94 / 64 / 62 94 / 64 / 62 94 / 64 / 62 94 / 65 / 62
REDQ5 136 / 106 / 100 136 / 106 / 100 136 / 106 / 100 136 / 107 / 101
REDQ10 241 / 211 / 200 241 / 211 / 200 241 / 211 / 200 241 / 212 / 201
Table 7: Bottleneck memory consumption (in megabytes) with methods. Three worst bottleneck memory consumptions are shown in form of “1st worst bottleneck memory consumption / 2nd worst bottleneck memory consumption / 3rd worst bottleneck memory consumption.”

Appendix C Additional ablation study of Dr.Q

Dropout is introduced into three parts of the algorithm for Dr.Q (i.e., lines 6, 8 and 10 of Algorithm 2 in Section 3). In this section, we conducted an ablation study to answer the question “which dropout introduction contributes to overall performance improvement of Dr.Q?” We compared Dr.Q with its following variants:
-DO@TargetQ: A Dr.Q variant that does not use dropout in line 6. Specifically, dropout is not used in in the following part in line 6.


-DO@CurrentQ: A Dr.Q variant that does not use dropout in line 8. Specifically, dropout is not used in in the following part in line 8.


-DO@PolicyOpt: A Dr.Q variant that does not use dropout in line 10. Specifically, dropout is not used in in the following part in line 10.


-DO: A Dr.Q variant that does not use dropout in lines 6, 8, and 10.
In this ablation study, we compared the methods on the basis of average return and estimation bias.

The comparison results (Figure 8) indicate that the use of dropout for the target and current Q-functions (i.e., and in lines 6 and 8) is effective. Regarding average return, the Dr.Q variants that do not use dropout for either target Q-functions in line 6 or current Q-functions in line 8 perform significantly worse than that of Dr.Q. For example, in Ant, the variants that do not use dropout for target Q-functions (-DO@TargetQ and -DO) perform much worse than Dr.Q. In addition, in Humanoid, the variants that do not use dropout for current Q-functions (-DO@CurrentQ and -DO) perform much worse than Dr.Q. Regarding estimation bias, that of the variants that do not use dropout for target Q-functions (-DO@TargetQ and -DO) is significantly worse than that of Dr.Q in all environments.

Figure 8: Additional ablation study result

Appendix D Hyperparameter settings

The hyperparameter settings for each method in the experiments discussed in Section 4 are listed in Table 8. Parameter values, except for (i) dropout rate for Dr.Q and DUVN and (ii) for DUVN, were set according to chen2021randomized. The dropout rate (i) was set through line search, and for DUVN (ii) was set according to Harrigan2016DeepRL; moerland2017efficient.

Method Parameter Value
SAC, REDQ, Dr.Q, and DUVN optimizer Adam (kingma2014adam)
learning rate
discount rate () 0.99
target-smoothing coefficient () 0.005
replay buffer size
number of hidden layers for all networks 2
number of hidden units per layer 256
mini-batch size 256
random starting data 5000
UTD ratio 20
REDQ and Dr.Q in-target minimization parameter 2
REDQ ensemble size 10
Dr.Q and DUVN dropout rate 0.01
DUVN in-target minimization parameter 1
Table 8: Hyperparameter settings