SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

07/09/2020 ∙ by Kimin Lee, et al. ∙ 66

Model-free deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Q-learning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Q-learning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upper-confidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

sunrise

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free reinforcement learning (RL), with high-capacity function approximators, such as deep neural networks (DNNs), has been used to solve a variety of sequential decision-making problems, including board games 

Silver et al. (2017, 2018), video games Mnih et al. (2015); Vinyals et al. (2019), and robotic manipulation Kalashnikov et al. (2018). It has been well established that the above successes are highly sample inefficient Kaiser et al. (2020). Recently, a lot of progress has been made in more sample-efficient model-free RL algorithms through improvements in off-policy learning both in discrete and continuous domains Fujimoto et al. (2018); Haarnoja et al. (2018); Hessel et al. (2018)

. However, there are still substantial challenges when training off-policy RL algorithms. First, the learning process is often unstable and sensitive to hyperparameters because it is a complex problem to optimize large nonlinear policies such as DNNs 

Henderson et al. (2018). Second, Q-learning often converges to sub-optimal solutions due to error propagation in the Bellman backup, i.e., the errors induced in the target value can lead to an increase in overall error in the Q-function Kumar et al. (2019, 2020). Third, it is hard to balance exploration and exploitation, which is necessary for efficient RL Chen et al. (2017); Osband et al. (2016) (see Section 2 for further details).

One way to address the above issues with off-policy RL algorithms is to use ensemble methods, which combine multiple models of the value function and (or) policy Chen et al. (2017); Lan et al. (2020); Osband et al. (2016); Wiering and Van Hasselt (2008). One example is the twin-Q trick Fujimoto et al. (2018) that was proposed to handle the overestimation of value functions for continuous control tasks. Bootstrapped DQN Osband et al. (2016) leveraged an ensemble of Q-functions for more effective exploration, and Chen et al. (2017) further improved it by adapting upper-confidence bounds algorithms Audibert et al. (2009); Auer et al. (2002) based on uncertainty estimates from ensembles. However, most prior works have studied the various axes of improvements from ensemble methods in isolation and have ignored the error propagation aspect.

In this paper, we present SUNRISE, a simple unified ensemble method that is compatible with most modern off-policy RL algorithms, such as Q-learning and actor-critic algorithms. SUNRISE consists of the following key ingredients (see Figure 1(a)):

  • [leftmargin=8mm]

  • Bootstrap with random initialization: To enforce diversity between ensemble agents, we initialize the model parameters randomly and apply different training samples to each agent. Similar to Osband et al. (2016), we find that this simple technique stabilizes the learning process and improves performance by combining diverse agents.

  • Weighted Bellman backup: Errors in the target Q-function can propagate to the current Q-function Kumar et al. (2019, 2020) because the Bellman backup is usually applied with a learned target Q-function (see Section 3.2

    for more details). To handle this issue, we reweigh the Bellman backup based on uncertainty estimates of target Q-functions. Because prediction errors can be characterized by uncertainty estimates from ensembles (i.e., variance of predictions) as shown in Figure 

    1(b), we find that the proposed method significantly mitigates error propagation in Q-learning.

  • UCB exploration: We define an upper-confidence bound (UCB) based on the mean and variance of Q-functions similar to Chen et al. (2017), and introduce an inference method, which selects actions with highest UCB for efficient exploration. This inference method can encourage exploration by providing a bonus for visiting unseen state-action pairs, where ensembles produce high uncertainty, i.e., high variance (see Figure 1(b)).

We demonstrate the effectiveness of SUNRISE using Soft Actor-Critic (SAC) Haarnoja et al. (2018) for continuous control benchmarks (specifically, OpenAI Gym Brockman et al. (2016) and DeepMind Control Suite Tassa et al. (2018)) and Rainbow DQN Hessel et al. (2018) for discrete control benchmarks (specifically, Atari games Bellemare et al. (2013)). In our experiments, SUNRISE consistently improves the performance of existing off-policy RL methods and outperforms baselines, including model-based RL methods such as POPLIN Wang and Ba (2020), Dreamer Hafner et al. (2020), and SimPLe Kaiser et al. (2020).

2 Related work

Off-policy RL algorithms. Recently, various off-policy RL algorithms have provided large gains in sample-efficiency by reusing past experiences Fujimoto et al. (2018); Haarnoja et al. (2018); Hessel et al. (2018). Rainbow DQN Hessel et al. (2018) achieved state-of-the-art performance on the Atari games Bellemare et al. (2013) by combining several techniques, such as double Q-learning Van Hasselt et al. (2016) and distributional DQN Bellemare et al. (2017). For continuous control tasks, SAC Haarnoja et al. (2018) achieved state-of-the-art sample-efficiency results by incorporating the maximum entropy framework, and Laskin et al. (2020) showed that the sample-efficiency of SAC can be further improved on high-dimensional environments by incorporating data augmentations. Our ensemble method brings orthogonal benefits and is complementary and compatible with these existing state-of-the-art algorithms.

Ensemble methods in RL. Ensemble methods have been studied for different purposes in RL Agarwal et al. (2020); Anschel et al. (2017); Chen et al. (2017); Chua et al. (2018); Kurutach et al. (2018); Osband et al. (2016); Wiering and Van Hasselt (2008). Chua et al. (2018) showed that modeling errors in model-based RL can be reduced using an ensemble of dynamics models, and Kurutach et al. (2018) accelerated policy learning by generating imagined experiences from the ensemble of dynamics models. Bootstrapped DQN Osband et al. (2016) leveraged the ensemble of Q-functions for efficient exploration. However, our method is different in that we propose a unified framework that handles various issues in off-policy RL algorithms.

Stabilizing Q-learning. It has been empirically observed that instability in Q-learning can be caused by applying the Bellman backup on the learned value function Anschel et al. (2017); Fujimoto et al. (2018); Hasselt (2010); Kumar et al. (2019, 2020); Van Hasselt et al. (2016). For discrete control tasks, double Q-learning Hasselt (2010); Van Hasselt et al. (2016) addressed the value overestimation by maintaining two independent estimators of the action values and later extended to continuous control tasks in TD3 Fujimoto et al. (2018). Recently, Kumar et al. (2020) handled the error propagation issue by reweighting the Bellman backup based on cumulative Bellman errors. While most prior work has improved the stability by taking the minimum over Q-functions or estimating cumulative errors, we propose an alternative way that also utilizes ensembles to estimate uncertainty and provide more stable, higher-signal-to-noise back-ups.

Exploration in RL. To balance exploration and exploitation, several methods, such as the maximum entropy frameworks Haarnoja et al. (2018); Ziebart (2010) and exploration bonus rewards Bellemare et al. (2016); Choi et al. (2019); Houthooft et al. (2016); Pathak et al. (2017), have been proposed. Despite the success of these exploration methods, a potential drawback is that agents can focus on irrelevant aspects of the environment because these methods do not depend on the rewards. To handle this issue, Chen et al. (2017) proposed an exploration strategy that considers both best estimates (i.e., mean) and uncertainty (i.e., variance) of Q-functions for discrete control tasks. We further extend this strategy to continuous control tasks and show that it can be combined with other techniques.

(a) SUNRISE: actor-critic version
(b) Uncertainty estimates
Figure 1: (a) Illustration of our framework. We consider independent agents (i.e., no shared parameters between agents) with one replay buffer. (b) Uncertainty estimates from an ensemble of neural networks on a toy regression task (see Appendix C for more experimental details). The black line is the ground truth curve, and the red dots are training samples. The blue lines show the mean and variance of predictions over ten ensemble models. The ensemble can produce well-calibrated uncertainty estimates (i.e., variance) on unseen samples.

3 Sunrise

We present SUNRISE: Simple UNified framework for ReInforcement learning using enSEmbles. In principle, SUNRISE can be used in conjunction with most modern off-policy RL algorithms, such as SAC Haarnoja et al. (2018) and Rainbow DQN Hessel et al. (2018). For the exposition, we describe only the SAC version of SUNRISE in the main body. The Rainbow DQN version of SUNRISE follows the same principles and is fully described in Appendix B.

3.1 Preliminaries: reinforcement learning and soft actor-critic

We consider a standard RL framework where an agent interacts with an environment in discrete time. Formally, at each timestep , the agent receives a state from the environment and chooses an action based on its policy . The environment returns a reward and the agent transitions to the next state . The return is the total accumulated rewards from timestep with a discount factor . RL then maximizes the expected return from each state .

SAC Haarnoja et al. (2018) is an off-policy actor-critic method based on the maximum entropy RL framework Ziebart (2010), which encourages the robustness to noise and exploration by maximizing a weighted objective of the reward and the policy entropy (see Appendix A for further details). To update the parameters, SAC alternates between a soft policy evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Q-function, which is modeled as a neural network with parameters , is updated by minimizing the following soft Bellman residual:

(1)
(2)

where is a transition, is a replay buffer, are the delayed parameters, and is a temperature parameter. At the soft policy improvement step, the policy with its parameter is updated by minimizing the following objective:

(3)

Here, the policy is modeled as a Gaussian with mean and covariance given by neural networks to handle continuous action spaces.

3.2 Unified ensemble methods for off-policy RL algorithms

In the design of SUNRISE, we integrate the three key ingredients, i.e., bootstrap with random initialization, weighted Bellman backup, and UCB exploration, into a single framework.

Bootstrap with random initialization. Formally, we consider an ensemble of SAC agents, i.e., , where and denote the parameters of the -th soft Q-function and policy.111We remark that each Q-function has a unique target Q-function . To train the ensemble of agents, we use the bootstrap with random initialization Efron (1982); Osband et al. (2016), which enforces the diversity between agents through two simple ideas: First, we initialize the model parameters of all agents with random parameter values for inducing an initial diversity in the models. Second, we apply different samples to train each agent. Specifically, for each SAC agent in each timestep , we draw the binary masks

from the Bernoulli distribution with parameter

, and store them in the replay buffer. Then, when updating the model parameters of agents, we multiply the bootstrap mask to each objective function, such as: and in (2) and (3). We remark that Osband et al. (2016) applied this simple technique to train an ensemble of DQN Mnih et al. (2015) only for discrete control tasks, while we apply to SAC Haarnoja et al. (2018) and Rainbow DQN Hessel et al. (2018) for both continuous and discrete tasks with additional techniques in the following paragraphs.

Weighted Bellman backup. Since conventional Q-learning is based on the Bellman backup in (2), it can be affected by error propagation. I.e., error in the target Q-function gets propagated into the Q-function at the current state. Recently, Kumar et al. (2020) showed that this error propagation can cause inconsistency and unstable convergence. To mitigate this issue, for each agent , we consider a weighted Bellman backup as follows:

(4)

where is a transition, , and is a confidence weight based on ensemble of target Q-functions:

(5)

where is a temperature,

is the sigmoid function, and

is the empirical standard deviation of all target Q-functions

. Note that the confidence weight is bounded in because standard deviation is always positive.222We find that it is empirically stable to set minimum value of weight as 0.5. The proposed objective

down-weights the sample transitions with high variance across target Q-functions, resulting in a loss function for the

updates that has a better signal-to-noise ratio.

1:for each iteration do
2:     for each timestep  do
3:         // UCB exploration
4:         Collect action samples:
5:         Choose the action that maximizes UCB:
6:         Collect state and reward from the environment by taking action
7:         Sample bootstrap masks |
8:         Store transitions and masks in replay buffer
9:     end for
10:     // Update agents via bootstrap and weighted Bellman backup
11:     for each gradient step do
12:         Sample random minibatch
13:         for each agent  do
14:              Update the Q-function by minimizing
15:              Update the policy by minimizing
16:         end for
17:     end for
18:end for
Algorithm 1 SUNRISE: SAC version

UCB exploration. The ensemble can also be leveraged for efficient exploration Chen et al. (2017); Osband et al. (2016) because it can express higher uncertainty on unseen samples. Motivated by this, by following the idea of Chen et al. (2017), we consider an optimism-based exploration that chooses the action that maximizes

(6)

where and are the empirical mean and standard deviation of all Q-functions , and the is a hyperparameter. This inference method can encourage exploration by adding an exploration bonus (i.e., standard deviation ) for visiting unseen state-action pairs similar to the UCB algorithm Auer et al. (2002). We remark that this inference method was originally proposed in Chen et al. (2017) for efficient exploration in discrete action spaces. However, in continuous action spaces, finding the action that maximizes the UCB is not straightforward. To handle this issue, we propose a simple approximation scheme, which first generates candidate action set from ensemble policies , and then chooses the action that maximizes the UCB (Line 4 in Algorithm 1

). For evaluation, we approximate the maximum a posterior action by averaging the mean of Gaussian distributions modeled by each ensemble policy. The full procedure is summarized in Algorithm 

1.

4 Experimental results

We designed our experiments to answer the following questions:

  • [leftmargin=8mm]

  • Can SUNRISE improve off-policy RL algorithms, such as SAC Haarnoja et al. (2018) and Rainbow DQN Hessel et al. (2018), for both continuous (see Table 1 and Table 2) and discrete (see Table 3) control tasks?

  • Does SUNRISE mitigate error propagation (see Figure 2(a))?

  • How does ensemble size affect the performance of SUNRISE (see Figure 2(b) and Figure 2(c))?

  • What is the contribution of each technique in SUNRISE (see Table 4)?

4.1 Setups

Continuous control tasks. We evaluate SUNRISE on several continuous control tasks using simulated robots from OpenAI Gym Brockman et al. (2016) and DeepMind Control Suite Tassa et al. (2018). For OpenAI Gym experiments with proprioceptive inputs (e.g., positions and velocities), we compare to PETS Chua et al. (2018), a state-of-the-art model-based RL method based on ensembles of dynamics models; POPLIN-P Wang and Ba (2020), a state-of-the-art model-based RL method which uses a policy network to generate actions for planning; POPLIN-A Wang and Ba (2020), variant of POPLIN-P which adds noise in the action space; METRPO Kurutach et al. (2018), a hybrid RL method which augments TRPO Schulman et al. (2015) using ensembles of dynamics models; and two state-of-the-art model-free RL methods, TD3 Fujimoto et al. (2018) and SAC Haarnoja et al. (2018). For our method, we consider a combination of SAC and SUNRISE, as described in Algorithm 1. Following the setup in POPLIN Wang and Ba (2020), we report the mean and standard deviation across four runs after 200K timesteps on four complex environments: Cheetah, Walker, Hopper, and Ant. More experimental details and learning curves are in Appendix D.

For DeepMind Control Suite with image inputs, we compare to PlaNet Hafner et al. (2019), a model-based RL method which learns a latent dynamics model and uses it for planning; Dreamer Hafner et al. (2020), a hybrid RL method which utilizes the latent dynamics model to generate synthetic roll-outs; SLAC Lee et al. (2019), a hybrid RL method which combines the latent dynamics model with SAC; and three state-of-the-art model-free RL methods which apply contrastive learning (CURL Srinivas et al. (2020)) or data augmentation (RAD Laskin et al. (2020) and DrQ Kostrikov et al. (2020)) to SAC. For our method, we consider a combination of RAD (i.e., SAC with random crop) and SUNRISE. Following the setup in RAD Laskin et al. (2020), we report the mean and standard deviation across five runs after 100k (i.e., low sample regime) and 500k (i.e., asymptotically optimal regime) environment steps on six environments: Finger-spin, Cartpole-swing, Reacher-easy, Cheetah-run, Walker-walk, and Cup-catch. More experimental details and learning curves are in Appendix E.

Discrete control benchmarks. For discrete control tasks, we demonstrate the effectiveness of SUNRISE on several Atari games Bellemare et al. (2013). We compare to SimPLe Kaiser et al. (2020), a hybrid RL method which updates the policy only using samples generated by learned dynamics model; Rainbow DQN Hessel et al. (2018) with modified hyperparameters for sample-efficiency van Hasselt et al. (2019); Random agent Kaiser et al. (2020); CURL Srinivas et al. (2020); a model-free RL method which applies the contrastive learning to Rainbow DQN; and Human performances reported in Kaiser et al. (2020) and van Hasselt et al. (2019). Following the setups in SimPLe Kaiser et al. (2020), we report the mean across three runs after 100K interactions (i.e., 400K frames with action repeat of 4). For our method, we consider a combination of sample-efficient versions of Rainbow DQN van Hasselt et al. (2019) and SUNRISE (see Algorithm 3 in Appendix B). More experimental details and learning curves are in Appendix F.

For our method, we do not alter any hyperparameters of the original off-policy RL algorithms and train five ensemble agents. There are only three additional hyperparameters , , and for bootstrap, weighted Bellman backup, and UCB exploration, where we provide details in Appendix D, E, and F.

4.2 Comparative evaluation

Cheetah Walker Hopper Ant
PETS Chua et al. (2018) 2288.4 1019.0 282.5 501.6 114.9 621.0 1165.5 226.9
POPLIN-A Wang and Ba (2020) 1562.8 1136.7 -105.0 249.8 202.5 962.5 1148.4 438.3
POPLIN-P Wang and Ba (2020) 4235.0 1133.0 597.0 478.8 2055.2 613.8 2330.1 320.9
METRPO Kurutach et al. (2018) 2283.7 900.4 -1609.3 657.5 1272.5 500.9 282.2 18.0
TD3 Fujimoto et al. (2018) 3015.7 969.8 -516.4 812.2 1816.6 994.8 870.1 283.8
SAC Haarnoja et al. (2018) 4035.7 268.0 -382.5 849.5 2020.6 692.9 836.5 68.4
SUNRISE 5370.6 483.1 1926.5 694.8 2601.9 306.5 1627.0 292.7
Table 1: Performance on OpenAI Gym at 200K timesteps. The results show the mean and standard deviation averaged over four runs, and the best results are indicated in bold. For baseline methods, we report the best number in POPLIN Wang and Ba (2020).
500K step PlaNet Hafner et al. (2019) Dreamer Hafner et al. (2020) SLAC Lee et al. (2019) CURL Srinivas et al. (2020) DrQ Kostrikov et al. (2020) RAD Laskin et al. (2020) SUNRISE
Finger-spin
561 284
796 183
673 92
926 45
938 103
975 16
983 1
Cartpole-swing
475 71
762 27
-
845 45
868 10
873 3
876 4
Reacher-easy
210 44
793 164
-
929 44
942 71
916 49
982 3
Cheetah-run
305 131
570 253
640 19
518 28
660 96
624 10
678 46
Walker-walk
351 58
897 49
842 51
902 43
921 45
938 9
953 13
Cup-catch
460 380
879 87
852 71
959 27
963 9
966 9
969 5
100K step
Finger-spin
136 216
341 70
693 141
767 56
901 104
811 146
905 57
Cartpole-swing
297 39
326 27
-
582 146
759 92
373 90
591 55
Reacher-easy
20 50
314 155
-
538 233
601 213
567 54
722 50
Cheetah-run
138 88
235 137
319 56
299 48
344 67
381 79
413 35
Walker-walk
224 48
277 12
361 73
403 24
612 164
641 89
667 147
Cup-catch
0 0
246 174
512 110
769 43
913 53
666 181
633 241
Table 2: Performance on DeepMind Control Suite at 100K and 500K environment steps. The results show the mean and standard deviation averaged five runs, and the best results are indicated in bold. For baseline methods, we report the best numbers reported in prior works Kostrikov et al. (2020); Laskin et al. (2020).

OpenAI Gym. Table 1 shows the average returns of evaluation roll-outs for all methods. SUNRISE consistently improves the performance of SAC across all environments and outperforms the state-of-the-art POPLIN-P on all environments except Ant. In particular, the average returns are improved from 597.0 to 1926.5 compared to POPLIN-P on the Walker environment, which most model-based RL methods cannot solve efficiently. We remark that SUNRISE is more compute-efficient than modern model-based RL methods, such as POPLIN and PETS, because they also utilize ensembles (of dynamics models) and perform planning to select actions. Namely, SUNRISE is simple to implement, computationally efficient, and readily parallelizable.

DeepMind Control Suite. As shown in Table 2, SUNRISE also consistently improves the performance of RAD (i.e., SAC with random crop) on all environments from DeepMind Control Suite. This implies that the proposed method can be useful for high-dimensional and complex input observations. Moreover, our method achieves state-of-the-art performances in almost all environments against existing pixel-based RL methods. We remark that SUNRISE can also be combined with DrQ, and expect that it can achieve better performances on Cartpole-swing and Cup-catch at 100K environment steps.

Atari games. We also evaluate SUNRISE on discrete control tasks using Rainbow DQN on Atari games. Table 3 shows that SUNRISE improves the performance of Rainbow in almost all environments, and achieves state-of-the-art performance on 12 out of 26 environments. Here, we remark that SUNRISE is also compatible with CURL, which could enable even better performance. These results demonstrate that SUNRISE is a general approach, and can be applied to various off-policy RL algorithms.

Game Human Random SimPLe Kaiser et al. (2020) CURL Srinivas et al. (2020) Rainbow van Hasselt et al. (2019) SUNRISE
Alien 7127.7 227.8 616.9 558.2 789.0 872.0
Amidar 1719.5 5.8 88.0 142.1 118.5 122.6
Assault 742.0 222.4 527.2 600.6 413.0 594.8
Asterix 8503.3 210.0 1128.3 734.5 533.3 755.0
BankHeist 753.1 14.2 34.2 131.6 97.7 266.7
BattleZone 37187.5 2360.0 5184.4 14870.0 7833.3 15700.0
Boxing 12.1 0.1 9.1 1.2 0.6 6.7
Breakout 30.5 1.7 16.4 4.9 2.3 1.8
ChopperCommand 7387.8 811.0 1246.9 1058.5 590.0 1040.0
CrazyClimber 35829.4 10780.5 62583.6 12146.5 25426.7 22230.0
DemonAttack 1971.0 152.1 208.1 817.6 688.2 919.8
Freeway 29.6 0.0 20.3 26.7 28.7 30.2
Frostbite 4334.7 65.2 254.7 1181.3 1478.3 2026.7
Gopher 2412.5 257.6 771.0 669.3 348.7 654.7
Hero 30826.4 1027.0 2656.6 6279.3 3675.7 8072.5
Jamesbond 302.8 29.0 125.3 471.0 300.0 390.0
Kangaroo 3035.0 52.0 323.1 872.5 1060.0 2000.0
Krull 2665.5 1598.0 4539.9 4229.6 2592.1 3087.2
KungFuMaster 22736.3 258.5 17257.2 14307.8 8600.0 10306.7
MsPacman 6951.6 307.3 1480.0 1465.5 1118.7 1482.3
Pong 14.6 -20.7 12.8 -16.5 -19.0 -19.3
PrivateEye 69571.3 24.9 58.3 218.4 97.8 100.0
Qbert 13455.0 163.9 1288.8 1042.4 646.7 1830.8
RoadRunner 7845.0 11.5 5640.6 5661.0 9923.3 11913.3
Seaquest 42054.7 68.4 683.3 384.5 396.0 570.7
UpNDown 11693.2 533.4 3350.3 2955.2 3816.0 5074.0
Table 3: Performance on Atari games at 100K interactions. The results show the scores averaged three runs, and the best results are indicated in bold. For baseline methods, we report the best numbers reported in prior works Kaiser et al. (2020); van Hasselt et al. (2019).

OpenAI Gym with stochastic rewards. To verify the effectiveness of SUNRISE in mitigating error propagation, following Kumar et al. (2019), we evaluate on a modified version of OpenAI Gym environments with stochastic rewards by adding Gaussian noise to the reward function: where . This increases the noise in value estimates. Following Kumar et al. (2019), we only inject this noisy reward during training and report the deterministic ground-truth reward during evaluation. For our method, we also consider a variant of SUNRISE, which selects action without UCB exploration to isolate the effect of the proposed weighted Bellman backup. Specifically, we randomly select an index of policy uniformly at random and generate actions from the selected policy for the duration of that episode similar to Bootstrapped DQN Osband et al. (2016) (see Algorithm 2 in Appendix A). Our method is compared with DisCor Kumar et al. (2020), which improves SAC by reweighting the Bellman backup based on estimated cumulative Bellman errors (see Appendix G for more details).

Figure 2(a) shows the learning curves of all methods on the Cheetah environment with stochastic rewards. SUNRISE outperforms baselines such as SAC and DisCor, even when only using the proposed weighted Bellman backup (green curve). This implies that errors in the target Q-function can be characterized by the proposed confident weight in (5) effectively. By additionally utilizing UCB exploration, both sample-efficiency and asymptotic performance of SUNRISE are further improved (blue curve). More evaluation results with DisCor on other environments are also available in Appendix G, where the overall trend is similar.

(a) Cheetah with stochastic rewards
(b) Cheetah
(c) Ant
Figure 2: (a) Comparison with DisCor on modified Cheetah environment, where we add Gaussian noise to reward function to increase an error in value estimations. Learning curves of SUNRISE with varying values of ensemble size on (b) Cheetah and (c) Ant environments.
Atari games
BOOT WB UCB Seaquest ChopperCommand Gopher
Rainbow - - - 396.0 37.7 590.0 127.3 348.7 43.8
SUNRISE - - 547.3 110.0 590.0 85.2 222.7 34.7
- 550.7 67.0 860.0 235.5 377.3 195.6
- 477.3 48.5 623.3 216.4 286.0 39.2
570.7 43.6 1040.0 77.9 654.7 218.0
OpenAI Gym
BOOT WB UCB Cheetah Hopper Ant
SAC - - - 4035.7 268.0 2020.6 692.9 836.5 68.4
SUNRISE - - 4213.5 249.1 2378.3 538.0 1033.4 106.0
- 5197.4 448.1 2586.5 317.0 1164.6 488.4
- 4789.3 192.3 2393.2 316.9 1684.8 1060.9
5370.6 483.1 2601.9 306.5 1627.0 292.7
Table 4: Contribution of each technique in SUNRISE, i.e., bootstrap with random initialization (BOOT), weighted Bellman backup (WB), and UCB exploration, on several environments from OpenAI Gym and Atari games at 200K timesteps and 100K interactions. The results show the mean and standard deviation averaged over four and three runs for OpenAI Gym and Atari games.

4.3 Ablation study

Effects of ensemble size. We analyze the effects of ensemble size on the Cheetah and Ant environments from OpenAI Gym. Figure 2(b) and Figure 2(c) show that the performance can be improved by increasing the ensemble size, but the improvement is saturated around . Thus, we use five ensemble agents for all experiments. More experimental results on the Hopper and Walker environments are also available in Appendix D, where the overall trend is similar.

Contribution of each technique. In order to verify the individual effects of each technique in SUNRISE, we incrementally apply our techniques. For SUNRISE without UCB exploration, we use random inference proposed in Bootstrapped DQN Osband et al. (2016), which randomly selects an index of policy uniformly at random and generates the action from the selected actor for the duration of that episode (see Algorithm 2 in Appendix A). Table 4 shows the performance of SUNRISE on several environments from OpenAI Gym and Atari games. First, we remark that the performance gain from SUNRISE only with bootstrap, which corresponds to a naive extension of Bootstrap DQN Osband et al. (2016), is marginal compared to other techniques, such as weighted Bellman backup and UCB exploration. However, by utilizing all proposed techniques, we obtain the best performance in almost all environments. This shows that all proposed techniques can be integrated and that they are indeed largely complementary.

5 Conclusion

In this paper, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. In particular, SUNRISE integrates bootstrap with random initialization, weighted Bellman backup, and UCB exploration to handle various issues in off-policy RL algorithms. Our experiments show that SUNRISE consistently improves the performances of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, and outperforms state-of-the-art RL algorithms for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. We believe that SUNRISE could be useful to other relevant topics such as sim-to-real transfer Tobin et al. (2017)

, imitation learning 

Torabi et al. (2018), offline RL Agarwal et al. (2020), and planning Srinivas et al. (2018); Tamar et al. (2016).

Acknowledgements

This research is supported in part by ONR PECASE N000141612723, Tencent, and Berkeley Deep Drive. We would like to thank Hao Liu for improving the presentation and giving helpful feedback. We would also like to thank Aviral Kumar and Kai Arulkumaran for providing tips on implementation of DisCor and Rainbow.

Broader Impact

Despite impressive progress in Deep RL over the last few years, a number of issues prevent RL algorithms from being deployed to real-world problems like autonomous navigation Bojarski et al. (2016) and industrial robotic manipulation Kalashnikov et al. (2018). One issue, among several others, is training stability. RL algorithms are often sensitive to hyperparameters, noisy, and converge at suboptimal policies. Our work addresses the stability issue by providing a unified framework for utilizing ensembles during training. The resulting algorithm significantly improves the stability of prior methods. Though we demonstrate results on common RL benchmarks, SUNRISE could be one component, of many, that helps stabilize training RL policies in the real-world tasks like robotically assisted elderly care, automation of household tasks, and robotic assembly in manufacturing.

One downside to the SUNRISE method is that it requires additional compute proportional to the ensemble size. A concern is that developing methods that require increased computing resources to improve performance and deploying them at scale could lead to increased carbon emissions due to the energy required to power large compute clusters Schwartz (2020). For this reason, it is also important to develop complementary methods for training large networks energy-efficiently Howard et al. (2017).

References

  • R. Agarwal, D. Schuurmans, and M. Norouzi (2020) An optimistic perspective on offline reinforcement learning. In ICML, Cited by: §2, §5.
  • O. Anschel, N. Baram, and N. Shimkin (2017) Averaged-dqn: variance reduction and stabilization for deep reinforcement learning. In ICML, Cited by: §2, §2.
  • J. Audibert, R. Munos, and C. Szepesvári (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410 (19), pp. 1876–1902. Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §B.2, §1, §3.2.
  • M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. In ICML, Cited by: §B.1, §2.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §1, §2, §4.1.
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In NeurIPS, Cited by: §2.
  • M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: Broader Impact.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix D, §1, §4.1.
  • R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman (2017) UCB exploration via q-ensembles. arXiv preprint arXiv:1706.01502. Cited by: §B.2, Appendix D, Appendix E, Appendix F, 3rd item, §1, §1, §2, §2, §3.2.
  • J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee (2019) Contingency-aware exploration in reinforcement learning. In ICLR, Cited by: §2.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, Cited by: Appendix D, §2, §4.1, Table 1.
  • B. Efron (1982) The jackknife, the bootstrap, and other resampling plans. Vol. 38, Siam. Cited by: §B.2, §3.2.
  • S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In ICML, Cited by: §1, §1, §2, §2, §4.1, Table 1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, Cited by: Appendix A, §1, §1, §2, §2, §3.1, §3.2, §3, 1st item, §4.1, Table 1.
  • D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2020) Dream to control: learning behaviors by latent imagination. In ICLR, Cited by: §1, §4.1, Table 2.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In ICML, Cited by: §4.1, Table 2.
  • H. V. Hasselt (2010) Double q-learning. In NeurIPS, Cited by: §2.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In AAAI, Cited by: §1.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In AAAI, Cited by: §B.1, §1, §1, §2, §3.2, §3, 1st item, §4.1.
  • R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2016) Vime: variational information maximizing exploration. In NeurIPS, Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: Broader Impact.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. (2020) Model-based reinforcement learning for atari. In ICLR, Cited by: §1, §1, §4.1, Table 3.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, Cited by: §1, Broader Impact.
  • I. Kostrikov, D. Yarats, and R. Fergus (2020) Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649. Cited by: §4.1, Table 2.
  • A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In NeurIPS, Cited by: 2nd item, §1, §2, §4.2.
  • A. Kumar, A. Gupta, and S. Levine (2020) DisCor: corrective feedback in reinforcement learning via distribution correction. arXiv preprint arXiv:2003.07305. Cited by: §B.2, Appendix G, 2nd item, §1, §2, §3.2, §4.2.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. In ICLR, Cited by: §2, §4.1, Table 1.
  • Q. Lan, Y. Pan, A. Fyshe, and M. White (2020) Maxmin q-learning: controlling the estimation bias of q-learning. In ICLR, Cited by: §1.
  • M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas (2020) Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990. Cited by: Appendix E, §2, §4.1, Table 2.
  • A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §4.1, Table 2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §B.1, §1, §3.2.
  • I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In NeurIPS, Cited by: Appendix A, §B.2, §B.2, Appendix D, Appendix E, Appendix F, 1st item, §1, §1, §2, §3.2, §3.2, §4.2, §4.3, footnote 6.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In ICML, Cited by: §2.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In ICLR, Cited by: §B.2.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In ICML, Cited by: §4.1.
  • M. O. Schwartz (2020) Groundwater contamination associated with a potential nuclear waste repository at yucca mountain, usa. Bulletin of Engineering Geology and the Environment 79 (2), pp. 1125–1136. Cited by: Broader Impact.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §5.
  • A. Srinivas, M. Laskin, and P. Abbeel (2020) CURL: contrastive unsupervised representations for reinforcement learning. In ICML, Cited by: §4.1, §4.1, Table 2, Table 3.
  • A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel (2016) Value iteration networks. In NeurIPS, Cited by: §5.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §1, §4.1.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, Cited by: §5.
  • F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. In IJCAI, Cited by: §5.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In AAAI, Cited by: §B.1, §2, §2.
  • H. P. van Hasselt, M. Hessel, and J. Aslanides (2019)

    When to use parametric models in reinforcement learning?

    .
    In NeurIPS, Cited by: §4.1, Table 3.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • T. Wang and J. Ba (2020) Exploring model-based planning with policy networks. In ICLR, Cited by: Appendix D, §1, §4.1, Table 1.
  • T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: footnote 5.
  • M. A. Wiering and H. Van Hasselt (2008) Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics 38 (4), pp. 930–936. Cited by: §1, §2.
  • D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus (2019) Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741. Cited by: Appendix E.
  • B. D. Ziebart (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §2, §3.1.

Appendix A SUNRISE: Soft actor-critic

Background. SAC Haarnoja et al. (2018) is a state-of-the-art off-policy algorithm for continuous control problems. SAC learns a policy, , and a critic, , and aims to maximize a weighted objective of the reward and the policy entropy, . To update the parameters, SAC alternates between a soft policy evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Q-function, which is modeled as a neural network with parameters , is updated by minimizing the following soft Bellman residual:

where is a transition, is a replay buffer, are the delayed parameters, and is a temperature parameter. At the soft policy improvement step, the policy with its parameter is updated by minimizing the following objective:

We remark that this corresponds to minimizing the Kullback-Leibler divergence between the policy and a Boltzmann distribution induced by the current soft Q-function.

SUNRISE without UCB exploration. For SUNRISE without UCB exploration, we use random inference proposed in Bootstrapped DQN Osband et al. (2016), which randomly selects an index of policy uniformly at random and generates the action from the selected actor for the duration of that episode (see Line 3 in Algorithm 2).

1:for each iteration do
2:     // Random inference
3:     Select an index of policy using
4:     for each timestep  do
5:         Get the action from selected policy:
6:         Collect state and reward from the environment by taking action
7:         Sample bootstrap masks |
8:         Store transitions and masks in replay buffer
9:     end for
10:     // Update agents via bootstrap and weighted Bellman backup
11:     for each gradient step do
12:         Sample random minibatch
13:         for each agent  do
14:              Update the Q-function by minimizing
15:              Update the policy by minimizing
16:         end for
17:     end for
18:end for
Algorithm 2 SUNRISE: SAC version (random inference)

Appendix B Extension to Rainbow DQN

b.1 Preliminaries: Rainbow DQN

Background. DQN algorithm Mnih et al. (2015) learns a Q-function, which is modeled as a neural network with parameters , by minimizing the following Bellman residual:

(7)

where is a transition, is a replay buffer, and are the delayed parameters. Even though Rainbow DQN integrates several techniques, such as double Q-learning Van Hasselt et al. (2016) and distributional DQN Bellemare et al. (2017), applying SUNRISE to Rainbow DQN can be described based on the standard DQN algorithm. For exposition, we refer the reader to Hessel et al. (2018) for more detailed explanations of Rainbow DQN.

b.2 SUNRISE: Rainbow DQN

Bootstrap with random initialization. Formally, we consider an ensemble of Q-functions, i.e., , where denotes the parameters of the -th Q-function.333Here, we remark that each Q-function has a unique target Q-function. To train the ensemble of Q-functions, we use the bootstrap with random initialization Efron (1982); Osband et al. (2016), which enforces the diversity between Q-functions through two simple ideas: First, we initialize the model parameters of all Q-functions with random parameter values for inducing an initial diversity in the models. Second, we apply different samples to train each Q-function. Specifically, for each Q-function in each timestep , we draw the binary masks from the Bernoulli distribution with parameter , and store them in the replay buffer. Then, when updating the model parameters of Q-functions, we multiply the bootstrap mask to each objective function.

Weighted Bellman backup. Since conventional Q-learning is based on the Bellman backup in (7), it can be affected by error propagation. I.e., error in the target Q-function gets propagated into the Q-function at the current state. Recently, Kumar et al. (2020) showed that this error propagation can cause inconsistency and unstable convergence. To mitigate this issue, for each Q-function , we consider a weighted Bellman backup as follows:

where is a transition, and is a confidence weight based on ensemble of target Q-functions:

(8)

where is a temperature, is the sigmoid function, and is the empirical standard deviation of all target Q-functions . Note that the confidence weight is bounded in because standard deviation is always positive.444We find that it is empirically stable to set minimum value of weight as 0.5. The proposed objective down-weights the sample transitions with high variance across target Q-functions, resulting in a loss function for the updates that has a better signal-to-noise ratio. Note that we combine the proposed weighted Bellman backup with prioritized replay Schaul et al. (2016) by multiplying both weights to Bellman backups.

UCB exploration. The ensemble can also be leveraged for efficient exploration Chen et al. (2017); Osband et al. (2016) because it can express higher uncertainty on unseen samples. Motivated by this, by following the idea of Chen et al. (2017), we consider an optimism-based exploration that chooses the action that maximizes

(9)

where and are the empirical mean and standard deviation of all Q-functions , and the is a hyperparameter. This inference method can encourage exploration by adding an exploration bonus (i.e., standard deviation ) for visiting unseen state-action pairs similar to the UCB algorithm Auer et al. (2002). This inference method was originally proposed in Chen et al. (2017) for efficient exploration in DQN, but we further extend it to Rainbow DQN. For evaluation, we approximate the maximum a posterior action by choosing the action maximizes the mean of Q-functions, i.e., . The full procedure is summarized in Algorithm 3.

1:for each iteration do
2:     for each timestep  do
3:         // UCB exploration
4:         Choose the action that maximizes UCB:
5:         Collect state and reward from the environment by taking action
6:         Sample bootstrap masks |
7:         Store transitions and masks in replay buffer
8:     end for
9:     // Update Q-functions via bootstrap and weighted Bellman backup
10:     for each gradient step do
11:         Sample random minibatch
12:         for each agent  do
13:              Update the Q-function by minimizing
14:         end for
15:     end for
16:end for
Algorithm 3 SUNRISE: Rainbow version

Appendix C Implementation details for toy regression tasks

We evaluate the quality of uncertainty estimates from an ensemble of neural networks on a toy regression task. To this end, we generate twenty training samples drawn as , where

, and train ten ensembles of regression networks using bootstrap with random initialization. The regression network is as fully-connected neural networks with 2 hidden layers and 50 rectified linear units in each layer. For bootstrap, we draw the binary masks from the Bernoulli distribution with mean

. As uncertainty estimates, we measure the empirical variance of the networks’ predictions. As shown in Figure 1(b), the ensemble can produce well-calibrated uncertainty estimates (i.e., variance) on unseen samples.

Appendix D Experimental setups and results: OpenAI Gym

Environments. We evaluate the performance of SUNRISE on four complex environments based on the standard bench-marking environments555We used the reference implementation at https://github.com/WilsonWangTHU/mbbl Wang et al. (2019). from OpenAI Gym Brockman et al. (2016). Note that we do not use a modified Cheetah environments from PETS Chua et al. (2018) (dented as Cheetah in POPLIN Wang and Ba (2020)) because it includes additional information in observations.

Training details. We consider a combination of SAC and SUNRISE using the publicly released implementation repository (https://github.com/vitchyr/rlkit) without any modifications on hyperparameters and architectures. For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because splitting samples can reduce the sample-efficiency. Also, initial diversity from random initialization can be enough because each Q-function has a unique target Q-function, i.e., target value is also different according to initialization.

Learning curves. Figure 3 shows the learning curves on all environments. One can note that SUNRISE consistently improves the performance of SAC by a large margin.

(a) Cheetah
(b) Walker
(c) Hopper
(d) Ant
Figure 3: Learning curves on (a) Cheetah, (b) Walker, (c) Hopper, and (d) Ant environments from OpenAI Gym. The solid line and shaded regions represent the mean and standard deviation, respectively, across four runs.

Effects of ensembles. Figure 4 shows the learning curves of SUNRISE with varying values of ensemble size on all environments. The performance can be improved by increasing the ensemble size, but the improvement is saturated around .

(a) Cheetah
(b) Walker
(c) Hopper
(d) Ant
Figure 4: Learning curves of SUNRISE with varying values of ensemble size . The solid line and shaded regions represent the mean and standard deviation, respectively, across four runs.
Hyperparameter Value Hyperparameter Value
Random crop True Initial temperature
Observation rendering Learning rate cheetah, run
Observation downsampling otherwise
Replay buffer size Learning rate ()
Initial steps Batch Size (cheetah), (rest)
Stacked frames function EMA
Action repeat finger, spin; walker, walk Critic target update freq
cartpole, swingup Convolutional layers
otherwise Number of filters
Hidden units (MLP) Non-linearity ReLU
Evaluation episodes Encoder EMA
Optimizer Adam Latent dimension
Discount
Table 5: Hyperparameters used for DeepMind Control Suite experiments. Most hyperparameters values are unchanged across environments with the exception for action repeat, learning rate, and batch size.

Appendix E Experimental setups and results: DeepMind Control Suite

Training details. We consider a combination of RAD and SUNRISE using the publicly released implementation repository (https://github.com/MishaLaskin/rad) with a full list of hyperparameters in Table 5. Similar to Laskin et al. (2020), we use the same encoder architecture as in Yarats et al. (2019), and the actor and critic share the same encoder to embed image observations.666However, we remark that each agent does not share the encoders unlike Bootstrapped DQN Osband et al. (2016). For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because training samples can reduce the sample-efficiency. Also, initial diversity from random initialization can be enough because each Q-function has a unique target Q-function, i.e., target value is also different according to initialization.

Learning curves. Figure 5(g), 5(h), 5(i), 5(j), 5(k), and 5(l) show the learning curves on all environments. Since RAD already achieves the near optimal performances and the room for improvement is small, we can see a small but consistent gains from SUNRISE. To verify the effectiveness of SUNRISE more clearly, we consider a combination of SAC and SUNRISE in Figure 5(a), 5(b), 5(c), 5(d), 5(e), and 5(f), where the gain from SUNRISE is more significant.

(a) Finger-spin
(b) Cartpole-swing
(c) Reacher-easy
(d) Cheetah-run
(e) Walker-walk
(f) Cup-catch
(g) Finger-spin
(h) Cartpole-swing
(i) Reacher-easy
(j) Cheetah-run
(k) Walker-walk
(l) Cup-catch
Figure 5: Learning curves of (a-f) SAC and (g-I) RAD on DeepMind Control Suite. The solid line and shaded regions represent the mean and standard deviation, respectively, across five runs.

Appendix F Experimental setups and results: Atari games

Training details. We consider a combination of sample-efficient versions of Rainbow DQN and SUNRISE using the publicly released implementation repository (https://github.com/Kaixhin/Rainbow) without any modifications on hyperparameters and architectures. For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because splitting samples can reduce the sample-efficiency. Also, initial diversity from random initialization can be enough because each Q-function has a unique target Q-function, i.e., target value is also different according to initialization.

Learning curves. Figure 6, Figure 7 and Figure 8 show the learning curves on all environments.

(a) Seaquest
(b) BankHeist
(c) Assualt
(d) CrazyClimber
(e) DemonAttack
(f) ChopperCommand
(g) KungFuMaster
(h) Kangaroo
(i) UpNDown
Figure 6: Learning curves on Atari games. The solid line and shaded regions represent the mean and standard deviation, respectively, across three runs.
(a) Amidar
(b) Alien
(c) Pong
(d) Frostbite
(e) MsPacman
(f) Boxing
(g) Jamesbond
(h) Krull
(i) BattleZone
(j) RoadRunner
(k) Hero
(l) Asterix
Figure 7: Learning curves on Atari games. The solid line and shaded regions represent the mean and standard deviation, respectively, across three runs.
(a) PrivateEye
(b) Qbert
(c) Breakout
(d) Freeway
(e) Gopher
Figure 8: Learning curves on Atari games. The solid line and shaded regions represent the mean and standard deviation, respectively, across three runs.

Appendix G Experimental setups and results: stochastic reward OpenAi Gym

DisCor. DisCor Kumar et al. (2020) was proposed to prevent the error propagation issue in Q-learning. In addition to a standard Q-learning, DisCor trains an error model , which approximates the cumulative sum of discounted Bellman errors over the past iterations of training. Then, using the error model, DisCor reweights the Bellman backups based on a confidence weight defined as follows:

where is a discount factor and is a temperature. By following the setups in Kumar et al. (2020), we take a network with 1 extra hidden layer than the corresponding Q-network as an error model, and chose for all experiments. We update the temperature via a moving average and use the learning rate of . We use the SAC algorithm as the RL objective coupled with DisCor and build on top of the publicly released implementation repository (https://github.com/vitchyr/rlkit).

Learning curves. Figure 9 shows the learning curves of SUNRISE and DisCor on stochastic reward OpenAi Gym environments. SUNRISE outperforms baselines such as SAC and DisCor, even when only using the proposed weighted Bellman backup (green curve). This implies that errors in the target Q-function can be characterized by the proposed confident weight in (5) effectively. By additionally utilizing UCB exploration, both sample-efficiency and asymptotic performance of SUNRISE are further improved (blue curve).

(a) Cheetah
(b) Walker
(c) Hopper
(d) Ant
Figure 9: Comparison with DisCor on (a) Cheetah, (b) Walker, (c) Hopper, and (d) Ant environments with stochastic rewards. We add Gaussian noises to reward function to increase an error in value estimations.