sunrise
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning
view repo
Modelfree deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Qlearning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various offpolicy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Qlearning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upperconfidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing offpolicy RL algorithms, such as Soft ActorCritic and Rainbow DQN, for both continuous and discrete control tasks on both lowdimensional and highdimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.
READ FULL TEXT VIEW PDF
Valuebased reinforcementlearning algorithms are currently stateofthe...
read it
The exploration mechanism used by a Deep Reinforcement Learning (RL) age...
read it
Recent exploration methods have proven to be a recipe for improving
samp...
read it
Reinforcement learning (RL) enables robots to learn skills from interact...
read it
The estimation of advantage is crucial for a number of reinforcement lea...
read it
Qlearning methods represent a commonly used class of algorithms in
rein...
read it
We propose a general formulation for addressing reinforcement learning (...
read it
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning
Modelfree reinforcement learning (RL), with highcapacity function approximators, such as deep neural networks (DNNs), has been used to solve a variety of sequential decisionmaking problems, including board games
Silver et al. (2017, 2018), video games Mnih et al. (2015); Vinyals et al. (2019), and robotic manipulation Kalashnikov et al. (2018). It has been well established that the above successes are highly sample inefficient Kaiser et al. (2020). Recently, a lot of progress has been made in more sampleefficient modelfree RL algorithms through improvements in offpolicy learning both in discrete and continuous domains Fujimoto et al. (2018); Haarnoja et al. (2018); Hessel et al. (2018). However, there are still substantial challenges when training offpolicy RL algorithms. First, the learning process is often unstable and sensitive to hyperparameters because it is a complex problem to optimize large nonlinear policies such as DNNs
Henderson et al. (2018). Second, Qlearning often converges to suboptimal solutions due to error propagation in the Bellman backup, i.e., the errors induced in the target value can lead to an increase in overall error in the Qfunction Kumar et al. (2019, 2020). Third, it is hard to balance exploration and exploitation, which is necessary for efficient RL Chen et al. (2017); Osband et al. (2016) (see Section 2 for further details).One way to address the above issues with offpolicy RL algorithms is to use ensemble methods, which combine multiple models of the value function and (or) policy Chen et al. (2017); Lan et al. (2020); Osband et al. (2016); Wiering and Van Hasselt (2008). One example is the twinQ trick Fujimoto et al. (2018) that was proposed to handle the overestimation of value functions for continuous control tasks. Bootstrapped DQN Osband et al. (2016) leveraged an ensemble of Qfunctions for more effective exploration, and Chen et al. (2017) further improved it by adapting upperconfidence bounds algorithms Audibert et al. (2009); Auer et al. (2002) based on uncertainty estimates from ensembles. However, most prior works have studied the various axes of improvements from ensemble methods in isolation and have ignored the error propagation aspect.
In this paper, we present SUNRISE, a simple unified ensemble method that is compatible with most modern offpolicy RL algorithms, such as Qlearning and actorcritic algorithms. SUNRISE consists of the following key ingredients (see Figure 1(a)):
[leftmargin=8mm]
Bootstrap with random initialization: To enforce diversity between ensemble agents, we initialize the model parameters randomly and apply different training samples to each agent. Similar to Osband et al. (2016), we find that this simple technique stabilizes the learning process and improves performance by combining diverse agents.
Weighted Bellman backup: Errors in the target Qfunction can propagate to the current Qfunction Kumar et al. (2019, 2020) because the Bellman backup is usually applied with a learned target Qfunction (see Section 3.2
for more details). To handle this issue, we reweigh the Bellman backup based on uncertainty estimates of target Qfunctions. Because prediction errors can be characterized by uncertainty estimates from ensembles (i.e., variance of predictions) as shown in Figure
1(b), we find that the proposed method significantly mitigates error propagation in Qlearning.UCB exploration: We define an upperconfidence bound (UCB) based on the mean and variance of Qfunctions similar to Chen et al. (2017), and introduce an inference method, which selects actions with highest UCB for efficient exploration. This inference method can encourage exploration by providing a bonus for visiting unseen stateaction pairs, where ensembles produce high uncertainty, i.e., high variance (see Figure 1(b)).
We demonstrate the effectiveness of SUNRISE using Soft ActorCritic (SAC) Haarnoja et al. (2018) for continuous control benchmarks (specifically, OpenAI Gym Brockman et al. (2016) and DeepMind Control Suite Tassa et al. (2018)) and Rainbow DQN Hessel et al. (2018) for discrete control benchmarks (specifically, Atari games Bellemare et al. (2013)). In our experiments, SUNRISE consistently improves the performance of existing offpolicy RL methods and outperforms baselines, including modelbased RL methods such as POPLIN Wang and Ba (2020), Dreamer Hafner et al. (2020), and SimPLe Kaiser et al. (2020).
Offpolicy RL algorithms. Recently, various offpolicy RL algorithms have provided large gains in sampleefficiency by reusing past experiences Fujimoto et al. (2018); Haarnoja et al. (2018); Hessel et al. (2018). Rainbow DQN Hessel et al. (2018) achieved stateoftheart performance on the Atari games Bellemare et al. (2013) by combining several techniques, such as double Qlearning Van Hasselt et al. (2016) and distributional DQN Bellemare et al. (2017). For continuous control tasks, SAC Haarnoja et al. (2018) achieved stateoftheart sampleefficiency results by incorporating the maximum entropy framework, and Laskin et al. (2020) showed that the sampleefficiency of SAC can be further improved on highdimensional environments by incorporating data augmentations. Our ensemble method brings orthogonal benefits and is complementary and compatible with these existing stateoftheart algorithms.
Ensemble methods in RL. Ensemble methods have been studied for different purposes in RL Agarwal et al. (2020); Anschel et al. (2017); Chen et al. (2017); Chua et al. (2018); Kurutach et al. (2018); Osband et al. (2016); Wiering and Van Hasselt (2008). Chua et al. (2018) showed that modeling errors in modelbased RL can be reduced using an ensemble of dynamics models, and Kurutach et al. (2018) accelerated policy learning by generating imagined experiences from the ensemble of dynamics models. Bootstrapped DQN Osband et al. (2016) leveraged the ensemble of Qfunctions for efficient exploration. However, our method is different in that we propose a unified framework that handles various issues in offpolicy RL algorithms.
Stabilizing Qlearning. It has been empirically observed that instability in Qlearning can be caused by applying the Bellman backup on the learned value function Anschel et al. (2017); Fujimoto et al. (2018); Hasselt (2010); Kumar et al. (2019, 2020); Van Hasselt et al. (2016). For discrete control tasks, double Qlearning Hasselt (2010); Van Hasselt et al. (2016) addressed the value overestimation by maintaining two independent estimators of the action values and later extended to continuous control tasks in TD3 Fujimoto et al. (2018). Recently, Kumar et al. (2020) handled the error propagation issue by reweighting the Bellman backup based on cumulative Bellman errors. While most prior work has improved the stability by taking the minimum over Qfunctions or estimating cumulative errors, we propose an alternative way that also utilizes ensembles to estimate uncertainty and provide more stable, highersignaltonoise backups.
Exploration in RL. To balance exploration and exploitation, several methods, such as the maximum entropy frameworks Haarnoja et al. (2018); Ziebart (2010) and exploration bonus rewards Bellemare et al. (2016); Choi et al. (2019); Houthooft et al. (2016); Pathak et al. (2017), have been proposed. Despite the success of these exploration methods, a potential drawback is that agents can focus on irrelevant aspects of the environment because these methods do not depend on the rewards. To handle this issue, Chen et al. (2017) proposed an exploration strategy that considers both best estimates (i.e., mean) and uncertainty (i.e., variance) of Qfunctions for discrete control tasks. We further extend this strategy to continuous control tasks and show that it can be combined with other techniques.
We present SUNRISE: Simple UNified framework for ReInforcement learning using enSEmbles. In principle, SUNRISE can be used in conjunction with most modern offpolicy RL algorithms, such as SAC Haarnoja et al. (2018) and Rainbow DQN Hessel et al. (2018). For the exposition, we describe only the SAC version of SUNRISE in the main body. The Rainbow DQN version of SUNRISE follows the same principles and is fully described in Appendix B.
We consider a standard RL framework where an agent interacts with an environment in discrete time. Formally, at each timestep , the agent receives a state from the environment and chooses an action based on its policy . The environment returns a reward and the agent transitions to the next state . The return is the total accumulated rewards from timestep with a discount factor . RL then maximizes the expected return from each state .
SAC Haarnoja et al. (2018) is an offpolicy actorcritic method based on the maximum entropy RL framework Ziebart (2010), which encourages the robustness to noise and exploration by maximizing a weighted objective of the reward and the policy entropy (see Appendix A for further details). To update the parameters, SAC alternates between a soft policy evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Qfunction, which is modeled as a neural network with parameters , is updated by minimizing the following soft Bellman residual:
(1)  
(2) 
where is a transition, is a replay buffer, are the delayed parameters, and is a temperature parameter. At the soft policy improvement step, the policy with its parameter is updated by minimizing the following objective:
(3) 
Here, the policy is modeled as a Gaussian with mean and covariance given by neural networks to handle continuous action spaces.
In the design of SUNRISE, we integrate the three key ingredients, i.e., bootstrap with random initialization, weighted Bellman backup, and UCB exploration, into a single framework.
Bootstrap with random initialization. Formally, we consider an ensemble of SAC agents, i.e., , where and denote the parameters of the th soft Qfunction and policy.^{1}^{1}1We remark that each Qfunction has a unique target Qfunction . To train the ensemble of agents, we use the bootstrap with random initialization Efron (1982); Osband et al. (2016), which enforces the diversity between agents through two simple ideas: First, we initialize the model parameters of all agents with random parameter values for inducing an initial diversity in the models. Second, we apply different samples to train each agent. Specifically, for each SAC agent in each timestep , we draw the binary masks
from the Bernoulli distribution with parameter
, and store them in the replay buffer. Then, when updating the model parameters of agents, we multiply the bootstrap mask to each objective function, such as: and in (2) and (3). We remark that Osband et al. (2016) applied this simple technique to train an ensemble of DQN Mnih et al. (2015) only for discrete control tasks, while we apply to SAC Haarnoja et al. (2018) and Rainbow DQN Hessel et al. (2018) for both continuous and discrete tasks with additional techniques in the following paragraphs.Weighted Bellman backup. Since conventional Qlearning is based on the Bellman backup in (2), it can be affected by error propagation. I.e., error in the target Qfunction gets propagated into the Qfunction at the current state. Recently, Kumar et al. (2020) showed that this error propagation can cause inconsistency and unstable convergence. To mitigate this issue, for each agent , we consider a weighted Bellman backup as follows:
(4) 
where is a transition, , and is a confidence weight based on ensemble of target Qfunctions:
(5) 
where is a temperature,
is the sigmoid function, and
is the empirical standard deviation of all target Qfunctions
. Note that the confidence weight is bounded in because standard deviation is always positive.^{2}^{2}2We find that it is empirically stable to set minimum value of weight as 0.5. The proposed objectivedownweights the sample transitions with high variance across target Qfunctions, resulting in a loss function for the
updates that has a better signaltonoise ratio.
UCB exploration. The ensemble can also be leveraged for efficient exploration Chen et al. (2017); Osband et al. (2016) because it can express higher uncertainty on unseen samples. Motivated by this, by following the idea of Chen et al. (2017), we consider an optimismbased exploration that chooses the action that maximizes
(6) 
where and are the empirical mean and standard deviation of all Qfunctions , and the is a hyperparameter. This inference method can encourage exploration by adding an exploration bonus (i.e., standard deviation ) for visiting unseen stateaction pairs similar to the UCB algorithm Auer et al. (2002). We remark that this inference method was originally proposed in Chen et al. (2017) for efficient exploration in discrete action spaces. However, in continuous action spaces, finding the action that maximizes the UCB is not straightforward. To handle this issue, we propose a simple approximation scheme, which first generates candidate action set from ensemble policies , and then chooses the action that maximizes the UCB (Line 4 in Algorithm 1
). For evaluation, we approximate the maximum a posterior action by averaging the mean of Gaussian distributions modeled by each ensemble policy. The full procedure is summarized in Algorithm
1.We designed our experiments to answer the following questions:
Continuous control tasks. We evaluate SUNRISE on several continuous control tasks using simulated robots from OpenAI Gym Brockman et al. (2016) and DeepMind Control Suite Tassa et al. (2018). For OpenAI Gym experiments with proprioceptive inputs (e.g., positions and velocities), we compare to PETS Chua et al. (2018), a stateoftheart modelbased RL method based on ensembles of dynamics models; POPLINP Wang and Ba (2020), a stateoftheart modelbased RL method which uses a policy network to generate actions for planning; POPLINA Wang and Ba (2020), variant of POPLINP which adds noise in the action space; METRPO Kurutach et al. (2018), a hybrid RL method which augments TRPO Schulman et al. (2015) using ensembles of dynamics models; and two stateoftheart modelfree RL methods, TD3 Fujimoto et al. (2018) and SAC Haarnoja et al. (2018). For our method, we consider a combination of SAC and SUNRISE, as described in Algorithm 1. Following the setup in POPLIN Wang and Ba (2020), we report the mean and standard deviation across four runs after 200K timesteps on four complex environments: Cheetah, Walker, Hopper, and Ant. More experimental details and learning curves are in Appendix D.
For DeepMind Control Suite with image inputs, we compare to PlaNet Hafner et al. (2019), a modelbased RL method which learns a latent dynamics model and uses it for planning; Dreamer Hafner et al. (2020), a hybrid RL method which utilizes the latent dynamics model to generate synthetic rollouts; SLAC Lee et al. (2019), a hybrid RL method which combines the latent dynamics model with SAC; and three stateoftheart modelfree RL methods which apply contrastive learning (CURL Srinivas et al. (2020)) or data augmentation (RAD Laskin et al. (2020) and DrQ Kostrikov et al. (2020)) to SAC. For our method, we consider a combination of RAD (i.e., SAC with random crop) and SUNRISE. Following the setup in RAD Laskin et al. (2020), we report the mean and standard deviation across five runs after 100k (i.e., low sample regime) and 500k (i.e., asymptotically optimal regime) environment steps on six environments: Fingerspin, Cartpoleswing, Reachereasy, Cheetahrun, Walkerwalk, and Cupcatch. More experimental details and learning curves are in Appendix E.
Discrete control benchmarks. For discrete control tasks, we demonstrate the effectiveness of SUNRISE on several Atari games Bellemare et al. (2013). We compare to SimPLe Kaiser et al. (2020), a hybrid RL method which updates the policy only using samples generated by learned dynamics model; Rainbow DQN Hessel et al. (2018) with modified hyperparameters for sampleefficiency van Hasselt et al. (2019); Random agent Kaiser et al. (2020); CURL Srinivas et al. (2020); a modelfree RL method which applies the contrastive learning to Rainbow DQN; and Human performances reported in Kaiser et al. (2020) and van Hasselt et al. (2019). Following the setups in SimPLe Kaiser et al. (2020), we report the mean across three runs after 100K interactions (i.e., 400K frames with action repeat of 4). For our method, we consider a combination of sampleefficient versions of Rainbow DQN van Hasselt et al. (2019) and SUNRISE (see Algorithm 3 in Appendix B). More experimental details and learning curves are in Appendix F.
Cheetah  Walker  Hopper  Ant  
PETS Chua et al. (2018)  2288.4 1019.0  282.5 501.6  114.9 621.0  1165.5 226.9 
POPLINA Wang and Ba (2020)  1562.8 1136.7  105.0 249.8  202.5 962.5  1148.4 438.3 
POPLINP Wang and Ba (2020)  4235.0 1133.0  597.0 478.8  2055.2 613.8  2330.1 320.9 
METRPO Kurutach et al. (2018)  2283.7 900.4  1609.3 657.5  1272.5 500.9  282.2 18.0 
TD3 Fujimoto et al. (2018)  3015.7 969.8  516.4 812.2  1816.6 994.8  870.1 283.8 
SAC Haarnoja et al. (2018)  4035.7 268.0  382.5 849.5  2020.6 692.9  836.5 68.4 
SUNRISE  5370.6 483.1  1926.5 694.8  2601.9 306.5  1627.0 292.7 
500K step  PlaNet Hafner et al. (2019)  Dreamer Hafner et al. (2020)  SLAC Lee et al. (2019)  CURL Srinivas et al. (2020)  DrQ Kostrikov et al. (2020)  RAD Laskin et al. (2020)  SUNRISE  

Fingerspin 








Cartpoleswing 


 





Reachereasy 


 





Cheetahrun 








Walkerwalk 








Cupcatch 








100K step  
Fingerspin 








Cartpoleswing 


 





Reachereasy 


 





Cheetahrun 








Walkerwalk 








Cupcatch 







OpenAI Gym. Table 1 shows the average returns of evaluation rollouts for all methods. SUNRISE consistently improves the performance of SAC across all environments and outperforms the stateoftheart POPLINP on all environments except Ant. In particular, the average returns are improved from 597.0 to 1926.5 compared to POPLINP on the Walker environment, which most modelbased RL methods cannot solve efficiently. We remark that SUNRISE is more computeefficient than modern modelbased RL methods, such as POPLIN and PETS, because they also utilize ensembles (of dynamics models) and perform planning to select actions. Namely, SUNRISE is simple to implement, computationally efficient, and readily parallelizable.
DeepMind Control Suite. As shown in Table 2, SUNRISE also consistently improves the performance of RAD (i.e., SAC with random crop) on all environments from DeepMind Control Suite. This implies that the proposed method can be useful for highdimensional and complex input observations. Moreover, our method achieves stateoftheart performances in almost all environments against existing pixelbased RL methods. We remark that SUNRISE can also be combined with DrQ, and expect that it can achieve better performances on Cartpoleswing and Cupcatch at 100K environment steps.
Atari games. We also evaluate SUNRISE on discrete control tasks using Rainbow DQN on Atari games. Table 3 shows that SUNRISE improves the performance of Rainbow in almost all environments, and achieves stateoftheart performance on 12 out of 26 environments. Here, we remark that SUNRISE is also compatible with CURL, which could enable even better performance. These results demonstrate that SUNRISE is a general approach, and can be applied to various offpolicy RL algorithms.
Game  Human  Random  SimPLe Kaiser et al. (2020)  CURL Srinivas et al. (2020)  Rainbow van Hasselt et al. (2019)  SUNRISE 

Alien  7127.7  227.8  616.9  558.2  789.0  872.0 
Amidar  1719.5  5.8  88.0  142.1  118.5  122.6 
Assault  742.0  222.4  527.2  600.6  413.0  594.8 
Asterix  8503.3  210.0  1128.3  734.5  533.3  755.0 
BankHeist  753.1  14.2  34.2  131.6  97.7  266.7 
BattleZone  37187.5  2360.0  5184.4  14870.0  7833.3  15700.0 
Boxing  12.1  0.1  9.1  1.2  0.6  6.7 
Breakout  30.5  1.7  16.4  4.9  2.3  1.8 
ChopperCommand  7387.8  811.0  1246.9  1058.5  590.0  1040.0 
CrazyClimber  35829.4  10780.5  62583.6  12146.5  25426.7  22230.0 
DemonAttack  1971.0  152.1  208.1  817.6  688.2  919.8 
Freeway  29.6  0.0  20.3  26.7  28.7  30.2 
Frostbite  4334.7  65.2  254.7  1181.3  1478.3  2026.7 
Gopher  2412.5  257.6  771.0  669.3  348.7  654.7 
Hero  30826.4  1027.0  2656.6  6279.3  3675.7  8072.5 
Jamesbond  302.8  29.0  125.3  471.0  300.0  390.0 
Kangaroo  3035.0  52.0  323.1  872.5  1060.0  2000.0 
Krull  2665.5  1598.0  4539.9  4229.6  2592.1  3087.2 
KungFuMaster  22736.3  258.5  17257.2  14307.8  8600.0  10306.7 
MsPacman  6951.6  307.3  1480.0  1465.5  1118.7  1482.3 
Pong  14.6  20.7  12.8  16.5  19.0  19.3 
PrivateEye  69571.3  24.9  58.3  218.4  97.8  100.0 
Qbert  13455.0  163.9  1288.8  1042.4  646.7  1830.8 
RoadRunner  7845.0  11.5  5640.6  5661.0  9923.3  11913.3 
Seaquest  42054.7  68.4  683.3  384.5  396.0  570.7 
UpNDown  11693.2  533.4  3350.3  2955.2  3816.0  5074.0 
OpenAI Gym with stochastic rewards. To verify the effectiveness of SUNRISE in mitigating error propagation, following Kumar et al. (2019), we evaluate on a modified version of OpenAI Gym environments with stochastic rewards by adding Gaussian noise to the reward function: where . This increases the noise in value estimates. Following Kumar et al. (2019), we only inject this noisy reward during training and report the deterministic groundtruth reward during evaluation. For our method, we also consider a variant of SUNRISE, which selects action without UCB exploration to isolate the effect of the proposed weighted Bellman backup. Specifically, we randomly select an index of policy uniformly at random and generate actions from the selected policy for the duration of that episode similar to Bootstrapped DQN Osband et al. (2016) (see Algorithm 2 in Appendix A). Our method is compared with DisCor Kumar et al. (2020), which improves SAC by reweighting the Bellman backup based on estimated cumulative Bellman errors (see Appendix G for more details).
Figure 2(a) shows the learning curves of all methods on the Cheetah environment with stochastic rewards. SUNRISE outperforms baselines such as SAC and DisCor, even when only using the proposed weighted Bellman backup (green curve). This implies that errors in the target Qfunction can be characterized by the proposed confident weight in (5) effectively. By additionally utilizing UCB exploration, both sampleefficiency and asymptotic performance of SUNRISE are further improved (blue curve). More evaluation results with DisCor on other environments are also available in Appendix G, where the overall trend is similar.

BOOT  WB  UCB  Seaquest  ChopperCommand  Gopher  
Rainbow        396.0 37.7  590.0 127.3  348.7 43.8  
SUNRISE  ✓      547.3 110.0  590.0 85.2  222.7 34.7  
✓  ✓    550.7 67.0  860.0 235.5  377.3 195.6  
✓    ✓  477.3 48.5  623.3 216.4  286.0 39.2  
✓  ✓  ✓  570.7 43.6  1040.0 77.9  654.7 218.0  

BOOT  WB  UCB  Cheetah  Hopper  Ant  
SAC        4035.7 268.0  2020.6 692.9  836.5 68.4  
SUNRISE  ✓      4213.5 249.1  2378.3 538.0  1033.4 106.0  
✓  ✓    5197.4 448.1  2586.5 317.0  1164.6 488.4  
✓    ✓  4789.3 192.3  2393.2 316.9  1684.8 1060.9  
✓  ✓  ✓  5370.6 483.1  2601.9 306.5  1627.0 292.7 
Effects of ensemble size. We analyze the effects of ensemble size on the Cheetah and Ant environments from OpenAI Gym. Figure 2(b) and Figure 2(c) show that the performance can be improved by increasing the ensemble size, but the improvement is saturated around . Thus, we use five ensemble agents for all experiments. More experimental results on the Hopper and Walker environments are also available in Appendix D, where the overall trend is similar.
Contribution of each technique. In order to verify the individual effects of each technique in SUNRISE, we incrementally apply our techniques. For SUNRISE without UCB exploration, we use random inference proposed in Bootstrapped DQN Osband et al. (2016), which randomly selects an index of policy uniformly at random and generates the action from the selected actor for the duration of that episode (see Algorithm 2 in Appendix A). Table 4 shows the performance of SUNRISE on several environments from OpenAI Gym and Atari games. First, we remark that the performance gain from SUNRISE only with bootstrap, which corresponds to a naive extension of Bootstrap DQN Osband et al. (2016), is marginal compared to other techniques, such as weighted Bellman backup and UCB exploration. However, by utilizing all proposed techniques, we obtain the best performance in almost all environments. This shows that all proposed techniques can be integrated and that they are indeed largely complementary.
In this paper, we present SUNRISE, a simple unified ensemble method, which is compatible with various offpolicy RL algorithms. In particular, SUNRISE integrates bootstrap with random initialization, weighted Bellman backup, and UCB exploration to handle various issues in offpolicy RL algorithms. Our experiments show that SUNRISE consistently improves the performances of existing offpolicy RL algorithms, such as Soft ActorCritic and Rainbow DQN, and outperforms stateoftheart RL algorithms for both continuous and discrete control tasks on both lowdimensional and highdimensional environments. We believe that SUNRISE could be useful to other relevant topics such as simtoreal transfer Tobin et al. (2017)
Torabi et al. (2018), offline RL Agarwal et al. (2020), and planning Srinivas et al. (2018); Tamar et al. (2016).This research is supported in part by ONR PECASE N000141612723, Tencent, and Berkeley Deep Drive. We would like to thank Hao Liu for improving the presentation and giving helpful feedback. We would also like to thank Aviral Kumar and Kai Arulkumaran for providing tips on implementation of DisCor and Rainbow.
Despite impressive progress in Deep RL over the last few years, a number of issues prevent RL algorithms from being deployed to realworld problems like autonomous navigation Bojarski et al. (2016) and industrial robotic manipulation Kalashnikov et al. (2018). One issue, among several others, is training stability. RL algorithms are often sensitive to hyperparameters, noisy, and converge at suboptimal policies. Our work addresses the stability issue by providing a unified framework for utilizing ensembles during training. The resulting algorithm significantly improves the stability of prior methods. Though we demonstrate results on common RL benchmarks, SUNRISE could be one component, of many, that helps stabilize training RL policies in the realworld tasks like robotically assisted elderly care, automation of household tasks, and robotic assembly in manufacturing.
One downside to the SUNRISE method is that it requires additional compute proportional to the ensemble size. A concern is that developing methods that require increased computing resources to improve performance and deploying them at scale could lead to increased carbon emissions due to the energy required to power large compute clusters Schwartz (2020). For this reason, it is also important to develop complementary methods for training large networks energyefficiently Howard et al. (2017).
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §1, §2, §4.1.Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: Broader Impact.When to use parametric models in reinforcement learning?
. In NeurIPS, Cited by: §4.1, Table 3.Background. SAC Haarnoja et al. (2018) is a stateoftheart offpolicy algorithm for continuous control problems. SAC learns a policy, , and a critic, , and aims to maximize a weighted objective of the reward and the policy entropy, . To update the parameters, SAC alternates between a soft policy evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Qfunction, which is modeled as a neural network with parameters , is updated by minimizing the following soft Bellman residual:
where is a transition, is a replay buffer, are the delayed parameters, and is a temperature parameter. At the soft policy improvement step, the policy with its parameter is updated by minimizing the following objective:
We remark that this corresponds to minimizing the KullbackLeibler divergence between the policy and a Boltzmann distribution induced by the current soft Qfunction.
SUNRISE without UCB exploration. For SUNRISE without UCB exploration, we use random inference proposed in Bootstrapped DQN Osband et al. (2016), which randomly selects an index of policy uniformly at random and generates the action from the selected actor for the duration of that episode (see Line 3 in Algorithm 2).
Background. DQN algorithm Mnih et al. (2015) learns a Qfunction, which is modeled as a neural network with parameters , by minimizing the following Bellman residual:
(7) 
where is a transition, is a replay buffer, and are the delayed parameters. Even though Rainbow DQN integrates several techniques, such as double Qlearning Van Hasselt et al. (2016) and distributional DQN Bellemare et al. (2017), applying SUNRISE to Rainbow DQN can be described based on the standard DQN algorithm. For exposition, we refer the reader to Hessel et al. (2018) for more detailed explanations of Rainbow DQN.
Bootstrap with random initialization. Formally, we consider an ensemble of Qfunctions, i.e., , where denotes the parameters of the th Qfunction.^{3}^{3}3Here, we remark that each Qfunction has a unique target Qfunction. To train the ensemble of Qfunctions, we use the bootstrap with random initialization Efron (1982); Osband et al. (2016), which enforces the diversity between Qfunctions through two simple ideas: First, we initialize the model parameters of all Qfunctions with random parameter values for inducing an initial diversity in the models. Second, we apply different samples to train each Qfunction. Specifically, for each Qfunction in each timestep , we draw the binary masks from the Bernoulli distribution with parameter , and store them in the replay buffer. Then, when updating the model parameters of Qfunctions, we multiply the bootstrap mask to each objective function.
Weighted Bellman backup. Since conventional Qlearning is based on the Bellman backup in (7), it can be affected by error propagation. I.e., error in the target Qfunction gets propagated into the Qfunction at the current state. Recently, Kumar et al. (2020) showed that this error propagation can cause inconsistency and unstable convergence. To mitigate this issue, for each Qfunction , we consider a weighted Bellman backup as follows:
where is a transition, and is a confidence weight based on ensemble of target Qfunctions:
(8) 
where is a temperature, is the sigmoid function, and is the empirical standard deviation of all target Qfunctions . Note that the confidence weight is bounded in because standard deviation is always positive.^{4}^{4}4We find that it is empirically stable to set minimum value of weight as 0.5. The proposed objective downweights the sample transitions with high variance across target Qfunctions, resulting in a loss function for the updates that has a better signaltonoise ratio. Note that we combine the proposed weighted Bellman backup with prioritized replay Schaul et al. (2016) by multiplying both weights to Bellman backups.
UCB exploration. The ensemble can also be leveraged for efficient exploration Chen et al. (2017); Osband et al. (2016) because it can express higher uncertainty on unseen samples. Motivated by this, by following the idea of Chen et al. (2017), we consider an optimismbased exploration that chooses the action that maximizes
(9) 
where and are the empirical mean and standard deviation of all Qfunctions , and the is a hyperparameter. This inference method can encourage exploration by adding an exploration bonus (i.e., standard deviation ) for visiting unseen stateaction pairs similar to the UCB algorithm Auer et al. (2002). This inference method was originally proposed in Chen et al. (2017) for efficient exploration in DQN, but we further extend it to Rainbow DQN. For evaluation, we approximate the maximum a posterior action by choosing the action maximizes the mean of Qfunctions, i.e., . The full procedure is summarized in Algorithm 3.
We evaluate the quality of uncertainty estimates from an ensemble of neural networks on a toy regression task. To this end, we generate twenty training samples drawn as , where
, and train ten ensembles of regression networks using bootstrap with random initialization. The regression network is as fullyconnected neural networks with 2 hidden layers and 50 rectified linear units in each layer. For bootstrap, we draw the binary masks from the Bernoulli distribution with mean
. As uncertainty estimates, we measure the empirical variance of the networks’ predictions. As shown in Figure 1(b), the ensemble can produce wellcalibrated uncertainty estimates (i.e., variance) on unseen samples.Environments. We evaluate the performance of SUNRISE on four complex environments based on the standard benchmarking environments^{5}^{5}5We used the reference implementation at https://github.com/WilsonWangTHU/mbbl Wang et al. (2019). from OpenAI Gym Brockman et al. (2016). Note that we do not use a modified Cheetah environments from PETS Chua et al. (2018) (dented as Cheetah in POPLIN Wang and Ba (2020)) because it includes additional information in observations.
Training details. We consider a combination of SAC and SUNRISE using the publicly released implementation repository (https://github.com/vitchyr/rlkit) without any modifications on hyperparameters and architectures. For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because splitting samples can reduce the sampleefficiency. Also, initial diversity from random initialization can be enough because each Qfunction has a unique target Qfunction, i.e., target value is also different according to initialization.
Learning curves. Figure 3 shows the learning curves on all environments. One can note that SUNRISE consistently improves the performance of SAC by a large margin.
Effects of ensembles. Figure 4 shows the learning curves of SUNRISE with varying values of ensemble size on all environments. The performance can be improved by increasing the ensemble size, but the improvement is saturated around .
Hyperparameter  Value  Hyperparameter  Value 

Random crop  True  Initial temperature  
Observation rendering  Learning rate  cheetah, run  
Observation downsampling  otherwise  
Replay buffer size  Learning rate ()  
Initial steps  Batch Size  (cheetah), (rest)  
Stacked frames  function EMA  
Action repeat  finger, spin; walker, walk  Critic target update freq  
cartpole, swingup  Convolutional layers  
otherwise  Number of filters  
Hidden units (MLP)  Nonlinearity  ReLU  
Evaluation episodes  Encoder EMA  
Optimizer  Adam  Latent dimension  
Discount  
Training details. We consider a combination of RAD and SUNRISE using the publicly released implementation repository (https://github.com/MishaLaskin/rad) with a full list of hyperparameters in Table 5. Similar to Laskin et al. (2020), we use the same encoder architecture as in Yarats et al. (2019), and the actor and critic share the same encoder to embed image observations.^{6}^{6}6However, we remark that each agent does not share the encoders unlike Bootstrapped DQN Osband et al. (2016). For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because training samples can reduce the sampleefficiency. Also, initial diversity from random initialization can be enough because each Qfunction has a unique target Qfunction, i.e., target value is also different according to initialization.
Learning curves. Figure 5(g), 5(h), 5(i), 5(j), 5(k), and 5(l) show the learning curves on all environments. Since RAD already achieves the near optimal performances and the room for improvement is small, we can see a small but consistent gains from SUNRISE. To verify the effectiveness of SUNRISE more clearly, we consider a combination of SAC and SUNRISE in Figure 5(a), 5(b), 5(c), 5(d), 5(e), and 5(f), where the gain from SUNRISE is more significant.
Training details. We consider a combination of sampleefficient versions of Rainbow DQN and SUNRISE using the publicly released implementation repository (https://github.com/Kaixhin/Rainbow) without any modifications on hyperparameters and architectures. For our method, the temperature for weighted Bellman backups is chosen from , the mean of the Bernoulli distribution is chosen from , the penalty parameter is chosen from , and we train five ensemble agents. The optimal parameters are chosen to achieve the best performance on training environments. Here, we remark that training ensemble agents using same training samples but with different initialization (i.e., ) usually achieves the best performance in most cases similar to Osband et al. (2016) and Chen et al. (2017). We expect that this is because splitting samples can reduce the sampleefficiency. Also, initial diversity from random initialization can be enough because each Qfunction has a unique target Qfunction, i.e., target value is also different according to initialization.
DisCor. DisCor Kumar et al. (2020) was proposed to prevent the error propagation issue in Qlearning. In addition to a standard Qlearning, DisCor trains an error model , which approximates the cumulative sum of discounted Bellman errors over the past iterations of training. Then, using the error model, DisCor reweights the Bellman backups based on a confidence weight defined as follows:
where is a discount factor and is a temperature. By following the setups in Kumar et al. (2020), we take a network with 1 extra hidden layer than the corresponding Qnetwork as an error model, and chose for all experiments. We update the temperature via a moving average and use the learning rate of . We use the SAC algorithm as the RL objective coupled with DisCor and build on top of the publicly released implementation repository (https://github.com/vitchyr/rlkit).
Learning curves. Figure 9 shows the learning curves of SUNRISE and DisCor on stochastic reward OpenAi Gym environments. SUNRISE outperforms baselines such as SAC and DisCor, even when only using the proposed weighted Bellman backup (green curve). This implies that errors in the target Qfunction can be characterized by the proposed confident weight in (5) effectively. By additionally utilizing UCB exploration, both sampleefficiency and asymptotic performance of SUNRISE are further improved (blue curve).
Comments
There are no comments yet.