The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.READ FULL TEXT VIEW PDF
The many recent successes in scaling reinforcement learning (RL) to complex sequential decision-making problems were kick-started by the Deep Q-Networks algorithm (DQN; Mnih2015 dqn-arxiv, Mnih2015). Its combination of Q-learning with convolutional neural networks and experience replay enabled it to learn, from raw pixels, how to play many Atari games at human-level performance. Since then, many extensions have been proposed that enhance its speed or stability.
Double DQN (DDQN; van2016deep van2016deep) addresses an overestimation bias of Q-learning [van Hasselt2010], by decoupling selection and evaluation of the bootstrap action. Prioritized experience replay [Schaul et al.2015] improves data efficiency, by replaying more often transitions from which there is more to learn. The dueling network architecture [Wang et al.2016] helps to generalize across actions by separately representing state values and action advantages. Learning from multi-step bootstrap targets [Sutton1988, Sutton and Barto1998], as used in A3C [Mnih et al.2016]
, shifts the bias-variance trade-off and helps to propagate newly observed rewards faster to earlier visited states. Distributional Q-learning[Bellemare, Dabney, and Munos2017]
learns a categorical distribution of discounted returns, instead of estimating the mean. Noisy DQN[Fortunato et al.2017] uses stochastic network layers for exploration. This list is, of course, far from exhaustive.
Each of these algorithms enables substantial performance improvements in isolation. Since they do so by addressing radically different issues, and since they build on a shared framework, they could plausibly be combined. In some cases this has been done: Prioritized DDQN and Dueling DDQN both use double Q-learning, and Dueling DDQN was also combined with prioritized experience replay. In this paper we propose to study an agent that combines all the aforementioned ingredients. We show how these different ideas can be integrated, and that they are indeed largely complementary. In fact, their combination results in new state-of-the-art results on the benchmark suite of 57 Atari 2600 games from the Arcade Learning Environment [Bellemare et al.2013], both in terms of data efficiency and of final performance. Finally we show results from ablation studies to help understand the contributions of the different components.
Reinforcement learning addresses the problem of an agent learning to act in an environment in order to maximize a scalar reward signal. No direct supervision is provided to the agent, for instance it is never directly told the best action.
At each discrete time step , the environment provides the agent with an observation , the agent responds by selecting an action , and then the environment provides the next reward , discount , and state . This interaction is formalized as a Markov Decision Process, or MDP, which is a tuple , where is a finite set of states, is a finite set of actions, is the (stochastic) transition function, is the reward function, and is a discount factor. In our experiments MDPs will be episodic with a constant , except on episode termination where , but the algorithms are expressed in the general form.
On the agent side, action selection is given by a policy
that defines a probability distribution over actions for each state. From the stateencountered at time , we define the discounted return as the discounted sum of future rewards collected by the agent, where the discount for a reward steps in the future is given by the product of discounts before that time, . An agent aims to maximize the expected discounted return by finding a good policy.
The policy may be learned directly, or it may be constructed as a function of some other learned quantities. In value-based reinforcement learning, the agent learns an estimate of the expected discounted return, or value, when following a policy starting from a given state, , or state-action pair, . A common way of deriving a new policy from a state-action value function is to act -greedily with respect to the action values. This corresponds to taking the action with the highest value (the greedy action) with probability , and to otherwise act uniformly at random with probability . Policies of this kind are used to introduce a form of exploration: by randomly selecting actions that are sub-optimal according to its current estimates, the agent can discover and correct its estimates when appropriate. The main limitation is that it is difficult to discover alternative courses of action that extend far into the future; this has motivated research on more directed forms of exploration.
Large state and/or action spaces make it intractable to learn Q value estimates for each state and action pair independently. In deep reinforcement learning, we represent the various components of agents, such as policies or values
, with deep (i.e., multi-layer) neural networks. The parameters of these networks are trained by gradient descent to minimize some suitable loss function.
In DQN [Mnih et al.2015] deep networks and reinforcement learning were successfully combined by using a convolutional neural net to approximate the action values for a given state (which is fed as input to the network in the form of a stack of raw pixel frames). At each step, based on the current state, the agent selects an action -greedily with respect to the action values, and adds a transition () to a replay memory buffer [Lin1992]
, that holds the last million transitions. The parameters of the neural network are optimized by using stochastic gradient descent to minimize the loss
where is a time step randomly picked from the replay memory. The gradient of the loss is back-propagated only into the parameters of the online network (which is also used to select actions); the term represents the parameters of a target network
; a periodic copy of the online network which is not directly optimized. The optimization is performed using RMSprop[Tieleman and Hinton2012], a variant of stochastic gradient descent, on mini-batches sampled uniformly from the experience replay. This means that in the loss above, the time index will be a random time index from the last million transitions, rather than the current time. The use of experience replay and target networks enables relatively stable learning of Q values, and led to super-human performance on several Atari games.
DQN has been an important milestone, but several limitations of this algorithm are now known, and many extensions have been proposed. We propose a selection of six extensions that each have addressed a limitation and improved overall performance. To keep the size of the selection manageable, we picked a set of extensions that address distinct concerns (e.g., just one of the many addressing exploration).
Conventional Q-learning is affected by an overestimation bias, due to the maximization step in Equation 1, and this can harm learning. Double Q-learning [van Hasselt2010], addresses this overestimation by decoupling, in the maximization performed for the bootstrap target, the selection of the action from its evaluation. It is possible to effectively combine this with DQN [van Hasselt, Guez, and Silver2016], using the loss
This change was shown to reduce harmful overestimations that were present for DQN, thereby improving performance.
DQN samples uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. As a proxy for learning potential, prioritized experience replay [Schaul et al.2015] samples transitions with probability relative to the last encountered absolute TD error:
where is a hyper-parameter that determines the shape of the distribution. New transitions are inserted into the replay buffer with maximum priority, providing a bias towards recent transitions. Note that stochastic transitions might also be favoured, even when there is little left to learn about them.
The dueling network is a neural network architecture designed for value based RL. It features two streams of computation, the value and advantage streams, sharing a convolutional encoder, and merged by a special aggregator [Wang et al.2016]. This corresponds to the following factorization of action values:
where , , and are, respectively, the parameters of the shared encoder , of the value stream , and of the advantage stream ; and is their concatenation.
Q-learning accumulates a single reward and then uses the greedy action at the next step to bootstrap. Alternatively, forward-view multi-step targets can be used [Sutton1988]. We define the truncated -step return from a given state as
A multi-step variant of DQN is then defined by minimizing the alternative loss,
Multi-step targets with suitably tuned often lead to faster learning [Sutton and Barto1998].
We can learn to approximate the distribution of returns instead of the expected return. Recently Bellemare, Dabney, and Munos (2017) proposed to model such distributions with probability masses placed on a discrete support , where
is a vector withatoms, defined by for . The approximating distribution at time is defined on this support, with the probability mass on each atom , such that . The goal is to update such that this distribution closely matches the actual distribution of returns.
To learn the probability masses, the key insight is that return distributions satisfy a variant of Bellman’s equation. For a given state and action , the distribution of the returns under the optimal policy should match a target distribution defined by taking the distribution for the next state and action , contracting it towards zero according to the discount, and shifting it by the reward (or distribution of rewards, in the stochastic case). A distributional variant of Q-learning is then derived by first constructing a new support for the target distribution, and then minimizing the Kullbeck-Leibler divergence between the distribution and the target distribution ,
Here is a L2-projection of the target distribution onto the fixed support , and is the greedy action with respect to the mean action values in state .
As in the non-distributional case, we can use a frozen copy of the parameters to construct the target distribution. The parametrized distribution can be represented by a neural network, as in DQN, but with outputs. A softmax is applied independently for each action dimension of the output to ensure that the distribution for each action is appropriately normalized.
The limitations of exploring using -greedy policies are clear in games such as Montezuma’s Revenge, where many actions must be executed to collect the first reward. Noisy Nets [Fortunato et al.2017] propose a noisy linear layer that combines a deterministic and noisy stream,
are random variables, anddenotes the element-wise product. This transformation can then be used in place of the standard linear . Over time, the network can learn to ignore the noisy stream, but will do so at different rates in different parts of the state space, allowing state-conditional exploration with a form of self-annealing.
In this paper we integrate all the aforementioned components into a single integrated agent, which we call Rainbow.
First, we replace the 1-step distributional loss (3) with a multi-step variant. We construct the target distribution by contracting the value distribution in according to the cumulative discount, and shifting it by the truncated -step discounted return. This corresponds to defining the target distribution as . The resulting loss is
where, again, is the projection onto .
We combine the multi-step distributional loss with double Q-learning by using the greedy action in selected according to the online network as the bootstrap action , and evaluating such action using the target network.
In standard proportional prioritized replay [Schaul et al.2015] the absolute TD error is used to prioritize the transitions. This can be computed in the distributional setting, using the mean action values. However, in our experiments all distributional Rainbow variants prioritize transitions by the KL loss, since this is what the algorithm is minimizing:
The KL loss as priority might be more robust to noisy stochastic environments because the loss can continue to decrease even when the returns are not deterministic.
The network architecture is a dueling network architecture adapted for use with return distributions. The network has a shared representation , which is then fed into a value stream with outputs, and into an advantage stream with outputs, where will denote the output corresponding to atom and action . For each atom
, the value and advantage streams are aggregated, as in dueling DQN, and then passed through a softmax layer to obtain the normalised parametric distributions used to estimate the returns’ distributions:
where and .
We then replace all linear layers with their noisy equivalent described in Equation (4). Within these noisy linear layers we use factorised Gaussian noise [Fortunato et al.2017] to reduce the number of independent noise variables.
We now describe the methods and setup used for configuring and evaluating the learning agents.
We evaluated all agents on 57 Atari 2600 games from the arcade learning environment [Bellemare et al.2013]. We follow the training and evaluation procedures of Mnih2015 Mnih2015 and van Hasselt et al. van2016deep. The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play), as in van Hasselt et al. van2016deep.
Agents’ scores are normalized, per game, so that 0% corresponds to a random agent and 100% to the average score of a human expert. Normalized scores can be aggregated across all Atari levels to compare the performance of different agents. It is common to track the median human normalized performance across all games. We also consider the number of games where the agent’s performance is above some fraction of human performance, to disentangle where improvements in the median come from. The mean human normalized performance is potentially less informative, as it is dominated by a few games (e.g., Atlantis) where agents achieve scores orders of magnitude higher than humans do.
Besides tracking the median performance as a function of environment steps, at the end of training we re-evaluate the best agent snapshot using two different testing regimes. In the no-ops starts regime, we insert a random number (up to 30) of no-op actions at the beginning of each episode (as we do also in training). In the human starts regime, episodes are initialized with points randomly sampled from the initial portion of human expert trajectories [Nair et al.2015]; the difference between the two regimes indicates the extent to which the agent has over-fit to its own trajectories.
Due to space constraints, we focus on aggregate results across games. However, in the appendix we provide full learning curves for all games and all agents, as well as detailed comparison tables of raw and normalized scores, in both the no-op and human starts testing regimes.
|Min history to start learning||80K frames|
|Adam learning rate|
|Target Network Period||32K frames|
|Prioritization importance sampling|
|Distributional min/max values|
All Rainbow’s components have a number of hyper-parameters. The combinatorial space of hyper-parameters is too large for an exhaustive search, therefore we have performed limited tuning. For each component, we started with the values used in the paper that introduced this component, and tuned the most sensitive among hyper-parameters by manual coordinate descent.
DQN and its variants do not perform learning updates during the first frames, to ensure sufficiently uncorrelated updates. We have found that, with prioritized replay, it is possible to start learning sooner, after only frames.
DQN starts with an exploration of 1, corresponding to acting uniformly at random; it anneals the amount of exploration over the first 4M frames, to a final value of 0.1 (lowered to 0.01 in later variants). Whenever using Noisy Nets, we acted fully greedily (), with a value of for the hyper-parameter used to initialize the weights in the noisy stream111 The noise was generated on the GPU. Tensorflow noise generation can be unreliable on GPU. If generating the noise on the CPU, lowering
The noise was generated on the GPU. Tensorflow noise generation can be unreliable on GPU. If generating the noise on the CPU, loweringto 0.1 may be helpful.. For agents without Noisy Nets, we used -greedy but decreased the exploration rate faster than was previously used, annealing to 0.01 in the first frames.
We used the Adam optimizer [Kingma and Ba2014], which we found less sensitive to the choice of the learning rate than RMSProp. DQN uses a learning rate of In all Rainbow’s variants we used a learning rate of , selected among , and a value of for Adam’s hyper-parameter.
For replay prioritization we used the recommended proportional variant, with priority exponent of , and linearly increased the importance sampling exponent from 0.4 to 1 over the course of training. The priority exponent was tuned comparing values of . Using the KL loss of distributional DQN as priority, we have observed that performance is very robust to the choice of .
The value of in multi-step learning is a sensitive hyper-parameter of Rainbow. We compared values of . We observed that both and did well initially, but overall performed the best by the end.
The hyper-parameters (see Table 1) are identical across all 57 games, i.e., the Rainbow agent really is a single agent setup that performs well across all the games.
In this section we analyse the main experimental results. First, we show that Rainbow compares favorably to several published agents. Then we perform ablation studies, comparing several variants of the agent, each corresponding to removing a single component from Rainbow.
In Figure 1 we compare the Rainbow’s performance (measured in terms of the median human normalized score across games) to the corresponding curves for A3C, DQN, DDQN, Prioritized DDQN, Dueling DDQN, Distributional DQN, and Noisy DQN. We thank the authors of the Dueling and Prioritized agents for providing the learning curves of these, and report our own re-runs for DQN, A3C, DDQN, Distributional DQN and Noisy DQN. The performance of Rainbow is significantly better than any of the baselines, both in data efficiency, as well as in final performance. Note that we match final performance of DQN after 7M frames, surpass the best final performance of these baselines in 44M frames, and reach substantially improved final performance.
In the final evaluations of the agent, after the end of training, Rainbow achieves a median score of 223% in the no-ops regime; in the human starts regime we measured a median score of 153%. In Table 2 we compare these scores to the published median scores of the individual baselines.
|Prioritized DDQN (*)||140%||128%|
|Dueling DDQN (*)||151%||117%|
In Figure 2 (top row) we plot the number of games where an agent has reached some specified level of human normalized performance. From left to right, the subplots show on how many games the different agents have achieved 20%, 50%, 100%, 200% and 500% human normalized performance. This allows us to identify where the overall improvements in performance come from. Note that the gap in performance between Rainbow and other agents is apparent at all levels of performance: the Rainbow agent is improving scores on games where the baseline agents were already good, as well as improving in games where baseline agents are still far from human performance.
As in the original DQN setup, we ran each agent on a single GPU. The 7M frames required to match DQN’s final performance correspond to less than 10 hours of wall-clock time. A full run of 200M frames corresponds to approximately 10 days, and this varies by less than 20% between all of the discussed variants. The literature contains many alternative training setups that improve performance as a function of wall-clock time by exploiting parallelism, e.g., Nair2015 Nair2015, nes_atari nes_atari, and Mnih:2016 Mnih:2016. Properly relating the performance across such very different hardware/compute resources is non-trivial, so we focused exclusively on algorithmic variations, allowing apples-to-apples comparisons. While we consider them to be important and complementary, we leave questions of scalability and parallelism to future work.
Since Rainbow integrates several different ideas into a single agent, we conducted additional experiments to understand the contribution of the various components, in the context of this specific combination.
To gain a better understanding of the contribution of each component to the Rainbow agent, we performed ablation studies. In each ablation, we removed one component from the full Rainbow combination. Figure 3 shows a comparison for median normalized score of the full Rainbow to six ablated variants. Figure 2 (bottom row) shows a more detailed breakdown of how these ablations perform relative to different thresholds of human normalized performance, and Figure 4 shows the gain or loss from each ablation for every game, averaged over the full learning run.
Prioritized replay and multi-step learning were the two most crucial components of Rainbow, in that removing either component caused a large drop in median performance. Unsurprisingly, the removal of either of these hurt early performance. Perhaps more surprisingly, the removal of multi-step learning also hurt final performance. Zooming in on individual games (Figure 4), we see both components helped almost uniformly across games (the full Rainbow performed better than either ablation in 53 games out of 57).
Distributional Q-learning ranked immediately below the previous techniques for relevance to the agent’s performance. Notably, in early learning no difference is apparent, as shown in Figure 3, where for the first 40 million frames the distributional-ablation performed as well as the full agent. However, without distributions, the performance of the agent then started lagging behind. When the results are separated relatively to human performance in Figure 2, we see that the distributional-ablation primarily seems to lags on games that are above human level or near it.
In terms of median performance, the agent performed better when Noisy Nets were included; when these are removed and exploration is delegated to the traditional -greedy mechanism, performance was worse in aggregate (red line in Figure 3). While the removal of Noisy Nets produced a large drop in performance for several games, it also provided small increases in other games (Figure 4).
In aggregate, we did not observe a significant difference when removing the dueling network from the full Rainbow. The median score, however, hides the fact that the impact of Dueling differed between games, as shown by Figure 4. Figure 2 shows that Dueling perhaps provided some improvement on games with above-human performance levels (# games ), and some degradation on games with sub-human performance (# games ).
Also in the case of double Q-learning, the observed difference in median performance (Figure 3) is limited, with the component sometimes harming or helping depending on the game (Figure 4). To further investigate the role of double Q-learning, we compared the predictions of our trained agents to the actual discounted returns computed from clipped rewards. Comparing Rainbow to the agent where double Q-learning was ablated, we observed that the actual returns are often higher than and therefore fall outside the support of the distribution, spanning from to . This leads to underestimated returns, rather than overestimations. We hypothesize that clipping the values to this constrained range counteracts the overestimation bias of Q-learning. Note, however, that the importance of double Q-learning may increase if the support of the distributions is expanded.
In the appendix, for each game we show final performance and learning curves for Rainbow, its ablations, and baselines.
We have demonstrated that several improvements to DQN can be successfully integrated into a single learning algorithm that achieves state-of-the-art performance. Moreover, we have shown that within the integrated algorithm, all but one of the components provided clear performance benefits. There are many more algorithmic components that we were not able to include, which would be promising candidates for further experiments on integrated agents. Among the many possible candidates, we discuss several below.
We have focused here on value-based methods in the Q-learning family. We have not considered purely policy-based RL algorithms such as trust-region policy optimisation [Schulman et al.2015], nor actor-critic methods [Mnih et al.2016, O’Donoghue et al.2016].
A number of algorithms exploit a sequence of data to achieve improved learning efficiency. Optimality tightening [He et al.2016] uses multi-step returns to construct additional inequality bounds, instead of using them to replace the 1-step targets used in Q-learning. Eligibility traces allow a soft combination over n-step returns [Sutton1988]. However, sequential methods all leverage more computation per gradient than the multi-step targets used in Rainbow. Furthermore, introducing prioritized sequence replay raises questions of how to store, replay and prioritise sequences.
Episodic control [Blundell et al.2016] also focuses on data efficiency, and was shown to be very effective in some domains. It improves early learning by using episodic memory as a complementary learning system, capable of immediately re-enacting successful action sequences.
Besides Noisy Nets, numerous other exploration methods could also be useful algorithmic ingredients: among these Bootstrapped DQN [Osband et al.2016], intrinsic motivation [Stadie, Levine, and Abbeel2015] and count-based exploration [Bellemare et al.2016]. Integration of these alternative components is fruitful subject for further research.
In this paper we have focused on the core learning updates, without exploring alternative computational architectures. Asynchronous learning from parallel copies of the environment, as in A3C [Mnih et al.2016], Gorila [Nair et al.2015], or Evolution Strategies [Salimans et al.2017], can be effective in speeding up learning, at least in terms of wall-clock time. Note, however, they can be less data efficient.
Hierarchical RL has also been applied with success to several complex Atari games. Among successful applications of HRL we highlight h-DQN [Kulkarni et al.2016a] and Feudal Networks [Vezhnevets et al.2017].
The state representation could also be made more efficient by exploiting auxiliary tasks such as pixel control or feature control [Jaderberg et al.2016], supervised predictions [Dosovitskiy and Koltun2016] or successor features [Kulkarni et al.2016b].
To evaluate Rainbow fairly against the baselines, we have followed the common domain modifications of clipping rewards, fixed action-repetition, and frame-stacking, but these might be removed by other learning algorithm improvements. Pop-Art normalization [van Hasselt et al.2016] allows reward clipping to be removed, while preserving a similar level of performance. Fine-grained action repetition [Sharma, Lakshminarayanan, and Ravindran2017] enabled to learn how to repeat actions. A recurrent state network [Hausknecht and Stone2015] can learn a temporal state representation, replacing the fixed stack of observation frames. In general, we believe that exposing the real game to the agent is a promising direction for future research.