1 Introduction
In Reinforcement Learning (RL), an agent seeks to maximize the cumulative rewards obtained from interactions with an unknown environment. Given only knowledge based on previously observed trajectories, the agent faces the explorationexploitation dilemma: Should the agent take actions that maximize rewards based on its current knowledge or instead investigate poorly understood states and actions to potentially improve future performance. Thus, in order to find the optimal policy the agent needs to use an appropriate exploration strategy.
Popular exploration strategies, such as greedy (Sutton & Barto, 1998), rely on random perturbations of the agent’s policy, which leads to undirected exploration. The theoretical RL literature offers a variety of statisticallyefficient methods that are based on a measure of uncertainty in the agent’s model. Examples include upper confidence bound (UCB) (Auer et al., 2002) and Thompson sampling (TS) (Thompson, 1933). In recent years, these have been extended to practical exploration algorithms for large statespaces and shown to improve performance (Osband et al., 2016a; Chen et al., 2017; O’Donoghue et al., 2018; Fortunato et al., 2018). However, these methods assume that the observation noise distribution is independent of the evaluation point, while in practice heteroscedastic observation noise is omnipresent in RL. This means that the noise depends on the evaluation point, rather than being identically distributed (homoscedastic). For instance, the return distribution typically depends on a sequence of interactions and, potentially, on hidden states or inherently heteroscedastic reward observations. Kirschner & Krause (2018) recently demonstrated that, even in the simpler bandit setting, classical approaches such as UCB and TS fail to efficiently account for heteroscedastic noise.
In this work, we propose to use InformationDirected Sampling (IDS) (Russo & Van Roy, 2014; Kirschner & Krause, 2018)
for efficient exploration in RL. The IDS framework can be used to design explorationexploitation strategies that balance the estimated instantaneous regret and the expected information gain. Importantly, through the choice of an appropriate
informationgain function, IDS is able to account for parametric uncertainty and heteroscedastic observation noise during exploration.As our main contribution, we propose a novel, tractable RL algorithm based on the IDS principle. We combine recent advances in distributional RL (Bellemare et al., 2017; Dabney et al., 2018b) and approximate parameter uncertainty methods in order to develop both homoscedastic and heteroscedastic variants of an agent that is similar to DQN (Mnih et al., 2015), but uses informationdirected exploration. Our evaluation on Atari 2600 games shows the importance of accounting for heteroscedastic noise and indicates that at our approach can substantially outperform alternative stateoftheart algorithms that focus on modeling either only epistemic or only aleatoric uncertainty. To the best of our knowledge, we are the first to develop a tractable IDS algorithm for RL in large state spaces.
2 Related Work
Exploration algorithms are well understood in bandits and have inspired successful extensions to RL (Bubeck & CesaBianchi, 2012; Lattimore & Szepesvári, 2018). Many strategies rely on the ”optimism in the face of uncertainty” (Lai & Robbins, 1985) principle. These algorithms act greedily w.r.t. an augmented reward function that incorporates an exploration bonus. One prominent example is the upper confidence bound (UCB) algorithm (Auer et al., 2002)
, which uses a bonus based on confidence intervals. A related strategy is Thompson sampling (TS)
(Thompson, 1933), which samples actions according to their posterior probability of being optimal in a Bayesian model. This approach often provides better empirical results than optimistic strategies
(Chapelle & Li, 2011).In order to extend TS to RL, one needs to maintain a distribution over Markov Decision Processes (MDPs), which is difficult in general. Similar to TS,
Osband et al. (2016b) propose randomized linear value functions to maintain a Bayesian posterior distribution over value functions. Bootstrapped DQN (Osband et al., 2016a)extends this idea to deep neural networks by using an ensemble of Qfunctions. To explore, Bootstrapped DQN randomly samples a Qfunction from the ensemble and acts greedily w.r.t. the sample.
Fortunato et al. (2018) and Plappert et al. (2018) investigate a similar idea and propose to adaptively perturb the parameterspace, which can also be thought of as tracking an approximate parameter posterior. O’Donoghue et al. (2018) propose TS in combination with an uncertainty Bellman equation, which propagates agent’s uncertainty in the Qvalues over multiple time steps. Additionally, Chen et al. (2017)propose to use the Qensemble of Bootstrapped DQN to obtain approximate confidence intervals for a UCB policy. There are also multiple other ways to approximate parametric posterior in neural networks, including Neural Bayesian Linear Regression
(Snoek et al., 2015; Azizzadenesheli et al., 2018), Variational Inference (Blundell et al., 2015), Monte Carlo methods (Neal, 1995; Mandt et al., 2016; Welling & Teh, 2011), and Bayesian Dropout (Gal & Ghahramani, 2016). For an empirical comparison of these, we refer the reader to Riquelme et al. (2018).A shortcoming of all approaches mentioned above is that, while they consider parametric uncertainty, they do not account for heteroscedastic noise during exploration. In contrast, distributional RL algorithms, such as Categorical DQN (C51) (Bellemare et al., 2017)
and Quantile Regression DQN (QRDQN)
(Dabney et al., 2018b), approximate the distribution over the Qvalues directly. However, both methods do not take advantage of the return distribution for exploration and use greedy exploration. Implicit Quantile Networks (IQN) (Dabney et al., 2018a) instead use a risksensitive policy based on a return distribution learned via quantile regression and outperform both C51 and QRDQN on Atari57. Similarly, Moerland et al. (2018) and Dilokthanakul & Shanahan (2018) act optimistically w.r.t. the return distribution in deterministic MDPs. However, these approaches to not consider parametric uncertainty.Return and parametric uncertainty have previously been combined for exploration by Tang & Agrawal (2018) and Moerland et al. (2017). Both methods account for parametric uncertainty by sampling parameters that define a distribution over Qvalues. The former then act greedily with respect to the expectation of this distribution, while the latter additionally samples a return for each action and then acts greedily with respect to it. However, like Thompson sampling, these approaches do not appropriately exploit the heteroscedastic nature of the return. In particular, noisier actions are more likely to be chosen, which can slow down learning.
Our method is based on InformationDirected Sampling (IDS), which can explicitly account for parametric uncertainty and heteroscedasticity in the return distribution. IDS has been primarily studied in the bandit setting (Russo & Van Roy, 2014; Kirschner & Krause, 2018). Zanette & Sarkar (2017) extend it to finite MDPs, but their approach remains impractical for large state spaces, since it requires to find the optimal policies for a set of MDPs at the beginning of each episode.
3 Background
We model the agentenvironment interaction with a MDP , where and are the state and action spaces, is the stochastic reward function, is the probability of transitioning from state to state after taking action , and is the discount factor. A policy maps a state to a distribution over actions. For a fixed policy , the discounted return of action in state
is a random variable
, with initial state and action and transition probabilities , . The return distribution statisfied the Bellman equation,(1) 
where denotes distributional equality. If we take the expectation of (1), the usual Bellman equation (Bellman, 1957) for the Qfunction, , follows as
(2) 
The objective is to find an optimal policy that maximizes the expected total discounted return for all .
3.1 Uncertainty in Reinforcement Learning
To find such an optimal policy, the majority of RL algorithms use a point estimate of the Qfunction, . However, such methods can be inefficient, because they can be overconfident about the performance of suboptimal actions if the optimal ones have not been evaluated before. A natural solution for more efficient exploration is to use uncertainty information. In this context, there are two source of uncertainty. Parametric (epistemic) uncertainty is a result of ambiguity over the class of models that explain that data seen so far, while intrinsic (aleatoric) uncertainty is caused by stochasticity in the environment or policy, and is captured by the distribution over returns (Moerland et al., 2017).
Osband et al. (2016a) estimate parametric uncertainty with a Bootstrapped DQN. They maintain an ensemble of Qfunctions, , which is represented by a multiheaded deep neural network. To train the network, the standard bootstrap method (Efron, 1979; Hastie et al., 2001) constructs different datasets by sampling with replacement from the global data pool. Instead, Osband et al. (2016a) trains all network heads on the exact same data and diversifies the Qensemble via two other mechanisms. First, each head is trained on its own independent target head , which is periodically updated (Mnih et al., 2015). Further, each head is randomly initialized, which, combined with the nonlinear parameterization and the independently targets, provides sufficient diversification.
Intrinsic uncertainty is captured by the return distribution . While Qlearning (Watkins, 1989) aims to estimate the expected discounted return , distributional RL approximates the random return directly. As in standard Qlearning (Watkins, 1989), one can define a distributional Bellman optimality operator based on (1),
(3) 
To estimate the distribution of , we use the approach of C51 (Bellemare et al., 2017) in the following. It parameterizes the return as a categorical distribution over a set of equidistant atoms in a fixed interval
. The atom probabilities are parameterized by a softmax distribution over the outputs of a parametric model. Since the parameterization
and the Bellman update have disjoint supports, the algorithm requires an additional step that projects the shifted support of onto. Then it minimizes the KullbackLeibler divergence
.3.2 Heteroscedasticity in Reinforcement Learning
In RL, heteroscedasticity means that the variance of the return distribution
depends on the state and action. This can occur in a number of ways. The variance of the reward function itself may depend on or . Even with deterministic or homoscedastic rewards, in stochastic environments the variance of the observed return is a function of the stochasticity in the transitions over a sequence of steps. Furthermore, Partially Observable MDPs (Monahan, 1982) are also heteroscedastic due to the possibility of different states aliasing to the same observation.Interestingly, heteroscedasticity also occurs in valuebased RL regardless of the environment. This is due to Bellman targets being generated based on an evolving policy
. To demonstrate this, consider a standard observation model used in supervised learning
, with true function and Gaussian noise . In Temporal Difference (TD) algorithms (Sutton & Barto, 1998), given a sample transition , the learning target is generated as , for some action . Similarly to the observation model above, we can describe TDtargets for learning being generated as , with and given by(4)  
The last term clearly shows the dependence of the noise function on the policy , used to generate the Bellman target. Note additionally that heteroscedastic targets are not limited to TDlearning methods, but also occur in TD() and MonteCarlo learning (Sutton & Barto, 1998), no matter if the environment is stochastic or not.
3.3 InformationDirected Sampling
InformationDirected Sampling (IDS) is a bandit algorithm, which was first introduced in the Bayesian setting by Russo & Van Roy (2014), and later adapted to the frequentist framework by Kirschner & Krause (2018). Here, we concentrate on the latter formulation in order to avoid keeping track of a posterior distribution over the environment, which itself is a difficult problem in RL. The bandit problem is equivalent to a single state MDP with stochastic reward function and optimal action . We define the (expected) regret , which is the loss in reward for choosing an suboptimal action . Note, however, that we cannot directly compute , since it depends on and the unknown optimal action . Instead, IDS uses a conservative regret estimate , where is a confidence interval which contains the true expected reward with high probability.
In addition, assume for now that we are given an information gain function . Then, at any time step , the IDS policy is defined by
(5) 
Technically, this is known as deterministic IDS which, for simplicity, we refer to as IDS throughout this work. Intuitively, IDS chooses actions with small regretinformation ratio to balance between incurring regret and acquiring new information at each step. Kirschner & Krause (2018) introduce several informationgain functions and derive a highprobability bound on the cumulative regret, . Here, is an upper bound on the total information gain , which has a sublinear dependence in for different function classes and the specific informationgain function we use in the following (Srinivas et al., 2010). The overall regret bound for IDS matches the best bound known for the widely used UCB policy for linear and kernelized reward functions.
One particular choice of the information gain function, that works well empirically and we focus on in the following, is (Kirschner & Krause, 2018). Here is the variance in the parametric estimate of and is the variance of the observed reward. In particular, the information gain is small for actions with little uncertainty in the true expected reward or with reward that is subject to high observation noise. Importantly, note that may explicitly depend on the selected action , which allows the policy to account for heteroscedastic noise.
We demonstrate the advantage of such a strategy in the Gaussian Process setting (Murphy, 2012). In particular, for an arbitrary set of actions , we model the distribution of by a multivariate Gaussian, with covariance , where is a positive definite kernel. In our toy example, the goal is to maximize under heteroscedastic observation noise with variance (Figure 1). As UCB and TS do not consider observation noise in the acquisition function, they may sample at points where is large. Instead, by exploiting kernel correlation, IDS is able to shrink the uncertainty in the highnoise region with fewer samples, by selecting a nearby point with potentially higher regret but small noise.
4 InformationDirected Sampling for Reinforcement Learning
In this section, we use the IDS strategy from the previous section in the context of deep RL. In order to do so, we have to define a tractable notion of regret and information gain .
4.1 Estimating Regret and Information Gain
In the context of RL, it is natural to extend the definition of instantaneous regret of action in state using the Qfunction
(6) 
where is the history of observations at time . The regret definition in eq. (6) captures the loss in return when selecting action in state rather than the optimal action. This is similar to the notion of the advantage function. Since depends on the true Qfunction , which is not available in practice and can only be estimated based on finite data, the IDS framework instead uses a conservative estimate.
To do so, we must characterize the parametric uncertainty in the Qfunction. Since we use neural networks as function approximators, we can obtain approximate confidence bounds using a Bootstrapped DQN (Osband et al., 2016a). In particular, given an ensemble of actionvalue functions, we compute the empirical mean and variance of the estimated Qvalues,
(7) 
Based on the mean and variance estimate in the Qvalues, we can define a surrogate for the regret using confidence intervals,
(8) 
where
is a scaling hyperparameter. The first term corresponds to the maximum plausible value that the Qfunction could take at a given state, while the right term lowerbounds the Qvalue given the chosen action. As a result, eq. (
8) provides a conservative estimate of the regret in eq. (6).Given the regret surrogate, the only missing component to use the IDS strategy in eq. (5) is to compute the information gain function . In particular, we use based on the discussion in (Kirschner & Krause, 2018). In addition to the previously defined predictive parameteric variance estimates for the regret, it depends on the variance of the noise distribution, . While in the bandit setting we track onestep rewards, in RL we focus on learning from returns from complete trajectories. Therefore, instantaneous reward observation noise variance in the bandit setting transfers to the variance of the return distribution in RL. We point out that the scale of can substantially vary depending on the stochasticity of the policy and the environment, as well as the reward scaling. This directly affects the scale of the information gain and the degree to which the agent chooses to explore. Since the weighting between regret and information gain in the IDS ratio is implicit, for stable performance across a range of environments, we propose computing the information gain using the normalized variance
(9) 
where are small constants that prevent division by . This normalization step brings the mean of all variances to , while keeping their values positive. Importantly, it preserves the signal needed for noisesensitive exploration and allows the agent to account for numerical differences across environments and favor the same amount of risk. We also experimentally found this version to give better results compared to the unnormalized variance .
4.2 InformationDirected Reinforcement Learning
Using the estimates for regret and information gain, we provide the complete control algorithm in Algorithm 1. At each step, we compute the parametric uncertainty over as well as the distribution over returns . We then follow the steps from Section 4.1 to compute the regret and the information gain of each action, and select the one that minimizes the regretinformation ratio .
To estimate parametric uncertainty, we use the exact same training procedure and architecture as Bootstrapped DQN (Osband et al., 2016a): we split the DQN architecture (Mnih et al., 2015) into bootstrap heads after the convolutional layers. Each head is trained against its own target head and all heads are trained on the exact same data. We use Double DQN targets (van Hasselt et al., 2016) and normalize gradients propagated by each head by .
To estimate , it makes sense to share some of the weights from the Bootstrapped DQN. We propose to use the output of the last convolutional layer as input to a separate head that estimates . The output of this head is the only one used for computing and is also not included in the bootstrap estimate. For instance, this head can be trained using C51 or QRDQN, with variance , where denotes the atoms of the distribution support, , their corresponding probabilities, and . To isolate the effect of noisesensitive exploration from the advantages of distributional training, we do not propagate distributional loss gradients in the convolutional layers and use the representation learned only from the bootstrap branch. This is not a limitation of our approach and both (or either) bootstrap and distributional gradients can be propagated through the convolutional layers.
Importantly, our method can account for deep exploration, since both the parametric uncertainty and the intrinsic uncertainty estimates in the information gain are extended beyond a single time step and propagate information over sequences of states. We note the difference with intrinsic motivation methods, which augment the reward function by adding an exploration bonus to the step reward (Houthooft et al., 2016; Stadie et al., 2015; Schmidhuber, 2010; Bellemare et al., 2016; Tang et al., 2017). While the bonus is sometimes based on an informationgain measure, the estimated optimal policy is often affected by the augmentation of the rewards.
5 Experiments
We now provide experimental results on 55 of the Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al., 2013), simulated via the OpenAI gym interface (Brockman et al., 2016). We exclude Defender and Surround from the standard Atari57 selection, since they are not available in OpenAI gym. Our method builds on the standard DQN architecture and we expect it to benefit from recent improvements such as Dueling DQN (Wang et al., 2016) and prioritized replay (Schaul et al., 2016). However, in order to separately study the effect of changing the exploration strategy, we compare our method without these additions. Our code can be found at https://github.com/nikonikolov/rltf/tree/idsdrl.
Mean  Median  

DQN  232%  79% 
DDQN  313%  118% 
Dueling  379%  151% 
NoisyNetDQN  389%  123% 
Prior.  444%  124% 
Bootstrapped DQN  553%  139% 
Prior. Dueling  608%  172% 
NoisyNetDueling  651%  172% 
DQNIDS  757%  187% 
C51  721%  178% 
QRDQN  888%  193% 
IQN  1048%  218% 
C51IDS  1058%  253% 
We evaluate two versions of our method: a homoscedastic one, called DQNIDS, for which we do not estimate and set to a constant, and a heteroscedastic one, C51IDS, for which we estimate using C51 as previously described. DQNIDS uses the exact same network architecture as Bootstrapped DQN. For C51IDS, we add the fullyconnected part of the C51 network (Bellemare et al., 2017) on top of the last convolutional layer of the DQNIDS architecture, but we do not propagate distributional loss gradients into the convolutional layers. We use a target network to compute Bellman updates, with double DQN targets only for the bootstrap heads, but not for the distributional update. Weights are updated using the Adam optimizer (Kingma & Ba, 2015). We evaluate the performance of our method using a mean greedy policy that is computed on the bootstrap heads
(10) 
Due to computational limitations, we did not perform an extensive hyperparameter search. Our final algorithm uses , (for DQNIDS) and target update frequency of 40000 agent steps, based on a parameter search over , , and target update in
. For C51IDS, we put a heuristically chosen lower bound of
on to prevent the agent from fixating on “noiseless” actions. This bound is introduced primarily for numerical reasons, since, even in the bandit setting, the strategy may degenerate as the noise variance of a single action goes to zero. We also ran separate experiments without this lower bound and while the pergame scores slightly differ, the overall change in mean humannormalized score was only 23%. We also use the suggested hyperparameters from C51 and Bootstrapped DQN, and set learning rate , , number of heads , number of atoms . The rest of our training procedure is identical to that of Mnih et al. (2015), with the difference that we do not use greedy exploration. All episodes begin with up to 30 random noops (Mnih et al., 2015) and the horizon is capped at 108K frames (van Hasselt et al., 2016). Complete details are provided in Appendix A.To provide comparable results with existing work we report evaluation results under the best agent protocol. Every 1M training frames, learning is frozen, the agent is evaluated for 500K frames and performance is computed as the average episode return from this latest evaluation run. Table 1 shows the mean and median humannormalized scores (van Hasselt et al., 2016) of the best agent performance after 200M training frames. Additionally, we illustrate the distributions learned by C51 and C51IDS in Figure 3.
We first point out the results of DQNIDS and Bootstrapped DQN. While both methods use the same architecture and similar optimization procedures, DQNIDS outperforms Bootstrapped DQN by around . This suggests that simply changing the exploration strategy from TS to IDS (along with the type of optimizer), even without accounting for heteroscedastic noise, can substantially improve performance. Furthermore, DQNIDS slightly outperforms C51, even though C51 has the benefits of distributional learning.
We also see that C51IDS outperforms C51 and QRDQN and achieves slightly better results than IQN. Importantly, the fact that C51IDS substantially outperforms DQNIDS, highlights the significance of accounting for heteroscedastic noise. We also experimented with a QRDQNIDS version, which uses QRDQN instead of C51 to estimate and noticed that our method can benefit from better approximation of the return distribution. While we expect the performance over IQN to be higher, we do not include QRDQNIDS scores since we were unable to reproduce the reported QRDQN results on some games. We also note that, unlike C51IDS, IQN is specifically tuned for risk sensitivity. One way to get a risksensitive IDS policy is by tuning for in the additive IDS formulation , proposed by Russo & Van Roy (2014). We verified on several games that C51IDS scores can be improved by using this additive formulation and we believe such gains can be extended to the rest of the games.
6 Conclusion
We extended the idea of frequentist InformationDirected Sampling to a practical RL exploration algorithm that can account for heteroscedastic noise. To the best of our knowledge, we are the first to propose a tractable IDS algorithm for RL in large state spaces. Our method suggests a new way to use the return distribution in combination with parametric uncertainty for efficient deep exploration and demonstrates substantial gains on Atari games. We also identified several sources of heteroscedasticity in RL and demonstrated the importance of accounting for heteroscedastic noise for efficient exploration. Additionally, our evaluation results demonstrated that similarly to the bandit setting, IDS has the potential to outperform alternative strategies such as TS in RL.
There remain promising directions for future work. Our preliminary results show that similar improvements can be observed when IDS is combined with continuous control RL methods such as the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016). Developing a computationally efficient approximation of the randomized IDS version, which minimizes the regretinformation ratio over the set of stochastic policies, is another idea to investigate. Additionally, as indicated by Russo & Van Roy (2014), IDS should be seen as a design principle rather than a specific algorithm, and thus alternative information gain functions are an important direction for future research.
Acknowledgments
We thank Ian Osband and Will Dabney for providing details about the Atari evaluation protocol. This work was supported by SNSF grant 200020_159557, the Vector Institute and the Open Philanthropy Project AI Fellows Program.
References
 Auer et al. (2002) Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
 Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep qnetworks. arXiv, abs/1802.04412, 2018.

Bellemare et al. (2013)
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29, pp. 1471–1479. 2016.
 Bellemare et al. (2017) Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proc. of the International Conference on Machine Learning, volume 70, pp. 449–458, 2017.
 Bellman (1957) Richard Bellman. Dynamic Programming. Princeton University Press, 1st edition, 1957.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proc. of the International Conference on Machine Learning, volume 37, pp. 1613–1622, 2015.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv, abs/1606.01540, 2016.
 Bubeck & CesaBianchi (2012) Sébastien Bubeck and Nicolò CesaBianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
 Chapelle & Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249–2257. 2011.
 Chen et al. (2017) Richard Y. Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. UCB and infogain exploration via qensembles. arXiv, abs/1706.01502, 2017.
 Dabney et al. (2018a) Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributional reinforcement learning. In Proc. of the International Conference on Machine Learning, volume 80, pp. 1096–1105, 2018a.
 Dabney et al. (2018b) Will Dabney, Mark Rowland, Marc Bellemare, and Remi Munos. Distributional reinforcement learning with quantile regression. In Proc. of the AAAI Conference on Artificial Intelligence, 2018b.
 Dilokthanakul & Shanahan (2018) Nat Dilokthanakul and Murray Shanahan. Deep reinforcement learning with riskseeking exploration. In From Animals to Animats 15, pp. 201–211. Springer International Publishing, 2018.
 Efron (1979) Bradley Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26, 1979.
 Fortunato et al. (2018) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In Proc. of the International Conference on Learning Representations, 2018.

Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In Proc. of the International Conference on Machine Learning, volume 48, pp. 1050–1059, 2016.  Hastie et al. (2001) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer New York Inc., 2001.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances In Neural Information Processing Systems 29, pp. 1109–1117. 2016.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
 Kirschner & Krause (2018) Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Proc. International Conference on Learning Theory (COLT), 2018.
 Lai & Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 – 22, 1985.
 Lattimore & Szepesvári (2018) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, draft edition, 2018.
 Lillicrap et al. (2016) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proc. of the International Conference on Learning Representations, 2016.
 Mandt et al. (2016) Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient algorithms. In Proc. of the International Conference on Machine Learning, volume 48, pp. 354–363, 2016.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Moerland et al. (2017) Thomas M. Moerland, Joost Broekens, and Catholijn M. Jonker. Efficient exploration with double uncertain value networks. Symposium on Deep Reinforcement Learning, NIPS, 2017.
 Moerland et al. (2018) Thomas M. Moerland, Joost Broekens, and Catholijn M. Jonker. The potential of the return distribution for exploration in rl. arXiv, abs/1806.04242, 2018.
 Monahan (1982) George E. Monahan. A survey of partially observable markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1–16, 1982.
 Murphy (2012) Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
 Nair et al. (2015) Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. ICML Workshop on Deep Learning, 2015.
 Neal (1995) Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
 O’Donoghue et al. (2018) Brendan O’Donoghue, Ian Osband, Remi Munos, and Vlad Mnih. The uncertainty Bellman equation and exploration. In Proc. of the International Conference on Machine Learning, volume 80, pp. 3839–3848, 2018.
 Osband et al. (2016a) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems 29, pp. 4026–4034. 2016a.
 Osband et al. (2016b) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In Proc. of the International Conference on Machine Learning, volume 48, pp. 2377–2386, 2016b.
 Plappert et al. (2018) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In Proc. of the International Conference on Learning Representations, 2018.
 Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In Proc. of the International Conference on Learning Representations, 2018.
 Russo & Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via informationdirected sampling. In Advances in Neural Information Processing Systems 27, pp. 1583–1591. 2014.
 Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proc. of the International Conference on Learning Representations, 2016.
 Schmidhuber (2010) Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable bayesian optimization using deep neural networks. In Proc. of the International Conference on Machine Learning, volume 37, pp. 2171–2180, 2015.
 Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. of the International Conference on Machine Learning, pp. 1015–1022, 2010.
 Stadie et al. (2015) Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv, abs/1507.00814, 2015.
 Sutton & Barto (1998) Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, 1st edition, 1998.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. #exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems 30, pp. 2753–2762. 2017.
 Tang & Agrawal (2018) Yunhao Tang and Shipra Agrawal. Exploration by distributional reinforcement learning. In Proc. of the International Joint Conference on Artificial Intelligence, pp. 2710–2716, 2018.
 Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(34):285–294, 1933.
 van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Proc. of the AAAI Conference on Artificial Intelligence, pp. 2094–2100, 2016.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In Proc. of the International Conference on Machine Learning, volume 48, pp. 1995–2003, 2016.
 Watkins (1989) Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, 1989.
 Welling & Teh (2011) Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proc. of the International Conference on Machine Learning, pp. 681–688, 2011.
 Zanette & Sarkar (2017) Andrea Zanette and Rahul Sarkar. Information directed reinforcement learning. Technical report, 2017.
Appendix A Hyperparameters
Hyperparameter  Value  Description 
0.1  Scale factor for computing regret surrogate  
1.0  Observation noise variance for DQNIDS  
0.00001  Informationratio constants; prevent division by 0  
minibatch size  32  Size of minibatch samples for gradient descent step 
replay buffer size  1M  The number of most recent observations stored in the replay buffer 
agent history length  4  The number of most recent frames concatenated as input to the network 
action repeat  4  Repeat each action selected by the agent this many times 
0.99  Discount factor  
training frequency  4  The number of times an action is selected by the agent between successive gradient descent steps 
10  Number of bootstrap heads  
0.9  Adam optimizer parameter  
0.99  Adam optimizer parameter  
0.01/32  Adam optimizer parameter  
0.00005  learning rate  
learning starts  50000  Agent step at which learning starts. Random policy beforehand 
number of bins  51  Number of bins for Categorical DQN (C51) 
[10, 10]  C51 distribution range  
number of quantiles  200  Number of quantiles for QRDQN 
target network  
update frequency  40000  Number of agent steps between consecutive target updates 
evaluation length  125K  Number of agent steps each evaluation window lasts for. Equivalent to 500K frames 
evaluation frequency  250K  The number of steps the agent takes in training mode between two evaluation runs. Equivalent to 1M frames 
eval episode length  27K  Number of maximum agent steps during an evaluation episode. Equivalent to 108K frames 
max noops  30  Maximum number noop actions before the episode starts 
Appendix B Supplemental Results
Humannormalized scores are computed as (van Hasselt et al., 2016),
(11) 
where , and represent the pergame raw scores.
DQN  DDQN  Duel.  Bootstrap  Prior.Duel.  DQNIDS  

Alien  1,620.0  3,747.7  4,461.4  2,436.6  3,941.0  9,780.1 
Amidar  978.0  1,793.3  2,354.5  1,272.5  2,296.8  2,457.0 
Assault  4,280.4  5,393.2  4,621.0  8,047.1  11,477.0  9,446.7 
Asterix  4,359.0  17,356.5  28,188.0  19,713.2  375,080.0  50,167.3 
Asteroids  1,364.5  734.7  2,837.7  1,032.0  1,192.7  1,959.7 
Atlantis  279,987.0  106,056.0  382,572.0  994,500.0  395,762.0  993,212.5 
Bank Heist  455.0  1,030.6  1,611.9  1,208.0  1,503.1  1,226.1 
Battle Zone  29,900.0  31,700.0  37,150.0  38,666.7  35,520.0  67,394.2 
Beam Rider  8,627.5  13,772.8  12,164.0  23,429.8  30,276.5  30,426.6 
Berzerk  585.6  1,225.4  1,472.6  1,077.9  3,409.0  4,816.2 
Bowling  50.4  68.1  65.5  60.2  46.7  50.7 
Boxing  88.0  91.6  99.4  93.2  98.9  99.9 
Breakout  385.5  418.5  345.3  855.0  366.0  600.1 
Centipede  4,657.7  5,409.4  7,561.4  4,553.5  7,687.5  5,860.2 
Chopper Command  6,126.0  5,809.0  11,215.0  4,100.0  13,185.0  13,385.4 
Crazy Climber  110,763.0  117,282.0  143,570.0  137,925.9  162,224.0  194,935.7 
Demon Attack  12,149.4  58,044.2  60,813.3  82,610.0  72,878.6  130,687.2 
Double Dunk  6.6  5.5  0.1  3.0  12.5  1.2 
Enduro  729.0  1,211.8  2,258.2  1,591.0  2,306.4  2,358.2 
Fishing Derby  4.9  15.5  46.4  26.0  41.3  45.2 
Freeway  30.8  33.3  0.0  33.9  33.0  34.0 
Frostbite  797.4  1,683.3  4,672.8  2,181.4  7,413.0  5,884.3 
Gopher  8,777.4  14,840.8  15,718.4  17,438.4  104,368.2  47,826.2 
Gravitar  473.0  412.0  588.0  286.1  238.0  771.0 
H.E.R.O.  20,437.8  20,818.2  23,037.7  21,021.3  21,036.5  15,165.4 
Ice Hockey  1.9  2.7  0.5  1.3  0.4  1.7 
James Bond  768.5  1,358.0  1,312.5  1,663.5  812.0  1,782.2 
Kangaroo  7,259.0  12,992.0  14,854.0  14,862.5  1,792.0  15,364.5 
Krull  8,422.3  7,920.5  11,451.9  8,627.9  10,374.4  10,587.3 
KungFu Master  26,059.0  29,710.0  34,294.0  36,733.3  48,375.0  38,113.5 
Montezuma’s Revenge  0.0  0.0  0.0  100.0  0.0  0.0 
Ms. PacMan  3,085.6  2,711.4  6,283.5  2,983.3  3,327.3  7,273.7 
Name This Game  8,207.8  10,616.0  11,971.1  11,501.1  15,572.5  15,576.7 
Phoenix  8,485.2  12,252.5  23,092.2  14,964.0  70,324.3  176,493.2 
Pitfall!  286.1  29.9  0.0  0.0  0.0  0.0 
Pong  19.5  20.9  21.0  20.9  20.9  21.0 
Private Eye  146.7  129.7  103.0  1,812.5  206.0  201.1 
Q*Bert  13,117.3  15,088.5  19,220.3  15,092.7  18,760.3  26,098.5 
River Raid  7,377.6  14,884.5  21,162.6  12,845.0  20,607.6  27,648.3 
Road Runner  39,544.0  44,127.0  69,524.0  51,500.0  62,151.0  59,546.2 
Robotank  63.9  65.1  65.3  66.6  27.5  68.6 
Seaquest  5,860.6  16,452.7  50,254.2  9,083.1  931.6  58,909.8 
Skiing  13,062.3  9,021.8  8,857.4  9,413.2  19,949.9  7,415.3 
Solaris  3,482.8  3,067.8  2,250.8  5,443.3  133.4  2,086.8 
Space Invaders  1,692.3  2,525.5  6,427.3  2,893.0  15,311.5  35,422.1 
Star Gunner  54,282.0  60,142.0  89,238.0  55,725.0  125,117.0  84,241.0 
Tennis  12.2  22.8  5.1  0.0  0.0  23.6 
Time Pilot  4,870.0  8,339.0  11,666.0  9,079.4  7,553.0  13,464.8 
Tutankham  68.1  218.4  211.4  214.8  245.9  265.5 
Up and Down  9,989.9  22,972.2  44,939.6  26,231.0  33,879.1  85,903.5 
Venture  163.0  98.0  497.0  212.5  48.0  389.1 
Video Pinball  196,760.4  309,941.9  98,209.5  811,610.0  479,197.0  696,914.0 
Wizard Of Wor  2,704.0  7,492.0  7,855.0  6,804.7  12,352.0  19,267.9 
Yars’ Revenge  18,098.9  11,712.6  49,622.1  17,782.3  69,618.1  25,279.5 
Zaxxon  5,363.0  10,163.0  12,944.0  11,491.7  13,886.0  16,789.2 
Random  Human  C51  QRDQN  IQN  C51IDS  

Alien  227.8  7,127.7  3,166.0  4,871.0  7,022.0  11,473.6 
Amidar  5.8  1,719.5  1,735.0  1,641.0  2,946.0  1,757.6 
Assault  222.4  742.0  7,203.0  22,012.0  29,091.0  21,829.1 
Asterix  210.0  8,503.3  406,211.0  261,025.0  342,016.0  536,273.0 
Asteroids  719.1  47,388.7  1,516.0  4,226.0  2,898.0  2,549.1 
Atlantis  12,850.0  29,028.1  841,075.0  971,850.0  978,200.0  1,032,150.0 
Bank Heist  14.2  753.1  976.0  1,249.0  1,416.0  1,338.3 
Battle Zone  2,360.0  37,187.5  28,742.0  39,268.0  42,244.0  66,724.0 
Beam Rider  363.9  16,926.5  14,074.0  34,821.0  42,776.0  42,196.7 
Berzerk  123.7  2,630.4  1,645.0  3,117.0  1,053.0  23,227.3 
Bowling  23.1  160.7  81.8  77.2  86.5  57.0 
Boxing  0.1  12.1  97.8  99.9  99.8  99.9 
Breakout  1.7  30.5  748.0  742.0  734.0  575.5 
Centipede  2,090.9  12,017.0  9,646.0  12,447.0  11,561.0  9,840.5 
Chopper Command  811.0  7,387.8  15,600.0  14,667.0  16,836.0  12,309.5 
Crazy Climber  10,780.5  35,829.4  179,877.0  161,196.0  179,082.0  205,629.6 
Demon Attack  152.1  1,971.0  130,955.0  121,551.0  128,580.0  129,667.5 
Double Dunk  18.6  16.4  2.5  21.9  5.6  1.2 
Enduro  0.0  860.5  3,454.0  2,355.0  2,359.0  2,370.1 
Fishing Derby  91.7  38.7  8.9  39.0  33.8  49.8 
Freeway  0.0  29.6  33.9  34.0  34.0  34.0 
Frostbite  65.2  4,334.7  3,965.0  4,384.0  4,324.0  10,924.1 
Gopher  257.6  2,412.5  33,641.0  113,585.0  118,365.0  123,337.5 
Gravitar  173.0  3,351.4  440.0  995.0  911.0  885.5 
H.E.R.O.  1,027.0  30,826.4  38,874.0  21,395.0  28,386.0  17,545.3 
Ice Hockey  11.2  0.9  3.5  1.7  0.2  0.5 
James Bond  29.0  302.8  1,909.0  4,703.0  35,108.0  9,687.0 
Kangaroo  52.0  3,035.0  12,853.0  15,356.0  15,487.0  16,143.5 
Krull  1,598.0  2,665.5  9,735.0  11,447.0  10,707.0  10,454.5 
KungFu Master  258.5  22,736.3  48,192.0  76,642.0  73,512.0  59,710.7 
Montezuma’s Revenge  0.0  4,753.3  0.0  0.0  0.0  0.0 
Ms. PacMan  307.3  6,951.6  3,415.0  5,821.0  6,349.0  6,616.2 
Name This Game  2,292.3  8,049.0  12,542.0  21,890.0  22,682.0  15,248.1 
Phoenix  761.4  7,242.6  17,490.0  16,585.0  56,599.0  89,050.8 
Pitfall!  229.4  6,463.7  0.0  0.0  0.0  0.0 
Pong  20.7  14.6  20.9  21.0  21.0  21.0 
Private Eye  24.9  69,571.3  15,095.0  350.0  200.0  150.0 
Q*Bert  163.9  13,455.0  23,784.0  572,510.0  25,750.0  27,844.0 
River Raid  1,338.5  17,118.0  17,322.0  17,571.0  17,765.0  30,637.1 
Road Runner  11.5  7,845.0  55,839.0  64,262.0  57,900.0  61,550.3 
Robotank  2.2  11.9  52.3  59.4  62.5  69.8 
Seaquest  68.4  42,054.7  266,434.0  8,268.0  30,140.0  86,989.3 
Skiing  17,098.1  4,336.9  13,901.0  9,324.0  9,289.0  7,785.4 
Solaris  1,236.3  12,326.7  8,342.0  6,740.0  8,007.0  3,571.3 
Space Invaders  148.0  1,668.7  5,747.0  20,972.0  28,888.0  46,244.2 
Star Gunner  664.0  10,250.0  49,095.0  77,495.0  74,677.0  137,453.6 
Tennis  23.8  8.3  23.1  23.6  23.6  23.5 
Time Pilot  3,568.0  5,229.2  8,329.0  10,345.0  12,236.0  14,351.4 
Tutankham  11.4  167.6  280.0  297.0  293.0  200.2 
Up and Down  533.4  11,693.2  15,612.0  71,260.0  88,148.0  109,045.9 
Venture  0.0  1,187.5  1,520.0  43.9  1,318.0  495.6 
Video Pinball  16,256.9  17,667.9  949,604.0  705,662.0  698,045.0  756,111.1 
Wizard Of Wor  563.5  4,756.5  9,300.0  25,061.0  31,190.0  18,817.4 
Yars’ Revenge  3,092.9  54,576.9  35,050.0  26,447.0  28,379.0  64,822.9 
Zaxxon  32.5  9,173.3  10,513.0  13,112.0  21,772.0  18,295.4 