1 Introduction
Reinforcement learning algorithms sutton2018reinforcement are faced with two types of uncertainty: aleatoric uncertainty and epistemic uncertainty. Aleatoric uncertainty, also known as aleatoric risk osband2016risk
, is uncertainty that stems from inherent randomness of the environment or of an agent’s actions, but which may be characterized. For example, given an unbiased coin, we may not know what the outcome of the next flip will be, but we may be able to assign a probability to each outcome. Epistemic uncertainty, or parametric uncertainty, is uncertainty stemming from imperfect knowledge of the environment, which may be decreased with more information. For instance, if we are given a biased coin that either always yields heads or always yields tails, the uncertainty on the outcome of the first coin flip is epistemic. However, the uncertainty vanishes after observing the outcome of the first flip.
Distinguishing between both types of uncertainties is important in reinforcement learning moerland2017efficient . With the epistemic uncertainty, exploring a new environment can be done more efficiently, since actions can be taken to identify and better explore poorly known states. However, aleatoric risk can be irrelevant for exploration. Conversely, when designing riskaware policies, the aleatoric risk can be more informative than the epistemic uncertainty, since a full description of the environmental uncertainty can allow planning for all possible outcomes. Conflating both types of uncertainty can lead to inadequate exploration or riskawareness.
We propose a neural network method for separately estimating aleatoric risk and epistemic uncertainty in reinforcement learning, in both stochastic and deterministic environments. We use the well established distributional reinforcement learning framework, which aims to learn the entire return distribution instead of only its expected value bellemare2017distributional
, to describe the aleatoric risk. Our main contributions are to show that 1) distributional reinforcement learning can be framed as a Bayesian inference problem, which allows us to introduce the notion of epistemic uncertainty in this setting, 2) the disagreement between an ensemble consisting of only two estimators of the return distribution provides a lowvariance estimate of the epistemic uncertainty, and 3) this uncertainty metric can successfully be used in complex reinforcement learning domains to achieve better exploration and design uncertaintyaware agents. Our work allows both types of uncertainty to be estimated in a theoretically grounded and computationally cheap way.
This paper is structured as follows. In sections 2 and 3, we provide an introduction to uncertainty estimation and learning the return distribution in reinforcement learning. Section 4 presents our Bayesian framework for distributional reinforcement learning, and shows how epistemic uncertainties can be estimated in this framework. Section 5 then experimentally illustrates our method and some of its applications.
2 Background and related work
There has been increasing interest in developing practical methods for uncertainty estimation in reinforcement learning, with prior work mostly focusing either on aleatoric risk or epistemic uncertainty. For the estimation of epistemic uncertainty in complex environments, several methods have been proposed. Pseudocounts bellemare2016unifying ; tang2017exploration that approximate the number of visits of states or stateaction pairs by an agent can be interpreted as an uncertainty measure. Bayesian inference techniques over the parameters that define the value function have been demonstrated osband2014generalization ; lipton2017bbq ; azizzadenesheli2018efficient . The disagreement between the predictions of an ensemble of neural networks has also been proposed as a way of estimating uncertainty tibshirani1996comparison ; osband2016deep ; lakshminarayanan2017simple ; pearce2018bayesian ; osband2018randomized ; lutjens2018safe ; burda2018exploration . Dropout in neural networks has also been used gal2016dropout ; moerland2017efficient . Our work builds on epistemic uncertainty estimation using dropout gal2016dropout and anchored neural networks pearce2018bayesian and proposes an expansion to the distributional setting, thus allowing us to consider both aleatoric and epistemic uncertainty in a single framework.
Prior work on aleatoric risk in reinforcement learning has focused on the stochastic nature of the returns, which can be caused either by randomness inherent to the environment or randomness in the agent’s actions. Several approaches have been developed to estimate higherorder terms of the return distribution sobel1982variance ; prashanth2013actor ; tamar2016learning . More recently, distributional reinforcement learning algorithms have been developed that aim to learn the entire distribution of the returns morimura2010nonparametric ; bellemare2017distributional . Our work builds on this distributional approach by proposing a method for estimating epistemic uncertainty as well in this setting.
Accounting for both aleatoric risk and epistemic uncertainty has been used in modelbased reinforcement learning to mitigate model bias depeweg2018decomposition ; chua2018deep ; henaff2019model . However, uncertainties on the model cannot straightforwardly be used to estimate the agent’s own uncertainty about its actions in a real environment. Both types of uncertainty on the agent were accounted for in modelfree reinforcement learning in moerland2017efficient , and more recently in nikolov2019information . In moerland2017efficient , the return distribution is used in a deterministic environment to simultaneously propagate both types of uncertainty, but they cannot be separately estimated. In nikolov2019information , separate estimates of both types of uncertainty are used to drive better exploration with informationdirected sampling. Uncertainty estimates are provided by both an ensemble to estimate epistemic uncertainty and another network to estimate the return distribution; this is significantly more computationally expensive than our proposed method.
3 Preliminaries
3.1 Reinforcement learning with the return distribution
We frame the reinforcement learning problem as follows. We consider a discounted Markov Decision Process (MDP) defined by
, in which and represent the state and action spaces, is the distribution of rewards associated with performing actions given the states, is the probability of transitioning between states given the actions, and is the reward discount factor. At each time step , an agent observes state , performs an action chosen according to a policy that maps states to actions, and receives reward and a new state observation . The agent’s objective is to find a policy that maximizes the expected discounted return .The distribution of rewards associated with taking action in state and then following a policy can be learned using dynamic programming with the Bellman operator bellemare2017distributional ,
(1) 
where denotes that both distributions have equal probability laws, is distributed according to , and is chosen according to . A Bellman optimality operator for the return distribution can also be defined as
(2) 
where is distributed according to . The Bellman optimality operator can be used to learn an optimal policy over an MDP, which then consists of always picking the action with the highest expected return.
Distributional reinforcement learning has several advantages compared to related reinforcement learning methods such as mnih2015human that aim to learn only the expectation value of the returns. By learning the return distribution, distributional reinforcement learning can be used to account for risk morimura2010nonparametric ; morimura2012parametric ; dabney2018implicit . Moreover, learning the return distribution instead of only its expectation value has been shown to lead on its own to improved performance on reinforcement learning benchmarks bellemare2017distributional ; dabney2017distributional
, and can lead to greater robustness to hyperparameter choices
barth2018distributed .Distributional reinforcement learning algorithms can parameterize the return distribution in different ways. bellemare2017distributional use a categorical parameterization using a fixed number of atoms, which requires knowing the support of the return distribution in advance. dabney2017distributional
propose using a quantile parameterization using a fixed number of quantiles.
3.2 Quantile distributional reinforcement learning
Of particular relevance to our work is the quantile parameterization of dabney2017distributional
. In this framework, a probability distribution
is parameterized by quantiles, and for each quantile of we aim to learn the corresponding quantile value . We denote asthe vector of these quantile estimates. Learning the quantile values proceeds by minimizing the quantile regression loss
koenker2001quantile ,(3) 
Intuitively, this loss penalizes overestimations with weight and underestimations with weight . This loss can be minimized stochastically for each new value sampled from .
For temporal difference learning, is replaced with the Bellman target as per equation 1 for the evaluation setting, or equation 2 for the control setting. The loss is thus
(4) 
which minimizes the average temporal difference error between all pairs of quantiles. In dabney2017distributional , quantile regression DQN (QRDQN) reinforcement learning algorithms use both the strict quantile loss of equation 4 and a modified quantile Huber loss that is smooth at 0. In the following, we focus on the strict quantile loss since the quantile Huber loss produces biased quantile estimates.
4 Epistemic uncertainty and the return distribution
Here, we present a practical and theoretically grounded method for estimating the epistemic uncertainty in the distributional setting, thus allowing us to separately quantify aleatoric and epistemic uncertainty. We choose to use a quantile parameterization of the return distribution based on dabney2017distributional , since such a parameterization does not assume a specific support for the returns and has empirically been shown to outperform a categorical parameterization on several benchmarks.
4.1 Bayesian inference for the return distribution
We first formulate learning the return distribution as a Bayesian inference problem. We use a likelihood that is based on the asymmetric Laplace distribution yu2001bayesian . This formulation is justified even when the actual distribution of the data is different from that assumed by this choice of likelihood sriram2013posterior
. Specifically, a random variable
is said to follow an asymmetric Laplace distribution if its probability density function is
(5) 
where and is the same as in equation 3. For each quantile estimate , we use as the corresponding likelihood. With data consisting of samples drawn from distribution and quantiles, the likelihood is then
(6) 
We note that this expression for the likelihood assumes that the quantile estimates are independent. Specifically, we do not enforce the condition that the estimates be in a specific order, so that the resulting distributions may be illdefined. In practice, as more data is collected, likely quantile estimates converge towards a welldefined distribution.
If we replace the expectation over the distribution in the expression for the quantile loss in equation 3 by the average over the observed samples, we then have
(7) 
The loss can now be interpreted as the negative loglikelihood of the data given the quantile estimates. Minimizing the quantile loss is thus equivalent to finding maximum likelihood estimates of the quantiles.
Estimating uncertainties in the Bayesian setting involves defining a suitable prior over the quantiles. The posterior distribution over the quantiles, from which the notion of uncertainty is derived, will then be proportional to . In practice, the quantile estimates are parameterized by a set of parameters
, such as neural network weights and biases, that are optimized during the learning process. A common choice for prior distributions over these parameters is a normal distribution centered at the origin. With a normal prior with standard deviation
, and assuming a scale parameter relating the magnitude of the quantiles to that of the parameters, the posterior distribution is(8) 
where . Minimizing the regularized quantile loss,
(9) 
is now equivalent to finding maximum a posteriori (MAP) estimates of the parameters given the observed data mackay2003information . The use of this framework allows us to frame learning the return distribution as a Bayesian inference problem, thus giving us the tools to separately quantify aleatoric risk and epistemic uncertainty. Specifically, the variance of the return distribution quantifies the aleatoric uncertainty, and the variance of the Bayesian posterior distribution can be used to estimate the epistemic uncertainty.
4.2 Bayesian ensembles of return distributions
When using nonlinear function approximators such as neural networks, the posterior distribution is difficult to estimate. Variational approaches can be used to approximate the posterior graves2011practical ; blundell2015weight , but require tracking uncertainties for all parameters.
Instead, we consider two methods for drawing samples from this posterior. First, dropout gal2016dropout used during both training and inference can be used to sample from an approximate posterior distribution. Second, pearce2018bayesian have recently proposed an "anchored neural networks" scheme, in which each network in an ensemble is regularized around coordinates randomly drawn from the prior. Then, with some assumptions, when the networks are trained on the same data the parameter values towards which they converge are drawn from the posterior distribution. Specifically, for each network we draw coordinates by sampling from the prior distribution, and train the network to minimize
(10) 
Using either sampling method, we can obtain useful epistemic uncertainty estimates. First, for any given quantile we can obtain an ensemble of estimates drawn from the posterior distribution of that quantile. This ensemble can then be used to estimate the uncertainty on that quantile, using the variance of the ensemble for example, or other parameters of interest. The perquantile uncertainty is helpful for identifying which parts of an estimated distribution are uncertain (see appendix). Second, since the quantile loss is separable, the posterior distribution for each quantile depends only on the data and the prior and not on the other quantiles. The uncertainties on each quantile are thus independent of each other, and therefore can be aggregated to produce lowvariance uncertainty measures for some statistical properties of the data. We expand on this property in the following.
4.3 A two network ensemble is sufficient
The use of large ensembles of neural networks is cumbersome in practice; we show that using our framework an ensemble of only two networks is sufficient to estimate the epistemic uncertainty on the mean of the distribution. This is one of the main results of our work, and implies that the epistemic uncertainty can be cheaply estimated in the distributional setting. Informally, this result can be understood as arising from the fact that the disagreement between the two networks provides independent error measures (one for each quantile). Although the error for each quantile is noisy, the average error provides a low variance uncertainty measure for the mean of the distribution. We formalize this intuition in the following.
Given an estimator of the quantiles, we approximate the mean of the return distribution using estimator ,
(11) 
It can easily be shown that if the first and second order moments of the estimators
are bounded for all and , then the mean of the quantile estimates indeed converges to the mean of the distribution. Since for each quantile the posterior distribution is estimated using the same data, we approximate the epistemic uncertainty of the mean of the return distribution using the following upper bound, which we denote ,(12) 
being an upper bound on the true variance of , this expression allows for some margin of error in the estimate of the uncertainty.
Since we don’t have direct access to the posterior distribution, we propose to use the difference between two networks A and B to obtain an estimate of ,
(13) 
where and are the estimates of the value of quantile produced by networks A and B. We make use of the fact that the uncertainties on the quantiles are independent to show that, with increasing , is with some reasonable assumptions a good estimate of .
Proposition 1: We assume that the expectation value of , the variance of and the variance of are all bounded for all and large enough . Then .
Proof: See appendix
We consider an illustrative example in which an analytic expression for the variance on is available. We assume that the posterior distribution of the quantile estimates is given by a homoskedastic normal distribution with standard deviation . For each both and are now normally distributed with standard deviation and the same mean. Their difference is thus normally distributed with standard deviation and mean 0, so that
follows a Chisquared distribution with
degrees of freedom. The relative error between and thus decreases with increasing . For example with and , as was used in the QRTD and QRDQN algorithms demonstrated by dabney2017distributional , is with 95% confidence within and , respectively, of .4.4 Distributional Qlearning with uncertainty
Until now, we have been mainly concerned with learning the return distribution given an ensemble of samples of this distribution. Here, we discuss how our uncertainty estimate can be adapted to temporal difference learning of the return distribution. First, we replace the empirical return distribution with the Bellman target defined in equation 1 for the estimation setting, and in equation 2 for the control setting. By interpreting each quantile value of the target distribution as a sample drawn from the target distribution, we can now write the likelihood associated with the quantile estimates,
(14) 
where now refers to the targets, is as defined in equation 4, and we use (corresponding to a strict quantile loss). The framework developed in the previous sections can thus be adapted to temporal difference methods. One further modification is required to adapt our uncertainty metric to temporal difference learning. Our theoretical framework requires that both networks be trained on the same target distributions. Since we now have access to two estimates of the return distribution, we propose to use their averages as the common target distribution for both networks.
5 Experiments
We perform several experiments. We first empirically verify that our uncertainty estimates perform as expected on the simple problem of learning a static distribution, and compare them to alternative metrics using a stochastic contextual bandit problem. We then illustrate applications of these estimates in two reinforcement learning environments: Cartpole and the Atari suite. The code to reproduce these experiments is available at https://github.com/unchartedtechnologies/riskanduncertainty.
5.1 Variance and number of quantiles
greedy agent (red), and two agents that select actions with Thompson sampling and different epistemic uncertainty metrics. Yellow: our metric with two anchored networks, Purple: Bayes by Backprop
blundell2015weight .First, we empirically verify that two networks are sufficient to provide a low variance estimate of the epistemic uncertainty, and that low variance can be achieved with a reasonable number of quantiles. We train 20 anchored neural networks on a fixed set of samples from a standard normal distribution for different numbers of quantiles. We then use this ensemble to measure epistemic uncertainty in two different ways. The first uncertainty measure is the ensemble uncertainty, corresponding to the average standard deviation of the quantile estimates over the ensemble. The second uncertainty measure is ours using two networks, which we calculate for all 190 pairs in this ensemble. Our results are shown in Fig. 1. We see that both uncertainty measures give the same result on average. Our uncertainty metric is noisier, but as the number of quantiles increases the noise decreases.
5.2 Experimental comparison of epistemic uncertainty measures
Next, we use a stochastic contextual bandit problem to compare our epistemic uncertainty measure to other uncertainty measures in neural networks. We use a contextual bandit problem used for example in guez2015sample ; blundell2015weight , in which at each step an agent is shown a mushroom, characterized by a set of features, and must decide to eat it or not. The agent is penalized for eating a toxic mushroom and rewarded for eating an edible mushroom, and must learn to distinguish good from bad mushrooms from its features. We make this problem stochastic by drawing the rewards for eating mushrooms from normal probability distributions centered around 1 (for a good mushroom) and 3 (for a bad mushroom). Figure 1 shows the performance achieved by several agents. Two agents use two different epistemic uncertainty measures, yielded by our method and by BayesbyBackprop blundell2015weight , and pick actions via Thompson sampling (see appendix). One baseline agent uses an greedy policy. We see that the agent that uses our uncertainty metric performs as well as the agent using BayesbyBackprop, which shows that both methods produce similar uncertainty estimates. However, the agent that uses our method also successfully learns the stochastic rewards associated with eating the mushrooms.
5.3 Applications in reinforcement learning
5.3.1 Better generalization through exploration
We first show that our epistemic uncertainty estimate allows us to design agents that efficiently explore their environment, allowing them to better generalize to a wide variety of initial conditions different from what they were trained on. We use the OpenAI Gym domain Cartpole brockman2016openai , in which the agents must keep a pole upright as long as possible on a cart that the agent can either move left or right. The episode terminates either after 200 time steps, if the cart leaves the track, or if the pole falls over. We measure generalization ability by training agents on the standard Cartpole domain in which the cart is initialized as the center of the track, and test them on modified domains in which the cart is initialized at different locations along the track.
In Fig.2, we compare the performance of two variants of distributional QRDQN agents on this generalization task. The first agent uses an greedy policy, and the second agent uses Thompson sampling with our epistemic uncertainty measure. We see that the agent that explores using our uncertainty metric learns slower but achieves significantly improved generalization abilities. This result is consistent with our agent spending more time exploring its environment compared to the greedy agent; whereas the greedy agent quickly finds and exploits one successful policy, thus incompletely exploring the MDP, the second agent discovers a larger variety of states and policies that help it generalize to different starting conditions. We note that this link between enhanced exploration and generalization has been observed before for example in arumugam2018mitigating ; witty2018measuring .
5.3.2 Tracking the epistemic uncertainty in distributional RL
Finally, we show that our uncertainty metric can be used in more complex environments such as the Atari game Breakout bellemare2013arcade to keep track of the epistemic uncertainty in a distributional agent over the course of an episode. For this experiment, we train an agent with the QRDQN algorithm as in dabney2017distributional with an greedy exploration policy, except that we use a network architecture that allows us to implement the anchored network scheme to measure the epistemic uncertainty (see appendix). Since the trained agent consistently achieves very high scores, we study the agent’s epistemic uncertainty over an episode in which it follows a 5% greedy policy, which forces it to sometimes make mistakes.
The trained agent’s epistemic uncertainty over the course of an episode is shown in Fig. 3. The uncertainty behaves in the way we would expect it to; in addition to the specific spikes in uncertainty pointed out in the figure that we can easily interpret, we also observe that the agent’s uncertainty generally increases during an episode. This is because the agent encounters a wider variety of possible states at the end of an episode than at its start. Monitoring an agent’s uncertainty in this way in complex domains would be valuable for reallife reinforcement learning agents, for example to detect and prevent potential accidents.
6 Conclusion
Estimating both aleatoric and epistemic uncertainty is crucial for building realworld learning strategies that can both explore efficiently and account for risk in their actions. We propose a method for estimating both types of uncertainty in reinforcement learning. Our method uses the distributional approach to estimate aleatoric risk, and a Bayesian framework to estimate epistemic uncertainty. Consisting of the disagreement between an ensemble of only two networks, our epistemic uncertainty metric is practical and computationally cheap. Our experimental results in the Cartpole and Atari domains illustrate applications of our method.
7 Acknowledgments
We thank Gilles Stoltz for feedback on our paper, and the team at Uncharted Technologies for their encouragement, feedback, and support.
References
 (1) Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction. MIT press, 2018.

(2)
Ian Osband.
Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout.
In Proceedings of the NIPS 2016 Workshop on Bayesian Deep Learning, 2016.  (3) Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Efficient exploration with double uncertain value networks. In Conference on Neural Information Processing Systems (NIPS), 2017.

(4)
Marc G Bellemare, Will Dabney, and Rémi Munos.
A distributional perspective on reinforcement learning.
In
Proceedings of the International Conference on Machine Learning
, 2017.  (5) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 (6) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # Exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 2753–2762, 2017.
 (7) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In Proceedings of the International Conference on Machine Learning, pages 2377–2386, 2016.

(8)
Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng.
BBQnetworks: efficient exploration in deep reinforcement learning
for taskoriented dialogue systems.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, 2018.  (9) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through Bayesian deep Qnetworks. Information Theory and Applications Workshop, 2018.
 (10) Robert Tibshirani. A comparison of some error estimates for neural network models. Neural Computation, 8(1):152–163, 1996.
 (11) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing systems, pages 4026–4034, 2016.
 (12) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 (13) Tim Pearce, Nicolas Anastassacos, Mohamed Zaki, and Andy Neely. Bayesian inference with anchored ensembles of neural networks, and application to reinforcement learning. arXiv preprint arXiv:1805.11324, 2018.
 (14) Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 2018.
 (15) Björn Lütjens, Michael Everett, and Jonathan P How. Safe reinforcement learning with model uncertainty estimates. arXiv preprint arXiv:1810.08700, 2018.
 (16) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 (17) Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, pages 1050–1059, 2016.
 (18) Matthew J Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
 (19) LA Prashanth and Mohammad Ghavamzadeh. Actorcritic algorithms for risksensitive MDPs. In Advances in neural information processing systems, pages 252–260, 2013.
 (20) Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the rewardtogo. The Journal of Machine Learning Research, 17(1):361–396, 2016.
 (21) Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 799–806, 2010.
 (22) Stefan Depeweg, JoseMiguel HernandezLobato, Finale DoshiVelez, and Steffen Udluft. Decomposition of uncertainty in bayesian deep learning for efficient and risksensitive learning. In Proceedings of the International Conference on Machine Learning, pages 1192–1201, 2018.
 (23) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 (24) Mikael Henaff, Alfredo Canziani, and Yann LeCun. Modelpredictive policy learning with uncertainty regularization for driving in dense traffic. arXiv preprint arXiv:1901.02705, 2019.
 (25) Nikolay Nikolov, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause. Informationdirected exploration for deep reinforcement learning. Proceedings of the International Conference on Learning Representations, 2019.
 (26) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 (27) Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 368–375. AUAI Press, 2010.
 (28) Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. Proceedings of the International Conference on Machine Learning, 2018.
 (29) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
 (30) Gabriel BarthMaron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In Proceedings of the International Conference on Learning Representations, 2018.
 (31) Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives, 15(4):143–156, 2001.
 (32) Keming Yu and Rana A Moyeed. Bayesian quantile regression. Statistics & Probability Letters, 54(4):437–447, 2001.
 (33) Karthik Sriram, RV Ramamoorthi, Pulak Ghosh, et al. Posterior consistency of Bayesian quantile regression based on the misspecified asymmetric Laplace density. Bayesian Analysis, 8(2):479–504, 2013.
 (34) David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
 (35) Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
 (36) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of the International Conference on Machine Learning, pages 1613–1622. JMLR. org, 2015.
 (37) AG Guez. Samplebased search methods for Bayesadaptive planning. PhD thesis, UCL (University College London), 2015.
 (38) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 (39) Dilip Arumugam, David Abel, Kavosh Asadi, Nakul Gopalan, Christopher Grimm, Jun Ki Lee, Lucas Lehnert, and Michael L Littman. Mitigating planner overfitting in modelbased reinforcement learning. arXiv preprint arXiv:1812.01129, 2018.
 (40) Sam Witty, Jun Ki Lee, Emma Tosch, Akanksha Atrey, Michael Littman, and David Jensen. Measuring and characterizing generalization in deep reinforcement learning. arXiv preprint arXiv:1812.02868, 12 2018.
 (41) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 (42) Brendan O’Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty Bellman equation and exploration. In Proceedings of the International Conference on Machine Learning, 2018.
 (43) Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013.
 (44) Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv preprint arXiv:1807.09289, 2018.
Appendix A Thompson sampling using our epistemic uncertainty metric
Here we provide our method for selecting an action with Thompson sampling using our epistemic uncertainty estimate, which we use for both our bandit experiment and our experiment on Cartpole. Thompson sampling in these contexts should be used with the epistemic uncertainty and not the aleatoric uncertainty; our method allows us to separate both contributions. Since we do not have access to the exact shape of the epistemic posterior distribution for the mean of the returns, we approximate it with a normal distribution.
Appendix B Proof of Proposition 1
Let and be the mean and standard deviation for quantile according to the posterior distribution. Since both and are random variables drawn according to the posterior distribution for , we can write , where is a random variable with mean 0. We thus have that:
(15) 
Moreover, the satisfy:
(16) 
With our assumption that , , and the variance of are bounded for all and large enough , the variance of is also bounded by some constant for large enough . Since the estimates of and are independent, the are independent random variables, so that:
(17) 
where the inequality holds for large enough . Since is 0 in expectation and its variance converges to 0, converges to 0.
Appendix C What type of epistemic uncertainty are we measuring?
It is important in reinforcement learning to distinguish between local and global uncertainties. Global uncertainty estimates propagate uncertainties through multiple time steps, whereas local uncertainties consider only the uncertainty at the current step. Since we consider fixed Bellman targets during training, our uncertainty estimate measures the local uncertainty. The global uncertainty is usually a more useful quantity in the reinforcement learning problem; however, estimates of the local uncertainty can be converted into an estimate of the global uncertainty using an uncertainty Bellman equation [42]. Our metric provides a simple estimate of the local uncertainty, which can thus if necessary be used to quantify the global uncertainty.
Appendix D Further information and results on the contextual bandit problem
Here, we provide more information and results on the contextual bandit experiment.
d.1 Experiment setup
We use the UCI mushroom data set [43], where each entry contains features about different mushrooms and whether they are edible or not. We convert this dataset into a contextual bandit problem in which at each step an agent must choose between eating a given mushroom or not eating it. The agent receives stochastic rewards drawn from normal distributions with standard deviation 1, and means 3 and 1 for respectively eating a toxic mushroom and eating an edible mushroom. The agent receives a deterministic reward of 0 for not eating the mushroom.
Our agents all use neural networks to predict the reward or distribution of rewards corresponding to each action (eating or not eating), as a function of the mushroom’s features. Each neural network uses two hidden layers of 100 neurons each. The distributional agents each have 50 outputs, each corresponding to one quantile. Each time an agent acts, its action as well as the corresponding reward is stored in memory. Every ten actions, the neural network is updated using 100 batches of 32 actionreward pairs randomly drawn from memory. For the methods that select actions using Thompson sampling (see section 1 of the appendix), the weight of the prior is annealed linearly with the number of mushrooms in the replay buffer. For each agent, the hyperparameters corresponding to either the prior or the dropout rate were optimized to achieve the best performance. Each experiment was repeated over 10 random seeds, and the plots show both the median cumulative regrets (in bold) and quantiles 0.1 and 0.9, such that 80% of our results lie in the shaded area.
d.2 Results
First, in figure 4 we compare two agents that both pick actions according to Thompson sampling with our epistemic uncertainty metric, but that sample from the posterior distribution in two different ways. One agent uses two anchored networks [13], and the second agent draws two samples from the dropout distribution [17] with dropout probability 20%^{1}^{1}1We tested dropout probabilities 5,10,20,and 50 and found best performance at 20. We find that both agents obtain much better scores than the epsilongreedy agent. However, the dropout agent seems to achieve slightly worse longerterm performance.
Next, we examine the reward distributions learned by our agents, and use a larger ensemble of networks to also study the perquantile uncertainty. Specifically, in figure 4 we plot the reward distribution predicted by our agent for an edible mushroom. We see that the agent has correctly learned the distribution. We also plot the perquantile uncertainty on this distribution, obtained from multiple samples from the approximated posterior for each quantile. We observe that the perquantile uncertainty provides important information: the agent is a lot less certain about the lower quantiles than the rest of the distribution. This is because the possibility of receiving a very negative reward if the model is wrong affects the lower quantiles much more than the upper quantiles.
Appendix E Locating the epistemic uncertainty in the distribution
We further demonstrate that our Bayesian formulation of learning a distribution allows us to locate the uncertainty in the estimated distribution. We produce 20 samples of a fixed distribution to be learned by our network, ten of which have a value of 1 and ten of which have a value of 1. With these samples, the epistemic uncertainty on the middle quantiles of the estimated distribution that produced these samples should be higher than that on the lower quantiles. For example, there are not enough data points to decide whether the median value should be 1 or 1. We train an ensemble of 20 anchored neural networks on these samples, and measure the perquantile epistemic uncertainty using the observed standard deviation in the predictions.
Our experimental results are shown in figure 5. The epistemic uncertainty is indeed higher on the middle quantiles than on the other quantiles. This is reflected in more noise in the predictions of the ensemble for those quantile values.
Appendix F Further results on Cartpole
We provide a visual representation of the predictions given by our two networks in the Cartpole domain in figure 6. The anchoring scheme causes each network’s predictions to be noisy, and the noise can be interpreted as an indication of the agent’s uncertainty. As we can expect, the uncertainty on the suboptimal action is higher than for the better action, since that action is selected less often in that state.
We note that for our Cartpole experiment, we use a smooth quantile loss with (see [29]
) instead of a strict quantile loss as in the rest of our paper. This is because of the density of positive rewards in this environment; if the loss function is not made sensitive to outliers then the value estimates tend to increase exponentially, so learning is often unstable. Such an effect can also be observed with the DQN algorithm
[26] in this domain.We also note that we do not decrease the importance of the prior with the amount of experience that we collect in this experiment. This causes the agent to always maintain a minimum amount of uncertainty, so that it tends to continue exploring even when it has found good solutions to the problem. On the one hand, this causes slower learning; however it also leads to more exploration at every stage in the learning process as the agent discovers new parts of the MDP.
Appendix G Further results on the Atari suite
g.1 Tracking the epistemic uncertainty: experimental setup
Our experiment reproduces the training procedure of [29], except that we use a network architecture allowing us to measure the epistemic uncertainty using our method and we train our agent for 50 million steps instead of 200 million. Our network architecture includes the following modifications. Following [11], instead of having two separate networks, we have both networks share common parameters within a "body" consisting of convolutional layers, on top of which lie two "heads" with separate parameters, consisting of a single linear layer, that each correspond to one of the networks. As in [44] we only define a prior on the two heads. The outputs from both heads are averaged to produced the value estimates used by the policy.
g.2 Exploration via Thompson sampling
We also performed an experiment in which we trained agents that select actions using Thompson sampling (in the same way as we did in the Cartpole environment) on the following Atari games: Alien, Amidar, Assault, Asterix, and Breakout. These agents used the same network architecture as described above, and were trained over 50 million game frames. Every 1M frames, these agents were evaluated on 500k frames, as in [26, 29]. Table G.2 shows the best evaluation results achieved during training for both our implementation of QRDQN with an greedy policy (as in [29]), and QRDQN with a Thompson sampling policy. Fig. 7 shows the evaluation scores as a function of game frames for all 5 games.
Game  Human  QRDQN (greedy)  QRDQN (Thompson sampling) 

Alien  7128  1825  1899 
Amidar  1719  1035  442 
Assault  742  29359  14377 
Asterix  8503  44074  20518 
Breakout  31  565  515 
We observe that agents that use Thompson sampling with our uncertainty metric all learn successful policies on these games. However, on some games agents that use Thompson sampling to select actions learn more slowly than agents that use an greedy policy. This result is in line with what we observe on Cartpole: the agents that use Thompson sampling spend more time exploring less rewarding parts of the MDP, and at the end of training lag behind the greedy agents that have spent more time exploiting.
We hypothesize that, similarly to Cartpole, agents trained using the more exploratory policy experience a wider variety of states and thus generalize better. However, there is no straightforward and computationally cheap way of comparing generalization performance on Atari and thus leave such an analysis for future work.
We note that our findings would seem to contrast with the results reported in [9], in which an agent with a more exploratory policy learns faster than a baseline DDQN agent with an greedy policy. However, the implementation of the agent in [9] requires significant design changes compared to their baseline agent, such as a different network architecture, learning rate, and a larger set of hyperparameters. We thus cannot directly compare our results to theirs.