1 Introduction
Distributional reinforcement learning (Jaquette, 1973; Sobel, 1982; White, 1988; Morimura et al., 2010b; Bellemare et al., 2017) focuses on the intrinsic randomness of returns within the reinforcement learning (RL) framework. As the agent interacts with the environment, irreducible randomness seeps in through the stochasticity of these interactions, the approximations in the agent’s representation, and even the inherently chaotic nature of physical interaction (Yu et al., 2016). Distributional RL aims to model the distribution over returns, whose mean is the traditional value function, and to use these distributions to evaluate and optimize a policy.
Any distributional RL algorithm is characterized by two aspects: the parameterization of the return distribution, and the distance metric or loss function being optimized. Together, these choices control assumptions about the random returns and how approximations will be traded off. Categorical DQN
(Bellemare et al., 2017, C51) combines a categorical distribution and the crossentropy loss with the Cramérminimizing projection (Rowland et al., 2018). For this, it assumes returns are bounded in a known range and trades off meanpreservation at the cost of overestimating variance.
C51 outperformed all previous improvements to DQN on a set of 57 Atari 2600 games in the Arcade Learning Environment (Bellemare et al., 2013), which we refer to as the Atari57 benchmark. Subsequently, several papers have built upon this successful combination to achieve significant improvements to the stateoftheart in Atari57 (Hessel et al., 2018; Gruslys et al., 2018), and challenging continuous control tasks (BarthMaron et al., 2018).
These algorithms are restricted to assigning probabilities to an a priori fixed, discrete set of possible returns.
Dabney et al. (2018) propose an alternate pair of choices, parameterizing the distribution by a uniform mixture of Diracs whose locations are adjusted using quantile regression. Their algorithm, QRDQN, while restricted to a discrete set of quantiles, automatically adapts return quantiles to minimize the Wasserstein distance between the Bellman updated and current return distributions. This flexibility allows QRDQN to significantly improve on C51’s Atari57 performance.In this paper, we extend the approach of Dabney et al. (2018), from learning a discrete set of quantiles to learning the full quantile function, a continuous map from probabilities to returns. When combined with a base distribution, such as , this forms an implicit distribution capable of approximating any distribution over returns given sufficient network capacity. Our approach, implicit quantile networks (IQN), is best viewed as a simple distributional generalization of the DQN algorithm (Mnih et al., 2015), and provides several benefits over QRDQN.
First, the approximation error for the distribution is no longer controlled by the number of quantiles output by the network, but by the size of the network itself, and the amount of training. Second, IQN can be used with as few, or as many, samples per update as desired, providing improved data efficiency with increasing number of samples per training update. Third, the implicit representation of the return distribution allows us to expand the class of policies to more fully take advantage of the learned distribution. Specifically, by taking the base distribution to be nonuniform, we expand the class of policies to greedy policies on arbitrary distortion risk measures (Yaari, 1987; Wang, 1996).
We begin by reviewing distributional reinforcement learning, related work, and introducing the concepts surrounding risksensitive RL. In subsequent sections, we introduce our proposed algorithm, IQN, and present a series of experiments using the Atari57 benchmark, investigating the robustness and performance of IQN. Despite being a simple distributional extension to DQN, and forgoing any other improvements, IQN significantly outperforms QRDQN and nearly matches the performance of Rainbow, which combines many orthogonal advances. In fact, in humanstarts as well as in the hardest Atari games (where current RL agents still underperform human players) IQN improves over Rainbow.
2 Background / Related Work
We consider the standard RL setting, in which the interaction of an agent and an environment is modeled as a Markov Decision Process
(Puterman, 1994), where and denote the state and action spaces, the (state and actiondependent) reward function, the transition kernel, and a discount factor. A policy maps a state to a distribution over actions.For an agent following policy
, the discounted sum of future rewards is denoted by the random variable
, where , , , and . The actionvalue function is defined as , and can be characterized by the Bellman equationThe objective in RL is to find an optimal policy , which maximizes , i.e. for all and all . One approach is to find the unique fixed point of the Bellman optimality operator (Bellman, 1957):
To this end, Qlearning (Watkins, 1989)
iteratively improves an estimate,
, of the optimal actionvalue function, , by repeatedly applying the Bellman update:The actionvalue function can be approximated by a parameterized function
(e.g. a neural network), and trained by minimizing the squared temporal difference (TD) error,
over samples observed while following an greedy policy over . This policy acts greedily with respect to with probability and uniformly at random otherwise. DQN (Mnih et al., 2015)
uses a convolutional neural network to parameterize
and the Qlearning algorithm to achieve humanlevel play on the Atari57 benchmark.2.1 Distributional RL
In distributional RL, the distribution over returns (the law of ) is considered instead of the scalar value function that is its expectation. This change in perspective has yielded new insights into the dynamics of RL (Azar et al., 2012), and been a useful tool for analysis (Lattimore & Hutter, 2012)
. Empirically, distributional RL algorithms show improved sample complexity and final performance, as well as increased robustness to hyperparameter variation
(BarthMaron et al., 2018).An analogous distributional Bellman equation of the form
can be derived, where denotes that two random variables and have equal probability laws, and the random variables and are distributed according to and , respectively.
Morimura et al. (2010a) defined the distributional Bellman operator explicitly in terms of conditional probabilities, parameterized by the mean and scale of a Gaussian or Laplace distribution, and minimized the KullbackLeibler (KL) divergence between the Bellman target and the current estimated return distribution. However, the distributional Bellman operator is not a contraction in the KL.
As with the scalar setting, a distributional Bellman optimality operator can be defined by
with distributed according to . While the distributional Bellman operator for policy evaluation is a contraction in the Wasserstein distance (Bellemare et al., 2017), this no longer holds for the control case. Convergence to the optimal policy can still be established, but requires a more involved argument.
Bellemare et al. (2017) parameterize the return distribution as a categorical distribution over a fixed set of equidistant points and minimize the KL divergence to the projected distributional Bellman target. Their algorithm, C51, outperformed previous DQN variants on the Atari57 benchmark. Subsequently, Hessel et al. (2018) combined C51 with enhancements such as prioritized experience replay (Schaul et al., 2016), step updates (Sutton, 1988), and the dueling architecture (Wang et al., 2016), leading to the Rainbow agent, current stateoftheart in Atari57.
The categorical parameterization, using the projected KL loss, has also been used in recent work to improve the critic of a policy gradient algorithm, D4PG, achieving significantly improved robustness and stateoftheart performance across a variety of continuous control tasks (BarthMaron et al., 2018).
2.2 Wasserstein Metric
The Wasserstein metric, for , plays a key role in recent results in distributional RL (Bellemare et al., 2017; Dabney et al., 2018). It has also been a topic of increasing interest in generative modeling (Arjovsky et al., 2017; Bousquet et al., 2017; Tolstikhin et al., 2017), because unlike the KL divergence, the Wasserstein metric inherently trades off approximate solutions with likelihoods.
The Wasserstein distance is the
metric on inverse cumulative distribution functions (c.d.f.), also known as quantile functions
(Müller, 1997). For random variables and with quantile functions and , respectively, the Wasserstein distance is given byThe class of optimal transport metrics express distances between distributions in terms of the minimal cost for transporting mass to make the two distributions identical. This cost is given in terms of some metric, , on the underlying space . The Wasserstein metric corresponds to . We are particularly interested in the Wasserstein metrics due to the predominant use of spaces in meanvalue reinforcement learning.
2.3 Quantile Regression for Distributional RL
Bellemare et al. (2017) showed that the distributional Bellman operator is a contraction in the Wasserstein metric, but as the proposed algorithm did not itself minimize the Wasserstein metric, this left a theorypractice gap for distributional RL. Recently, this gap was closed, in both directions. First and most relevant to this work, Dabney et al. (2018) proposed the use of quantile regression for distributional RL and showed that by choosing the quantile targets suitably the resulting projected distributional Bellman operator is a contraction in the Wasserstein metric. Concurrently, Rowland et al. (2018) showed the original class of categorical algorithms are a contraction in the Cramér distance, the metric on cumulative distribution functions.
By estimating the quantile function at precisely chosen points, QRDQN minimizes the Wasserstein distance to the distributional Bellman target (Dabney et al., 2018). This estimation uses quantile regression, which has been shown to converge to the true quantile function value when minimized using stochastic approximation (Koenker, 2005).
In QRDQN, the random return is approximated by a uniform mixture of Diracs,
with each assigned a fixed quantile target, for , where . These quantile estimates are trained using the Huber (1964) quantile regression loss, with threshold ,
on the pairwise TDerrors
2.4 Risk in Reinforcement Learning
Distributional RL algorithms have been theoretically justified for the Wasserstein and Cramér metrics (Bellemare et al., 2017; Rowland et al., 2018), and learning the distribution over returns, in and of itself, empirically results in significant improvements to data efficiency, final performance, and stability (Bellemare et al., 2017; Dabney et al., 2018; Gruslys et al., 2018; BarthMaron et al., 2018). However, in each of these recent works the policy used was based entirely on the mean of the return distribution, just as in standard reinforcement learning. A natural question arises: can we expand the class of policies using information provided by the distribution over returns (i.e. to the class of risksensitive policies)? Furthermore, when would this larger policy class be beneficial?
Here, ‘risk’ refers to the uncertainty over possible outcomes, and risksensitive policies are those which depend upon more than the mean of the outcomes. At this point, it is important to highlight the difference between intrinsic uncertainty, captured by the distribution over returns, and parametric uncertainty, the uncertainty over the value estimate typically associated with Bayesian approaches such as PSRL (Osband et al., 2013) and Kalman TD (Geist & Pietquin, 2010). Distributional RL seeks to capture the former, which classic approaches to risk are built upon^{1}^{1}1One exception is the recent work (Moerland et al., 2017) towards combining both forms of uncertainty to improve exploration..
Expected utility theory states that if a decision policy is consistent with a particular set of four axioms regarding its choices then the decision policy behaves as though it is maximizing the expected value of some utility function (von Neumann & Morgenstern, 1947),
This is perhaps the most pervasive notion of risksensitivity. A policy maximizing a linear utility function is called riskneutral, whereas concave or convex utility functions give rise to riskaverse or riskseeking policies, respectively. Many previous studies on risksensitive RL adopt the utility function approach (Howard & Matheson, 1972; Marcus et al., 1997; Maddison et al., 2017).
A crucial axiom of expected utility is independence: given random variables , and , such that ( preferred over ), any mixture between and is preferred to the same mixture between and (von Neumann & Morgenstern, 1947). Stated in terms of the cumulative probability functions, . This axiom in particular has troubled many researchers because it is consistently violated by human behavior (Tversky & Kahneman, 1992). The Allais paradox is a frequently used example of a decision problem where people violate the independence axiom of expected utility theory (Allais, 1990).
However, as Yaari (1987) showed, this axiom can be replaced by one in terms of convex combinations of outcome values, instead of mixtures of distributions. Specifically, if as before , then for any and random variable , . This leads to an alternate, dual, theory of choice than that of expected utility. Under these axioms the decision policy behaves as though it is maximizing a distorted expectation, for some continuous monotonic function :
Such a function is known as a distortion risk measure, as it distorts the cumulative probabilities of the random variable (Wang, 1996). That is, we have two fundamentally equivalent approaches to risksensitivity. Either, we choose a utility function and follow the expectation of this utility. Or, we choose a reweighting of the distribution and compute expectation under this distortion measure. Indeed, Yaari (1987) further showed that these two functions are inverses of each other. The choice between them amounts to a choice over whether the behavior should be invariant to mixing with random events or to convex combinations of outcomes.
Distortion risk measures include, as special cases, cumulative probability weighting used in cumulative prospect theory (Tversky & Kahneman, 1992), conditional value at risk (Chow & Ghavamzadeh, 2014), and many other methods (Morimura et al., 2010b). Recently Majumdar & Pavone (2017) argued for the use of distortion risk measures in robotics.
3 Implicit Quantile Networks
We now introduce the implicit quantile network (IQN), a deterministic parametric function trained to reparameterize samples from a base distribution, e.g. , to the respective quantile values of a target distribution. IQN provides an effective way to learn an implicit representation of the return distribution, yielding a powerful function approximator for a new DQNlike agent.
Let be the quantile function at for the random variable . For notational simplicity we write , thus for the resulting stateaction return distribution sample is .
We propose to model the stateaction quantile function as a mapping from stateactions and samples from some base distribution, typically , to , viewed as samples from the implicitly defined return distribution.
Let be a distortion risk measure, with identity corresponding to riskneutrality. Then, the distorted expectation of under is given by
Notice that the distorted expectation is equal to the expected value of weighted by , that is, . The immediate implication of this is that for any , there exists a sampling distribution for such that the mean of is equal to the distorted expectation of under , that is, any distorted expectation can be represented as a weighted sum over the quantiles (Dhaene et al., 2012). Denote by the risksensitive greedy policy
(1) 
For two samples , and policy , the sampled temporal difference (TD) error at step is
(2) 
Then, the IQN loss function is given by
(3) 
where and denote the respective number of iid samples used to estimate the loss. A corresponding samplebased risksensitive policy is obtained by approximating in Equation 1 by samples of :
Implicit quantile networks differ from the approach of Dabney et al. (2018) in two ways. First, instead of approximating the quantile function at fixed values of we approximate it with for some differentiable functions , , and
. If we ignore the distributional interpretation for a moment and view each
as a separate actionvalue function, this highlights that implicit quantile networks are a type of universal value function approximator (UVFA) (Schaul et al., 2015). There may be additional benefits to implicit quantile networks beyond the obvious increase in representational fidelity. As with UVFAs, we might hope that training over many different ’s (goals in the case of the UVFA) leads to better generalization between values and improved sample complexity than attempting to train each separately.Second, , , and are sampled from continuous, independent, distributions. Besides , we also explore risksentive policies , with nonlinear . The independent sampling of each , results in the sample TD errors being decorrelated, and the estimated actionvalues go from being the true mean of a mixture of Diracs to a sample mean of the implicit distribution defined by reparameterizing the sampling distribution via the learned quantile function.
3.1 Implementation
Consider the neural network structure used by the DQN agent (Mnih et al., 2015). Let be the function computed by the convolutional layers and the subsequent fullyconnected layers mapping to the estimated actionvalues, such that . For our network we use the same functions and as in DQN, but include an additional function computing an embedding for the sample point . We combine these to form the approximation , where denotes the elementwise (Hadamard) product.
As the network for is not particularly deep, we use the multiplicative form, , to force interaction between the convolutional features and the sample embedding. Alternative functional forms, e.g. concatenation or a ‘residual’ function , are conceivable, and can be parameterized in different ways. To investigate these, we compared performance across a number of architectural variants on six Atari 2600 games (Asterix, Assault, Breakout, Ms.Pacman, QBert, Space Invaders). Full results are given in the Appendix. Despite minor variation in performance, we found the general approach to be robust to the various choices. Based upon the results we used the following function in our later experiments, for embedding dimension :
(4) 
After settling on a network architecture, we study the effect of the number of samples, and , used in the estimate terms of Equation 3.
We hypothesized that , the number of samples of , would affect the sample complexity of IQN, with larger values leading to faster learning, and that with one would potentially approach the performance of DQN. This would support the hypothesis that the improved performance of many distributional RL algorithms rests on their effect as auxiliary loss functions, which would vanish in the case of . Furthermore, we believed that , the number of samples of , would affect the variance of the gradient estimates much like a minibatch size hyperparameter. Our prediction was that would have the greatest effect on variance of the longterm performance of the agent.
We used the same set of six games as before, with our chosen architecture, and varied . In Figure 2 we report the average humannormalized scores on the six games for each configuration. Figure 2 (left) shows the average performance over the first ten million frames, while (right) shows the average performance over the last ten million (from 190M to 200M).
As expected, we found that has a dramatic effect on early performance, shown by the continual improvement in score as the value increases. Additionally, we observed that affected performance very differently than expected: it had a strong effect on early performance, but minimal impact on longterm performance past .
Overall, while using more samples for both distributions is generally favorable, appears to be sufficient to achieve the majority of improvements offered by IQN for longterm performance, with variation past this point largely insignificant. To our surprise we found that even for , which is comparable to DQN in the number of loss components, the longer term performance is still quite strong ( DQN).
In an informal evaluation, we did not find IQN to be sensitive to , the number of samples used for the policy, and have fixed it at for all experiments.
4 RiskSensitive Reinforcement Learning
In this section, we explore the effects of varying the distortion risk measure, , away from identity. This only affects the policy, , used both in Equation 2 and for acting in the environment. As we have argued, evaluating under different distortion risk measures is equivalent to changing the sampling distribution for , allowing us to achieve various forms of risksensitive policies. We focus on a handful of sampling distributions and their corresponding distortion measures. The first one is the cumulative probability weighting parameterization proposed in cumulative prospect theory (Tversky & Kahneman, 1992; Gonzalez & Wu, 1999):
In particular, we use the parameter value found by Wu & Gonzalez (1996) to most closely match human subjects. This choice is interesting as, unlike the others we consider, it is neither globally convex nor concave. For small values of it is locally concave and for larger values of it becomes locally convex. Recall that concavity corresponds to riskaverse and convexity to riskseeking policies.
Second, we consider the distortion risk measure proposed by Wang (2000), where and are taken to be the standard Normal cumulative distribution function and its inverse:
For , this produces riskaverse policies and we include it due to its simple interpretation and ability to switch between riskaverse and riskseeking distortions.
Third, we consider a simple power formula for riskaverse () or riskseeking () policies:
Finally, we consider conditional valueatrisk (CVaR):
CVaR has been widely studied in and out of reinforcement learning (Chow & Ghavamzadeh, 2014). Its implementation as a modification to the sampling distribution of is particularly simple, as it changes to . Another interesting sampling distribution, not included in our experiments, is denoted and corresponds to sampled by averaging samples from .
In Figure 3 (right) we give an example of a distribution (Neutral) and how each of these distortion measures affects the implied distribution due to changing the sampling distribution of . and reduce the impact of the tails of the distribution, while and heavily shift the distribution mass towards the tails, creating a riskaverse or riskseeking preference. Additionally, while CVaR entirely ignores all values corresponding to , gives these nonzero, but vanishingly small, probability.
By using these sampling distributions we can induce various risksensitive policies in IQN. We evaluate these on the same set of six Atari 2600 games previously used. Our algorithm simply changes the policy to maximize the distorted expectations instead of the usual sample mean. Figure 3 (left) shows our results in this experiment, with average scores reported under the usual, riskneutral, evaluation criterion.
Intuitively, we expected to see a qualitative effect from risksensitive training, e.g. strengthened exploration from a riskseeking objective. Although we did see qualitative differences, these did not always match our expectations. For two of the games, Asterix and Assault, there is a very significant advantage to the riskaverse policies. Although tends to perform almost identically to the standard riskneutral policy, and the riskseeking performs as well or worse than riskneutral, we find that both riskaverse policies improve performance over standard IQN. However, we also observe that the more riskaverse of the two, , suffers some loss in performance on two other games (QBert and Space Invaders).
Additionally, we note that the riskseeking policy significantly underperforms the riskneutral policy on three of the six games. It remains an open question as to exactly why we see improved performance for riskaverse policies. There are many possible explanations for this phenomenon, e.g. that riskaversion encodes a heuristic to stay alive longer, which in many games is correlated with increased rewards.
5 Full Atari57 Results
Finally, we evaluate IQN on the full Atari57 benchmark, comparing with the stateoftheart performance of Rainbow, a distributional RL agent that combines several advances in deep RL (Hessel et al., 2018), the closely related algorithm QRDQN (Dabney et al., 2018), prioritized experience replay DQN (Schaul et al., 2016), and the original DQN agent (Mnih et al., 2015). Note that in this section we use the riskneutral variant of the IQN, that is, the policy of the IQN agent is the regular greedy policy with respect to the mean of the stateaction return distribution.
It is important to remember that Rainbow builds upon the distributional RL algorithm C51 (Bellemare et al., 2017), but also includes prioritized experience replay (Schaul et al., 2016), Double DQN (van Hasselt et al., 2016), Dueling Network architecture (Wang et al., 2016), Noisy Networks (Fortunato et al., 2017), and multistep updates (Sutton, 1988). In particular, besides the distributional update, step updates and prioritized experience replay were found to have significant impact on the performance of Rainbow. Our other competitive baseline is QRDQN, which is currently stateoftheart for agents that do not combine distributional updates, step updates, and prioritized replay.
Thus, between QRDQN and the much more complex Rainbow we compare to the two most closely related, and best performing, agents in published work. In particular, we would expect that IQN would benefit from the additional enhancements in Rainbow, just as Rainbow improved significantly over C51.
Figure 4 shows the mean (left) and median (right) humannormalized scores during training over the Atari57 benchmark. IQN dramatically improves over QRDQN, which itself improves on many previously published results. At 100 million frames IQN has reached the same level of performance as QRDQN at 200 million frames. Table 1 gives a comparison between the same methods in terms of their best, humannormalized, scores per game under the 30 random noop start condition. These are averages over the given number of seeds. Additionally, using humanstarts, IQN achieves median humannormalized score, whereas Rainbow reaches (Hessel et al., 2018), see Table 2.
Mean  Median  Human Gap  Seeds  

DQN  228%  79%  0.334  1 
Prior.  434%  124%  0.178  1 
C51  701%  178%  0.152  1 
Rainbow  1189%  230%  0.144  2 
QRDQN  864%  193%  0.165  3 
IQN  1019%  218%  0.141  5 
Humanstarts (median)  
DQN  Prior.  A3C  C51  Rainbow  IQN 
68%  128%  116%  125%  153%  162% 
Finally, we took a closer look at the games in which each algorithm continues to underperform humans, and computed, on average, how far below humanlevel they perform^{2}^{2}2Details of how this is computed can be found in the Appendix.. We refer to this value as the humangap^{3}^{3}3Thanks to Joseph Modayil for proposing this metric. metric and give results in Table 1. Interestingly, C51 outperforms QRDQN in this metric, and IQN outperforms all others. This shows that the remaining gap between Rainbow and IQN is entirely from games on which both algorithms are already superhuman. The games where the most progress in RL is needed happen to be the games where IQN shows the greatest improvement over QRDQN and Rainbow.
6 Discussion and Conclusions
We have proposed a generalization of recent work based around using quantile regression to learn the distribution over returns of the current policy. Our generalization leads to a simple change to the DQN agent to enable distributional RL, the natural integration of risksensitive policies, and significantly improved performance over existing methods. The IQN algorithm provides, for the first time, a fully integrated distributional RL agent without prior assumptions on the parameterization of the return distribution.
IQN can be trained with as little as a single sample from each stateaction value distribution, or as many as computational limits allow to improve the algorithm’s data efficiency. Furthermore, IQN allows us to expand the class of control policies to a large class of risksensitive policies connected to distortion risk measures. Finally, we show substantial gains on the Atari57 benchmark over QRDQN, and even halving the distance between QRDQN and Rainbow.
Despite the significant empirical successes in this paper there are many areas in need of additional theoretical analysis. We highlight a few particularly relevant open questions we were unable to address in the present work. First, samplebased convergence results have been recently shown for a class of categorical distributional RL algorithms (Rowland et al., 2018). Could existing samplebased RL convergence results be extended to the QRbased algorithms?
Second, can the contraction mapping results for a fixed grid of quantiles given by Dabney et al. (2018) be extended to the more general class of approximate quantile functions studied in this work? Finally, and particularly salient to our experiments with distortion risk measures, theoretical guarantees for risksensitive RL have been building over recent years, but have been largely limited to special cases and restricted classes of risksensitive policies. Can the convergence of the distribution of returns under the Bellman operator be leveraged to show convergence to a fixedpoint in distorted expectations? In particular, can the control results of Bellemare et al. (2017) be expanded to cover some class of risksensitive policies?
There remain many intriguing directions for future research into distributional RL, even on purely empirical fronts. Hessel et al. (2018) recently showed that distributional RL agents can be significantly improved, when combined with other techniques. Creating a RainbowIQN agent could yield even greater improvements on Atari57. We also recall the surprisingly rich return distributions found by BarthMaron et al. (2018), and hypothesize that the continuous control setting may be a particularly fruitful area for the application of distributional RL in general, and IQN in particular.
References
 Allais (1990) Allais, M. Allais paradox. In Utility and Probability, pp. 3–9. Springer, 1990.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
 Azar et al. (2012) Azar, M. G., Munos, R., and Kappen, H. J. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning (ICML), 2012.
 BarthMaron et al. (2018) BarthMaron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T. Distributional policy gradients. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The Arcade Learning Environment: an evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
 Bellman (1957) Bellman, R. E. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
 Bousquet et al. (2017) Bousquet, O., Gelly, S., Tolstikhin, I., SimonGabriel, C.J., and Schoelkopf, B. From optimal transport to generative modeling: the vegan cookbook. arXiv preprint arXiv:1705.07642, 2017.
 Chow & Ghavamzadeh (2014) Chow, Y. and Ghavamzadeh, M. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pp. 3509–3517, 2014.
 Dabney et al. (2018) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
 Dhaene et al. (2012) Dhaene, J., Kukush, A., Linders, D., and Tang, Q. Remarks on quantiles and distortion risk measures. European Actuarial Journal, 2(2):319–328, 2012.
 Fortunato et al. (2017) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
 Geist & Pietquin (2010) Geist, M. and Pietquin, O. Kalman temporal differences. Journal of Artificial Intelligence Research, 39:483–532, 2010.
 Gonzalez & Wu (1999) Gonzalez, R. and Wu, G. On the shape of the probability weighting function. Cognitive Psychology, 38(1):129–166, 1999.
 Gruslys et al. (2018) Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M. G., and Munos, R. The Reactor: a fast and sampleefficient actorcritic agent for reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
 Hessel et al. (2018) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
 Howard & Matheson (1972) Howard, R. A. and Matheson, J. E. Risksensitive markov decision processes. Management Science, 18(7):356–369, 1972.
 Huber (1964) Huber, P. J. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
 Jaquette (1973) Jaquette, S. C. Markov decision processes with a new optimality criterion: discrete time. The Annals of Statistics, 1(3):496–505, 1973.
 Koenker (2005) Koenker, R. Quantile Regression. Cambridge University Press, 2005.
 Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pp. 320–334. Springer, 2012.
 Maddison et al. (2017) Maddison, C. J., Lawson, D., Tucker, G., Heess, N., Doucet, A., Mnih, A., and Teh, Y. W. Particle value functions. arXiv preprint arXiv:1703.05820, 2017.
 Majumdar & Pavone (2017) Majumdar, A. and Pavone, M. How should a robot assess risk? Towards an axiomatic theory of risk in robotics. arXiv preprint arXiv:1710.11040, 2017.
 Marcus et al. (1997) Marcus, S. I., FernándezGaucherand, E., HernándezHernandez, D., Coraluppi, S., and Fard, P. Risk sensitive markov decision processes. In Systems and Control in the TwentyFirst Century, pp. 263–279. Springer, 1997.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Moerland et al. (2017) Moerland, T. M., Broekens, J., and Jonker, C. M. Efficient exploration with double uncertain value networks. arXiv preprint arXiv:1711.10789, 2017.
 Morimura et al. (2010a) Morimura, T., Hachiya, H., Sugiyama, M., Tanaka, T., and Kashima, H. Parametric return density estimation for reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010a.
 Morimura et al. (2010b) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 799–806, 2010b.
 Müller (1997) Müller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.

Nair et al. (2015)
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A.,
Panneershelvam, V., Suleyman, M., Beattie, C., and Petersen, S. e. a.
Massively parallel methods for deep reinforcement learning.
In
ICML Workshop on Deep Learning
, 2015.  Osband et al. (2013) Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013.
 Puterman (1994) Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 Rowland et al. (2018) Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. An analysis of categorical distributional reinforcement learning. In AISTATS, 2018.
 Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015.
 Schaul et al. (2016) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 Sobel (1982) Sobel, M. J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(04):794–802, 1982.
 Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
 Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 Tversky & Kahneman (1992) Tversky, A. and Kahneman, D. Advances in prospect theory: cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992.
 van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double Qlearning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
 von Neumann & Morgenstern (1947) von Neumann, J. and Morgenstern, O. Theory of Games and Economic Behavior. Princeton University Press, 1947.
 Wang (1996) Wang, S. Premium calculation by transforming the layer premium density. ASTIN Bulletin: The Journal of the IAA, 26(1):71–92, 1996.
 Wang (2000) Wang, S. S. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance, pp. 15–36, 2000.
 Wang et al. (2016) Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
 Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
 White (1988) White, D. J. Mean, variance, and probabilistic criteria in finite markov decision processes: a review. Journal of Optimization Theory and Applications, 56(1):1–29, 1988.
 Wu & Gonzalez (1996) Wu, G. and Gonzalez, R. Curvature of the probability weighting function. Management Science, 42(12):1676–1690, 1996.
 Yaari (1987) Yaari, M. E. The dual theory of choice under risk. Econometrica: Journal of the Econometric Society, pp. 95–115, 1987.
 Yu et al. (2016) Yu, K.T., Bauza, M., Fazeli, N., and Rodriguez, A. More than a million ways to be pushed. a highfidelity experimental dataset of planar pushing. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 30–37. IEEE, 2016.
Appendix
Architecture and Hyperparameters
We considered multiple architectural variants for parameterizing an IQN. All of these build on the Qnetwork of a regular DQN (Mnih et al., 2015), which can be seen as the composition of a convolutional stack and an MLP , and extend it by an embedding of the sample point, , and a merging function , resulting in the function
For the embedding , we considered a number of variants: a learned linear embedding, a learned MLP embedding with a single hidden layer of size , and a learned linear function of cosine basis functions of the form
. Each of those was followed by either a ReLU or sigmoid nonlinearity.
For the merging function
, the simplest choice would be a simple vector concatenation of
and . Note however, that the MLP which takes in the output of and outputs the actionvalue quantiles, only has a single hidden layer in the DQN network. Therefore, to force a sufficiently early interaction between the two representations, we also considered a multiplicative function , where denotes the elementwise (Hadamard) product of two vectors, as well as a ‘residual’ function .Early experiments showed that a simple linear embedding of was insufficient to achieve good performance, and the residual version of didn’t show any marked difference to the multiplicative variant, so we do not include results for these here. For the other configurations, Figure 5 shows pairwise comparisons between 1) a cosine basis function embedding and a completely learned MLP embedding, 2) an embedding size (hidden layer size or number of cosine basis elements) 32 and 64, 3) ReLU and sigmoid nonlinearity following the embedding, and 4) concatenation and a multiplicative interaction between and .
Each comparison ‘violin plot’ can be understood as a marginalization over the other variants of the architecture, with the humannormalized performance at the end of training, averaged across six Atari 2600 games, on the yaxis. Each white dot corresponds to a configuration (each represented by two seeds), the black dots show the position of our preferred configuration. The width of the colored regions corresponds to a kernel density estimate of the number of configurations at each performance level.
Our final choice is a multiplicative interaction with a linear function of a cosine embedding, with and a ReLU nonlinearity (see Equation 4), as this configuration yielded the highest performance consistently over multiple seeds. Also noteworthy is the overall robustness of the approach to these variations: most of the configurations consistently outperform the QRDQN baseline shown as a grey horizontal line for comparison.
We give pseudocode for the IQN loss in Algorithm 1. All other hyperparameters for this agent correspond to the ones used by Dabney et al. (2018). In particular, the Bellman target is computed using a target network. Notice that IQN will generally be more computationally expensive persample than QRDQN. However, in practice IQN requires many fewer samples per update than QRDQN so that the actual running times are comparable.
Evaluation
The humannormalized scores reported in this paper are given by the formula (van Hasselt et al., 2016; Dabney et al., 2018)
where , and are the pergame raw scores (undiscounted returns) for the given agent, a reference human player, and random agent baseline (Mnih et al., 2015).
The ‘humangap’ metric referred to at the end of Section 5 builds on the humannormalized score, but emphasizes the remaining improvement for the agent to reach superhuman performance. It is given by , with a value of corresponding to random play, and a value of corresponding to superhuman level of performance. To avoid degeneracies in the case of , the quantity is being clipped above at .
games  random  human  dqn  prior. duel.  qrdqn  iqn 

Alien  227.8  7,127.7  1,620.0  3,941.0  4,871  7,022 
Amidar  5.8  1,719.5  978.0  2,296.8  1,641  2,946 
Assault  222.4  742.0  4,280.4  11,477.0  22,012  29,091 
Asterix  210.0  8,503.3  4,359.0  375,080.0  261,025  342,016 
Asteroids  719.1  47,388.7  1,364.5  1,192.7  4,226  2,898 
Atlantis  12,850.0  29,028.1  279,987.0  395,762.0  971,850  978,200 
Bank Heist  14.2  753.1  455.0  1,503.1  1,249  1,416 
Battle Zone  2,360.0  37,187.5  29,900.0  35,520.0  39,268  42,244 
Beam Rider  363.9  16,926.5  8,627.5  30,276.5  34,821  42,776 
Berzerk  123.7  2,630.4  585.6  3,409.0  3,117  1,053 
Bowling  23.1  160.7  50.4  46.7  77.2  86.5 
Boxing  0.1  12.1  88.0  98.9  99.9  99.8 
Breakout  1.7  30.5  385.5  366.0  742  734 
Centipede  2,090.9  12,017.0  4,657.7  7,687.5  12,447  11,561 
Chopper Command  811.0  7,387.8  6,126.0  13,185.0  14,667  16,836 
Crazy Climber  10,780.5  35,829.4  110,763.0  162,224.0  161,196  179,082 
Defender  2,874.5  18,688.9  23,633.0  41,324.5  47,887  53,537 
Demon Attack  152.1  1,971.0  12,149.4  72,878.6  121,551  128,580 
Double Dunk  18.6  16.4  6.6  12.5  21.9  5.6 
Enduro  0.0  860.5  729.0  2,306.4  2,355  2,359 
Fishing Derby  91.7  38.7  4.9  41.3  39.0  33.8 
Freeway  0.0  29.6  30.8  33.0  34.0  34.0 
Frostbite  65.2  4,334.7  797.4  7,413.0  4,384  4,324 
Gopher  257.6  2,412.5  8,777.4  104,368.2  113,585  118,365 
Gravitar  173.0  3,351.4  473.0  238.0  995  911 
H.E.R.O.  1,027.0  30,826.4  20,437.8  21,036.5  21,395  28,386 
Ice Hockey  11.2  0.9  1.9  0.4  1.7  0.2 
James Bond  29.0  302.8  768.5  812.0  4,703  35,108 
Kangaroo  52.0  3,035.0  7,259.0  1,792.0  15,356  15,487 
Krull  1,598.0  2,665.5  8,422.3  10,374.4  11,447  10,707 
KungFu Master  258.5  22,736.3  26,059.0  48,375.0  76,642  73,512 
Montezuma’s Revenge  0.0  4,753.3  0.0  0.0  0.0  0.0 
Ms. PacMan  307.3  6,951.6  3,085.6  3,327.3  5,821  6,349 
Name This Game  2,292.3  8,049.0  8,207.8  15,572.5  21,890  22,682 
Phoenix  761.4  7,242.6  8,485.2  70,324.3  16,585  56,599 
Pitfall!  229.4  6,463.7  286.1  0.0  0.0  0.0 
Pong  20.7  14.6  19.5  20.9  21.0  21.0 
Private Eye  24.9  69,571.3  146.7  206.0  350  200 
Q*Bert  163.9  13,455.0  13,117.3  18,760.3  572,510  25,750 
River Raid  1,338.5  17,118.0  7,377.6  20,607.6  17,571  17,765 
Road Runner  11.5  7,845.0  39,544.0  62,151.0  64,262  57,900 
Robotank  2.2  11.9  63.9  27.5  59.4  62.5 
Seaquest  68.4  42,054.7  5,860.6  931.6  8,268  30,140 
Skiing  17,098.1  4,336.9  13,062.3  19,949.9  9,324  9,289 
Solaris  1,236.3  12,326.7  3,482.8  133.4  6,740  8,007 
Space Invaders  148.0  1,668.7  1,692.3  15,311.5  20,972  28,888 
Star Gunner  664.0  10,250.0  54,282.0  125,117.0  77,495  74,677 
Surround  10.0  6.5  5.6  1.2  8.2  9.4 
Tennis  23.8  8.3  12.2  0.0  23.6  23.6 
Time Pilot  3,568.0  5,229.2  4,870.0  7,553.0  10,345  12,236 
Tutankham  11.4  167.6  68.1  245.9  297  293 
Up and Down  533.4  11,693.2  9,989.9  33,879.1  71,260  88,148 
Venture  0.0  1,187.5  163.0  48.0  43.9  1,318 
Video Pinball  16,256.9  17,667.9  196,760.4  479,197.0  705,662  698,045 
Wizard Of Wor  563.5  4,756.5  2,704.0  12,352.0  25,061  31,190 
Yars’ Revenge  3,092.9  54,576.9  18,098.9  69,618.1  26,447  28,379 
Zaxxon  32.5  9,173.3  5,363.0  13,886.0  13,112  21,772 
Comments
There are no comments yet.