1 Introduction
In reinforcement learning (RL), a central notion is the return, the sum of discounted rewards. Typically, the average of these returns is estimated by a value function and used for policy improvement. Recently, however, approaches that attempt to learn the distribution of the return have been shown to be surprisingly effective (Morimura et al., 2010a, b; Bellemare et al., 2017; Dabney et al., 2017, 2018; Gruslys et al., 2018); we refer to the general approach of learning return distributions as distributional RL (DRL).
Despite impressive experimental performance (Bellemare et al., 2017; BarthMaron et al., 2018; Dabney et al., 2018) and fundamental theoretical results (Rowland et al., 2018; Qu et al., 2018), it remains challenging to develop and analyse DRL algorithms. In this paper, we propose to address these challenges by phrasing DRL algorithms in terms of recursive estimation of sets of statistics on the return distribution. We observe that DRL algorithms can be viewed as combining a statistical estimator with a procedure we refer to as an imputation strategy, which generates a return distribution consistent with the set of statistical estimates. This highly general approach (see Figure 1) requires a precise treatment of the differing roles of statistics and samples in distributional RL.
Using this framework we are able to provide new theoretical results for existing DRL algorithms as well as demonstrate the derivation of a new algorithm based on the expectiles of the return distribution. More importantly, our novel approach immediately applies to a large class of statistics and imputation strategies, suggesting several avenues for future research. Specifically, we are able to provide answers to the following questions:

[label=(),leftmargin=0.6cm,topsep=1pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Can we describe existing DRL algorithms in a unifying framework, and could such a framework be used to develop new algorithms?

What return distribution statistics can be learnt exactly through Bellman updates?

If certain statistics cannot be learnt exactly, how can we estimate them in a principled manner, and give guarantees on their approximation error relative to the true values of these statistics?
After reviewing relevant background material, we begin with (i) by presenting a new framework for understanding DRL, that is, in terms of a set of statistics to be learnt, and an imputation strategy for specifying a dynamic programming update. We then formalise (ii) by introducing the notion of Bellman closedness
for collections of statistics, and show that in a wide class of statistics, the only properties of return distributions that can be learnt exactly through Bellman updates are moments. Interestingly, this rules out statistics such as quantiles that have formed the basis of successful existing DRL algorithms. However, we then address (iii) by showing that the framework allows us to give guarantees on the approximation error introduced in learning these statistics, through the notion of
approximate Bellman closedness. We apply the framework developed in answering these questions to the case of expectile statistics to develop a new distributional RL algorithm, which we term Expectile Distributional RL (EDRL). Finally, we test these new insights on a variety of MDPs and largerscale environments to illustrate and expand on the theoretical contributions developed earlier in the paper.2 Background
Consider a Markov decision process
with finite state space , finite action space , transition kernel , discount rate , and reward distributions for each . Thus, if an agent is at state at time , and an action is taken, the agent transitions to a state and receives a reward . We now briefly review two principal goals in reinforcement learning.Firstly, given a Markov policy , evaluation of consists of computing the expected returns , where indicates that at each time step , the agent’s action is sampled from . Secondly, the task of control consists of finding a policy for which the expected returns are maximised.
2.1 Bellman equations
The classical Bellman equation (Bellman, 1957) relates expected returns at each stateaction pair to the expected returns at possible next states in the MDP by:
(1) 
This gives rise to the following fixedpoint iteration scheme
(2) 
for updating a collection of approximations towards their true values. This fundamental algorithm, together with techniques from approximate dynamic programming and stochastic approximation, allows expected returns in an MDP to be learnt and improved upon, forming the basis of all valuebased RL (Sutton & Barto, 2018).
The distributional Bellman equation describes a similar relationship to Equation (1
) at the level of probability distributions
(Morimura et al., 2010a, b; Bellemare et al., 2017). Letting be the distribution of the random return when actions are selected according to , we have(3)  
where the expectation gives a mixture distribution over nextstates, is defined by , and is the pushforward of the measure through the function , so that for all Borel subsets , we have (Rowland et al., 2018).
Stated in terms of the random return , distributed according to , this takes a more familiar form with
In analogy with Expression (2), an update operation could be defined from Equation (3) to move a collection of approximate distributions towards the true return distributions. However, since the space of distributions is infinitedimensional, it is typically impossible to work directly with the distributional Bellman equation, and existing approaches to distributional RL generally rely on parametric approximations to this equation; we briefly review some important examples of these approaches below.
2.2 Categorical and quantile distributional RL
To date, the main approaches to DRL employed at scale have included learning discrete categorical distributions (Bellemare et al., 2017; BarthMaron et al., 2018; Qu et al., 2018), and learning distribution quantiles (Dabney et al., 2017, 2018; Zhang et al., 2019); we refer to these approaches as CDRL and QDRL respectively. We give brief accounts of the dynamic programming versions of these algorithms here, with full descriptions of stochastic versions, related results, and visualisations given in Appendix Section A for completeness. We note also that other approaches, such as learning mixtures of Gaussians, have been explored (BarthMaron et al., 2018).
CDRL. CDRL assumes a categorical form for return distributions, taking , where denotes the Dirac distribution at location . The values are an evenly spaced, fixed set of supports, and the probability parameters are learnt. The corresponding Bellman update takes the form
where is a projection operator which ensures the righthand side of the expression above is a distribution supported only on ; full details are reviewed in Appendix Section A.
QDRL. In contrast, QDRL assumes a parametric form for return distributions , where now are learnable parameters. The Bellman update is given by moving the atom location in to the quantile (where ) of the target distribution , defined as the minimiser of the quantile regression loss
(4) 
3 The role of statistics in distributional RL
In this section, we describe a new perspective on existing distributional RL algorithms, with a focus on learning sets of statistics, rather than approximate distributions. We begin with a precise definition.
Definition 3.1 (Statistics).
A statistic is a function . We also allow statistics to be defined on subsets of , in situations where an assumption (such as finite moments) is required for the statistic to be defined.
The QDRL update described in Section 2.2 is readily interpreted from the perspective of learning statistics; the update extracts the values of a finite set of quantile statistics from the target distribution, and all other information about the target is lost. It is less obvious whether the CDRL update can also be interpreted as keeping track of a finite set of statistics, but the following lemma shows that this is indeed the case.
Lemma 3.2.
CDRL updates, with distributions supported on , can be interpreted as learning the values of the following statistics of return distributions:
where for , is a piecewise linear function defined so that is equal to for , equal to for
, and linearly interpolating between
and for .Although viewing distributional RL as approximating the return distribution with some parameterisation is intuitive from an algorithmic standpoint, there are advantages to thinking in terms of sets of statistics and their recursive estimation; this perspective allows us to precisely quantify what information is being passed through successive distributional Bellman updates. This in turn leads to new insights in the development and analysis of DRL algorithms. Before addressing these points, we first consider a motivating example where a lack of precision could lead us astray.
3.1 Expectiles
Motivated by the success of QDRL, we consider learning expectiles of return distributions, a family of statistics introduced by Newey & Powell (1987). Expectiles generalise the mean in analogy with how quantiles generalise the median. As the goal of RL is to maximise mean returns, we conjectured that expectiles, in particular, might lead to successful DRL algorithms. We begin with a formal definition.
Definition 3.3 (Expectiles).
Given a distribution with finite second moment, and , the expectile of is defined to be the minimiser of the expectile regression loss , given by
For each , we denote the expectile of by .
We remark that: (i) the expectile regression loss is an asymmetric version of the squared loss, just as the quantile regression loss is an asymmetric version of the absolute value loss; and (ii) the expectile of is simply its mean. Because of this, we can attempt to derive an algorithm by replacing the quantile regression loss in QDRL with the expectile regression loss in Definition 3.3, so as to learn the expectiles corresponding to .
Following this logic, we again take approximate distributions of the form , and we perform updates according to
(5) 
where is the target distribution.
In practice, however, this algorithm does not perform as we might expect, and in fact the variance of the learnt distributions collapses as training proceeds, indicating that the algorithm does not approximate the true expectiles in any reasonable sense. In Figure 2, we illustrate this point by comparing the learnt statistics for this “naive” approach with those of CDRL and our proposed algorithm EDRL (introduced in Section 3.3). All methods accurately approximate the immediate reward distribution (right), but as successive Bellman updates are applied the different algorithms show characteristic approximation errors. The CDRL algorithm overestimates the variance of the return distribution due to the projection splitting probability mass across the discrete support. By contrast, the naive expectile approach underestimates the true variance, quickly converging to a single Dirac.
We observe that there is a “type error” present in Expression (5); the parameter being updated, , has the semantics of a statistic, as the minimiser of the loss, whilst the parameters appearing in the target distribution have the semantics of outcomes/samples. A crucial message of this paper is the need to distinguish between statistics and samples in distributional RL; in the next section, we describe a general framework for achieving this.
3.2 Imputation strategies
If we had access to full return distribution estimates at each possible next stateaction pair , we would be able to avoid the conflation between samples and statistics described in the previous section. Denoting the approximation to the value of a statistic at a stateaction pair by , we would like to update according to:
(6) 
Thus, a principled way in which to design DRL algorithms for collections of statistics is to include an additional step in the algorithm in which for any stateaction pair that we would like to backup from, the estimated statistics are converted into a consistent distribution . This would then allow backups of the form in Expression (6) to be carried out. This notion is formalised in the following definition.
Definition 3.4 (Imputation strategies).
Given a set of statistics , an imputation strategy is a function
that maps each vector of statistic values to a distribution that has those statistics. Mathematically,
is such that , for each and each collection of statistic values .Thus, an imputation strategy is simply a function that takes in a collection of values for certain statistics, and returns a probability distribution with those statistic values; in some sense, it is a pseudoinverse of .
Example 3.5 (Imputation strategies in CDRL and QDRL).
In QDRL, the imputation strategy is given by . In CDRL, given approximate statistics for , the imputation strategy is given by selecting the distribution such that , for , and .
3.3 Expectile distributional reinforcement learning
We now apply the general framework of statistics and imputation strategies developed in Section 3.2 to the specific case of expectiles, introduced in Section 3.1. We will define an imputation strategy so that updates of the form given in Expression (6) can be applied to learn expectiles.
The imputation strategy has the task of accepting as input a collection of expectile values , corresponding to , and computing a probability distribution such that for . Since is strictly convex as a function of , this can be restated as finding a probability distribution satisfying the firstorder optimality conditions
(7) 
This defines a rootfinding problem, but may equivalently be formulated as a minimisation problem, with objective
(8) 
By constraining the distribution to be of the form and viewing the minimisation objective above as a function of , it is straightforwardly verifiable that this minimisation problem is convex. The imputation strategy is thus defined implicitly, by stating that is given by a minimiser of (8) of the form . We remark that other parametric choices for are possible, but the mixture of Dirac deltas described above leads to a particular tractable optimisation problem.
Having established an imputation strategy , Algorithm 1 now yields a full DRL algorithm for learning expectiles, which we term EDRL. Returning to Figure 2, we observe that EDRL (bottom row) is able to accurately represent the true return distribution, even after many Bellman updates through the chain, and does not exhibit the collapse observed with the naive approach in Section 3.1.
3.4 Stochastic approximation
Practically speaking, it is often not possible to compute the updates in Expression (6), owing to MDP dynamics being unknown and/or intractable to integrate over. Because of this, it is often necessary to apply stochastic approximation. Let
be a sample of the random variables
, obtained by direct interaction with the environment. Then, we updateusing the gradient of a loss function
:(9) 
For EDRL, a natural such loss function for the estimated statistic is the expectile regression loss of Definition 3.3 at ; this yields a stochastic version of EDRL, described in Algorithm 2.
To ensure convergence of these stochastic gradient updates to the correct statistic, it should be the case that the expectation of the (sub)gradient (9) at the true value of the statistics is equal to . It can be verified that this is the case whenever (i) the true statistic of a distribution satisfies , (ii) the loss is affine in the probability distribution argument. Mestimator losses and their associated statistics (Huber & Ronchetti, 2009) satisfy these conditions, and thus represent a large family of statistics to which this approach to DRL could immediately be applied; the statistics in CDRL, QDRL and EDRL are all special cases of Mestimators.
4 Analysing distributional RL
We now use the framework of statistics and imputations strategies developed in Section 3 to build a deeper understanding of the accuracy with which statistics in distributional RL may be learnt via Bellman updates.
4.1 Bellman closedness
The classical Bellman equation (1) shows that there is a closedform relationship between expected returns at each stateaction pair of an MDP; if the goal is to learn expected returns, we are not required to keep track of any other statistics of the return distributions. This wellknown observation, together with the new interpretation of DRL algorithms as learning collections of statistics of return distributions, motivates a more general question:
“Given a set of statistics , if we want to learn the values for all via dynamic programming, is it sufficient to keep track of only these statistics?”
The following definition formalises this question.
Definition 4.1 (Bellman closedness).
A set of statistics is Bellman closed if for each , the statistics can be expressed, in an MDPindependent manner, in terms of the random variables and , and the discount factor . We refer to any such expression for a set of Bellman closed set of statistics as a Bellman equation, and write for the corresponding operator such that the Bellman equation can be written
(10) 
where .
Thus, the singleton set consisting of the mean statistic is Bellman closed; the corresponding Bellman equation is Equation (1). It is also known that the set consisting of the mean and variance statistics are Bellman closed (Sobel, 1982). In principle, given a Bellman closed set of statistics , the corresponding statistics of the return distributions can be found by solving a fixedpoint equation corresponding to the relevant Bellman operator, . Further, if is a contraction in some metric, then it is possible to find the true statistics for the MDP via a fixedpoint iteration scheme based on the operator . In contrast, if a collection of statistics is not Bellman closed, there is no Bellman equation relating the statistics of the return distributions, and consequently it is not possible to learn the statistics exactly using dynamic programming in a selfcontained way; the set of statistics must either be enlarged to make it Bellman closed, or an imputation strategy can be used to perform backups as described in Section 3.2.
An important class of Bellman closed sets of statistics are given in the following result (Sobel, 1982; Lattimore & Hutter, 2012).
Lemma 4.2.
For each , the set of statistics consisting of the first moments is Bellman closed.
The next result shows that across a wide range of statistics, collections of moments are effectively the only finite sets of statistics that are Bellman closed; the proof relies on a result of Engert (1970) which characterises finitedimensional vector spaces of measurable functions closed under translation.
Theorem 4.3.
The only finite sets of statistics of the form that are Bellman closed are given by collections of statistics with the property that the linear span is equal to the linear span of the set of moment functionals , for some , where is the constant functional equal to .
We believe this to be an important novel result, which helps to highlight how rare it is for statistics to be Bellman closed. One important corollary of Theorem 4.3, given the characterisation of CDRL as learning expectations of return distributions in Lemma 3.2, is that the sets of statistics learnt in CDRL are not Bellman closed. A similar result holds for QDRL, and we record these facts in the following result.
Lemma 4.4.
The sets of statistics learnt under (i) CDRL, and (ii) QDRL, are not Bellman closed.
The immediate upshot of this is that in general, the learnt values of statistics in distributional RL algorithms need not correspond exactly to the true underlying values for the MDP (even in tabular settings), as the statistics propagated through DRL dynamic programming updates are not sufficient to determine the statistics we seek to learn. This inexactness was noted specifically for CDRL and QDRL in the original papers (Bellemare et al., 2017; Dabney et al., 2017). In this paper, our analysis and experiments confirm that these artefacts arise even with tabular agents in fullyobserved domains, thus representing intrinsic properties of the distributional RL algorithms concerned. However, empirically the distributions learnt by these algorithms are often accurate. In the next section, we provide theoretical guarantees that describe this phenomenon quantitatively.
4.2 Approximate Bellman closedness
In light of the results on Bellman closedness in Section 4.1, we might ask in what sense the values of the statistics learnt by DRL algorithms relate to the corresponding true underlying values for the MDP concerned. A key task in this analysis is to formalise the notion of low approximation error in DRL algorithms that seek to learn collections of statistics that are not Bellman closed. Perhaps surprisingly, in general it is not possible to simultaneously achieve low approximation error on all statistics in a nonBellman closed set; we give several examples for CDRL and QDRL to this end in Appendix Section C.
Due to the fact that it is in general not possible to learn statistics uniformly well, we formalise the notion of approximate closedness in terms of the average approximation error across a collection of statistics, as described below.
Definition 4.5 (Approximate Bellman closedness).
A collection of statistics , together with an imputation strategy , are said to be approximately Bellman closed for a class of MDPs if, for each MDP in and every policy , we have
where denotes the learnt value of the statistic for the return distribution at the stateaction pair .
We can now study the approximation errors of CDRL and QDRL in light of this new concept. Whilst the analysis in Section 4.1 shows that CDRL and QDRL necessarily induce some approximation error due to lack of Bellman closedness, the following results reassuringly show that the approximation error can be made arbitrarily small by increasing the number of learnt statistics.
Theorem 4.6.
Consider the class of MDPs with a fixed discount factor , and immediate reward distributions supported on . The set of statistics and imputation strategy corresponding to CDRL with evenly spaced bin locations at is approximately Bellman closed for , where .
Theorem 4.7.
Consider the class of MDPs with a fixed discount factor , and immediate reward distributions supported on . Then the collection of quantile statistics for , together with the standard QDRL imputation strategy, is approximately Bellman closed for , where .
4.3 Mean consistency
So far, our discussion has been focused around evaluation. For control, it is important to correctly estimate expected returns, so that accurate policy improvement can be performed. We analyse to what extent expected returns are correctly learnt in existing DRL algorithms in the following result. The result for CDRL has been shown previously (Rowland et al., 2018; Lyle et al., 2019), but our proof here gives a new perspective in terms of statistics.
Lemma 4.8.
(i) Under CDRL updates using support locations , if all approximate reward distributions have support bounded in , expected returns are exactly learnt. (ii) Under QDRL updates, expected returns are not exactly learnt.
Importantly, for EDRL, as long as the expectile (i.e. the mean) is included in the set of statistics, expected returns are learnt exactly; we return to this point in Section 5.2.
5 Experimental results
We first present results with a tabular version of EDRL to illustrate and expand upon the theoretical results presented in Sections 3 and 4. We then combine the EDRL update with a DQNstyle architecture to create a novel deep RL algorithm (ERDQN), and evaluate performance on the Atari57 environments. We give full details of the architectures used in experiments in Appendix Section D.1.
There are several ways in which the rootfinding/optimisation problems (7) and (8) may be solved in practice. In our experiments, we use a SciPy optimisation routine (Jones et al., 2001).
5.1 Tabular policy evaluation
We empirically validate that EDRL, which uses a sample imputation strategy, better approximates the true expectiles of a policy’s return distribution as compared to the naive approach described in Section 3.1. We then show that the same is true for a variant of QDRL.
We use a variant of the classic Chain domain (see Figure 3). This environment is a onedimensional chain of length with two possible actions at each state: (i) forward, which moves the agent right by one step with probability 0.95 and to with probability 0.05, and backward, which moves the agent to with probability 0.95 and one step to the right with probability 0.05. The reward is when transitioning to the leftmost state, when transitioning to the rightmost state, and zero elsewhere. Episodes begin in the leftmost state and terminate when the rightmost state is reached. The discount factor is . For an Chain with length 15, we compute the return distribution of the optimal policy which selects the forward action at each state. This environment formulation induces an increasingly multimodal return distribution under the policy as the distance from the goal state increases. We compute the ground truth start state expectiles from the empirical distribution of 1,000 Monte Carlo rollouts under the policy .
EDRL. We ran two DRL algorithms on this Chain environment: (i) EDRL, using a SciPy optimisation routine to impute target samples at each step; and (ii) EDRLNaive, using the update described in Section 3.1. We learned expectiles, set the learning rate to , and performed 30,000 training steps.
In Figure 4 we illustrate the collapse of the start state expectiles learned by the EDRLNaive algorithm with expectiles, which leads to high expectile estimation error, measured as in Definition 4.5. In Figure 5, we show that this error grows as both the distance to the goal state and number of expectiles learned increase. In contrast, under EDRL these errors are much lower this error remains relatively low for varying numbers of expectiles and distances to the goal with EDRL. In Appendix E, we illustrate that this observation generalises to other return distributions in the Chain.
QDRL. In practical implementations, QDRL often minimises the Huberquantile loss
(11) 
rather than the quantile loss (4) for numerical stability, where is the Huber loss function with width parameter , as in Dabney et al. (2017) (we set ). As with naive EDRL, simply replacing the quantile regression loss in QDRL with Expression (11) conflates samples and statistics, leading to worse approximation of the distribution. We propose a new algorithm for learning Huber quantiles, HuberQDRLImputation, that incorporates an imputation strategy by solving an optimisation problem analogous to (8) in the case of the Huber quantile loss. In Figure 6, we compare this to HuberQDRLNaive, the standard algorithm for learning Huber quantiles, on the chain environment. As in the case of expectiles, the Huber quantile estimation error is vastly reduced when using an imputation strategy.
5.2 Tabular control
In Section 4.3 we argued for the importance of mean consistency. In Figure 7a we give a simple, five state, MDP in which the learned control policy is directly affected by mean consistency. At start state the agent has the choice of two actions, leading down two paths and culminating in two different reward distributions. The rewards at terminal states and
are sampled from (shifted) exponential distributions with densities
() and (), respectively. Transitions are deterministic, and . For CDRL, we take bin locations at .Figure 7b shows the true return distributions, their expectations, and the means estimated by CDRL, QDRL and EDRL. Due to a lack of mean consistency both CDRL and QDRL learn a suboptimal greedy policy. For CDRL, this is due to the true return distributions having support outside , and for QDRL, this is due to the quantiles not capturing tail behaviour. In contrast, EDRL correctly learns the means of both return distributions, and so is able to act optimally.
5.3 Expectile regression DQN
To demonstrate the effectiveness of EDRL at scale, we combine the EDRL update in Algorithm 2 with the architecture of QRDQN to obtain a new deep RL agent, expectile regression DQN (ERDQN). Precise details of the architecture, training algorithm, and environments are given in Appendix Section D. We evaluate ERDQN on a suite of 57 Atari games using the Arcade Learning Environment (Bellemare et al., 2013). In Figure 8, we plot mean and median human normalised scores for ERDQN with 11 atoms, and compare against DQN, QRDQN (which learns 200 Huber quantile statistics), and a naive implementation of ERDQN that doesn’t use an imputation strategy, learning 201 expectiles. All methods were rerun for this paper, and results were averaged over 3 seeds. In practice, we found that with 11 expectiles, ERDQN already offers strong performance relative to these other approaches, and that with this number of statistics, the additional training overhead due to the SciPy optimiser calls is low.
In terms of mean human normalised score, ERDQN represents a substantial improvement over both QRDQN and the naive version of ERDQN that does not use an imputation strategy. We hypothesise that the mean consistency of EDRL (in contrast to other DRL methods; see Section 4.3) is partially responsible for these improvements, and leave further investigation of the role of mean consistency in DRL as a direction for future work. We also remark that the performance of ERDQN shows that there may be significant practical value in applying the framework developed in this paper to other families of statistics. It remains to be seen if the presence of partial observability may induce nontrivial distributions, which could also explain ERDQN’s improved performance in some games. Investigation into the robustness of ERDQN with regards to the precise imputation strategy used is also a natural question for future work.
6 Conclusion
We have developed a unifying framework for DRL in terms of statistical estimators and imputation strategies. Through this framework, we have developed a new algorithm, EDRL, as well as proposing algorithmic adjustments to an existing approach. We have also used this framework to define the notion of Bellman closedness, and provided new approximation guarantees for existing algorithms.
This paper also opens up several avenues for future research. Firstly, the framework of imputation strategies has the potential to be applied to a wide range of collections of statistics, opening up a large space of new algorithms to explore. Secondly, our analysis has shown that a lack of Bellman closedness necessarily introduces a source of approximation error into many DRL algorithms; it will be interesting to see how this interacts with errors introduced by function approximation. Finally, we have focused on DRL algorithms that can be interpreted as learning a finite collection of statistics in this paper. One notable alternative is implicit quantile networks (Dabney et al., 2018), which attempt to learn an uncountable collection of quantiles with a finitecapacity function approximator; it will also be interesting to extend our analysis to this setting.
Acknowledgements
The authors acknowledge the vital contributions of their colleagues at DeepMind. Thanks to Hado van Hasselt for detailed comments on an earlier draft, and to Georg Ostrovski for useful suggestions regarding the SciPy optimisation calls within ERDQN.
References
 BarthMaron et al. (2018) BarthMaron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T. Distributional policy gradients. Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), 2017.
 Bellman (1957) Bellman, R. Dynamic Programming. Princeton University Press, 1st edition, 1957.
 Dabney et al. (2017) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
 Dabney et al. (2018) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), 2018.
 Engert (1970) Engert, M. Finite dimensional translation invariant subspaces. Pacific Journal of Mathematics, 32(2):333–343, 1970.
 Gruslys et al. (2018) Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. The reactor: A fast and sampleefficient actorcritic agent for reinforcement learning. Proceedings of the International Conference on Learning Representations (ICLR), 2018.
 Huber & Ronchetti (2009) Huber, P. J. and Ronchetti, E. Robust Statistics. Wiley New York, 2nd edition, 2009.
 Jones et al. (2001) Jones, E., Oliphant, T., and Peterson, P. SciPy: Open source scientific tools for Python, 2001. URL http://www.scipy.org/.
 Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory (ALT), 2012.
 Lyle et al. (2019) Lyle, C., Castro, P. S., and Bellemare, M. G. A comparative analysis of expected and distributional reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
 Morimura et al. (2010a) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Nonparametric return distribution approximation for reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), 2010a.
 Morimura et al. (2010b) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Parametric return density estimation for reinforcement learning. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010b.
 Newey & Powell (1987) Newey, W. K. and Powell, J. L. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pp. 819–847, 1987.
 Qu et al. (2018) Qu, C., Mannor, S., and Xu, H. Nonlinear distributional gradient temporaldifference learning. arXiv preprint arXiv:1805.07732, 2018.
 Rowland et al. (2018) Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. An analysis of categorical distributional reinforcement learning. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
 Sobel (1982) Sobel, M. J. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.
 Zhang et al. (2019) Zhang, S., Mavrin, B., Yao, H., Kong, L., and Liu, B. QUOTA: The quantile option architecture for reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
Appendices
Appendix A Distributional reinforcement learning algorithms
For completeness, we give full descriptions of CDRL and QDRL algorithms in this section, complementing the details given in Section 2.2. We also summarise CDRL, QDRL, the exact approach to distributional RL, and our proposed algorithm EDRL, in Figure 10 at the end of this section.
a.1 The distributional Bellman operator
a.2 Categorical distributional reinforcement learning
As described in Section 2.2, CDRL algorithms are an approach to distributional RL that restrict approximate distributions to the parametric family of the form , where are an evenly spaced, fixed set of supports. For evaluation of a policy , given a collection of approximations , the approximation at is updated according to:
Here, is a projection operator defined for a single Dirac delta as
(12) 
and extended affinely and continuously. In the language of operators, the CDRL update may be neatly described as , where we abuse notation by interpreting as an operator on collections of distributions indexed by stateaction pairs, applying the transformation in Expression (12) to each distribution. The supremumCramér distance is defined as
for all , where for any , denotes the CDF of . The operator is a contraction in the supremumCramér distance, and so by the contraction mapping theorem, repeated CDRL updates converge to a unique limit point, regardless of the initial approximate distributions. For more details on these results and further background, see Bellemare et al. (2017); Rowland et al. (2018).
Stochastic approximation. The update is typically not computable in practice, due to unknown/intractable dynamics. An unbiased approximation to may be obtained by interacting with the environment to obtain a transition , and computing the target
It can be shown (Rowland et al., 2018)
that the following is an unbiased estimator for the CDRL update
:Finally, the current estimate can be moved towards the stochastic target by following the (semi)gradient of some loss, in analogy with semigradient methods in classical RL. Bellemare et al. (2017) consider the KL loss
and update by taking the gradient of the loss through the second argument with respect to the parameters . Other losses, such as the Cramér distance, may also be considered (Rowland et al., 2018).
Control. All variants of CDRL for evaluation may be modified to become control algorithms. This is achieved by adjusting the distribution of the action in the backup in an analogous way to classical RL algorithms. Instead of having , we instead select based on the currently estimated expected returns for each of the actions at the state . For Qlearningstyle algorithms, the action corresponding to the highest estimated expected return is selected:
However, other choices are possible, such as SARSAstyle greedy action selection.
a.3 Quantile distributional reinforcement learning
As described in Section 2.2, QDRL algorithms are an approach to distributional RL that restrict approximate distributions to the parametric family of the form . For evaluation of a policy , given a collection of approximations , the approximation at is updated according to:
Here, is a projection operator defined by
where , and is the CDF of of . As noted in Section 2.2, may also be characterised as the minimiser (over ) of the quantile regression loss ; this perspective turns out to be crucial in deriving a stochastic approximation version of the algorithm.
Stochastic approximation. As for CDRL, the update is typically not computable in practice, due to unknown/intractable dynamics. Instead, a stochastic target may be computed by using a transition , and updating each atom location at the current stateaction pair by following the gradient of the QR loss:
Because the loss is affine in its second argument, this yields an unbiased estimator of the true gradient
Control. The methods for evaluation described above may be modified to yield control methods in exactly the same as described for CDRL in Section A.2.
a.4 Quantiles versus expectiles
Quantiles of a distribution are given by the inverse of the cumulative distribution function. As such, they fundamentally represent threshold values for the cumulative probabilities. That is, the quantile at
, , is greater than or equal to of the outcome values. In contrast, expectiles also take into account the magnitude of outcomes; the expectile at , , is such that the expectation of the deviations below of the random variable is equal to of the expectation of the deivations above . We illustrate these points in Figure 9.Appendix B Proofs
b.1 Proofs of results from Section 3
See 3.2
Proof.
We first observe that the projection operator , defined in Section A.2, preserves each of the statistics , in the sense that for any distribution , we have for all . Secondly, we observe that that the map is injective; each distribution has a unique vector of statistics. Thus, CDRL can indeed be interpreted as learning precisely the set of statistics . ∎
b.2 Proofs of results from Section 4.1
See 4.2
Proof.
We begin by introducing notation. Let be the ^{th} moment functional, for . We now compute
Thus, can be expressed in terms of and , as required. ∎
See 4.3
Proof.
Suppose form a Bellman closed set of statistical functionals of the form for some measurable , for each . Now note that for any MDP , we have the following equation:
for all , and for each . By assumption of Bellman closedness, the righthand side of this equation may be written as a function of , , and the collection of statistics . Since this must hold across all valid sets of return distributions, it must the case that each may be written as a function of , and ; we will write for some .
We next claim that is affine in . To see this, note that both and are affine as functions of the distribution , by assumption on the form of the statistics . Therefore too is affine on the (convex) codomain of .
Thus, we have