This is a TensorFlow implementation of DeepMind's A Distributional Perspective on Reinforcement Learning.(C51-DDPG)
In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.READ FULL TEXT VIEW PDF
In reinforcement learning an agent interacts with the environment by tak...
In current reinforcement learning (RL) methods, function approximation e...
In this work, we build on recent advances in distributional reinforcemen...
Distributional reinforcement learning (RL) has achieved state-of-the-art...
We present a unifying framework for designing and analysing distribution...
The recently proposed distributional approach to reinforcement learning
In this paper, we propose the Quantile Option Architecture (QUOTA) for
This is a TensorFlow implementation of DeepMind's A Distributional Perspective on Reinforcement Learning.(C51-DDPG)
Contains an implementation of the Categorical DQN Reinforcement Learning Algorithms Described here: https://arxiv.org/pdf/1707.06887.pdf
One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility , or value (Sutton & Barto, 1998). Bellman’s equation succintly describes this value in terms of the expected reward and expected outcome of the random transition :
In this paper, we aim to go beyond the notion of value and argue in favour of a distributional perspective on reinforcement learning. Specifically, the main object of our study is the random return whose expectation is the value . This random return is also described by a recursive equation, but one of a distributional nature:
The distributional Bellman equation states that the distribution of
is characterized by the interaction of three random variables: the reward, the next state-action , and its random return . By analogy with the well-known case, we call this quantity the value distribution.
Although the distributional perspective is almost as old as Bellman’s equation itself (Jaquette, 1973; Sobel, 1982; White, 1988), in reinforcement learning it has thus far been subordinated to specific purposes: to model parametric uncertainty (Dearden et al., 1998), to design risk-sensitive algorithms (Morimura et al., 2010b, a), or for theoretical analysis (Azar et al., 2012; Lattimore & Hutter, 2012). By contrast, we believe the value distribution has a central role to play in reinforcement learning.
Contraction of the policy evaluation Bellman operator. Basing ourselves on results by Rösler (1992)
we show that, for a fixed policy, the Bellman operator over value distributions is a contraction in a maximal form of the Wasserstein (also called Kantorovich or Mallows) metric. Our particular choice of metric matters: the same operator is not a contraction in total variation, Kullback-Leibler divergence, or Kolmogorov distance.
Instability in the control setting. We will demonstrate an instability in the distributional version of Bellman’s optimality equation, in contrast to the policy evaluation case. Specifically, although the optimality operator is a contraction in expected value (matching the usual optimality result), it is not a contraction in any metric over distributions. These results provide evidence in favour of learning algorithms that model the effects of nonstationary policies.
Better approximations. From an algorithmic standpoint, there are many benefits to learning an approximate distribution rather than its approximate expectation. The distributional Bellman operator preserves multimodality in value distributions, which we believe leads to more stable learning. Approximating the full distribution also mitigates the effects of learning from a nonstationary policy. As a whole, we argue that this approach makes approximate reinforcement learning significantly better behaved.
We will illustrate the practical benefits of the distributional perspective in the context of the Arcade Learning Environment (Bellemare et al., 2013). By modelling the value distribution within a DQN agent (Mnih et al., 2015), we obtain considerably increased performance across the gamut of benchmark Atari 2600 games, and in fact achieve state-of-the-art performance on a number of games. Our results echo those of Veness et al. (2015), who obtained extremely fast learning by predicting Monte Carlo returns.
From a supervised learning perspective, learning the full value distribution might seem obvious: why restrict ourselves to the mean? The main distinction, of course, is that in our setting there are no given targets. Instead, we use Bellman’s equation to make the learning process tractable; we must, asSutton & Barto (1998) put it, “learn a guess from a guess”. It is our belief that this guesswork ultimately carries more benefits than costs.
We consider an agent interacting with an environment in the standard fashion: at each step, the agent selects an action based on its current state, to which the environment responds with a reward and the next state. We model this interaction as a time-homogeneous Markov Decision Process. As usual, and are respectively the state and action spaces, is the transition kernel , is the discount factor, and is the reward function, which in this work we explicitly treat as a random variable. A stationary policy maps each state
to a probability distribution over the action space.
The return is the sum of discounted rewards along the agent’s trajectory of interactions with the environment. The value function of a policy describes the expected return from taking action from state , then acting according to :
Fundamental to reinforcement learning is the use of Bellman’s equation (Bellman, 1957) to describe the value function:
In reinforcement learning we are typically interested in acting so as to maximize the return. The most common approach for doing so involves the optimality equation
This equation has a unique fixed point , the optimal value function, corresponding to the set of optimal policies ( is optimal if ).
We view value functions as vectors in, and the expected reward function as one such vector. In this context, the Bellman operator and optimality operator are
These operators are useful as they describe the expected behaviour of popular learning algorithms such as SARSA and Q-Learning. In particular they are both contraction mappings, and their repeated application to some initial converges exponentially to or , respectively (Bertsekas & Tsitsiklis, 1996).
In this paper we take away the expectations inside Bellman’s equations and consider instead the full distribution of the random variable . From here on, we will view as a mapping from state-action pairs to distributions over returns, and call it the value distribution.
Our first aim is to gain an understanding of the theoretical behaviour of the distributional analogues of the Bellman operators, in particular in the less well-understood control setting. The reader strictly interested in the algorithmic contribution may choose to skip this section.
It will sometimes be convenient to make use of the probability space . The reader unfamiliar with measure theory may think of as the space of all possible outcomes of an experiment (Billingsley, 1995). We will write to denote the norm of a vector for ; the same applies to vectors in . The norm of a random vector (or ) is then , and for we have (we will omit the dependency on whenever unambiguous). We will denote the c.d.f. of a random variable by , and its inverse c.d.f. by .
A distributional equation indicates that the random variable is distributed according to the same law as . Without loss of generality, the reader can understand the two sides of a distributional equation as relating the distributions of two independent random variables. Distributional equations have been used in reinforcement learning by Engel et al. (2005); Morimura et al. (2010a) among others, and in operations research by White (1988).
The main tool for our analysis is the Wasserstein metricBickel & Freedman, 1981, where it is called the Mallows metric). For , two c.d.fs over the reals, it is defined as
where the infimum is taken over all pairs of random variables with respective cumulative distributions and . The infimum is attained by the inverse c.d.f. transform of a random variable uniformly distributed on :
For this is more explicitly written as
Given two random variables , with c.d.fs , , we will write . We will find it convenient to conflate the random variables under consideration with their versions under the , writing
whenever unambiguous; we believe the greater legibility justifies the technical inaccuracy. Finally, we extend this metric to vectors of random variables, such as value distributions, using the corresponding norm.
Consider a scalar and a random variable independent of . The metric has the following properties:
We will need the following additional property, which makes no independence assumptions on its variables. Its proof, and that of later results, is given in the appendix.
Let be a set of random variables describing a partition of , i.e. and for any there is exactly one with . Let be two random variables. Then
denote the space of value distributions with bounded moments. For two value distributionswe will make use of a maximal form of the Wasserstein metric:
We will use to establish the convergence of the distributional Bellman operators.
is a metric over value distributions.
In the policy evaluation setting (Sutton & Barto, 1998) we are interested in the value function associated with a given policy . The analogue here is the value distribution . In this section we characterize and study the behaviour of the policy evaluation operator . We emphasize that describes the intrinsic randomness of the agent’s interactions with its environment, rather than some measure of uncertainty about the environment itself.
We view the reward function as a random vector , and define the transition operator
where we use capital letters to emphasize the random nature of the next state-action pair . We define the distributional Bellman operator as
While bears a surface resemblance to the usual Bellman operator (2), it is fundamentally different. In particular, three sources of randomness define the compound distribution :
The randomness in the reward ,
The randomness in the transition , and
The next-state value distribution .
In particular, we make the usual assumption that these three quantities are independent. In this section we will show that (5) is a contraction mapping whose unique fixed point is the random return .
Consider the process , starting with some . We may expect the limiting expectation of to converge exponentially quickly, as usual, to . As we now show, the process converges in a stronger sense: is a contraction in , which implies that all moments also converge exponentially quickly.
is a -contraction in .
Using Lemma 3, we conclude using Banach’s fixed point theorem that has a unique fixed point. By inspection, this fixed point must be as defined in (1). As we assume all moments are bounded, this is sufficient to conclude that the sequence converges to in for .
To conclude, we remark that not all distributional metrics are equal; for example, Chung & Sobel (1987) have shown that is not a contraction in total variation distance. Similar results can be derived for the Kullback-Leibler divergence and the Kolmogorov distance.
Observe that (and more generally, ) relates to a coupling , in the sense that
As a result, we cannot directly use
to bound the variance difference. However, is in fact a contraction in variance (Sobel, 1982, see also appendix). In general, is not a contraction in the centered moment, , but the centered moments of the iterates still converge exponentially quickly to those of ; the proof extends the result of Rösler (1992).
Thus far we have considered a fixed policy , and studied the behaviour of its associated operator . We now set out to understand the distributional operators of the control setting – where we seek a policy that maximizes value – and the corresponding notion of an optimal value distribution. As with the optimal value function, this notion is intimately tied to that of an optimal policy. However, while all optimal policies attain the same value , in our case a difficulty arises: in general there are many optimal value distributions.
In this section we show that the distributional analogue of the Bellman optimality operator converges, in a weak sense, to the set of optimal value distributions. However, this operator is not a contraction in any metric between distributions, and is in general much more temperamental than the policy evaluation operators. We believe the convergence issues we outline here are a symptom of the inherent instability of greedy updates, as highlighted by e.g. Tsitsiklis (2002) and most recently Harutyunyan et al. (2016).
Let be the set of optimal policies. We begin by characterizing what we mean by an optimal value distribution.
An optimal value distribution is the v.d. of an optimal policy. The set of optimal value distributions is .
We emphasize that not all value distributions with expectation are optimal: they must match the full distribution of the return under some optimal policy.
A greedy policy for maximizes the expectation of . The set of greedy policies for is
Recall that the expected Bellman optimality operator is
The maximization at corresponds to some greedy policy. Although this policy is implicit in (6), we cannot ignore it in the distributional setting. We will call a distributional Bellman optimality operator any operator which implements a greedy selection rule, i.e.
As in the policy evaluation setting, we are interested in the behaviour of the iterates , . Our first result is to assert that behaves as expected.
Let . Then
and in particular exponentially quickly.
By inspecting Lemma 4, we might expect that converges quickly in to some fixed point in . Unfortunately, convergence is neither quick nor assured to reach a fixed point. In fact, the best we can hope for is pointwise convergence, not even to the set but to the larger set of nonstationary optimal value distributions.
A nonstationary optimal value distribution is the value distribution corresponding to a sequence of optimal policies. The set of n.o.v.d. is .
Let be measurable and suppose that is finite. Then
If is finite, then converges to uniformly. Furthermore, if there is a total ordering on , such that for any ,
Then has a unique fixed point .
Comparing Theorem 1 to Lemma 4 reveals a significant difference between the distributional framework and the usual setting of expected return. While the mean of converges exponentially quickly to , its distribution need not be as well-behaved! To emphasize this difference, we now provide a number of negative results concerning .
The operator is not a contraction.
Consider the following example (Figure 2, left). There are two states, and ; a unique transition from to ; from , action yields no reward, while the optimal action yields or with equal probability. Both actions are terminal. There is a unique optimal policy and therefore a unique fixed point . Now consider as given in Figure 2 (right), and its distance to :
where we made use of the fact that everywhere except at . When we apply to , however, the greedy action is selected and . But
for a sufficiently small . This shows that the undiscounted update is not a nonexpansion: . With , the same proof shows it is not a contraction. Using a more technically involved argument, we can extend this result to any metric which separates and .
Not all optimality operators have a fixed point .
To see this, consider the same example, now with , and a greedy operator which breaks ties by picking if , and otherwise. Then the sequence alternates between and .
That has a fixed point is insufficient to guarantee the convergence of to .
Theorem 1 paints a rather bleak picture of the control setting. It remains to be seen whether the dynamical eccentricies highlighted here actually arise in practice. One open question is whether theoretically more stable behaviour can be derived using stochastic policies, for example from conservative policy iteration (Kakade & Langford, 2002).
In this section we propose an algorithm based on the distributional Bellman optimality operator. In particular, this will require choosing an approximating distribution. Although the Gaussian case has previously been considered (Morimura et al., 2010a; Tamar et al., 2016), to the best of our knowledge we are the first to use a rich class of parametric distributions.
We will model the value distribution using a discrete distribution parametrized by and , and whose support is the set of atoms ,
. In a sense, these atoms are the “canonical returns” of our distribution. The atom probabilities are given by a parametric model
The discrete distribution has the advantages of being highly expressive and computationally friendly (see e.g. Van den Oord et al., 2016).
Using a discrete distribution poses a problem: the Bellman update and our parametrization almost always have disjoint supports. From the analysis of Section 3 it would seem natural to minimize the Wasserstein metric (viewed as a loss) between and , which is also conveniently robust to discrepancies in support. However, a second issue prevents this: in practice we are typically restricted to learning from sample transitions, which is not possible under the Wasserstein loss (see Prop. 5 and toy results in the appendix).
Instead, we project the sample Bellman update onto the support of (Figure 1, Algorithm 1), effectively reducing the Bellman update to multiclass classification. Let be the greedy policy w.r.t. . Given a sample transition , we compute the Bellman update for each atom , then distribute its probability to the immediate neighbours of . The component of the projected update is
where bounds its argument in the range .111Algorithm 1 computes this projection in time linear in . As is usual, we view the next-state distribution as parametrized by a fixed parameter . The sample loss is the cross-entropy term of the KL divergence
which is readily minimized e.g. using gradient descent. We call this choice of distribution and loss the categorical algorithm. When , a simple one-parameter alternative is we call this the Bernoulli algorithm. We note that, while these algorithms appear unrelated to the Wasserstein metric, recent work (Bellemare et al., 2017) hints at a deeper connection.
To understand the approach in a complex setting, we applied the categorical algorithm to games from the Arcade Learning Environment (ALE; Bellemare et al., 2013). While the ALE is deterministic, stochasticity does occur in a number of guises: 1) from state aliasing, 2) learning from a nonstationary policy, and 3) from approximation errors. We used five training games (Fig 3) and 52 testing games.
For our study, we use the DQN architecture (Mnih et al., 2015), but output the atom probabilities instead of action-values, and chose from
preliminary experiments over the training games. We call the resulting architecture Categorical DQN.
We replace the squared loss by and train the network to minimize this loss.222For , our TensorFlow implementation trains at roughly 75% of DQN’s speed.
, our TensorFlow implementation trains at roughly 75% of DQN’s speed.As in DQN, we use a simple -greedy policy over the expected action-values; we leave as future work the many ways in which an agent could select actions on the basis of the full distribution. The rest of our training regime matches Mnih et al.’s, including the use of a target network for .
Figure 4 illustrates the typical value distributions we observed in our experiments. In this example, three actions (those including the button press) lead to the agent releasing its laser too early and eventually losing the game. The corresponding distributions reflect this: they assign a significant probability to 0 (the terminal value). The safe actions have similar distributions (left, which tracks the invaders’ movement, is slightly favoured). This example helps explain why our approach is so successful: the distributional update keeps separated the low-value, “losing” event from the high-value, “survival” event, rather than average them into one (unrealizable) expectation.333Video: http://youtu.be/yFBwyPuO2Vg.
One surprising fact is that the distributions are not concentrated on one or two values, in spite of the ALE’s determinism, but are often close to Gaussians. We believe this is due to our discretizing the diffusion process induced by .
We began by studying our algorithm’s performance on the training games in relation to the number of atoms (Figure 3). For this experiment, we set . From the data, it is clear that using too few atoms can lead to poor behaviour, and that more always increases performance; this is not immediately obvious as we may have expected to saturate the network capacity. The difference in performance between the 51-atom version and DQN is particularly striking: the latter is outperformed in all five games, and in Seaquest we attain state-of-the-art performance. As an additional point of the comparison, the single-parameter Bernoulli algorithm performs better than DQN in 3 games out of 5, and is most notably more robust in Asterix.
One interesting outcome of this experiment was to find out that our method does pick up on stochasticity. Pong exhibits intrinsic randomness: the exact timing of the reward depends on internal registers and is truly unobservable. We see this clearly reflected in the agent’s prediction (Figure 5): over five consecutive frames, the value distribution shows two modes indicating the agent’s belief that it has yet to receive a reward. Interestingly, since the agent’s state does not include past rewards, it cannot even extinguish the prediction after receiving the reward, explaining the relative proportions of the modes.
The performance of the 51-atom agent (from here onwards, C51) on the training games, presented in the last section, is particularly remarkable given that it involved none of the other algorithmic ideas present in state-of-the-art agents. We next asked whether incorporating the most common hyperparameter choice, namely a smaller training, could lead to even better results. Specifically, we set (instead of ); furthermore, every 1 million frames, we evaluate our agent’s performance with .
We compare our algorithm to DQN (), Double DQN (van Hasselt et al., 2016), the Dueling architecture (Wang et al., 2016), and Prioritized Replay (Schaul et al., 2016), comparing the best evaluation score achieved during training. We see that C51 significantly outperforms these other algorithms (Figures 6 and 7). In fact, C51 surpasses the current state-of-the-art by a large margin in a number of games, most notably Seaquest. One particularly striking fact is the algorithm’s good performance on sparse reward games, for example Venture and Private Eye. This suggests that value distributions are better able to propagate rarely occurring events. Full results are provided in the appendix.
We also include in the appendix (Figure 12) a comparison, averaged over 3 seeds, showing the number of games in which C51’s training performance outperforms fully-trained DQN and human players. These results continue to show dramatic improvements, and are more representative of an agent’s average performance. Within 50 million frames, C51 has outperformed a fully trained DQN agent on 45 out of 57 games. This suggests that the full 200 million training frames, and its ensuing computational cost, are unnecessary for evaluating reinforcement learning algorithms within the ALE.
The most recent version of the ALE contains a stochastic execution mechanism designed to ward against trajectory overfitting.Specifically, on each frame the environment rejects the agent’s selected action with probability . Although DQN is mostly robust to stochastic execution, there are a few games in which its performance is reduced. On a score scale normalized with respect to the random and DQN agents, C51 obtains mean and median score improvements of and respectively, confirming the benefits of C51 beyond the deterministic setting.
In this work we sought a more complete picture of reinforcement learning, one that involves value distributions. We found that learning value distributions is a powerful notion that allows us to surpass most gains previously made on Atari 2600, without further algorithmic adjustments.
It is surprising that, when we use a policy which aims to maximize expected return, we should see any difference in performance. The distinction we wish to make is that learning distributions matters in the presence of approximation. We now outline some possible reasons.
Reduced chattering. Our results from Section 3.4 highlighted a significant instability in the Bellman optimality operator. When combined with function approximation, this instability may prevent the policy from converging, what Gordon (1995) called chattering. We believe the gradient-based categorical algorithm is able to mitigate these effects by effectively averaging the different distributions, similar to conservative policy iteration (Kakade & Langford, 2002). While the chattering persists, it is integrated to the approximate solution.
State aliasing. Even in a deterministic environment, state aliasing may result in effective stochasticity. McCallum (1995), for example, showed the importance of coupling representation learning with policy learning in partially observable domains. We saw an example of state aliasing in Pong, where the agent could not exactly predict the reward timing. Again, by explicitly modelling the resulting distribution we provide a more stable learning target.
A richer set of predictions.
A recurring theme in artificial intelligence is the idea of an agent learning from a multitude of predictions (Caruana 1997; Utgoff & Stracuzzi 2002; Sutton et al. 2011; Jaderberg et al. 2017). The distributional approach naturally provides us with a rich set of auxiliary predictions, namely: the probability that the return will take on a particular value. Unlike previously proposed approaches, however, the accuracy of these predictions is tightly coupled with the agent’s performance.
Framework for inductive bias. The distributional perspective on reinforcement learning allows a more natural framework within which we can impose assumptions about the domain or the learning problem itself. In this work we used distributions with support bounded in . Treating this support as a hyperparameter allows us to change the optimization problem by treating all extremal returns (e.g. greater than ) as equivalent. Surprisingly, a similar value clipping in DQN significantly degrades performance in most games. To take another example: interpreting the discount factor as a proper probability, as some authors have argued, leads to a different algorithm.
Well-behaved optimization. It is well-accepted that the KL divergence between categorical distributions is a reasonably easy loss to minimize. This may explain some of our empirical performance. Yet early experiments with alternative losses, such as KL divergence between continuous densities, were not fruitful, in part because the KL divergence is insensitive to the values of its outcomes. A closer minimization of the Wasserstein metric should yield even better results than what we presented here.
In closing, we believe our results highlight the need to account for distribution in the design, theoretical or otherwise, of algorithms.
The authors acknowledge the important role played by their colleagues at DeepMind throughout the development of this work. Special thanks to Yee Whye Teh, Alex Graves, Joel Veness, Guillaume Desjardins, Tom Schaul, David Silver, Andre Barreto, Max Jaderberg, Mohammad Azar, Georg Ostrovski, Bernardo Avila Pires, Olivier Pietquin, Audrunas Gruslys, Tom Stepleton, Aaron van den Oord; and particularly Chris Maddison for his comprehensive review of an earlier draft. Thanks also to Marek Petrik for pointers to the relevant literature, and Mark Rowland for fine-tuning details in the final version.
The camera-ready copy of this paper incorrectly reported a mean score of 1010% for C51. The corrected figure stands at 701%, which remains higher than the other comparable baselines. The median score remains unchanged at 178%.
The error was due to evaluation episodes in one game (Atlantis) lasting over 30 minutes; in comparison, the other results presented here cap episodes at 30 minutes, as is standard. The previously reported score on Atlantis was 3.7 million; our 30-minute score is 841,075, which we believe is close to the achievable maximum in this time frame. Capping at 30 minutes brings our human-normalized score on Atlantis from 22824% to a mere (!) 5199%, unfortunately enough to noticeably affect the mean score, whose sensitivity to outliers is well-documented.
Proceedings of the International Conference on Machine Learning, 2012.
An expectation maximization algorithm for continuous markov decision processes with arbitrary reward.In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009.
Parametric return density estimation for reinforcement learning.In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010a.
ICML Workshop on Deep Learning, 2015.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 4(2), 2012.
Pixel recurrent neural networks.In Proceedings of the International Conference on Machine Learning, 2016.
To the best of our knowledge, the work closest to ours are two papers (Morimura et al., 2010b, a) studying the distributional Bellman equation from the perspective of its cumulative distribution functions. The authors propose both parametric and nonparametric solutions to learn distributions for risk-sensitive reinforcement learning. They also provide some theoretical analysis for the policy evaluation setting, including a consistency result in the nonparametric case. By contrast, we also analyze the control setting, and emphasize the use of the distributional equations to improve approximate reinforcement learning.
The variance of the return has been extensively studied in the risk-sensitive setting. Of note, Tamar et al. (2016) analyze the use of linear function approximation to learn this variance for policy evaluation, and Prashanth & Ghavamzadeh (2013) estimate the return variance in the design of a risk-sensitive actor-critic algorithm. Mannor & Tsitsiklis (2011) provides negative results regarding the computation of a variance-constrained solution to the optimal control problem.
The distributional formulation also arises when modelling uncertainty. Dearden et al. (1998) considered a Gaussian approximation to the value distribution, and modelled the uncertainty over the parameters of this approximation using a Normal-Gamma prior. Engel et al. (2005) leveraged the distributional Bellman equation to define a Gaussian process over the unknown value function. More recently, Geist & Pietquin (2010)
proposed an alternative solution to the same problem based on unscented Kalman filters. We believe much of the analysis we provide here, which deals with the intrinsic randomness of the environment, can also be applied to modelling uncertainty.
Our work here is based on a number of foundational results, in particular concerning alternative optimality criteria. Early on, Jaquette (1973) showed that a moment optimality criterion, which imposes a total ordering on distributions, is achievable and defines a stationary optimal policy, echoing the second part of Theorem 1. Sobel (1982) is usually cited as the first reference to Bellman equations for the higher moments (but not the distribution) of the return. Chung & Sobel (1987) provides results concerning the convergence of the distributional Bellman operator in total variation distance. White (1988) studies “nonstandard MDP criteria” from the perspective of optimizing the state-action pair occupancy.
A number of probabilistic frameworks for reinforcement learning have been proposed in recent years. The planning as inference approach (Toussaint & Storkey, 2006; Hoffman et al., 2009) embeds the return into a graphical model, and applies probabilistic inference to determine the sequence of actions leading to maximal expected reward. Wang et al. (2008) considered the dual formulation of reinforcement learning, where one optimizes the stationary distribution subject to constraints given by the transition function (Puterman, 1994), in particular its relationship to linear approximation. Related to this dual is the Compress and Control algorithm Veness et al. (2015), which describes a value function by learning a return distribution using density models. One of the aims of this work was to address the question left open by their work of whether one could be design a practical distributional algorithm based on the Bellman equation, rather than Monte Carlo estimation.
Let be a set of random variables describing a partition of , i.e. and for any there is exactly one with . Let be two random variables. Then
We will give the proof for , noting that the same applies to . Let and , respectively. First note that
Now, whenever . It follows that we can choose so that also whenever , without increasing the expected norm. Hence
Next, we claim that
Specifically, the left-hand side of the equation is an infimum over all r.v.’s whose cumulative distributions are and , respectively, while the right-hand side is an infimum over sequences of r.v’s and whose cumulative distributions are , respectively. To prove this upper bound, consider the c.d.f. of :
Hence the distribution is equivalent, in an almost sure sense, to one that first picks an element of the partition, then picks a value for conditional on the choice . On the other hand, the c.d.f. of is
Thus the right-hand side infimum in (9) has the additional constraint that it must preserve the conditional c.d.fs, in particular when . Put another way, instead of having the freedom to completely reorder the mapping , we can only reorder it within each element of the partition. We now write
where (a) follows because is a partition. Using (9), this implies
because in (b) the individual components of the sum are independently minimized; and (c) from (8). ∎
is a metric over value distributions.
The only nontrivial property is the triangle inequality. For any value distribution , write
where in (a) we used the triangle inequality for . ∎
is a -contraction in .
Consider two value distributions , and write to be the vector of variances of . Then