The Potential of the Return Distribution for Exploration in RL

06/11/2018 ∙ by Thomas M. Moerland, et al. ∙ 0

This paper studies the potential of the return distribution for exploration in deterministic environments. We study network losses and propagation mechanisms for Gaussian, Categorical and Mixture of Gaussian distributions. Combined with exploration policies that leverage this return distribution, we solve, for example, a randomized Chain task of length 100, which has not been reported before when learning with neural networks.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is the dominant class of algorithms to learn sequential decision-making from data. Most RL approaches focus on learning the mean action value . Recently, Bellemare et al. (2017) studied distributional RL, where one propagates the entire return distribution (of which is the expectation) through the Bellman equation. Bellemare et al. (2017) show increased performance in a variety of RL tasks.

However, Bellemare et al. (2017) did not yet leverage the return distribution for exploration. In the present paper, we identify the potential of the return distribution for informed exploration. The return distribution may be induced by two sources of stochasticity: 1) our stochastic policy and 2) a stochastic environment. For this work we assume a deterministic environment, which makes the return distribution entirely induced by the stochastic policy. Thereby, we may actually act optimistically with respect to this distribution.111In Section 7, we more thoroughly discuss the different types of uncertainty present in sequential decision making. The present paper explores this idea, in the context of neural networks and for different propagation distributions (Gaussian, Categorical and Gaussian mixture). Results show vastly improved learning in a challenging exploration task, which had not been solved with neural networks before. We also provide extensive visual illustration of the process of return-based exploration, which shows a natural shift from exploration to exploitation.

2 Distributional Reinforcement Learning

We adopt a Markov Decision Process (MDP)

(Sutton & Barto, 1998) given by the tuple . For this work, we assume a discrete action space and deterministic transition and reward functions. At every time-step we observe a state and pick an action . The MDP follows the transition dynamics and returns rewards . We act in the MDP according to a stochastic policy . The (discounted) return from a state-action pair is a random process given by


where . The return

is a random variable, where the distribution of

is induced by the stochastic policy (as we assume a deterministic environment). Eq. 1 can be unwritten in recursive form, known as the distributional Bellman equation (Bellemare et al., 2017) (omitting the superscript from now on):


where denotes distributional equality (Engel et al., 2005). The state action value is the expectation of the return distribution. Applying this expectation to Eq. 2 gives


, which is known as the Bellman equation (Sutton & Barto, 1998). Most RL algorithms learn this mean action value , and explore by some random perturbation of these means.

3 Distributional Perspective on Exploration

As mentioned in the introduction, the return distribution may be induced by two sources of stochasticity: 1) our stochastic policy and 2) a stochastic environment. Therefore, if we assume a deterministic environment, then the return distribution is entirely induced by our own policy. As we may modify our policy, it actually makes sense to act optimistically with respect to the return distribution.

As an illustration, consider a state-action pair with particular mean value estimate

. It matters whether this average originates from a highly varying return or from consistently the same return. It matters because our policy may influence the shape of this distribution, i.e. for the highly varying returns we may actively transform the distribution towards the good returns. In other words, what we really care about in deterministic domains is the best return, or the upper end of the return distribution, because it is an indication of what we may achieve once we have figured out how to act in the future. By starting from broad distribution initializations that gradually narrow when subpolicies converge, we observe a natural shift from exploration to exploitation.

4 Distributional Policy Evaluation

Following Bellemare et al. (2017), we introduce a neural network to model the return distribution . For this work we will consider three parametric distributions to approximate the return distribution: Gaussian, Categorical (as previously studied by Bellemare et al. (2017)), and Gaussian mixture.

To perform policy evaluation, we need to discuss two topics:

  1. How to propagate the distribution through the Bellman equation (based on newly observed data). We will denote the propagated distribution as .

  2. A loss function between the current network predictions and the new target:


Due to space restrictions, we will only show the propagation and loss for the Gaussian case. For the Categorical and Gaussian mixture outcome we specify the distributional loss in Appendix A.1 and the Bellman propagation in Appendix A.2.

Distribution propagation

Define the distributional Bellman back-up operator as the recursive application of Eq. 2 to . For a Gaussian network output, , we need to propagate both the mean and variance through the Bellman equation.


because , , and we assume the next state distributions are independent so we may ignore the covariances.222For random variables and scalar constants we have: and .

In practice, we approximate the expectation over the policy probabilities at the next state

by sampling a next state once (either on- or off-policy). This is the most common solution in RL, and will be right in expectation over multiple traces. The 1-step bootstrap distribution estimate then becomes .

Figure 1: Gaussian distribution propagation. Left: Example 2-step MDP. Learning process for the state-action pairs in the dotted box is shown on the right. Right: Three (a-c) return exploration phases for the left half of the MDP. Each plot also displays the mean (

), standard deviation (

), and policy probabilities under Thompson (tho) sampling and UCB (ucb).


Next, we want to move our current network predictions closer to the new target

, for which we will use a distributional distance. A well-known choice in machine learning is the

cross-entropy :


For both the Gaussian and Categorical output distributions we can derive closed-form expressions for the cross-entropy , see Appendix A.1. However, for the Gaussian mixture outcome we do not have a closed-form cross-entropy expression, and we instead minimize the distance. See Appendix A.1 for details as well.

In practice, we store a database of transition tuples , where can also be computed either on- or off-policy, and minimize:


where is computed based on Bellman propagation. This completes the policy evaluation step for the Gaussian case.

5 Distributional Exploration (Policy Improvement)

Our real interest is usually not in policy evaluation only, as we want to gradually improve our policy as well. The major benefit of probabilistic policy evaluation (previous section) is that we have additional information to balance exploration and exploitation. We will treat the return distribution as something against which we can act optimistically. Exploration under uncertainty has been extensively studied in the bandit literature. Two of the most successful algorithms, which we both consider in this work, are

  1. Thompson sampling (Thompson, 1933), which takes a sample for each action and picks the action with the highest draw.

  2. Upper Confidence Bounds (UCB) (Auer et al., 2002), which picks the action with the highest upper confidence bound for some constant . Analytic expressions for for the different output distributions are provided in Appendix A.3.

6 Experiments

Figure 2: Learning curves on Chain domain for different types of return distributions. Plots progress row-wise for increased depth of the Chain, i.e. increased exploration difficulty. Exploration uses a UCB policy with constant for each (this induces slight randomness in the otherwise deterministic UCB decision). Results averaged over 10 repetitions.

We now show several results of return-based exploration on a Toy example, Chain domain and OpenAI Gym task. Fig. 1, left shows an example 2-step MDP to illustrate the concept of return-based exploration. On the right of Fig. 1 we display three phases of learning in this MDP for a Gaussian . Due to space constraints we only visualize the distributions for the left half of the MDP, the full learning process is shown in Figure 5 (Appendix). In Fig. 1a we just initialized the network, and both Thompson sampling and UCB follow almost uniform policies. After some training (Fig. 1b) the second state () distributions start converging, but the uncertainty at still remains broader (as it generalizes over the sometimes explored inferior in ). Thompson sampling and UCB gradually start to prefer in the root state now. Finally, after some additional episodes (Fig. 1c) the distribution estimates have converged on the optimal state-action values, and both Thompson sampling and UCB have automatically converged on the optimal policy.

We next consider the Chain domain (Appendix C, Figure 4), which has been previously studied in RL literature (Osband et al., 2016). The domain consists of a chain of states of length , with two available actions at each state. The only trace giving a positive, non-zero reward is to select the ‘correct’ action at every step, which is randomly picked at domain initialization. The domain has a strong exploration challenge, which grows exponentially with the length of the chain for undirected exploration methods (see Appendix C).

Figure 2 show the learning curves of return-based exploration for different types of output distributions, and compares the results to -greedy exploration. The plots progress row-wise to longer chain lengths. First of all, we note that -greedy learns slowly in the short chain, and does not learn at all in the longer chains. However, the methods with return uncertainty do learn, and consistently solve the domain even for the long chain of length 100. Note that this is a very challenging exploration problem, as we need to take 100 steps correctly while there is no structure in the domain at all (i.e., the correct action randomly changes at each depth in the chain, so a function approximator with local generalization is more harmful than beneficial). The mixture of Gaussians (mog) return distribution performs less stable than the Gaussian and categorical. This might be due to the -loss used for the Gaussian mixture, which is different from the cross-entropy losses used for the Gaussian and categorical distributional loss (see Appendix A). In Figure 6 we provide a full illustration of how exploration based on a Categorical return d A detailed illustration of the learning process for the categorical return distribution is shown in Figure 6 (Appendix).

Figure 3: Return-based exploration versus -greedy on FrozenLake-v1. The return-based exploration methods use Thompson sampling. Compared to the OpenAI Gym implementation, we modify the environment to be fully deterministic (by removing the random ‘slipping’ effect of the task). Results averaged over 5 repetitions.

In Figure 3 we show the results of return-based exploration on a task from the OpenAI Gym repository: FrozenLake. Again, we observe that the return based exploration methods learn better than -greedy. These experiments use Thompson sampling for exploration, which shows that return-based exploration can be employed with both UCB or Thompson exploration.

7 Discussion

We shortly discuss why return-based exploration seems to work so well in the challenging exploration task of the Chain. It turns out that the return distributions really narrow when a certain action terminates the episode. In such cases, we bootstrap a very narrow next state distribution around 0 (because the reward function is assumed deterministic). On the Chain, we see that all the terminating actions very quickly narrow, while the trace along the full path keeps some additional uncertainty. It appears as if the return distribution in this implementation identifies a specific type of uncertainty related to the termination probabilities and asymmetry in the domain search tree, which relates our work to ideas from Monte Carlo Tree Search (Moerland et al., 2018) as well.

The benefit of exploration based on uncertainty is that policy improvement almost comes for free. The only thing that we propagate are full distributions, which we initialize wide, and then gradually converge when the distributions behind it start converging. This creates a more natural transition from the exploration to the exploitation phase, a trait which most undirected methods (-greedy, Boltzmann) lack.

An important direction for future work is to connect the policy-dependent return uncertainty, as studied in this paper, to the statistical (or epistemic) uncertainty of the mean action-value, which is a function of the local number of visits to a state. The return distribution mechanism in this paper clearly identifies a different aspect of (future policy) uncertainty, which may be related to the termination structure of subtrees, or to the fact that uncertainty in an MDP should propagate over steps as well (Dearden et al., 1998). In any case, due to the sequential nature of MDPs there appear to be more aspects to epistemic/reducible (Osband et al., 2018) uncertainty, and these distinctions have yet to be properly identified. Finally, another important extension is to stochastic environments (Depeweg et al., 2016; Moerland et al., 2017b), i.e., separating which part of the return distribution originates from our own policy uncertainty (for which we can be optimistic) and which part originates from the stochastic environment (for which we want to act on the expectation).

8 Conclusion

This paper identified the potential of the return distribution for targeted exploration. In deterministic domains, the return distribution is induced by our own policy, and since we may modify this policy ourselves, it makes sense to act optimistically with respect to this distribution. Exploration based on the return distribution, especially for the Gaussian and Categorical case, manages to solve the ‘randomized’ Chain of length 100 with function approximation, which we believe has not been reported before. Moreover, it also performs well in another task from the OpenAI Gym. Future work should expand these ideas to stochastic environments, and identify the connections to exploration based on the statistical uncertainty of the mean action value.


Appendix A Distributional Details

The current network distributions are denote by , which we want to update with a newly calculated target distribution . For readability, we will omit the dependency on in the remainder of this section. We study three types of network output distributions:

  1. Gaussian:

  2. Categorical: parametrized by the number of bins and edges . Define the set of bins as , for . Each bin has associated density , with .

  3. Gaussian mixture: , for mixtures. Here denotes the weight of the -th mixture component, .

We now detail the loss, Bellman propagation and analytic standard deviation (as used in the UCB policy) for each of these output distributions.

a.1 Loss


The main text already introduced the cross-entropy loss used for Gaussian . Here we derive the analytical expression of this cross-entropy:


Bringing everything that does not depend on out of the integral and taking the logarithm:


Which can be rewritten as

We can rewrite the second moment

in terms of the mean and variance of :


Therefore, we can simplify the full expression to


which is used as the closed-form loss for the Gaussian experiments (Eq. 6) in this paper. Note that we also experimented with (other) closed form distributional losses for Gaussians, such as the Bhattacharyya distance and Hellinger distance, but these did not significantly improve performance.


For the categorical target distribution we again minimize the cross-entropy with :


Gaussian Mixture

There is no closed form expression for the KL-divergence or cross-entropy between two Gaussian mixtures. We could of course approximate such a loss by repeated sampling, but this will strongly increase the computational burden. Therefore, we instead searched for a distance measure between Gaussian mixtures that does have a closed form expression, which is the -distance:


We may simplify the remaining integrals in this expression, since for any two Gaussians and we have (Petersen et al., 2008):


Therefore, Eq. 14 simplies to


which can be evaluated in time for mixture components.

Sample-based loss

For some output distributions we either do not have a density (like some deep generative models) or the available analytic distributional loss performs suboptimal. However, we can always sample from our model. For example, for a 1-step Q-learning update, we can (repeatedly) sample from our network at the next timestep , transform these through the Bellman equation, and then train our model on a negative log-likelihood loss:


Results of this approach are not shown, but were comparable to the results with approximate return propagation shown in the Results section. However, this approach is clearly more computationally expensive.

a.2 Bellman Propagation

Given a data tuple , where may either be on- or off-policy, and a bootstrapped distribution , we want to calculate the one-step Bellman transformed distribution .


For the categorical distribution, we may Bellman transform each individual atom/bin, and then project the probabilities of the transformed means back on the atoms (denoted by operator ). This procedure follows Bellemare et al. (2017):


Gaussian mixture

For the Gaussian mixture case, we have


This implies that we may propagate each Gaussian mixture component individually, as discussed in Section 4, keeping each mixture weight the same.

a.3 Standard Deviation

For UCB exploration, we require fast (i.e., analytic) access to the distribution standard deviation, to prevent repeatedly having to sample. Clearly, for the Gaussian output we directly have the standard deviation available.


For a categorical output distribution with categories and associated probabilities , we have the standard deviation as:


where .

Gaussian mixture

For a Gaussian mixture model with mixture weights

, mixture means and mixture standard deviation , we start from:


Now we may again use Eq. 11 to rewrite the second moments of the mixture components in terms of their means and variances, i.e. . Plugging this expression into Eq. 21 gives:


This last expression gives the variance of the mixture in terms of the weight, mean and variance of the mixture components.

Appendix B Related Work

Return Uncertainty

While the distributional Bellman equation (Eq. 2) is certainly not new (Sobel, 1982; White, 1988), nearly all RL research has focussed on the mean action-value. Most papers that do study the underlying return distribution study the ’variance of the return’. Engel et al. (2005) learned the distribution of the return with Gaussian Processes, but did not use it for exploration. Tamar et al. (2016) studied the variance of the return with linear function approximation. Mannor & Tsitsiklis (2011) theoretically studies policies that bound the variance of the return.

The variance of the return has actually primarily been in the context of risk-sensitive RL. In several scenarios we may want to avoid incidental large negative pay-offs, which can e.g. be disastrous for a real-world robot, or in a financial portfolio. Morimura et al. (2012) studied parametric return distribution propagation as well. They do risk-sensitive exploration by softmax exploration over quantile Q-functions (also known as the Value-at-Risk

(VaR) in financial management literature). Their distribution losses are based on KL-divergences (including Normal, Laplace and skewed Laplace distributions), but their implementations do remain in the tabular setting.

Bellemare et al. (2017) was the first to theoretically study the distributional Bellman operator, and also implement a distributional policy evaluation algorithm in the context of neural networks. Thereby, there work can be considered the basis of our work, where we present an extension that uses the return distribution for exploration. Concurrently with our work, Tang & Agrawal (2018); Tang & Kucukelbir (2017) interpreted the return distribution from a variational perspective and leveraged it for exploration as well. Moerland et al. (2017a) also provided initial work on the return distribution for exploration. Our present paper is more extensive on the theoretical side, for example specifying full distributional loss functions and comparing different types of network output distributions. However, Moerland et al. (2017a) does try to connect the concept of return distribution to the statistical uncertainty of the mean action value as well, which both seem plausible quantities for exploration.

Another branch of related work is from the Tree Search community. Various papers have focussed on propagating distributions within the tree, e.g. Tesauro et al. (2012) and Kaufmann & Koolen (2017). The tree search approach by Moerland et al. (2018) does not explicitly propagate distributions (only -like estimates), but their idea (the remaining uncertainty should also incorporate the remaining uncertainty in the subtree below an action) is observable in the return-based exploration and learning visualizations of this paper as well.

Other Uncertainty-based exploration methods

There exists a long history of work on the statistical uncertainty of the mean action value for exploration, in the context of function approximation for example by Osband et al. (2016), Gal et al. (2016) and more recently Azizzadenesheli et al. (2017), Henderson et al. (2017) and Jeong & Lee (2017). Moreover, the uncertainty theme for exploration also appears in count-based exploration approaches (Bellemare et al., 2016) and model-based RL (Guez et al., 2012; Moerland et al., 2017b).

Appendix C Randomized Chain

Figure 4: Chain domain. Example MDP where undirected exploration is highly inefficient. Based on Osband et al. (2014).

We here present the randomized Chain, which we believe is the correct implementation of a well-known RL task known as the Chain (Osband et al., 2014) (Fig. 4). The domain illustrates the difficulty of exploration with sparse rewards.The MDP consists of a chain of states . At each time step the agent has two available actions: (‘left’) and (‘right’). At every step, one of both actions is the ‘correct’ one, which deterministically moves the agent one step further in the chain. The wrong action terminates the episode. All states have zero reward except the final chain state , which has .

Variants of these problem have been studied more frequently in RL (Osband et al., 2014). In the ‘ordered’ implementation, the correct action is always the same (e.g. ), and the optimal policy is to always walk right. This is the variant illustrated in Fig. 4 as well. However, in our ‘randomized’ Chain implementation the correct action is randomly picked at domain initialization. The problem with the ordered version is that it introduced a systematic bias which is easily exploited when learning with neural networks. Due to the generalization of neural networks, it relatively easily predicts to always take action , and then suddenly solves the entire chain. With the randomized version, there is actually no structure in the domain at all, and learning with a neural network only makes the domain more complicated. The ‘randomized’ version therefore gives the true exponential complexity, as reported before (Osband et al., 2014; Moerland et al., 2017a), when learning with neural networks.

Appendix D Implementation Details

Network architecture consists of a 3 layer neural network per discrete action

with 256 nodes in each hidden layer and ELU activations. Learning rates were 0.0005 on all experiments. Optimization is performed with stochastic gradient descent on minibatches of size 32 using Adam updates in Tensorflow. We use a replay database of size 50.000. After collecting a new (set of) roll-outs, we randomly sample an equal amount of data from the replay for processing. All new collected data is processed on-policy, while all replay data is processed off-policy. The maximum length per episode is 200. We use discount factor

. All -greedy experiments have fixed at 0.05 throughout learning.

For the categorical outcome we put the bin edges slightly above and below the highest and lowest expected reward in the domain. In the chain we use bins, on the other domains we use bins. For the Gaussian output we add an initialization bias of to the standard deviation at initialization. For the Gaussian mixture output we use mixture components, where we spread out the mixture means upon initialization. Due to the logarithm appearing in the Gaussian cross-entropy loss we see that the gradient may explode when the standard deviation strongly narrows. We mitigate this problem by clipping gradients.

Thompson sampling is best implemented in an ‘episode-wise’ fashion, where we sample from a posterior distribution over parameters once at the beginning of a new episode (Russo et al., 2017). This ensures deep exploration. However, for the return based uncertainty we directly sample in the network output space per action, and we cannot implement this correlated form of Thompson sampling.

Figure 5: Example of Gaussian distribution propagation on the Toy example of Fig.1. a) Return distributions at initialization. Both Thompson sampling and UCB have largely uniform policies. b) Distributions after training for 64 episodes. The terminal state distributions start gradually converging, while the distributions at state remain broader. The terminal node decisions are already greedy, while the first node already starts to assign higher probability to the optimal action . c) Converged distributions after training for some additional time. Both UCB and Thompson sampling now deterministically sample the optimal policy.
Figure 6: Example of return distribution-based exploration on the Chain of length 7. Each plot (a-c) shows successive states (left-to-right) and both actions (up and down). The correct action at each step (randomly drawn at domain initialization) is indicated by a green box around the plot. We use a categorical with 7 atoms between and . a). Return distributions after 2 episodes. The distributions are almost uniform, which makes the policy fully exploratory. b). Return distributions after 28 episodes. Both the correct and wrong action have propagated mass towards 0. However, the distributions of the wrong actions converge faster, because the correct actions propagate the remaining uncertainty at the next timestep. The correct action in the last state already started to move towards a value of 1. c)

. Converged return distributions after 68 episodes. All the correct actions have now backpropagated the return from the end of the chain. The policy now consistently exploits. Note that some of the wrong actions (red boxes) put some mass at a return of 1 as well. This is due to the generalization from neighboring states (which are treated as continuous in the network input), but disappears with enough data.