1 Introduction
A reinforcement learning (RL) agent is tasked with two fundamental, interdependent problems: exploration (how to discover useful data), and credit assignment (how to incorporate it). In this work, we take a careful look at the problem of credit assignment. The instrumental learning object in RL – the value function – quantifies the following question: “how does choosing an action in a state affect future return?”. This is a challenging question for several reasons.
Issue 1: Variance.
The simplest way of estimating the value function is by averaging returns (future discounted sums of rewards) starting from taking
in . This Monte Carlo style of estimation is inefficient, since there can be a lot of randomness in trajectories.Issue 2: Partial observability.
To amortize the search and reduce variance, temporal difference (TD) methods, like Sarsa and Qlearning, use a learned approximation of the value function and bootstrap. This introduces bias due to the approximation, as well as a reliance on the Markov assumption, which is especially problematic when the agent operates outside of a Markov Decision Process (MDP), for example if the state is partially observed, or if there is function approximation. Bootstrapping may then cause the value function to not converge at all, or to remain permanently biased
singh1994learning .Issue 3: Time as a proxy.
TD() methods control this biasvariance tradeoff, but they rely on time as the sole metric for relevance: the more recent the action, the more credit or blame it receives from a future reward sutton1988learning ; Sutton2019Book
. Although time is a reasonable proxy for causeandeffect (especially in MDPs), in general it is a heuristic, and can hence be improved by learning.
Issue 4: No counterfactuals.
The only data used for estimating an action’s value are trajectories that contain that action, while ideally we would like to be able to use the same trajectory to update all relevant actions, not just the ones that happened to (serendipitously) occur.
Figure 1 illustrates these issues concretely. At the highlevel, we wish to achieve credit assignment mechanisms that are both sampleefficient (issues 1 and 4), and expressive (issues 2 and 3). To this end, we propose to reverse the key learning question, and learn estimators that measure: “given the future outcome (reward or state), how relevant was the choice of in to achieve it?”, which is essentially the credit assignment question itself. Although eligibility traces consider the same question, they do so in a way that is (purposefully) equivalent to the forward view sutton1988learning , and so they have to rely mainly on “vanilla" features, like time, to decide credit assignment. Reasoning in the backward view explicitly opens up a new family of algorithms. Specifically, we propose to use a form of hindsight conditioning to determine the relevance of a past action to a particular outcome. We show that the usual value functions can be rewritten in hindsight, yielding a new family of estimators, and derive policy gradient algorithms that use these estimators. We demonstrate empirically the ability of these algorithms to address the highlighted issues through a set of diagnostic tasks, which are not handled well by other means.
2 Background and Notation
A Markov decision process (MDP) puterman1994markov is a tuple , with being the state space,  the action space, – the statetransition distribution (with
denoting the probability of transitioning to state
from by choosing action ), – the reward function, and – the scalar discount factor. A stochastic policy maps each state to a distribution over actions: denotes the probability of choosing action in state . Let and be the distributions over trajectories generated by a policy , given and , respectively. Let be the return obtained along the trajectory . The value (or V) function and the actionvalue (or Q) function denote the expected return under the policy given and , respectively:(1) 
The benefit of choosing a given action over the usual policy is measured by the advantage function . Policy gradient algorithms improve the policy by changing in the direction of the gradient of the value function Sutton:2000 . This gradient at some initial state is
where is the (unnormalized) discounted statevisitation distribution. Practical algorithms such as REINFORCE williams1992simple approximate or with an step truncated return, possibly combined with a bootstrapped approximate value function , which is also often used as baseline (see Sutton:2000 ; A3C_2016 ) along a trajectory :
3 Conditioning on the Future
The classical value function attempts to answer the question: "how does the current action affect future outcomes?" By relying on predictions about these future outcomes, existing approaches often exacerbate problems around variance (issue 1) and partial observability (issue 2). Furthermore, these methods tend to use temporal distance as a proxy for relevance (issue 3) and are unable to assign credit counterfactually (issue 4). We propose to learn estimators that explicitly consider the credit assignment question: "given an outcome, how relevant were past decisions?", and try to answer it explicitly.
This approach can in fact be linked to some classical methods in statistical estimation. In particular, Monte Carlo simulation is known to be inaccurate when there are rare events that are of interest: the averaging requires an infeasible number of samples to obtain an accurate estimate rubino2009rare . One solution is to change measures, that is, to use another distribution for which the events are less rare, and correct with importance sampling. The Girsanov theorem is a wellknown example of this in processes with Brownian dynamics girsanov1960transforming , known to produce lower variance estimates.
This scenario of rare random events is particularly relevant to efficient credit assignment in RL. When a new significant outcome is experienced, the agent ought to quickly update its estimates and policy accordingly. Let be a sampled trajectory, and some function of it. By changing measures from the policy with which it was sampled to a futureconditional, or hindsight distribution , we hope to improve the efficiency of credit assignment. The importance sampling ratio then precisely denotes the relevance of an action to the specific future . If the distribution is accurate, this allows us to quickly assign credit to all actions relevant to achieving . In this work, we consider to be a future state, or a future return. To highlight the use of the futureconditional distribution, we refer to the resulting family of methods as Hindsight Credit Assignment (HCA).
The remainder of this section formalizes the insight outlined above, and derives the usual value functions in terms of the hindsight distributions, while the subsequent section presents novel policy gradient algorithms based on these estimators.
3.1 Conditioning on Future States
The agent composes its estimates of the return from an action by summing over the rewards obtained from future states . One option of hindsight conditioning is to consider, at each step, the likelihood of an action given that the future state was reached.
Definition 1 (Stateconditional hindsight distributions).
For any action and any state , define to be the conditional probability over trajectories of the first action of trajectory being equal to , given that the state has occurred at step along trajectory :
(2) 
Intuitively, quantifies the relevance of action to the future state . If is not relevant to reaching , this probability is simply the policy (there is no relevant information in ). If is instrumental to reaching , , and vice versa, if detracts from reaching , . In general, is a lowerentropy distribution than . The relationship of to more familiar quantities can be understood through the following identity obtained by an application of Bayes’ rule:
Using this identity and importance sampling, we can rewrite the usual Qfunction in terms of . Since throughout there is only one policy involved, we will drop the explicit conditioning, but it is implied.
Theorem 1.
Consider an action and a state for which . Then the following holds
So, each of the rewards along the way is weighted by the ratio , which exactly quantifies how relevant was in achieving the corresponding state . Following the discussion above, this ratio is if is irrelevant, and larger or smaller than in the other cases. The expression for the Qfunction is similar to that in Eq. (1), but the new expectation is no longer conditioned on the initial action – the policy is followed from the start ( instead of ). This is an important point, as it will allow us to use returns generated by any action to update the values of all actions, to the extent that they are relevant according to . Theorem 1 implies the following expression for the advantage:
(3) 
where . This form of the advantage is particularly appealing, since it directly removes irrelevant rewards from consideration. Indeed, whenever , the reward does not participate in the advantage for the value of action . When there is inconsequential noise that is outside of the agent’s control, this may greatly reduce the variance of the estimates.
Removing time dependence.
For clarity of exposition, here we have considered the hindsight distribution to be additionally conditioned on time. Indeed, depends not only on reaching the state, but also on the number of timesteps that it takes to do so. In general, this can be limiting, as it introduces a stronger dependence on the particular trajectory, and a harder estimation problem of the hindsight distribution. It turns out we can generalize all of the results presented here to a timeindependent distribution , which gives the probability of conditioned on reaching at some point in the future. The scalar is the "probability of survival" at each step. This can either be the discount , or a termination probability if the problem is undiscounted. In the discounted reward case Eq. (3) can be reexpressed in terms of as follows:
(4) 
with the choice of . The interested reader may find the relevant proofs in the appendix.
Finally, it is possible to obtain a hindsight Vfunction, analogously to the Qfunction from Theorem 1. The next section does this for returnconditional HCA. We include other variations in appendix.
3.2 Conditioning on Future Returns
The previous section derived Qfunctions that explicitly reweigh the rewards at each step, based on the corresponding states’ connection to the action whose value we wish to estimate. Since ultimately we are interested in the return, we could alternatively use it for future conditioning itself.
Definition 2 (Returnconditional hindsight distributions).
For any action and any possible return , define to be the conditional probability over trajectories of the first action being , given that has been observed along :
The distribution is intuitively similar to , but instead of future states, it directly quantifies the relevance of to obtaining the entire return . This is appealing, since in the end we care about returns. Further, this could be simpler to learn, since instead of the possibly highdimensional state, we now need to worry only about a scalar outcome. On the other hand, it is no longer "jumpy" in time, so may benefit less from structure in the dynamics. As with , we will drop the explicit conditioning on , but it is implied. We have the following result.
Theorem 2.
Consider an action , and assume that for any possible random return for some trajectory we have . Then we have:
(5) 
The V (rather than Q) function form here has interesting properties that we will discuss in the next section. Mathematically, the two forms are analogous to derive, but the ratio is now flipped. Equations (5) and (1) imply the following expression for the advantage:
(6) 
The factor expresses how much a single action contributed to obtaining a return . If other actions (drawn from ) would have yielded the same return, , and the advantage is . If an action has made achieving more likely, then , and conversly, if other actions would have contributed to achieving more than , then . Hence, expresses the impact an action has on the environment, in terms of the return, if everything else (future decisions as well as randomness of the environment) is unchanged.
Both and can be learned online from sampled trajectories (see Sec. 4 for algorithms, and a discussion in Sec. 4.1). Finally, while we chose to focus on state and return conditioning, one could consider other options. For example, conditioning on the reward (instead of the state) at a future time , or an embedding of (or part of) the future trajectory, could have interesting properties.
3.3 Policy Gradients
We now give a policy gradient theorem based on the new expressions of the value function.
Theorem 3.
Let be the policy parameterized by , and . Then, the gradient of the value at some state is:
(7)  
(8)  
Note that the expression for state HCA in Eq. (7) is written for all actions, rather than only the sampled one. Interestingly, this form does not require (or benefit from) a baseline. Contrary to the usual allactions algorithm which uses the critic, the HCA reweighting allows us to use returns sampled from a particular starting action to obtain value estimates for all actions.
4 Algorithms
Using the new policy gradient theorem, we will now give novel algorithms based on sampling the expectations (7) and (8). Then, we will discuss the training of the relevant hindsight distributions.
StateConditional HCA
Consider a parametric representation of the policy and the futurestateconditional distribution , as well as the baseline and an estimate of the immediate reward . Generate step trajectories . We can compose an estimate of the return for all actions (see Theorem 7 in appendix):
The algorithm proceeds by training to predict the usual return and to predict (square loss), the hindsight distribution to predict
(cross entropy loss), and finally by updating the policy logits with
. See Algorithm 1 in appendix for the detailed pseudocode.ReturnConditional HCA
Consider a parametric representation of the policy and the returnconditioned distribution . Generate full trajectories and compute the sampled advantage at each step:
where . The algorithm proceeds by training the hindsight distribution to predict (cross entropy loss), and updating the policy gradient with . See Algorithm 2 in appendix for the detailed pseudocode.
RL without value functions.
The returnconditional version lends itself to a particularly simple algoriTheorem In particular, we no longer need to learn the value function – if is estimated well, using complete rollouts is feasible without variance issues. This takes our idea of reversing the direction of the learning question to the extreme, it is now entirely in hindsight.
The result is an actorcritic algorithm, where the usual baseline is replaced by This baseline is strongly correlated to the return (it is proportional to it), which is desirable since we would like to remove as much of the variance (due to the dynamics of the world, or the agent’s own policy) as possible. The following proposition verifies that despite being correlated, this baseline does not introduce bias into the policy gradient.
Proposition 1.
The baseline does not introduce any bias in the policy gradient:
4.1 Learning Hindsight Distributions
We have given equivalent rewritings of the usual value functions in terms of the proposed hindsight distributions, and have motivated their properties, when they are accurate. Now, the question is if it is feasible to learn good estimates of those distributions from experience, and whether shifting the learning problem in this way is beneficial. The remainder of this section discusses this question, while the next one provides empirical evidence for the affirmative.
There are several conventional objects that could be learned to help with credit assignment: a value function, a forward model, or an inverse model over states. An accurate forward model allows one to compute value functions directly with no variance, and an accurate inverse model – to perform precise credit assignment. However, learning such generative models accurately is difficult and has been a longstanding challenge in RL, especially in highdimensional state spaces. Interestingly, the hindsight distribution is a discriminative, rather than generative model, and is hence not required to model the full distribution over states. Additionally, the action space is usually much smaller than the state space, and so shifting the focus to actions potentially makes the problem much easier. When certain structure in the dynamics is present, learning hindsight distributions may be significantly easier still – e.g. if the transition model is stochastic or the policy is changing, a particular can lead to many possible future states, but a particular future state can be explained by a small number of past actions. In general, learning and
are supervised learning problems, so the new algorithms delegate some of the learning difficulty in RL to a supervised setting, for which many efficient approaches exist (e.g.
gutmann2010noise ; van2018representation ).) overcomes partial observability, but is prone to noise. The plot depicts learning curves for the setting with added white noise of
. Right. The average performance w.r.t. different noise levels – predictably, state HCA is the most robust., and standard deviation
. Left: The state identity is observed. Both HCA methods improve on PG. Middle: The state identity is hidden, handicapping state HCA, but return HCA continues to improve on PG. Right: Average performance w.r.t. different s with Gaussian rewards of means , , and standard deviation . Note that the optimal value itself decays in this case.5 Experiments
To empirically validate our proposal in a controlled way, we devised a set of diagnostic tasks that highlight issues 14, while also being representative of what occurs in practice (Fig. 2). We then systematically verify the intuitions developed throughout the paper. In all cases, we learn the hindsight distributions in tandem with the control policy. For each problem we compare HCA with state and return conditioning to standard baseline policy gradient, that is: step advantage actor critic (with for Monte Carlo). All the results are an average of independent runs, with the plots depicting means and standard deviations. For simplicity we take in all of the tasks.
Shortcut.
We begin with an example capturing the intuition from Fig. 1 (left). Fig. 2 (left) depicts a chain of length with a rewarding final state. At each step, one action takes a shortcut and directly transitions to the final state, while the other continues on the longer path, which may be more likely according to the policy. There is a perstep penalty (of ), and a final reward of . There is also a chance (of ) that the agent transitions to the absorbing state directly.
This problem highlights two issues: (1) the importance of counterfactual credit assignment (issue 4); when the long path is taken more frequently than the shortcut path, counterfactual updates become increasingly effective (see Fig. 3, right) (2) the use of time as a proxy for relevance (issue 3) is shown to be only a heuristic, even in a fullyobservable MDP. The relevance for the states along the chain is not accurately reflected in the long temporal distance between them and the goal state. In Fig. 3 we show that HCA is more effective at quickly adjusting the policy towards the shortcut action.
Delayed Effect.
The next task instantiates the example from Fig. 1 (right). Fig. 2 (middle) depicts a POMDP, in which after the first decision, there is aliasing until the final state. This is a common case of partial observability, and is especially pertinent if the features are being learned. We show that (1) Bootstrapping naively is inadequate in this case (issue 2), but HCA is able to carry the appropriate information;^{1}^{1}1See the discussion in Appendix F and (2) While Monte Carlo is able to overcome the partial observability, its performance deteriorates when intermediate reward noise is present (issue 1). HCA on the other hand is able to reduce the variance due to the irrelevant noise in the rewards.
Additionally, in this example the first decision is the most relevant choice, despite being the most temporally remote, once again highlighting that using temporal proximity for credit assignment is a heuristic (issue 3). One of the final states is rewarding (with ), the other penalizing (with ), and the middle states contain white noise of standard deviation . Fig. 4 depicts our results. In this task, the returnconditional HCA has a more difficult learning problem, as it needs to correctly model the noise distribution to condition on, which is as difficult as learning the values naively, and hence performs similarly to the baseline.
Ambiguous Bandit.
Finally, to emphasize that credit assignment can be challenging, even when it is not longterm, we consider a problem without a temporal component. Fig. 2
(right) depicts a bandit with two actions, leading to two different states, whose reward functions are similar (here: drawn from overlapping Gaussian distributions), with some probability
of crossover. The challenge here is due to variance (issue 1) and a lack of counterfactual updates (issue 4). It is difficult to tell whether an action was genuinely better, or just happened to be on the tail end of the distribution. This is a common scenario when bootstrapping with similar values. Due to the explicit aim at modeling the distributions, the hindsight algorithms are more efficient (Fig. 5 (left)).To highlight the differences between the two types of hindsight conditioning, we introduce partial observability (issue 2), see Fig. 5 (right). The returnconditional policy is still able to improve over policy gradient, but stateconditioning now fails to provide informative conditioning (by construction).
6 Related Work
Hindsight experience replay (HER) andrychowicz2017hindsight introduces the idea of offpolicy learning about many goals from the same trajectory. The intuition is that regardless of what goal the trajectory was pursuing originally, in hindsight it, e.g., successfully found the one corresponding to its final state, and there is something to be learned. Rauber et al. rauber2018hindsight extend the same intuition to policy gradient algorithms, with goalconditioned policies. Goyal et al. goyal2019recall also use goal conditioning and learn a backtracking model, which predicts the stateaction pairs occurring on trajectories that end up in goal states. These works share our intuition of in hindsight using the same data to learn about many things, but in the context of goalconditioned policies, while we essentially contrast conditional and unconditional policies, where the conditioning is on the extra outcome (state or return). Note that we never act w.r.t. the conditional policy, and it is used solely for credit assignment. Prioritized sweeping can be viewed as changing the sampling distribution with hindsight knowledge of the TD errors moore1993prioritized .
Another line of work that aims to propagate credit efficiently backward in time is the temporal value transport algorithm hung2018optimizing , in which an attention mechanism over memory is used to jump over parts of a trajectory that are irrelevant for the rewards obtained. While demonstrated on challenging problems, that method is biased; a promising direction for future research would be to apply our unbiased hindsight mechanism with past states chosen by such an attention mechanism.
A large number of variance reduction techniques have been applied in RL, e.g. using learned value functions as critics, and other control variates (e.g. weber2019credit ). When a model of the environment is available, it can be used to reduce variance. Rollouts from the same state fill the same role in policy gradients schulman2015trust
. Differentiable system dynamics allow lowvariance estimates of the Qvalue gradient by using the pathwise derivative estimator, effectively backpropagating the gradient of the objective along trajectories (e.g.
schulman2015stochastic ; heess2015learning ; henaff2017model ). In stochastic systems this requires knowledge of the environment noise. To bypass this, Heess et al. heess2015learning infer the noise given an observed trajectory. Buesing et al. buesing2018woulda apply this idea to POMDPs, where it can be viewed as reasoning about events in hindsight. They use a structural causal model of the dynamics and infer the posterior over latent causes from empirical trajectories. Using an empirical rather than a learned distribution over latent causes can reduce bias and, together with the (deterministic) model of the system dynamics, allows exploring the effect of alternative action choices for an observed trajectory.Inverse models similar to the ones we use appear, for instance, in variational intrinsic control gregor2016variational (see also e.g. hausman2018learning ). However, in our work, the inverse model serves as a way of determining the influence of an action on a future outcome, whereas the work in gregor2016variational ; hausman2018learning aims to use the inverse model to derive an intrinsic reward for training policies in which actions influence the future observations.
7 Closing
We proposed a new family of algorithms that explicitly consider the question of credit assignment as a part of, or instead of, estimating the traditional value function. The proposed estimators come with new properties, and as we validate empirically, are able to address some of the key issues in credit assignment. Investigating the scalability of these algorithms in the deep reinforcement learning setting is an exciting problem for future research.
Acknowledgements
The authors thank Joseph Modayil for reviews of earlier manuscripts, Theo Weber for several insightful suggestions, and the anonymous reviewers for their useful feedback.
References
 (1) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 (2) Lars Buesing, Theophane Weber, Yori Zwols, Sébastien Racanière, Arthur Guez, JeanBaptiste Lespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactuallyguided policy search. CoRR, abs/1811.06272, 2018.
 (3) Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability & Its Applications, 5(3):285–301, 1960.
 (4) Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. In International Conference on Learning Representations(ICLR), 2019.
 (5) Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.

(6)
Michael Gutmann and Aapo Hyvärinen.
Noisecontrastive estimation: A new estimation principle for
unnormalized statistical models.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pages 297–304, 2010.  (7) Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR), 2018.
 (8) Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
 (9) Mikael Henaff, William F Whitney, and Yann LeCun. Modelbased planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177, 2017.
 (10) ChiaChun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. arXiv preprint arXiv:1810.06721, 2018.

(11)
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
In Maria Florina Balcan and Kilian Q. Weinberger, editors,
Proceedings of The 33rd International Conference on Machine Learning
, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.  (12) Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
 (13) Martin Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
 (14) Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. Hindsight policy gradients. In International Conference on Learning Representations (ICLR), 2019.
 (15) Gerardo Rubino, Bruno Tuffin, et al. Rare event simulation using Monte Carlo methods, volume 73. Wiley Online Library, 2009.
 (16) John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. CoRR, abs/1506.05254, 2015.
 (17) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 (18) Satinder Singh, Tommi Jaakkola, and Michael I. Jordan. Learning without state estimation in partially observable environments. In International Conference on Machine Learning (ICML), 1994.
 (19) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 (20) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2nd edition, 2018.
 (21) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 (22) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 (23) Théophane Weber, Nicolas Heess, Lars Buesing, and David Silver. Credit assignment techniques in stochastic computation graphs. CoRR, abs/1901.01761, 2019.
 (24) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
Appendix A Proofs
a.1 Proof of Theorem 1
Lemma 1.
For any initial state , a state that can occur on a trajectory , that is: for some an action for which , we have:
(9) 
Proof.
From Bayes’ rule, we have:
∎
a.2 Proof of Theorem 2
Proof.
For any action , the value function writes as
where follows from Bayes’ rule.
∎
a.3 Proof of Theorem 3
Proof.
Similarly, for the return version and any action , we have:
∎
a.4 Proof of Proposition 1
Proof.
Appendix B Other variants
Analogously to Theorems 1 and 2, we can obtain the V and Qfunctions for state and return conditioning, respectively. We have:
Theorem 4.
Consider an action for which and for any state sampled on :
Theorem 5.
Consider an action for which . We have:
(12) 
Proof.
The Qfunction writes:
where follows from Bayes’ rule. ∎
Appendix C TimeIndependent StateConditional Case
We begin by introducing a time independent variant of stateconditional distribution. Let and
be the geometric distribution on
. Then the stateconditional distribution writes as follows for a future state :