Hindsight Credit Assignment

12/05/2019 ∙ by Anna Harutyunyan, et al. ∙ Google 0

We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A reinforcement learning (RL) agent is tasked with two fundamental, interdependent problems: exploration (how to discover useful data), and credit assignment (how to incorporate it). In this work, we take a careful look at the problem of credit assignment. The instrumental learning object in RL – the value function – quantifies the following question: “how does choosing an action in a state affect future return?”. This is a challenging question for several reasons.

Issue 1: Variance.

The simplest way of estimating the value function is by averaging returns (future discounted sums of rewards) starting from taking

in . This Monte Carlo style of estimation is inefficient, since there can be a lot of randomness in trajectories.

Issue 2: Partial observability.

To amortize the search and reduce variance, temporal difference (TD) methods, like Sarsa and Q-learning, use a learned approximation of the value function and bootstrap. This introduces bias due to the approximation, as well as a reliance on the Markov assumption, which is especially problematic when the agent operates outside of a Markov Decision Process (MDP), for example if the state is partially observed, or if there is function approximation. Bootstrapping may then cause the value function to not converge at all, or to remain permanently biased 

singh1994learning .

Issue 3: Time as a proxy.

TD() methods control this bias-variance trade-off, but they rely on time as the sole metric for relevance: the more recent the action, the more credit or blame it receives from a future reward sutton1988learning ; Sutton2019Book

. Although time is a reasonable proxy for cause-and-effect (especially in MDPs), in general it is a heuristic, and can hence be improved by learning.

Issue 4: No counterfactuals.

The only data used for estimating an action’s value are trajectories that contain that action, while ideally we would like to be able to use the same trajectory to update all relevant actions, not just the ones that happened to (serendipitously) occur.

Figure 1 illustrates these issues concretely. At the high-level, we wish to achieve credit assignment mechanisms that are both sample-efficient (issues 1 and 4), and expressive (issues 2 and 3). To this end, we propose to reverse the key learning question, and learn estimators that measure: “given the future outcome (reward or state), how relevant was the choice of in to achieve it?”, which is essentially the credit assignment question itself. Although eligibility traces consider the same question, they do so in a way that is (purposefully) equivalent to the forward view sutton1988learning , and so they have to rely mainly on “vanilla" features, like time, to decide credit assignment. Reasoning in the backward view explicitly opens up a new family of algorithms. Specifically, we propose to use a form of hindsight conditioning to determine the relevance of a past action to a particular outcome. We show that the usual value functions can be rewritten in hindsight, yielding a new family of estimators, and derive policy gradient algorithms that use these estimators. We demonstrate empirically the ability of these algorithms to address the highlighted issues through a set of diagnostic tasks, which are not handled well by other means.

Figure 1: Left. Consider the trajectory shown by solid arrows to be the sampled trajectory, . An RL algorithm will typically assign credit for the reward obtained in state to the actions along . This is unsatisfying for two reasons: (1) action was not essential in reaching state , any other would have been just as effective; hence, overemphasizing is a source of variance; (2) from , action was sampled, leading to a multi-step trajectory into , but action transitions to from directly; so, it should get more of the credit for . Note that could have been an exploratory action, but also could have been more likely according to the policy in , but given that was reached, was more likely Right. The choice between actions or at state causes a transition to either or , but they are perceptually aliased. On the next decision, the same action transitions the agent to different states, depending on the true underlying . The state can be a single state, or could itself be a trajectory. This scenario can happen e.g. when the features are being learned. A TD algorithm that bootstraps in will not be able to learn the correct values of and , since it will average over the rewards of and . When is a potentially long trajectory with a noisy reward, a Monte Carlo algorithm will incorporate the noise along into the values of both and , despite it being irrelevant to the choice between them. We would like to be able to directly determine the relevance of to being in .

2 Background and Notation

A Markov decision process (MDP) puterman1994markov is a tuple , with being the state space, - the action space, – the state-transition distribution (with

denoting the probability of transitioning to state

from by choosing action ), – the reward function, and – the scalar discount factor. A stochastic policy maps each state to a distribution over actions: denotes the probability of choosing action in state . Let and be the distributions over trajectories generated by a policy , given and , respectively. Let be the return obtained along the trajectory . The value (or V-) function and the action-value (or Q-) function denote the expected return under the policy given and , respectively:

(1)

The benefit of choosing a given action over the usual policy is measured by the advantage function . Policy gradient algorithms improve the policy by changing in the direction of the gradient of the value function Sutton:2000 . This gradient at some initial state is

where is the (unnormalized) discounted state-visitation distribution. Practical algorithms such as REINFORCE williams1992simple approximate or with an -step truncated return, possibly combined with a bootstrapped approximate value function , which is also often used as baseline (see Sutton:2000 ; A3C_2016 ) along a trajectory :

3 Conditioning on the Future

The classical value function attempts to answer the question: "how does the current action affect future outcomes?" By relying on predictions about these future outcomes, existing approaches often exacerbate problems around variance (issue 1) and partial observability (issue 2). Furthermore, these methods tend to use temporal distance as a proxy for relevance (issue 3) and are unable to assign credit counter-factually (issue 4). We propose to learn estimators that explicitly consider the credit assignment question: "given an outcome, how relevant were past decisions?", and try to answer it explicitly.

This approach can in fact be linked to some classical methods in statistical estimation. In particular, Monte Carlo simulation is known to be inaccurate when there are rare events that are of interest: the averaging requires an infeasible number of samples to obtain an accurate estimate rubino2009rare . One solution is to change measures, that is, to use another distribution for which the events are less rare, and correct with importance sampling. The Girsanov theorem is a well-known example of this in processes with Brownian dynamics girsanov1960transforming , known to produce lower variance estimates.

This scenario of rare random events is particularly relevant to efficient credit assignment in RL. When a new significant outcome is experienced, the agent ought to quickly update its estimates and policy accordingly. Let be a sampled trajectory, and some function of it. By changing measures from the policy with which it was sampled to a future-conditional, or hindsight distribution , we hope to improve the efficiency of credit assignment. The importance sampling ratio then precisely denotes the relevance of an action to the specific future . If the distribution is accurate, this allows us to quickly assign credit to all actions relevant to achieving . In this work, we consider to be a future state, or a future return. To highlight the use of the future-conditional distribution, we refer to the resulting family of methods as Hindsight Credit Assignment (HCA).

The remainder of this section formalizes the insight outlined above, and derives the usual value functions in terms of the hindsight distributions, while the subsequent section presents novel policy gradient algorithms based on these estimators.

3.1 Conditioning on Future States

The agent composes its estimates of the return from an action by summing over the rewards obtained from future states . One option of hindsight conditioning is to consider, at each step, the likelihood of an action given that the future state was reached.

Definition 1 (State-conditional hindsight distributions).

For any action and any state , define to be the conditional probability over trajectories of the first action of trajectory being equal to , given that the state has occurred at step along trajectory :

(2)

Intuitively, quantifies the relevance of action to the future state . If is not relevant to reaching , this probability is simply the policy (there is no relevant information in ). If is instrumental to reaching , , and vice versa, if detracts from reaching , . In general, is a lower-entropy distribution than . The relationship of to more familiar quantities can be understood through the following identity obtained by an application of Bayes’ rule:

Using this identity and importance sampling, we can rewrite the usual Q-function in terms of . Since throughout there is only one policy involved, we will drop the explicit conditioning, but it is implied.

Theorem 1.

Consider an action and a state for which . Then the following holds

So, each of the rewards along the way is weighted by the ratio , which exactly quantifies how relevant was in achieving the corresponding state . Following the discussion above, this ratio is if is irrelevant, and larger or smaller than in the other cases. The expression for the Q-function is similar to that in Eq. (1), but the new expectation is no longer conditioned on the initial action – the policy is followed from the start ( instead of ). This is an important point, as it will allow us to use returns generated by any action to update the values of all actions, to the extent that they are relevant according to . Theorem 1 implies the following expression for the advantage:

(3)

where . This form of the advantage is particularly appealing, since it directly removes irrelevant rewards from consideration. Indeed, whenever , the reward does not participate in the advantage for the value of action . When there is inconsequential noise that is outside of the agent’s control, this may greatly reduce the variance of the estimates.

Removing time dependence.

For clarity of exposition, here we have considered the hindsight distribution to be additionally conditioned on time. Indeed, depends not only on reaching the state, but also on the number of timesteps that it takes to do so. In general, this can be limiting, as it introduces a stronger dependence on the particular trajectory, and a harder estimation problem of the hindsight distribution. It turns out we can generalize all of the results presented here to a time-independent distribution , which gives the probability of conditioned on reaching at some point in the future. The scalar is the "probability of survival" at each step. This can either be the discount , or a termination probability if the problem is undiscounted. In the discounted reward case Eq. (3) can be re-expressed in terms of as follows:

(4)

with the choice of . The interested reader may find the relevant proofs in the appendix.

Finally, it is possible to obtain a hindsight V-function, analogously to the Q-function from Theorem 1. The next section does this for return-conditional HCA. We include other variations in appendix.

3.2 Conditioning on Future Returns

The previous section derived Q-functions that explicitly reweigh the rewards at each step, based on the corresponding states’ connection to the action whose value we wish to estimate. Since ultimately we are interested in the return, we could alternatively use it for future conditioning itself.

Definition 2 (Return-conditional hindsight distributions).

For any action and any possible return , define to be the conditional probability over trajectories of the first action being , given that has been observed along :

The distribution is intuitively similar to , but instead of future states, it directly quantifies the relevance of to obtaining the entire return . This is appealing, since in the end we care about returns. Further, this could be simpler to learn, since instead of the possibly high-dimensional state, we now need to worry only about a scalar outcome. On the other hand, it is no longer "jumpy" in time, so may benefit less from structure in the dynamics. As with , we will drop the explicit conditioning on , but it is implied. We have the following result.

Theorem 2.

Consider an action , and assume that for any possible random return for some trajectory we have . Then we have:

(5)

The V- (rather than Q-) function form here has interesting properties that we will discuss in the next section. Mathematically, the two forms are analogous to derive, but the ratio is now flipped. Equations (5) and (1) imply the following expression for the advantage:

(6)

The factor expresses how much a single action contributed to obtaining a return . If other actions (drawn from ) would have yielded the same return, , and the advantage is . If an action has made achieving more likely, then , and conversly, if other actions would have contributed to achieving more than , then . Hence, expresses the impact an action has on the environment, in terms of the return, if everything else (future decisions as well as randomness of the environment) is unchanged.

Both and can be learned online from sampled trajectories (see Sec. 4 for algorithms, and a discussion in Sec. 4.1). Finally, while we chose to focus on state and return conditioning, one could consider other options. For example, conditioning on the reward (instead of the state) at a future time , or an embedding of (or part of) the future trajectory, could have interesting properties.

3.3 Policy Gradients

We now give a policy gradient theorem based on the new expressions of the value function.

Theorem 3.

Let be the policy parameterized by , and . Then, the gradient of the value at some state is:

(7)
(8)

Note that the expression for state HCA in Eq. (7) is written for all actions, rather than only the sampled one. Interestingly, this form does not require (or benefit from) a baseline. Contrary to the usual all-actions algorithm which uses the critic, the HCA reweighting allows us to use returns sampled from a particular starting action to obtain value estimates for all actions.

4 Algorithms

Using the new policy gradient theorem, we will now give novel algorithms based on sampling the expectations (7) and (8). Then, we will discuss the training of the relevant hindsight distributions.

State-Conditional HCA

Consider a parametric representation of the policy and the future-state-conditional distribution , as well as the baseline and an estimate of the immediate reward . Generate -step trajectories . We can compose an estimate of the return for all actions (see Theorem 7 in appendix):

The algorithm proceeds by training to predict the usual return and to predict (square loss), the hindsight distribution to predict

(cross entropy loss), and finally by updating the policy logits with

. See Algorithm 1 in appendix for the detailed pseudocode.

Return-Conditional HCA

Consider a parametric representation of the policy and the return-conditioned distribution . Generate full trajectories and compute the sampled advantage at each step:

where . The algorithm proceeds by training the hindsight distribution to predict (cross entropy loss), and updating the policy gradient with . See Algorithm 2 in appendix for the detailed pseudocode.

RL without value functions.

The return-conditional version lends itself to a particularly simple algoriTheorem In particular, we no longer need to learn the value function – if is estimated well, using complete rollouts is feasible without variance issues. This takes our idea of reversing the direction of the learning question to the extreme, it is now entirely in hindsight.

The result is an actor-critic algorithm, where the usual baseline is replaced by This baseline is strongly correlated to the return (it is proportional to it), which is desirable since we would like to remove as much of the variance (due to the dynamics of the world, or the agent’s own policy) as possible. The following proposition verifies that despite being correlated, this baseline does not introduce bias into the policy gradient.

Proposition 1.

The baseline does not introduce any bias in the policy gradient:

4.1 Learning Hindsight Distributions

We have given equivalent rewritings of the usual value functions in terms of the proposed hindsight distributions, and have motivated their properties, when they are accurate. Now, the question is if it is feasible to learn good estimates of those distributions from experience, and whether shifting the learning problem in this way is beneficial. The remainder of this section discusses this question, while the next one provides empirical evidence for the affirmative.

There are several conventional objects that could be learned to help with credit assignment: a value function, a forward model, or an inverse model over states. An accurate forward model allows one to compute value functions directly with no variance, and an accurate inverse model – to perform precise credit assignment. However, learning such generative models accurately is difficult and has been a long-standing challenge in RL, especially in high-dimensional state spaces. Interestingly, the hindsight distribution is a discriminative, rather than generative model, and is hence not required to model the full distribution over states. Additionally, the action space is usually much smaller than the state space, and so shifting the focus to actions potentially makes the problem much easier. When certain structure in the dynamics is present, learning hindsight distributions may be significantly easier still – e.g. if the transition model is stochastic or the policy is changing, a particular can lead to many possible future states, but a particular future state can be explained by a small number of past actions. In general, learning and

are supervised learning problems, so the new algorithms delegate some of the learning difficulty in RL to a supervised setting, for which many efficient approaches exist (e.g. 

gutmann2010noise ; van2018representation ).

Figure 2: Left: Shortcut. Each state has two actions, one transitions directly to the goal, the other to the next state of the chain. Center: Delayed effect. Start state presents a choice of two actions, followed by an aliased chain, with the consequence of the initial choice apparent only in the final state. Right: Ambiguous bandit. Each action transitions to a particular state with high probability, but to the other action’s state with low probability. When the two states have noisy rewards, credit assignment to each action becomes challenging.
Figure 3: Shortcut. Left: learning curves for with the policy between long and short paths initialized uniformly. Explicitly considering the likelihood of reaching the final state allows state-conditioned HCA to more quickly adjust its policy. Right: the advantage of the shortcut action estimated by performing rollouts from a fixed policy. The -axis depicts the policy probabilities of the actions on the long path. The oracle is computed analytically without sampling. When the shortcut action is unlikely and rarely encountered, it is difficult to obtain an accurate estimate of the advantage. HCA is consistently able to maintain larger (and more accurate) advantages.
Figure 4: Delayed effect. Left: Bootstrapping. The learning curves for , , and a -step return, which causes the agent to bootstrap in the partially observed region. As expected, naive bootstrapping is unable to learn a good estimate. Middle: Using full Monte Carlo returns (for

) overcomes partial observability, but is prone to noise. The plot depicts learning curves for the setting with added white noise of

. Right. The average performance w.r.t. different noise levels – predictably, state HCA is the most robust.
Figure 5: Ambiguous bandit with Gaussian rewards of means ,

, and standard deviation

. Left: The state identity is observed. Both HCA methods improve on PG. Middle: The state identity is hidden, handicapping state HCA, but return HCA continues to improve on PG. Right: Average performance w.r.t. different -s with Gaussian rewards of means , , and standard deviation . Note that the optimal value itself decays in this case.

5 Experiments

To empirically validate our proposal in a controlled way, we devised a set of diagnostic tasks that highlight issues 1-4, while also being representative of what occurs in practice (Fig. 2). We then systematically verify the intuitions developed throughout the paper. In all cases, we learn the hindsight distributions in tandem with the control policy. For each problem we compare HCA with state and return conditioning to standard baseline policy gradient, that is: -step advantage actor critic (with for Monte Carlo). All the results are an average of independent runs, with the plots depicting means and standard deviations. For simplicity we take in all of the tasks.

Shortcut.

We begin with an example capturing the intuition from Fig. 1 (left). Fig. 2 (left) depicts a chain of length with a rewarding final state. At each step, one action takes a shortcut and directly transitions to the final state, while the other continues on the longer path, which may be more likely according to the policy. There is a per-step penalty (of ), and a final reward of . There is also a chance (of ) that the agent transitions to the absorbing state directly.

This problem highlights two issues: (1) the importance of counter-factual credit assignment (issue 4); when the long path is taken more frequently than the shortcut path, counter-factual updates become increasingly effective (see Fig. 3, right) (2) the use of time as a proxy for relevance (issue 3) is shown to be only a heuristic, even in a fully-observable MDP. The relevance for the states along the chain is not accurately reflected in the long temporal distance between them and the goal state. In Fig. 3 we show that HCA is more effective at quickly adjusting the policy towards the shortcut action.

Delayed Effect.

The next task instantiates the example from Fig. 1 (right). Fig. 2 (middle) depicts a POMDP, in which after the first decision, there is aliasing until the final state. This is a common case of partial observability, and is especially pertinent if the features are being learned. We show that (1) Bootstrapping naively is inadequate in this case (issue 2), but HCA is able to carry the appropriate information;111See the discussion in Appendix F and (2) While Monte Carlo is able to overcome the partial observability, its performance deteriorates when intermediate reward noise is present (issue 1). HCA on the other hand is able to reduce the variance due to the irrelevant noise in the rewards.

Additionally, in this example the first decision is the most relevant choice, despite being the most temporally remote, once again highlighting that using temporal proximity for credit assignment is a heuristic (issue 3). One of the final states is rewarding (with ), the other penalizing (with ), and the middle states contain white noise of standard deviation . Fig. 4 depicts our results. In this task, the return-conditional HCA has a more difficult learning problem, as it needs to correctly model the noise distribution to condition on, which is as difficult as learning the values naively, and hence performs similarly to the baseline.

Ambiguous Bandit.

Finally, to emphasize that credit assignment can be challenging, even when it is not long-term, we consider a problem without a temporal component. Fig. 2

(right) depicts a bandit with two actions, leading to two different states, whose reward functions are similar (here: drawn from overlapping Gaussian distributions), with some probability

of crossover. The challenge here is due to variance (issue 1) and a lack of counter-factual updates (issue 4). It is difficult to tell whether an action was genuinely better, or just happened to be on the tail end of the distribution. This is a common scenario when bootstrapping with similar values. Due to the explicit aim at modeling the distributions, the hindsight algorithms are more efficient (Fig. 5 (left)).

To highlight the differences between the two types of hindsight conditioning, we introduce partial observability (issue 2), see Fig. 5 (right). The return-conditional policy is still able to improve over policy gradient, but state-conditioning now fails to provide informative conditioning (by construction).

6 Related Work

Hindsight experience replay (HER) andrychowicz2017hindsight introduces the idea of off-policy learning about many goals from the same trajectory. The intuition is that regardless of what goal the trajectory was pursuing originally, in hindsight it, e.g., successfully found the one corresponding to its final state, and there is something to be learned. Rauber et al. rauber2018hindsight extend the same intuition to policy gradient algorithms, with goal-conditioned policies. Goyal et al. goyal2019recall also use goal conditioning and learn a backtracking model, which predicts the state-action pairs occurring on trajectories that end up in goal states. These works share our intuition of in hindsight using the same data to learn about many things, but in the context of goal-conditioned policies, while we essentially contrast conditional and unconditional policies, where the conditioning is on the extra outcome (state or return). Note that we never act w.r.t. the conditional policy, and it is used solely for credit assignment. Prioritized sweeping can be viewed as changing the sampling distribution with hindsight knowledge of the TD errors moore1993prioritized .

Another line of work that aims to propagate credit efficiently backward in time is the temporal value transport algorithm hung2018optimizing , in which an attention mechanism over memory is used to jump over parts of a trajectory that are irrelevant for the rewards obtained. While demonstrated on challenging problems, that method is biased; a promising direction for future research would be to apply our unbiased hindsight mechanism with past states chosen by such an attention mechanism.

A large number of variance reduction techniques have been applied in RL, e.g. using learned value functions as critics, and other control variates (e.g. weber2019credit ). When a model of the environment is available, it can be used to reduce variance. Rollouts from the same state fill the same role in policy gradients schulman2015trust

. Differentiable system dynamics allow low-variance estimates of the Q-value gradient by using the pathwise derivative estimator, effectively backpropagating the gradient of the objective along trajectories (e.g. 

schulman2015stochastic ; heess2015learning ; henaff2017model ). In stochastic systems this requires knowledge of the environment noise. To bypass this, Heess et al. heess2015learning infer the noise given an observed trajectory. Buesing et al. buesing2018woulda apply this idea to POMDPs, where it can be viewed as reasoning about events in hindsight. They use a structural causal model of the dynamics and infer the posterior over latent causes from empirical trajectories. Using an empirical rather than a learned distribution over latent causes can reduce bias and, together with the (deterministic) model of the system dynamics, allows exploring the effect of alternative action choices for an observed trajectory.

Inverse models similar to the ones we use appear, for instance, in variational intrinsic control gregor2016variational (see also e.g. hausman2018learning ). However, in our work, the inverse model serves as a way of determining the influence of an action on a future outcome, whereas the work in gregor2016variational ; hausman2018learning aims to use the inverse model to derive an intrinsic reward for training policies in which actions influence the future observations.

7 Closing

We proposed a new family of algorithms that explicitly consider the question of credit assignment as a part of, or instead of, estimating the traditional value function. The proposed estimators come with new properties, and as we validate empirically, are able to address some of the key issues in credit assignment. Investigating the scalability of these algorithms in the deep reinforcement learning setting is an exciting problem for future research.

Acknowledgements

The authors thank Joseph Modayil for reviews of earlier manuscripts, Theo Weber for several insightful suggestions, and the anonymous reviewers for their useful feedback.

References

Appendix A Proofs

a.1 Proof of Theorem 1

Lemma 1.

For any initial state , a state that can occur on a trajectory , that is: for some an action for which , we have:

(9)
Proof.

From Bayes’ rule, we have:

Proof of Theorem 1.

From the definition of the Q-function for a state-action pair , we have

(10)

where .

Combining Eq. (9) with Eq. (10) we deduce

a.2 Proof of Theorem 2

Proof.

For any action , the value function writes as

where follows from Bayes’ rule.

a.3 Proof of Theorem 3

Proof.

Using (3), we have:

where the third equality is due to , for .

Similarly, for the return version and any action , we have:

a.4 Proof of Proposition 1

Proof.

We have:

where follows from Theorem 2. ∎

Appendix B Other variants

Analogously to Theorems 1 and 2, we can obtain the V- and Q-functions for state and return conditioning, respectively. We have:

Theorem 4.

Consider an action for which and for any state sampled on :

Proof.

We can flip the result of Lemma 1 for actions for which and .

(11)

Let . We have

Theorem 5.

Consider an action for which . We have:

(12)
Proof.

The Q-function writes:

where follows from Bayes’ rule. ∎

Appendix C Time-Independent State-Conditional Case

We begin by introducing a time independent variant of state-conditional distribution. Let and

be the geometric distribution on

. Then the state-conditional distribution writes as follows for a future state :