The machine learning community has recently seen breakthroughs in challenging problems in classification, density modeling, and reinforcement learning (RL). To a large extent, successful methods have relied on gradient-based optimization (in particular on the backpropagation algorithm(Rumelhart et al., 1985)) for credit assignment, i.e. for answering the question how individual parameters (or units) affect the value of the objective. Recently, Schulman et al. (2015a) have shown that such problems can be formalized as optimization in stochastic computation graphs (SCGs). Furthermore, they derive a general gradient estimator that remains valid even in the presence of stochastic or non-differentiable computational nodes. This unified view reveals that numerous previously proposed, domain-specific gradient estimators, such as the likelihood ratio estimator (Glasserman, 1992), also known as ‘REINFORCE’ (Williams, 1992), as well as the pathwise derivative estimator, also known as the “reparameterization trick” (Glasserman, 1991; Kingma & Welling, 2014; Rezende et al., 2014), can be regarded as instantiations of the general SCG estimator. While theoretically correct and conceptually satisfying, the resulting estimator often exhibits high variance in practice, and significant effort has gone into developing techniques to mitigate this problem (Ng et al., 1999; Sutton et al., 2000; Schulman et al., 2015b; Arjona-Medina et al., 2018). Moreover, like backpropagation, the general SCG estimator requires a full forward and backward pass through the entire graph for each gradient evaluation, making the learning dynamics global instead of local. This can become prohibitive for models consisting of hundreds of layers, or recurrent model trained over long temporal horizons.
In this paper, starting from the SCG framework, we target those limitations, by introducing a collection of results which unify and generalize a growing body of results dealing with credit assignment. In combination, they lead to a spectrum of approaches that provide estimates of model parameter gradients for very general deterministic or stochastic models. Taking advantage of the model structure, they allow to trade off bias and variance of these estimates in a flexible manner. Furthermore, they provide mechanisms to derive local and asynchronous gradient estimates that relieve the need for a full model evaluation. Our results are borrowing from and generalizing a class of methods popular primarily in the reinforcement learning literature, namely that of learned approximations to the surrogate loss or its gradients, also known as value functions, baselines and critics. As new models with increasing structure and complexity are actively being developed by the machine learning community, we expect these methods to contribute powerful new training algorithms in a variety of fields such as hierarchical RL, multi-agent RL, and probabilistic programming.
This paper is structured as follows. We review the stochastic computation graph framework and recall the core result of Schulman et al. (2015a) in section 2. In section 3 we discuss the notions of value functions, baselines, and critics in arbitrary computation graphs, and discuss how and under which conditions they can be used to obtain both lower variance and local estimates of the model gradients, as well as local learning rules for other value functions. In section 4 we provide similar results for gradient critics, i.e. for estimates or approximations of the downstream loss gradient. In section 5 we go through many examples of more structured SCGs arising from various applications and see how our techniques allow to derive different estimators. In section 7, we discuss how the techniques and concepts introduced in the previous sections can be used and combined in different ways to obtain a wide spectrum of different gradient estimators with different strengths and weaknesses for different purposes. We conclude by investigating a simple chain graph example in detail in section 8.
Notation for derivatives
We use a ‘physics style’ notation by representing a function and its output by the same letter. For any two variables and in a computation graph, we use the partial derivative to denote the direct derivative of with respect to , and the to denote the total derivative of with respect to , taking into account all paths (or effects) from on ; we use this notation even if is still effectively a function of multiple variables. For any variable , we let denote the value of which we treat as a constant in the gradient formula; i.e. can be thought of a the output of a ‘function’ which behaves as the identity, but has gradient zero everywhere111 such an operation is often called ‘stop gradient’ in the deep learning community.
such an operation is often called ‘stop gradient’ in the deep learning community.. Finally, we only use the derivative notation for deterministic computation, the gradient of any sampling operation is assumed to be .
All proofs are omitted from the main text and can be found in the appendix.
2 Gradient estimation for expectation of a single function
An important class of problems in machine learning can be formalized as optimizing an expected loss over parameters , where both the sampling distribution
as well as the loss functioncan depend on . As we explain in greater detail in the appendix, concrete examples of this setup are reinforcement learning (where
is a composition of known policy and potentially unknown system dynamics), and variational autoencoders (whereis a composition of data distribution and inference network); cf. Fig. 1. Because of the dependency of the distribution on , backpropagation does not directly apply to this problem.
Two well-known estimators, the score function estimator and the pathwise derivative, have been proposed in the literature. Both turn the gradient of an expectation into an expectation of gradients and hence allow for unbiased Monte Carlo estimates by simulation from known distributions, thus opening the door to stochastic approximations of gradient-based algorithms such as gradient descent (Robbins & Monro, 1985).
Likelihood ratio estimator
For a random variableparameterized by , i.e. the gradient of an expectation can be obtained using the following estimator:
This classical result from the literature is known under a variety of names, including likelihood ratio estimator, or “REINFORCE” and can readily be derived by noting that for any function . The quantity is known as the score function of variable .
Pathwise derivative estimator
In many cases, a random variable can be rewritten as a differentiable, deterministic function of a fixed random variable with parameterless distribution 222Note that reparametrization is always possible, but differentiable reparametrization is not.. This leads to a new expectation for which now only appears inside the expectation and the gradient can be straightforwardly estimated:
Both estimators remain applicable when is a function of itself. In particular, for the score function estimator we obtain
and for the reparametrization approach, we obtain:
Other estimators have been introduced recently, relying on the implicit function theorem (Figurnov et al., 2018), continuous approximations to discrete sampling using Gumbel function (Maddison et al., 2016; Jang et al., 2016)
, law of large numbers applied to sums of discrete samples(Bengio et al., 2013), and others.
2.1 Stochastic computation graphs
We quickly recall the main results from Schulman et al. (2015a).
Definition 1 (Stochastic Computation Graph).
A stochastic computation graph is a directed, acyclic graph , with two classes of nodes (also called variables): deterministic nodes, and stochastic nodes.
Stochastic nodes, (represented with circles, denoted ), which are distributed conditionally given their parents.
Deterministic nodes (represented with squares, denoted ), which are deterministic functions of their parents.
We further specialize our notion of deterministic nodes as follows:
Certain deterministic nodes with no parents in the graphs are referred to as input nodes , which are set externally, including the parameters we differentiate with respect to.
Certain deterministic nodes are designated as losses or costs (represented with diamonds) and denoted . We aim to compute the gradient of the expected sum of costs with respect to some input node .
A parent of a node is connected to it by a directed edge . Let be the total cost.
For a node , we let denote the set of parents of in the graph. A path between and is a sequence of nodes such that all are directed edges in the graph; if any such path exists, we say that descends from and denote it . We say the path is blocked by a set of variables if any of the . By convention, descends from itself. For a variable and set , we say can be deterministically computed from if there is no path from a stochastic variable to which is not blocked by (cf. Fig. 2). Algorithmically, this means that knowing the values of all variables in allows to compute without sampling any other variables; mathematically, it means that conditional on , is a constant. Finally, whenever we use the notion of conditional independence, we refer to the notion of conditional independence informed by the graph (i.e. d-separation) (Geiger et al., 1990; Koller & Friedman, 2009).
Gradient estimator for SCG.
Consider the expected loss . We present the general gradient estimator for the gradient of the expected loss derived in (Schulman et al., 2015a).
For any stochastic variable , we let
denote the conditional log-probability ofgiven its parents, i. e. the value 333Note is indeed the conditional distribution, not the marginal one. The parents of are implicit in this notation, by analogy with deterministic layers, whose parents are typically not explicitly written out., and let denote the score function .
[Theorem 1 from (Schulman et al., 2015a)444Since we defined the gradient of sampling operations to be zero, we do not need to use the notion of of deterministic descendence as in the original theorem; as the gradient of non-deterministic descendents with respect to inputs is always zero.] Under simple regularity conditions,
Here, the first term corresponds to the influence has on the loss through the non-differentiable path mediated by stochastic nodes. Intuitively, when using this estimator for gradient descent, is changed so as to increase or ‘reinforce’ the probability of samples of that empirically led to lower total cost . The second term corresponds to the direct influence has on the total cost through differentiable paths. Note that differentiable paths include paths going through reparameterized random variables.
3 Value based methods
The gradient estimator from theorem 1 is very general and conceptually simple but it tends to have high variance (see for instance the analysis found in Mnih & Rezende, 2016), which affects convergence speed (see e.g. Schmidt et al., 2011). Furthermore, it requires a full evaluation of the graph. To address these issues, we first discuss several variations of the basic estimator in which the total cost is replaced by its deviation from the expected total cost, or conditional expectations thereof, with the aim of reducing the variance of the estimator. We then discuss how approximations of these conditional expectations can be learned locally, leading to a scheme in which gradient computations from partial model evaluations become possible.
In this section, we use the simple concept of conditional expectations to introduce a general definition of value function in a stochastic computation graph.
Definition 2 (Value function).
Let be an arbitrary subset of , an assignment of possible values to variables in and an arbitrary scalar value in the graph. The value function for set is the expectation of the quantity conditioned on :
Intuitively, a value function is an estimate of the cost which averages out the effect of stochastic variables not in , therefore the larger the set, the fewer variables are averaged out.
The definition of the value function as conditional expectation results in the following characterization:
For a given assignment of , is the optimal mean-squared error estimator of given input :
Consider an arbitrary node , and let denote the -rooted cost-to-go, i.e. the sum of costs ‘downstreams’ from (similar notation is used for if is a set). The scalar will often be the cost-to-go for some fixed node ; furthermore, when clear from context, we use to both refer to the variables and the values they take. For notational simplicity, we will denote the corresponding value function .
Fig. 3 shows multiple examples of value functions for different graphs. The above definition is broader than the typical one used in reinforcement learning. There, due to the simple chain structure of the Markov Decision Processes, the resulting Markov properties of the graph, and the particular choice of , the expectation is only with respect to downstream nodes. Importantly, according to Def. 2 the value can depend on via ancestors of (e.g. example in Fig. 3c). Lemma 1 remains valid nevertheless.
3.2 Baselines and critics
In this section, we will define the notions of baselines and critics and use them to introduce a generalization of theorem 1 which can be used to compute lower variance estimator of the gradient of the expected cost. We will then show how to use value functions to design baselines and critics.
Consider an arbitrary node and input .
Definition 3 (Baseline).
A baseline for is any function of the graph such that . A baseline set is an arbitrary subset of the non-descendants of .
Baseline sets are of interest because of the following property:
Let be an arbitrary scalar function of . Then is a baseline for .
Common choices are constant baselines, i.e. , or baselines only depending on the parents of .
Definition 4 (Critic).
A critic of cost for is any function of the graph such that .
By linearity of expectations, linear combinations of baselines are baselines, and convex combinations of critics are critics.
The use of the terms critic and baseline is motivated by their respective roles in the following theorem, which generalizes the policy gradient theorem (Sutton et al., 2000):
Consider an arbitrary baseline and critic for each stochastic node . Then,
The difference between a critic and a baseline is called an advantage function.
Theorem 2 enables the derivation of a surrogate loss. Let be defined as , where we recall that the tilde notation indicates a constant from the point of view of computing gradients. Then, the gradient of the expected cost equals the gradient of in expectation: .
Before providing intuition on this theorem, we see how value functions can be used to design baselines and critics:
Definition 5 (Baseline value function and critic value function).
For any node and baseline set , a special case of a baseline is to choose the value function with set . Such a baseline is called a baseline value function.
Let a critic set be a set such that , and and are conditionally independent given ; a special case is when is such that is deterministically computable given . Then the value function for set is a critic for which we call a critic value function for .
In the standard MDP setup of the RL literature, consists of the state and the action which is taken by a stochastic policy in state with probability , which is a deterministic function of . Definition 5 is more general than this conventional usage of critics since it does not require to contain all stochastic ancestor nodes that are required to evaluate . For instance, assume that the action is conditionally sampled from the state and some source of noise , for instance due to dropout, with distribution 555in this example, it is important is used only once; it cannot be used to compute other actions.. The critic set may but does not need to include ; if it does not, is not a deterministic function of and . The corresponding critic remains useful and valid.
Figure 3 contains several examples of value functions which take the role of baselines and critics for different nodes.
Three related ideas guide the derivation of theorem 2. To give intuition, let us analyze the term , which replaces the score function weighted by the total cost . First, the conditional distribution of only influences the costs downstream from , hence we only have to reinforce the probability of with the cost-to-go instead the total cost . Second, the extent to which a particular sample contributed to cost-to-go should be compared to the cost-to-go the graph typically produces in the first place. This is the intuition behind subtracting the baseline , also known as a control variate. Third, we ideally would like to understand the precise contribution of to the cost-to-go, not for a particular value of downstream random variables, but on average. This is the idea behind the critic . The advantage (difference between critic and baseline) therefore provides an estimate of ‘how much better than anticipated’ the cost was, as a function of the random choice .
Baseline value functions are often used as baselines as they approximate the optimal baseline (see Appendix B.1). Critic value functions are often used as they provide an expected downstream cost given the conditioning set. Furthermore, as we will see in the next section, value functions can be estimated in a recursive fashion, enabling local learning of the values, and sharing of value functions between baselines and critics. For these reasons, in the rest of this paper, we will only consider baseline value functions and critic value functions.
In the remainder of this section, we consider an arbitrary value function with conditioning set .
3.3 Recursive estimation and Markov properties
A fundamental principle in RL is given by the Bellman equation – which details how a value function can be defined recursively in terms of the value function at the next time step. In this section, we generalize the notion of recursive computation to arbitrary graphs.
The main result, which follows immediately from the law of iterated expectations, characterizes the value function for one set, as an expectation of a value function (or critic / baseline value function) of a larger set:
Consider two sets , and an arbitrary quantity . Then we have: .
This lemma is powerful, as it allows to relate value functions as average of over value function. A simple example in RL is the relation (here, in the infinite discounted case) between the Q function of a policy and the corresponding value function , which is given by . Note this equation relates a critic value function to a value function typically used as baseline.
To fully leverage the lemma above, we proceed with a Markov property for graphs666borrowed from well known conditional independence conditions in graphical models, and adapted to our purposes., which captures the following situation: given two conditioning sets , it may be the case that the additional information contained in does not improve the accuracy of the cost prediction compared to the information contained in the smaller set .
For conditioning set , we say that is Markov (for ) if for any such that there exists a directed path from to not blocked by , none of the descendants of are in .
Let be the set of all ancestors of nodes 777Recall that by convention nodes are descendants of themselves, so .
Let be Markov, consider any such that . For any assignment of values to the variables in , let be the restriction of to the variables in . Then:
which we will simply denote, with a slight abuse of notation,
In other words, the information contained in is irrelevant in terms of cost prediction, given access to the information in . Several examples are shown in Fig. 4. It is worth noting that Def. 6 does not rule out changes in the expected value of after adding additional nodes to (cf. Fig. 4(d,e)). Instead it rules out correlations between and that are mediated via ancestors of nodes in as in the example in Fig. 4(a,b,c)).
The notion of Markov set can be used to refine Lemma 2 as follows:
Lemma 3 (Generalized Bellman equation).
Consider two sets , and suppose is Markov. Then we have: .
The Markov assumption is critical in allowing to ‘push’ the boundary at which the expectation is defined; without it, lemma 2 only allows to relate value functions of sets which are subset of one another. But notice here that no such inclusion is required between and themselves. In the context of RL, this corresponds to equations of the type (see Fig. 5), though to get the separation between the reward and the value at the next time step, we need a slight refinement, which we detail in the next section.
3.4 Decomposed costs and bootstrap
In the previous sections we have considered a value function with respect to a node which predicts an estimate of the cost-to-go from node (note was implicit in most of our notation). In this section, we write the cost-to-go at a node as a funtion of cost-to-go from other nodes or collection of nodes, and leverage the linearity of expectation to turn these relations between costs into relation between value functions.
A first simple observation is that because of the linearity of expectations, for any two scalar quantities , real value and set , we have .
Definition 7 (Decomposed costs).
For a node and a collection in the graph, we say that the cost can be decomposed with set if .
This implies that cost nodes can be grouped in disjoint sets corresponding to the descendents of different sets , without double-counting. A common special case is a tree, where each is a singleton containing a single child of .
Theorem 3 (Bootstrap principle for SCGs).
Suppose the cost-to-go from node can be decomposed with sets , and consider an arbitrary set with associated value function . Furthermore, for each set , consider a set and associated value function: . If for each , , or if for each , is Markov and , then:
Fig. 6 highlights potential difficulties of defining correct bootstrap equations for various graphs.
From the bootstrap equation follows a special case, which we call partial averaging, often used for critics:
Corollary 1 (Partial averages).
Suppose that for each , is Markov and . Without loss of generality, define as the collection of all cost nodes which can be deterministically computed from . Then,
The term ‘partial average’ indicates that the value function is a conditional expectation (i.e. ’averaging’ variables) but that it combines averaged cost estimates (the value terms ) and empirical costs (). Fig. 7 shows some examples for generic graphs.
In the case of RL for instance, a k-step return is a form of partial average, since the return – sum of all rewards downstream from state – can be written as the sum of all rewards in and downstream from ; the critic value function is therefore equal888We assume for simplicity that the rewards are deterministic functions of the state; the result can be trivially generalized. to . This implies in turn that is also equal to .
3.5 Approximate Value functions
In practice, value functions often cannot be computed exactly. In such cases, one can resort to learning parametric approximations. For node , conditioning set , we will consider an approximate value function as an approximation (with parameters ) to the value function .
Following corollary 1, we know that for a possible assignment of variables , minimizes over . We therefore elect to optimize by considering the following weighted average, called a regression on return in reinforcement learning:
from which we obtain (note that does not affect the distribution of any variable in the graph, and therefore exchange of derivative and integration follows under common regularity conditions):
which can easily be computed by forward sampling from , even if conditional sampling given is difficult. This is possible because of the use of as a particular weighting on the collection of problems of the type .
We now leverage the recursion methods from the previous sections in two different ways. The first is to use the combination of approximate value functions and partial averages to define other value functions. For a partial average as defined in theorem 1 and family of approximate value functions , we can define an approximate value function through the bootstrap equation: . In other words, using the bootstrap equations, approximating value functions for certain sets automatically defines other approximate value functions for other sets.
In general, we can trade bias and variance by making larger (which will typically result in lower bias, higher variance) or not, i.e. by shifting the boundary at which variables are integrated out. An extreme case of a partial average is not an average at all, where , in which case the value function is the empirical return . K-step returns in reinforcement learning (see section 5.1) are an example of trading bias and variance by choosing the integration boundary to be all nodes at a distance greater than , and all costs at a distance less than . -weighted returns in the RL literature (Section 5.1) are convex combinations of partial averages. similarly controls a bias-variance tradeoff.
By following this gradient, the value function will tend towards the bootstrap value instead of the return . Because the former has averaged out stochastic nodes, it is a lower variance target, and should in practice provide a stronger learning signal. Furthermore, as it can be evaluated as soon as is evaluated, it provides a local or online learning rule for the value at ; by this we mean the corresponding gradient update can be computed as soon as all sets are evaluated. In RL, this local learning property can be found in actor-critic schemes: when taking action in state , as soon as the immediate reward is computed and next state is evaluated, the value function (which is a baseline for ) can be regressed against low-variance target (which is also a critic for ), and the temporal difference error (or advantage) can be used to update the policy by following .
4 Gradient-based methods
In the previous section, we developed techniques to lower the variance of the score-function terms in the gradient estimate. This led to the construction of a surrogate loss which satisfies .
In this section, we develop corresponding techniques to lower the variance estimates of the gradients of surrogate cost . To this end, we will again make use of conditional expectations to partially average out variability from stochastic nodes. This leads to the idea of a gradient-critic, the equivalent of the value critic for gradient-based approaches.
Definition 8 (Value-gradient).
The value-gradient for with set is the following function of :
Value-gradients are not directly useful in our derivations but we will see later that certain value-gradients can reduce the variance of our estimators. We call these value-gradient gradient-critics.
Definition 9 (Gradient-critic).
Consider two nodes and , and a value-gradient for node with set . If and are conditionally independent given 999See lemma 7 in Appendix for a characterization of conditional independence between total derivatives., then we say the value-gradient is a gradient-critic for with respect to .
If is deterministically computable from , then is a gradient-critic for with respect to .
We can use gradient-critics in the backpropagation equation. First, we recall the equation for backpropagation and stochastic backpropagation. Let be an arbitrary node of , and be the children of in . The backpropagation equations state that:
From this we obtain the stochastic backpropagation equations:
Gradient-critics allow for replacing these stochastic estimates by conditional expectations, potentially achieving lower variance:
For each child of , let be a gradient-critic for with respect to . We then have:
Note a similar intuition as the idea of critic defined in the previous section. In both cases, we want to evaluate the expectation of a product of two correlated random variables, and replace one by its expectation given a set which makes the variables conditionally independent.
4.2 Horizon gradient-critic
More generally, we do not have to limit ourselves to being children of . We define a separator set for in to be a set such that every deterministic path from to the loss is blocked by a . For simplicity, we further require the separator set to be unordered, which means that for any , cannot be an ancestor to ; we drop this assumption for a generalized result in the appendix A. Under these assumptions, the backpropagation rule can be rewritten (see (Naumann, 2008; Parmas, 2018), also appendix A):
Assume that for every , is a gradient critic for with respect to . We then have:
This theorem allows us to ‘push’ the horizon after which we start using gradient-critics. It constitutes the gradient equivalent of partial averaging, since it combines stochastic backpropagation (the terms ) and gradient critics .
4.3 The gradient-critic bootstrap
We now show how the result from the previous section allows to derive a generic notion of bootstrapping for gradient-critics.
Theorem 6 (Gradient-critic bootstrap).
Consider a node , unordered separator set . Consider value-gradient with set for node , and with Markov sets critics for with respect to . Suppose that for all , . Then,
4.4 Gradient-critic and gradient of critic
The section above proposes an operational definition of a gradient critic, in that one can replace the sampled gradient by the expectation of the gradient . A natural question follows – is a value-gradient the gradient of a value function? Similarly, is a gradient-critic the gradient of a critic function?
It is in general not true that the value-gradient must be the gradient of a value function. However, if the critic set is Markov, the gradient-critic is the gradient of the critic.
Consider a node and critic set , and corresponding critic value function and gradient-critic . If is Markov for , then we have:
This characterization of the gradient-critic as gradient of a critic plays a key role in using reparametrization techniques when gradients are not computable. For instance, in a continuous control application of reinforcement learning, the state of the environment can be assumed to be an unknown but differentiable function of the previous state and of the action. In this context, a critic can readily be learned by predicting total costs. By the argument above, the gradient of this critic actually corresponds to the gradient-critic of the unknown environment dynamics. This technique is at the heart of differentiable policy gradients (Lillicrap et al., 2015) and stochastic value gradients (Heess et al., 2015).
When estimating the gradient critic from the critic, one needs to make sure that the conditional distribution on conditional on has ‘full density’ (i.e. that the loss function can be evaluated in a neighborhood of the values of ), otherwise the resulting gradient estimate will be incorrect. This is an issue for instance if variables in are deterministic function of one another. To address this issue, one can sample from a different distribution than , for instance by injecting additional noise in the variables. One may have to use bootstrap equation instead of regression on return, since other we would be estimating the critic of a different graph (with added noise). See for instance (Silver et al., 2014; Lillicrap et al., 2015).
4.5 Gradient-critic approximation and computation
Following the arguments regarding conditional expectation and square minimization from section 3.1, we know that satisfies the following minimization problem:
For a parametric approximation , and using the same weighting scheme as section 3.5, it follows that:
Finally, if is Markovian for , from Theorem 7, the gradient-critic can be defined in two ways: first, as the critic of a gradient (