On the Variance of Unbiased Online Recurrent Optimization

02/06/2019 ∙ by Tim Cooijmans, et al. ∙ 0

The recently proposed Unbiased Online Recurrent Optimization algorithm (UORO, arXiv:1702.05043) uses an unbiased approximation of RTRL to achieve fully online gradient-based learning in RNNs. In this work we analyze the variance of the gradient estimate computed by UORO, and propose several possible changes to the method which reduce this variance both in theory and practice. We also contribute significantly to the theoretical and intuitive understanding of UORO (and its existing variance reduction technique), and demonstrate a fundamental connection between its gradient estimate and the one that would be computed by REINFORCE if small amounts of noise were added to the RNN's hidden units.

READ FULL TEXT VIEW PDF

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

All learning algorithms are driven by some form of credit assignment—identification of the causal effect of past actions on a learning signal (Minsky, 1961; Sutton, 1984)

. This enables agents to learn from experience by amplifying behaviors that lead to success, and attenuating behaviors that lead to failure. The problem of performing efficient and precise credit assignment, especially in temporal agents, is a central one in artificial intelligence.

Knowledge of the inner workings of the agent can simplify the problem considerably, as we can trace responsibility for the agent’s decisions back to its parameters. In this work, we consider credit assignment in recurrent neural networks (rnns; Elman, 1990; Hochreiter and Schmidhuber, 1997), where the differentiability of the learning signal with respect to past hidden units allows us to assign credit using derivatives. But even with this structure, online credit assignment across long or indefinite stretches of time remains a largely unsolved problem.

Typically, differentiation occurs by Backpropagation Through Time

(bptt; Rumelhart et al., 1986; Werbos, 1990), which requires a “forward pass” in which the network is evaluated for a length of time, followed by a “backwards pass” in which gradient with respect to the model’s parameters is computed. This is impractical for very long sequences, and a common trick is to “truncate” the backwards pass after some fixed number of iterations (Williams and Peng, 1990). As a consequence, parameter updates are infrequent, expensive, and limited in the range of temporal dependencies they reflect.

bptt’s more natural dual, Real-Time Recurrent Learning (rtrl; Williams and Zipser, 1989), carries gradient information forward rather than backward. It runs alongside the model and provides parameters updates at every time step. To do so, however, it must retain a large matrix relating the model’s internal state to its parameters. Even when this matrix can be stored at all, updating it is prohibitively expensive. Various approximations to rtrl have been proposed (e.g. Mak et al., 1999) in order to obtain cheaper gradient estimates at the cost of reducing their accuracy.

In this paper we consider Unbiased Online Recurrent Optimization (uoro; Ollivier et al., 2015; Tallec and Ollivier, 2018), an unbiased stochastic approximation to rtrl that compresses the gradient information through random projections. We analyze the variance of the uoro gradient estimator, relate it to other gradient estimators, and propose various modifications to it that reduce its variance both in theory and practice.

2 Outline of the Paper

We begin with a detailed discussion of the relationship and tradeoffs between rtrl and bptt in Section 3. Before narrowing our focus to approximations to rtrl, we briefly review other approaches to online credit assignment in Section 4. We then contribute a novel and arguably more intuitive derivation of the uoro algorithm in Section 5.

In Sections 6 and 7 we give our main contribution in the form of a thorough analysis of uoro and the variance it incurs., and derive a new variance reduction method based on this analysis. Sections 6.1 and 6.2 discuss limitations of the variance reduction scheme of Tallec and Ollivier (2018), and in Section 6.3 propose to augment its scalar coefficients with matrix-valued transformations. We develop a framework for analysis of uoro-style estimators in Sections 6.4 and 6.5, which allows us to determine the total variance incurred when accumulating consecutive gradient estimates over time. Working within this framework, we derive a formula for matrices that gives the optimal variance reduction subject to certain structural constraints (Section 7.1). We evaluate our theory in a tractable empirical setting in Section 7.2, and explore avenues toward a practical algorithm in Section 7.1.3.

Section 8 introduces a variant of uoro that avoids one of its two levels of approximation. It exploits the fact that gradients with respect to weight matrices are naturally rank-one. We show this reduces the variance by a factor on the order of the number of hidden units, at the cost of increasing computation time by the same factor.

Finally, we study the relationship between uoro and reinforce (Williams, 1992) in Section 9. The analysis uncovers a close connection when reinforce is used to train rnns with perturbed hidden states. We show that when this noise is annealed, the reinforce estimator converges to the uoro estimator plus an additional term that has expectation zero but unbounded variance.

3 Automatic Differentiation in Recurrent Neural Networks

Recurrent Neural Networks (rnns; Elman, 1990; Hochreiter and Schmidhuber, 1997)

are a general class of nonlinear sequence models endowed with memory. Given a sequence of input vectors

, and initial state vector , an rnn’s state evolves according to

where is an arbitrary continuously differentiable transition function parameterized by that produces the next state given the previous state and the current observation . Typically, will take the form of an affine map followed by a nonlinear function:

(1)

Here

is the “activation function”, which is assumed to be continuously differentiable (and is typically nonlinear and coordinate-wise), and

is a square matrix parameter whose vectorization is .

The defining feature of recurrent neural networks as compared to feed-forward neural networks is the fact that their weights are tied over time. That is, we have

. However, we will continue to distinguish the different ’s in the recurrence, as this allows us to refer to individual “applications” of in the analysis (which will be useful later).

Although we will treat the sequence as finite, i.e. for some sequence length , we are interested mainly in streaming tasks for which may as well be infinite.

At each time step , we incur a loss which is some differentiable function of . In order to minimize the aggregate loss with respect to , we require an estimate of its gradient with respect to . We will write (or occasionally ) for the Jacobian of with respect to . We can express the gradient as a double sum over time that factorizes in two interesting ways:

(2)

Each of the terms indicates how the use of the parameter at time affected the loss at time . This is a double sum over time with terms, but since future parameter applications do not affect past losses, we have for . Both factorizations exploit this triangular structure and allow the gradient to be computed in by recursive accumulation.

By far the most popular strategy for breaking down this computation goes by the name of Back-Propagation Through Time (bptt; Werbos, 1990). It is an instance of what is known as reverse-mode accumulation in the autodifferentiation community, and relies on the reverse factorization in Equation 2. bptt computes gradients of total future loss with respect to states in reverse chronological order by the recursion

(3)

At each step, a term of the gradient is accumulated.

Since the quantities , and generally depend on and , the use of bptt in practice implies running the model forward for steps to obtain the sequence of hidden states and losses , and subsequently running backward to compute the gradient.

Its converse, Real-Time Recurrent Learning (rtrl; Williams and Zipser, 1989), is an instance of forward-mode accumulation. It exploits the forward factorization of the gradient in Equation 2, computing Jacobians of hidden states with respect to past applications of the parameter recursively according to

(4)

What rtrl provides over bptt is that we can run it forward alongside our model, and at each time-step update the model parameters immediately (using ), thus performing fully online learning. This is to be contrasted with bptt, where we must run the model forward for time-steps before we can make a parameter update, thus introducing a long delay between the reception of a learning signal and the parameter update that takes it into account.

There is a caveat to the above, which is that as soon as we update our parameter , the Jacobian accumulated by rtrl is no longer quite correct, as it is based on previous values of . However, as argued by Williams and Zipser (1995) and Ollivier et al. (2015) this problem can be mostly ignored as long as the learning rate is small enough in relation to the rate of the natural decay of the Jacobian (which occurs due to the vanishing gradient phenomenon).

The main drawback of rtrl is that the accumulated quantity is a large matrix. If the size of the parameters is where is the hidden state size, then this matrix requires space to store. This is typically much larger than bptt’s space. Moreover, the rtrl recursions involve propagating a matrix forward by the matrix-matrix product , which takes time. bptt on the other hand only propagates a vector through time at a cost of . Although rtrl frees us to grow and capture arbitrarily-long-term dependencies, the algorithm is grossly impractical for models of even modest size.

4 Other Approaches to Credit Assignment

A number of techniques have been proposed to reduce the memory requirements of bptt. Storage of past hidden states may be traded for time by recomputing the states on demand, in the extreme case resulting in a quadratic-time algorithm. Better choices for this tradeoff are explored by Chen et al. (2016); Gruslys et al. (2016). Reversible Recurrent Neural Networks (MacKay et al., 2018; Gomez et al., 2017) allow the on-demand computation of past states to occur in reverse order, restoring the linear time complexity while limiting the model class. Stochastic Attentive Backtracking (Ke et al., 2018) sidesteps the storage requirements of backprop through long periods of time by retaining only a sparse subset of states in the distant past. This subset is selected based on an attention mechanism that is part of the model being trained. Gradient from future loss is propagated backwards to these states only through the attention connections. Synthetic gradients (Jaderberg et al., 2017) approximates bptt by use of a predictive model of the total future gradient , which is trained online based on bptt.

Instead of transporting derivatives through time, we may assign credit by transporting value over time. For example, actor-critic architectures (Konda and Tsitsiklis, 2000; Barto et al., 1983) employ Temporal Difference Learning (Sutton, 1988) to obtain a predictive model of the total future loss. By differentiation, the estimated total future loss may be used to estimate the total future gradient. More commonly, such estimates are used directly as a proxy for the total future loss, or as a reinforce baseline. Along similar lines as our analysis of reinforce in Section 9, we may interpret these methods as effectively differentiating the estimate in expectation. rudder (Arjona-Medina et al., 2018) redistributes the total loss over time, replacing the immediate losses by surrogates determined by a process similar to backpropagation through a critic. These surrogates preserve the total loss but in an RL setting may better reflect the long-term impact of the action taken at time . Temporal Value Transport (Hung et al., 2018) relies on attention weights to determine which past time steps were relevant to which future time steps, and injects the estimated total future loss from the future time steps into the immediate loss for the associated past time steps.

5 Unbiased Online Recurrent Optimization

The recently proposed Unbiased Online Recurrent Optimization algorithm (uoro; Tallec and Ollivier, 2018) and its predecessor NoBackTrack (Ollivier et al., 2015) approximate rtrl by maintaining a rank-one estimate of the Jacobian . We now briefly derive the basic algorithm.

5.1 Derivation

First, we note that can be written as . We then perform a rank-one projection of each term in this sum using a random vector (which is chosen to satisfy ). This gives us the estimator

Unbiasedness follows from a simple application of linearity of expectation:

We will refer to this projection as the spatial projection to distinguish it from the temporal projection that is to follow.

It is interesting to note that can be interpreted as a “directional Jacobian”, which measures the instantaneous change in as a function of ’s movement along the direction . Similarly is essentially the gradient of with respect to , and thus measures the instantaneous change of along the direction of , as a function of the change in . Thus the intuition behind this first approximation is that we are guessing the relevant direction of change in and performing the gradient computations only along that direction.

We can generalize the spatial projection from the standard uoro method by projecting in the space of any cut vertex on the computational path from to . For uoro, ; other choices include for projection in parameter space, and for projection in preactivation space. We will make extensive use of this choice in later Sections.

This gives the generalized estimator

which is unbiased following a similar argument as before.

The random projections serve to reduce the large matrix into the more manageable vector quantities and . But because the sum of rank-one matrices is not itself rank one, the resultant estimator will still be too expensive to maintain and update online.

In order to obtain a practical algorithm we make a second rank-one approximation, now across time instead of -space. To this end we introduce random scalar coefficients satisfying (where is the Kronecker delta which is 1 if and 0 otherwise) and define the following rank-one estimator:

By linearity of expectation this is an unbiased estimate of the previous spatially projected estimator

, and is thus also an unbiased estimator of , although with potentially much higher variance.

Going forward we will assume that are iid random signs and are iid standard normal vectors, so that we may treat the product

as a single Gaussian-distributed random vector

, which will simplify our analysis.

The two factors and of the rank-one approximation are maintained by the following pair of recursions:

(5)

with initialized to zero vectors. Notably these recursions are similar in structure to that used by rtrl to compute the exact Jacobian (c.f. Equation 4). As with the rtrl equations, their validity follows from the fact that .

In these recursions we have introduced coefficients and to implement the variance reduction technique from Tallec and Ollivier (2018); Ollivier et al. (2015), which we will refer to as greedy iterative rescaling (gir). We will discuss gir in detail in the next subsection.

Finally, at each step we estimate using the estimator . This is a small deviation from the one given by Tallec and Ollivier (2018), which uses backpropagation to compute exactly, and the remaining part of the gradient, , is estimated as . Although our version has slightly higher variance, it is conceptually simpler.

The projected Jacobians that appear in Equation 5.1 can be computed efficiently without explicitly handling the full Jacobians. Specifically, can be computed by reverse-mode differentiating with respect to , and substituting in place of the adjoint . By a similar trick, one can compute and using forward-mode differentiation. The resulting algorithm has the same time complexity as backpropagation through time, but its storage does not grow with time.

5.2 Greedy Iterative Rescaling

This subsection explains gir and the role of the coefficients in Equation 5.1.

Whereas our above derivation of the algorithm introduced a temporal projection, Ollivier et al. (2015); Tallec and Ollivier (2018) interpret the algorithm given by Equation 5.1 as implementing a series of projections. Under this view, is a rank-one estimate of the rank-two matrix that is the sum of the forwarded previous Jacobian estimate and the approximate contribution :

The temporal “cross-terms” and , which are zero in expectation (but contribute variance), constitute the error introduced in the transition from time to . The coefficients and

provide an extra degree of freedom with which we can minimize this error. As shown by

Ollivier et al. (2015), the minimizers ensure the terms and their counterparts have small norm, so that their contribution to the variance is small as well.

The total (trace) variance of with respect to is given by the expected squared Frobenius norm of the error:

As the common sign does not affect the norm, this is simply

where denotes the Frobenius inner product.

The coefficients and affect the error through the single degree of freedom . By differentiation and use of the identity we find that the optimal choices satisfy

This includes the solution from Ollivier et al. (2015).

Examining their use in Equation 5.1 we can see that for this particular solution plays the important role of contracting , which would otherwise grow indefinitely (being a sum of independent random quantities). While division by in the recursion for causes an expansive effect, this is more than counteracted by the natural contractive property of the Jacobian (which is due to gradient vanishing in well-behaved rnns). Thus we can interpret the role of as distributing this contraction evenly between and , which limits the growth of both quantities and thus keeps the variance of their product under control. A formal treatment of the growth of variance over time is given by Massé (2017).

6 Variance Analysis

In this section we analyze the variance behavior of uoro-style algorithms. We first discuss limitations of the gir variance reduction scheme discussed in Section 5.2, namely that it is greedy (Section 6.1) and derives from a somewhat inappropriate objective (Section 6.2). We then generalize the algorithm and develop a more holistic theoretical framework for its analysis (Sections 6.3 through 6.5).

6.1 Greedy Iterative Rescaling is Greedy

In Section 5.2 we discussed how gir can be interpreted as minimizing the variance of a rank-one estimate of a rank-two matrix (which is a stochastic approximation that occurs at each step in uoro). Here we unify this sequence of approximations into a single temporal rank-one estimation (as introduced in Section 5.1), which helps us reveal the inherent limitations of gir.

Recall that the uoro recursions (Equation 5.1) maintain past contributions in the form of sums and , and at each step gir applies respective scaling factors and (resp.) to these sums. This gives rise to an overall scaling (and similarly ) of contributions made at time step and propagated forward through time step . We can write the estimates produced by uoro in terms of as follows:

Note that each such estimate is but one element in a sequence of estimates. In the next section, we will establish a notion of the variance for this sequence, so that we may speak meaningfully about its minimization. For now, we will consider the minimization of the variance of at each time step as an independent problem, with independent decision variables . The optimal coefficients given by (derived in Appendix B) minimize the variance of with respect to .

This solution is generally different from that of gir, which is constrained to have the form for (where is independent of ). This relationship between and breaks the independence of consecutive variance minimization problems, and therefore the resulting coefficients cannot in general be optimal for all .

We can see this by writing the optimal coefficients for that minimize the variance of in terms of the coefficients that minimize the variance of :

We see that in order to minimize the variance of given coefficients that minimize the variance of , we should divide each contribution by the square root of its contraction due to forward-propagation through , and multiply each by the same factor. Crucially, this factor depends on and therefore cannot be expressed by gir, which is constrained to rescale all past contributions by a constant factor independent of . This is true of any algorithm that maintains past contributions in a reduced form such as .

6.2 Greedy Iterative Rescaling Optimizes an Inappropriate Objective

In the previous subsection, we saw a sense in which gir is greedy: its ability to minimize the variance of is hampered by its own past decisions. To see this, we took a holistic view of the sequence of variance minimization problems solved by gir, and showed that the choice of coefficients at time constrains the choice of future coefficients. Here we take a further step back, and argue that the variance of is not the right objective in light of the downstream application of these estimates.

The Jacobian estimates are used to determine a sequence of gradient estimates , which are accumulated by a gradient descent process. We argue that the quantity of interest is the variance of the total gradient estimate incurred during steps of optimization (which estimates the total gradient ).

Since consecutive gradient contributions depend largely on the same stochastic quantities, the variance of this sum is not simply the sum of the individual variances. Hence even if we could independently minimize the variances of the Jacobian estimates, doing so is not equivalent to minimizing the variance of the total gradient estimate.

6.3 Generalized Recursions

Before proceeding with the variance computation we will generalize the uoro recursions by replacing the and

coefficients by an invertible matrix

as follows:

(6)

can be interpreted as modifying the covariance of the noise vector (although differently for either recursion). Analogously to the standard uoro recursions, our generalized recursions compute the following sums:

We can view as a matrix-valued generalization of the gir coefficients, with equivalence when . The extra degrees of freedom allow more fine-grained control over the norms of cross-terms,111By “cross-term” we mean a term that appears in the expanded sum which is zero in expectation but contributes variance. as can be seen when we expand both the temporal and the spatial projections in the estimator :

Each term’s scaling depends not just on temporal indices but now also on the indices of units. As we shall see, in expectation, terms where both the temporal indices and units correspond remain unaffected, and it is only the undesired cross-terms for which or that are affected.

Tallec and Ollivier (2018) hint at a related approach which would correspond to choosing to be diagonal matrices. However, they derive their choice by optimizing the norms of only temporally corresponding terms for which , and ignoring temporal cross terms which make up the bulk of the error. We instead consider a class of matrices that is not constrained to be diagonal, and whose value minimizes a measure of variance that is more relevant to the optimization process.

Thus our recursion in Equation 6.3 is a strict generalization of the uoro recursion in Equation 5.1. The matrices can express a broad class of variance reduction mechanisms, including gir. That said, our analysis of this system will be limited to cases where the are independent of the noise vectors for all . Notably, this precludes gir because of its complex nonlinear interaction with the noise.

6.4 A Simple Expression for the Gradient Estimate

In this subsection we will derive a simple expression for the gradient estimate which will prove useful in our subsequent computations.

To reduce visual clutter we define the following notational aliases, which we will make heavy use of throughout the rest of the manuscript:

First, we observe that that , as derivatives of past losses with respect to future activations are zero. Next we observe that

Given these observations we may express the estimate of each gradient contribution as

where in the last step we have:

  • consolidated the temporal and spatial projections by concatenating the into a single vector , and the noise vectors into a single vector ,

  • stacked the ’s into the matrix ,

  • defined to be the block-diagonal matrix , and

  • introduced the “truncated identity matrix

    with diagonal blocks .

Finally, the total gradient estimate is given by

(7)

The matrix accounts for the fact that at time of the algorithm, contributions from future steps are not included in . Omitting this matrix would introduce terms that are zero in expectation and hence would not bias the total gradient estimate, but they would still contribute to the variance of the estimator (to a degree which would adversely affect the usefulness of our subsequent analysis).

It is easy to see that this estimator is unbiased as long as . This can happen, for example, when and are independent with . We will focus our analysis on this case.

6.5 Computing the Variance of the Total Gradient Estimate

In this section we derive the variance of the total gradient estimate. We assume that is independent of , so that we may use the general results from Appendix A.

By bilinearity, the covariance matrix of the total gradient estimate is

Combining this with the identity from the previous subsection and applying Corollary A (with ) yields the following expression for the same quantity:

Corollary A also yields the following expression for the total variance222We define the “total variance” to be the trace of the covariance matrix. of the total gradient estimate:

7 Variance Reduction

We now turn to the problem of reducing the variance given in Equation 6.5. In Sections 7.1 through 7.1.3 we develop an improved (though as yet impractical) variance reduction scheme. Finally, we evaluate our theory in Section 7.2.

7.1 Optimizing subject to restrictions on its form

Denote by the part of the total variance (Equation 6.5) that depends on . Making use of the cyclic property of the trace, and the fact that is block-diagonal, we can write this as

(8)

We wish to optimize with respect to in a way that leads to a practical online algorithm. To this end, we require that be of the form , with a scalar and a constant matrix. This restriction makes sense from a practical standpoint; we envision an algorithm that maintains a statistical estimate of the optimal value of . The stationarity assumption enables us to amortize over time both the sample complexity of obtaining this estimate, and the computational cost associated with inverting it.

We furthermore assume projection occurs in preactivation space, that is, . This assumption gives , which is a convenient algebraic structure to work with.

Even given this restricted form we cannot find the jointly optimal solution for and . Instead, we will consider optimizing while holding the ’s fixed, and vice versa.

7.1.1 Optimizing coefficients given

Let us first simplify the expression for . Given the restricted form we may write

(9)

where we have collected the factors that do not depend on into the matrix with elements

(10)

Now we wish to solve

(11)

The optimization problem considered here differs from that given in Section 6.1. Although the objective considered there can similarly be written in terms of a matrix like , that matrix would have rank one (see Appendix B). This difference is a consequence of being the variance of the total gradient estimate rather than that of a single contribution . In particular, the rank-one property is lost due to our inclusion of the