Efficient Optimization of Loops and Limits with Randomized Telescoping Sums

05/16/2019 ∙ by Alex Beatson, et al. ∙ 4

We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates. We identify conditions under which RT estimators achieve optimization convergence rates independent of the length of the loop or the required accuracy of the approximation. We also derive a method for tuning RT estimators online to maximize a lower bound on the expected decrease in loss per unit of computation. We evaluate our adaptive RT estimators on a range of applications including meta-optimization of learning rates, variational inference of ODE parameters, and training an LSTM to model long sequences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many important optimization problems consist of objective functions that can only be computed iteratively or as the limit of an approximation. Machine learning and scientific computing provide many important examples. In meta-learning, evaluation of the objective typically requires the training of a model, a case of bi-level optimization. When training a model on sequential data or to make decisions over time, each learning step requires looping over time steps. More broadly, in many scientific and engineering applications one wishes to optimize an objective that is defined as the limit of a sequence of approximations with both fidelity and computational cost increasing according to a natural number . Inner-loop examples include: integration by Monte Carlo or quadrature with 

evaluation points; solving ordinary differential equations (ODEs) with an Euler or Runge Kutta method with 

steps and 

step size; and solving partial differential equations (PDEs) with a finite element basis with size or order increasing with 

.

Whether the task is fitting parameters to data, identifying the parameters of a natural system, or optimizing the design of a mechanical part, in this work we seek to more rapidly solve problems in which the objective function demands a tradeoff between computational cost and accuracy. We formalize this by considering parameters 

and a loss function 

that is the uniform limit of a sequence :

(1)

Some problems may involve a finite horizon , in other cases . We also introduce a cost function  that is nondecreasing in  to represent the cost of computing and its gradient.

A principal challenge of optimization problems with the form in Eq. 1 is selecting a finite  such that the minimum of the surrogate  is close to that of , but without  (or its gradients) being too expensive. Choosing a large can be computationally prohibitive, while choosing a small 

may bias optimization. Meta-optimizing learning rates with truncated horizons can choose wrong hyperparameters by orders of magnitude

(Wu et al., 2018). Truncating backpropogation through time for recurrent neural networks (RNNs) favors short term dependencies (Tallec & Ollivier, 2017). Using too coarse a discretization to solve an ODE or PDE can cause error in the solution and bias outer-loop optimization. These optimization problems thus experience a sharp trade-off between efficient computation and bias.

We propose randomized telescope

(RT) gradient estimators, which provide cheap unbiased gradient estimates to allow efficient optimization of these objectives. RT estimators represent the objective or its gradients as a telescoping series of differences between intermediate values, and draw weighted samples from this series to maintain unbiasedness while balancing variance and expected computation.

The paper proceeds as follows. Section 2 introduces RT estimators and their history. Section 3 formalizes RT estimators for optimization, and discusses related work in optimization. Section 4 discusses conditions for finite variance and computation, and proves RT estimators can achieve optimization guarantees for loops and limits. Section 5 discusses designing RT estimators by maximizing a bound on expected improvement per unit of computation. Section 6 describes practical considerations for adapting RT estimators online. Section 7 presents experimental results. Section 8 discusses limitations and future work. Appendix A presents algorithm pseudocode. Appendix B presents proofs. Code may be found at https://github.com/PrincetonLIPS/randomized_telescopes.

2 Unbiased randomized truncation

In this section, we discuss the general problem of estimating limits through randomized truncation. The first subsection presents the randomized telescope family of unbiased estimators, while the second subsection describes their history (dating back to von Neumann and Ulam). In the following sections, we will describe how this technique can be used to provide cheap unbiased gradient estimates and accelerate optimization for many problems.

2.1 Randomized telescope estimators

Consider estimating any quantity  for  where . Assume that we can compute for any finite , but since the cost is nondecreasing in  there is a point at which this becomes impractical. Rather than truncating at a fixed value short of the limit, we may find it useful to construct an unbiased estimator of  and take on some randomness in return for reduced computational cost.

Define the backward difference  and represent the quantity of interest with a telescoping series:

We may sample from this telescoping series to provide unbiased estimates of , introducing variance to our estimator in exchange for reducing expected computation. We use the name randomized telescope (RT) to refer to the family of estimators indexed by a distribution  over the integers 

(for example, a geometric distribution) and a weight function

:

(2)
Proposition 2.1.

Unbiasedness of RT estimators. The RT estimators in (2) are unbiased estimators of as long as

See Appendix B for a short proof. Although we are coining the term “randomized telescope” to refer to the family of estimators with the form of Eq. 2, the underlying trick has a long history, discussed in the next section. The literature we are aware of focusses on one or both of two special cases of Eq. 2, defined by choice of weight function . We will also focus on these two variants of RT estimators, but we observe that there is a larger family.

Most related work uses the “Russian roulette” estimator originally discovered and named by von Neumann and Ulam (Kahn, 1955), which we term RT-RR and has the form

(3)

It can be seen as summing the iterates 

while flipping a biased coin at each iterate. With probability 

, the series is truncated at term . With probability , the process continues, and all future terms are upweighted by  to maintain unbiasedness.

The other important special case of Eq. 2 is the “single sample” estimator RT-SS, referred to as “single term weighted truncation” in Lyne et al. (2015). RT-SS takes

(4)

This is directly importance sampling the differences .

We will later prove conditions under which RT-SS and RT-RR should be preferred. Of all estimators in the form of Eq. 2 which obey proposition 2.1 and for all , RT-SS minimizes the variance across worst-case diagonal covariances . Within the same family, RT-RR achieves minimum variance when  and  are independent for all .

2.2 A brief history of unbiased randomized truncation

The essential trick—unbiased estimation of a quantity via randomized truncation of a series—dates back to unpublished work from John von Neumann and Stanislaw Ulam. They are credited for using it to develop a Monte Carlo method for matrix inversion in Forsythe & Leibler (1950), and for a method for particle diffusion in Kahn (1955).

It has been applied and rediscovered in a number of fields and applications. The early work from von Neumann and Ulam led to its use in computational physics, in neutron transport problems (Spanier & Gelbard, 1969), for studying lattice fermions (Kuti, 1982), and to estimate functional integrals (Wagner, 1987). In computer graphics Arvo & Kirk (1990) introduced its use for ray tracing; it is now widely used in rendering software. In statistical estimation, it has been used for estimation of derivatives (Rychlik, 1990)

, unbiased kernel density estimation

(Rychlik, 1995), doubly-intractable Bayesian posterior distributions (Girolami et al., 2013; Lyne et al., 2015; Wei & Murray, 2016)

, and unbiased Markov chain Monte Carlo

(Jacob et al., 2017).

The underlying trick has been rediscovered by Fearnhead et al. (2008) for unbiased estimation in particle filtering, by McLeish (2010) for debiasing Monte Carlo estimates, by Rhee & Glynn (2012, 2015) for unbiased estimation in stochastic differential equations, and by Tallec & Ollivier (2017)

to debias truncated backpropagation. The latter also uses RT estimators for optimization; however, it only considers fixed “Russian roulette”-style randomized telescope estimators and does not consider convergence rates or how to adapt the estimator online (our main contributions).

3 Optimizing loops and limits

In this paper, we consider optimizing functions defined as limits. Consider a problem where, given parameters  we can obtain a series of approximate losses , which converges uniformly to some limit , for  and . We assume the sequence of gradients with respect to , denoted converge uniformly to a limit . Under this uniform convergence and assuming convergence of , we have  (see Theorem 7.17 in Rudin et al. (1976)), and so  is indeed the gradient of our objective . We assume there is a computational cost  associated with evaluating  or , nondecreasing with , and we wish to efficiently minimize  with respect to . Loops are an important special case of this framework, where  is the final output resulting from running e.g., a training loop or RNN for some number of steps increasing in .

3.1 Randomized telescopes for optimization

We propose using randomized telescopes as a stochastic gradient estimator for such optimization problems. We aim to accelerate optimization much as mini-batch stochastic gradient descent accelerates optimization for large datasets: using Monte Carlo sampling to decrease the expected cost of each optimization step, at the price of increasing variance in the gradient estimates, without introducing bias.

Consider the gradient , and the backward difference , where , so that . We use the randomized telescope estimator

(5)

where is drawn according to a proposal distribution , and together  and  satisfy proposition 2.1.

Note that due to linearity of differentiation, and letting , we have

Thus, when the computation of can reuse most of the computation performed for , we can evaluate  via forward or backward automatic differentiation with cost approximately equal to computing , i.e.,  has computation cost . This most often occurs when evaluating  involves an inner loop with a step size which does not change with , e.g., meta-learning and training RNNs, but not solving ODEs or PDEs. When computing  does not reuse computation evaluating  has computation cost .

3.2 Related work in optimization

Gradient-based bilevel optimization has seen extensive work in literature. See Jameson (1988) for an early example of optimizing implicit functions, Christianson (1998) for a mathematical treatment, and Maclaurin et al. (2015); Franceschi et al. (2017) for recent treatments in machine learning. Shaban et al. (2018) propose truncating only the backward pass by only backpropagating through the final few optimization steps to reduce memory requirements. Metz et al. (2018) propose linearly increasing the number of inner steps over the course of the outer optimization.

An important case of bi-level optimization is optimization of architectures and hyperparameters. Truncation causes bias, as shown by Wu et al. (2018) for learning rates and by Metz et al. (2018) for neural optimizers.

Bi-level optimization is also used for meta-learning across related tasks (Schmidhuber, 1987; Bengio et al., 1992). Ravi & Larochelle (2016) train an initialization and optimizer, and Finn et al. (2017) only an initialization, to minimize validation loss. The latter paper shows increasing performance with the number of steps used in the inner optimization. However, in practice the number of inner loop steps must be kept small to allow training over many tasks.

Bi-level optimization can be accelerated by amortization. Variational inference can be seen as bi-level optimization; variational autoencoders

(Kingma & Welling, 2014) amortize the inner optimization with a predictive model of the solution to the inner objective. Recent work such as Brock et al. (2018); Lorraine & Duvenaud (2018) amortizes hyperparameter optimization in a similar fashion.

However, amortizing the inner loop induces bias. Cremer et al. (2018) demonstrate this in VAEs, while Kim et al. (2018) show that in VAEs, combining amortization with truncation by taking several gradient steps on the output of the encoder can reduce this bias. This shows these techniques are orthogonal to our contributions: while fully amortizing the inner optimization causes bias, predictive models of the limit can accelerate convergence of to .

Our work is also related to work on training sequence models. Tallec & Ollivier (2017) use the Russian roulette estimator to debias truncated backpropagation through time. They use a fixed geometrically decaying , and show that this improves validation loss for Penn Treebank. They do not consider efficiency of optimization, or methods to automatically set or adapt the hyperparameters of the randomized telescope. Trinh et al. (2018) learn long term dependencies with auxiliary losses. Other work accelerates optimization of sequence models by replacing recurrent models with models which use convolution or attention (Vaswani et al., 2017), which can be trained more efficiently.

4 Convergence rates with fixed RT estimators

Before considering more complex large-scale problems, we examine the simple RT estimator for stochastic gradient descent on convex problems. We assume that the sequence  and units for  are chosen such that . We study RT-SS, with  fixed a priori. We consider optimizing parameters , where  is a bounded, convex and compact set with diameter bounded by . We assume  is convex in , and  converge according to , where  converges polynomially or geometrically. The quantity of interest is the instantaneous regret, , where  is the parameter after  steps of SGD.

In this setting, any fixed truncation scheme using  as a surrogate for , with fixed , cannot achieve . Meanwhile, the fully unrolled estimator has computational cost which scales with . In the many situations where , it is impossible to take even a single gradient step with this estimator.

The randomized telescope estimator overcomes these drawbacks by exploiting the fact that  converges according to . As long as  is chosen to have tails no lighter than , for sufficiently fast convergence, the resulting RT-SS gradient estimator achieves asymptotic regret bounds invariant to  in terms of convergence rate.

All proofs are deferred to Appendix B. We begin by proving bounds on the variance and expected computation for polynomially decaying  and .

Theorem 4.1.

Bounded variance and compute with polynomial convergence of . Assume  converges according to  or faster, for constants  and . Choose the RT-SS estimator with . The resulting estimator  achieves expected compute , where  is the th generalized harmonic number of order , and expected squared norm . The limit  is finite iff , in which case it is given by the Riemannian zeta function, . Accordingly, the estimator achieves horizon-agnostic variance and expected compute bounds iff .

The corresponding bounds for geometrically decaying and follow.

Theorem 4.2.

Bounded variance and compute with geometric convergence of . Assume  converges according to , or faster, for . Choose RT-SS and with . The resulting estimator   achieves expected compute  and expected squared norm . Thus, the estimator achieves horizon-agnostic variance and expected compute bounds for all .

Given a setting and estimator  from either 4.1 or 4.2, with corresponding expected compute cost  and upper bound on expected squared norm , the following theorem considers regret guarantees when using this estimator to perform stochastic gradient descent.

Theorem 4.3.

Asymptotic regret bounds for optimizing infinite-horizon programs. Assume the setting from 4.1 or 4.2, and the corresponding  and  from those theorems. Let  be the instantaneous regret at the th step of optimization, . Let  be the greatest  such that a computational budget  is not exceeded. Use online gradient descent with step size . As , the asymptotic instantaneous regret is bounded by , independent of .

Theorem 4.3 indicates that if  converges sufficiently fast and  is convex, the RT estimator provably optimizes the limit.

5 Adaptive RT estimators

In practice, the estimator considered in the previous section may have high variance. This section develops an objective for designing such estimators, and derives closed-form  and  which maximize this objective given estimates of  and assumptions on .

5.1 Choosing between unbiased estimators

We propose choosing an estimator which achieves the best lower bound on the expected improvement per compute unit spent, given smoothness assumptions on the loss. Our analysis builds on that of Balles et al. (2016): they adaptively choose a batch size using batch covariance information, while we choose between between arbitrary unbiased gradient estimators using knowledge of those estimators’ expected squared norms and computation cost.

Here we assume that the true gradient of the objective  (for compactness of notation) is smooth in . We do not assume convexity. Note that   is not necessarily equal to , as the loss  and its gradient 

may be random variables due to sampling of data and/or latent variables.

We assume that  is -smooth (the gradients of  are -Lipschitz), i.e., there exists a constant  such that . It follows (Balles et al., 2016; Bottou et al., 2018) that, when performing SGD with an unbiased stochastic gradient estimator ,

(6)

Unbiasedness of implies , thus:

(7)

Above, is a lower bound on the expected improvement in the loss from one optimization step. Given a fixed choice of , how should one pick the learning rate  to maximize  and what is the corresponding lower bound on expected improvement?

Optimizing  by finding  s.t.  yields

(8)
(9)

This tells us how to choose  if we know , etc. In practice, it is unlikely that we know  or even . We instead assume we have access to some “reference” learning rate , which has been optimized for use with a “reference” gradient estimator , with known . When using RT estimators, we may have access to learning rates which have been optimized for use with the un-truncated estimator. Even when we do not know an optimal reference learning rate, this construction means we need only tune one hyperparameter (the reference learning rate), and can still choose between a family of gradient estimators online. Instead of directly maximizing , we choose  for  by maximizing improvement relative to the reference estimator in terms of , the lower bound on expected improvement.

Assume that  has been set optimally for a problem and reference estimator  up to some constant , i.e.,

(10)

Then the expected improvement  obtained by the reference estimator  is:

(11)

We assume that , such that  is positive and the reference has guaranteed expected improvement. Now set the learning rate according to

(12)

It follows that the expected improvement  obtained by the estimator  is

(13)

Let the expected computation cost of evaluating  be . We want to maximize . If we use the above method to choose , we have . We call  the relative optimization efficiency, or ROE. We decide between gradient estimators  by choosing the one which maximizes the ROE. Once an estimator is chosen, one should choose a learning rate according to (12) relative to a reference learning rate  and estimator .

5.2 Optimal weighted sampling for RT estimators

Now that we have an objective, we can consider designing RT estimators which optimize the ROE. For the classes of single sample and Russian roulette estimators, we prove conditions under which that class maximizes the ROE across an arbitrary choice of RT estimators. We also derive closed-form expressions for the optimal sampling distribution  for each class, under the conditions where that class is optimal.

We assume that computation can be reused and evaluating  has computation cost . As described in Section 3.1, this is approximately true for many objectives. When it is not, the cost of computing  is . This would penalize the ROE of dense   and favor sparse , possibly impacting the optimality conditions for RT-RR. We mitigate this inaccuracy by subsequence selection (described in the following subsection), which allows construction of sparse sampling strategies.

We begin by showing the RT-SS estimator is optimal with regards to worst-case diagonal covariances , and deriving the optimal .

Theorem 5.1.

Optimality of RT-SS under adversarial correlation. Consider the family of estimators presented in Equation 2. Assume , , and are univariate. For any fixed sampling distribution , the single-sample RT estimator RT-SS minimizes the worst-case variance of across an adversarial choice of covariances .

Theorem 5.2.

Optimal q under adversarial correlation. Consider the family of estimators presented in Equation 2. Assume and are diagonal. The RT-SS estimator with  maximizes the ROE across an adversarial choice of diagonal covariance matrices .

We next show the RT-RR estimator is optimal when  is diagonal and  and  are independent for , and derive the optimal .

Theorem 5.3.

Optimality of RT-RR under independence. Consider the family of estimators presented in Eq. 2. Assume the are univariate. When the  are uncorrelated, for any importance sampling distribution , the Russian roulette estimator achieves the minimum variance in this family and thus maximizes the optimization efficiency lower bound.

Theorem 5.4.

Optimal q under independence. Consider the family of estimators presented in Equation 2. Assume  is diagonal and  and  are independent. The RT-RR estimator with , where , maximizes the ROE.

5.3 Subsequence selection

The scheme for designing RT estimators given in the previous subsection contains assumptions which will often not hold in practice. To partially alleviate these concerns, we can design the sequence of iterates over which we apply the RT estimator to maximize the ROE.

Some sequences may result in more efficient estimators, depending on how the intermediate iterates correlate with . The variance of the estimator, and the ROE, will be reduced if we choose a sequence such that is positively correlated with for all .

We begin with a reference sequence , , with cost function , where  and , and where  has cost . We assume knowledge of . We aim to find a subsequence , where  is the set of subsequences over the integers  which have final element . Given , we take , , , , and , where .

In practice, we greedily construct  by adding indexes  to the sequence  or removing indexes  from the sequence . As this step requires minimal computation, we perform both greedy adding and greedy removal and return the  with the best ROE. The minimal subsequence  is always considered, allowing RT estimators to fall back on the original full-horizon estimator.

6 Practical implementation

6.1 Tuning the estimator

We estimate the expected squared distances by maintaining exponential moving averages. We keep track of the computational budget used so far by the RT estimator, and “tune” the estimator every units of computation, where is the compute required to evaluate , and is a “tuning frequency” hyperparameter. During tuning, the gradients are computed, the squared norms are computed, and the exponential moving averages are updated. At the end of tuning, the estimator is updated using the expected squared norms; i.e. a subsequence is selected, is set according to section 5.2 with choice of RT-RR or RT-SS left as a hyperparameter, and the learning rate is adapted according to section 5.1

6.2 Controlling sequence length

Tuning and subsequence selection require computation. Consider using RT to optimize an objective with an inner loop of size . If we let be the gradient of the loss after inner steps, we must maintain exponential moving averages , and compute gradients each time we tune the estimator. The computational cost of the tuning step under this scheme is . This is unacceptable if we wish our method to scale well with the size of loops we might wish to optimize.

To circumvent this, we choose base subsequences such that . This ensures that , where  is the maximum number of steps we wish to unroll. We must maintain  exponential moving averages. Computing the gradients  during each tuning step requires compute . Noting that  and that  yields .

7 Experiments

For all experiments, we tune learning rates for the full-horizon un-truncated estimator via grid search over all , for  and . The same learning rates are used for the truncated estimators and (as reference learning rates) for the RT estimators. We do not decay the learning rate. Experiments are run with the random seeds

and we plot means and standard deviation.

We use the same hyperparameters for our online tuning procedure for all experiments: the tuning frequency is set to , and the exponential moving average weight is set to . These hyperparameters were not extensively tuned. For each problem, we compare deterministic, RT-SS, and RT-RR estimators, each with a range of truncations.

7.1 Lotka-Volterra ODE

We first experiment with variational inference of parameters of a Lotka-Volterra (LV) ODE. LV ODEs are defined by the predator-prey equations, where  and  are predator and prey populations, respectively:

We aim to infer the parameters . The true parameters are drawn from , chosen empirically to ensure stability solving the equations. We generate ground-truth data by solving the equations using RK4 (a common 4th-order Runge Kutta method) from  to  with  steps. The learner is given access to five equally spaced noisy observations , generated according to .

We place a diagonal Gaussian prior on with the same mean and standard deviation as the data-generating distribution. The variational posterior is a diagonal Gaussian  with mean  and standard deviation . The parameters optimized are . We let  and , where , to ensure positivity. We use a reflecting boundary to ensure positivity of parameter samples from . The variational posterior is initialized to have mean equal to the prior and standard deviation 0.1.

The loss considered is the negative evidence lower bound (negative ELBO). The ELBO is:

Above, is the value of the solution to the LV ODE with parameters , evaluated at time . We consider a sequence , where in computing the ELBO, is approximated by solving the ODE using RK4 with

steps, and linearly interpolating the solution to the

observation times. The outer-loop optimization is performed with a batch size of (i.e., samples of are performed at each step) and a learning rate of . Evaluation is performed with a batch size of .

Negative ELBO

Gradient evaluations (thousands)
Figure 1: Lotka-Volterra parameter inference

Figure 1 shows the loss of the different estimators over the course of training. RT-SS estimators outperform the un-truncated estimator without inducing bias. They are competitive with the truncation , while avoiding the bias present with the truncation , at the cost of some variance. Some RT-RR estimators experience issues with optimization, appearing to obtain the same biased solution as the truncation.

7.2 MNIST learning rate

We next experiment with meta-optimization of a learning rate on MNIST. We largely follow the procedure used by Wu et al. (2018). We use a feedforward network with two hidden layers of 100 units, with weights initialized from a Gaussian with standard deviation 0.1, and biases initialized to zero. Optimization is performed with a batch size of 100.

The neural network is trained by SGD with momentum using Polyak averaging, with the momentum parameter fixed to . We aim to learn a learning rate  and decay  for the inner-loop optimization. These are initialized to  and  respectively. The learning rate for the inner optimization at an inner optimization step  is .

As in Wu et al. (2018), we pre-train the net for 50 steps with a learning rate of . is the evaluation loss after  training steps with a batch size of . The evaluation loss is measured over  validation batches or the entire validation set, whichever is smaller. The outer optimization is performed with a learning rate of .

RT-SS estimators achieve faster convergence than fixed-truncation estimators. RT-RR estimators suffer from very poor convergence. Truncated estimators appear to obtain biased solutions. The un-truncated estimator achieves a slightly better loss than the RT estimators, but takes significantly longer to converge.

Evaluation loss

Neural network evaluations (thousands)
Figure 2: MNIST learning rate meta-optimization

7.3 enwik8 Lstm

Finally, we study a high-dimensional optimization problem: training an LSTM to model sequences on enwik8. These data are the first 100M bytes of a Wikipedia XML dump. There are 205 unique tokens. We use the first 90M, 5M, and 5M characters as the training, evaluation, and test sets.

We build on code111http://github.com/salesforce/awd-lstm-lm from Merity et al. (2017, 2018). We train an LSTM with 1000 hidden units and 400-dimensional input and output embeddings. The model has 5.9M parameters. The only regularization is an penalty on the weights with magnitude . The optimization is performed with a learning rate of . This model is not state-of-the-art: our aim to investigate performance of RT estimators for optimizing high-dimensional neural networks, rather than to maximize performance at a language modeling task.

We choose to be the mean cross-entropy after unrolling the LSTM training for steps. We choose the horizon , such that the un-truncated loop has 257 steps, chosen to be close to the 200-length training sequences used by Merity et al. (2018).

Figure 3 shows the training bits-per-character (proportional to the training cross-entropy loss). RT estimators provide some acceleration over the un-truncated  estimator early in training, but after about 200k cell evaluations, fall back on the un-truncated estimator, subsequently progressing slightly more slowly due to computational cost of tuning. We conjecture that the diagonal covariance assumption in Section 5 is unsuited to high-dimensional problems, and leads to overly conservative estimators.

Training bits-per-character

LSTM cell evaluations (thousands)
Figure 3: LSTM training on enwik8

8 Limitations and future work

Other optimizers. We develop the lower bound on expected improvement for SGD. Important future directions would investigate adaptive and momentum-based SGD methods such as Adam (Kingma & Ba, 2014).

Tuning step. Our method includes a tuning step which requires computation. It might be possible to remove this tuning step by estimating covariance structure online using just the values of observed during each optimization step.

RT estimators beyond RT-SS and RT-RR. There is a rich family defined by choices of and . The optimal member depends on covariance structure between the . We explore RT-SS and RT-RR under strict covariance assumptions. Relaxing these assumptions and optimizing and across a wider family could improve adaptive estimator performance for high-dimensional problems such as training RNNs.

Predictive models of the sequence limit. Using any sequence with RT yields an unbiased estimator as long as the sequence is consistent, i.e. its limit is the true gradient. Combining randomized telescopes with predictive models of the gradients (Jaderberg et al., 2017; Weber et al., 2019) might yield a fast-converging sequence, leading to estimators with low computation and variance.

9 Conclusion

We investigated the use of randomly truncated unbiased gradient estimators for optimizing objectives which involve loops and limits. We proved these estimators can achieve horizon-independent convergence rates for optimizing loops and limits. We derived adaptive variants which can be tuned online to maximize a lower bound on expected improvement per unit computation. Experimental results matched theoretical intuitions that the single sample estimator is more robust than Russian roulette for optimization. The adaptive RT-SS estimator often significantly accelerates optimization, and can otherwise fall back on the un-truncated estimator.

10 Acknowledgements

We would like to thank Matthew Johnson, Peter Orbanz, and James Saunderson for helpful discussions. This work was funded by the Alfred P. Sloan Foundation and NSF IIS-1421780.

References

Appendix A Algorithm pseudocode

  Input: initial parameter , gradient routine which returns , compute costs , exponential decay , tuning frequency , horizon , reference learning rate
  Initialize , next_tune,
  repeat
     if next_tune then
        
        expectedCompute, expectedSquaredNorm = compute_and_variance
        
        
        next_tune
     end if
     
     for  to  do
        
     end for
     
     
     if compute reused then
        
     else
        
     end if
  until converged
Algorithm 1 Optimization with randomized telescopes
  Input: current parameter , current squared distance estimates , gradient routine which returns , compute costs , exponential decay , horizon
  
  for  to  do
     
  end for
  for  to  do
     for  to  do
        
     end for
  end for
  
  
  
  Return: updated estimates , sampling distribution , weight function , and subsequence
Algorithm 2 tune
  Input: Norm estimates , compute costs
  Initialize
  Initialize , , convergedFALSE, bestAddCostcost, bestRemoveCostcost
  while not converged do
     
     for  for if not  do
        trialsort
        trialCostcosttrial
        if trialCost < bestAddCost then
           trial
           bestAddCosttrialCost
           converged False
           BREAK
        else
           converged True
        end if
     end for
  end while
  converged False
  while not converged do
     for  for if do
        trial for if
        trialCostsequence_costS
        if trialCost bestRemoveCost then
           trial
           bestRemoveCosttrialCost
           converged False
           BREAK
        else
           converged True
        end if
     end for
  end while
  if bestRemoveCost bestAddCost then
     Return:
  else
     Return:
  end if
Algorithm 3 greedy_subsequence_select
  Input: Norm estimates , compute costs , sequence
  , _and_
  expectedCompute
  if RT-SS then
     expectedSquaredNorm
  else if RT-RR then
     expectedSquaredNorm
  else
     Undefined: must specify RT-SS or RT-RR
  end if
  Return: expectedCompute, expectedSquaredNorm
Algorithm 4 compute_and_variance
  Input: Norm estimates , compute costs , sequence
  expectedCompute, expectedSquaredNorm = compute_and_variance
  Return: expectedCompute * expectedSquaredNorm
Algorithm 5 sequence_cost
  Input: , , and
  if RT-SS then
     
     
  else if RT-RR then
     
     max
     
     
  else
     Undefined: must specify RT-SS or RT-RR
  end if
  Return: ,
Algorithm 6 _and_

Appendix B Proofs

b.1 Proofs for section 2

b.1.1 Proposition 2.1

Unbiasedness of RT estimators. The RT estimators in (2) are unbiased estimators of as long as

Proof.

A randomized telescope estimator which satisfies the above linear constraint condition has expectation:

b.2 Proofs for section 4

b.2.1 Theorem 4.1

Bounded variance and compute with polynomial convergence of . Assume  converges according to  or faster, for constants and . Choose the RT-SS estimator with . The resulting estimator  achieves expected compute , where  is the th generalized harmonic number of order , and expected squared norm . The limit  is finite iff , in which case it is given by the Riemannian zeta function, . Accordingly, the estimator achieves horizon-agnostic variance and expected compute bounds iff .

Proof.

Begin by noting the RT-SS estimator returns with probability . Let and , such that . First, note . Now inspect the expected squared norm :