1 Introduction
Many important optimization problems consist of objective functions that can only be computed iteratively or as the limit of an approximation. Machine learning and scientific computing provide many important examples. In metalearning, evaluation of the objective typically requires the training of a model, a case of bilevel optimization. When training a model on sequential data or to make decisions over time, each learning step requires looping over time steps. More broadly, in many scientific and engineering applications one wishes to optimize an objective that is defined as the limit of a sequence of approximations with both fidelity and computational cost increasing according to a natural number . Innerloop examples include: integration by Monte Carlo or quadrature with
evaluation points; solving ordinary differential equations (ODEs) with an Euler or Runge Kutta method with
steps andstep size; and solving partial differential equations (PDEs) with a finite element basis with size or order increasing with
.Whether the task is fitting parameters to data, identifying the parameters of a natural system, or optimizing the design of a mechanical part, in this work we seek to more rapidly solve problems in which the objective function demands a tradeoff between computational cost and accuracy. We formalize this by considering parameters
and a loss function
that is the uniform limit of a sequence :(1) 
Some problems may involve a finite horizon , in other cases . We also introduce a cost function that is nondecreasing in to represent the cost of computing and its gradient.
A principal challenge of optimization problems with the form in Eq. 1 is selecting a finite such that the minimum of the surrogate is close to that of , but without (or its gradients) being too expensive. Choosing a large can be computationally prohibitive, while choosing a small
may bias optimization. Metaoptimizing learning rates with truncated horizons can choose wrong hyperparameters by orders of magnitude
(Wu et al., 2018). Truncating backpropogation through time for recurrent neural networks (RNNs) favors short term dependencies (Tallec & Ollivier, 2017). Using too coarse a discretization to solve an ODE or PDE can cause error in the solution and bias outerloop optimization. These optimization problems thus experience a sharp tradeoff between efficient computation and bias.We propose randomized telescope
(RT) gradient estimators, which provide cheap unbiased gradient estimates to allow efficient optimization of these objectives. RT estimators represent the objective or its gradients as a telescoping series of differences between intermediate values, and draw weighted samples from this series to maintain unbiasedness while balancing variance and expected computation.
The paper proceeds as follows. Section 2 introduces RT estimators and their history. Section 3 formalizes RT estimators for optimization, and discusses related work in optimization. Section 4 discusses conditions for finite variance and computation, and proves RT estimators can achieve optimization guarantees for loops and limits. Section 5 discusses designing RT estimators by maximizing a bound on expected improvement per unit of computation. Section 6 describes practical considerations for adapting RT estimators online. Section 7 presents experimental results. Section 8 discusses limitations and future work. Appendix A presents algorithm pseudocode. Appendix B presents proofs. Code may be found at https://github.com/PrincetonLIPS/randomized_telescopes.
2 Unbiased randomized truncation
In this section, we discuss the general problem of estimating limits through randomized truncation. The first subsection presents the randomized telescope family of unbiased estimators, while the second subsection describes their history (dating back to von Neumann and Ulam). In the following sections, we will describe how this technique can be used to provide cheap unbiased gradient estimates and accelerate optimization for many problems.
2.1 Randomized telescope estimators
Consider estimating any quantity for where . Assume that we can compute for any finite , but since the cost is nondecreasing in there is a point at which this becomes impractical. Rather than truncating at a fixed value short of the limit, we may find it useful to construct an unbiased estimator of and take on some randomness in return for reduced computational cost.
Define the backward difference and represent the quantity of interest with a telescoping series:
We may sample from this telescoping series to provide unbiased estimates of , introducing variance to our estimator in exchange for reducing expected computation. We use the name randomized telescope (RT) to refer to the family of estimators indexed by a distribution over the integers
(for example, a geometric distribution) and a weight function
:(2) 
Proposition 2.1.
Unbiasedness of RT estimators. The RT estimators in (2) are unbiased estimators of as long as
See Appendix B for a short proof. Although we are coining the term “randomized telescope” to refer to the family of estimators with the form of Eq. 2, the underlying trick has a long history, discussed in the next section. The literature we are aware of focusses on one or both of two special cases of Eq. 2, defined by choice of weight function . We will also focus on these two variants of RT estimators, but we observe that there is a larger family.
Most related work uses the “Russian roulette” estimator originally discovered and named by von Neumann and Ulam (Kahn, 1955), which we term RTRR and has the form
(3) 
It can be seen as summing the iterates
while flipping a biased coin at each iterate. With probability
, the series is truncated at term . With probability , the process continues, and all future terms are upweighted by to maintain unbiasedness.The other important special case of Eq. 2 is the “single sample” estimator RTSS, referred to as “single term weighted truncation” in Lyne et al. (2015). RTSS takes
(4) 
This is directly importance sampling the differences .
We will later prove conditions under which RTSS and RTRR should be preferred. Of all estimators in the form of Eq. 2 which obey proposition 2.1 and for all , RTSS minimizes the variance across worstcase diagonal covariances . Within the same family, RTRR achieves minimum variance when and are independent for all .
2.2 A brief history of unbiased randomized truncation
The essential trick—unbiased estimation of a quantity via randomized truncation of a series—dates back to unpublished work from John von Neumann and Stanislaw Ulam. They are credited for using it to develop a Monte Carlo method for matrix inversion in Forsythe & Leibler (1950), and for a method for particle diffusion in Kahn (1955).
It has been applied and rediscovered in a number of fields and applications. The early work from von Neumann and Ulam led to its use in computational physics, in neutron transport problems (Spanier & Gelbard, 1969), for studying lattice fermions (Kuti, 1982), and to estimate functional integrals (Wagner, 1987). In computer graphics Arvo & Kirk (1990) introduced its use for ray tracing; it is now widely used in rendering software. In statistical estimation, it has been used for estimation of derivatives (Rychlik, 1990)
, unbiased kernel density estimation
(Rychlik, 1995), doublyintractable Bayesian posterior distributions (Girolami et al., 2013; Lyne et al., 2015; Wei & Murray, 2016), and unbiased Markov chain Monte Carlo
(Jacob et al., 2017).The underlying trick has been rediscovered by Fearnhead et al. (2008) for unbiased estimation in particle filtering, by McLeish (2010) for debiasing Monte Carlo estimates, by Rhee & Glynn (2012, 2015) for unbiased estimation in stochastic differential equations, and by Tallec & Ollivier (2017)
to debias truncated backpropagation. The latter also uses RT estimators for optimization; however, it only considers fixed “Russian roulette”style randomized telescope estimators and does not consider convergence rates or how to adapt the estimator online (our main contributions).
3 Optimizing loops and limits
In this paper, we consider optimizing functions defined as limits. Consider a problem where, given parameters we can obtain a series of approximate losses , which converges uniformly to some limit , for and . We assume the sequence of gradients with respect to , denoted converge uniformly to a limit . Under this uniform convergence and assuming convergence of , we have (see Theorem 7.17 in Rudin et al. (1976)), and so is indeed the gradient of our objective . We assume there is a computational cost associated with evaluating or , nondecreasing with , and we wish to efficiently minimize with respect to . Loops are an important special case of this framework, where is the final output resulting from running e.g., a training loop or RNN for some number of steps increasing in .
3.1 Randomized telescopes for optimization
We propose using randomized telescopes as a stochastic gradient estimator for such optimization problems. We aim to accelerate optimization much as minibatch stochastic gradient descent accelerates optimization for large datasets: using Monte Carlo sampling to decrease the expected cost of each optimization step, at the price of increasing variance in the gradient estimates, without introducing bias.
Consider the gradient , and the backward difference , where , so that . We use the randomized telescope estimator
(5) 
where is drawn according to a proposal distribution , and together and satisfy proposition 2.1.
Note that due to linearity of differentiation, and letting , we have
Thus, when the computation of can reuse most of the computation performed for , we can evaluate via forward or backward automatic differentiation with cost approximately equal to computing , i.e., has computation cost . This most often occurs when evaluating involves an inner loop with a step size which does not change with , e.g., metalearning and training RNNs, but not solving ODEs or PDEs. When computing does not reuse computation evaluating has computation cost .
3.2 Related work in optimization
Gradientbased bilevel optimization has seen extensive work in literature. See Jameson (1988) for an early example of optimizing implicit functions, Christianson (1998) for a mathematical treatment, and Maclaurin et al. (2015); Franceschi et al. (2017) for recent treatments in machine learning. Shaban et al. (2018) propose truncating only the backward pass by only backpropagating through the final few optimization steps to reduce memory requirements. Metz et al. (2018) propose linearly increasing the number of inner steps over the course of the outer optimization.
An important case of bilevel optimization is optimization of architectures and hyperparameters. Truncation causes bias, as shown by Wu et al. (2018) for learning rates and by Metz et al. (2018) for neural optimizers.
Bilevel optimization is also used for metalearning across related tasks (Schmidhuber, 1987; Bengio et al., 1992). Ravi & Larochelle (2016) train an initialization and optimizer, and Finn et al. (2017) only an initialization, to minimize validation loss. The latter paper shows increasing performance with the number of steps used in the inner optimization. However, in practice the number of inner loop steps must be kept small to allow training over many tasks.
Bilevel optimization can be accelerated by amortization. Variational inference can be seen as bilevel optimization; variational autoencoders
(Kingma & Welling, 2014) amortize the inner optimization with a predictive model of the solution to the inner objective. Recent work such as Brock et al. (2018); Lorraine & Duvenaud (2018) amortizes hyperparameter optimization in a similar fashion.However, amortizing the inner loop induces bias. Cremer et al. (2018) demonstrate this in VAEs, while Kim et al. (2018) show that in VAEs, combining amortization with truncation by taking several gradient steps on the output of the encoder can reduce this bias. This shows these techniques are orthogonal to our contributions: while fully amortizing the inner optimization causes bias, predictive models of the limit can accelerate convergence of to .
Our work is also related to work on training sequence models. Tallec & Ollivier (2017) use the Russian roulette estimator to debias truncated backpropagation through time. They use a fixed geometrically decaying , and show that this improves validation loss for Penn Treebank. They do not consider efficiency of optimization, or methods to automatically set or adapt the hyperparameters of the randomized telescope. Trinh et al. (2018) learn long term dependencies with auxiliary losses. Other work accelerates optimization of sequence models by replacing recurrent models with models which use convolution or attention (Vaswani et al., 2017), which can be trained more efficiently.
4 Convergence rates with fixed RT estimators
Before considering more complex largescale problems, we examine the simple RT estimator for stochastic gradient descent on convex problems. We assume that the sequence and units for are chosen such that . We study RTSS, with fixed a priori. We consider optimizing parameters , where is a bounded, convex and compact set with diameter bounded by . We assume is convex in , and converge according to , where converges polynomially or geometrically. The quantity of interest is the instantaneous regret, , where is the parameter after steps of SGD.
In this setting, any fixed truncation scheme using as a surrogate for , with fixed , cannot achieve . Meanwhile, the fully unrolled estimator has computational cost which scales with . In the many situations where , it is impossible to take even a single gradient step with this estimator.
The randomized telescope estimator overcomes these drawbacks by exploiting the fact that converges according to . As long as is chosen to have tails no lighter than , for sufficiently fast convergence, the resulting RTSS gradient estimator achieves asymptotic regret bounds invariant to in terms of convergence rate.
All proofs are deferred to Appendix B. We begin by proving bounds on the variance and expected computation for polynomially decaying and .
Theorem 4.1.
Bounded variance and compute with polynomial convergence of . Assume converges according to or faster, for constants and . Choose the RTSS estimator with . The resulting estimator achieves expected compute , where is the th generalized harmonic number of order , and expected squared norm . The limit is finite iff , in which case it is given by the Riemannian zeta function, . Accordingly, the estimator achieves horizonagnostic variance and expected compute bounds iff .
The corresponding bounds for geometrically decaying and follow.
Theorem 4.2.
Bounded variance and compute with geometric convergence of . Assume converges according to , or faster, for . Choose RTSS and with . The resulting estimator achieves expected compute and expected squared norm . Thus, the estimator achieves horizonagnostic variance and expected compute bounds for all .
Given a setting and estimator from either 4.1 or 4.2, with corresponding expected compute cost and upper bound on expected squared norm , the following theorem considers regret guarantees when using this estimator to perform stochastic gradient descent.
Theorem 4.3.
Asymptotic regret bounds for optimizing infinitehorizon programs. Assume the setting from 4.1 or 4.2, and the corresponding and from those theorems. Let be the instantaneous regret at the th step of optimization, . Let be the greatest such that a computational budget is not exceeded. Use online gradient descent with step size . As , the asymptotic instantaneous regret is bounded by , independent of .
Theorem 4.3 indicates that if converges sufficiently fast and is convex, the RT estimator provably optimizes the limit.
5 Adaptive RT estimators
In practice, the estimator considered in the previous section may have high variance. This section develops an objective for designing such estimators, and derives closedform and which maximize this objective given estimates of and assumptions on .
5.1 Choosing between unbiased estimators
We propose choosing an estimator which achieves the best lower bound on the expected improvement per compute unit spent, given smoothness assumptions on the loss. Our analysis builds on that of Balles et al. (2016): they adaptively choose a batch size using batch covariance information, while we choose between between arbitrary unbiased gradient estimators using knowledge of those estimators’ expected squared norms and computation cost.
Here we assume that the true gradient of the objective (for compactness of notation) is smooth in . We do not assume convexity. Note that is not necessarily equal to , as the loss and its gradient
may be random variables due to sampling of data and/or latent variables.
We assume that is smooth (the gradients of are Lipschitz), i.e., there exists a constant such that . It follows (Balles et al., 2016; Bottou et al., 2018) that, when performing SGD with an unbiased stochastic gradient estimator ,
(6) 
Unbiasedness of implies , thus:
(7) 
Above, is a lower bound on the expected improvement in the loss from one optimization step. Given a fixed choice of , how should one pick the learning rate to maximize and what is the corresponding lower bound on expected improvement?
Optimizing by finding s.t. yields
(8)  
(9) 
This tells us how to choose if we know , , etc. In practice, it is unlikely that we know or even . We instead assume we have access to some “reference” learning rate , which has been optimized for use with a “reference” gradient estimator , with known . When using RT estimators, we may have access to learning rates which have been optimized for use with the untruncated estimator. Even when we do not know an optimal reference learning rate, this construction means we need only tune one hyperparameter (the reference learning rate), and can still choose between a family of gradient estimators online. Instead of directly maximizing , we choose for by maximizing improvement relative to the reference estimator in terms of , the lower bound on expected improvement.
Assume that has been set optimally for a problem and reference estimator up to some constant , i.e.,
(10) 
Then the expected improvement obtained by the reference estimator is:
(11) 
We assume that , such that is positive and the reference has guaranteed expected improvement. Now set the learning rate according to
(12) 
It follows that the expected improvement obtained by the estimator is
(13) 
Let the expected computation cost of evaluating be . We want to maximize . If we use the above method to choose , we have . We call the relative optimization efficiency, or ROE. We decide between gradient estimators by choosing the one which maximizes the ROE. Once an estimator is chosen, one should choose a learning rate according to (12) relative to a reference learning rate and estimator .
5.2 Optimal weighted sampling for RT estimators
Now that we have an objective, we can consider designing RT estimators which optimize the ROE. For the classes of single sample and Russian roulette estimators, we prove conditions under which that class maximizes the ROE across an arbitrary choice of RT estimators. We also derive closedform expressions for the optimal sampling distribution for each class, under the conditions where that class is optimal.
We assume that computation can be reused and evaluating has computation cost . As described in Section 3.1, this is approximately true for many objectives. When it is not, the cost of computing is . This would penalize the ROE of dense and favor sparse , possibly impacting the optimality conditions for RTRR. We mitigate this inaccuracy by subsequence selection (described in the following subsection), which allows construction of sparse sampling strategies.
We begin by showing the RTSS estimator is optimal with regards to worstcase diagonal covariances , and deriving the optimal .
Theorem 5.1.
Optimality of RTSS under adversarial correlation. Consider the family of estimators presented in Equation 2. Assume , , and are univariate. For any fixed sampling distribution , the singlesample RT estimator RTSS minimizes the worstcase variance of across an adversarial choice of covariances .
Theorem 5.2.
Optimal q under adversarial correlation. Consider the family of estimators presented in Equation 2. Assume and are diagonal. The RTSS estimator with maximizes the ROE across an adversarial choice of diagonal covariance matrices .
We next show the RTRR estimator is optimal when is diagonal and and are independent for , and derive the optimal .
Theorem 5.3.
Optimality of RTRR under independence. Consider the family of estimators presented in Eq. 2. Assume the are univariate. When the are uncorrelated, for any importance sampling distribution , the Russian roulette estimator achieves the minimum variance in this family and thus maximizes the optimization efficiency lower bound.
Theorem 5.4.
Optimal q under independence. Consider the family of estimators presented in Equation 2. Assume is diagonal and and are independent. The RTRR estimator with , where , maximizes the ROE.
5.3 Subsequence selection
The scheme for designing RT estimators given in the previous subsection contains assumptions which will often not hold in practice. To partially alleviate these concerns, we can design the sequence of iterates over which we apply the RT estimator to maximize the ROE.
Some sequences may result in more efficient estimators, depending on how the intermediate iterates correlate with . The variance of the estimator, and the ROE, will be reduced if we choose a sequence such that is positively correlated with for all .
We begin with a reference sequence , , with cost function , where and , and where has cost . We assume knowledge of . We aim to find a subsequence , where is the set of subsequences over the integers which have final element . Given , we take , , , , and , where .
In practice, we greedily construct by adding indexes to the sequence or removing indexes from the sequence . As this step requires minimal computation, we perform both greedy adding and greedy removal and return the with the best ROE. The minimal subsequence is always considered, allowing RT estimators to fall back on the original fullhorizon estimator.
6 Practical implementation
6.1 Tuning the estimator
We estimate the expected squared distances by maintaining exponential moving averages. We keep track of the computational budget used so far by the RT estimator, and “tune” the estimator every units of computation, where is the compute required to evaluate , and is a “tuning frequency” hyperparameter. During tuning, the gradients are computed, the squared norms are computed, and the exponential moving averages are updated. At the end of tuning, the estimator is updated using the expected squared norms; i.e. a subsequence is selected, is set according to section 5.2 with choice of RTRR or RTSS left as a hyperparameter, and the learning rate is adapted according to section 5.1
6.2 Controlling sequence length
Tuning and subsequence selection require computation. Consider using RT to optimize an objective with an inner loop of size . If we let be the gradient of the loss after inner steps, we must maintain exponential moving averages , and compute gradients each time we tune the estimator. The computational cost of the tuning step under this scheme is . This is unacceptable if we wish our method to scale well with the size of loops we might wish to optimize.
To circumvent this, we choose base subsequences such that . This ensures that , where is the maximum number of steps we wish to unroll. We must maintain exponential moving averages. Computing the gradients during each tuning step requires compute . Noting that and that yields .
7 Experiments
For all experiments, we tune learning rates for the fullhorizon untruncated estimator via grid search over all , for and . The same learning rates are used for the truncated estimators and (as reference learning rates) for the RT estimators. We do not decay the learning rate. Experiments are run with the random seeds
and we plot means and standard deviation.
We use the same hyperparameters for our online tuning procedure for all experiments: the tuning frequency is set to , and the exponential moving average weight is set to . These hyperparameters were not extensively tuned. For each problem, we compare deterministic, RTSS, and RTRR estimators, each with a range of truncations.
7.1 LotkaVolterra ODE
We first experiment with variational inference of parameters of a LotkaVolterra (LV) ODE. LV ODEs are defined by the predatorprey equations, where and are predator and prey populations, respectively:
We aim to infer the parameters . The true parameters are drawn from , chosen empirically to ensure stability solving the equations. We generate groundtruth data by solving the equations using RK4 (a common 4thorder Runge Kutta method) from to with steps. The learner is given access to five equally spaced noisy observations , generated according to .
We place a diagonal Gaussian prior on with the same mean and standard deviation as the datagenerating distribution. The variational posterior is a diagonal Gaussian with mean and standard deviation . The parameters optimized are . We let and , where , to ensure positivity. We use a reflecting boundary to ensure positivity of parameter samples from . The variational posterior is initialized to have mean equal to the prior and standard deviation 0.1.
The loss considered is the negative evidence lower bound (negative ELBO). The ELBO is:
Above, is the value of the solution to the LV ODE with parameters , evaluated at time . We consider a sequence , where in computing the ELBO, is approximated by solving the ODE using RK4 with
steps, and linearly interpolating the solution to the
observation times. The outerloop optimization is performed with a batch size of (i.e., samples of are performed at each step) and a learning rate of . Evaluation is performed with a batch size of .
Negative ELBO 


Gradient evaluations (thousands) 
Figure 1 shows the loss of the different estimators over the course of training. RTSS estimators outperform the untruncated estimator without inducing bias. They are competitive with the truncation , while avoiding the bias present with the truncation , at the cost of some variance. Some RTRR estimators experience issues with optimization, appearing to obtain the same biased solution as the truncation.
7.2 MNIST learning rate
We next experiment with metaoptimization of a learning rate on MNIST. We largely follow the procedure used by Wu et al. (2018). We use a feedforward network with two hidden layers of 100 units, with weights initialized from a Gaussian with standard deviation 0.1, and biases initialized to zero. Optimization is performed with a batch size of 100.
The neural network is trained by SGD with momentum using Polyak averaging, with the momentum parameter fixed to . We aim to learn a learning rate and decay for the innerloop optimization. These are initialized to and respectively. The learning rate for the inner optimization at an inner optimization step is .
As in Wu et al. (2018), we pretrain the net for 50 steps with a learning rate of . is the evaluation loss after training steps with a batch size of . The evaluation loss is measured over validation batches or the entire validation set, whichever is smaller. The outer optimization is performed with a learning rate of .
RTSS estimators achieve faster convergence than fixedtruncation estimators. RTRR estimators suffer from very poor convergence. Truncated estimators appear to obtain biased solutions. The untruncated estimator achieves a slightly better loss than the RT estimators, but takes significantly longer to converge.
Evaluation loss 


Neural network evaluations (thousands) 
7.3 enwik8 Lstm
Finally, we study a highdimensional optimization problem: training an LSTM to model sequences on enwik8. These data are the first 100M bytes of a Wikipedia XML dump. There are 205 unique tokens. We use the first 90M, 5M, and 5M characters as the training, evaluation, and test sets.
We build on code^{1}^{1}1http://github.com/salesforce/awdlstmlm from Merity et al. (2017, 2018). We train an LSTM with 1000 hidden units and 400dimensional input and output embeddings. The model has 5.9M parameters. The only regularization is an penalty on the weights with magnitude . The optimization is performed with a learning rate of . This model is not stateoftheart: our aim to investigate performance of RT estimators for optimizing highdimensional neural networks, rather than to maximize performance at a language modeling task.
We choose to be the mean crossentropy after unrolling the LSTM training for steps. We choose the horizon , such that the untruncated loop has 257 steps, chosen to be close to the 200length training sequences used by Merity et al. (2018).
Figure 3 shows the training bitspercharacter (proportional to the training crossentropy loss). RT estimators provide some acceleration over the untruncated estimator early in training, but after about 200k cell evaluations, fall back on the untruncated estimator, subsequently progressing slightly more slowly due to computational cost of tuning. We conjecture that the diagonal covariance assumption in Section 5 is unsuited to highdimensional problems, and leads to overly conservative estimators.
Training bitspercharacter 


LSTM cell evaluations (thousands) 
8 Limitations and future work
Other optimizers. We develop the lower bound on expected improvement for SGD. Important future directions would investigate adaptive and momentumbased SGD methods such as Adam (Kingma & Ba, 2014).
Tuning step. Our method includes a tuning step which requires computation. It might be possible to remove this tuning step by estimating covariance structure online using just the values of observed during each optimization step.
RT estimators beyond RTSS and RTRR. There is a rich family defined by choices of and . The optimal member depends on covariance structure between the . We explore RTSS and RTRR under strict covariance assumptions. Relaxing these assumptions and optimizing and across a wider family could improve adaptive estimator performance for highdimensional problems such as training RNNs.
Predictive models of the sequence limit. Using any sequence with RT yields an unbiased estimator as long as the sequence is consistent, i.e. its limit is the true gradient. Combining randomized telescopes with predictive models of the gradients (Jaderberg et al., 2017; Weber et al., 2019) might yield a fastconverging sequence, leading to estimators with low computation and variance.
9 Conclusion
We investigated the use of randomly truncated unbiased gradient estimators for optimizing objectives which involve loops and limits. We proved these estimators can achieve horizonindependent convergence rates for optimizing loops and limits. We derived adaptive variants which can be tuned online to maximize a lower bound on expected improvement per unit computation. Experimental results matched theoretical intuitions that the single sample estimator is more robust than Russian roulette for optimization. The adaptive RTSS estimator often significantly accelerates optimization, and can otherwise fall back on the untruncated estimator.
10 Acknowledgements
We would like to thank Matthew Johnson, Peter Orbanz, and James Saunderson for helpful discussions. This work was funded by the Alfred P. Sloan Foundation and NSF IIS1421780.
References
 Arvo & Kirk (1990) Arvo, J. and Kirk, D. Particle transport and image synthesis. ACM SIGGRAPH Computer Graphics, 24(4):63–66, 1990.

Balles et al. (2016)
Balles, L., Romero, J., and Hennig, P.
Coupling adaptive batch sizes with learning rates.
In
Uncertainty in Artificial Intelligence
, 2016.  Bengio et al. (1992) Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Conference on Optimality in Artificial and Biological Neural Networks, 1992.
 Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Brock et al. (2018) Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: Oneshot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
 Christianson (1998) Christianson, B. Reverse accumulation and implicit functions. Optimization Methods and Software, 9(4):307–322, 1998.
 Cremer et al. (2018) Cremer, C., Li, X., and Duvenaud, D. Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558, 2018.
 Fearnhead et al. (2008) Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777, 2008.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
 Forsythe & Leibler (1950) Forsythe, G. E. and Leibler, R. A. Matrix inversion by a Monte Carlo method. Mathematics of Computation, 4(31):127–129, 1950.
 Franceschi et al. (2017) Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradientbased hyperparameter optimization. In International Conference on Machine Learning, 2017.
 Girolami et al. (2013) Girolami, M., Lyne, A.M., Strathmann, H., Simpson, D., and Atchade, Y. Playing Russian roulette with intractable likelihoods. Technical report, Citeseer, 2013.
 Hazan et al. (2016) Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Jacob et al. (2017) Jacob, P. E., O’Leary, J., and Atchadé, Y. F. Unbiased Markov chain Monte Carlo with couplings. arXiv preprint arXiv:1708.03625, 2017.
 Jaderberg et al. (2017) Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. In International Conference on Machine Learning, pp. 1627–1635. JMLR. org, 2017.
 Jameson (1988) Jameson, A. Aerodynamic design via control theory. Journal of Scientific Computing, 3(3):233–260, 1988.
 Kahn (1955) Kahn, H. Use of different Monte Carlo sampling techniques. 1955.
 Kim et al. (2018) Kim, Y., Wiseman, S., Miller, A. C., Sontag, D., and Rush, A. M. Semiamortized variational autoencoders. In International conference on Machine Learning, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.
 Kuti (1982) Kuti, J. Stochastic method for the numerical study of lattice fermions. Physical Review Letters, 49(3):183, 1982.
 Lorraine & Duvenaud (2018) Lorraine, J. and Duvenaud, D. Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419, 2018.

Lyne et al. (2015)
Lyne, A.M., Girolami, M., Atchadé, Y., Strathmann, H., Simpson, D., et al.
On Russian roulette estimates for Bayesian inference with doublyintractable likelihoods.
Statistical science, 30(4):443–467, 2015.  Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122, 2015.
 McLeish (2010) McLeish, D. A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods and Applications, 2010.
 Merity et al. (2017) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.
 Merity et al. (2018) Merity, S., Keskar, N. S., and Socher, R. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018.
 Metz et al. (2018) Metz, L., Maheswaranathan, N., Nixon, J., Freeman, C. D., and SohlDickstein, J. Learned optimizers that outperform SGD on wallclock and validation loss. arXiv preprint arXiv:1810.10180, 2018.
 Ravi & Larochelle (2016) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2016.
 Rhee & Glynn (2012) Rhee, C.h. and Glynn, P. W. A new approach to unbiased estimation for SDEs. In Proceedings of the Winter Simulation Conference, pp. 17. Winter Simulation Conference, 2012.
 Rhee & Glynn (2015) Rhee, C.h. and Glynn, P. W. Unbiased estimation with square root convergence for SDE models. Operations Research, 63(5):1026–1043, 2015.
 Rudin et al. (1976) Rudin, W. et al. Principles of Mathematical Analysis, volume 3. McGrawhill New York, 1976.
 Rychlik (1990) Rychlik, T. Unbiased nonparametric estimation of the derivative of the mean. Statistics & probability letters, 10(4):329–333, 1990.

Rychlik (1995)
Rychlik, T.
A class of unbiased kernel estimates of a probability density function.
Applicationes Mathematicae, 22(4):485–497, 1995.  Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 Shaban et al. (2018) Shaban, A., Cheng, C.A., Hatch, N., and Boots, B. Truncated backpropagation for bilevel optimization. arXiv preprint arXiv:1810.10667, 2018.
 Spanier & Gelbard (1969) Spanier, J. and Gelbard, E. M. Monte Carlo Principles and Neutron Transport Problems. AddisonWesley Publishing Company, 1969.
 Tallec & Ollivier (2017) Tallec, C. and Ollivier, Y. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017.
 Trinh et al. (2018) Trinh, T. H., Dai, A. M., Luong, T., and Le, Q. V. Learning longerterm dependencies in RNNs with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Neural Information Processing Systems, pp. 5998–6008, 2017.
 Wagner (1987) Wagner, W. Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Computational Physics, 71(1):21–33, 1987.
 Weber et al. (2019) Weber, T., Heess, N., Buesing, L., and Silver, D. Credit assignment techniques in stochastic computation graphs. arXiv preprint arXiv:1901.01761, 2019.
 Wei & Murray (2016) Wei, C. and Murray, I. Markov chain truncation for doublyintractable inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
 Wu et al. (2018) Wu, Y., Ren, M., Liao, R., and Grosse., R. Understanding shorthorizon bias in stochastic metaoptimization. In International Conference on Learning Representations, 2018.
Appendix A Algorithm pseudocode
Appendix B Proofs
b.1 Proofs for section 2
b.1.1 Proposition 2.1
Unbiasedness of RT estimators. The RT estimators in (2) are unbiased estimators of as long as
Proof.
A randomized telescope estimator which satisfies the above linear constraint condition has expectation:
∎
b.2 Proofs for section 4
b.2.1 Theorem 4.1
Bounded variance and compute with polynomial convergence of . Assume converges according to or faster, for constants and . Choose the RTSS estimator with . The resulting estimator achieves expected compute , where is the th generalized harmonic number of order , and expected squared norm . The limit is finite iff , in which case it is given by the Riemannian zeta function, . Accordingly, the estimator achieves horizonagnostic variance and expected compute bounds iff .
Proof.
Begin by noting the RTSS estimator returns with probability . Let and , such that . First, note . Now inspect the expected squared norm :
Comments
There are no comments yet.