DeepAI

# Super-efficiency of automatic differentiation for functions defined as a minimum

In min-min optimization or max-min optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closed-form, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a super-efficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.

• 17 publications
• 46 publications
• 20 publications
12/08/2020

### Convergence Rates for Multi-classs Logistic Regression Near Minimum

Training a neural network is typically done via variations of gradient d...
12/17/2021

### Convergence Rates of Two-Time-Scale Gradient Descent-Ascent Dynamics for Solving Nonconvex Min-Max Problems

There are much recent interests in solving noncovnex min-max optimizatio...
08/27/2021

### A Bisection Method Like Algorithm for Approximating Extrema of a Continuous Function

For a continuous function f defined on a closed and bounded domain, ther...
05/08/2018

### Differential Equations for Modeling Asynchronous Algorithms

Asynchronous stochastic gradient descent (ASGD) is a popular parallel op...
06/15/2020

### Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

In the context of statistical supervised learning, the noiseless linear ...
11/05/2019

### A Rule for Gradient Estimator Selection, with an Application to Variational Inference

Stochastic gradient descent (SGD) is the workhorse of modern machine lea...
08/02/2021

### Learning Linearized Assignment Flows for Image Labeling

We introduce a novel algorithm for estimating optimal parameters of line...

## 1 Introduction

In machine learning, many objective functions are expressed as the minimum of another function: functions

defined as

 ℓ(x)=minz∈RmL(z,x), (1)

where . Such formulation arises for instance in dictionary learning, where is a dictionary, a sparse code, and is the Lasso cost (Mairal et al., 2010). In this case, measures the ability of the dictionary to encode the input data. Another example is the computation of the Wassertein barycenter of distributions in optimal transport (Agueh and Carlier, 2011): represents the barycenter, is the sum of distances to , and the distances themselves are defined by minimizing the transport cost. In the field of optimization, formulation (1) is also encountered as a smoothing technique, for instance in reweighted least-squares (Daubechies et al., 2010) where is smooth but not

. In game theory, such problems naturally appear in two-players maximin games

(von Neumann, 1928), with applications for instance to generative adversarial nets (Goodfellow et al., 2014). In this setting, should be maximized.

A key point to optimize – either maximize or minimize – is usually to compute the gradient of , . If the minimizer is available, the first order optimality conditions impose that and the gradient is given by

 g∗(x)=∇xL(z∗(x),x). (2)

However, in most cases the minimizer of the function is not available in closed-form. It is approximated via an iterative algorithm, which produces a sequence of iterates . There are then three ways to estimate :

The analytic estimator corresponds to plugging the approximation in (2)

 g1t(x)≜∇xL(zt(x),x). (3)

The automatic estimator is , where the derivative is computed with respect to

as well. The chain rule gives

 g2t(x)=∇xL(zt(x),x)+∂zt∂x∇zL(zt(x),x). (4)

This expression can be computed efficiently using automatic differentiation (Baydin et al., 2018), in most cases at a cost similar to that of computing .

If is invertible, the implicit function theorem gives where . The implicit estimator is

 g3t(x)≜∇xL(zt(x),x)+J(zt(x),x)∇zL(zt(x),x). (5)

This estimator can be more costly to compute than the previous ones, as a linear system has to be solved.

These estimates have been proposed and used by different communities. The analytic one corresponds to alternate optimization of , where one updates while considering that is fixed. It is used for instance in dictionary learning (Olshausen and Field, 1997; Mairal et al., 2010) or in optimal transport (Feydy et al., 2019)

. The second is common in the deep learning community as a way to differentiate through optimization problems

(Gregor and Le Cun, 2010). Recently, it has been used as a way to accelerate convolutional dictionary learning (Tolooshams et al., 2018). It has also been used to differentiate through the Sinkhorn algorithm in optimal transport applications (Boursier and Perchet, 2019; Genevay et al., 2018). It integrates smoothly in a machine learning framework, with dedicated libraries (Abadi et al., 2016; Paszke et al., 2019)

. The third one is found in bi-level optimization, for instance for hyperparameter optimization

(Bengio, 2000)

. It is also the cornerstone of the use of convex optimization as layers in neural networks

(Agrawal et al., 2019).

Contribution In this article, we want to answer the following question: which one of these estimators is the best? The central result, presented in section 2, is the following convergence speed, when is differentiable and under mild regularity hypothesis (Proposition 1, 2 and 3)

 |g1t(x)−g∗(x)| =O(|zt(x)−z∗(x)|), |g2t(x)−g∗(x)| =o(|zt(x)−z∗(x)|), |g3t(x)−g∗(x)| =O(|zt(x)−z∗(x)|2).

This is a super-efficiency phenomenon for the automatic estimator, illustrated in Figure 1 on a toy example. As our analysis reveals, the bound on depends on the convergence speed of the Jacobian of , which itself depends on the optimization algorithm used to produce . In section 3, we build on the work of Gilbert (1992) and give accurate bounds on the convergence of the Jacobian for gradient descent (Proposition 5) and stochastic gradient descent (Proposition 8 and 9) in the strongly convex case. We then study a simple case of non-strongly convex problem (Proposition 12). To the best of our knowledge, these bounds are novel. This analysis allows us to refine the convergence rates of the gradient estimators.

In section 4, we start by recalling and extending the consequence of using wrong gradients in an optimization algorithm (Proposition 13 and 14). Then, since each gradient estimator comes at a different cost, we put the convergence bounds developed in the paper in perspective with a complexity analysis. This leads to practical and principled guidelines about which estimator should be used in which case. Finally, we provide numerical illustrations of the aforementioned results in section 5.
Notation  The norm of is . The operator norm of is and the Frobenius norm is

. The vector of size

full of s is . The Euclidean scalar product is .

The proofs are only sketched in the article, full proofs are deferred to appendix.

## 2 Convergence speed of gradient estimates

We consider a compact set . We make the following assumptions on .
H1: is twice differentiable over with second derivatives and respectively and -Lipschitz.
H2:For all , has a unique minimizer . The mapping is differentiable, with Jacobian .
H1 implies that and are Lipschitz, with constants and . The Jacobian of at is . For the rest of the section, we consider a point , and we denote , and .

### 2.1 Analytic estimator g1

The analytic estimator (3) approximates as well as approximates by definition of the -smoothness.

###### Proposition 1 (Convergence of the analytic estimator).

The analytic estimator verifies .

### 2.2 Automatic estimator g2

The automatic estimator (4) can be written as

 g2=g∗+R(Jt)(zt−z∗)+Rxz+JtRzz (6)

where

 Rxz ≜∇xL(zt,x)−∇xL(z∗,x)−∇xzL(z∗,x)(zt−z∗) Rzz ≜∇zL(zt,x)−∇zzL(z∗,x)(zt−z∗).

are Taylor’s rests and

 R(J)≜J∇zzL(z∗,x)+∇xzL(z∗,x). (7)

The implicit function theorem states that . Importantly, in a non strongly-convex setting where is not invertible, it might happen that goes to even though does not converge to . H1 implies a quadratic bound on the rests

 |Rxz|≤Lxz2|zt−z∗|2and |Rzz|≤Lzz2|zt−z∗|2. (8)

We assume that is bounded . This holds when converges, which is the subject of section 3. The triangle inequality in Equation 6 gives:

###### Proposition 2 (Convergence of the automatic estimator).

We define

 L≜Lxz+LJLzz (9)

Then .

This proposition shows that the rate of convergence of depends on the speed of convergence of . For instance, if goes to , we have

 g2t−g∗=o(|zt−z∗|).

Unfortunately, it might happen that, even though goes to , does not go to since differentiation is not a continuous operation. In section 3, we refine this convergence rate by analyzing the convergence speed of the Jacobian in different settings.

### 2.3 Implicit estimator g3

The implicit estimator (5) is well defined provided that is invertible. We obtain convergence bounds by making a Lipschitz assumption on .

###### Proposition 3.

[Convergence of the implicit estimator] Assume that is -Lipschitz with respect to its first argument, and that . Then, for as defined in (9),

 |g3t−g∗|≤(L2+LJLz)|zt−z∗|2. (10)
###### Sketch of proof.

The proof is similar to that of Proposition 2, using . ∎

Therefore this estimator converges twice as fast as , and at least as fast as . Just like this estimator does not need to store the past iterates in memory, since it is a function of and . However, it is usually much more costly to compute.

### 2.4 Link with bi-level optimization

Bi-level optimization appears in a variety of machine-learning problems, such as hyperparameter optimization (Pedregosa, 2016) or supervised dictionary learning (Mairal et al., 2012). It considers problems of the form

 minx∈Rnℓ′(x)≜L′(z∗(x),x) s.t. z∗(x)∈argminz∈RmL(z,x), (11)

where is another objective function. The setting of our paper is a special instance of bi-level optimization where . The gradient of is

 g′∗=∇xℓ′(x)=∇xL′(z∗,x)+J∗∇zL′(z∗,x).

When is a sequence of approximate minimizers of , gradient estimates can be defined as

 g′1 =∇xL′(zt(x),x), g′2 =∇xL′(zt(x),x)+Jt∇zL′(zt(x),x), g′3 =∇xL′(zt(x),x)+J(zt(x),x)∇zL′(zt(x),x).

Here, , since does not minimize . Hence, does not estimate . Moreover, in general,

 ∇xzL′(z∗,x)+J∗∇zzL′(z∗,x)≠0.

Therefore, there is no cancellation to allow super-efficiency of and and we only obtain linear rates

 g′2−g∗=O(|zt−z∗|),g′3−g∗=O(|zt−z∗|).

## 3 Convergence speed of the Jacobian

In order to get a better understanding of the convergence properties of the gradient estimators – in particular – we analyze it in different settings. A large portion of the analysis is devoted to the convergence of to , since it does not directly follow from the convergence of . In most cases, we show convergence of to , and use

 ∥R(Jt)∥≤Lz∥Jt−J∗∥ (12)

in the bound of Proposition 2.

### 3.1 Contractive setting

When are the iterates of a fixed point iteration with a contractive mapping, we recall the following result due to Gilbert (1992).

###### Proposition 4 (Convergence speed of the Jacobian).

Assume that is produced by a fixed point iteration

 zt+1=Φ(zt,x),

where is differentiable. We suppose that is contractive: there exists such that for all , Under mild regularity conditions on :

• converges to a differentiable function such that , with Jacobian .

• and

### 3.2 Gradient descent in the strongly convex case

We consider the gradient descent iterations produced by the mapping , with a step-size . We assume that is -strongly convex with respect to , i.e. for all . In this setting, satisfies the hypothesis of Proposition 4, and we obtain precise bounds.

###### Proposition 5.

[Convergence speed of the Jacobian of gradient descent in a strongly convex setting] Let produced by the recursion with and . We have and where is defined in (9).

###### Sketch of proof (c.1).

We show that satisfies the recursive inequality . ∎

As a consequence, Prop. 1, 2, 3 together with Eq. (12) give in this case

 |g1−g∗| ≤Lx|z0−z∗|κt, |g2−g∗| ≤(ρLzt+κ2)L|z0−z∗|2κ2t−1, (13) |g3−g∗| ≤(L2+LJLz)|z0−z∗|2κ2t.

We get the convergence speed , which is almost twice better than the rate for . Importantly, the order of magnitude in Proposition 5 is tight, as it can be seen in Proposition 15.

### 3.3 Stochastic gradient descent in z

We provide an analysis of the convergence of in the stochastic gradient descent setting, assuming once again the -strong convexity of . We suppose that is an expectation

 L(z,x)=Eξ[C(z,x,ξ)],

where is drawn from a distribution , and is twice differentiable. Stochastic gradient descent (SGD) with steps iterates

 zt+1(x)=zt(x)−ρt∇zC(zt(x),x,ξt+1)% where ξt+1∼d.

In the stochastic setting, Proposition 2 becomes

###### Proposition 6.

Define

 δt=E[∥Jt−J∗∥2F] and dt=E[|zt−z∗|2]. (14)

We have .

###### Sketch of proof (c.3).

We use Cauchy-Schwarz and the norm inequality to bound . ∎

We begin by deriving a recursive inequality on , inspired by the analysis techniques of .

###### Proposition 7.

[Bounding inequality for the Jacobian] We assume bounded Hessian noise, in the sense that and . Let , and . We have

 δt+1≤(1−2ρtμ)δt+2ρt√rL√dt√δt+ρ2tB2. (15)
###### Sketch of proof (c.4).

A standard strong convexity argument gives the bound

 δt+1≤(1−2ρtμ)δt+2ρt√rLE[∥Jt−J∗∥F|zt−z0|]+ρ2tB2.

The middle term is then bounded using Cauchy-Schwarz inequality. ∎

Therefore, any convergence bound on provides another convergence bound on by unrolling Eq. (15). We first analyze the fixed step-size case by using the simple “bounded gradients” hypothesis and bounds on from Moulines and Bach (2011)

. In this setting, the iterates converge linearly until they reach a threshold caused by gradient variance.

###### Proposition 8.

[SGD with constant step-size] Assume that the gradients have bounded variance . Assume , and let and . In this setting

 δt≤(κt2(∥J∗∥F+tα)+B2)2,

where and .

###### Sketch of proof (c.5).

Moulines and Bach (2011) give , which implies A bit of work on Eq. (15) then gives

 √δt+1≤κ2√δt+ρκt2+(1−κ2)B2

Unrolling the recursion gives the proposed bound. ∎

This bound showcases that the familiar “two-stages” behavior also stands for : a transient linear decay at rate in the first iterations, and then convergence to a stationary noise level . This bound is not tight enough to provide a good estimate of the noise level in . Let the limit of the sequence defined by the recursion (15). We find

 B23=(1−2ρμ)B23+2ρ√rLβB3+ρ2B2,

which gives by expanding

 B3=√ρ√rLσ(2μ)32⎛⎝1+√1+4μ2B2rL2σ2⎞⎠.

This noise level scales as , which is observed in practice. In this scenario, Prop. 1, 2 and 3 show that reaches a noise level proportional to , just like , while both and reach a noise level proportional to : and can estimate to an accuracy that can never be reached by . We now turn to the decreasing step-size case.

###### Proposition 9.

[SGD with decreasing step-size] Assume that with . Assume a bound on of the form . Then

 δt≤4ρ0B2μ+rL2d2μ2t−α+o(t−α).
###### Sketch of proof (c.6).

We use (15) to obtain a recursion

 δt+1≤(1−μρ0t−α)δt+(B2ρ20+rL2d2ρ0μ)t−2α,

which is then unrolled. ∎

When , we have so the assumption is verified for some . One could use the precise bounds of (Moulines and Bach, 2011) to obtain non-asymptotic bounds on as well.

Overall, we recover bounds for with the same behavior than the bounds for . Pluging them in Proposition 6 gives the asymptotic behaviors for the gradient estimators.

###### Proposition 10 (Convergence speed of the gradient estimators for SGD with decreasing step).

Assume that with . Then

 Eξ[|g1−g∗|] =O(√t−α),Eξ[|g2−g∗|]=O(t−α) Eξ[|g3−g∗|]=O(t−α)

The super-efficiency of and is once again illustrated, as they converge at the same speed as .

### 3.4 Beyond strong convexity

All the previous results rely critically on the strong convexity of . A function with minimizer is -Łojasiewicz (Attouch and Bolte, 2009) when for some . Any strongly convex function is -Łojasiewicz: the set of -Łojasiewicz functions for offers a framework beyond strong-convexity that still provides convergence rates on the iterates. The general study of gradient descent on this class of function is out of scope for this paper. We analyze a simple class of -Łojasiewicz functions, the least mean -th problem, where

 L(z,x)≜1pn∑i=1(xi−[Dz]i)p (16)

for an even integer and is overcomplete (rank). In this simple case, is minimized by cancelling , and .

In the case of least squares () we can perfectly describe the behavior of gradient descent, which converges linearly.

###### Proposition 11.

Let the iterates of gradient descent with step in (16) with , and . It holds

 g1=D(zt−z∗),g2=D(z2t−z∗) and g3=0.
###### Proof.

The iterates verify , and we find . The result follows. ∎

The automatic estimator therefore goes exactly twice as fast as the analytic one to , while the implicit estimator is exact. Then, we analyze the case where in a more restrictive setting.

###### Proposition 12.

For , we assume . Let . We have

 |g1t|=O(t−α),|g2t|=O(t−2α),g3t=0.
###### Sketch of proof (c.7).

We first show that the residuals verify , which gives the result for . We find where verifies . Using the development of and unrolling the recursion concludes the proof. ∎

For this problem, is of the order of magnitude of squared and as , we see that the rate of convergence of goes to , while the one of goes to .

## 4 Consequence on optimization

In this section, we study the impact of using the previous inexact estimators for first order optimization. These estimators nicely fit in the framework of inexact oracles introduced by Devolder et al. (2014).

### 4.1 Inexact oracle

We assume that is -strongly convex and -smooth with minimizer . A -inexact oracle is a couple such that is the inexact value function, is the inexact gradient and for all

 μ2|x−y|2≤ℓ(x)−ℓδ(y)−⟨gδ(y)|x−y⟩≤L2|x−y|2+δ. (17)

Devolder et al. (2013) show that if the gradient approximation verifies for all , then is a -inexact oracle, with

 δi=Δ2i(1μx+12Lx) . (18)

We consider the optimization of with inexact gradient descent: starting from , it iterates

 xq+1=xq−ηgit(xq), (19)

with , a fixed and or .

###### Proposition 13.

[Devolder et al. 2013, Theorem 4] The iterates with estimate verify

 ℓ(xq)−ℓ(x∗)≤2Lx(1−μx4Lx)q|x0−x∗|2+δi

with defined in (18).

As goes to infinity, the error made on tends towards . Thus, a more precise gradient estimate achieves lower optimization error. This illustrates the importance of using gradients estimates with an error as small as possible.

We now consider stochastic optimization for our problem, with loss defined as

 ℓ(x)=Eυ[h(x,υ)] with h(x,υ)=minzH(z,x,υ).

Stochastic gradient descent with constant step-size and inexact gradients iterates

 xq+1=xq−ηgit(xq,υq+1),

where is computed by an approximate minimization of .

###### Proposition 14.

We assume that is -strongly convex, -smooth and verifies

 E[|∇xh(x,υ)−∇xℓ(x)|2]≤σ2.

The iterates of SGD with approximate gradient and step-size verify

 E|xq−x∗|2≤(1−ημx2)q|x0−x∗|+2ημxσ2+4μxδi

with .

The proof is deferred to Appendix D. In this case, it is pointless to achieve an estimation error on the gradient smaller than some fraction of the gradient variance .

As a final note, these results extend without difficulty to the problem of maximizing , by considering gradient ascent or stochastic gradient ascent.

### 4.2 Time and memory complexity

In the following, we put our results in perspective with a computational and memory complexity analysis, allowing us to provide practical guidelines for optimization of .

Computational complexity of the estimators  The cost of computing the estimators depends on the cost function . We give a complexity analysis in the least squares case (16) which is summarized in Table 1. In this case, computing the gradient takes operations, therefore the cost of computing with gradient descent is . Computing comes at the same price. The estimator requires a reverse pass on the computational graph, which costs a factor of the forward computational cost: the final cost is . Griewank and Walther (2008) showed that typically . Finally, computing requires a costly Hessian computation, and a linear system inversion. The final cost is . The linear scaling of and is highlighted in Figure A.3. In addition, computing usually requires to store in memory all intermediate variables, which might be a burden. However, some optimization algorithms are invertible, such as SGD with momentum (Maclaurin et al., 2015). In this case, no additional memory is needed.
Linear convergence: a case for the analytic estimator In the time it takes to compute , one can at the same cost compute . If converges linearly at rate , Proposition 1 shows that , while Proposition 2 gives, at best, : is a better estimator of than , provided that . In the quadratic case, we even have . Further, computing might requires additional memory: should be preferred over in this setting. However, our analysis is only asymptotic, and other effects might come into play to tip the balance in favor of .

As it appears clearly in Table 1, choosing over depends on : when , the additional cost of computing is negligible, and it should be preferred since it is more accurate. This is however a rare situation in a large scale setting.
Sublinear convergence  We have provided two settings where converges sub-linearly. In the stochastic gradient descent case with a fixed step-size, one can benefit from using over , since it allows to reach an accuracy that can never be reached by . With a decreasing step-size, reaching requires iterations, while reaching only takes iterations. For small enough, we have : it is always beneficial to use if memory capacity allows it.

The story is similar for the simple non-strongly convex problem studied in subsection 3.4: because of the slow convergence of the algorithms, is much closer to than . Although our analysis was carried in the simple least mean -th problem, we conjecture it could be extended to the more general setting of -Łojasiewicz functions (Attouch and Bolte, 2009).

## 5 Experiments

All experiments are performed in Python using pytorch (Paszke et al., 2019). The code to reproduce the figures is available online.111See Appendix.

### 5.1 Considered losses

In our experiments, we considered several losses with different properties. For each experiments, the details on the size of the problems are reported in subsection A.1.
Regression For a design matrix and a regularization parameter , we define

 L1(z,x)= 12|x−Dz|2+λ2|z|2, L2(z,x)= n∑i=1log(1+e−xi[Dz]i)+λ2|z|