In machine learning, many objective functions are expressed as the minimum of another function: functionsdefined as
where . Such formulation arises for instance in dictionary learning, where is a dictionary, a sparse code, and is the Lasso cost (Mairal et al., 2010). In this case, measures the ability of the dictionary to encode the input data. Another example is the computation of the Wassertein barycenter of distributions in optimal transport (Agueh and Carlier, 2011): represents the barycenter, is the sum of distances to , and the distances themselves are defined by minimizing the transport cost. In the field of optimization, formulation (1) is also encountered as a smoothing technique, for instance in reweighted least-squares (Daubechies et al., 2010) where is smooth but not
. In game theory, such problems naturally appear in two-players maximin games(von Neumann, 1928), with applications for instance to generative adversarial nets (Goodfellow et al., 2014). In this setting, should be maximized.
A key point to optimize – either maximize or minimize – is usually to compute the gradient of , . If the minimizer is available, the first order optimality conditions impose that and the gradient is given by
However, in most cases the minimizer of the function is not available in closed-form.
It is approximated via an iterative algorithm, which produces a sequence of iterates .
There are then three ways to estimate :
The analytic estimator corresponds to plugging the approximation in (2)
The automatic estimator is , where the derivative is computed with respect to
as well. The chain rule gives
This expression can be computed efficiently using automatic differentiation (Baydin et al., 2018), in most cases at a cost similar to that of computing .
If is invertible, the implicit function theorem gives where . The implicit estimator is
This estimator can be more costly to compute than the previous ones, as a linear system has to be solved.
These estimates have been proposed and used by different communities. The analytic one corresponds to alternate optimization of , where one updates while considering that is fixed. It is used for instance in dictionary learning (Olshausen and Field, 1997; Mairal et al., 2010) or in optimal transport (Feydy et al., 2019)
. The second is common in the deep learning community as a way to differentiate through optimization problems(Gregor and Le Cun, 2010). Recently, it has been used as a way to accelerate convolutional dictionary learning (Tolooshams et al., 2018). It has also been used to differentiate through the Sinkhorn algorithm in optimal transport applications (Boursier and Perchet, 2019; Genevay et al., 2018). It integrates smoothly in a machine learning framework, with dedicated libraries (Abadi et al., 2016; Paszke et al., 2019)
. The third one is found in bi-level optimization, for instance for hyperparameter optimization(Bengio, 2000)
. It is also the cornerstone of the use of convex optimization as layers in neural networks(Agrawal et al., 2019).
Contribution In this article, we want to answer the following question: which one of these estimators is the best? The central result, presented in section 2, is the following convergence speed, when is differentiable and under mild regularity hypothesis (Proposition 1, 2 and 3)
This is a super-efficiency phenomenon for the automatic estimator, illustrated in Figure 1 on a toy example. As our analysis reveals, the bound on depends on the convergence speed of the Jacobian of , which itself depends on the optimization algorithm used to produce . In section 3, we build on the work of Gilbert (1992) and give accurate bounds on the convergence of the Jacobian for gradient descent (Proposition 5) and stochastic gradient descent (Proposition 8 and 9) in the strongly convex case. We then study a simple case of non-strongly convex problem (Proposition 12). To the best of our knowledge, these bounds are novel. This analysis allows us to refine the convergence rates of the gradient estimators.
In section 4, we start by recalling and extending the consequence of using wrong gradients in an optimization algorithm (Proposition 13 and 14). Then, since each gradient estimator comes at a different cost, we put the convergence bounds developed in the paper in perspective with a complexity analysis. This leads to practical and principled guidelines about which estimator should be used in which case.
Finally, we provide numerical illustrations of the aforementioned results in section 5.
Notation The norm of is . The operator norm of is and the Frobenius norm is
. The vector of sizefull of s is . The Euclidean scalar product is .
The proofs are only sketched in the article, full proofs are deferred to appendix.
2 Convergence speed of gradient estimates
We consider a compact set . We make the following assumptions on .
H1: is twice differentiable over with second derivatives and respectively and -Lipschitz.
H2:For all , has a unique minimizer . The mapping is differentiable, with Jacobian .
H1 implies that and are Lipschitz, with constants and . The Jacobian of at is . For the rest of the section, we consider a point , and we denote , and .
2.1 Analytic estimator
The analytic estimator (3) approximates as well as approximates by definition of the -smoothness.
Proposition 1 (Convergence of the analytic estimator).
The analytic estimator verifies .
2.2 Automatic estimator
The automatic estimator (4) can be written as
are Taylor’s rests and
The implicit function theorem states that . Importantly, in a non strongly-convex setting where is not invertible, it might happen that goes to even though does not converge to . H1 implies a quadratic bound on the rests
Proposition 2 (Convergence of the automatic estimator).
This proposition shows that the rate of convergence of depends on the speed of convergence of . For instance, if goes to , we have
Unfortunately, it might happen that, even though goes to , does not go to since differentiation is not a continuous operation. In section 3, we refine this convergence rate by analyzing the convergence speed of the Jacobian in different settings.
2.3 Implicit estimator
The implicit estimator (5) is well defined provided that is invertible. We obtain convergence bounds by making a Lipschitz assumption on .
[Convergence of the implicit estimator] Assume that is -Lipschitz with respect to its first argument, and that . Then, for as defined in (9),
Sketch of proof.
The proof is similar to that of Proposition 2, using . ∎
Therefore this estimator converges twice as fast as , and at least as fast as . Just like this estimator does not need to store the past iterates in memory, since it is a function of and . However, it is usually much more costly to compute.
2.4 Link with bi-level optimization
Bi-level optimization appears in a variety of machine-learning problems, such as hyperparameter optimization (Pedregosa, 2016) or supervised dictionary learning (Mairal et al., 2012). It considers problems of the form
where is another objective function. The setting of our paper is a special instance of bi-level optimization where . The gradient of is
When is a sequence of approximate minimizers of , gradient estimates can be defined as
Here, , since does not minimize . Hence, does not estimate . Moreover, in general,
Therefore, there is no cancellation to allow super-efficiency of and and we only obtain linear rates
3 Convergence speed of the Jacobian
In order to get a better understanding of the convergence properties of the gradient estimators – in particular – we analyze it in different settings. A large portion of the analysis is devoted to the convergence of to , since it does not directly follow from the convergence of . In most cases, we show convergence of to , and use
in the bound of Proposition 2.
3.1 Contractive setting
When are the iterates of a fixed point iteration with a contractive mapping, we recall the following result due to Gilbert (1992).
Proposition 4 (Convergence speed of the Jacobian).
Assume that is produced by a fixed point iteration
where is differentiable. We suppose that is contractive: there exists such that for all , Under mild regularity conditions on :
converges to a differentiable function such that , with Jacobian .
3.2 Gradient descent in the strongly convex case
We consider the gradient descent iterations produced by the mapping , with a step-size . We assume that is -strongly convex with respect to , i.e. for all . In this setting, satisfies the hypothesis of Proposition 4, and we obtain precise bounds.
[Convergence speed of the Jacobian of gradient descent in a strongly convex setting] Let produced by the recursion with and . We have and where is defined in (9).
Sketch of proof (c.1).
We show that satisfies the recursive inequality . ∎
3.3 Stochastic gradient descent in
We provide an analysis of the convergence of in the stochastic gradient descent setting, assuming once again the -strong convexity of . We suppose that is an expectation
where is drawn from a distribution , and is twice differentiable. Stochastic gradient descent (SGD) with steps iterates
In the stochastic setting, Proposition 2 becomes
We have .
Sketch of proof (c.3).
We use Cauchy-Schwarz and the norm inequality to bound . ∎
We begin by deriving a recursive inequality on , inspired by the analysis techniques of .
[Bounding inequality for the Jacobian] We assume bounded Hessian noise, in the sense that and . Let , and . We have
Sketch of proof (c.4).
A standard strong convexity argument gives the bound
The middle term is then bounded using Cauchy-Schwarz inequality. ∎
Therefore, any convergence bound on provides another convergence bound on by unrolling Eq. (15). We first analyze the fixed step-size case by using the simple “bounded gradients” hypothesis and bounds on from Moulines and Bach (2011)
. In this setting, the iterates converge linearly until they reach a threshold caused by gradient variance.
[SGD with constant step-size] Assume that the gradients have bounded variance . Assume , and let and . In this setting
where and .
Sketch of proof (c.5).
This bound showcases that the familiar “two-stages” behavior also stands for : a transient linear decay at rate in the first iterations, and then convergence to a stationary noise level . This bound is not tight enough to provide a good estimate of the noise level in . Let the limit of the sequence defined by the recursion (15). We find
which gives by expanding
This noise level scales as , which is observed in practice. In this scenario, Prop. 1, 2 and 3 show that reaches a noise level proportional to , just like , while both and reach a noise level proportional to : and can estimate to an accuracy that can never be reached by . We now turn to the decreasing step-size case.
[SGD with decreasing step-size] Assume that with . Assume a bound on of the form . Then
When , we have so the assumption is verified for some . One could use the precise bounds of (Moulines and Bach, 2011) to obtain non-asymptotic bounds on as well.
Overall, we recover bounds for with the same behavior than the bounds for . Pluging them in Proposition 6 gives the asymptotic behaviors for the gradient estimators.
Proposition 10 (Convergence speed of the gradient estimators for SGD with decreasing step).
Assume that with . Then
The super-efficiency of and is once again illustrated, as they converge at the same speed as .
3.4 Beyond strong convexity
All the previous results rely critically on the strong convexity of . A function with minimizer is -Łojasiewicz (Attouch and Bolte, 2009) when for some . Any strongly convex function is -Łojasiewicz: the set of -Łojasiewicz functions for offers a framework beyond strong-convexity that still provides convergence rates on the iterates. The general study of gradient descent on this class of function is out of scope for this paper. We analyze a simple class of -Łojasiewicz functions, the least mean -th problem, where
for an even integer and is overcomplete (rank). In this simple case, is minimized by cancelling , and .
In the case of least squares () we can perfectly describe the behavior of gradient descent, which converges linearly.
Let the iterates of gradient descent with step in (16) with , and . It holds
The iterates verify , and we find . The result follows. ∎
The automatic estimator therefore goes exactly twice as fast as the analytic one to , while the implicit estimator is exact. Then, we analyze the case where in a more restrictive setting.
For , we assume . Let . We have
Sketch of proof (c.7).
We first show that the residuals verify , which gives the result for . We find where verifies . Using the development of and unrolling the recursion concludes the proof. ∎
For this problem, is of the order of magnitude of squared and as , we see that the rate of convergence of goes to , while the one of goes to .
4 Consequence on optimization
In this section, we study the impact of using the previous inexact estimators for first order optimization. These estimators nicely fit in the framework of inexact oracles introduced by Devolder et al. (2014).
4.1 Inexact oracle
We assume that is -strongly convex and -smooth with minimizer . A -inexact oracle is a couple such that is the inexact value function, is the inexact gradient and for all
Devolder et al. (2013) show that if the gradient approximation verifies for all , then is a -inexact oracle, with
We consider the optimization of with inexact gradient descent: starting from , it iterates
with , a fixed and or .
As goes to infinity, the error made on tends towards . Thus, a more precise gradient estimate achieves lower optimization error. This illustrates the importance of using gradients estimates with an error as small as possible.
We now consider stochastic optimization for our problem, with loss defined as
Stochastic gradient descent with constant step-size and inexact gradients iterates
where is computed by an approximate minimization of .
We assume that is -strongly convex, -smooth and verifies
The iterates of SGD with approximate gradient and step-size verify
The proof is deferred to Appendix D. In this case, it is pointless to achieve an estimation error on the gradient smaller than some fraction of the gradient variance .
As a final note, these results extend without difficulty to the problem of maximizing , by considering gradient ascent or stochastic gradient ascent.
4.2 Time and memory complexity
In the following, we put our results in perspective with a computational and memory complexity analysis, allowing us to provide practical guidelines for optimization of .
|Gradient estimate||Computational cost|
Computational complexity of the estimators
The cost of computing the estimators depends on the cost function .
We give a complexity analysis in the least squares case (16) which is summarized in Table 1.
In this case, computing the gradient takes operations, therefore the cost of computing with gradient descent is .
Computing comes at the same price.
The estimator requires a reverse pass on the computational graph, which costs a factor of the forward computational cost:
the final cost is .
Griewank and Walther (2008) showed that typically .
Finally, computing requires a costly Hessian computation, and a linear system inversion. The final cost is .
The linear scaling of and is highlighted in Figure A.3.
In addition, computing usually requires to store in memory all intermediate variables, which might be a burden.
However, some optimization algorithms are invertible, such as SGD with momentum (Maclaurin et al., 2015).
In this case, no additional memory is needed.
Linear convergence: a case for the analytic estimator In the time it takes to compute , one can at the same cost compute . If converges linearly at rate , Proposition 1 shows that , while Proposition 2 gives, at best, : is a better estimator of than , provided that . In the quadratic case, we even have . Further, computing might requires additional memory: should be preferred over in this setting. However, our analysis is only asymptotic, and other effects might come into play to tip the balance in favor of .
As it appears clearly in Table 1, choosing over depends on : when , the additional cost of computing is negligible, and it should be preferred since it is more accurate.
This is however a rare situation in a large scale setting.
Sublinear convergence We have provided two settings where converges sub-linearly. In the stochastic gradient descent case with a fixed step-size, one can benefit from using over , since it allows to reach an accuracy that can never be reached by . With a decreasing step-size, reaching requires iterations, while reaching only takes iterations. For small enough, we have : it is always beneficial to use if memory capacity allows it.
The story is similar for the simple non-strongly convex problem studied in subsection 3.4: because of the slow convergence of the algorithms, is much closer to than . Although our analysis was carried in the simple least mean -th problem, we conjecture it could be extended to the more general setting of -Łojasiewicz functions (Attouch and Bolte, 2009).
5.1 Considered losses
In our experiments, we considered several losses with different properties. For each experiments, the details on the size of the problems are reported in subsection A.1.
Regression For a design matrix and a regularization parameter , we define