# On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization - linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with ℓ_p loss, p>2, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. We also prove that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

## Authors

• 32 publications
• 13 publications
• 36 publications
• ### Nonlinear Acceleration of CNNs

The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleratio...
06/01/2018 ∙ by Damien Scieur, et al. ∙ 2

• ### Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

We investigate the theoretical limits of pipeline parallel learning of d...
10/11/2019 ∙ by Igor Colin, et al. ∙ 0

• ### Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...
09/23/2018 ∙ by Ohad Shamir, et al. ∙ 0

The rise of deep learning in recent years has brought with it increasing...
01/09/2018 ∙ by Igor Gitman, et al. ∙ 0

• ### Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks

In this paper we explore acceleration techniques for large scale nonconv...
05/17/2018 ∙ by Guangzeng Xie, et al. ∙ 0

• ### Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients

Modern statistical inference tasks often require iterative optimization ...
12/20/2017 ∙ by Tengyuan Liang, et al. ∙ 0

• ### Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimizatio...
01/16/2020 ∙ by Wei Hu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

How does depth help? This central question of deep learning still eludes full theoretical understanding. The general consensus is that there is a trade-off: increasing depth improves expressiveness, but complicates optimization. Superior expressiveness of deeper networks, long suspected, is now confirmed by theory, albeit for fairly limited learning problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., 2018)

. Difficulties in optimizing deeper networks have also been long clear – the signal held by a gradient gets buried as it propagates through many layers. This is known as the “vanishing/exploding gradient problem”. Modern techniques such as batch normalization

(Ioffe & Szegedy, 2015)(He et al., 2015) have somewhat alleviated these difficulties in practice.

Given the longstanding consensus on expressiveness vs. optimization trade-offs, this paper conveys a rather counterintuitive message: increasing depth can accelerate optimization. The effect is shown, via first-cut theoretical and empirical analyses, to resemble a combination of two well-known tools in the field of optimization: momentum, which led to provable acceleration bounds (Nesterov, 1983); and adaptive regularization, a more recent technique proven to accelerate by Duchi et al. (2011) in their proposal of the AdaGrad algorithm. Explicit mergers of both techniques are quite popular in deep learning (Kingma & Ba, 2014; Tieleman & Hinton, 2012). It is thus intriguing that merely introducing depth, with no other modification, can have a similar effect, but implicitly.

There is an obvious hurdle in isolating the effect of depth on optimization: if increasing depth leads to faster training on a given dataset, how can one tell whether the improvement arose from a true acceleration phenomenon, or simply due to better representational power (the shallower network was unable to attain the same training loss)? We respond to this hurdle by focusing on linear neural networks (cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt & Ma (2016); Kawaguchi (2016)). With these models, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices – an overparameterization.

We provide a new analysis of linear neural network optimization via direct treatment of the differential equations associated with gradient descent when training arbitrarily deep networks on arbitrary loss functions. We find that the overparameterization introduced by depth leads gradient descent to operate as if it were training a shallow (single layer) network, while employing a particular preconditioning scheme. The preconditioning promotes movement along directions already taken by the optimization, and can be seen as an acceleration procedure that combines momentum with adaptive learning rates. Even on simple convex problems such as linear regression with

loss, , overparameterization via depth can significantly speed up training. Surprisingly, in some of our experiments, not only did overparameterization outperform naïve gradient descent, but it was also faster than two well-known acceleration methods – AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). In addition to purely linear networks, we also demonstrate (empirically) the implicit acceleration of overparameterization on a non-linear model, by replacing hidden layers with depth- linear networks. The implicit acceleration of overparametrization is different from standard regularization – we prove its effect cannot be attained via gradients of any fixed regularizer.

Both our theoretical analysis and our empirical evaluation indicate that acceleration via overparameterization need not be computationally expensive. From an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters.

The remainder of the paper is organized as follows. In Section 2 we review related work. Section 3 presents a warmup example of linear regression with  loss, demonstrating the immense effect overparameterization can have on optimization, with as little as a single additional scalar. Our theoretical analysis begins in Section 4, with a setup of preliminary notation and terminology. Section 5 derives the preconditioning scheme implicitly induced by overparameterization, followed by Section 6 which shows that this form of preconditioning is not attainable via any regularizer. In Section 7 we qualitatively analyze a very simple learning problem, demonstrating how the preconditioning can speed up optimization. Our empirical evaluation is delivered in Section 8. Finally, Section 9 concludes.

## 2 Related Work

Theoretical study of optimization in deep learning is a highly active area of research. Works along this line typically analyze critical points (local minima, saddles) in the landscape of the training objective, either for linear networks (see for example Kawaguchi (2016); Hardt & Ma (2016) or Baldi & Hornik (1989) for a classic account), or for specific non-linear networks under different restrictive assumptions (cf. Choromanska et al. (2015); Haeffele & Vidal (2015); Soudry & Carmon (2016); Safran & Shamir (2017)). Other works characterize other aspects of objective landscapes, for example Safran & Shamir (2016) showed that under certain conditions a monotonically descending path from initialization to global optimum exists (in compliance with the empirical observations of Goodfellow et al. (2014)).

The dynamics of optimization was studied in Fukumizu (1998) and Saxe et al. (2013), for linear networks. Like ours, these works analyze gradient descent through its corresponding differential equations. Fukumizu (1998) focuses on linear regression with  loss, and does not consider the effect of varying depth – only a two (single hidden) layer network is analyzed. Saxe et al. (2013) also focuses on  regression, but considers any depth beyond two (inclusive), ultimately concluding that increasing depth can slow down optimization, albeit by a modest amount. In contrast to these two works, our analysis applies to a general loss function, and any depth including one. Intriguingly, we find that for  regression, acceleration by depth is revealed only when . This explains why the conclusion reached in Saxe et al. (2013) differs from ours.

Turning to general optimization, accelerated gradient (momentum) methods were introduced in Nesterov (1983), and later studied in numerous works (see Wibisono et al. (2016) for a short review). Such methods effectively accumulate gradients throughout the entire optimization path, using the collected history to determine the step at a current point in time. Use of preconditioners to speed up optimization is also a well-known technique. Indeed, the classic Newton’s method can be seen as preconditioning based on second derivatives. Adaptive preconditioning with only first-order (gradient) information was popularized by the BFGS algorithm and its variants (cf. Nocedal (1980)). Relevant theoretical guarantees, in the context of regret minimization, were given in Hazan et al. (2007); Duchi et al. (2011). In terms of combining momentum and adaptive preconditioning, Adam (Kingma & Ba, 2014) is a popular approach, particularly for optimization of deep networks.

Algorithms with certain theoretical guarantees for non-convex optimization, and in particular for training deep neural networks, were recently suggested in various works, for example Ge et al. (2015); Agarwal et al. (2017); Carmon et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and references therein. Since the focus of this paper lies on the analysis of algorithms already used by practitioners, such works lie outside our scope.

## 3 Warmup: ℓp Regression

We begin with a simple yet striking example of the effect being studied. For linear regression with  loss, we will see how even the slightest overparameterization can have an immense effect on optimization. Specifically, we will see that simple gradient descent on an objective overparameterized by a single scalar, corresponds to a form of accelerated gradient descent on the original objective.

Consider the objective for a scalar linear regression problem with  loss ( – even positive integer):

 L(w)=E(x,y)∼S[1p(x⊤w−y)p]

here are instances, are continuous labels,  is a finite collection of labeled instances (training set), and

is a learned parameter vector. Suppose now that we apply a simple overparameterization, replacing the parameter vector

by a vector  times a scalar :

 L(w1,w2)=E(x,y)∼S[1p(x⊤w1w2−y)p]

Obviously the overparameterization does not affect the expressiveness of the linear model. How does it affect optimization? What happens to gradient descent on this non-convex objective?

###### Observation 1.

Gradient descent over , with fixed small learning rate and near-zero initialization, is equivalent to gradient descent over  with particular adaptive learning rate and momentum terms.

To see this, consider the gradients of  and :

 ∇w := E(x,y)∼S[(x⊤w−y)p−1x] ∇w1 := E(x,y)∼S[(x⊤w1w2−y)p−1w2x] ∇w2 := E(x,y)∼S[(x⊤w1w2−y)p−1w⊤1x]

Gradient descent over  with learning rate :

The dynamics of the underlying parameter are:

 w(t+1)=w(t+1)1w(t+1)2 ↤(w(t)1−η∇w(t)1)(w(t)2−η∇w(t)2) =w(t)1w(t)2−ηw(t)2∇w(t)1−η∇w(t)2w(t)1+O(η2) =w(t)−η(w(t)2)2∇w(t)−η(w(t)2)−1∇w(t)2w(t)+O(η2)

is assumed to be small, thus we neglect . Denoting and , this gives:

 w(t+1)↤w(t)−ρ(t)∇w(t)−γ(t)w(t)

Since by assumption and  are initialized near zero,  will initialize near zero as well. This implies that at every iteration ,  is a weighted combination of past gradients. There thus exist such that:

 w(t+1)↤w(t)−ρ(t)∇w(t)−∑t−1τ=1μ(t,τ)∇w(τ)

We conclude that the dynamics governing the underlying parameter  correspond to gradient descent with a momentum term, where both the learning rate () and momentum coefficients () are time-varying and adaptive.

## 4 Linear Neural Networks

Let  be a space of objects (e.g. images or word embeddings) that we would like to infer something about, and let  be the space of possible inferences. Suppose we are given a training set , along with a (point-wise) loss function . For example,  could hold continuous values with  being the  loss: ; or it could hold one-hot vectors representing categories with  being the softmax-cross-entropy loss: , where and  stand for coordinate  of  and  respectively. For a predictor , i.e. a mapping from  to , the overall training loss is . If  comes from some parametric family , we view the corresponding training loss as a function of the parameters, i.e. we consider . For example, if the parametric family in question is the class of (directly parameterized) linear predictors:

 Φlin:={x↦Wx|W∈Rk,d} (1)

the respective training loss is a function from  to .

In our context, a depth- () linear neural network, with hidden widths , is the following parametric family of linear predictors: , where by definition and . As customary, we refer to each , , as the weight matrix of layer . For simplicity of presentation, we hereinafter omit from our notation the hidden widths , and simply write  instead of  ( will be specified explicitly if not clear by context). That is, we denote:

 ΦN:= (2) {x↦WNWN−1⋯W1x| Wj∈Rnj,nj−1, j=1...N}

For completeness, we regard a depth- network as the family of directly parameterized linear predictors, i.e. we set  (see Equation 1).

The training loss that corresponds to a depth- linear network – , is a function from to . For brevity, we will denote this function by . Our focus lies on the behavior of gradient descent when minimizing . More specifically, we are interested in the dependence of this behavior on , and in particular, in the possibility of increasing  leading to acceleration. Notice that for any  we have:

 LN(W1,...,WN)=L1(WNWN−1⋯W1) (3)

and so the sole difference between the training loss of a depth- network and that of a depth- network (classic linear model) lies in the replacement of a matrix parameter by a product of  matrices. This implies that if increasing  can indeed accelerate convergence, it is not an outcome of any phenomenon other than favorable properties of depth-induced overparameterization for optimization.

## 5 Implicit Dynamics of Gradient Descent

In this section we present a new result for linear neural networks, tying the dynamics of gradient descent on  – the training loss corresponding to a depth- network, to those on  – training loss of a depth- network (classic linear model). Specifically, we show that gradient descent on , a complicated and seemingly pointless overparameterization, can be directly rewritten as a particular preconditioning scheme over gradient descent on .

When applied to , gradient descent takes on the following form:

 W(t+1)j↤(1−ηλ)W(t)j−η∂LN∂Wj(W(t)1,…,W(t)N) (4) , j=1…N

here is a learning rate, and  is an optional weight decay coefficient. For simplicity, we regard both  and  as fixed (no dependence on ). Define the underlying end-to-end weight matrix:

 We:=WNWN−1⋯W1 (5)

Given that (Equation 3), we view  as an optimized weight matrix for , whose dynamics are governed by Equation 4. Our interest then boils down to the study of these dynamics for different choices of . For  they are (trivially) equivalent to standard gradient descent over . We will characterize the dynamics for .

To be able to derive, in our general setting, an explicit update rule for the end-to-end weight matrix  (Equation 5), we introduce an assumption by which the learning rate is small, i.e. . Formally, this amounts to translating Equation 4 to the following set of differential equations:

 ˙Wj(t)=−ηλWj(t)−η∂LN∂Wj(W1(t),…,WN(t)) (6) , j=1…N

where  is now a continuous time index, and  stands for the derivative of  with respect to time. The use of differential equations, for both theoretical analysis and algorithm design, has a long and rich history in optimization research (see Helmke & Moore (2012) for an overview). When step sizes (learning rates) are taken to be small, trajectories of discrete optimization algorithms converge to smooth curves modeled by continuous-time differential equations, paving way to the well-established theory of the latter (cf. Boyce et al. (1969)). This approach has led to numerous interesting findings, including recent results in the context of acceleration methods (e.g. Su et al. (2014); Wibisono et al. (2016)).

With the continuous formulation in place, we turn to express the dynamics of the end-to-end matrix :

###### Theorem 1.

Assume the weight matrices  follow the dynamics of continuous gradient descent (Equation 6). Assume also that their initial values (time ) satisfy, for :

 W⊤j+1(t0)Wj+1(t0)=Wj(t0)W⊤j(t0) (7)

Then, the end-to-end weight matrix  (Equation 5) is governed by the following differential equation:

 ˙We(t)=−ηλN⋅We(t) (8) −η∑Nj=1[We(t)W⊤e(t)]j−1N⋅ dL1dW(We(t))⋅[W⊤e(t)We(t)]N−jN

where  and , , are fractional power operators defined over positive semidefinite matrices.

###### Proof.

(sketch – full details in Appendix A.1) If (no weight decay) then one can easily show that throughout optimization. Taking the transpose of this equation and adding to itself, followed by integration over time, imply that the difference between and  is constant. This difference is zero at initialization (Equation 7), thus will remain zero throughout, i.e.:

 W⊤j+1(t)Wj+1(t)=Wj(t)W⊤j(t), ∀t≥t0 (9)

A slightly more delicate treatment shows that this is true even if , i.e. with weight decay included.

Equation 9 implies alignment of the (left and right) singular spaces of  and , simplifying the product . Successive application of this simplification allows a clean computation for the product of all layers (that is, ), leading to the explicit form presented in theorem statement (Equation 8). ∎

Translating the continuous dynamics of Equation 8 back to discrete time, we obtain the sought-after update rule for the end-to-end weight matrix:

 W(t+1)e↤(1−ηλN)W(t)e (10) −η∑Nj=1[W(t)e(W(t)e)⊤]j−1N⋅ dL1dW(W(t)e)⋅[(W(t)e)⊤W(t)e]N−jN

This update rule relies on two assumptions: first, that the learning rate  is small enough for discrete updates to approximate continuous ones; and second, that weights are initialized on par with Equation 7, which will approximately be the case if initialization values are close enough to zero. It is customary in deep learning for both learning rate and weight initializations to be small, but nonetheless above assumptions are only met to a certain extent. We support their applicability by showing empirically (Section 8) that the end-to-end update rule (Equation 10) indeed provides an accurate description for the dynamics of .

A close look at Equation 10 reveals that the dynamics of the end-to-end weight matrix  are similar to gradient descent over  – training loss corresponding to a depth- network (classic linear model). The only difference (besides the scaling by  of the weight decay coefficient ) is that the gradient  is subject to a transformation before being used. Namely, for , it is multiplied from the left by  and from the right by , followed by summation over . Clearly, when  (depth- network) this transformation reduces to identity, and as expected,  precisely adheres to gradient descent over . When  the dynamics of  are less interpretable. We arrange it as a vector to gain more insight:

###### Claim 1.

For an arbitrary matrix , denote by  its arrangement as a vector in column-first order. Then, the end-to-end update rule in Equation 10 can be written as:

 vec(W(t+1)e)↤(1−ηλN)⋅vec(W(t)e) (11) −η⋅PW(t)evec(dL1dW(W(t)e))

where  is a positive semidefinite preconditioning matrix that depends on

. Namely, if we denote the singular values of

by (by definition if ), and corresponding left and right singular vectors by and

respectively, the eigenvectors of

are:

 vec(urv⊤r′),r=1…k , r′=1…d

with corresponding eigenvalues:

 ∑Nj=1σ2N−jNrσ2j−1Nr′,r=1…k , r′=1…d

###### Proof.

The result readily follows from the properties of the Kronecker product – see Appendix A.2 for details. ∎

Claim 1 implies that in the end-to-end update rule of Equation 10, the transformation applied to the gradient

is essentially a preconditioning, whose eigendirections and eigenvalues depend on the singular value decomposition of

. The eigendirections are the rank- matrices , where  and  are left and right (respectively) singular vectors of . The eigenvalue of  is , where  and  are the singular values of  corresponding to  and  (respectively). When , an increase in  or  leads to an increase in the eigenvalue corresponding to the eigendirection . Qualitatively, this implies that the preconditioning favors directions that correspond to singular vectors whose presence in  is stronger. We conclude that the effect of overparameterization, i.e. of replacing a classic linear model (depth- network) by a depth- linear network, boils down to modifying gradient descent by promoting movement along directions that fall in line with the current location in parameter space. A-priori, such a preference may seem peculiar – why should an optimization algorithm be sensitive to its location in parameter space? Indeed, we generally expect sensible algorithms to be translation invariant, i.e. be oblivious to parameter value. However, if one takes into account the common practice in deep learning of initializing weights near zero, the location in parameter space can also be regarded as the overall movement made by the algorithm. We thus interpret our findings as indicating that overparameterization promotes movement along directions already taken by the optimization, and therefore can be seen as a form of acceleration. This intuitive interpretation will become more concrete in the subsection that follows.

A final point to make, is that the end-to-end update rule (Equation 10 or 11), which obviously depends on  – number of layers in the deep linear network, does not depend on the hidden widths (see Section 4). This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters. Consequently, the acceleration of overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

### 5.1 Single Output Case

To facilitate a straightforward presentation of our findings, we hereinafter focus on the special case where the optimized models have a single output, i.e. where . This corresponds, for example, to a binary (two-class) classification problem, or to the prediction of a numeric scalar property (regression). It admits a particularly simple form for the end-to-end update rule of Equation 10:

###### Claim 2.

Assume , i.e. . Then, the end-to-end update rule in Equation 10 can be written as follows:

 W(t+1)e↤(1−ηλN)⋅W(t)e (12) −η∥W(t)e∥2−2N2⋅(dL1dW(W(t)e)+ (N−1)⋅PrW(t)e{dL1dW(W(t)e)})

where  stands for Euclidean norm raised to the power of , and , , is defined to be the projection operator onto the direction of :

 PrW:R1,d→R1,d (13) PrW{V}:={W∥W∥2V⊤⋅W∥W∥2, W≠00, W=0
###### Proof.

The result follows from the definition of a fractional power operator over matrices – see Appendix A.3. ∎

Claim 2 implies that in the single output case, the effect of overparameterization (replacing classic linear model by depth- linear network) on gradient descent is twofold: first, it leads to an adaptive learning rate schedule, by introducing the multiplicative factor ; and second, it amplifies (by ) the projection of the gradient on the direction of . Recall that we view  not only as the optimized parameter, but also as the overall movement made in optimization (initialization is assumed to be near zero). Accordingly, the adaptive learning rate schedule can be seen as gaining confidence (increasing step sizes) when optimization moves farther away from initialization, and the gradient projection amplification can be thought of as a certain type of momentum that favors movement along the azimuth taken so far. These effects bear potential to accelerate convergence, as we illustrate qualitatively in Section 7, and demonstrate empirically in Section 8.

## 6 Overparametrization Effects Cannot Be Attained via Regularization

Adding a regularizer to the objective is a standard approach for improving optimization (though lately the term regularization is typically associated with generalization). For example, AdaGrad was originally invented to compete with the best regularizer from a particular family. The next theorem shows (for single output case) that the effects of overparameterization cannot be attained by adding a regularization term to the original training loss, or via any similar modification. This is not obvious a-priori, as unlike many acceleration methods that explicitly maintain memory of past gradients, updates under overparametrization are by definition the gradients of something. The assumptions in the theorem are minimal and also necessary, as one must rule-out the trivial counter-example of a constant training loss.

###### Theorem 2.

Assume  does not vanish at , and is continuous on some neighborhood around this point. For a given , ,111 For the result to hold with , additional assumptions on  are required; otherwise any non-zero linear function  serves as a counter-example – it leads to a vector field  that is the gradient of . define:

 F(W):= (14) ∥W∥2−2N2⋅(dL1dW(W)+(N−1)⋅PrW{dL1dW(W)})

where  is the projection given in Equation 13. Then, there exists no function (of ) whose gradient field is .

###### Proof.

(sketch – full details in Appendix A.4) The proof uses elementary differential geometry (Buck, 2003): curves, arc length and the fundamental theorem for line integrals, which states that the integral of  for any differentiable function  amounts to  along every closed curve.

Overparametrization changes gradient descent’s behavior: instead of following the original gradient , it follows some other direction  (see Equations 12 and 14) that is a function of the original gradient as well as the current point . We think of this change as a transformation that maps one vector field to another – :

 Fϕ(W)= {∥W∥2−2N(ϕ(W)+(N−1)⟨ϕ(W),W∥W∥⟩W∥W∥),W≠00,W=0

Notice that for , we get exactly the vector field defined in theorem statement.

We note simple properties of the mapping . First, it is linear, since for any vector fields and scalar : and . Second, because of the linearity of line integrals, for any curve , the functional , a mapping of vector fields to scalars, is linear.

We show that  contradicts the fundamental theorem for line integrals. To do so, we construct a closed curve  for which the linear functional does not vanish at . Let , which is well-defined since by assumption . For we define (see Figure 1):

 Γr,R:=Γ1r,R → Γ2r,R → Γ3r,R → Γ4r,R

where:

• is the line segment from  to .

• is a spherical curve from  to .

• is the line segment from  to .

• is a spherical curve from  to .

With the definition of  in place, we decompose into a constant vector field plus a residual . We explicitly compute the line integrals along for , and derive bounds for . This, along with the linearity of the functional , provides a lower bound on the line integral of  over . We show the lower bound is positive as , thus indeed contradicts the fundamental theorem for line integrals. ∎

## 7 Illustration of Acceleration

To this end, we showed that overparameterization (use of depth- linear network in place of classic linear model) induces on gradient descent a particular preconditioning scheme (Equation 10 in general and 12 in the single output case), which can be interpreted as introducing some forms of momentum and adaptive learning rate. We now illustrate qualitatively, on a very simple hypothetical learning problem, the potential of these to accelerate optimization.

Consider the task of linear regression, assigning to vectors in  labels in . Suppose that our training set consists of two points in : and . Assume also that the loss function of interest is , : . Denoting the learned parameter by , the overall training loss can be written as:222 We omit the averaging constant  for conciseness.

 L(w1,w2)=1p(w1−y1)p+1p(w2−y2)p

With fixed learning rate  (weight decay omitted for simplicity), gradient descent over  gives:

 w(t+1)i↤w(t)i−η(w(t)i−yi)p−1, i=1,2

Changing variables per , we have:

 Δ(t+1)i↤Δ(t)i(1−η(Δ(t)i)p−2), i=1,2 (15)

Assuming the original weights and  are initialized near zero, and start off at and  respectively, and will eventually reach the optimum if the learning rate is small enough to prevent divergence:

 η<2yp−2i, i=1,2

Suppose now that the problem is ill-conditioned, in the sense that . If this has no effect on the bound for .333 Optimal learning rate for gradient descent on quadratic objective does not depend on current parameter value (cf. Goh (2017)). If the learning rate is determined by , leading  to converge very slowly. In a sense,  will suffer from the fact that there is no “communication” between the coordinates (this will actually be the case not just with gradient descent, but with most algorithms typically used in large-scale settings – AdaGrad, Adam, etc.).

Now consider the scenario where we optimize  via overparameterization, i.e. with the update rule in Equation 12 (single output). In this case the coordinates are coupled, and as  gets small ( gets close to ), the learning rate is effectively scaled by  (in addition to a scaling by  in coordinate  only), allowing (if ) faster convergence of . We thus have the luxury of temporarily slowing down  to ensure that  does not diverge, with the latter speeding up the former as it reaches safe grounds. In Appendix B we consider a special case and formalize this intuition, deriving a concrete bound for the acceleration of overparameterization.

## 8 Experiments

Our analysis (Section 5) suggests that overparameterization – replacement of a classic linear model by a deep linear network, induces on gradient descent a certain preconditioning scheme. We qualitatively argued (Section 7

) that in some cases, this preconditioning may accelerate convergence. In this section we put these claims to the test, through a series of empirical evaluations based on TensorFlow toolbox (

Abadi et al. (2016)). For conciseness, many of the details behind our implementation are deferred to Appendix C.

We begin by evaluating our analytically-derived preconditioning scheme – the end-to-end update rule in Equation 10. Our objective in this experiment is to ensure that our analysis, continuous in nature and based on a particular assumption on weight initialization (Equation 7), is indeed applicable to practical scenarios. We focus on the single output case, where the update-rule takes on a particularly simple (and efficiently implementable) form – Equation 12

. The dataset chosen was UCI Machine Learning Repository’s “Gas Sensor Array Drift at Different Concentrations”

(Vergara et al., 2012; Rodriguez-Lujan et al., 2014). Specifically, we used the dataset’s “Ethanol” problem – a scalar regression task with  examples, each comprising  features (one of the largest numeric regression tasks in the repository). As training objectives, we tried both  and  losses. Figure 2 shows convergence (training objective per iteration) of gradient descent optimizing depth- and depth- linear networks, against optimization of a single layer model using the respective preconditioning schemes (Equation 12 with ). As can be seen, the preconditioning schemes reliably emulate deep network optimization, suggesting that, at least in some cases, our analysis indeed captures practical dynamics.

Alongside the validity of the end-to-end update rule, Figure 2 also demonstrates the negligible effect of network width on convergence, in accordance with our analysis (see Section 5). Specifically, it shows that in the evaluated setting, hidden layers of size  (scalars) suffice in order for the essence of overparameterization to fully emerge. Unless otherwise indicated, all results reported hereinafter are based on this configuration, i.e. on scalar hidden layers. The computational toll associated with overparameterization will thus be virtually non-existent.

As a final observation on Figure 2, notice that it exhibits faster convergence with a deeper network. This however does not serve as evidence in favor of acceleration by depth, as we did not set learning rates optimally per model (simply used the common choice of ). To conduct a fair comparison between the networks, and more importantly, between them and a classic single layer model, multiple learning rates were tried, and the one giving fastest convergence was taken on a per-model basis. Figure 3 shows the results of this experiment. As can be seen, convergence of deeper networks is (slightly) slower in the case of  loss. This falls in line with the findings of Saxe et al. (2013). In stark contrast, and on par with our qualitative analysis in Section 7, is the fact that with  loss adding depth significantly accelerated convergence. To the best of our knowledge, this provides first empirical evidence to the fact that depth, even without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem, can lead to favorable optimization.

In light of the speedup observed with  loss, it is natural to ask how the implicit acceleration of depth compares against explicit methods for acceleration and adaptive learning. Figure 4-left shows convergence of a depth- network (optimized with gradient descent) against that of a single layer model optimized with AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). The displayed curves correspond to optimal learning rates, chosen individually via grid search. Quite surprisingly, we find that in this specific setting, overparameterizing, thereby turning a convex problem non-convex, is a more effective optimization strategy than carefully designed algorithms tailored for convex problems. We note that this was not observed with all algorithms – for example Adam (Kingma & Ba, 2014) was considerably faster than overparameterization. However, when introducing overparameterization simultaneously with Adam (a setting we did not theoretically analyze), further acceleration is attained – see Figure 4-right. This suggests that at least in some cases, not only plain gradient descent benefits from depth, but also more elaborate algorithms commonly employed in state of the art applications.

An immediate question arises at this point. If depth indeed accelerates convergence, why not add as many layers as one can computationally afford? The reason, which is actually apparent in our analysis, is the so-called vanishing gradient problem. When training a very deep network (large ), while initializing weights to be small, the end-to-end matrix  (Equation 5) is extremely close to zero, severely attenuating gradients in the preconditioning scheme (Equation 10). A possible approach for alleviating this issue is to initialize weights to be larger, yet small enough such that the end-to-end matrix does not “explode”. The choice of identity (or near identity) initialization leads to what is known as linear residual networks (Hardt & Ma, 2016), akin to the successful residual networks architecture (He et al., 2015) commonly employed in deep learning. Notice that identity initialization satisfies the condition in Equation 7, rendering the end-to-end update rule (Equation 10) applicable. Figure 5-left shows convergence, under gradient descent, of a single layer model against deeper networks than those evaluated before – depths  and . As can be seen, with standard, near-zero initialization, the depth- network starts making visible progress only after about  iterations, whereas the depth- network seems stuck even after  iterations. In contrast, under identity initialization, both networks immediately make progress, and again depth serves as an implicit accelerator.

As a final sanity test, we evaluate the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting. Specifically, we experiment with the convolutional network tutorial for MNIST built into TensorFlow,

which includes convolution, pooling and dense layers, ReLU non-linearities, stochastic gradient descent with momentum, and dropout

(Srivastava et al., 2014). We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer. Here, as opposed to previous experiments, widths of the newly formed hidden layers were not set to , but rather to the minimal values that do not deteriorate expressiveness (see Appendix C). Overall, with an addition of roughly  in number of parameters, optimization has accelerated considerably – see Figure 5

-right. The displayed results were obtained with the hyperparameter settings hardcoded into the tutorial. We have tried alternative settings (varying learning rates and standard deviations of initializations – see Appendix

C), and in all cases observed an outcome similar to that in Figure 5-right – overparameterization led to significant speedup. Nevertheless, as reported above for linear networks, it is likely that for non-linear networks the effect of depth on optimization is mixed – some settings accelerate by it, while others do not. Comprehensive characterization of the cases in which depth accelerates optimization warrants much further study. We hope our work will spur interest in this avenue of research.

## 9 Conclusion

Through theory and experiments, we demonstrated that overparameterizing a neural network by increasing its depth can accelerate optimization, even on very simple problems.

Our analysis of linear neural networks, the subject of various recent studies, yielded a new result: for these models, overparameterization by depth can be understood as a preconditioning scheme with a closed form description (Theorem 1 and the claims thereafter). The preconditioning may be interpreted as a combination between certain forms of adaptive learning rate and momentum. Given that it depends on network depth but not on width, acceleration by overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

Clearly, complete theoretical analysis for non-linear networks will be challenging. Empirically however, we showed that the trivial idea of replacing an internal weight matrix by a product of two can significantly accelerate optimization, with absolutely no effect on expressiveness (Figure 5-right).

The fact that gradient descent over classic convex problems such as linear regression with  loss, , can accelerate from transitioning to a non-convex overparameterized objective, does not coincide with conventional wisdom, and provides food for thought. Can this effect be rigorously quantified, similarly to analyses of explicit acceleration methods such as momentum or adaptive regularization (AdaGrad)?

## Acknowledgments

Sanjeev Arora’s work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Elad Hazan’s work is supported by NSF grant 1523815 and Google Brain. Nadav Cohen is a member of the Zuckerman Israeli Postdoctoral Scholars Program, and is supported by Eric and Wendy Schmidt.

## Appendix A Deferred Proofs

### a.1 Proof of Theorem 1

Before delving into the proof, we introduce notation that will admit a more compact presentation of formulae. For , we denote:

 ∏j=baWj  :=WbWb−1⋯Wa ∏bj=aW⊤j:=W⊤aW⊤a+1⋯W⊤b

where  are the weight matrices of the depth- linear network (Equation 2). If , then by definition both and are identity matrices, with size depending on context, i.e. on the dimensions of matrices they are multiplied against. Given any square matrices (possibly scalars) , we denote by a block-diagonal matrix holding them on its diagonal:

 diag(A1…Am)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A10000⋱0000Am00000⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

As illustrated above, may hold additional, zero-valued rows and columns beyond . Conversely, it may also trim (omit) rows and columns, from its bottom and right ends respectively, so long as only zeros are being removed. The exact shape of is again determined by context, and so if  and  are matrices, the expression infers a number of rows equal to the number of columns in , and a number of columns equal to the number of rows in .

Turning to the actual proof, we disregard the trivial case , and begin by noticing that Equation 3, along with the definition of  (Equation 5), imply that for every :

 ∂LN∂Wj(W1,…,WN)=N∏i=j+1W⊤i⋅dL1dW(We)⋅j−1∏i=1W⊤i

Plugging this into the differential equations of gradient descent (Equation 6), we get:

 ˙Wj(t)=−ηλWj(t) (16) −ηN∏i=j+1W⊤i(t)⋅dL1dW(We(t))⋅j−1∏i=1W⊤i(t) , j=1…N

For , multiply the ’th equation by  from the right, and the ’th equation by  from the left. This yields:

 W⊤j+1(t)˙Wj+1(t)+ηλ⋅W⊤j+1(t)Wj+1(t)= ˙Wj(t)W⊤j(t)+ηλ⋅Wj(t)W⊤j(t)

Taking the transpose of these equations and adding to themselves, we obtain, for every :

 W⊤j+1(t)˙Wj+1(t)+˙W⊤j+1(t)Wj+1(t)+ 2ηλ⋅W⊤j+1(t)Wj+1(t)= ˙Wj(t)W⊤j(t)+Wj(t)˙W⊤j(t)+ 2ηλ⋅Wj(t)W⊤j(t) (17)

Denote for :

 Cj(t):=Wj(t)W⊤j(t)   ,   C′j(t):=W⊤j(t)Wj(t)

Equation 17 can now be written as:

 ˙C′j+1(t)+2ηλ⋅