How does depth help? This central question of deep learning still eludes full theoretical understanding. The general consensus is that there is a trade-off: increasing depth improves expressiveness, but complicates optimization. Superior expressiveness of deeper networks, long suspected, is now confirmed by theory, albeit for fairly limited learning problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., 2018)
. Difficulties in optimizing deeper networks have also been long clear – the signal held by a gradient gets buried as it propagates through many layers. This is known as the “vanishing/exploding gradient problem”. Modern techniques such as batch normalization(Ioffe & Szegedy, 2015)He et al., 2015) have somewhat alleviated these difficulties in practice.
Given the longstanding consensus on expressiveness vs. optimization trade-offs, this paper conveys a rather counterintuitive message: increasing depth can accelerate optimization. The effect is shown, via first-cut theoretical and empirical analyses, to resemble a combination of two well-known tools in the field of optimization: momentum, which led to provable acceleration bounds (Nesterov, 1983); and adaptive regularization, a more recent technique proven to accelerate by Duchi et al. (2011) in their proposal of the AdaGrad algorithm. Explicit mergers of both techniques are quite popular in deep learning (Kingma & Ba, 2014; Tieleman & Hinton, 2012). It is thus intriguing that merely introducing depth, with no other modification, can have a similar effect, but implicitly.
There is an obvious hurdle in isolating the effect of depth on optimization: if increasing depth leads to faster training on a given dataset, how can one tell whether the improvement arose from a true acceleration phenomenon, or simply due to better representational power (the shallower network was unable to attain the same training loss)? We respond to this hurdle by focusing on linear neural networks (cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt & Ma (2016); Kawaguchi (2016)). With these models, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices – an overparameterization.
We provide a new analysis of linear neural network optimization via direct treatment of the differential equations associated with gradient descent when training arbitrarily deep networks on arbitrary loss functions. We find that the overparameterization introduced by depth leads gradient descent to operate as if it were training a shallow (single layer) network, while employing a particular preconditioning scheme. The preconditioning promotes movement along directions already taken by the optimization, and can be seen as an acceleration procedure that combines momentum with adaptive learning rates. Even on simple convex problems such as linear regression withloss, , overparameterization via depth can significantly speed up training. Surprisingly, in some of our experiments, not only did overparameterization outperform naïve gradient descent, but it was also faster than two well-known acceleration methods – AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). In addition to purely linear networks, we also demonstrate (empirically) the implicit acceleration of overparameterization on a non-linear model, by replacing hidden layers with depth- linear networks. The implicit acceleration of overparametrization is different from standard regularization – we prove its effect cannot be attained via gradients of any fixed regularizer.
Both our theoretical analysis and our empirical evaluation indicate that acceleration via overparameterization need not be computationally expensive. From an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters.
The remainder of the paper is organized as follows. In Section 2 we review related work. Section 3 presents a warmup example of linear regression with loss, demonstrating the immense effect overparameterization can have on optimization, with as little as a single additional scalar. Our theoretical analysis begins in Section 4, with a setup of preliminary notation and terminology. Section 5 derives the preconditioning scheme implicitly induced by overparameterization, followed by Section 6 which shows that this form of preconditioning is not attainable via any regularizer. In Section 7 we qualitatively analyze a very simple learning problem, demonstrating how the preconditioning can speed up optimization. Our empirical evaluation is delivered in Section 8. Finally, Section 9 concludes.
2 Related Work
Theoretical study of optimization in deep learning is a highly active area of research. Works along this line typically analyze critical points (local minima, saddles) in the landscape of the training objective, either for linear networks (see for example Kawaguchi (2016); Hardt & Ma (2016) or Baldi & Hornik (1989) for a classic account), or for specific non-linear networks under different restrictive assumptions (cf. Choromanska et al. (2015); Haeffele & Vidal (2015); Soudry & Carmon (2016); Safran & Shamir (2017)). Other works characterize other aspects of objective landscapes, for example Safran & Shamir (2016) showed that under certain conditions a monotonically descending path from initialization to global optimum exists (in compliance with the empirical observations of Goodfellow et al. (2014)).
The dynamics of optimization was studied in Fukumizu (1998) and Saxe et al. (2013), for linear networks. Like ours, these works analyze gradient descent through its corresponding differential equations. Fukumizu (1998) focuses on linear regression with loss, and does not consider the effect of varying depth – only a two (single hidden) layer network is analyzed. Saxe et al. (2013) also focuses on regression, but considers any depth beyond two (inclusive), ultimately concluding that increasing depth can slow down optimization, albeit by a modest amount. In contrast to these two works, our analysis applies to a general loss function, and any depth including one. Intriguingly, we find that for regression, acceleration by depth is revealed only when . This explains why the conclusion reached in Saxe et al. (2013) differs from ours.
Turning to general optimization, accelerated gradient (momentum) methods were introduced in Nesterov (1983), and later studied in numerous works (see Wibisono et al. (2016) for a short review). Such methods effectively accumulate gradients throughout the entire optimization path, using the collected history to determine the step at a current point in time. Use of preconditioners to speed up optimization is also a well-known technique. Indeed, the classic Newton’s method can be seen as preconditioning based on second derivatives. Adaptive preconditioning with only first-order (gradient) information was popularized by the BFGS algorithm and its variants (cf. Nocedal (1980)). Relevant theoretical guarantees, in the context of regret minimization, were given in Hazan et al. (2007); Duchi et al. (2011). In terms of combining momentum and adaptive preconditioning, Adam (Kingma & Ba, 2014) is a popular approach, particularly for optimization of deep networks.
Algorithms with certain theoretical guarantees for non-convex optimization, and in particular for training deep neural networks, were recently suggested in various works, for example Ge et al. (2015); Agarwal et al. (2017); Carmon et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and references therein. Since the focus of this paper lies on the analysis of algorithms already used by practitioners, such works lie outside our scope.
3 Warmup: Regression
We begin with a simple yet striking example of the effect being studied. For linear regression with loss, we will see how even the slightest overparameterization can have an immense effect on optimization. Specifically, we will see that simple gradient descent on an objective overparameterized by a single scalar, corresponds to a form of accelerated gradient descent on the original objective.
Consider the objective for a scalar linear regression problem with loss ( – even positive integer):
here are instances, are continuous labels, is a finite collection of labeled instances (training set), and
is a learned parameter vector. Suppose now that we apply a simple overparameterization, replacing the parameter vectorby a vector times a scalar :
Obviously the overparameterization does not affect the expressiveness of the linear model. How does it affect optimization? What happens to gradient descent on this non-convex objective?
Gradient descent over , with fixed small learning rate and near-zero initialization, is equivalent to gradient descent over with particular adaptive learning rate and momentum terms.
To see this, consider the gradients of and :
Gradient descent over with learning rate :
The dynamics of the underlying parameter are:
is assumed to be small, thus we neglect . Denoting and , this gives:
Since by assumption and are initialized near zero, will initialize near zero as well. This implies that at every iteration , is a weighted combination of past gradients. There thus exist such that:
We conclude that the dynamics governing the underlying parameter correspond to gradient descent with a momentum term, where both the learning rate () and momentum coefficients () are time-varying and adaptive.
4 Linear Neural Networks
Let be a space of objects (e.g. images or word embeddings) that we would like to infer something about, and let be the space of possible inferences. Suppose we are given a training set , along with a (point-wise) loss function . For example, could hold continuous values with being the loss: ; or it could hold one-hot vectors representing categories with being the softmax-cross-entropy loss: , where and stand for coordinate of and respectively. For a predictor , i.e. a mapping from to , the overall training loss is . If comes from some parametric family , we view the corresponding training loss as a function of the parameters, i.e. we consider . For example, if the parametric family in question is the class of (directly parameterized) linear predictors:
the respective training loss is a function from to .
In our context, a depth- () linear neural network, with hidden widths , is the following parametric family of linear predictors: , where by definition and . As customary, we refer to each , , as the weight matrix of layer . For simplicity of presentation, we hereinafter omit from our notation the hidden widths , and simply write instead of ( will be specified explicitly if not clear by context). That is, we denote:
For completeness, we regard a depth- network as the family of directly parameterized linear predictors, i.e. we set (see Equation 1).
The training loss that corresponds to a depth- linear network – , is a function from to . For brevity, we will denote this function by . Our focus lies on the behavior of gradient descent when minimizing . More specifically, we are interested in the dependence of this behavior on , and in particular, in the possibility of increasing leading to acceleration. Notice that for any we have:
and so the sole difference between the training loss of a depth- network and that of a depth- network (classic linear model) lies in the replacement of a matrix parameter by a product of matrices. This implies that if increasing can indeed accelerate convergence, it is not an outcome of any phenomenon other than favorable properties of depth-induced overparameterization for optimization.
5 Implicit Dynamics of Gradient Descent
In this section we present a new result for linear neural networks, tying the dynamics of gradient descent on – the training loss corresponding to a depth- network, to those on – training loss of a depth- network (classic linear model). Specifically, we show that gradient descent on , a complicated and seemingly pointless overparameterization, can be directly rewritten as a particular preconditioning scheme over gradient descent on .
When applied to , gradient descent takes on the following form:
here is a learning rate, and is an optional weight decay coefficient. For simplicity, we regard both and as fixed (no dependence on ). Define the underlying end-to-end weight matrix:
Given that (Equation 3), we view as an optimized weight matrix for , whose dynamics are governed by Equation 4. Our interest then boils down to the study of these dynamics for different choices of . For they are (trivially) equivalent to standard gradient descent over . We will characterize the dynamics for .
To be able to derive, in our general setting, an explicit update rule for the end-to-end weight matrix (Equation 5), we introduce an assumption by which the learning rate is small, i.e. . Formally, this amounts to translating Equation 4 to the following set of differential equations:
where is now a continuous time index, and stands for the derivative of with respect to time. The use of differential equations, for both theoretical analysis and algorithm design, has a long and rich history in optimization research (see Helmke & Moore (2012) for an overview). When step sizes (learning rates) are taken to be small, trajectories of discrete optimization algorithms converge to smooth curves modeled by continuous-time differential equations, paving way to the well-established theory of the latter (cf. Boyce et al. (1969)). This approach has led to numerous interesting findings, including recent results in the context of acceleration methods (e.g. Su et al. (2014); Wibisono et al. (2016)).
With the continuous formulation in place, we turn to express the dynamics of the end-to-end matrix :
Assume the weight matrices follow the dynamics of continuous gradient descent (Equation 6). Assume also that their initial values (time ) satisfy, for :
Then, the end-to-end weight matrix (Equation 5) is governed by the following differential equation:
where and , , are fractional power operators defined over positive semidefinite matrices.
(sketch – full details in Appendix A.1) If (no weight decay) then one can easily show that throughout optimization. Taking the transpose of this equation and adding to itself, followed by integration over time, imply that the difference between and is constant. This difference is zero at initialization (Equation 7), thus will remain zero throughout, i.e.:
A slightly more delicate treatment shows that this is true even if , i.e. with weight decay included.
Translating the continuous dynamics of Equation 8 back to discrete time, we obtain the sought-after update rule for the end-to-end weight matrix:
This update rule relies on two assumptions: first, that the learning rate is small enough for discrete updates to approximate continuous ones; and second, that weights are initialized on par with Equation 7, which will approximately be the case if initialization values are close enough to zero. It is customary in deep learning for both learning rate and weight initializations to be small, but nonetheless above assumptions are only met to a certain extent. We support their applicability by showing empirically (Section 8) that the end-to-end update rule (Equation 10) indeed provides an accurate description for the dynamics of .
A close look at Equation 10 reveals that the dynamics of the end-to-end weight matrix are similar to gradient descent over – training loss corresponding to a depth- network (classic linear model). The only difference (besides the scaling by of the weight decay coefficient ) is that the gradient is subject to a transformation before being used. Namely, for , it is multiplied from the left by and from the right by , followed by summation over . Clearly, when (depth- network) this transformation reduces to identity, and as expected, precisely adheres to gradient descent over . When the dynamics of are less interpretable. We arrange it as a vector to gain more insight:
For an arbitrary matrix , denote by its arrangement as a vector in column-first order. Then, the end-to-end update rule in Equation 10 can be written as:
where is a positive semidefinite preconditioning matrix that depends on .
Namely, if we denote the singular values of respectively, the eigenvectors of
. Namely, if we denote the singular values ofby (by definition if ), and corresponding left and right singular vectors by and
respectively, the eigenvectors ofare:
with corresponding eigenvalues:
with corresponding eigenvalues:
The result readily follows from the properties of the Kronecker product – see Appendix A.2 for details. ∎
is essentially a preconditioning, whose eigendirections and eigenvalues depend on the singular value decomposition of. The eigendirections are the rank- matrices , where and are left and right (respectively) singular vectors of . The eigenvalue of is , where and are the singular values of corresponding to and (respectively). When , an increase in or leads to an increase in the eigenvalue corresponding to the eigendirection . Qualitatively, this implies that the preconditioning favors directions that correspond to singular vectors whose presence in is stronger. We conclude that the effect of overparameterization, i.e. of replacing a classic linear model (depth- network) by a depth- linear network, boils down to modifying gradient descent by promoting movement along directions that fall in line with the current location in parameter space. A-priori, such a preference may seem peculiar – why should an optimization algorithm be sensitive to its location in parameter space? Indeed, we generally expect sensible algorithms to be translation invariant, i.e. be oblivious to parameter value. However, if one takes into account the common practice in deep learning of initializing weights near zero, the location in parameter space can also be regarded as the overall movement made by the algorithm. We thus interpret our findings as indicating that overparameterization promotes movement along directions already taken by the optimization, and therefore can be seen as a form of acceleration. This intuitive interpretation will become more concrete in the subsection that follows.
A final point to make, is that the end-to-end update rule (Equation 10 or 11), which obviously depends on – number of layers in the deep linear network, does not depend on the hidden widths (see Section 4). This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters. Consequently, the acceleration of overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.
5.1 Single Output Case
To facilitate a straightforward presentation of our findings, we hereinafter focus on the special case where the optimized models have a single output, i.e. where . This corresponds, for example, to a binary (two-class) classification problem, or to the prediction of a numeric scalar property (regression). It admits a particularly simple form for the end-to-end update rule of Equation 10:
Assume , i.e. . Then, the end-to-end update rule in Equation 10 can be written as follows:
where stands for Euclidean norm raised to the power of , and , , is defined to be the projection operator onto the direction of :
The result follows from the definition of a fractional power operator over matrices – see Appendix A.3. ∎
Claim 2 implies that in the single output case, the effect of overparameterization (replacing classic linear model by depth- linear network) on gradient descent is twofold: first, it leads to an adaptive learning rate schedule, by introducing the multiplicative factor ; and second, it amplifies (by ) the projection of the gradient on the direction of . Recall that we view not only as the optimized parameter, but also as the overall movement made in optimization (initialization is assumed to be near zero). Accordingly, the adaptive learning rate schedule can be seen as gaining confidence (increasing step sizes) when optimization moves farther away from initialization, and the gradient projection amplification can be thought of as a certain type of momentum that favors movement along the azimuth taken so far. These effects bear potential to accelerate convergence, as we illustrate qualitatively in Section 7, and demonstrate empirically in Section 8.
6 Overparametrization Effects Cannot Be Attained via Regularization
Adding a regularizer to the objective is a standard approach for improving optimization (though lately the term regularization is typically associated with generalization). For example, AdaGrad was originally invented to compete with the best regularizer from a particular family. The next theorem shows (for single output case) that the effects of overparameterization cannot be attained by adding a regularization term to the original training loss, or via any similar modification. This is not obvious a-priori, as unlike many acceleration methods that explicitly maintain memory of past gradients, updates under overparametrization are by definition the gradients of something. The assumptions in the theorem are minimal and also necessary, as one must rule-out the trivial counter-example of a constant training loss.
Assume does not vanish at , and is continuous on some neighborhood around this point. For a given , ,111 For the result to hold with , additional assumptions on are required; otherwise any non-zero linear function serves as a counter-example – it leads to a vector field that is the gradient of . define:
where is the projection given in Equation 13. Then, there exists no function (of ) whose gradient field is .
(sketch – full details in Appendix A.4) The proof uses elementary differential geometry (Buck, 2003): curves, arc length and the fundamental theorem for line integrals, which states that the integral of for any differentiable function amounts to along every closed curve.
Overparametrization changes gradient descent’s behavior: instead of following the original gradient , it follows some other direction (see Equations 12 and 14) that is a function of the original gradient as well as the current point . We think of this change as a transformation that maps one vector field to another – :
Notice that for , we get exactly the vector field defined in theorem statement.
We note simple properties of the mapping . First, it is linear, since for any vector fields and scalar : and . Second, because of the linearity of line integrals, for any curve , the functional , a mapping of vector fields to scalars, is linear.
We show that contradicts the fundamental theorem for line integrals. To do so, we construct a closed curve for which the linear functional does not vanish at . Let , which is well-defined since by assumption . For we define (see Figure 1):
is the line segment from to .
is a spherical curve from to .
is the line segment from to .
is a spherical curve from to .
With the definition of in place, we decompose into a constant vector field plus a residual . We explicitly compute the line integrals along for , and derive bounds for . This, along with the linearity of the functional , provides a lower bound on the line integral of over . We show the lower bound is positive as , thus indeed contradicts the fundamental theorem for line integrals. ∎
7 Illustration of Acceleration
To this end, we showed that overparameterization (use of depth- linear network in place of classic linear model) induces on gradient descent a particular preconditioning scheme (Equation 10 in general and 12 in the single output case), which can be interpreted as introducing some forms of momentum and adaptive learning rate. We now illustrate qualitatively, on a very simple hypothetical learning problem, the potential of these to accelerate optimization.
Consider the task of linear regression, assigning to vectors in labels in . Suppose that our training set consists of two points in : and . Assume also that the loss function of interest is , : . Denoting the learned parameter by , the overall training loss can be written as:222 We omit the averaging constant for conciseness.
With fixed learning rate (weight decay omitted for simplicity), gradient descent over gives:
Changing variables per , we have:
Assuming the original weights and are initialized near zero, and start off at and respectively, and will eventually reach the optimum if the learning rate is small enough to prevent divergence:
Suppose now that the problem is ill-conditioned, in the sense that . If this has no effect on the bound for .333 Optimal learning rate for gradient descent on quadratic objective does not depend on current parameter value (cf. Goh (2017)). If the learning rate is determined by , leading to converge very slowly. In a sense, will suffer from the fact that there is no “communication” between the coordinates (this will actually be the case not just with gradient descent, but with most algorithms typically used in large-scale settings – AdaGrad, Adam, etc.).
Now consider the scenario where we optimize via overparameterization, i.e. with the update rule in Equation 12 (single output). In this case the coordinates are coupled, and as gets small ( gets close to ), the learning rate is effectively scaled by (in addition to a scaling by in coordinate only), allowing (if ) faster convergence of . We thus have the luxury of temporarily slowing down to ensure that does not diverge, with the latter speeding up the former as it reaches safe grounds. In Appendix B we consider a special case and formalize this intuition, deriving a concrete bound for the acceleration of overparameterization.
Our analysis (Section 5) suggests that overparameterization – replacement of a classic linear model by a deep linear network, induces on gradient descent a certain preconditioning scheme. We qualitatively argued (Section 7
) that in some cases, this preconditioning may accelerate convergence. In this section we put these claims to the test, through a series of empirical evaluations based on TensorFlow toolbox (Abadi et al. (2016)). For conciseness, many of the details behind our implementation are deferred to Appendix C.
We begin by evaluating our analytically-derived preconditioning scheme – the end-to-end update rule in Equation 10. Our objective in this experiment is to ensure that our analysis, continuous in nature and based on a particular assumption on weight initialization (Equation 7), is indeed applicable to practical scenarios. We focus on the single output case, where the update-rule takes on a particularly simple (and efficiently implementable) form – Equation 12
. The dataset chosen was UCI Machine Learning Repository’s “Gas Sensor Array Drift at Different Concentrations”(Vergara et al., 2012; Rodriguez-Lujan et al., 2014). Specifically, we used the dataset’s “Ethanol” problem – a scalar regression task with examples, each comprising features (one of the largest numeric regression tasks in the repository). As training objectives, we tried both and losses. Figure 2 shows convergence (training objective per iteration) of gradient descent optimizing depth- and depth- linear networks, against optimization of a single layer model using the respective preconditioning schemes (Equation 12 with ). As can be seen, the preconditioning schemes reliably emulate deep network optimization, suggesting that, at least in some cases, our analysis indeed captures practical dynamics.
Alongside the validity of the end-to-end update rule, Figure 2 also demonstrates the negligible effect of network width on convergence, in accordance with our analysis (see Section 5). Specifically, it shows that in the evaluated setting, hidden layers of size (scalars) suffice in order for the essence of overparameterization to fully emerge. Unless otherwise indicated, all results reported hereinafter are based on this configuration, i.e. on scalar hidden layers. The computational toll associated with overparameterization will thus be virtually non-existent.
As a final observation on Figure 2, notice that it exhibits faster convergence with a deeper network. This however does not serve as evidence in favor of acceleration by depth, as we did not set learning rates optimally per model (simply used the common choice of ). To conduct a fair comparison between the networks, and more importantly, between them and a classic single layer model, multiple learning rates were tried, and the one giving fastest convergence was taken on a per-model basis. Figure 3 shows the results of this experiment. As can be seen, convergence of deeper networks is (slightly) slower in the case of loss. This falls in line with the findings of Saxe et al. (2013). In stark contrast, and on par with our qualitative analysis in Section 7, is the fact that with loss adding depth significantly accelerated convergence. To the best of our knowledge, this provides first empirical evidence to the fact that depth, even without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem, can lead to favorable optimization.
In light of the speedup observed with loss, it is natural to ask how the implicit acceleration of depth compares against explicit methods for acceleration and adaptive learning. Figure 4-left shows convergence of a depth- network (optimized with gradient descent) against that of a single layer model optimized with AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). The displayed curves correspond to optimal learning rates, chosen individually via grid search. Quite surprisingly, we find that in this specific setting, overparameterizing, thereby turning a convex problem non-convex, is a more effective optimization strategy than carefully designed algorithms tailored for convex problems. We note that this was not observed with all algorithms – for example Adam (Kingma & Ba, 2014) was considerably faster than overparameterization. However, when introducing overparameterization simultaneously with Adam (a setting we did not theoretically analyze), further acceleration is attained – see Figure 4-right. This suggests that at least in some cases, not only plain gradient descent benefits from depth, but also more elaborate algorithms commonly employed in state of the art applications.
An immediate question arises at this point. If depth indeed accelerates convergence, why not add as many layers as one can computationally afford? The reason, which is actually apparent in our analysis, is the so-called vanishing gradient problem. When training a very deep network (large ), while initializing weights to be small, the end-to-end matrix (Equation 5) is extremely close to zero, severely attenuating gradients in the preconditioning scheme (Equation 10). A possible approach for alleviating this issue is to initialize weights to be larger, yet small enough such that the end-to-end matrix does not “explode”. The choice of identity (or near identity) initialization leads to what is known as linear residual networks (Hardt & Ma, 2016), akin to the successful residual networks architecture (He et al., 2015) commonly employed in deep learning. Notice that identity initialization satisfies the condition in Equation 7, rendering the end-to-end update rule (Equation 10) applicable. Figure 5-left shows convergence, under gradient descent, of a single layer model against deeper networks than those evaluated before – depths and . As can be seen, with standard, near-zero initialization, the depth- network starts making visible progress only after about iterations, whereas the depth- network seems stuck even after iterations. In contrast, under identity initialization, both networks immediately make progress, and again depth serves as an implicit accelerator.
As a final sanity test, we evaluate the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting. Specifically, we experiment with the convolutional network tutorial for MNIST built into TensorFlow,444 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistSrivastava et al., 2014). We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer. Here, as opposed to previous experiments, widths of the newly formed hidden layers were not set to , but rather to the minimal values that do not deteriorate expressiveness (see Appendix C). Overall, with an addition of roughly in number of parameters, optimization has accelerated considerably – see Figure 5
-right. The displayed results were obtained with the hyperparameter settings hardcoded into the tutorial. We have tried alternative settings (varying learning rates and standard deviations of initializations – see AppendixC), and in all cases observed an outcome similar to that in Figure 5-right – overparameterization led to significant speedup. Nevertheless, as reported above for linear networks, it is likely that for non-linear networks the effect of depth on optimization is mixed – some settings accelerate by it, while others do not. Comprehensive characterization of the cases in which depth accelerates optimization warrants much further study. We hope our work will spur interest in this avenue of research.
Through theory and experiments, we demonstrated that overparameterizing a neural network by increasing its depth can accelerate optimization, even on very simple problems.
Our analysis of linear neural networks, the subject of various recent studies, yielded a new result: for these models, overparameterization by depth can be understood as a preconditioning scheme with a closed form description (Theorem 1 and the claims thereafter). The preconditioning may be interpreted as a combination between certain forms of adaptive learning rate and momentum. Given that it depends on network depth but not on width, acceleration by overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.
Clearly, complete theoretical analysis for non-linear networks will be challenging. Empirically however, we showed that the trivial idea of replacing an internal weight matrix by a product of two can significantly accelerate optimization, with absolutely no effect on expressiveness (Figure 5-right).
The fact that gradient descent over classic convex problems such as linear regression with loss, , can accelerate from transitioning to a non-convex overparameterized objective, does not coincide with conventional wisdom, and provides food for thought. Can this effect be rigorously quantified, similarly to analyses of explicit acceleration methods such as momentum or adaptive regularization (AdaGrad)?
Sanjeev Arora’s work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Elad Hazan’s work is supported by NSF grant 1523815 and Google Brain. Nadav Cohen is a member of the Zuckerman Israeli Postdoctoral Scholars Program, and is supported by Eric and Wendy Schmidt.
- Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
Agarwal et al. (2017)
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T.
Finding approximate local minima faster than gradient descent.
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM, 2017.
Arora et al. (2018)
Arora, R., Basu, A., Mianjy, P., and Mukherjee, A.
Understanding deep neural networks with rectified linear units.International Conference on Learning Representations (ICLR), 2018.
Baldi & Hornik (1989)
Baldi, P. and Hornik, K.
Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58, 1989.
- Boyce et al. (1969) Boyce, W. E., DiPrima, R. C., and Haines, C. W. Elementary differential equations and boundary value problems, volume 9. Wiley New York, 1969.
- Buck (2003) Buck, R. C. Advanced calculus. Waveland Press, 2003.
- Carmon et al. (2016) Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016.
- Choromanska et al. (2015) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204, 2015.
- Cohen et al. (2017) Cohen, N., Sharir, O., Levine, Y., Tamari, R., Yakira, D., and Shashua, A. Analysis and design of convolutional networks via hierarchical tensor decompositions. arXiv preprint arXiv:1705.02302, 2017.
- Daniely (2017) Daniely, A. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Eldan & Shamir (2015) Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
- Fukumizu (1998) Fukumizu, K. Effect of batch learning in multilayer neural networks. Gen, 1(04):1E–03, 1998.
Ge et al. (2015)
Ge, R., Huang, F., Jin, C., and Yuan, Y.
Escaping from saddle points—online stochastic gradient for tensor decomposition.In Conference on Learning Theory, pp. 797–842, 2015.
- Goh (2017) Goh, G. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL http://distill.pub/2017/momentum.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016.
- Goodfellow et al. (2014) Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
- Haeffele & Vidal (2015) Haeffele, B. D. and Vidal, R. Global Optimality in Tensor Factorization, Deep Learning, and Beyond. CoRR abs/1202.2745, cs.NA, 2015.
- Hardt & Ma (2016) Hardt, M. and Ma, T. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69(2-3):169–192, December 2007. ISSN 0885-6125.
- He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Helmke & Moore (2012) Helmke, U. and Moore, J. B. Optimization and dynamical systems. Springer Science & Business Media, 2012.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456, 2015.
- Janzamin et al. (2015) Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods. CoRR abs/1506.08473, 2015.
- Jones et al. (2001–) Jones, E., Oliphant, T., Peterson, P., et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/. [Online; accessed ¡today¿].
- Kawaguchi (2016) Kawaguchi, K. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
- Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lee et al. (2017) Lee, H., Ge, R., Risteski, A., Ma, T., and Arora, S. On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028, 2017.
- Livni et al. (2014) Livni, R., Shalev-Shwartz, S., and Shamir, O. On the computational efficiency of training neural networks. Advances in Neural Information Processing Systems, 2014.
- Nesterov (1983) Nesterov, Y. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.
- Nocedal (1980) Nocedal, J. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
- Raghu et al. (2016) Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
Rodriguez-Lujan et al. (2014)
Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., and Huerta, R.
On the calibration of sensor arrays for pattern recognition using the minimal number of experiments.Chemometrics and Intelligent Laboratory Systems, 130:123–134, 2014.
- Safran & Shamir (2016) Safran, I. and Shamir, O. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.
- Safran & Shamir (2017) Safran, I. and Shamir, O. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
- Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Soudry & Carmon (2016) Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Srivastava et al. (2014) Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Su et al. (2014) Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518, 2014.
Tieleman & Hinton (2012)
Tieleman, T. and Hinton, G.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
Vergara et al. (2012)
Vergara, A., Vembu, S., Ayhan, T., Ryan, M. A., Homer, M. L., and Huerta, R.
Chemical gas sensor drift compensation using classifier ensembles.Sensors and Actuators B: Chemical, 166:320–329, 2012.
- Wibisono et al. (2016) Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
- Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Deferred Proofs
a.1 Proof of Theorem 1
Before delving into the proof, we introduce notation that will admit a more compact presentation of formulae. For , we denote:
where are the weight matrices of the depth- linear network (Equation 2). If , then by definition both and are identity matrices, with size depending on context, i.e. on the dimensions of matrices they are multiplied against. Given any square matrices (possibly scalars) , we denote by a block-diagonal matrix holding them on its diagonal:
As illustrated above, may hold additional, zero-valued rows and columns beyond . Conversely, it may also trim (omit) rows and columns, from its bottom and right ends respectively, so long as only zeros are being removed. The exact shape of is again determined by context, and so if and are matrices, the expression infers a number of rows equal to the number of columns in , and a number of columns equal to the number of rows in .
Plugging this into the differential equations of gradient descent (Equation 6), we get:
For , multiply the ’th equation by from the right, and the ’th equation by from the left. This yields:
Taking the transpose of these equations and adding to themselves, we obtain, for every :
Denote for :
Equation 17 can now be written as: