One of the most important recent innovations for optimizing deep neural networks is Batch Normalization (Bn) (Ioffe and Szegedy, 2015). This technique has been proven to successfully stabilize and accelerate training of deep neural networks and is thus by now standard in many state-of-the art architectures such as ResNets (He et al., 2016) and the latest Inception Nets (Szegedy et al., 2017). The success of Batch Normalization has promoted its key idea that normalizing the inner layers of a neural network stabilizes training which recently led to the development of many such normalization methods such as (Arpit et al., 2016; Klambauer et al., 2017; Salimans and Kingma, 2016) and (Ba et al., 2016) to name just a few.
Yet, despite the ever more important role of Batch Normalization for training deep neural networks, the Machine Learning community is mostly relying on empirical evidence and thus lacking a thorough theoretical understanding that can explain such success. Indeed – to the best of our knowledge – there exists no theoretical result which provably shows faster convergence rates for this technique on any problem instance. So far, there only exists competing hypotheses that we briefly summarize below.
1.1 Related work
Internal Covariate Shift
The most widespread idea is that Batch Normalization accelerates training by reducing the so-called internal covariate shift, defined as the change in the distribution of layer inputs while the conditional distribution of outputs is unchanged. This change can be significant especially for deep neural networks where the successive composition of layers drives the activation distribution away from the initial input distribution. Ioffe and Szegedy (2015)
argue that Batch Normalization reduces the internal covariate shift by employing a normalization technique that enforces the input distribution of each activation layer to be whitened - i.e. enforced to have zero means and unit variances - and decorrelated . Yet, as pointed out byLipton and Steinhardt (2018), the covariate shift phenomenon itself is not rigorously shown to be the reason behind the performance of Batch Normalization. Furthermore, a recent empirical study published by (Santurkar et al., 2018) provides strong evidence supporting the hypothesis that the performance gain of Batch Normilization is not explained by the reduction of internal covariate shift.
Smoothing of objective function
Recently, Santurkar et al. (2018)
argue that under certain assumptions a normalization layer simplifies optimization by smoothing the loss landscape of the optimization problem of the preceding layer. Yet, we note that this effect may - at best - only improve the constant factor of the convergence rate of Gradient Descent and not the rate itself (e.g. from sub-linear to linear). Furthermore, the analysis treats only the largest eigenvalue and thus one direction in the landscape (at any given point) and keeps the (usually trainable)BN parameters fixed to zero-mean and unit variance. For a thorough conclusion about the overall landscape, a look at the entire eigenspectrum (including negative and zero eigenvalues) would be needed. Yet, this is particularly hard to do as soon as one allows for learnable mean and variance parameters since the effect of their interplay on the distribution of eigenvalues is highly non-trivial.
Finally, a different perspective was brought up by another normalization technique termed Weight Normalization (Wn) (Salimans and Kingma, 2016)
. This technique performs a very simple normalization that is independent of any data statistics with the goal of decoupling the length of the weight vector from its direction. The optimization of the training objective is then performed by training the two parts separately. As discussed in Section2, Bn and Wn differ in how the weights are normalized but share the above mentioned decoupling effect. Interestingly, weight normalization has been shown empirically to benefit from similar acceleration properties as Batch Normalization (Gitman and Ginsburg, 2017; Salimans and Kingma, 2016). This raises the obvious question whether the empirical success of training with Batch Normalization can (at least partially) be attributed to its length-direction decoupling aspect.
1.2 Contribution and organization
We contribute to a better theoretical understanding of Batch Normalization by analyzing it from an optimization perspective. In this regard, we particularly address the following question:
Can we find a setting in which Batch Normalization provably accelerates optimization with Gradient Descent and does the length-direction decoupling play a role in this phenomenon?
We answer both questions affirmatively. In particular, we show that the specific variance transformation of Bn decouples the length and directional components of the weight vectors in such a way that allows local search methods to exploit certain global properties of the optimization landscape (present in the directional component of the optimal weight vector). Using this fact and endowing the optimization method with an adaptive stepsize scheme, we obtain an exponential (or as more commonly termed linear) convergence rate for Batch Norm Gradient Descent on the (possibly) non-convex problem of Learning Halfspaces with Gaussian inputs (Section 4), which is a prominent problem in machine learning Erdogdu et al. (2016). We thereby turn Bn
from an effective practical heuristic into a provably converging algorithm. Additionally we show that the length-direction decoupling can be considered as a non-linear reparametrization of the weight space, which may be beneficial for even simple convex optimization tasks such as logistic regressions. Interestingly, non-linear weightspace transformations have received little to no attention within the optimization community (see(Mikhalevich et al., 1988) for an exception).
Finally, in Section 5 we analyze the effect of Bn for training a multilayer neural network (MLP) and prove – again under a similar Gaussianity assumption – that Bn acts in such a way that the cross dependencies between layers are reduced and thus the curvature structure of the network is simplified. Again, this is due to a certain global property in the directional part of the optimization landscape, which BN can exploit via the length-direction decoupling. As a result, gradient-based optimization in reparametrized coordinates (and with an adaptive stepsize policy) can enjoy a linear convergence rate on each individual unit. We substantiate both findings with experimental results on real world datasets that confirm the validity of our analysis outside the setting of our theoretical assumptions that cannot be certified to always hold in practice.
2.1 Assumptions on data distribution
Suppose that is a random input vector and is the corresponding output variable. Throughout this paper we recurrently use the following statistics
and make the following (weak) assumption.
[Weak assumption on data distribution] We assume that . We further assume that the spectrum of the matrix is bounded as
As a result, is the symmetric positive definite covariance matrix of .
relies on a stronger assumption on the data distribution. In this regard we consider the combined random variable
whose mean vector and covariance matrix are and as defined above in Eq. (1).
[Normality assumption on data distribution]
We assume that
is a multivariate normal random variable distributed with mean and second-moment
In the absence of further knowledge, assuming Gaussian data is plausible from an information-theoretic point of view since the Gaussian distribution maximizes the entropy over the set of all absolutely continuous distributions with fixed first and second moment(Dowson and Wragg, 1973). Thus, many recent studies on neural networks make this assumption on (see e.g. (Brutzkus and Globerson, 2017; Du and Lee, 2018)). Here we assume Gaussianity on instead which is even less restrictive in some cases111For example, suppose that conditional distribution is gaussian with mean for positive labels and for negative labels (mixture of gaussians). If the covariance matrix of these marginal distributions are the same, is Gaussian while is not..
2.2 Batch normalization as a reparameterization of the weight space
In neural networks a Bn layer normalizes the input of each unit of the following layer. This is done on the basis of data statistics in a training batch, but for the sake of analyzablility we will work directly with population statistics. In particular, the output of a specific unit, which projects an input to its weight vector
and applies a sufficiently smooth activation functionas follows
is normalized on the pre-activation level. That is, the input-output mapping of this unit becomes
As stated (in finite-sum terms) in Algorithm 1 of (Ioffe and Szegedy, 2015) the normalization operation amounts to computing
where and are (trainable) mean and variance adjustment parameters. In the following, we assume that is zero mean (Assumption 1) and omit .222In the (non-compositional) models of Section 3 and 4, fulfilling the assumption that is zero-mean is as simple as centering the dataset. Yet, we also omit centering a neurons input as well as learning
is zero-mean is as simple as centering the dataset. Yet, we also omit centering a neurons input as well as learningfor the neural network analysis in Section 5. This is done for the sake of simplicity but note that Salimans and Kingma (2016) found that these aspects of Bn yield no empirical improvements for optimization. Then the variance can be written as follows
and replacing this expression into the batch normalized output of Eq.(5) yields
In order to keep concise notations, we will often use the induced norm of the positive definite matrix defined as . Comparing Eq. (4) and (8) it becomes apparent that Bn can be considered as a reparameterization of the weight space. We thus define
and note that accounts for the direction and for the length of . As a result, the batch normalized output can then be written as
Note that Weight Normalization (Wn) is another instance of the above reparametrization, where the covariance matrix
is replaced by the identity matrix(Salimans and Kingma, 2016). In both cases, the objective becomes invariant to linear scaling of . From a geometry perspective, the directional part of Wn can be understood as performing optimization on the unit sphere while Bn operates on the -sphere (ellipsoid) (Cho and Lee, 2017). Note that one can compute the variance term (7) in a matrix-free manner, i.e. never needs to be computed explicitly for Bn.
Of course, this type of reparametrization is not exclusive to applications in neural networks. In the following two sections we first show how reparametrizing the weight space of linear models can be advantageous from a classical optimization point of view. In Section 5 we extend this analysis to training Batch Normalized neural networks with adaptive-stepsize Gradient Descent and show that the length-direction split induces an interesting decoupling effect of the individual network layers which simplifies the curvature structure.
3 Ordinary Least Squares
As a preparation for subsequent analyses, we start with the simple convex quadratic objective encountered when minimizing an ordinary least squares problem
One can think of this as a linear neural network with just one neuron and a quadratic loss. Thus, applying Bn resembles reparametrizing according to Eq. (9) and the objective turns into the non-convex problem
Despite the non-convexity of this new objective, we will prove that Gradient Descent (Gd) enjoys a linear convergence rate. Interestingly, our analysis establishes a link between in reparametrized coordinates (Eq. (12)) and the well-studied task of minimizing (generalized) Rayleigh quotients as it is commonly encountered in eigenvalue problems (Argentati et al., 2017).
3.1 Convergence analysis
To simplify the analysis, note that, for a given , the objective of Eq. (12) is convex w.r.t. the scalar and thus the optimal value can be found by setting , which gives . Replacing this closed-form solution into Eq. (12) yields the following optimization problem
which – as discussed in Appendix A.2 – is a special case of minimizing the generalized Rayleigh quotient for which an extensive literature exists (Knyazev, 1998; D’yakonov and McCormick, 1995). Here, we particularly consider solving (13) with Gd, which applies the following iterative updates to the parameters
Based upon existing results, the next theorem establishes a linear convergence rate for the above iterates to the minimizer in the normalized coordinates.
This convergence rate is of the same order as the rate of standard Gd on the original objective of Eq. (11) (Nesterov, 2013). Yet, it is interesting to see that the non-convexity of the normalized objective does not slow gradient-based optimization down. In the following, we will repeatedly invoke this result to analyze more complex objectives for which Gd only achieves a sublinear convergence rate in the original coordinate space but is provably accelerated after using Batch Normalization.
4 Learning Halfspaces
We now turn our attention to the problem of Learning Halfspaces, which encompasses training the simplest possible neural network: the Perceptron. This optimization problem can be written as
is a loss function. Common choices forinclude the zero-one, piece-wise linear, logistic and sigmoidal loss. We here tailor our analysis to the following choice of loss function.
[Assumptions on loss function] We assume that the loss function is infinitely differentiable, i.e. , with a bounded derivative, i.e. such that .
Furthermore, we need to be sufficiently smooth.
[Smoothness assumption] We assume that the objective is -smooth if it is differentiable on and its gradient is -Lipschitz. Furthermore, we assume that a solution exists that is bounded in the sense that 333This is a rather technical but not so restrictive assumption. For example, it always holds for the sigmoid loss unless the classification error of is already zero.
Since globally optimizing (17) is in general NP-hard (Guruswami and Raghavendra, 2009), we instead focus on understanding the effect of the normalized parameterization when searching for a stationary point. Towards this end we now assume that is a multivariate normal random variable (see Assumption 2 and discussion there).
4.1 Global characterization of the objective
The learning halfspaces objective – on Gaussian inputs – has a remarkable property: all critical points lie on the same line, independent of the choice of the loss function . We formalize this claim in the next lemma.
Interestingly, the optimal direction of these critical points spans the same line as the solution of a corresponding least squares regression problem (see Eq. (38) in Appendix A). In the context of convex optimization of generalized linear models, this fact was first pointed out in (Brillinger, 2012). Although the global optima of the two objectives are aligned, classical optimization methods - which perform updates based on local information - are generally blind to such global properties of the objective function. This is unfortunate since Gradient Descent converges linearly in the quadratic least-squares setting but only sublinearly on general Learning Halfspace problems (Zhang et al., 2015).
To accelerate the convergence of Gradient Descent, Erdogdu et al. (2016) thus proposed a two-step global optimization procedure for solving generalized linear models, which first involves finding the optimal direction by optimizing a least squares regression as a surrogate objective and secondly searching for a proper scaling factor of that minimizer. Here, we show that running Gd in coordinates reparameterized as in Eq. (9) makes this two-step procedure redundant. More specifically, splitting the optimization problem into searching for the optimal direction and scaling separately, allows even local optimization methods to exploit the property of global minima alignment. Thus - without having to solve a least squares problem in the first place - the directional updates on the Learning Halfspace problem can mimic the least squares dynamics and thereby inherit the linear convergence rate. Combined with a fast (one dimensional) search for the optimal scaling in each step the overall convergence stays linear.
As an illustration, Figure 1 shows the level sets as well as the optimal direction of a least squares-, a logistic- and a sigmoidal regression problem on the same Gaussian dataset. Furthermore, it shows iterates of Gd in original coordinates and a sequential version of Gd in normalized coordinates that first optimizes the direction and then the scaling of its parameters (Gdnp). Both methods start at the same point and run with an infinitesimally small stepsize. It can be seen that, while Gd takes completely different paths towards the optimal points of each problem instance, the dynamics of Gdnp are exactly the same until the optimal direction444Depicted by the dotted red line. Note that – as a result of Lemma 1 – this line is identical in all problems. is found and differ only in the final scaling.
|[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_ols_log.pdf||[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_ols_log.pdf|
|[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_logreg.pdf||[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_logreg.pdf|
|[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_lh.pdf||[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_lh.pdf|
4.2 Local optimization in normalized parameterization
Let be the -reparameterized objective with as defined in Eq. (9). We here consider optimizing this objective using Gradient Descent in Normalized Parameters (Gdnp). In each iteration, Gdnp performs a gradient step to optimize with respect to the direction
and does a one-dimensional bisection search to estimate the scaling factor(see Algorithm 1). It is therefore only a slight modification of performing Batch Normalization plus Gradient Descent: Instead of taking a gradient step on both and , we search for the (locally)-optimal scaling at each iteration. This modification is cheap and it simplifies the theoretical analysis but it can also be substituted easily in practice by performing multiple Gd steps on the scaling factor, which we do for the experiments in Section 4.4.
4.3 Convergence result
We now show that Algorithm 1 can achieve a linear convergence rate to a critical point on the possibly non-convex objective with Gaussian inputs. Note that all information for computing the adaptive stepsize is readily available and can be computed efficiently.
for , where
is a stopping criterion. If initialized such that (see Eq. (13)), then is an approximate critical point of in the sense that
To complete the picture, Table 1 compares this result to the order complexity for reaching an -optimal critical point (i.e.) of Gradient Descent and Accelerated Gradient Descent (Agd) in original coordinates. Although we observe that the accelerated rate achieved by Gdnp is significantly better than the best known rate of Agd for non-convex objective functions, we need to point out that the rate for Gdnp relies on the assumption of a Gaussian data distribution. Yet, Section 4.4 includes promising experimental results in a more practical setting with non-Gaussian data.
Finally, we note that the proof of this result relies specifically on the -reparametrization done by Batch Normalization. In Appendix B.3.6 we detail out why our proof strategy is not suitable for the -reparametrization of Weight Normalization and thus leave it as an interesting open question if other settings (or proof strategies) can be found where linear rates for Wn are provable.
|Agd||Smoothness||Sublinear||(Jin et al., 2017)|
|Gdnp||1, 2, 3, and 4||Linear||This paper|
4.4 Experiments I
In order to substantiate the above analysis we compare the convergence behavior of Gd and Agd to three versions of Gradient Descent in normalized coordinates. Namely, we benchmark (i) Gdnp (Algorithm 1) with multiple gradient steps on instead of Bisection, (ii) a simpler version (Bn) which updates and with just one fixed step-size gradient step555Thus Bn is conceptually very close to the classical Batch Norm Gradient Descent presented in (Ioffe and Szegedy, 2015) and (iii) Weight Normalization (Wn) as presented in (Salimans and Kingma, 2016). All methods use full batch sizes and – except for Gdnp on – each method is run with a problem specific, constant stepsize.
We consider empirical risk minimization as a surrogate for (17) on the common real-world dataset a9a as well as on synthetic data drawn from a multivariate Gaussian distribution. We center the datasets and use two different functions . First, we choose the softplus which resembles the classical logistic regression (convex). Secondly, we use the sigmoid which is a commonly used (non-convex) continuous approximation of the 0-1 loss (Zhang et al., 2015). Further details can be found in Appendix D.
|[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/gau_logistic_logloss.pdf||[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/a9a_logistic_logloss.pdf|
|gaussian softplus||a9a softplus|
|[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/gau_sigmoid_loggrads.pdf||[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/a9a_sigmoid_loggrads.pdf|
|gaussian sigmoid||a9a sigmoid|
Results of an average run (solid line) in terms of log suboptimality (softplus) and log gradient norm (sigmoid) over iterations as well as 90% confidence intervals of 10 runs with random initialization.
The Gaussian design experiments clearly confirm Theorem 2 in the sense that the loss in the convex-, as well as the gradient norm in the non-convex case decrease at a linear rate. The results on a9a show that Gdnp can accelerate optimization even when the normality assumption does not hold and in a setting where no covariate shift is present, which motivates future research of normalization techniques in optimization. Interestingly, the performance of simple Bn and Wn is similar to that of Gd, which suggests that the length-direction decoupling on its own does not capture the entire potential of these methods. Gdnp on the other hand takes full advantage of the parameter splitting, both in terms of multiple steps on and – more importantly – adaptive stepsizes in .
5 Neural Networks
We now turn our focus to optimizing the weights of a multilayer perceptron (MLP) with one hidden layer andhidden units that maps an input to a scalar output in the following way
The and terms are the input and output weights of unit and is a so-called activation function. We assume to be a , which is a common choice in many neural network architectures used in practice. Given a loss function , the optimal input- and output weights are determined by minimizing the following optimization problem
In the following, we analyze the optimization of the input weights with frozen output weights and thus write hereafter for simplicity. As previously assumed for the case of learning halfspaces, our analysis relies on Assumption 2. This approach is rather common in the analysis of neural networks, e.g. Brutzkus and Globerson (2017) showed that for a special type of one-hidden layer network with isotropic Gaussian inputs, all local minimizers are global. Under the same assumption, similar results for different classes of neural networks were derived in (Li and Yuan, 2017; Soltanolkotabi et al., 2017; Du et al., 2017).
5.1 Global characterization of the objective
We here show that, under the same assumption, the landscape of exhibits a global property similar to one of the Learning Halfspaces objective (see Section 4.1). In fact, the critical points of all neurons in a hidden layer align along one and the same single line in and this direction depends only on incoming information into the hidden layer.
See Appendix C.2 for a discussion of possible implications for deep neural networks.
5.2 Convergence result
We again consider normalizing the weights of each unit by means of its input covariance matrix , i.e. and present a simple algorithmic manner to leverage the input-output decoupling of Lemma 2. Contrary to Bn, which optimizes all the weights simultaneously, our approach is based on alternating optimization over the different hidden units. We thus adapt Gdnp to the problem of optimizing the units of a neural network in Algorithm 2 but note that this modification is mainly to simplify the theoretical analysis.
Optimizing each unit independently formally results in minimizing the function as defined in Algorithm 2. In the next theorem, we prove that this version of Gdnp achieves a linear rate of convergence to optimize each .
[Convergence of Gdnp on MLP] Suppose Assumptions 1– 4 hold. We consider optimizing the weights of unit , assuming that all directions are critical points of and for . Then, Gdnp with step-size policy as in (98) and stopping criterion as in (99) yields a linear convergence rate on in the sense that
where the constant is defined in Eq. (102).
The result of Theorem 3 relies on the fact that each is either zero or has zero gradient. If we assume that an exact critical point is reached after optimizing each individual unit, then the result directly implies that the alternating minimization presented in Algorithm 2 reaches a critical point of the overall objective. Since the established convergence rate for each individual unit is linear, this assumption sounds realistic. We leave a more precise convergence analysis, that takes into account that optimizing each individual unit for a finite number of steps may yield numerical suboptimalities, for future work.
5.3 Experiments II
In the proof of Theorem 3 we show that Gdnp can leverage the length-direction decoupling in a way that lowers cross-dependencies between hidden layers and yields faster convergence. A central part of the proof is Lemma 2 which says that – given Gaussian inputs – the optimal direction of a given layer is independent of all downstream layers. Since this assumption is rather strong and since Algorithm 2 is intended for analysis purposes only, we test the validity of the above hypothesis outside the Gaussian setting by training a Batch Normalized multilayer feedforward network (Bn) on a real-world image classification task with plain Gradient Descent. For comparison, a second unnormalized network is trained by Gd. To validate Lemma 2 we measure the interdependency between the central and all other hidden layers in terms of the Frobenius norm of their second partial cross derivatives (in the directional component). Further details can be found in Appendix D.
|[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_loss.pdf||[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_grads.pdf|
|[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_hess_w.pdf||
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_hess_GD.pdf
Figure 3 confirms that the directional gradients of the central layer are affected far more by the upstream than by the downstream layers to a surprisingly large extent. Interestingly, this holds even before reaching a critical point. The downstream cross-dependencies are generally decaying for the Batch Normalized network (Bn) (especially in the first 1000 iterations where most progress is made) while they remain elevated in the un-normalized network (Gd), which suggest that using Batch Normalization layers indeed simplifies the networks curvature structure in such that the length-direction decoupling allows Gradient Descent to exploit simpler trajectories in these normalized coordinates for faster convergence.
Of course, we cannot untangle this effect fully from other possible positive aspects of training with Bn (see introduction). Yet, the fact that the (de-)coupling increases in the distance to the middle layer (note how earlier (later) layers are more (less) important for ) emphasizes the relevance of this analysis particularly for deep neural network structures, where downstream dependencies might vanish completely with depth. This does not only make gradient based training easier but also suggests the possibility of using partial second order information, such as diagonal Hessian approximations (e.g. proposed in (Martens et al., 2012)).
We took a theoretical approach to study the acceleration provided by Batch Normalization. In a somewhat simplified setting, we have shown that the reparametrization performed by Batch Normalization leads to a provable acceleration of gradient-based optimization by splitting it into subtasks that are easier to solve. In order to evaluate the impact of the assumptions required for our analysis, we also performed experiments on real-world datasets that agree with the results of the theoretical analysis to a surprisingly large extent.
We consider this work as a first step for two particular directions of future research. First, it raises the question of how to optimally train Batch Normalized neural networks. Particularly, our results suggest that different and adaptive stepsize schemes for the two parameters - length and direction - can lead to significant accelerations. Second, the analysis of Section 3 and 4 reveals that a better understanding of non-linear coordinate transformations is a promising direction for the continuous optimization community.
- Absil et al. (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
- Argentati et al. (2017) Argentati, M. E., Knyazev, A. V., Neymeyr, K., Ovtchinnikov, E. E., and Zhou, M. (2017). Convergence theory for preconditioned eigenvalue solvers in a nutshell. Foundations of Computational Mathematics, 17(3):713–727.
- Arpit et al. (2016) Arpit, D., Zhou, Y., Kota, B. U., and Govindaraju, V. (2016). Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
- Brillinger (2012) Brillinger, D. R. (2012). A generalized linear model with “gaussian” regressor variables. In Selected Works of David Brillinger, pages 589–606. Springer.
- Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966.
- Cho and Lee (2017) Cho, M. and Lee, J. (2017). Riemannian approach to batch normalization. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5225–5235. Curran Associates, Inc.
- Dowson and Wragg (1973) Dowson, D. and Wragg, A. (1973). Maximum-entropy distributions having prescribed first and second moments (corresp.). IEEE Transactions on Information Theory, 19(5):689–693.
- Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206.
- Du et al. (2017) Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. (2017). Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779.
- D’yakonov and McCormick (1995) D’yakonov, E. G. and McCormick, S. (1995). Optimization in solving elliptic problems. CRC Press.
- Erdogdu et al. (2016) Erdogdu, M. A., Dicker, L. H., and Bayati, M. (2016). Scaled least squares estimator for glms in large-scale problems. In Advances in Neural Information Processing Systems, pages 3324–3332.
- Gitman and Ginsburg (2017) Gitman, I. and Ginsburg, B. (2017). Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. arXiv preprint arXiv:1709.08145.
- Guruswami and Raghavendra (2009) Guruswami, V. and Raghavendra, P. (2009). Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In , pages 770–778.
- Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456. JMLR.org.
- Jin et al. (2017) Jin, C., Netrapalli, P., and Jordan, M. I. (2017). Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456.
- Klambauer et al. (2017) Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pages 972–981.
- Knyazev and Shorokhodov (1991) Knyazev, A. and Shorokhodov, A. (1991). On exact estimates of the convergence rate of the steepest ascent method in the symmetric eigenvalue problem. Linear algebra and its applications, 154:245–257.
- Knyazev (1998) Knyazev, A. V. (1998). Preconditioned eigensolvers—an oxymoron. Electron. Trans. Numer. Anal, 7:104–123.
- Knyazev and Neymeyr (2003) Knyazev, A. V. and Neymeyr, K. (2003). A geometric theory for preconditioned inverse iteration iii: A short and sharp convergence estimate for generalized eigenvalue problems. Linear Algebra and its Applications, 358(1-3):95–114.
- Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.
Landsman and Nevslehová (2008)
Landsman, Z. and Nevslehová, J. (2008).
Stein’s lemma for elliptical random vectors.
Journal of Multivariate Analysis, 99(5):912–927.
Li and Yuan (2017)
Li, Y. and Yuan, Y. (2017).
Convergence analysis of two-layer neural networks with relu activation.In Advances in Neural Information Processing Systems, pages 597–607.
- Lipton and Steinhardt (2018) Lipton, Z. C. and Steinhardt, J. (2018). Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341.
- Martens et al. (2012) Martens, J., Sutskever, I., and Swersky, K. (2012). Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464.
- Mikhalevich et al. (1988) Mikhalevich, V., Redkovskii, N., and Antonyuk, A. (1988). Minimization methods for smooth nonconvex functions. Cybernetics, 24(4):395–403.
- Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., and Lerer, A. (2017).
Automatic differentiation in pytorch.In NIPS-W.
- Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909.
- Santurkar et al. (2018) Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604.
- Soltanolkotabi et al. (2017) Soltanolkotabi, M., Javanmard, A., and Lee, J. D. (2017). Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926.
Szegedy et al. (2017)
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. (2017).
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12.
- Zhang et al. (2015) Zhang, Y., Lee, J. D., Wainwright, M. J., and Jordan, M. I. (2015). Learning halfspaces and neural networks with random initialization. arXiv preprint arXiv:1511.07948.
Appendix A Least Squares Analysis
a.1 Restated result
Recall that, after normalizing according to (9) and using the closed form solution for the optimal scaling factor , optimizing the ordinary least squares objective can be written as the following minimization problem
We consider optimizing the above objective by Gd which takes iterative steps of the form
Furthermore, we choose
Our analysis relies on the (weak) data distribution assumption stated in A1. It is noteworthy that holds under this assumption. In the next theorem, we establish a convergence rate of Gd in terms of function value as well as gradient norm.
The proof of this result crucially relies on the insight that the minimization problem given in (12
) resembles the problem of maximizing the generalized Rayleigh quotient which is commonly encountered in generalized eigenproblems. We will thus first review this area, where convergence rates are usually provided in terms of the angle of the current iterate with the maximizer, which is the principal eigenvector. Interestingly, this angle can be related to both, the current function value as well as the the norm of the current gradient. We will make use of these connections to prove the above Theorem in SectionA.5. Although not necessarily needed for convex function, we introduce the gradient norm relation as we will later go on to prove a similar result for possibly non-convex functions in the learning halfspace setting (Theorem 2).
a.2 Background on eigenvalue problems
a.2.1 Rayleigh quotient
Optimizing the Rayleigh quotient is a classical non-convex optimization problem that is often encountered in eigenvector problems. Let be a symmetric matrix, then is the principal eigenvector of if it maximizes
and is called the Rayleigh quotient. Notably, this quotient satisfies the so-called Rayleigh inequality
where and are the smallest and largest eigenvalue of respectively.
Maximizing is a non-convex (strict-saddle) optimization problem, where the -th critical point constitutes the -th eigenvector with corresponding eigenvalue (see (Absil et al., 2009), Section 4.6.2 for details). It is known that optimizing with Gd - using an iteration-dependent stepsize - converges linearly to the principal eigenvector . The convergence analysis is based on the ”minidimensional” method and yields the following result
under weak assumptions on . Details as well as the proof of this result can be found in (Knyazev and Shorokhodov, 1991).
a.2.2 Generalized rayleigh quotient
The reparametrized least squares objective (13), however, is not exactly equivalent to (27) because of the covariance matrix that appears in the denominator. As a matter of fact, our objective is a special instance of the generalized Rayleigh quotient
where is defined as above and is a symmetric positive definite matrix.
Maximizing (29) is a generalized eigenproblem in the sense that it solves the task of finding eigenvalues of the matrix pencil for which , i.e. finding a vector that obeys . Again we have
again under weak assumptions on .
a.2.3 Our contribution
Since our setting in Eq. (13) is a minimization task, we note that for our specific choice of and we have and we recall the general result that then
More importantly, we here have a special case where the nominator of has a particular low rank structure. In fact, is a rank one matrix. Instead of directly invoking the convergence rate in (Knyazev and Neymeyr, 2003), this allows for a much simpler analysis of the convergence rate of Gd on since the rank one property yields a simpler representation of the relevant vectors. Furthermore, we establish a connection between suboptimality on function value and the -norm of the gradient. As mentioned earlier, we need such a guarantee in our future analysis on learning halfspaces which is an instance of a (possibly) non-convex optimization problem.
Notations Let be a symmetric positive definite matrix. We introduce the following compact notations that will be used throughout the analysis.
-inner product of : .
-norm of : .
-angle between two vectors :
-orthogonal projection of to span for :
-spectral norm of a matrix :
These notations allow us to make the analysis similar to the simple Rayleigh quotient case. For example, the denominator in (29) can now be written as .
Properties We will use the following elementary properties of the induced terms defined above.
If is the -orthogonal projection of to span, then it holds that
The -spectral norm of a matrix can be written in the alternative form
(34) (35) (36)
Let be a square matrix and , then the following result holds due to the definition of -spectral norm
a.4 Characterization of the LS minimizer
By setting the gradient of (11) to zero and recalling the convexity of we immediately see that the minimizer of this objective is
Indeed, one can easily verify that is also an eigenvector of the matrix pair since
where is the corresponding generalized eigenvalue. The associated eigenvector with is
Thereby we extend the normalized eigenvector to an -orthogonal basis of such that
holds for all . Let be the matrix whose -th column is . The matrix is orthogonal to the matrix since
which is a zero matrix due to the-orthogonality of the basis (see Eq. (41)). As a result, the columns of are eigenvectors associated with a zero eigenvalue. Since and form an -orthonormal basis of no further eigenvalues exist. We can conclude that any vector is a minimizer of the reparametrized ordinary least squares problem as presented in (13) and the minimum value of relates to the eigenvalue as
Spectral representation of suboptimality Our convergence analysis is based on the angle between the current iterate and the leading eigenvector , for which we recall property (P.1). We can express in the -orthogonal basis that we defined above:
and since is the -orthogonal projection of to span, the result of (P.2) implies
Clearly this metric is zero for the optimal solution and else bounded by one from above. To justify it is a proper choice, the next proposition proves that suboptimality on , i.e. , relates directly to this angle.
The suboptimality of on relates to as
where . This is equivalent to
We use the proposed eigenexpansion of Eq. (43) to rewrite
and replace the above result into . Then
which proves the second part of the proposition. The first follows directly from property (P.1). ∎
Gradient-suboptimality connection Fermat’s first-order optimality condition implies that the gradient is zero at the minimizer of . Considering the structure of , we propose a precise connection between the norm of gradient and suboptimality. Our analysis relies on the representation of the gradient in the -orthonormal basis which is described in the next proposition.
Using the -orthogonal basis as given in Eq. (41), the gradient vector can be expanded as
Exploiting the gradient representation of the last proposition, the next proposition establishes the connection between suboptimality and the -norm of gradient .
Suppose that , then the -norm of the gradient relates to the suboptimality as
Multiplying the gradient representation in Proposition 2 by yields