1 Introduction
One of the most important recent innovations for optimizing deep neural networks is Batch Normalization (Bn) (Ioffe and Szegedy, 2015). This technique has been proven to successfully stabilize and accelerate training of deep neural networks and is thus by now standard in many stateofthe art architectures such as ResNets (He et al., 2016) and the latest Inception Nets (Szegedy et al., 2017). The success of Batch Normalization has promoted its key idea that normalizing the inner layers of a neural network stabilizes training which recently led to the development of many such normalization methods such as (Arpit et al., 2016; Klambauer et al., 2017; Salimans and Kingma, 2016) and (Ba et al., 2016) to name just a few.
Yet, despite the ever more important role of Batch Normalization for training deep neural networks, the Machine Learning community is mostly relying on empirical evidence and thus lacking a thorough theoretical understanding that can explain such success. Indeed – to the best of our knowledge – there exists no theoretical result which provably shows faster convergence rates for this technique on any problem instance. So far, there only exists competing hypotheses that we briefly summarize below.
1.1 Related work
Internal Covariate Shift
The most widespread idea is that Batch Normalization accelerates training by reducing the socalled internal covariate shift, defined as the change in the distribution of layer inputs while the conditional distribution of outputs is unchanged. This change can be significant especially for deep neural networks where the successive composition of layers drives the activation distribution away from the initial input distribution. Ioffe and Szegedy (2015)
argue that Batch Normalization reduces the internal covariate shift by employing a normalization technique that enforces the input distribution of each activation layer to be whitened  i.e. enforced to have zero means and unit variances  and decorrelated . Yet, as pointed out by
Lipton and Steinhardt (2018), the covariate shift phenomenon itself is not rigorously shown to be the reason behind the performance of Batch Normalization. Furthermore, a recent empirical study published by (Santurkar et al., 2018) provides strong evidence supporting the hypothesis that the performance gain of Batch Normilization is not explained by the reduction of internal covariate shift.Smoothing of objective function
Recently, Santurkar et al. (2018)
argue that under certain assumptions a normalization layer simplifies optimization by smoothing the loss landscape of the optimization problem of the preceding layer. Yet, we note that this effect may  at best  only improve the constant factor of the convergence rate of Gradient Descent and not the rate itself (e.g. from sublinear to linear). Furthermore, the analysis treats only the largest eigenvalue and thus one direction in the landscape (at any given point) and keeps the (usually trainable)
BN parameters fixed to zeromean and unit variance. For a thorough conclusion about the overall landscape, a look at the entire eigenspectrum (including negative and zero eigenvalues) would be needed. Yet, this is particularly hard to do as soon as one allows for learnable mean and variance parameters since the effect of their interplay on the distribution of eigenvalues is highly nontrivial.Lengthdirection decoupling
Finally, a different perspective was brought up by another normalization technique termed Weight Normalization (Wn) (Salimans and Kingma, 2016)
. This technique performs a very simple normalization that is independent of any data statistics with the goal of decoupling the length of the weight vector from its direction. The optimization of the training objective is then performed by training the two parts separately. As discussed in Section
2, Bn and Wn differ in how the weights are normalized but share the above mentioned decoupling effect. Interestingly, weight normalization has been shown empirically to benefit from similar acceleration properties as Batch Normalization (Gitman and Ginsburg, 2017; Salimans and Kingma, 2016). This raises the obvious question whether the empirical success of training with Batch Normalization can (at least partially) be attributed to its lengthdirection decoupling aspect.1.2 Contribution and organization
We contribute to a better theoretical understanding of Batch Normalization by analyzing it from an optimization perspective. In this regard, we particularly address the following question:
Can we find a setting in which Batch Normalization provably accelerates optimization with Gradient Descent and does the lengthdirection decoupling play a role in this phenomenon?
We answer both questions affirmatively. In particular, we show that the specific variance transformation of Bn decouples the length and directional components of the weight vectors in such a way that allows local search methods to exploit certain global properties of the optimization landscape (present in the directional component of the optimal weight vector). Using this fact and endowing the optimization method with an adaptive stepsize scheme, we obtain an exponential (or as more commonly termed linear) convergence rate for Batch Norm Gradient Descent on the (possibly) nonconvex problem of Learning Halfspaces with Gaussian inputs (Section 4), which is a prominent problem in machine learning Erdogdu et al. (2016). We thereby turn Bn
from an effective practical heuristic into a provably converging algorithm. Additionally we show that the lengthdirection decoupling can be considered as a nonlinear reparametrization of the weight space, which may be beneficial for even simple convex optimization tasks such as logistic regressions. Interestingly, nonlinear weightspace transformations have received little to no attention within the optimization community (see
(Mikhalevich et al., 1988) for an exception).Finally, in Section 5 we analyze the effect of Bn for training a multilayer neural network (MLP) and prove – again under a similar Gaussianity assumption – that Bn acts in such a way that the cross dependencies between layers are reduced and thus the curvature structure of the network is simplified. Again, this is due to a certain global property in the directional part of the optimization landscape, which BN can exploit via the lengthdirection decoupling. As a result, gradientbased optimization in reparametrized coordinates (and with an adaptive stepsize policy) can enjoy a linear convergence rate on each individual unit. We substantiate both findings with experimental results on real world datasets that confirm the validity of our analysis outside the setting of our theoretical assumptions that cannot be certified to always hold in practice.
2 Background
2.1 Assumptions on data distribution
Suppose that is a random input vector and is the corresponding output variable. Throughout this paper we recurrently use the following statistics
(1) 
and make the following (weak) assumption.
Assumption 1.
[Weak assumption on data distribution] We assume that . We further assume that the spectrum of the matrix is bounded as
(2) 
As a result, is the symmetric positive definite covariance matrix of .
The part of our analysis presented in Section 4 and 5
relies on a stronger assumption on the data distribution. In this regard we consider the combined random variable
(3) 
whose mean vector and covariance matrix are and as defined above in Eq. (1).
Assumption 2.
[Normality assumption on data distribution] We assume that is a multivariate normal random variable distributed with mean
and secondmoment
.In the absence of further knowledge, assuming Gaussian data is plausible from an informationtheoretic point of view since the Gaussian distribution maximizes the entropy over the set of all absolutely continuous distributions with fixed first and second moment
(Dowson and Wragg, 1973). Thus, many recent studies on neural networks make this assumption on (see e.g. (Brutzkus and Globerson, 2017; Du and Lee, 2018)). Here we assume Gaussianity on instead which is even less restrictive in some cases^{1}^{1}1For example, suppose that conditional distribution is gaussian with mean for positive labels and for negative labels (mixture of gaussians). If the covariance matrix of these marginal distributions are the same, is Gaussian while is not..2.2 Batch normalization as a reparameterization of the weight space
In neural networks a Bn layer normalizes the input of each unit of the following layer. This is done on the basis of data statistics in a training batch, but for the sake of analyzablility we will work directly with population statistics. In particular, the output of a specific unit, which projects an input to its weight vector
and applies a sufficiently smooth activation function
as follows(4) 
is normalized on the preactivation level. That is, the inputoutput mapping of this unit becomes
(5) 
As stated (in finitesum terms) in Algorithm 1 of (Ioffe and Szegedy, 2015) the normalization operation amounts to computing
(6) 
where and are (trainable) mean and variance adjustment parameters. In the following, we assume that is zero mean (Assumption 1) and omit .^{2}^{2}2In the (noncompositional) models of Section 3 and 4, fulfilling the assumption that
is zeromean is as simple as centering the dataset. Yet, we also omit centering a neurons input as well as learning
for the neural network analysis in Section 5. This is done for the sake of simplicity but note that Salimans and Kingma (2016) found that these aspects of Bn yield no empirical improvements for optimization. Then the variance can be written as follows(7)  
and replacing this expression into the batch normalized output of Eq.(5) yields
(8) 
In order to keep concise notations, we will often use the induced norm of the positive definite matrix defined as . Comparing Eq. (4) and (8) it becomes apparent that Bn can be considered as a reparameterization of the weight space. We thus define
(9) 
and note that accounts for the direction and for the length of . As a result, the batch normalized output can then be written as
(10) 
Note that Weight Normalization (Wn) is another instance of the above reparametrization, where the covariance matrix
is replaced by the identity matrix
(Salimans and Kingma, 2016). In both cases, the objective becomes invariant to linear scaling of . From a geometry perspective, the directional part of Wn can be understood as performing optimization on the unit sphere while Bn operates on the sphere (ellipsoid) (Cho and Lee, 2017). Note that one can compute the variance term (7) in a matrixfree manner, i.e. never needs to be computed explicitly for Bn.Of course, this type of reparametrization is not exclusive to applications in neural networks. In the following two sections we first show how reparametrizing the weight space of linear models can be advantageous from a classical optimization point of view. In Section 5 we extend this analysis to training Batch Normalized neural networks with adaptivestepsize Gradient Descent and show that the lengthdirection split induces an interesting decoupling effect of the individual network layers which simplifies the curvature structure.
3 Ordinary Least Squares
As a preparation for subsequent analyses, we start with the simple convex quadratic objective encountered when minimizing an ordinary least squares problem
(11)  
One can think of this as a linear neural network with just one neuron and a quadratic loss. Thus, applying Bn resembles reparametrizing according to Eq. (9) and the objective turns into the nonconvex problem
(12) 
Despite the nonconvexity of this new objective, we will prove that Gradient Descent (Gd) enjoys a linear convergence rate. Interestingly, our analysis establishes a link between in reparametrized coordinates (Eq. (12)) and the wellstudied task of minimizing (generalized) Rayleigh quotients as it is commonly encountered in eigenvalue problems (Argentati et al., 2017).
3.1 Convergence analysis
To simplify the analysis, note that, for a given , the objective of Eq. (12) is convex w.r.t. the scalar and thus the optimal value can be found by setting , which gives . Replacing this closedform solution into Eq. (12) yields the following optimization problem
(13) 
which – as discussed in Appendix A.2 – is a special case of minimizing the generalized Rayleigh quotient for which an extensive literature exists (Knyazev, 1998; D’yakonov and McCormick, 1995). Here, we particularly consider solving (13) with Gd, which applies the following iterative updates to the parameters
(14) 
Based upon existing results, the next theorem establishes a linear convergence rate for the above iterates to the minimizer in the normalized coordinates.
Theorem 1.
This convergence rate is of the same order as the rate of standard Gd on the original objective of Eq. (11) (Nesterov, 2013). Yet, it is interesting to see that the nonconvexity of the normalized objective does not slow gradientbased optimization down. In the following, we will repeatedly invoke this result to analyze more complex objectives for which Gd only achieves a sublinear convergence rate in the original coordinate space but is provably accelerated after using Batch Normalization.
4 Learning Halfspaces
We now turn our attention to the problem of Learning Halfspaces, which encompasses training the simplest possible neural network: the Perceptron. This optimization problem can be written as
(17) 
where and
is a loss function. Common choices for
include the zeroone, piecewise linear, logistic and sigmoidal loss. We here tailor our analysis to the following choice of loss function.Assumption 3.
[Assumptions on loss function] We assume that the loss function is infinitely differentiable, i.e. , with a bounded derivative, i.e. such that .
Furthermore, we need to be sufficiently smooth.
Assumption 4.
[Smoothness assumption] We assume that the objective is smooth if it is differentiable on and its gradient is Lipschitz. Furthermore, we assume that a solution exists that is bounded in the sense that ^{3}^{3}3This is a rather technical but not so restrictive assumption. For example, it always holds for the sigmoid loss unless the classification error of is already zero.
Since globally optimizing (17) is in general NPhard (Guruswami and Raghavendra, 2009), we instead focus on understanding the effect of the normalized parameterization when searching for a stationary point. Towards this end we now assume that is a multivariate normal random variable (see Assumption 2 and discussion there).
4.1 Global characterization of the objective
The learning halfspaces objective – on Gaussian inputs – has a remarkable property: all critical points lie on the same line, independent of the choice of the loss function . We formalize this claim in the next lemma.
Lemma 1.
Interestingly, the optimal direction of these critical points spans the same line as the solution of a corresponding least squares regression problem (see Eq. (38) in Appendix A). In the context of convex optimization of generalized linear models, this fact was first pointed out in (Brillinger, 2012). Although the global optima of the two objectives are aligned, classical optimization methods  which perform updates based on local information  are generally blind to such global properties of the objective function. This is unfortunate since Gradient Descent converges linearly in the quadratic leastsquares setting but only sublinearly on general Learning Halfspace problems (Zhang et al., 2015).
To accelerate the convergence of Gradient Descent, Erdogdu et al. (2016) thus proposed a twostep global optimization procedure for solving generalized linear models, which first involves finding the optimal direction by optimizing a least squares regression as a surrogate objective and secondly searching for a proper scaling factor of that minimizer. Here, we show that running Gd in coordinates reparameterized as in Eq. (9) makes this twostep procedure redundant. More specifically, splitting the optimization problem into searching for the optimal direction and scaling separately, allows even local optimization methods to exploit the property of global minima alignment. Thus  without having to solve a least squares problem in the first place  the directional updates on the Learning Halfspace problem can mimic the least squares dynamics and thereby inherit the linear convergence rate. Combined with a fast (one dimensional) search for the optimal scaling in each step the overall convergence stays linear.
As an illustration, Figure 1 shows the level sets as well as the optimal direction of a least squares, a logistic and a sigmoidal regression problem on the same Gaussian dataset. Furthermore, it shows iterates of Gd in original coordinates and a sequential version of Gd in normalized coordinates that first optimizes the direction and then the scaling of its parameters (Gdnp). Both methods start at the same point and run with an infinitesimally small stepsize. It can be seen that, while Gd takes completely different paths towards the optimal points of each problem instance, the dynamics of Gdnp are exactly the same until the optimal direction^{4}^{4}4Depicted by the dotted red line. Note that – as a result of Lemma 1 – this line is identical in all problems. is found and differ only in the final scaling.
[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_ols_log.pdf  [width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_ols_log.pdf 
[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_logreg.pdf  [width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_logreg.pdf 
[width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/GD_lh.pdf  [width=0.485trim=22pt 22pt 30pt 30pt,clip]exp/BN_lh.pdf 
4.2 Local optimization in normalized parameterization
Let be the reparameterized objective with as defined in Eq. (9). We here consider optimizing this objective using Gradient Descent in Normalized Parameters (Gdnp). In each iteration, Gdnp performs a gradient step to optimize with respect to the direction
and does a onedimensional bisection search to estimate the scaling factor
(see Algorithm 1). It is therefore only a slight modification of performing Batch Normalization plus Gradient Descent: Instead of taking a gradient step on both and , we search for the (locally)optimal scaling at each iteration. This modification is cheap and it simplifies the theoretical analysis but it can also be substituted easily in practice by performing multiple Gd steps on the scaling factor, which we do for the experiments in Section 4.4.4.3 Convergence result
We now show that Algorithm 1 can achieve a linear convergence rate to a critical point on the possibly nonconvex objective with Gaussian inputs. Note that all information for computing the adaptive stepsize is readily available and can be computed efficiently.
Theorem 2.
[Convergence rate of Gdnp on learning halfspaces] Suppose Assumptions 1– 4 hold. Let be the output of Gdnp on with the following choice of stepsizes
(18) 
for , where
(19)  
is a stopping criterion. If initialized such that (see Eq. (13)), then is an approximate critical point of in the sense that
(20) 
To complete the picture, Table 1 compares this result to the order complexity for reaching an optimal critical point (i.e.) of Gradient Descent and Accelerated Gradient Descent (Agd) in original coordinates. Although we observe that the accelerated rate achieved by Gdnp is significantly better than the best known rate of Agd for nonconvex objective functions, we need to point out that the rate for Gdnp relies on the assumption of a Gaussian data distribution. Yet, Section 4.4 includes promising experimental results in a more practical setting with nonGaussian data.
Finally, we note that the proof of this result relies specifically on the reparametrization done by Batch Normalization. In Appendix B.3.6 we detail out why our proof strategy is not suitable for the reparametrization of Weight Normalization and thus leave it as an interesting open question if other settings (or proof strategies) can be found where linear rates for Wn are provable.
Method  Assumptions  Complexity  Rate  Reference 

Gd  Smoothness  Sublinear  (Nesterov, 2013)  
Agd  Smoothness  Sublinear  (Jin et al., 2017)  
Agd  Smoothness+convexity  Sublinear  (Nesterov, 2013)  
Gdnp  1, 2, 3, and 4  Linear  This paper 
4.4 Experiments I
Setting
In order to substantiate the above analysis we compare the convergence behavior of Gd and Agd to three versions of Gradient Descent in normalized coordinates. Namely, we benchmark (i) Gdnp (Algorithm 1) with multiple gradient steps on instead of Bisection, (ii) a simpler version (Bn) which updates and with just one fixed stepsize gradient step^{5}^{5}5Thus Bn is conceptually very close to the classical Batch Norm Gradient Descent presented in (Ioffe and Szegedy, 2015) and (iii) Weight Normalization (Wn) as presented in (Salimans and Kingma, 2016). All methods use full batch sizes and – except for Gdnp on – each method is run with a problem specific, constant stepsize.
We consider empirical risk minimization as a surrogate for (17) on the common realworld dataset a9a as well as on synthetic data drawn from a multivariate Gaussian distribution. We center the datasets and use two different functions . First, we choose the softplus which resembles the classical logistic regression (convex). Secondly, we use the sigmoid which is a commonly used (nonconvex) continuous approximation of the 01 loss (Zhang et al., 2015). Further details can be found in Appendix D.
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/gau_logistic_logloss.pdf  [width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/a9a_logistic_logloss.pdf 
gaussian softplus  a9a softplus 
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/gau_sigmoid_loggrads.pdf  [width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/a9a_sigmoid_loggrads.pdf 
gaussian sigmoid  a9a sigmoid 
Results of an average run (solid line) in terms of log suboptimality (softplus) and log gradient norm (sigmoid) over iterations as well as 90% confidence intervals of 10 runs with random initialization.
Results
The Gaussian design experiments clearly confirm Theorem 2 in the sense that the loss in the convex, as well as the gradient norm in the nonconvex case decrease at a linear rate. The results on a9a show that Gdnp can accelerate optimization even when the normality assumption does not hold and in a setting where no covariate shift is present, which motivates future research of normalization techniques in optimization. Interestingly, the performance of simple Bn and Wn is similar to that of Gd, which suggests that the lengthdirection decoupling on its own does not capture the entire potential of these methods. Gdnp on the other hand takes full advantage of the parameter splitting, both in terms of multiple steps on and – more importantly – adaptive stepsizes in .
5 Neural Networks
We now turn our focus to optimizing the weights of a multilayer perceptron (MLP) with one hidden layer and
hidden units that maps an input to a scalar output in the following wayThe and terms are the input and output weights of unit and is a socalled activation function. We assume to be a , which is a common choice in many neural network architectures used in practice. Given a loss function , the optimal input and output weights are determined by minimizing the following optimization problem
(21) 
where
In the following, we analyze the optimization of the input weights with frozen output weights and thus write hereafter for simplicity. As previously assumed for the case of learning halfspaces, our analysis relies on Assumption 2. This approach is rather common in the analysis of neural networks, e.g. Brutzkus and Globerson (2017) showed that for a special type of onehidden layer network with isotropic Gaussian inputs, all local minimizers are global. Under the same assumption, similar results for different classes of neural networks were derived in (Li and Yuan, 2017; Soltanolkotabi et al., 2017; Du et al., 2017).
5.1 Global characterization of the objective
We here show that, under the same assumption, the landscape of exhibits a global property similar to one of the Learning Halfspaces objective (see Section 4.1). In fact, the critical points of all neurons in a hidden layer align along one and the same single line in and this direction depends only on incoming information into the hidden layer.
Lemma 2.
See Appendix C.2 for a discussion of possible implications for deep neural networks.
5.2 Convergence result
We again consider normalizing the weights of each unit by means of its input covariance matrix , i.e. and present a simple algorithmic manner to leverage the inputoutput decoupling of Lemma 2. Contrary to Bn, which optimizes all the weights simultaneously, our approach is based on alternating optimization over the different hidden units. We thus adapt Gdnp to the problem of optimizing the units of a neural network in Algorithm 2 but note that this modification is mainly to simplify the theoretical analysis.
Optimizing each unit independently formally results in minimizing the function as defined in Algorithm 2. In the next theorem, we prove that this version of Gdnp achieves a linear rate of convergence to optimize each .
Theorem 3.
[Convergence of Gdnp on MLP] Suppose Assumptions 1– 4 hold. We consider optimizing the weights of unit , assuming that all directions are critical points of and for . Then, Gdnp with stepsize policy as in (98) and stopping criterion as in (99) yields a linear convergence rate on in the sense that
(23)  
where the constant is defined in Eq. (102).
The result of Theorem 3 relies on the fact that each is either zero or has zero gradient. If we assume that an exact critical point is reached after optimizing each individual unit, then the result directly implies that the alternating minimization presented in Algorithm 2 reaches a critical point of the overall objective. Since the established convergence rate for each individual unit is linear, this assumption sounds realistic. We leave a more precise convergence analysis, that takes into account that optimizing each individual unit for a finite number of steps may yield numerical suboptimalities, for future work.
5.3 Experiments II
Setting
In the proof of Theorem 3 we show that Gdnp can leverage the lengthdirection decoupling in a way that lowers crossdependencies between hidden layers and yields faster convergence. A central part of the proof is Lemma 2 which says that – given Gaussian inputs – the optimal direction of a given layer is independent of all downstream layers. Since this assumption is rather strong and since Algorithm 2 is intended for analysis purposes only, we test the validity of the above hypothesis outside the Gaussian setting by training a Batch Normalized multilayer feedforward network (Bn) on a realworld image classification task with plain Gradient Descent. For comparison, a second unnormalized network is trained by Gd. To validate Lemma 2 we measure the interdependency between the central and all other hidden layers in terms of the Frobenius norm of their second partial cross derivatives (in the directional component). Further details can be found in Appendix D.
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_loss.pdf  [width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_grads.pdf 
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_hess_w.pdf 
[width=0.5trim=22pt 22pt 30pt 30pt,clip]exp/cif_50_hess_GD.pdf 
Results
Figure 3 confirms that the directional gradients of the central layer are affected far more by the upstream than by the downstream layers to a surprisingly large extent. Interestingly, this holds even before reaching a critical point. The downstream crossdependencies are generally decaying for the Batch Normalized network (Bn) (especially in the first 1000 iterations where most progress is made) while they remain elevated in the unnormalized network (Gd), which suggest that using Batch Normalization layers indeed simplifies the networks curvature structure in such that the lengthdirection decoupling allows Gradient Descent to exploit simpler trajectories in these normalized coordinates for faster convergence.
Of course, we cannot untangle this effect fully from other possible positive aspects of training with Bn (see introduction). Yet, the fact that the (de)coupling increases in the distance to the middle layer (note how earlier (later) layers are more (less) important for ) emphasizes the relevance of this analysis particularly for deep neural network structures, where downstream dependencies might vanish completely with depth. This does not only make gradient based training easier but also suggests the possibility of using partial second order information, such as diagonal Hessian approximations (e.g. proposed in (Martens et al., 2012)).
6 Conclusion
We took a theoretical approach to study the acceleration provided by Batch Normalization. In a somewhat simplified setting, we have shown that the reparametrization performed by Batch Normalization leads to a provable acceleration of gradientbased optimization by splitting it into subtasks that are easier to solve. In order to evaluate the impact of the assumptions required for our analysis, we also performed experiments on realworld datasets that agree with the results of the theoretical analysis to a surprisingly large extent.
We consider this work as a first step for two particular directions of future research. First, it raises the question of how to optimally train Batch Normalized neural networks. Particularly, our results suggest that different and adaptive stepsize schemes for the two parameters  length and direction  can lead to significant accelerations. Second, the analysis of Section 3 and 4 reveals that a better understanding of nonlinear coordinate transformations is a promising direction for the continuous optimization community.
References
 Absil et al. (2009) Absil, P.A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
 Argentati et al. (2017) Argentati, M. E., Knyazev, A. V., Neymeyr, K., Ovtchinnikov, E. E., and Zhou, M. (2017). Convergence theory for preconditioned eigenvalue solvers in a nutshell. Foundations of Computational Mathematics, 17(3):713–727.
 Arpit et al. (2016) Arpit, D., Zhou, Y., Kota, B. U., and Govindaraju, V. (2016). Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431.
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
 Brillinger (2012) Brillinger, D. R. (2012). A generalized linear model with “gaussian” regressor variables. In Selected Works of David Brillinger, pages 589–606. Springer.
 Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966.
 Cho and Lee (2017) Cho, M. and Lee, J. (2017). Riemannian approach to batch normalization. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5225–5235. Curran Associates, Inc.
 Dowson and Wragg (1973) Dowson, D. and Wragg, A. (1973). Maximumentropy distributions having prescribed first and second moments (corresp.). IEEE Transactions on Information Theory, 19(5):689–693.
 Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206.
 Du et al. (2017) Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. (2017). Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779.
 D’yakonov and McCormick (1995) D’yakonov, E. G. and McCormick, S. (1995). Optimization in solving elliptic problems. CRC Press.
 Erdogdu et al. (2016) Erdogdu, M. A., Dicker, L. H., and Bayati, M. (2016). Scaled least squares estimator for glms in largescale problems. In Advances in Neural Information Processing Systems, pages 3324–3332.
 Gitman and Ginsburg (2017) Gitman, I. and Ginsburg, B. (2017). Comparison of batch normalization and weight normalization algorithms for the largescale image classification. arXiv preprint arXiv:1709.08145.
 Guruswami and Raghavendra (2009) Guruswami, V. and Raghavendra, P. (2009). Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778.  Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 448–456. JMLR.org.
 Jin et al. (2017) Jin, C., Netrapalli, P., and Jordan, M. I. (2017). Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456.
 Klambauer et al. (2017) Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Selfnormalizing neural networks. In Advances in Neural Information Processing Systems, pages 972–981.
 Knyazev and Shorokhodov (1991) Knyazev, A. and Shorokhodov, A. (1991). On exact estimates of the convergence rate of the steepest ascent method in the symmetric eigenvalue problem. Linear algebra and its applications, 154:245–257.
 Knyazev (1998) Knyazev, A. V. (1998). Preconditioned eigensolvers—an oxymoron. Electron. Trans. Numer. Anal, 7:104–123.
 Knyazev and Neymeyr (2003) Knyazev, A. V. and Neymeyr, K. (2003). A geometric theory for preconditioned inverse iteration iii: A short and sharp convergence estimate for generalized eigenvalue problems. Linear Algebra and its Applications, 358(13):95–114.
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.

Landsman and Nevslehová (2008)
Landsman, Z. and Nevslehová, J. (2008).
Stein’s lemma for elliptical random vectors.
Journal of Multivariate Analysis
, 99(5):912–927. 
Li and Yuan (2017)
Li, Y. and Yuan, Y. (2017).
Convergence analysis of twolayer neural networks with relu activation.
In Advances in Neural Information Processing Systems, pages 597–607.  Lipton and Steinhardt (2018) Lipton, Z. C. and Steinhardt, J. (2018). Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341.
 Martens et al. (2012) Martens, J., Sutskever, I., and Swersky, K. (2012). Estimating the hessian by backpropagating curvature. arXiv preprint arXiv:1206.6464.
 Mikhalevich et al. (1988) Mikhalevich, V., Redkovskii, N., and Antonyuk, A. (1988). Minimization methods for smooth nonconvex functions. Cybernetics, 24(4):395–403.
 Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.

Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., and Lerer, A. (2017).
Automatic differentiation in pytorch.
In NIPSW.  Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909.
 Santurkar et al. (2018) Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604.
 Soltanolkotabi et al. (2017) Soltanolkotabi, M., Javanmard, A., and Lee, J. D. (2017). Theoretical insights into the optimization landscape of overparameterized shallow neural networks. arXiv preprint arXiv:1707.04926.

Szegedy et al. (2017)
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. (2017).
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, volume 4, page 12.  Zhang et al. (2015) Zhang, Y., Lee, J. D., Wainwright, M. J., and Jordan, M. I. (2015). Learning halfspaces and neural networks with random initialization. arXiv preprint arXiv:1511.07948.
Appendix
Appendix A Least Squares Analysis
a.1 Restated result
Recall that, after normalizing according to (9) and using the closed form solution for the optimal scaling factor , optimizing the ordinary least squares objective can be written as the following minimization problem
(13 revisited) 
We consider optimizing the above objective by Gd which takes iterative steps of the form
(24) 
where
(25) 
Furthermore, we choose
(26) 
Our analysis relies on the (weak) data distribution assumption stated in A1. It is noteworthy that holds under this assumption. In the next theorem, we establish a convergence rate of Gd in terms of function value as well as gradient norm.
See 1
The proof of this result crucially relies on the insight that the minimization problem given in (12
) resembles the problem of maximizing the generalized Rayleigh quotient which is commonly encountered in generalized eigenproblems. We will thus first review this area, where convergence rates are usually provided in terms of the angle of the current iterate with the maximizer, which is the principal eigenvector. Interestingly, this angle can be related to both, the current function value as well as the the norm of the current gradient. We will make use of these connections to prove the above Theorem in Section
A.5. Although not necessarily needed for convex function, we introduce the gradient norm relation as we will later go on to prove a similar result for possibly nonconvex functions in the learning halfspace setting (Theorem 2).a.2 Background on eigenvalue problems
a.2.1 Rayleigh quotient
Optimizing the Rayleigh quotient is a classical nonconvex optimization problem that is often encountered in eigenvector problems. Let be a symmetric matrix, then is the principal eigenvector of if it maximizes
(27) 
and is called the Rayleigh quotient. Notably, this quotient satisfies the socalled Rayleigh inequality
where and are the smallest and largest eigenvalue of respectively.
Maximizing is a nonconvex (strictsaddle) optimization problem, where the th critical point constitutes the th eigenvector with corresponding eigenvalue (see (Absil et al., 2009), Section 4.6.2 for details). It is known that optimizing with Gd  using an iterationdependent stepsize  converges linearly to the principal eigenvector . The convergence analysis is based on the ”minidimensional” method and yields the following result
(28) 
under weak assumptions on . Details as well as the proof of this result can be found in (Knyazev and Shorokhodov, 1991).
a.2.2 Generalized rayleigh quotient
The reparametrized least squares objective (13), however, is not exactly equivalent to (27) because of the covariance matrix that appears in the denominator. As a matter of fact, our objective is a special instance of the generalized Rayleigh quotient
(29) 
where is defined as above and is a symmetric positive definite matrix.
Maximizing (29) is a generalized eigenproblem in the sense that it solves the task of finding eigenvalues of the matrix pencil for which , i.e. finding a vector that obeys . Again we have
Among the rich literature on solving generalized symmetric eigenproblems, a Gd convergence rate similar to (28) has been established in Theorem 6 of (Knyazev and Neymeyr, 2003), which yields
again under weak assumptions on .
a.2.3 Our contribution
Since our setting in Eq. (13) is a minimization task, we note that for our specific choice of and we have and we recall the general result that then
More importantly, we here have a special case where the nominator of has a particular low rank structure. In fact, is a rank one matrix. Instead of directly invoking the convergence rate in (Knyazev and Neymeyr, 2003), this allows for a much simpler analysis of the convergence rate of Gd on since the rank one property yields a simpler representation of the relevant vectors. Furthermore, we establish a connection between suboptimality on function value and the norm of the gradient. As mentioned earlier, we need such a guarantee in our future analysis on learning halfspaces which is an instance of a (possibly) nonconvex optimization problem.
a.3 Preliminaries
Notations Let be a symmetric positive definite matrix. We introduce the following compact notations that will be used throughout the analysis.

inner product of : .

norm of : .

angle between two vectors :
(30) 
orthogonal projection of to span for :
(31) 
spectral norm of a matrix :
(32)
These notations allow us to make the analysis similar to the simple Rayleigh quotient case. For example, the denominator in (29) can now be written as .
Properties We will use the following elementary properties of the induced terms defined above.


If is the orthogonal projection of to span, then it holds that
(33) 
The spectral norm of a matrix can be written in the alternative form
(34) (35) (36) 
Let be a square matrix and , then the following result holds due to the definition of spectral norm
(37)
a.4 Characterization of the LS minimizer
By setting the gradient of (11) to zero and recalling the convexity of we immediately see that the minimizer of this objective is
(38) 
Indeed, one can easily verify that is also an eigenvector of the matrix pair since
(39)  
where is the corresponding generalized eigenvalue. The associated eigenvector with is
(40) 
Thereby we extend the normalized eigenvector to an orthogonal basis of such that
(41) 
holds for all . Let be the matrix whose th column is . The matrix is orthogonal to the matrix since
(42)  
which is a zero matrix due to the
orthogonality of the basis (see Eq. (41)). As a result, the columns of are eigenvectors associated with a zero eigenvalue. Since and form an orthonormal basis of no further eigenvalues exist. We can conclude that any vector is a minimizer of the reparametrized ordinary least squares problem as presented in (13) and the minimum value of relates to the eigenvalue asSpectral representation of suboptimality Our convergence analysis is based on the angle between the current iterate and the leading eigenvector , for which we recall property (P.1). We can express in the orthogonal basis that we defined above:
(43) 
and since is the orthogonal projection of to span, the result of (P.2) implies
(44) 
Clearly this metric is zero for the optimal solution and else bounded by one from above. To justify it is a proper choice, the next proposition proves that suboptimality on , i.e. , relates directly to this angle.
Proposition 1.
The suboptimality of on relates to as
(45) 
where . This is equivalent to
(46) 
Proof.
We use the proposed eigenexpansion of Eq. (43) to rewrite
(47) 
and replace the above result into . Then
which proves the second part of the proposition. The first follows directly from property (P.1). ∎
Gradientsuboptimality connection Fermat’s firstorder optimality condition implies that the gradient is zero at the minimizer of . Considering the structure of , we propose a precise connection between the norm of gradient and suboptimality. Our analysis relies on the representation of the gradient in the orthonormal basis which is described in the next proposition.
Proposition 2.
Using the orthogonal basis as given in Eq. (41), the gradient vector can be expanded as
(48)  
Proof.
The above derivation is based on two results: (i) is an eigenvector of and (ii) the representation of in Proposition 1. We recall the definition of in (25) and write
(49)  
∎
Exploiting the gradient representation of the last proposition, the next proposition establishes the connection between suboptimality and the norm of gradient .
Proposition 3.
Suppose that , then the norm of the gradient relates to the suboptimality as
(50) 
Proof.
Multiplying the gradient representation in Proposition 2 by yields