# Towards a Theoretical Understanding of Batch Normalization

Normalization techniques such as Batch Normalization have been applied very successfully for training deep neural networks. Yet, despite its apparent empirical benefits, the reasons behind the success of Batch Normalization are mostly hypothetical. We thus aim to provide a more thorough theoretical understanding from an optimization perspective. Our main contribution towards this goal is the identification of various problem instances in the realm of machine learning where, under certain assumptions, Batch Normalization can provably accelerate optimization with gradient-based methods. We thereby turn Batch Normalization from an effective practical heuristic into a provably converging algorithm for these settings. Furthermore, we substantiate our analysis with empirical evidence that suggests the validity of our theoretical results in a broader context.

## Authors

• 7 publications
• 37 publications
• 107 publications
• 1 publication
• 48 publications
• 5 publications
• ### Weight and Gradient Centralization in Deep Neural Networks

Batch normalization is currently the most widely used variant of interna...
10/02/2020 ∙ by Wolfgang Fuhl, et al. ∙ 0

• ### Normalization Techniques in Training DNNs: Methodology, Analysis and Application

Normalization techniques are essential for accelerating the training and...
09/27/2020 ∙ by Lei Huang, et al. ∙ 8

• ### On the Convergence and Robustness of Batch Normalization

Despite its empirical success, the theoretical underpinnings of the stab...
09/29/2018 ∙ by Yongqiang Cai, et al. ∙ 0

• ### Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

We present weight normalization: a reparameterization of the weight vect...
02/25/2016 ∙ by Tim Salimans, et al. ∙ 0

• ### Revisit Batch Normalization: New Understanding from an Optimization View and a Refinement via Composition Optimization

Batch Normalization (BN) has been used extensively in deep learning to a...
10/15/2018 ∙ by Xiangru Lian, et al. ∙ 0

• ### On the accuracy of self-normalized log-linear models

Calculation of the log-normalizer is a major computational obstacle in a...
06/12/2015 ∙ by Jacob Andreas, et al. ∙ 0

• ### Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Batch normalization (BatchNorm) has become an indispensable tool for tra...
02/29/2020 ∙ by Jonathan Frankle, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

One of the most important recent innovations for optimizing deep neural networks is Batch Normalization (Bn(Ioffe and Szegedy, 2015). This technique has been proven to successfully stabilize and accelerate training of deep neural networks and is thus by now standard in many state-of-the art architectures such as ResNets (He et al., 2016) and the latest Inception Nets (Szegedy et al., 2017). The success of Batch Normalization has promoted its key idea that normalizing the inner layers of a neural network stabilizes training which recently led to the development of many such normalization methods such as (Arpit et al., 2016; Klambauer et al., 2017; Salimans and Kingma, 2016) and (Ba et al., 2016) to name just a few.

Yet, despite the ever more important role of Batch Normalization for training deep neural networks, the Machine Learning community is mostly relying on empirical evidence and thus lacking a thorough theoretical understanding that can explain such success. Indeed – to the best of our knowledge – there exists no theoretical result which provably shows faster convergence rates for this technique on any problem instance. So far, there only exists competing hypotheses that we briefly summarize below.

### 1.1 Related work

##### Internal Covariate Shift

The most widespread idea is that Batch Normalization accelerates training by reducing the so-called internal covariate shift, defined as the change in the distribution of layer inputs while the conditional distribution of outputs is unchanged. This change can be significant especially for deep neural networks where the successive composition of layers drives the activation distribution away from the initial input distribution. Ioffe and Szegedy (2015)

argue that Batch Normalization reduces the internal covariate shift by employing a normalization technique that enforces the input distribution of each activation layer to be whitened - i.e. enforced to have zero means and unit variances - and decorrelated . Yet, as pointed out by

Lipton and Steinhardt (2018), the covariate shift phenomenon itself is not rigorously shown to be the reason behind the performance of Batch Normalization. Furthermore, a recent empirical study published by (Santurkar et al., 2018) provides strong evidence supporting the hypothesis that the performance gain of Batch Normilization is not explained by the reduction of internal covariate shift.

##### Smoothing of objective function

Recently, Santurkar et al. (2018)

argue that under certain assumptions a normalization layer simplifies optimization by smoothing the loss landscape of the optimization problem of the preceding layer. Yet, we note that this effect may - at best - only improve the constant factor of the convergence rate of Gradient Descent and not the rate itself (e.g. from sub-linear to linear). Furthermore, the analysis treats only the largest eigenvalue and thus one direction in the landscape (at any given point) and keeps the (usually trainable)

BN parameters fixed to zero-mean and unit variance. For a thorough conclusion about the overall landscape, a look at the entire eigenspectrum (including negative and zero eigenvalues) would be needed. Yet, this is particularly hard to do as soon as one allows for learnable mean and variance parameters since the effect of their interplay on the distribution of eigenvalues is highly non-trivial.

##### Length-direction decoupling

Finally, a different perspective was brought up by another normalization technique termed Weight Normalization (Wn(Salimans and Kingma, 2016)

. This technique performs a very simple normalization that is independent of any data statistics with the goal of decoupling the length of the weight vector from its direction. The optimization of the training objective is then performed by training the two parts separately. As discussed in Section

2, Bn and Wn differ in how the weights are normalized but share the above mentioned decoupling effect. Interestingly, weight normalization has been shown empirically to benefit from similar acceleration properties as Batch Normalization (Gitman and Ginsburg, 2017; Salimans and Kingma, 2016). This raises the obvious question whether the empirical success of training with Batch Normalization can (at least partially) be attributed to its length-direction decoupling aspect.

### 1.2 Contribution and organization

We contribute to a better theoretical understanding of Batch Normalization by analyzing it from an optimization perspective. In this regard, we particularly address the following question:

Can we find a setting in which Batch Normalization provably accelerates optimization with Gradient Descent and does the length-direction decoupling play a role in this phenomenon?

We answer both questions affirmatively. In particular, we show that the specific variance transformation of Bn decouples the length and directional components of the weight vectors in such a way that allows local search methods to exploit certain global properties of the optimization landscape (present in the directional component of the optimal weight vector). Using this fact and endowing the optimization method with an adaptive stepsize scheme, we obtain an exponential (or as more commonly termed linear) convergence rate for Batch Norm Gradient Descent on the (possibly) non-convex problem of Learning Halfspaces with Gaussian inputs (Section 4), which is a prominent problem in machine learning Erdogdu et al. (2016). We thereby turn Bn

from an effective practical heuristic into a provably converging algorithm. Additionally we show that the length-direction decoupling can be considered as a non-linear reparametrization of the weight space, which may be beneficial for even simple convex optimization tasks such as logistic regressions. Interestingly, non-linear weightspace transformations have received little to no attention within the optimization community (see

(Mikhalevich et al., 1988) for an exception).

Finally, in Section 5 we analyze the effect of Bn for training a multilayer neural network (MLP) and prove – again under a similar Gaussianity assumption – that Bn acts in such a way that the cross dependencies between layers are reduced and thus the curvature structure of the network is simplified. Again, this is due to a certain global property in the directional part of the optimization landscape, which BN can exploit via the length-direction decoupling. As a result, gradient-based optimization in reparametrized coordinates (and with an adaptive stepsize policy) can enjoy a linear convergence rate on each individual unit. We substantiate both findings with experimental results on real world datasets that confirm the validity of our analysis outside the setting of our theoretical assumptions that cannot be certified to always hold in practice.

## 2 Background

### 2.1 Assumptions on data distribution

Suppose that is a random input vector and is the corresponding output variable. Throughout this paper we recurrently use the following statistics

 (1)

and make the following (weak) assumption.

###### Assumption 1.

[Weak assumption on data distribution] We assume that . We further assume that the spectrum of the matrix is bounded as

 (2)

As a result, is the symmetric positive definite covariance matrix of .

The part of our analysis presented in Section 4 and 5

relies on a stronger assumption on the data distribution. In this regard we consider the combined random variable

 z:=−yx, (3)

whose mean vector and covariance matrix are and as defined above in Eq. (1).

###### Assumption 2.

[Normality assumption on data distribution] We assume that is a multivariate normal random variable distributed with mean

and second-moment

.

In the absence of further knowledge, assuming Gaussian data is plausible from an information-theoretic point of view since the Gaussian distribution maximizes the entropy over the set of all absolutely continuous distributions with fixed first and second moment

(Dowson and Wragg, 1973). Thus, many recent studies on neural networks make this assumption on (see e.g. (Brutzkus and Globerson, 2017; Du and Lee, 2018)). Here we assume Gaussianity on instead which is even less restrictive in some cases111For example, suppose that conditional distribution is gaussian with mean for positive labels and for negative labels (mixture of gaussians). If the covariance matrix of these marginal distributions are the same, is Gaussian while is not..

### 2.2 Batch normalization as a reparameterization of the weight space

In neural networks a Bn layer normalizes the input of each unit of the following layer. This is done on the basis of data statistics in a training batch, but for the sake of analyzablility we will work directly with population statistics. In particular, the output of a specific unit, which projects an input to its weight vector

and applies a sufficiently smooth activation function

as follows

 f(w)=Ex[φ(x⊤w)] (4)

is normalized on the pre-activation level. That is, the input-output mapping of this unit becomes

 f\scBN(w,g,γ)=Ex[φ(\scBN(x⊤w))]. (5)

As stated (in finite-sum terms) in Algorithm 1 of (Ioffe and Szegedy, 2015) the normalization operation amounts to computing

 \scBN(x⊤w)=gx⊤w−Ex[x⊤w]varx[x⊤w]1/2+γ, (6)

where and are (trainable) mean and variance adjustment parameters. In the following, we assume that is zero mean (Assumption 1) and omit .222In the (non-compositional) models of Section 3 and 4, fulfilling the assumption that

is zero-mean is as simple as centering the dataset. Yet, we also omit centering a neurons input as well as learning

for the neural network analysis in Section 5. This is done for the sake of simplicity but note that Salimans and Kingma (2016) found that these aspects of Bn yield no empirical improvements for optimization. Then the variance can be written as follows

 varx[x⊤w] =Ex[(x⊤w)2]=Ex[(w⊤x)(x⊤w)] (7) =w⊤Ex[xx⊤]w=w⊤Sw

and replacing this expression into the batch normalized output of Eq.(5) yields

 f\scBN(w,g)=Ex[φ(gx⊤w(w⊤Sw)1/2)]. (8)

In order to keep concise notations, we will often use the induced norm of the positive definite matrix defined as . Comparing Eq. (4) and (8) it becomes apparent that Bn can be considered as a reparameterization of the weight space. We thus define

 ~w:=gw∥w∥S (9)

and note that accounts for the direction and for the length of . As a result, the batch normalized output can then be written as

 f\scBN(w,g)=Ex[φ(x⊤~w)]. (10)

Note that Weight Normalization (Wn) is another instance of the above reparametrization, where the covariance matrix

is replaced by the identity matrix

(Salimans and Kingma, 2016). In both cases, the objective becomes invariant to linear scaling of . From a geometry perspective, the directional part of Wn can be understood as performing optimization on the unit sphere while Bn operates on the -sphere (ellipsoid) (Cho and Lee, 2017). Note that one can compute the variance term (7) in a matrix-free manner, i.e. never needs to be computed explicitly for Bn.

Of course, this type of reparametrization is not exclusive to applications in neural networks. In the following two sections we first show how reparametrizing the weight space of linear models can be advantageous from a classical optimization point of view. In Section 5 we extend this analysis to training Batch Normalized neural networks with adaptive-stepsize Gradient Descent and show that the length-direction split induces an interesting decoupling effect of the individual network layers which simplifies the curvature structure.

## 3 Ordinary Least Squares

As a preparation for subsequent analyses, we start with the simple convex quadratic objective encountered when minimizing an ordinary least squares problem

 min~w∈Rd(f{\sc ols}% (~w):=Ex,y[(y−x⊤~w)2]) (11) (A???)⇔ min~w∈Rd(2u⊤~w+~w⊤S~w).

One can think of this as a linear neural network with just one neuron and a quadratic loss. Thus, applying Bn resembles reparametrizing according to Eq. (9) and the objective turns into the non-convex problem

 (12)

Despite the non-convexity of this new objective, we will prove that Gradient Descent (Gd) enjoys a linear convergence rate. Interestingly, our analysis establishes a link between in reparametrized coordinates (Eq. (12)) and the well-studied task of minimizing (generalized) Rayleigh quotients as it is commonly encountered in eigenvalue problems (Argentati et al., 2017).

### 3.1 Convergence analysis

To simplify the analysis, note that, for a given , the objective of Eq. (12) is convex w.r.t. the scalar and thus the optimal value can be found by setting , which gives . Replacing this closed-form solution into Eq. (12) yields the following optimization problem

 minw∈Rd∖{0}(ρ(w):=−w⊤uu⊤ww⊤Sw), (13)

which – as discussed in Appendix A.2 – is a special case of minimizing the generalized Rayleigh quotient for which an extensive literature exists (Knyazev, 1998; D’yakonov and McCormick, 1995). Here, we particularly consider solving (13) with Gd, which applies the following iterative updates to the parameters

 wt+1:=wt+2ηt((u⊤wt)u+ρ(wt)Swt)w⊤tSwt. (14)

Based upon existing results, the next theorem establishes a linear convergence rate for the above iterates to the minimizer in the normalized coordinates.

###### Theorem 1.

[Convergence rate on least squares] Suppose that the (weak) Assumption 1 on the data distribution holds. Consider the Gd iterates given in Eq. (14) with the stepsize and starting from . Then,

 Δρt≤(1−μL)2tΔρ0, (15)

where . Furthermore, the -norm of the gradient relates to the suboptimality as

 ∥wt∥2S∥∇ρ(wt)∥2S−1/|4ρ(wt)|=Δρt. (16)

This convergence rate is of the same order as the rate of standard Gd on the original objective of Eq. (11) (Nesterov, 2013). Yet, it is interesting to see that the non-convexity of the normalized objective does not slow gradient-based optimization down. In the following, we will repeatedly invoke this result to analyze more complex objectives for which Gd only achieves a sublinear convergence rate in the original coordinate space but is provably accelerated after using Batch Normalization.

## 4 Learning Halfspaces

We now turn our attention to the problem of Learning Halfspaces, which encompasses training the simplest possible neural network: the Perceptron. This optimization problem can be written as

 min~w∈Rd (f{\sc lh}(~w):=Ey,x[φ(z⊤~w)]) (17)

where and

is a loss function. Common choices for

include the zero-one, piece-wise linear, logistic and sigmoidal loss. We here tailor our analysis to the following choice of loss function.

###### Assumption 3.

[Assumptions on loss function] We assume that the loss function is infinitely differentiable, i.e. , with a bounded derivative, i.e. such that .

Furthermore, we need to be sufficiently smooth.

###### Assumption 4.

[Smoothness assumption] We assume that the objective is -smooth if it is differentiable on and its gradient is -Lipschitz. Furthermore, we assume that a solution exists that is bounded in the sense that 333This is a rather technical but not so restrictive assumption. For example, it always holds for the sigmoid loss unless the classification error of is already zero.

Since globally optimizing (17) is in general NP-hard (Guruswami and Raghavendra, 2009), we instead focus on understanding the effect of the normalized parameterization when searching for a stationary point. Towards this end we now assume that is a multivariate normal random variable (see Assumption 2 and discussion there).

### 4.1 Global characterization of the objective

The learning halfspaces objective – on Gaussian inputs – has a remarkable property: all critical points lie on the same line, independent of the choice of the loss function . We formalize this claim in the next lemma.

###### Lemma 1.

Under Assumptions 1 and 2, all bounded critical points of have the general form

 ~w∗=g∗S−1u,

where the scalar depends on and the choice of the loss function .

Interestingly, the optimal direction of these critical points spans the same line as the solution of a corresponding least squares regression problem (see Eq. (38) in Appendix A). In the context of convex optimization of generalized linear models, this fact was first pointed out in (Brillinger, 2012). Although the global optima of the two objectives are aligned, classical optimization methods - which perform updates based on local information - are generally blind to such global properties of the objective function. This is unfortunate since Gradient Descent converges linearly in the quadratic least-squares setting but only sublinearly on general Learning Halfspace problems (Zhang et al., 2015).

To accelerate the convergence of Gradient Descent, Erdogdu et al. (2016) thus proposed a two-step global optimization procedure for solving generalized linear models, which first involves finding the optimal direction by optimizing a least squares regression as a surrogate objective and secondly searching for a proper scaling factor of that minimizer. Here, we show that running Gd in coordinates reparameterized as in Eq. (9) makes this two-step procedure redundant. More specifically, splitting the optimization problem into searching for the optimal direction and scaling separately, allows even local optimization methods to exploit the property of global minima alignment. Thus - without having to solve a least squares problem in the first place - the directional updates on the Learning Halfspace problem can mimic the least squares dynamics and thereby inherit the linear convergence rate. Combined with a fast (one dimensional) search for the optimal scaling in each step the overall convergence stays linear.

As an illustration, Figure 1 shows the level sets as well as the optimal direction of a least squares-, a logistic- and a sigmoidal regression problem on the same Gaussian dataset. Furthermore, it shows iterates of Gd in original coordinates and a sequential version of Gd in normalized coordinates that first optimizes the direction and then the scaling of its parameters (Gdnp). Both methods start at the same point and run with an infinitesimally small stepsize. It can be seen that, while Gd takes completely different paths towards the optimal points of each problem instance, the dynamics of Gdnp are exactly the same until the optimal direction444Depicted by the dotted red line. Note that – as a result of Lemma 1 – this line is identical in all problems. is found and differ only in the final scaling.

### 4.2 Local optimization in normalized parameterization

Let be the -reparameterized objective with as defined in Eq. (9). We here consider optimizing this objective using Gradient Descent in Normalized Parameters (Gdnp). In each iteration, Gdnp performs a gradient step to optimize with respect to the direction

and does a one-dimensional bisection search to estimate the scaling factor

(see Algorithm 1). It is therefore only a slight modification of performing Batch Normalization plus Gradient Descent: Instead of taking a gradient step on both and , we search for the (locally)-optimal scaling at each iteration. This modification is cheap and it simplifies the theoretical analysis but it can also be substituted easily in practice by performing multiple Gd steps on the scaling factor, which we do for the experiments in Section 4.4.

### 4.3 Convergence result

We now show that Algorithm 1 can achieve a linear convergence rate to a critical point on the possibly non-convex objective with Gaussian inputs. Note that all information for computing the adaptive stepsize is readily available and can be computed efficiently.

###### Theorem 2.

[Convergence rate of Gdnp on learning halfspaces] Suppose Assumptions 14 hold. Let be the output of Gdnp on with the following choice of stepsizes

 st:=s(wt,gt)=−∥wt∥3SLgth(wt,gt) (18)

for , where

 h(wt,gt):= Ez[φ′(z⊤~wt)](u⊤wt) (19) −Ez[φ′′(z⊤~wt)](u⊤wt)2

is a stopping criterion. If initialized such that (see Eq. (13)), then is an approximate critical point of in the sense that

 ∥∇~wf(~wTd)∥2≤ (1−μ/L)2TdΦ2(ρ(w0)−ρ∗) +2−Tsζ|b(0)t−a(0)t|/μ2. (20)

To complete the picture, Table 1 compares this result to the order complexity for reaching an -optimal critical point (i.e.) of Gradient Descent and Accelerated Gradient Descent (Agd) in original coordinates. Although we observe that the accelerated rate achieved by Gdnp is significantly better than the best known rate of Agd for non-convex objective functions, we need to point out that the rate for Gdnp relies on the assumption of a Gaussian data distribution. Yet, Section 4.4 includes promising experimental results in a more practical setting with non-Gaussian data.

Finally, we note that the proof of this result relies specifically on the -reparametrization done by Batch Normalization. In Appendix B.3.6 we detail out why our proof strategy is not suitable for the -reparametrization of Weight Normalization and thus leave it as an interesting open question if other settings (or proof strategies) can be found where linear rates for Wn are provable.

### 4.4 Experiments I

##### Setting

In order to substantiate the above analysis we compare the convergence behavior of Gd and Agd to three versions of Gradient Descent in normalized coordinates. Namely, we benchmark (i) Gdnp (Algorithm 1) with multiple gradient steps on instead of Bisection, (ii) a simpler version (Bn) which updates and with just one fixed step-size gradient step555Thus Bn is conceptually very close to the classical Batch Norm Gradient Descent presented in (Ioffe and Szegedy, 2015) and (iii) Weight Normalization (Wn) as presented in (Salimans and Kingma, 2016). All methods use full batch sizes and – except for Gdnp on – each method is run with a problem specific, constant stepsize.

We consider empirical risk minimization as a surrogate for (17) on the common real-world dataset a9a as well as on synthetic data drawn from a multivariate Gaussian distribution. We center the datasets and use two different functions . First, we choose the softplus which resembles the classical logistic regression (convex). Secondly, we use the sigmoid which is a commonly used (non-convex) continuous approximation of the 0-1 loss (Zhang et al., 2015). Further details can be found in Appendix D.

##### Results

The Gaussian design experiments clearly confirm Theorem 2 in the sense that the loss in the convex-, as well as the gradient norm in the non-convex case decrease at a linear rate. The results on a9a show that Gdnp can accelerate optimization even when the normality assumption does not hold and in a setting where no covariate shift is present, which motivates future research of normalization techniques in optimization. Interestingly, the performance of simple Bn and Wn is similar to that of Gd, which suggests that the length-direction decoupling on its own does not capture the entire potential of these methods. Gdnp on the other hand takes full advantage of the parameter splitting, both in terms of multiple steps on and – more importantly – adaptive stepsizes in .

## 5 Neural Networks

We now turn our focus to optimizing the weights of a multilayer perceptron (MLP) with one hidden layer and

hidden units that maps an input to a scalar output in the following way

 Fx(~W,Θ)=m∑i=1θiφ(x⊤~w(i)).

The and terms are the input and output weights of unit and is a so-called activation function. We assume to be a , which is a common choice in many neural network architectures used in practice. Given a loss function , the optimal input- and output weights are determined by minimizing the following optimization problem

 min~W,Θ(f{\sc nn}(~W,Θ):=Ey,x[ℓ(−yFx(~W,Θ))]), (21)

where

 ~W:={~w(1),…,~w(m)},Θ:={θ(1),…,θ(m)}.

In the following, we analyze the optimization of the input weights with frozen output weights and thus write hereafter for simplicity. As previously assumed for the case of learning halfspaces, our analysis relies on Assumption 2. This approach is rather common in the analysis of neural networks, e.g.  Brutzkus and Globerson (2017) showed that for a special type of one-hidden layer network with isotropic Gaussian inputs, all local minimizers are global. Under the same assumption, similar results for different classes of neural networks were derived in (Li and Yuan, 2017; Soltanolkotabi et al., 2017; Du et al., 2017).

### 5.1 Global characterization of the objective

We here show that, under the same assumption, the landscape of exhibits a global property similar to one of the Learning Halfspaces objective (see Section 4.1). In fact, the critical points of all neurons in a hidden layer align along one and the same single line in and this direction depends only on incoming information into the hidden layer.

###### Lemma 2.

Suppose Assumptions 1 and 2 hold and let be a critical point of with respect to hidden unit and for a fixed . Then, there exits a scalar such that

 ^w(i)=^c(i)S−1u,∀i=1,…,m. (22)

See Appendix C.2 for a discussion of possible implications for deep neural networks.

### 5.2 Convergence result

We again consider normalizing the weights of each unit by means of its input covariance matrix , i.e. and present a simple algorithmic manner to leverage the input-output decoupling of Lemma 2. Contrary to Bn, which optimizes all the weights simultaneously, our approach is based on alternating optimization over the different hidden units. We thus adapt Gdnp to the problem of optimizing the units of a neural network in Algorithm 2 but note that this modification is mainly to simplify the theoretical analysis.

Optimizing each unit independently formally results in minimizing the function as defined in Algorithm 2. In the next theorem, we prove that this version of Gdnp achieves a linear rate of convergence to optimize each .

###### Theorem 3.

[Convergence of Gdnp on MLP] Suppose Assumptions 14 hold. We consider optimizing the weights of unit , assuming that all directions are critical points of and for . Then, Gdnp with step-size policy as in (98) and stopping criterion as in  (99) yields a linear convergence rate on in the sense that

 ∥∇~w(i)f(~w(i)t)∥2S−1≤ (1−μ/L)2tC(ρ(w0)−ρ∗) (23) +2−T(i)sζ|b(0)t−a(0)t|/μ2,

where the constant is defined in Eq. (102).

The result of Theorem 3 relies on the fact that each is either zero or has zero gradient. If we assume that an exact critical point is reached after optimizing each individual unit, then the result directly implies that the alternating minimization presented in Algorithm 2 reaches a critical point of the overall objective. Since the established convergence rate for each individual unit is linear, this assumption sounds realistic. We leave a more precise convergence analysis, that takes into account that optimizing each individual unit for a finite number of steps may yield numerical suboptimalities, for future work.

### 5.3 Experiments II

##### Setting

In the proof of Theorem 3 we show that Gdnp can leverage the length-direction decoupling in a way that lowers cross-dependencies between hidden layers and yields faster convergence. A central part of the proof is Lemma 2 which says that – given Gaussian inputs – the optimal direction of a given layer is independent of all downstream layers. Since this assumption is rather strong and since Algorithm 2 is intended for analysis purposes only, we test the validity of the above hypothesis outside the Gaussian setting by training a Batch Normalized multilayer feedforward network (Bn) on a real-world image classification task with plain Gradient Descent. For comparison, a second unnormalized network is trained by Gd. To validate Lemma 2 we measure the interdependency between the central and all other hidden layers in terms of the Frobenius norm of their second partial cross derivatives (in the directional component). Further details can be found in Appendix D.

##### Results

Figure 3 confirms that the directional gradients of the central layer are affected far more by the upstream than by the downstream layers to a surprisingly large extent. Interestingly, this holds even before reaching a critical point. The downstream cross-dependencies are generally decaying for the Batch Normalized network (Bn) (especially in the first 1000 iterations where most progress is made) while they remain elevated in the un-normalized network (Gd), which suggest that using Batch Normalization layers indeed simplifies the networks curvature structure in such that the length-direction decoupling allows Gradient Descent to exploit simpler trajectories in these normalized coordinates for faster convergence.

Of course, we cannot untangle this effect fully from other possible positive aspects of training with Bn (see introduction). Yet, the fact that the (de-)coupling increases in the distance to the middle layer (note how earlier (later) layers are more (less) important for ) emphasizes the relevance of this analysis particularly for deep neural network structures, where downstream dependencies might vanish completely with depth. This does not only make gradient based training easier but also suggests the possibility of using partial second order information, such as diagonal Hessian approximations (e.g. proposed in (Martens et al., 2012)).

## 6 Conclusion

We took a theoretical approach to study the acceleration provided by Batch Normalization. In a somewhat simplified setting, we have shown that the reparametrization performed by Batch Normalization leads to a provable acceleration of gradient-based optimization by splitting it into subtasks that are easier to solve. In order to evaluate the impact of the assumptions required for our analysis, we also performed experiments on real-world datasets that agree with the results of the theoretical analysis to a surprisingly large extent.

We consider this work as a first step for two particular directions of future research. First, it raises the question of how to optimally train Batch Normalized neural networks. Particularly, our results suggest that different and adaptive stepsize schemes for the two parameters - length and direction - can lead to significant accelerations. Second, the analysis of Section 3 and 4 reveals that a better understanding of non-linear coordinate transformations is a promising direction for the continuous optimization community.

## Appendix A Least Squares Analysis

### a.1 Restated result

Recall that, after normalizing according to (9) and using the closed form solution for the optimal scaling factor , optimizing the ordinary least squares objective can be written as the following minimization problem

 minw∈Rd(ρ(w):=−w⊤uu⊤ww⊤Sw). (13 revisited)

We consider optimizing the above objective by Gd which takes iterative steps of the form

 wt+1=wt−ηt∇ρ(wt), (24)

where

 ∇ρ(wt)=−2(Bwt+ρ(wt)Swt)/w⊤tSwt. (25)

Furthermore, we choose

 ηt=w⊤tSwt2L|ρ(wt)|. (26)

Our analysis relies on the (weak) data distribution assumption stated in A1. It is noteworthy that holds under this assumption. In the next theorem, we establish a convergence rate of Gd in terms of function value as well as gradient norm.

See 1

The proof of this result crucially relies on the insight that the minimization problem given in (12

) resembles the problem of maximizing the generalized Rayleigh quotient which is commonly encountered in generalized eigenproblems. We will thus first review this area, where convergence rates are usually provided in terms of the angle of the current iterate with the maximizer, which is the principal eigenvector. Interestingly, this angle can be related to both, the current function value as well as the the norm of the current gradient. We will make use of these connections to prove the above Theorem in Section

A.5. Although not necessarily needed for convex function, we introduce the gradient norm relation as we will later go on to prove a similar result for possibly non-convex functions in the learning halfspace setting (Theorem 2).

### a.2 Background on eigenvalue problems

#### a.2.1 Rayleigh quotient

Optimizing the Rayleigh quotient is a classical non-convex optimization problem that is often encountered in eigenvector problems. Let be a symmetric matrix, then is the principal eigenvector of if it maximizes

 q(w)=w⊤Bww⊤w (27)

and is called the Rayleigh quotient. Notably, this quotient satisfies the so-called Rayleigh inequality

 λmin(B)≤q(w)≤λ1(B),∀w∈R∖{0},

where and are the smallest and largest eigenvalue of respectively.

Maximizing is a non-convex (strict-saddle) optimization problem, where the -th critical point constitutes the -th eigenvector with corresponding eigenvalue (see (Absil et al., 2009), Section 4.6.2 for details). It is known that optimizing with Gd - using an iteration-dependent stepsize - converges linearly to the principal eigenvector . The convergence analysis is based on the ”minidimensional” method and yields the following result

 λ1−q(wt)cos2∠(wt,v1)≤(1−λ1−λ2λ1−λmin)2tλ1−q(w0)cos2∠(w0,v1) (28)

under weak assumptions on . Details as well as the proof of this result can be found in (Knyazev and Shorokhodov, 1991).

#### a.2.2 Generalized rayleigh quotient

The reparametrized least squares objective (13), however, is not exactly equivalent to (27) because of the covariance matrix that appears in the denominator. As a matter of fact, our objective is a special instance of the generalized Rayleigh quotient

 ~ρ(w)=w⊤Bww⊤Aw, (29)

where is defined as above and is a symmetric positive definite matrix.

Maximizing (29) is a generalized eigenproblem in the sense that it solves the task of finding eigenvalues of the matrix pencil for which , i.e. finding a vector that obeys . Again we have

 λmin(B,A)≤~ρ(w)≤λ1(B,A),∀w∈R∖{0}.

Among the rich literature on solving generalized symmetric eigenproblems, a Gd convergence rate similar to (28) has been established in Theorem 6 of (Knyazev and Neymeyr, 2003), which yields

 λ1−ρ(wt+1)ρ(wt+1)−λ2≤(1−λ1−λ2λ1−λmin)2tλ1−ρ(wt)ρ(wt)−λ2,

again under weak assumptions on .

#### a.2.3 Our contribution

Since our setting in Eq. (13) is a minimization task, we note that for our specific choice of and we have and we recall the general result that then

 minρ(w)=−max(~ρ(w)).

More importantly, we here have a special case where the nominator of has a particular low rank structure. In fact, is a rank one matrix. Instead of directly invoking the convergence rate in (Knyazev and Neymeyr, 2003), this allows for a much simpler analysis of the convergence rate of Gd on since the rank one property yields a simpler representation of the relevant vectors. Furthermore, we establish a connection between suboptimality on function value and the -norm of the gradient. As mentioned earlier, we need such a guarantee in our future analysis on learning halfspaces which is an instance of a (possibly) non-convex optimization problem.

### a.3 Preliminaries

Notations Let be a symmetric positive definite matrix. We introduce the following compact notations that will be used throughout the analysis.

• -inner product of : .

• -norm of : .

• -angle between two vectors :

 ∠A(w,v):=arccos(⟨w,v⟩A∥w∥A∥v∥A) (30)
• -orthogonal projection of to span for :

 ^w∈span{v}with ⟨w−^w,v⟩A=0 (31)
• -spectral norm of a matrix :

 ∥C∥A:=maxw∈Rd/{0}∥Cw∥A∥w∥A (32)

These notations allow us to make the analysis similar to the simple Rayleigh quotient case. For example, the denominator in (29) can now be written as .

Properties We will use the following elementary properties of the induced terms defined above.

• If is the -orthogonal projection of to span, then it holds that

 cos2∠A(w,v)=∥^w∥2A∥w∥2A,sin∠Awv=∥w−^w∥A∥w∥A (33)
• The -spectral norm of a matrix can be written in the alternative form

 ∥C∥A =maxw∈Rd∖{0}∥A1/2Cw∥2∥A1/2w∥2 (34) =maxw∈Rd∖{0}∥(A1/2CA−1/2)A1/2w∥2∥A1/2w∥2 (35) =∥A1/2CA−1/2∥2 (36)
• Let be a square matrix and , then the following result holds due to the definition of -spectral norm

 ∥Bw∥A≤∥B∥A∥w∥A (37)

### a.4 Characterization of the LS minimizer

By setting the gradient of (11) to zero and recalling the convexity of we immediately see that the minimizer of this objective is

 ~w∗:=−S−1u. (38)

Indeed, one can easily verify that is also an eigenvector of the matrix pair since

 B~w∗=uu⊤(−S−1u)=−∥u∥2S−1u (39) = (∥u∥2S−1)S(−S−1u)=λ1S~w∗

where is the corresponding generalized eigenvalue. The associated eigenvector with is

 v1:=~w∗/∥u∥S−1. (40)

Thereby we extend the normalized eigenvector to an -orthogonal basis of such that

 ⟨vi,vj⟩S={0,i≠j1,i=j (41)

holds for all . Let be the matrix whose -th column is . The matrix is orthogonal to the matrix since

 BV2=uu⊤V2=uu⊤S−1SV2 (42) ∥S−1u∥Su⎛⎜⎝−v⊤1SV20⎞⎟⎠,

which is a zero matrix due to the

-orthogonality of the basis (see Eq. (41)). As a result, the columns of are eigenvectors associated with a zero eigenvalue. Since and form an -orthonormal basis of no further eigenvalues exist. We can conclude that any vector is a minimizer of the reparametrized ordinary least squares problem as presented in (13) and the minimum value of relates to the eigenvalue as

 λ1=−minwρ(w).

Spectral representation of suboptimality Our convergence analysis is based on the angle between the current iterate and the leading eigenvector , for which we recall property (P.1). We can express in the -orthogonal basis that we defined above:

 w=α1v1+V2α2,α1∈R,α2∈Rd−1 (43)

and since is the -orthogonal projection of to span, the result of (P.2) implies

 cos2∠S(w,v1)=∥α1v1∥2S∥w∥2S=α21∥w∥2S. (44)

Clearly this metric is zero for the optimal solution and else bounded by one from above. To justify it is a proper choice, the next proposition proves that suboptimality on , i.e. , relates directly to this angle.

###### Proposition 1.

The suboptimality of on relates to as

 ρ(w)−ρ(v1)=λ1sin2∠S(w,v1), (45)

where . This is equivalent to

 ρ(w)=−λ1cos2∠S(w,v1). (46)
###### Proof.

We use the proposed eigenexpansion of Eq. (43) to rewrite

 Bw=(α1Bv1+BV2α2)α1Bv1α1λ1Sv1 (47)

and replace the above result into . Then

 ρ(w) =−w⊤Bww⊤Sw−α1λ1(α1v1+V2α2)⊤Sv1∥w∥2S −λ1α21∥w∥2S−λ1cos2∠S(w,v1),

which proves the second part of the proposition. The first follows directly from property (P.1). ∎

Gradient-suboptimality connection Fermat’s first-order optimality condition implies that the gradient is zero at the minimizer of . Considering the structure of , we propose a precise connection between the norm of gradient and suboptimality. Our analysis relies on the representation of the gradient in the -orthonormal basis which is described in the next proposition.

###### Proposition 2.

Using the -orthogonal basis as given in Eq. (41), the gradient vector can be expanded as

 ∥w∥2S∇ρ(w)/2= −λ1α1sin2∠S(w,v1)Sv1 (48) +λ1cos2∠S(w,v1)SV2α2
###### Proof.

The above derivation is based on two results: (i) is an eigenvector of and (ii) the representation of in Proposition 1. We recall the definition of in (25) and write

 ∇ρ(w)∥w∥2S/2 =−ρ(w)Sw−Bw (49) −ρ(w)S(α1v1+V2α2)+λ1α1Sv1 −(1−cos2∠S(w,v1))λ1α1Sv1 + λ1cos2∠S(w,v1)SV2α2 =−λ1α1sin2∠S(w,v1)Sv1 + λ1cos2∠S(w,v1)SV2α2

Exploiting the gradient representation of the last proposition, the next proposition establishes the connection between suboptimality and the -norm of gradient .

###### Proposition 3.

Suppose that , then the -norm of the gradient relates to the suboptimality as

 ∥w∥2S∥∇ρ(w)∥2S−1/(4|ρ(w)|)=ρ(w)−ρ(v1) (50)
###### Proof.

Multiplying the gradient representation in Proposition 2 by yields

 S−1∇ρ(w)∥w∥2S/2= −λ1α1sin2∠