A Deep Conditioning Treatment of Neural Networks

02/04/2020 ∙ by Naman Agarwal, et al. ∙ Google 0

We study the role of depth in training randomly initialized overparameterized neural networks. We give the first general result showing that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data. This result holds for arbitrary non-linear activation functions, and we provide a characterization of the improvement in conditioning as a function of the degree of non-linearity and the depth of the network. We provide versions of the result that hold for training just the top layer of the neural network, as well as for training all layers, via the neural tangent kernel. As applications of these general results, we provide a generalization of the results of Das et al. (2019) showing that learnability of deep random neural networks with arbitrary non-linear activations (under mild assumptions) degrades exponentially with depth. Additionally, we show how benign overfitting can occur in deep neural networks via the results of Bartlett et al. (2019b).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have enjoyed tremendous empirical success, and theory is starting to emerge which attempts to explain this success. A sequence of papers has recently shown the benefits of overparametrization via large width for training neural networks: see, for example, (Li and Liang, 2018; Du et al., 2019; Allen-Zhu et al., 2019; Zou and Gu, 2019)

and the references therein. These papers show that with sufficiently large width, starting from a random initialization of the network weights, gradient descent provably finds a global minimizer of the loss function on the training set.

While several of the aforementioned papers do analyze deep neural networks, to our knowledge, there is no prior work that provably demonstrates the benefits of depth for training neural networks in general settings. Prevailing wisdom is that while depth enables the neural network to express more complicated functions (see, for example, (Eldan and Shamir, 2016; Telgarsky, 2016; Raghu et al., 2017; Lee et al., 2017; Daniely, 2017a) and the references therein), it hinders efficient training, which is the primary concern in this paper. Indeed, the papers mentioned earlier showing convergence of gradient descent either assume very shallow (one hidden layer) networks, or expend considerable effort to show that depth doesn’t degrade training by more than a polynomial factor. A few exceptions are the papers (Arora et al., 2018b, 2019a) which do show that depth helps in training neural networks, but are restricted to very specific problems with linear activations. See Section A for an in-depth discussion of these and other related works.

In this paper, we provide general results showing how depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data. Recent developments (Jacot et al., 2018; Yang, 2019; Arora et al., 2019c) have shown that training wide, randomly initialized neural networks is effectively a kernel method, and thus a convex optimization problem. It is well-known that the rate of convergence of gradient descent depends crucially on the condition number (or related quantities, such as smoothness or strong convexity

) of the function being minimized, and in the case of kernel methods, these quantities are directly related to the eigenvalues of the kernel matrix of the input data.

Our main result is that for a randomly initialized neural network with an arbitrary non-linear activation function, the condition number of the appropriate kernel matrices tend to the best possible value, , exponentially fast in the depth of the network. This result holds under very mild conditions on the input data, and a suitable normalization of the activation function. The rate at which the condition number tends to is determined by a coefficient of non-linearity of the activation function, a concept that we define in this paper.

We then apply our main result to show that when training large width neural networks of sufficient depth, gradient descent with square loss approaches training error at a rate, regardless of the initial conditioning of the data. This is in contrast to prior works (Arora et al., 2019c; Allen-Zhu et al., 2018)

and demonstrates the optimization benefits of using deeper networks. This result holds for either training just the top layer of the neural network, or all layers of the network with a sufficiently small learning rate (the so-called lazy training regime). In particular, when training just the top layer with the popular ReLU activations, we show that the width of the network only needs to grow

logarithmically in the initial conditioning of the input data. These results are established by using our main result to show that the smoothness and the strong convexity of the loss function improve exponentially with depth.

At the core of our work is an analysis for the case where the network has infinite width. We establish conditioning for the infinite width kernel and its neural tangent counterpart. Building on the analysis for infinite-width networks, the extension to finite width follows typically by applying standard concentration inequalities. Our optimization results then follow from the standard paradigm of choosing a suitably small step-size allowing for little movement of the underlying kernel. The generality of our results also leads to multiple applications beyond optimization.

As an application of our conditioning results, we extend the recent work of Das et al. (2019) on learnability of randomly initialized deep neural networks under the statistical query (Kearns, 1998) framework of learning. More specifically we show that learning a target function that is a sufficiently deep, and randomly initialized neural network with a general class of activations (including sign, ReLU and tanh), requires exponentially (in depth) many queries in the statistical query model of learning. As another application, we extend the work of Bartlett et al. (2019b)

on interpolating classifiers and show that randomly initialized and sufficiently deep neural networks can not only fit the training data, but in fact, the minimum norm (in the appropriate RKHS) interpolating solution achieves non-trivial excess risk guarantees in some settings as well.

2 Notation and preliminaries

For two vectors

and of like dimension, we denote their inner product by . Unless otherwise specified, denotes the Euclidean norm for vectors and the spectral norm for matrices. For a symmetric positive definite matrix , the condition number is defined to be the ratio , where and  are the largest and smallest eigenvalues respectively of . For a positive integer , define .

We are given a training set of examples: , where is the output space. We assume, as is standard in related literature, that for all we have . Let be the Gram matrix of the training data, i.e. . We make one of the following two assumptions on the input data:

Assumption 1

For all with , we have .

Assumption 2

.

ass:separation is a standard non-degeneracy assumption made in the literature. ass:non-singularity is a stronger assumption than ass:separation but still quite benign. In particular we show in thm:one-layer-conditioning-sgn (in app:relu_cond) that for ReLU activations, the representations derived after passing a dataset satisfying ass:separation through one layer, satisfy ass:non-singularity. This statement can also be made for more general activations, see thm:general-one-layer-conditioning.

To keep the presentation as clean as possible, we assume a very simple architecture of the neural network111Extending our analysis to layers of different sizes and outputs of length greater than poses no mathematical difficulty and is omitted for the sake of clarity of notation.: it has hidden fully-connected layers, each of width , and takes as input and outputs , with activation function to by entry-wise application. The network can thus be defined as the following function222Note that we’re using the so-called neural tangent kernel parameterization (Jacot et al., 2018) instead of the standard parameterization here. :

where , denote the weight matrices for the hidden layers, denotes the weight vector of the output layer, denotes a vector obtained by concatenating vectorizations of the weight matrices. We use the notation

for the normal distribution with mean

and covariance . All weights are initialized to independent, standard normal variables (i.e. drawn i.i.d. from ).

We assume that the activation function is normalized (via translation and scaling) to satisfy the following conditions:

(1)

The first condition is somewhat non-standard and is crucial to our conditioning analysis. In sec:discussion we discuss how the commonly used BatchNorm operation makes it possible for us to assume this condition without loss of generality. Throughout the paper, statements of the type “If then [consequence].” should be taken to mean that there exist universal constants such that if then [consequence] follows. We use and notation in a similar manner. Similarly, statements of the type “If then [consequence].” should be taken to mean that there exists a polynomial of bounded degree in the arguments such that if equals that polynomial then [consequence] follows.

3 Main results on conditioning of kernel matrices

3.1 Top layer kernel matrix.

The first kernel matrix we study is the one defined by (random) feature mapping generated at the top layer by the lower layer weights, i.e.333Note that does not depend on the component of ; this notation is chosen for simplicity.

The feature mapping defines a kernel function and the associated kernel matrix on a training set as,

The main results on conditioning in this paper are cleanest to express in the limit of infinite width neural networks, i.e. . In this limit, the kernel function and the kernel matrix , tend almost surely to deterministic limits (Daniely et al., 2016), denoted as and respectively. We study the conditioning of next. The rate at which the condition number of improves with depth depends on the following notion of degree of non-linearity of the activation function : The coefficient of non-linearity of the activation function is defined to be . The normalization eq:normalization of the activation function implies via lem:mu-bounds (in app:conditioning, where all missing proofs of results in this section can be found) that for any non-linear activation function , we have . To state our main result, it is convenient to define the following quantities: for and a positive integer , let , and define

We are now ready to state our main result on conditioning of the kernel matrix: The following bounds hold:

  1. Under ass:separation, we have for all with .

  2. Under ass:non-singularity, we have .

The following corollary is immediate, showing that the condition number of the kernel matrix approaches the smallest possible value, 1, exponentially fast as depth increases. The following bounds on hold:

  1. Under ass:separation, if , then .

  2. Under ass:non-singularity, we have .

3.1.1 Proof of thm:main-infinite-width-top-layer.

For the conditioning analysis, we need a key concept from Daniely et al. (2016), viz. the notion of the dual activation for the activation : For , define matrix . Define the conjugate activation function as .

The random initialization of the neural network induces a feature representation of the input vectors at every depth in the neural network: . This feature representation naturally yields a kernel function . The results of Daniely et al. (2016) imply that describes the behavior of kernels through the layers: Suppose . Then for any depth , as , where denotes the -fold composition of with itself.

By analyzing the Hermite expansion of (see app:conditioning for details), we have the following key lemma which shows that decays to rapidly as we move up the layers: [Correlation decay lemma] Suppose for some . Then

The proof of this lemma relies crucially on the normalization eq:normalization of the activation , and its non-linearity. The normalization implies that each application of decreases pairwise inner products of the input feature representations, at a rate governed by the coefficient of non-linearity. Using this fact repeatedly leads to the stated bound.

The final technical ingredient we need is the following linear-algebraic lemma which gives a lower bound on the smallest eigenvalue of a matrix obtained by the application of a given function to all entries of another positive definite matrix: [Eigenvalue lower bound lemma] Let be an arbitrary function whose power series converges everywhere in and has non-negative coefficients . Let be a positive definite matrix with for some , and all diagonal entries equal to . Let be matrix obtained by entrywise application of . Then we have

We can now prove thm:main-infinite-width-top-layer: [thm:main-infinite-width-top-layer] Part 1 follows directly from lem:recursive.

As for part 2, ass:non-singularity implies that . Since the function defines a kernel on the unit sphere, by Schoenberg’s theorem (Schoenberg, 1942), its power series expansion has only non-negative coefficients, so lem:general-eigenvalue-lb applies to , and we have

using lem:recursive and the fact that .

3.2 Neural tangent kernel matrix.

The second kernel matrix we study arises from the neural tangent kernel, which was introduced by Jacot et al. (2018). This kernel matrix naturally arises when all the layers of the neural network are trained via gradient gradient. For a given set of network weights , the neural tangent kernel matrix is defined as

As in the previous section, as the width of the hidden layers tends to infinity, the random tends to a deterministic limit, . For this infinite width limit, we have the following theorem analogous to part 1 of thm:main-infinite-width-top-layer: The diagonal entries of are all equal. Furthermore, the following bounds hold if :

  1. Under ass:separation, we have for all with .

  2. Under ass:non-singularity, we have .

The following corollary, analogous to cor:main-top-layer-cn, is immediate: The following bounds on the condition number hold:

  1. Under ass:separation, if , then .

  2. Under ass:non-singularity, if , then .

3.2.1 Proof of thm:main-infinite-width-ntk

We use the following formula for the NTK given by Arora et al. (2019c): defining and to be the derivative of , we have

(2)

We need the following bound in our analysis which follows via the Hermite expansion of : For any , we have .

We can now prove thm:main-infinite-width-ntk: [(thm:main-infinite-width-ntk)] First, we show that all diagonal values of are equal. For every , we have , and since for any , we have from eq:ntk-formula,

which is a fixed constant.

To prove part 1, let . It is easy to show (say, via the Hermite expansion of ) that . Thus, we have

where the penultimate inequality follows lem:dualact-dot-ratio and the final one from lem:recursive. We now show that since , for any any , we have

which gives the bound of part 1. We do this in two cases: if , then , which gives the required bound since all terms in the product are at most . Otherwise, if , then there are at least values of in which are larger than , and for these values of , we have , so . The product of these terms is therefore at most , which gives the required bound in this case.

To prove part 2, define as . Equation eq:ntk-formula shows that this defines a kernel on the unit sphere, and so by Schoenberg’s theorem (Schoenberg, 1942), its power series expansion has only non-negative coefficients. Thus, applying lem:general-eigenvalue-lb to , we conclude that

using the calculations in part 1. Since , the bound of part 2 follows.

4 Implications for optimization

Suppose we train the network using gradient descent on a loss function , which defines the empirical loss function

For the rest of this section we will assume that the loss function is the square loss, i.e. . The results presented can appropriately be extended to the setting where the loss function is smooth and strongly convex. Training a finite-width neural network necessitates the study of the conditioning of the finite-width kernel matrices and , rather than their infinite-width counterparts. In such settings optimization results typically follow from a simple 2-step modular analysis:

  • Step 1. [Initial Stability] Standard concentration inequalities imply that if the width is large enough, conditioning of the infinite-width kernel matrices transfers to their finite-width counterparts at initialization.

  • Step 2. [Training Stability] Standard optimization theory implies that conditioning in finite-width kernel matrices leads to fast training. In the case of training only the top layer this is sufficient. When training all layers, a much more careful analysis is needed to show that the NTK stays ”close” to initialization, leading to conditioning throughout the training process.

We now provide a couple of representative optimization results that follow from this type of analysis. Our goal here is to merely provide representative examples of typical optimization scenarios and highlight what benefits conditioning can lead to. Indeed, we believe extensions and improvements can be derived with significantly better bounds.

4.1 Training only the top layer

We consider a mode of training where only the top layer weight vector, , is updated, while keeping frozen at their randomly initialized values. To highlight this we introduce the notation . Let be a step size, the update rule at iteration is given by

Note that in this mode of training, the associated optimization problem is convex in . To implement Step 1 of the modular analysis, we appeal to the results of Daniely et al. (2016). They show that when the activations are suitably bounded (see Definition 6 in their paper for

-bounded activations) and the width is large enough, then with high probability, each entry in the kernel matrix

is close to the corresponding entry in . Specifically, via Theorems 2 and 3 in their paper, we have the following version of thm:main-infinite-width-top-layer for finite width neural networks: [Via Theorem 2 in Daniely et al. (2016)] For any , suppose that either

  • The activation is -bounded and , or

  • The activation is ReLU, and .

Then with high probability, we have that for all , .

Step 2 follows by using standard convex optimization theory (Nesterov, 2014), which tells us that the convergence rate of gradient descent for this problem depends on the condition number of . Specifically, we have the following result:

Suppose . Then,

  • If is -bounded and the width , or

  • If is RELU and the width .

Then setting , we get that with high probability over the initialization,

Alternatively, in order to find a point that is sub-optimal, gradient descent needs steps.

Similarly, one can also derive a linear convergence theorem for stochastic gradient descent: With the same choice of parameters as in thm:top-layer-general, appropriate choice of

and with high probability over the initialization, stochastic gradient descent finds a point that is -sub-optimal in expectation in at most steps.

The rate in the exponent in the theorem above naturally depends upon the condition number of the kernel matrix K. For simplicity, we choose to state the theorem for a depth at which the condition number is . Precise rates depending on , can be derived from cor:main-top-layer-cn.

4.2 Training All The Layers Together

In this section we provide a representative result for the training dynamics when all the layers are trained together with a fixed common learning rate. The dynamics are given by

Now since the bottom layers also move the kernel changes at every step. The standard analysis in this setting follows from carefully establishing that the NTK does not change too much during the training procedure allowing for the rest of the analysis to go through. The following theorem from Lee et al. (2019) summarizes one such setting for smooth activation functions.

[Thereom G.4 in Lee et al. (2019)] Suppose that the activation and its derivative further satisfies the properties that there exists a constant , such that for all

Then there exists a constant (depending on L, n, ) such that for width and setting the learning rate , with high probability over the initialization the following is satisfied for gradient descent for all ,

The following corollary is now a simple application of the above theorem and cor:main-infinite-width-ntk-cn. Suppose the conditions in thm:lee-smooth-act-optimization are satisfied and the width is taken to be a large enough constant (depending on ) and further , then gradient descent with high probability finds an suboptimal point in total time .

As stated in thm:lee-smooth-act-optimization the width required could be a very large constant. However, note that we require the depth to be logarithmic in for achieving constant condition number. Therefore the exponential in L factors accrued in the analysis of thm:lee-smooth-act-optimization are actually polynomial in . Therefore, merging results from Arora et al. (2019c), we can derive a polynomial in upper bound on the width of the network. This matches the best known bounds on the overparameterization while improving the optimization rates exponentially (in ). Further we believe similar results can also be derived for ReLU activations following techniques in Allen-Zhu et al. (2018).

The proofs from this section follow easily from our established results and standard arguments from optimization theory. We have included the proofs in app:opt-proofs for completeness.

5 SQ Learnability of Random Deep Neural Nets

In this section we show that our main result in Theorem 3.1 leads to a generalization of the recent result of Das et al. (2019) regarding learnability of random neural networks. The work of Das et al. (2019) studied randomly initialized deep neural networks with sign activations at hidden units. Motivated from the perspective of complexity of learning, they studied learnability of random neural networks in the popular statistical query learning (SQ) framework (Kearns, 1998; Bshouty and Feldman, 2002). Their main result establishes that any algorithm for learning a function that is a randomly initialized deep network with sign activations, requires exponential (in depth) many SQ queries in the worst case. Here we extend their result for arbitrary activations under mild assumptions and show that randomly initialized deep neural networks with arbitrary activations are hard to learn under the SQ model. Specifically, we prove our result under the assumption that the (normalized) activation is subgaussian with constant subgaussian norm. In particular we assume that

(3)

for a constant . Many activations such as the sign, ReLU and tanh satisfy this assumption.

A key component in establishing SQ hardness of learning is to show that given two non-collinear unit length vectors, a randomly initialized network of depth and sufficiently large width width makes, in expectation, the pair nearly orthogonal. In other words, the magnitude of the expected dot product between any pair decreases exponentially with depth. While Das et al. (2019) proved the result for sign activations, we prove the statement for more general activations and then use it to establish SQ hardness of learning. We will work with networks that will normalize the output of each layer to unit length via the operation . Then we have the following theorem: Let be a non linear activation with being the coefficient of non-linearity as in Definition 3.1 and satisfying (3). Let be unit length vectors such that . Define , where each column of is sampled from and each column of is sampled from for . Furthermore, the operation normalizes the output of each layer to unit length. Let for a universal constant and for define be the dot product obtained by taking the representation of at depth of the network defined above. Then for any , it holds that

where and is a universal constant. While the above theorem is not a black blox application of our main result (thm:main-infinite-width-top-layer) since careful concentration arguments are required due to finite width, the calculations are of a similar flavor.

We now show how the above theorem can be used to generalize the SQ lower bound of Das et al. (2019). Before describing our results, we recall that in the SQ model (Kearns, 1998) the learning algorithm does not have access to a labeled training set. Instead, for a given target function and a distribution over , the algorithm has access to a query oracle . The oracle takes as input a query function , and outputs a value such that . The goal of the algorithm is to use the query algorithm to output a function that approximates , i.e., , for a given .

(Das et al., 2019) established an SQ learnability lower bound for a subclass of neural networks with the property that a randomly initialized neural network falls in with high probability. This however only establishes that the class is hard to SQ learn as opposed to showing that a randomly initialized neural network is hard to learn. Furthermore, the lower bound only applies to networks with sign activations. We now show how to generalize their result in two ways: (a) we allow arbitrary activations satisfying (3), and (b) our lower bound shows that a randomly initialized network is hard to learn in the SQ model with constant probability. We achieve the stronger lower bound by carefully adapting the lower bound technique of Bshouty and Feldman (2002).

In our context we will fix a non linear activation and let the target be of the form

where each column of is sampled from and and each column of is sampled from for . Furthermore we will use the depth and the dimensionality to parameterize the bit complexity of the network description. We say that an algorithm -SQ learns if with probability at least over the randomness in , the algorithm makes at most queries to the SQ oracle for , receives responses from the oracle up to tolerance and outputs a that -approximates . Furthermore it is the case that each query function used by the algorithm can be evaluated in time .

Then we have the following lower bound extending the result of Das et al. (2019). The proofs of this section can be found in app:sq-app. Fix any non linear activation with the coefficient of non-linearity being that satisfies (3). Any algorithm that -SQ learns the random depth networks as defined above with width must satisfy .

6 Benign Overfitting in Deep Neural Networks

In this section, we give an application of our conditioning results showing how interpolating classifiers (i.e. classifiers achieving perfect training accuracy) can generalize well in the context of deep neural networks. Specifically, building on the work of Bartlett et al. (2019b)

, we consider the problem of linear regression with square loss where the feature representation is obtained via a randomly initialized deep network, and an interpolating linear predictor is obtained by training only the top layer (i.e. the

vector). Since there are infinitely many interpolating linear predictors in the overparameterized setting we consider, we focus our attention on the minimum norm predictor.

In this setting, the input space is the dimensional unit sphere , the output space , and samples are drawn from an unknown distribution . The training set is . To simplify the presentation, we work in the infinite width setting, i.e. we learn the minimum norm linear predictor in the RKHS corresponding to the kernel function for a deep neural network as defined in sec:top-layer. The number of hidden layers in the neural network, , depends on the the sample size in our results.

Following the notational conventions in (Bartlett et al., 2019b), for , we denote by their inner product. Let be the feature map corresponding to . We denote by the infinite matrix the linear map from corresponding to the inputs , so that for any , has th component . Note that , the kernel matrix for the training data defined by . We denote by the vector , and by the data covariance matrix.

The loss of a linear predictor parameterized by on an example is . We denote by the optimal linear predictor, i.e. a vector in . If is non-singular, then the linear predictor interpolates on , i.e., for all , and indeed, is the minimum norm interpolating linear predictor. Our goal is to bound the excess risk of , i.e. .

A key quantity of interest is the function defined as follows: is the largest value of for which ass:separation holds for a randomly drawn sample set of size with probability at least . Specifically, if denotes a sample set of size drawn i.i.d. from the marginal distribution of over the -coordinate, then

With this definition, we have the following excess risk bound (proof in app:interpolation): For any , let . Then, with probability at least over the choice of , there exists an interpolating linear predictor, and we have

A few caveats about the theorem are in order. Note that the number of layers, , and therefore and the optimal linear predictor depends on the sample size . Thus, the excess risk goes to when increases if .

7 Discussion

Our main result stated in thm:main-infinite-width-top-layer and its applications clearly demonstrate the joint benefit of using deeper networks with non-linear activations from an optimization and generalization perspective. Our ass:separation is standard and has been used in prior works on optimization of neural networks via stochastic gradient descent (Allen-Zhu et al., 2018; Zou and Gu, 2019; Du et al., 2019). ass:non-singularity is stronger but is easily satisfied by randomly initialized one layer networks with popular activations such as the ReLU. We establish this in thm:one-layer-conditioning-sgn.

Using our conditioning analysis we obtain in cor:train_all_layers that when training all layers of a deep enough network via gradient descent, the iteration complexity is independent of and the initial separation , thereby clearly demonstrating the benefit of depth. This is in contrast to prior works where the iteration complexity depends polynomially in the depth and (Allen-Zhu et al., 2018; Zou and Gu, 2019; Du et al., 2019). For the case of training all the layers with ReLU activations, our improved analysis implies that (see thm:top-layer-general) the width requirement only has a logarithmic dependence in , as opposed to polynomial in in prior works.

Finally, note that all our theorems and their implications hold for the case of normalized activations as defined in eq:normalization. As discussed earlier, the only somewhat non-standard part of the normalization is the requirement that the activation is centered so that its expectation on standard normal inputs is 0. This requirement is not simply a limitation of our analysis but is inherently necessary since when working with uncentered activations, a similar analysis to the one in this paper shows all pairwise dot products approach 1 (rather than 0) at an exponential rate.

Beyond this fact, we note that the commonly used batch normalization (BatchNorm) operation (Ioffe and Szegedy, 2015) makes it possible to assume that activations are centered without loss of generality. BatchNorm is an essential operation for efficient training of deep networks: in this operation the input to a given layer is processed to obtain where and and

are the mean and the standard deviation of the inputs for given batch

of examples. It is evident from the definition of BatchNorm that the output of the operation is invariant to translation of the activation by a fixed constant. Thus, without loss of generality we can assume that the activation is centered. In this light, our results can be viewed as providing a theoretical justification for the superior optimization performance of BatchNorm that has been observed in practice.

References

Appendix A Related Work

Representational Benefits of Depth.

Analogous to depth hierarchy theorems in circuit complexity, many recent works have aimed to characterize the representational power of deep neural networks when compared to their shallow counterparts. The work of Delalleau and Bengio [2011] studies sum-product networks and constructs examples of functions that can be efficiently represented by depth

or higher networks and require exponentially many neurons for representation with depth one networks. The works of

Martens and Medabalimi [2014] and Kane and Williams [2016] study networks of linear threshold gates and provide similar separation results. Eldan and Shamir [2016] show that for many popular activations such as sigmoid, ReLU etc. there are simple functions that can be computed by depth feed forward networks but require exponentially (in the input dimensionality) many neurons to represent using two layer feed forward networks. Telgarsky [2016] generalizes this to construct, for any integer , a family of functions that can be approximated by layers and size and require exponential in neurons to represent with depth.

Optimization Benefits of Depth.

While the benefits of depth are well understood in terms of the representation power using a small number of neurons, the question of whether increasing depth helps with optimization is currently poorly understood. The recent work of Arora et al. [2018b] aims to understand this question for the special case of linear neural networks. For the case of regression, they show that gradient descent updates on a depth linear network correspond to accelerated gradient descent type updates on the original weight vector. Similarly, they derive the form of the weight updates for a general over parameterized deep linear neural network and show that these updates can be viewed as performing gradient descent on the original network but with a preconditioning operation applied to the gradient at each step. Empirically this leads to faster convergence. The works of Bartlett et al. [2019a] and Arora et al. [2018a] study the convergence of gradient descent on linear regression problems when solved via an over parameterized deep linear network. These works establish that under suitable assumptions on the initialization, gradient descent on the over parameterized deep linear networks enjoys the same rate of convergence as performing linear regression in the original parameter space which is a smooth and strongly convex problem.

In a similar vein, the recent work of Arora et al. [2019b] analyzes over parameterized deep linear networks for solving matrix factorization, and shows that the solution to the gradient flow equations approaches the minimum nuclear norm solution at a rate that increases with the depth of the network. The recent work of Malach and Shalev-Shwartz [2019] studies depth separation between shallow and deeper networks over distributions that have a certain fractal structure. In certain regimes of the parameters of the distribution the authors show that, surprisingly, the stronger the depth separation is, the harder it becomes to learn the distribution via a deep network using gradient based algorithms.

Optimization of Neural Networks via Gradient Descent

In recent years there has been a large body of work in analyzing the convergence of gradient descent and stochastic gradient descent (SGD) on over parameterized neural networks. The work of Andoni et al. [2014] shows that depth one neural networks with quadratic activations can efficiently represent low degree polynomials and performing gradient descent on the network starting with random initialization can efficiently learn such classes. The work of Li and Yuan [2017] shows convergence of gradient descent on the population loss and under Gaussian input distribution, of a two layer feed forward network with relu activations and the identity mapping mimicking the ResNet architecture. Under similar assumptions the work of Soltanolkotabi et al. [2018] analyzes SGD for two layer neural networks with quadratic activations. The work of Li and Liang [2018] extends these results to more realistic data distributions.

Building upon the work of Daniely et al. [2016], Daniely [2017b] shows that SGD when run on over parameterized neural networks achieves at most excess loss (on the training set) over the best predictor in the conjugate kernel class at the rate that depends on and , the norm of the best predictor. This result is extended in the work of Du et al. [2019] showing that by running SGD on a randomly initialized two layer over parameterized networks with relu activations, one can get loss on the training data at the rate that depends on and the smallest eigenvalue of a certain kernel matrix. While the authors show that this eigenvalue is positive, no explicit bound is provided. These results are extended to higher depth in [Du et al., 2018] at the expense of an exponential dependence on the depth on the amount of over parameterization needed. In [Allen-Zhu et al., 2018] the authors provide an alternate analysis under the weaker ass:separation and at the same time obtain convergence rates that depend on and only polynomially in the depth of the network. The recent work of Zou and Gu [2019] provides an improved analysis with better dependence on the parameters. We would like to point out that all the above works fail to explain the optimization benefits of depth, and in fact the resulting bounds degrade as the network gets deeper.

The work of Jacot et al. [2018] proposed the Neural Tangent Kernel (NTK) that is associated with a randomly initialized neural network in the infinite width regime. The authors show that in this regime performing gradient descent on the parameters of the network is equivalent to kernel regression using the NTK. The work of Lee et al. [2019] and Yang [2019] generalizes this result and the recent work of Arora et al. [2019c]

provides a non-asymptotic analysis and an algorithm for exact computation of the NTK for feed forward and convolutional neural networks. There have also been works analyzing the mean field dynamics of SGD on infinite width neural networks

[Mei et al., 2018, Chizat and Bach, 2018, Rotskoff and Vanden-Eijnden, 2018, Sirignano and Spiliopoulos, 2018] as well as works designing provable learning algorithms for shallow neural networks under certain assumptions [Arora et al., 2016, Ge et al., 2017, Goel and Klivans, 2017, Ge et al., 2018, Goel et al., 2018, Bakshi et al., 2018, Vempala and Wilmes, 2018]. Recent works have also explored the question of providing sample complexity based separation between training via the NTK vs. training all the layers [Wei et al., 2019, Allen-Zhu and Li, 2020].

SQ Learnability of Neural Networks.

Several recent works have studied the statistical query (SQ) framework of Kearns [1998] to provide lower bounds on the number of queries needed to learn neural networks with a certain structure [Song et al., 2017, Vempala and Wilmes, 2018, Das et al., 2019]. The closest to us is the recent work of Das et al. [2019] that shows that learning a function that is a randomly initialized deep neural network with sign activations requires exponential in depth many statistical queries. A crucial part of their analysis requires showing that for randomly initialized neural networks with sign activations, the pairwise (normalized) dot products decrease exponentially fast with depth. Our main result in thm:main-infinite-width-top-layer strictly generalizes this result for arbitrary non-linear activations (under mild assumptions) thereby implying exponential SQ lower bounds for networks with arbitrary non linear activations. In particular, we show any algorithm that works in the statistical query framework, and learns (with high probability) a sufficiently deep randomly initialized network with an arbitrary non-linear activation, must necessarily use exponentially (in depth) many queries in the worst case. The only requirement we impose on the non-linear activations is that they satisfy subgaussianity (see Section C), a condition satisfied by popular activations such as relu, sign, and tanh.

Generalization in Neural Networks.

It has been observed repeatedly that modern deep neural networks have sufficient capacity to perfectly memorize the training data, yet generalize to test data very well (see, e.g., [Zhang et al., 2017]

). This observation flies in the face of conventional statistical learning theory which indicates that such overfitting should lead to poor generalization. Since then there has been a line of work providing generalization bounds for neural networks that depend on compressibility of the network

[Arora et al., 2018c], norm based bounds [Neyshabur et al., 2015, Bartlett et al., 2017], bounds via PAC-bayes analysis [Neyshabur et al., 2017, Dziugaite and Roy, 2017, Nagarajan and Kolter, 2019] and bounds that depend on the distance to initialization [Long and Sedghi, 2019]. Since randomly initialized neural networks are interpolating classifiers, i.e., they achieve zero error on the training set, there have also been recent works (e.g. [Belkin et al., 2018, 2019b, Liang and Rakhlin, 2018, Liang et al., 2019, Bartlett et al., 2019b, Belkin et al., 2019a, Hastie et al., 2019]) that study the generalization phenomenon in the context of specific interpolating methods (i.e. methods which perfectly fit the training data) and show how the obtained predictors can generalize well.

Appendix B Conditioning Analysis

Recall the notion of the dual activation for the activation : For , define matrix . Define the conjugate activation function as follows:

The following facts can be found in Daniely et al. [2016]:

  1. Let such that . Then

  2. Since , is square integrable w.r.t. the Gaussian measure. The (probabilitist’s) Hermite polynomials form an orthogonal basis for the Hilbert space of square integrable functions w.r.t. the Gaussian measure, and hence can be written as , where . This expansion is known as the Hermite expansion for .

  3. We have .

  4. The normalization eq:normalization has the following consequences. Since , we have , and since we have .

  5. If denotes the derivative of , then .

The above facts imply the following simple bound on the coefficient of non-linearity : For any normalized non-linear activation function , we have . The degree 1 Hermite polynomial is , so . Since is non-linear, for at least one , we have . This, coupled with the fact that implies that , which implies that .

The random initialization of the neural network induces a feature representation of the input vectors at every depth in the neural network: . This feature representation naturally yields a kernel function . In particular, after the first layer, the kernel function