While stochastic gradient decent (SGD) from a random initialization is probably the most popular supervised learning algorithm today, we have very few results that depicts conditions that guarantee its success. Indeed, to the best of our knowledge,Andoni et al.  provides the only known result of this form, and it is valid in a rather restricted setting. Namely, for depth-2 networks, where the underlying distribution is Gaussian, the algorithm is full gradient decent (rather than SGD), and the task is regression when the learnt function is a constant degree polynomial.
We build on the framework of Daniely et al.  to establish guarantees on SGD in a rather general setting. Daniely et al.  defined a framework that associates a reproducing kernel to a network architecture. They also connected the kernel to the network via the random initialization. Namely, they showed that right after the random initialization, any function in the kernel space can be approximated by changing the weights of the last layer. The quality of the approximation depends on the size of the network and the norm of the function in the kernel space.
As optimizing the last layer is a convex procedure, the result of Daniely et al.  intuitively shows that the optimization process starts from a favourable point for learning a function in the conjugate kernel space. In this paper we verify this intuition. Namely, for a fairly general family of architectures (that contains fully connected networks and convolutional networks) and supervised learning tasks, we show that if the network is large enough, the learning rate is small enough, and the number of SGD steps is large enough as well, SGD is guaranteed to learn any function in the corresponding kernel space. We emphasize that the number of steps and the size of the network are only required to be polynomial (which is best possible) in the relevant parameters – the norm of the function, the required accuracy parameter (), and the dimension of the input and the output of the network. Likewise, the result holds for any input distribution.
To evaluate our result, one should understand which functions it guarantee that SGD will learn. Namely, what functions reside in the conjugate kernel space, how rich it is, and how good those functions are as predictors. From an empirical perspective, in , it is shown that for standard convolutional networks the conjugate class contains functions whose performance is close to the performance of the function that is actually learned by the network. This is based on experiments on the standard CIFAR-10 dataset. From a theoretical perspective, we list below a few implications that demonstrate the richness of the conjugate kernel space. These implications are valid for fully connected networks of any depth between and , where is the input dimension. Likewise, they are also valid for convolutional networks of any depth between and , and with constantly many convolutional layers.
SGD is guaranteed to learn in polynomial time constant degree polynomials with polynomially bounded coefficients. As a corollary, SGD is guaranteed to learn in polynomial time conjunctions, DNF and CNF formulas with constantly many terms, and DNF and CNF formulas with constantly many literals in each term. These function classes comprise a considerable fraction of the function classes that are known to be poly-time (PAC) learnable by any method. Exceptions include constant degree polynomial thresholds with no restriction on the coefficients, decision lists and parities.
SGD is guaranteed to learn, not necessarily in polynomial time, any continuous function. This complements classical universal approximation results that show that neural networks can (approximately) express any continuous function (see  for a survey). Our results strengthen those results and show that networks are not only able to express those functions, but actually guaranteed to learn them.
1.1 Related work
Guarantees on SGD.
As noted above, there are very few results that provide polynomial time guarantees for SGD on NN. One notable exception is the work of Andoni et al. , that proves a result that is similar to ours, but in a substantially more restricted setting. Concretely, their result holds for depth-2 fully connected networks, as opposed to rather general architecture and constant or logarithmic depth in our case. Likewise, the marginal distribution on the instance space is assumed to be Gaussian or uniform, as opposed to arbitrary in our case. In addition, the algorithm they consider is full gradient decent, which corresponds to SGD with infinitely large mini-batch, as opposed to SGD with arbitrary mini-batch size in our case. Finally, the underlying task is regression in which the target function is a constant degree polynomial, whereas we consider rather general supervised learning setting.
Other polynomial time guarantees on learning deep architectures.
Various recent papers show that poly-time learning is possible in the case that the the learnt function can be realized by a neural network with certain (usually fairly strong) restrictions on the weights [23, 34, 33, 35], or under the assumption that the data is generated by a generative model that is derived from the network architecture [3, 4]. We emphasize that the main difference of those results from our results and the results of Andoni et al.  is that they do not provide guarantees on the standard SGD learning algorithm. Rather, they show that under those aforementioned conditions, there are some algorithms, usually very different from SGD on the network, that are able to learn in polynomial time.
Connection to kernels.
As mentioned earlier, our paper builds on Daniely et al. , who developed the association of kernels to NN which we rely on. Several previous papers [24, 10, 28, 27, 25, 32, 18, 26, 6, 5, 16, 2] investigated such associations, but in a more restricted settings (i.e., for less architectures). Some of those papers [28, 27, 13, 18, 6, 5] also provide measure of concentration results, that show that w.h.p. the random initialization of the network’s weights is reach enough to approximate the functions in the corresponding kernel space. As a result, these papers provide polynomial time guarantees on the variant of SGD, where only the last layer is trained. We remark that with the exception of , those results apply just to depth-2 networks.
1.2 Discussion and future directions
We next want to place this work in the appropriate learning theoretic context, and to elaborate further on this paper’s approach for investigating neural networks. For the sake of concreteness, let us restrict the discussion to binary classification over the Boolean cube. Namely, given examples from a distribution on , the goal is to learn a function whose 0-1 error, , is as small as possible. We will use a bit of terminology. A model is a distribution on and a model class is a collection of models. We note that any function class defines a model class, , consisting of all models such that for some . We define the capacity of a model class as the minimal number for which there is an algorithm such that for every the following holds. Given samples from , the algorithm is guaranteed to return, w.p. over the samples and its internal randomness, a function with 0-1 error . We note that for function classes the capacity is the VC dimension, up to a constant factor.
Learning theory analyses learning algorithms via model classes. Concretely, one fixes some model class and show that the algorithm is guaranteed to succeed whenever the underlying model is from . Often, the connection between the algorithm and the class at hand is very clear. For example, in the case that the model is derived from a function class , the algorithm might simply be one that finds a function in that makes no mistake on the given sample. The natural choice for a model class for analyzing SGD on NN would be the class of all functions that can be realized by the network, possibly with some reasonable restrictions on the weights. Unfortunately, this approach it is probably doomed to fail, as implied by various computational hardness results [8, 19, 7, 20, 21, 22, 12, 11].
So, what model classes should we consider? With a few isolated exceptions (e.g. ) all known efficiently learnable model classes are either a linear model class, or contained in an efficiently learnable linear model class. Namely, functions classes composed of compositions of some predefined embedding with linear threshold functions, or linear functions over some finite field.
Coming up we new tractable models would be a fascinating progress. Still, as linear function classes are the main tool that learning theory currently has for providing guarantees on learning, it seems natural to try to analyze SGD via linear model classes. Our work follows this line of thought, and we believe that there is much more to achieve via this approach. Concretely, while our bounds are polynomial, the degree of the polynomials is rather large, and possibly much better quantitative bounds can be achieved. To be more concrete, suppose that we consider simple fully connected architecture, with 2-layers, ReLU activation, and
hidden neurons. In this case, the capacity of the model class that our results guarantee that SGD will learn is. For comparison, the capacity of the class of all functions that are realized by this network is . As a challenge, we encourage the reader to prove that with this architecture (possibly with an activation that is different from the ReLU), SGD is guaranteed to learn some model class of capacity that is super-linear in .
We denote vectors by bold-face letters (e.g.), matrices by upper case letters (e.g. ), and collection of matrices by bold-face upper case letters (e.g. ). The -norm of is denoted by . We will also use the convention that . For functions we let
Let be a directed acyclic graph. The set of neighbors incoming to a vertex is denoted . We also denote . Given weight function and we let . The dimensional sphere is denoted . We use to denote .
Throughout the paper we assume that each example is a sequence of elements, each of which is represented as a unit vector. Namely, we fix and take the input space to be . Each input example is denoted,
While this notation is slightly non-standard, it unifies input types seen in various domains (see ).
The goal in supervised learning is to devise a mapping from the input space to an output space based on a sample , where drawn i.i.d. from a distribution over . A supervised learning problem is further specified by an output length
and a loss function, and the goal is to find a predictor whose loss, , is small. The empirical loss is commonly used as a proxy for the loss . When is defined by a vector of parameters, we will use the notations , and .
Regression problems correspond to , and, for instance, the squared loss . Binary classification is captured by , and, say, the zero-one loss or the hinge loss . Multiclass classification is captured by being the number of classes, , and, say, the zero-one loss or the logistic loss where is given by . A loss is -Lipschitz if for all , the function is -Lipschitz. Likewise, it is convex if is convex for every .
Neural network learning.
We define a neural network to be a vertices weighted directed acyclic graph (DAG) whose nodes are denoted and edges . The weight function will be denoted by , and its sole role would be to dictate the distribution of the initial weights. We will refer ’s nodes by neurons. Each of non-input neuron, i.e. neuron with incoming edges, is associated with an activation function . In this paper, an activation can be any function that is right and left differentiable, square integrable with respect to the Gaussian measure on , and is normalized in the sense that . The set of neurons having only incoming edges are called the output neurons. To match the setup of supervised learning defined above, a network has input neurons and output neurons, denoted . A network together with a weight vector defines a predictor whose prediction is given by “propagating” forward through the network. Concretely, we define to be the output of the subgraph of the neuron as follows: for an input neuron , outputs the corresponding coordinate in , and internal neurons, we define recursively as
For output neurons, we define as
Finally, we let .
We next describe the learning algorithm that we analyze in this paper. While there is no standard training algorithm for neural networks, the algorithms used in practice are usually quite similar to the one we describe, both in the way the weights are initialized and the way they are updated. We will use the popular Xavier initialization  for the network weights. Fix . We say that are -biased random weights (or, -biased random initialization) if each weight
is sampled independently from a normal distribution with mean
and varianceif is an input neuron and otherwise. Finally, each bias term is sampled independently from a normal distribution with mean and variance . We note that the rational behind this initialization scheme is that for every example and every neuron we have (see )
A function is a reproducing kernel, or simply a kernel, if for every , the matrix is positive semi-definite. Each kernel induces a Hilbert space of functions from to with a corresponding norm . For we denote . A kernel and its corresponding space are normalized if .
Kernels give rise to popular benchmarks for learning algorithms. Fix a normalized kernel and . It is well known that that for -Lipschitz loss , the SGD algorithm is guaranteed to return a function such that using examples. In the context of multiclass classification, for we define by . We say that a distribution on is -separable w.r.t. if there is such that and
. In this case, the perceptron algorithm is guaranteed to return a functionsuch that using examples. We note that both for perceptron and SGD, the above mentioned results are best possible, in the sense that any algorithm with the same guarantees, will have to use at least the same number of examples, up to a constant factor.
Computation skeletons 
In this section we define a simple structure which we term a computation skeleton. The purpose of a computational skeleton is to compactly describe a feed-forward computation from an input to an output. A single skeleton encompasses a family of neural networks that share the same skeletal structure. Likewise, it defines a corresponding normalized kernel. A computation skeleton is a DAG with inputs, whose non-input nodes are labeled by activations, and has a single output node . Figure 1
shows four example skeletons, omitting the designation of the activation functions. We denote bythe number of non-input nodes of . The following definition shows how a skeleton, accompanied with a replication parameter and a number of output nodes , induces a neural network architecture.
[Realization of a skeleton] Let be a computation skeleton and consider input coordinates in as in (1). For we define the following neural network . For each input node in , has corresponding input neurons with weight . For each internal node labelled by an activation , has neurons , each with an activation and weight . In addition, has output neurons with the identity activation and weight . There is an edge whenever . For every output node in , each neuron is connected to all output neurons . We term the -fold realization of .
Note that the notion of the replication parameter corresponds, in the terminology of convolutional networks, to the number of channels taken in a convolutional layer and to the number of hidden neurons taken in a fully-connected layer.
In addition to networks’ architectures, a computation skeleton also defines a normalized kernel . To define the kernel, we use the notion of a conjugate activation. For , we denote by
the multivariate Gaussian distribution onwith mean and covariance matrix .
[Conjugate activation] The conjugate activation of an activation is the function defined as The following definition gives the kernel corresponding to a skeleton [Compositional kernels] Let be a computation skeleton and let . For every node , inductively define a kernel as follows. For an input node corresponding to the th coordinate, define . For a non-input node , define
The final kernel is . The resulting Hilbert space and norm are denoted and respectively.
3 Main results
An activation is called -bounded if . Fix a skeleton and -Lipschitz111If is -Lipschitz, we can replace by and the learning rate by . The operation of algorithm 1 will be identical to its operation before the modification. Given this observation, it is very easy to derive results for general given our results. Hence, to save one paramater, we will assume that . convex loss . Define and , where is the minimal number for which all the activations in are -bounded, and is the maximal length of a path from an input node to . We also define , where is the minimal number for which all the activations in are -Lipschitz and satisfy . Through this and remaining sections we use to hide universal constants. Likewise, we fix the bias parameter and therefore omit it from the relevant notation.
We note that for constant depth skeletons with maximal degree that is polynomial in , and are polynomial in . These quantities are polynomial in also for various log-depth skeletons. For example, this is true for fully connected skeletons, or more generally, layered skeletons with constantly many layers that are not fully connected.
Suppose that all activations are -bounded. Let . Suppose that we run algorithm 1 on the network with the following parameters:
Zero initialized prediction layer
Then, w.p. over the choice of the initial weights, there is such that . Here, the expectation is over the training examples.
We next consider ReLU activations. Here, . Suppose that all activations are the ReLU. Let . Suppose that we run algorithm 1 on the network with the following parameters:
Zero initialized prediction layer
Then, w.p. over the choice of the initial weights, there is such that . Here, the expectation is over the training examples.
Finally, we consider the case in which the last layer is also initialized randomly. Here, we provide guarantees in a more restricted setting of supervised learning. Concretely, we consider multiclass classification, when is separable with margin, and is the logistic loss.
Suppose that all activations are -bounded, that is -separable with w.r.t. and let . Suppose we run algorithm 1 on with the following parameters:
Randomly initialized prediction layer
Then, w.p. over the choice of the initial weights and the training examples, there is such that
To demonstrate our results, let us elaborate on a few implications for specific network architectures. To this end, let us fix the instance space to be either or . Also, fix a bias parameter , a batch size , and a skeleton that is a skeleton of a fully connected network of depth between and . Finally, we also fix the activation function to be either the ReLU or a -bounded activation, assume that the prediction layer is initialized to , and fix the loss function to be some convex and Lipschitz loss function. Very similar results are valid for convolutional networks with constantly many convolutional layers. We however omit the details for brevity.
Our first implication shows that SGD is guaranteed to efficiently learn constant degree polynomials with polynomially bounded weights. To this end, let us denote by the collection of degree polynomials. Furthermore, for any polynomial we denote by the norm of its coefficients. Fix any positive integers . Suppose that we run algorithm 1 on the network with the following parameters:
Then, w.p. over the choice of the initial weights, there is such that . Here, the expectation is over the training examples. We note that several hypothesis classes that were studied in PAC learning can be realized by polynomial threshold functions with polynomially bounded coefficients. This includes conjunctions, DNF and CNF formulas with constantly many terms, and DNF and CNF formulas with constantly many literals in each term. If we take the loss function to be the logistic loss or the hinge loss, Corollary 3.1 implies that SGD efficiently learns these hypothesis classes as well.
Our second implication shows that any continuous function is learnable (not necessarily in polynomial time) by SGD. Fix a continuous function and . Assume that is realized222That is, if then with probability . by . Assume that we run algorithm 1 on the network . If is sufficiently small and and are sufficiently large, then, w.p. over the choice of the initial weights, there is such that .
We next remark on two extensions of our main results. The extended results can be proved in a similar fashion to our results. To avoid cumbersome notation, we restrict the proofs to the main theorems as stated, and will elaborate on the extended results in an extended version of this manuscript. First, we assume that the replication parameter is the same for all nodes. In practice, replication parameters for different nodes are different. This can be captured by a vector . Our main results can be extended to this case if for all , (a requirement that usually holds in practice). Second, we assume that there is no weight sharing that is standard in convolutional networks. Our results can be extended to convolutional networks with weight sharing.
We also note that we assume that in each step of algorithm 1, a fresh batch of examples is given. In practice this is often not the case. Rather, the algorithm is given a training set of examples, and at each step it samples from that set. In this case, our results provide guarantees on the training loss. If the training set is large enough, this also implies guarantees on the population loss via standard sample complexity results.
Throughout, we fix a loss , a skeleton , a replication parameter , the network and a bias parameter . For a matrix we denote , , and . We will often use the fact that . For and we abuse notation and denote .
For a skeleton we denote by the set of ’s internal nodes. We will aggregate the weights of
by a collection of matrices and bias vectors
Here, are the matrix and vector that maps the output of all the neurons corresponding to nodes in , to the neurons corresponding to . Likewise, is the matrix that maps the output of the neurons corresponding to to the final output of the network. We decompose further as a concatenation of two matrices that correspond to the internal and input nodes in respectively. For a prediction matrix and weights we denote by the weights obtained by replacing with . We let
Finally, we let and . For we denote by the output on of the network with the weights . Given we let to be the output of the neurons corresponding to . We denote by the output of the representation layer. We also let be the concatenation of . Note that and . For we denote and for we denote . We let . Finally, we let .
We next review the proof of theorem 3. The proof of theorem 3 is similar. Later, we will also comment how the proof can be modified to establish theorem 3. Let be some function with and let be the weights produced by the SGD algorithm. Our goal is to show that w.h.p. over the choice of , there is such that .
In section 4.4 we show that w.h.p. over the choice of , there is a prediction matrix so that and . This follows from the results of , and some extensions of those. Namely, we extend the original from to general , and also eliminate a certain logarithmic dependence on the size of the support of .
Given that such exists, standard online learning results (e.g. Chapter 21 in ) imply that if we would apply SGD only on the last layer, with the learning rate specified in theorem 3, i.e. for , we would be guaranteed to have some step in which .
However, as we consider SGD on all weights, this is not enough. Hence, in section 4.3, we show that with the above mentioned learning rate, the weights of the non-last layer change slowly enough, so that for all . Given this, we can invoke the online-learning based argument again.
In order to show that the last layer changes slowly, we need to bound the magnitude of the gradient of the training objective. In section 4.2 we establish such a bound on the gradient of the loss for every example. As and are averages of such functions, the same bound holds for them as well. We note that our bound depends on the spectral norm on the matrices . We show that for random matrices, w.h.p. the magnitude of the norm implies a bound that is good enough for our purposes. Likewise, trough the training process, the norm doesn’t grow too much, so the desired bound is valid throughout the optimization process.
The structure of the proof of theorem 3 is similar, but has a few differences. First, the first step would be to show that in the case that is -separable w.r.t. , then w.h.p. over the choice of , there is a prediction matrix such that is tiny, and . Again, this is based on the results and techniques of , and is done in section 4.4. Given this, again, running SGD on the top layer would be fine. However, now we cannot utilize the online-learning based argument we used before, because the starting point is not , but rather a random vector, whose norm is too large to carry out the analysis. In light of that, we take a somewhat different approach.
We show that the weights beneath the last layer are changing slow enough, so that the following holds throughout the optimization process: As long as the 0-1 error is larger than , the magnitude of the gradient is . More precisely, the derivative in the direction of , is smaller than . Given this, and bounds on both the first and second derivative of the loss (proved in section 4.2), we are able to establish the proof by adopting a standard argument from smooth convex optimization (done in section 4.3).
4.2 Boundness of the objective function
Let be an open set. For a function , a unit vector and we denote . We say that is -bounded at if is twice differentiable and
We say that is -bounded if it is -bounded in any . We note that for , is -bounded at if and only if , and . In particular, when too, is -bounded at if and only if , and . We will say that is -bounded if it is -bounded.
Let be -bounded function. Suppose that is -bounded. We have that is -bounded. If we furthermore assume that then we have that is -bounded.
The first part follows from the facts that
The second part follows from the fact that in the case that we have that .
Let be -bounded function. Suppose that is -bounded. We have that is -bounded
This follows from the fact that