Deep learning has achieved immense practical success, revolutionizing fields such as image, text, and speech recognition. It is also increasingly being used in engineering, medicine, and finance. However, despite their success in practice, there is currently limited mathematical understanding of deep neural networks. This has motivated recent mathematical research on deep learning models such as , , , , , , , , , and .
Neural networks are nonlinear statistical models whose parameters are estimated from data using stochastic gradient descent (SGD) methods. Deep learning uses neural networks with many layers (i.e., “deep” neural networks), which produces a highly flexible, powerful and effective model in practice.
In this paper we characterize multi-layer neural networks (i.e., “deep neural networks”) in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. We rigorously prove the limit of the neural network output as the number of hidden units increases to infinity. The proof relies upon weak convergence analysis for stochastic processes. The result can be considered a “law of large numbers” for the neural network’s output when both the network size and the number of stochastic gradient descent steps grow to infinity.
Recently, law of large numbers and central limit theorems have been established for neural networks with a single hidden layer, , , and . For a single hidden layer, one can directly study the weak convergence of the empirical measure of the parameters. However, in a neural network with multiple layers, there is a closure problem when studying the empirical measure of the parameters (which is explained in Section 3.3). Consequently, the law of large numbers for a multi-layer network is not a straightforward extension of the single-layer network result and the analysis involves unique challenges which require new approaches. In this paper we establish the limiting behavior of the output of the neural network.
To illustrate the idea, we consider a multi-layer neural network with two hidden layers:
As we will see in Section 3.2, the limit procedure can be extended to neural networks with three layers and subsequently to neural networks with any fixed number of hidden layers.
Notice now that (1.1) can be also written as
where and . The neural network model has parameters
which must be estimated from data. The number of hidden units in the first layer is and the number of hidden units in the second layer is . The multi-layer neural network (1.2) includes a normalization factor of in the first hidden layer and in the second hidden layer.
The loss function is
where the data . The goal is to estimate a set of parameters which minimizes the objective function (1.3).
The stochastic gradient descent (SGD) algorithm for estimating the parameters is, for ,
where , , and are the learning rates. The learning rates may depend upon and . The parameters at step are .
are samples of the random variables.
The goal of this paper is to characterize the limit of an appropriate rescaling of the multi-layer neural network output as both the number of hidden units and the stochastic gradient descent iterates become large. This is the topic of Theorem 2.3. The idea is to first take with fixed. In Lemma 2.2, we prove that the empirical measure of the parameters converges to a limit measure as (with fixed) which satisfies a measure evolution equation. This naturally implies a limit for the neural network output as . The next step is to take . Theorem 2.3 proves that the limiting distribution can be represented via a system of ODEs.
The rest of the paper is organized as follows. Our main result, which characterizes the asymptotic behavior of a neural network with two hidden layers when the number of hidden units becomes large, is presented in Section 2. The result can be easily extended to an arbitrary number of hidden layers. Section 3 discusses the theoretical results, includes a numerical study to showcase some of the theoretical implications, and, as an example, presents the limit for a three-layer neural network. The proof of the convergence theorem is in Section 4. The uniqueness of a solution to the limiting system is established in Section 5. The proof of the limit of the first layer (i.e., the proof of Lemma 2.2) is provided in Appendix A.
2 Main Results
Let us start by presenting our assumptions, which will hold throughout the paper. We shall work on a filtered probability spaceon which all the random variables are defined. The probability space is equipped with a filtration that is right continuous and contains all -negligible sets.
We assume the following conditions throughout the paper.
, i.e., it is twice continuously differentiable and bounded.
The distribution has compact support, i.e. the data takes values in the compact set .
The random initialization of the parameters, i.e. , are i.i.d. and take values in compact sets , and .
We denote by , , and the probability distributions of , , and respectively.
For reasons that will become clearer later on, we shall choose the learning rates to be
Note that the weights in the second layer are trained faster than the other parameters. This choice of learning rates is necessary for convergence to a non-trivial limit as . If the parameters in all the layers are trained with the same learning rate, it can be mathematically shown that the network will not train as become large. We further explore this interesting fact in Section 3.1.
Define the empirical measure
The neural network’s output can be re-written in terms of the empirical measure:
denotes the inner product of and . For example, we write
Let us next define the time-scaled empirical measure
and the corresponding time-scaled neural network output is
At any time , is measure-valued. The scaled empirical measure is a random element of 333 is the set of maps from into which are right-continuous and which have left-hand limits. with .
We study convergence using iterated limits. We first let where the number of units in the first layer is and the number of stochastic gradient descent steps is . Then, we let the number of units in the second layer .
We begin by letting the number of hidden units in the first layer .
The process converges in distribution to the measure valued process that takes values in as . For every , satisfies the measure evolution equation
The proof of this lemma is related to the limit in the first layer as the number of hidden units in the first layer grows with the number of hidden units in the second layer is held fixed. The proof is analogous to the proof in  and the details are presented for completeness in the Appendix A. ∎
in probability, as
The next step is to study the limit as . To do so, we study the limit of the random ODE as whose law is characterized by (2.3)-(2.4). Our main goal is the characterization of the limit neural network output . The following convergence result characterizes the neural network output for large and .
For any and ,
in probability444 in probability if, for all , ., where we have that
Notice that we can also write that satisfies
3 Discussion on the limiting results and extensions to multi-layer networks with greater depth
In Subsection 3.1, we discuss some of the implications of our theoretical convergence results and presents related numerical results. In Section 3.2, we show that the procedure can be extended to treat deep neural networks with more than two hidden layers. General challenges in the study of multi-layer neural networks are explored Subsection 3.3.
3.1 Discussion on the limiting results
It is instructive to notice that the results of this paper recover the results of  (see also [32, 35]) if we restrict attention to the one-layer case. Indeed, let us set , , and in (1.1)-(1.2), and we get the single-layer neural network
with the corresponding empirical measure of the parameters becoming
In that case notice that we can simply write
Then, it is relatively straightforward to notice that the result of Lemma 2.2 boils down to the one layer convergence results of , see also [32, 35]. Namely, if we write for the limit in probability of we get that
It is useful to compare the limits of the neural network output in the one layer and two layer cases, (3.2) and (2.6) respectively. It is clear that the two layer case is more involved, which provides some intuition for the increased complexity of deep neural networks when compared to shallow neural networks.
Perhaps more interestingly, the law of large numbers for a single hidden-layer network indicates that the network will converge to a deterministic limit in probability. That is, after a certain point, adding more hidden units will not increase the accuracy. In order for the trained network to change, more layers must be added to the network. This matches the behavior of neural networks in practice, where we see that deep neural networks (more layers) outperform shallow neural networks (only a few layers).
where is given by (2.5).
Theorem 2.3 shows that under an appropriate choice of the learning rates there is a well-defined limit for the neural network output and, as a consequence, for the objective function as well.
The parametrization of the learning rates, see (2.1), indicates that one should be using larger learning rates for the weights that connect the different layers, the in this case, as opposed to the weights that are specific to the different layers, in this case and . Notice that this is also the case for the three layer case outlined in Subsection 3.2 below.
As will be explained in Section 3.2, the law of large numbers can be extended to deep neural networks with an arbitrary number of layers. The law of large numbers will only hold under a certain choice of the learning rates. The learning rates need to be scaled with the number of hidden units in each layer. For a multi-layer network with layers, the learning rates are
where is the number of hidden units in the -th layer.
If the learning rates are constant in the number of hidden units , it turns out that the network will not train as (i.e., in the limit, the network parameters will remain at their initial conditions).
The necessity of scaling the learning rates in the asymptotic regime of large numbers of hidden units (i.e., wide layers) is one of the interesting products of the mean-field limit analysis. A numerical example is presented in Figure 1
below. A deep neural network is trained to classify images for the CIFAR10 dataset. The CIFAR10 dataset contains color images in classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). The dataset is divided into training images and test images. Each image has pixels. The goal is to train a neural network to correctly classify each image based solely on the image pixels as an input. The neural network we use has the mean-field normalizations in each layer . There are convolution layers which each have channels. This is followed by two fully-connected layers which each have units. We first train the neural network using the scaled learning rates. Then, we also train the neural network with the standard stochastic gradient descent algorithm (no scaling of the learning rates). Using the scaled learning rates, we achieve a high test accuracy. However, without the scalings, the neural network does not train (i.e., it remains at a very low accuracy).
3.2 Extension to deep neural networks with more layers
The procedure developed in this paper naturally extends to deep neural networks with more layers. For brevity, let us present the result in the case of three layers. The situation for more layers is the same, albeit with more complicated algebra. A deep neural network with three layers takes the form
which can be also written as
where and . The neural network model has parameters
which must be estimated from data. The number of hidden units in the first layer is , the number of hidden units in the second layer is , and the number of hidden units in the third layer is . Naturally, the loss function now becomes
where the data .
The stochastic gradient descent (SGD) algorithm for estimating the parameters is, for , , and is
where , , and are the learning rates. The parameters at step are
are samples of the random variables . We assume a condition analogous to Assumption 2.1.
Let us now choose the learning rates to be
Similar to before, define the empirical measure
The time-scaled empirical measure is
and the corresponding time-scaled neural network output is