Mean Field Analysis of Deep Neural Networks

We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multilayer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss. The ideas that we explore are to (a) sequentially take the limits of each hidden layer and (b) characterizing the evolution of parameters in terms of their initialization. The limit satisfies a system of integro-differential equations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/28/2018

Mean Field Analysis of Neural Networks: A Central Limit Theorem

Machine learning has revolutionized fields such as image, text, and spee...
07/09/2019

Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum

We analyze single-layer neural networks with the Xavier initialization i...
02/07/2019

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

Can multilayer neural networks -- typically constructed as highly comple...
10/29/2021

Limiting fluctuation and trajectorial stability of multilayer neural networks with mean field training

The mean field (MF) theory of multilayer neural networks centers around ...
12/30/2020

SGD Distributional Dynamics of Three Layer Neural Networks

With the rise of big data analytics, multi-layer neural networks have su...
05/27/2019

Neural Stochastic Differential Equations

Deep neural networks whose parameters are distributed according to typic...
05/30/2021

On the geometry of generalization and memorization in deep neural networks

Understanding how large neural networks avoid memorizing training data i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved immense practical success, revolutionizing fields such as image, text, and speech recognition. It is also increasingly being used in engineering, medicine, and finance. However, despite their success in practice, there is currently limited mathematical understanding of deep neural networks. This has motivated recent mathematical research on deep learning models such as [26], [27], [28], [29], [30], [25], [36], [37], [32], and [35].

Neural networks are nonlinear statistical models whose parameters are estimated from data using stochastic gradient descent (SGD) methods. Deep learning uses neural networks with many layers (i.e., “deep” neural networks), which produces a highly flexible, powerful and effective model in practice.

Applications of deep learning include image recognition (see [21] and [15]

), facial recognition

[46], driverless cars [5], speech recognition (see [21], [4], [22], and [47]), and text recognition (see [49] and [44]). Neural networks also find increasing more applications in engineering, robotics, medicine, and finance (see [23], [24], [45], [17], [34], [3], [38], [39], [40], and [41]).

In this paper we characterize multi-layer neural networks (i.e., “deep neural networks”) in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. We rigorously prove the limit of the neural network output as the number of hidden units increases to infinity. The proof relies upon weak convergence analysis for stochastic processes. The result can be considered a “law of large numbers” for the neural network’s output when both the network size and the number of stochastic gradient descent steps grow to infinity.

Recently, law of large numbers and central limit theorems have been established for neural networks with a single hidden layer

[36], [37], [32], and [35]. For a single hidden layer, one can directly study the weak convergence of the empirical measure of the parameters. However, in a neural network with multiple layers, there is a closure problem when studying the empirical measure of the parameters (which is explained in Section 3.3). Consequently, the law of large numbers for a multi-layer network is not a straightforward extension of the single-layer network result and the analysis involves unique challenges which require new approaches. In this paper we establish the limiting behavior of the output of the neural network.

To illustrate the idea, we consider a multi-layer neural network with two hidden layers:

(1.1)

As we will see in Section 3.2, the limit procedure can be extended to neural networks with three layers and subsequently to neural networks with any fixed number of hidden layers.

Notice now that (1.1) can be also written as

(1.2)

where and . The neural network model has parameters

which must be estimated from data. The number of hidden units in the first layer is and the number of hidden units in the second layer is . The multi-layer neural network (1.2) includes a normalization factor of in the first hidden layer and in the second hidden layer.

The loss function is

(1.3)

where the data . The goal is to estimate a set of parameters which minimizes the objective function (1.3).

The stochastic gradient descent (SGD) algorithm for estimating the parameters is, for ,

(1.4)

where , , and are the learning rates. The learning rates may depend upon and . The parameters at step are .

are samples of the random variables

.

The goal of this paper is to characterize the limit of an appropriate rescaling of the multi-layer neural network output as both the number of hidden units and the stochastic gradient descent iterates become large. This is the topic of Theorem 2.3. The idea is to first take with fixed. In Lemma 2.2, we prove that the empirical measure of the parameters converges to a limit measure as (with fixed) which satisfies a measure evolution equation. This naturally implies a limit for the neural network output as . The next step is to take . Theorem 2.3 proves that the limiting distribution can be represented via a system of ODEs.

The rest of the paper is organized as follows. Our main result, which characterizes the asymptotic behavior of a neural network with two hidden layers when the number of hidden units becomes large, is presented in Section 2. The result can be easily extended to an arbitrary number of hidden layers. Section 3 discusses the theoretical results, includes a numerical study to showcase some of the theoretical implications, and, as an example, presents the limit for a three-layer neural network. The proof of the convergence theorem is in Section 4. The uniqueness of a solution to the limiting system is established in Section 5. The proof of the limit of the first layer (i.e., the proof of Lemma 2.2) is provided in Appendix A.

2 Main Results

Let us start by presenting our assumptions, which will hold throughout the paper. We shall work on a filtered probability space

on which all the random variables are defined. The probability space is equipped with a filtration that is right continuous and contains all -negligible sets.

Assumption 2.1.

We assume the following conditions throughout the paper.

  • , i.e., it is twice continuously differentiable and bounded.

  • The distribution has compact support, i.e. the data takes values in the compact set .

  • The random initialization of the parameters, i.e. , are i.i.d. and take values in compact sets , and .

  • The probability distributions of the initial parameters

    admit continuous probability density functions.

We denote by , , and the probability distributions of , , and respectively.

For reasons that will become clearer later on, we shall choose the learning rates to be

(2.1)

Note that the weights in the second layer are trained faster than the other parameters. This choice of learning rates is necessary for convergence to a non-trivial limit as . If the parameters in all the layers are trained with the same learning rate, it can be mathematically shown that the network will not train as become large. We further explore this interesting fact in Section 3.1.

Define the empirical measure

(2.2)

The neural network’s output can be re-written in terms of the empirical measure:

denotes the inner product of and . For example, we write

Let us next define the time-scaled empirical measure

and the corresponding time-scaled neural network output is

At any time , is measure-valued. The scaled empirical measure is a random element of 333 is the set of maps from into which are right-continuous and which have left-hand limits. with .

We study convergence using iterated limits. We first let where the number of units in the first layer is and the number of stochastic gradient descent steps is . Then, we let the number of units in the second layer .

We begin by letting the number of hidden units in the first layer .

Lemma 2.2.

The process converges in distribution to the measure valued process that takes values in as . For every , satisfies the measure evolution equation

(2.3)

where

(2.4)
Proof.

The proof of this lemma is related to the limit in the first layer as the number of hidden units in the first layer grows with the number of hidden units in the second layer is held fixed. The proof is analogous to the proof in [36] and the details are presented for completeness in the Appendix A. ∎

Lemma 2.2 studies the limit of the empirical measure as with fixed. The limit is characterized by the stochastic evolution equation (2.3)-(2.4). Notice that Lemma 2.2 immediately implies that

in probability, as

The next step is to study the limit as . To do so, we study the limit of the random ODE as whose law is characterized by (2.3)-(2.4). Our main goal is the characterization of the limit neural network output . The following convergence result characterizes the neural network output for large and .

Theorem 2.3.

For any and ,

in probability444 in probability if, for all , ., where we have that

with

(2.5)

The system in (2.5) has a unique solution. In addition, letting defined through Lemma 2.2 we have the following rate of convergence

for some constant .

Notice that we can also write that satisfies

(2.6)

The proof of Theorem 2.3 is given in Section 4. Section 3 discusses some of its consequences as well as challenges that come up in the study of the limiting behavior of multi-layer neural networks as the number of the hidden units grows.

3 Discussion on the limiting results and extensions to multi-layer networks with greater depth

In Subsection 3.1, we discuss some of the implications of our theoretical convergence results and presents related numerical results. In Section 3.2, we show that the procedure can be extended to treat deep neural networks with more than two hidden layers. General challenges in the study of multi-layer neural networks are explored Subsection 3.3.

3.1 Discussion on the limiting results

It is instructive to notice that the results of this paper recover the results of [36] (see also [32, 35]) if we restrict attention to the one-layer case. Indeed, let us set , , and in (1.1)-(1.2), and we get the single-layer neural network

with the corresponding empirical measure of the parameters becoming

(3.1)

In that case notice that we can simply write

Then, it is relatively straightforward to notice that the result of Lemma 2.2 boils down to the one layer convergence results of [36], see also [32, 35]. Namely, if we write for the limit in probability of we get that

(3.2)

It is useful to compare the limits of the neural network output in the one layer and two layer cases, (3.2) and (2.6) respectively. It is clear that the two layer case is more involved, which provides some intuition for the increased complexity of deep neural networks when compared to shallow neural networks.

Perhaps more interestingly, the law of large numbers for a single hidden-layer network indicates that the network will converge to a deterministic limit in probability. That is, after a certain point, adding more hidden units will not increase the accuracy. In order for the trained network to change, more layers must be added to the network. This matches the behavior of neural networks in practice, where we see that deep neural networks (more layers) outperform shallow neural networks (only a few layers).

In addition, Theorem 2.3 gives us the limiting behavior of the objective function from (1.3) after proper rescaling. Indeed, we have that

where is given by (2.5).

Theorem 2.3 shows that under an appropriate choice of the learning rates there is a well-defined limit for the neural network output and, as a consequence, for the objective function as well.

The parametrization of the learning rates, see (2.1), indicates that one should be using larger learning rates for the weights that connect the different layers, the in this case, as opposed to the weights that are specific to the different layers, in this case and . Notice that this is also the case for the three layer case outlined in Subsection 3.2 below.

As will be explained in Section 3.2, the law of large numbers can be extended to deep neural networks with an arbitrary number of layers. The law of large numbers will only hold under a certain choice of the learning rates. The learning rates need to be scaled with the number of hidden units in each layer. For a multi-layer network with layers, the learning rates are

where is the number of hidden units in the -th layer.

If the learning rates are constant in the number of hidden units , it turns out that the network will not train as (i.e., in the limit, the network parameters will remain at their initial conditions).

The necessity of scaling the learning rates in the asymptotic regime of large numbers of hidden units (i.e., wide layers) is one of the interesting products of the mean-field limit analysis. A numerical example is presented in Figure 1

below. A deep neural network is trained to classify images for the CIFAR10 dataset

[31]. The CIFAR10 dataset contains color images in classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). The dataset is divided into training images and test images. Each image has pixels. The goal is to train a neural network to correctly classify each image based solely on the image pixels as an input. The neural network we use has the mean-field normalizations in each layer . There are convolution layers which each have channels. This is followed by two fully-connected layers which each have units. We first train the neural network using the scaled learning rates. Then, we also train the neural network with the standard stochastic gradient descent algorithm (no scaling of the learning rates). Using the scaled learning rates, we achieve a high test accuracy. However, without the scalings, the neural network does not train (i.e., it remains at a very low accuracy).

Figure 1: Performance of deep neural network on CIFAR10 dataset with and without scaled learning rates.

3.2 Extension to deep neural networks with more layers

The procedure developed in this paper naturally extends to deep neural networks with more layers. For brevity, let us present the result in the case of three layers. The situation for more layers is the same, albeit with more complicated algebra. A deep neural network with three layers takes the form

(3.3)

which can be also written as

(3.4)

where and . The neural network model has parameters

which must be estimated from data. The number of hidden units in the first layer is , the number of hidden units in the second layer is , and the number of hidden units in the third layer is . Naturally, the loss function now becomes

where the data .

The stochastic gradient descent (SGD) algorithm for estimating the parameters is, for , , and is

where

(3.6)

where , , and are the learning rates. The parameters at step are

are samples of the random variables . We assume a condition analogous to Assumption 2.1.

Let us now choose the learning rates to be

Similar to before, define the empirical measure

The time-scaled empirical measure is

and the corresponding time-scaled neural network output is