# Mean Field Analysis of Neural Networks: A Central Limit Theorem

Machine learning has revolutionized fields such as image, text, and speech recognition. There's also growing interest in applying machine and deep learning methods in science, engineering, medicine, and finance. Despite their immense success in practice, there is limited mathematical understanding of neural networks. We mathematically study neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously prove that the neural network satisfies a central limit theorem. Our result describes the neural network's fluctuations around its mean-field limit. The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation.

## Authors

• 9 publications
• 10 publications
• ### Mean Field Analysis of Deep Neural Networks

We analyze multi-layer neural networks in the asymptotic regime of simul...
03/11/2019 ∙ by Justin Sirignano, et al. ∙ 0

• ### Neural networks as Interacting Particle Systems: Asymptotic convexity of the Loss Landscape and Universal Scaling of the Approximation Error

Neural networks, a central tool in machine learning, have demonstrated r...
05/02/2018 ∙ by Grant M. Rotskoff, et al. ∙ 0

• ### Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum

We analyze single-layer neural networks with the Xavier initialization i...
07/09/2019 ∙ by Justin Sirignano, et al. ∙ 0

• ### Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

We consider learning two layer neural networks using stochastic gradient...
02/16/2019 ∙ by Song Mei, et al. ∙ 0

• ### Proximal Mean-field for Neural Network Quantization

Compressing large neural networks by quantizing the parameters, while ma...
12/11/2018 ∙ by Thalaiyasingam Ajanthan, et al. ∙ 0

• ### Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

Can multilayer neural networks -- typically constructed as highly comple...
02/07/2019 ∙ by Phan-Minh Nguyen, et al. ∙ 0

• ### A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

01/18/2019 ∙ by Umut Şimşekli, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Neural network models have been used as computational tools in many different contexts including machine learning, pattern recognition, physics, neuroscience and statistical mechanics, see for example

[21]

. Neural network models, particularly in machine learning, have achieved immense practical success over the past decade in fields such as image, text, and speech recognition. We mathematically analyze neural networks with a single hidden layer in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. A law of large numbers was previously proven in

[30], see also [27, 29] for related results. This paper rigorously proves a central limit theorem (CLT) for the empirical distribution of the neural network parameters. The central limit theorem describes the fluctuations of the finite empirical distribution of the neural network parameters around its mean-field limit.

The mean-field limit is a law of large numbers for the empirical measure of the neural network parameters as . It satisfies a deterministic nonlinear partial differential equation. The mean-field limit of course is only accurate in the limit , and the central limit theorem provides a first-order correction in . The central limit theorem quantifies the fluctuations of the finite empirical measure around its mean-field limit. It satisfies a linear stochastic partial differential equation (SPDE) driven by a Gaussian process. In particular, our result shows that the trained neural network behaves as where is the empirical measure of the parameters for a neural network with hidden units, is the mean-field limit, and is the Gaussian correction from the central limit theorem.

The proof requires a linearization of the nonlinear pre-limit evolution equation for the empirical distribution of the neural network parameters. This linearization produces several remainder terms which must be shown to vanish in the limit (similar to a perturbation analysis for PDEs). The SPDE for the CLT is linearized around the nonlinear PDE for the mean-field limit . The CLT SPDE and mean-field limit PDE are therefore coupled. We must also show that the pre-limit evolution equation (which is in discrete time since stochastic gradient descent is a discrete-time algorithm) converges to a continuous-time limit.

The proof relies upon weak convergence analysis for interacting particle systems. The convergence analysis is technically challenging since the fluctuations of the empirical distribution is a signed-measure-valued process and its limit process turns out to be distribution-valued in the appropriate space. Unfortunately, the space of signed measures endowed with the weak topology is in general not metrizable (see [11] and [32] for further discussion of the space of signed measures). We study the convergence of the fluctuations as a process taking values in the dual space of an appropriate Sobolev space. We prove that the pre-limit fluctuation process is relatively compact in that space and that any limit point is unique in that space. In particular, we will use the dual space of the Sobolev space with a bounded subset of the appropriate Euclidean space and where is sufficiently large; see Section 2 for a detailed description. Since the pre-limit evolution equation has discrete updates, we study convergence in the Skorokhod space . ( is the set of maps from into which are right-continuous and which have left-hand limits.)

Most of the literature on central limit theorems for interacting particle systems considers continuous-time systems, see for example [15, 26, 32, 5, 10, 6]. In contrast, in this article the pre-limit process is in discrete time and converges to a continuous-time limit process after an appropriate time rescaling. At a practical level, this shows that the relation between the number of particles (“hidden units” in the language of neural networks) and the number of stochastic gradient steps should be of the same order to have convergence and statistically good behavior. At a more mathematical level, this passage from discrete to continuous time produces a number of additional remainder terms that must be shown to vanish at the correct rate in order for a CLT to hold. We resolve all these issues for one-layer neural network models, rigorously establishing and characterizing the fluctuations limit.

Weak convergence and mean field analysis has been used in many other disciplines, including interacting particle systems in physics, neural networks in biology and financial modeling, see for example [17], [18], [7], [8], [9], [4], [20], [12], [23], [28], [34], [31] and the references therein for a certainly not-complete list. Recently, [30], [35], [27], and [29] study mean-field limits of machine learning algorithms, including neural networks. In this paper, we rigorously establish a central limit theorem for neural networks trained with stochastic gradient descent. [29] also formally studies corrections to the mean field limit.

Consider the one-layer neural network

 gNθ(x)=1NN∑i=1ciσ(wi⋅x), (1.1)

where for every , and . For notational convenience we shall interpret as the standard scalar inner product. The neural network model has parameters

, which must be estimated from data.

The neural network (1.1) takes a linear function of the original data, applies an element-wise nonlinear operation using the function

, and then takes another linear function to produce the output. The activation function

is a nonlinear function such as a sigmoid or tanh function. The quantity is referred to as the

-th “hidden unit”, and the vector

is called the “hidden layer”. The number of units in the hidden layer is .

The objective function is

 L(θ)=EY,X[(Y−gNθ(X))2], (1.2)

where the data

is assumed to have a joint distribution

. We shall write and for the state spaces of and , respectively. The parameters are estimated using stochastic gradient descent:

 cik+1 = cik+αN(yk−gNθk(xk))σ(wik⋅xk), wi,jk+1 = wi,jk+αN(yk−gNθk(xk))cikσ′(wik⋅xk)xjk,j=1,⋯,d, (1.3)

where is the learning rate and . Stochastic gradient descent minimizes (1.2) using a sequence of noisy (but unbiased) gradient descent steps . Stochastic gradient descent typically converges more rapidly than gradient descent for large datasets. For this reason, stochastic gradient descent is widely used in machine learning.

Define the empirical measure

 νNk(dc,dw)=1NN∑i=1δcik,wik(dc,dw).

The neural network’s output can be re-written in terms of the empirical measure:

 gNθk(x)=⟨cσ(w⋅x),νNk⟩.

denotes the inner product of and . For example, .

The scaled empirical measure is

 μNt=νN⌊Nt⌋.

The scaled empirical measure is a random element of the Skorokhod space 333 is the set of maps from into which are right-continuous and which have left-hand limits. with .

We shall work on a filtered probability space

on which all the random variables are defined. The probability space is equipped with a filtration that is right continuous and contains all

-null sets.

We impose the following conditions.

###### Assumption 1.1.

We have that

• The activation function .

• The data is compactly supported.

• The sequence of data samples is i.i.d.

• The random initialization is i.i.d, generated from a distribution with compact support.

### 1.1 Law of Large Numbers

[30] proves the mean-field limit as . The convergence theorems of [30] are summarized below.

###### Theorem 1.2.

Assume Assumption 1.1. The scaled empirical measure converges in distribution to in as . For every , is the deterministic unique solution of the measure evolution equation

 ⟨f,¯μt⟩ = ⟨f,¯μ0⟩+∫t0(∫X×Yα(y−⟨c′σ(w′⋅x),¯μs⟩)⟨∇(cσ(w⋅x))⋅∇f,¯μs⟩π(dx,dy))ds, (1.4)

where .

###### Remark 1.3.

Since weak convergence to a constant implies convergence in probability, Theorem 1.2 leads to the stronger result of convergence in probability

 limN→∞P{dE(μN,¯μ)≥δ}=0

for every and where is the metric for .

###### Corollary 1.4.

Assume Assumption 1.1. Suppose that admits a density and that there exists a unique solution to the nonlinear partial differential equation

 ∂p(t,c,w)∂t = −α∫X×Y((y−⟨c′σ(w′⋅x),p(t,c′,w′)⟩)x⋅∇w[cσ′(w⋅x)p(t,c,w)])π(dx,dy), p(0,c,w) = p0(c,w).

Then, we have that the solution to the measure evolution equation (1.4) is such that

 ¯μt(dc,dw)=p(t,c,w)dcdw.

### 1.2 Main Result: A Central Limit Theorem

In this paper, we prove a central limit theorem for one-layer neural networks as the size of the network and the number of training steps become large. The central limit theorem quantifies the speed of convergence of the finite neural network to its mean-field limit as well as how the finite neural network fluctuates around the mean-field limit for large .

Define the fluctuation process

 ηNt=√N(μNt−¯μt).

We prove that , where satisfies a stochastic partial differential equation. This result characterizes the fluctuations of the finite empirical measure around its mean-field limit for large . The limit has a Gaussian distribution. We study the convergence of in the space , where is the dual of the Sobolev space with a bounded domain. These spaces are described in detail in Section 2.

###### Theorem 1.5.

Assume Assumption 1.1 and let . Let be given. The sequence is relatively compact in . The sequence of processes converges in distribution in to the process , which, for every , satisfies the stochastic partial differential equation

 ⟨f,¯ηt⟩ = ⟨f,¯η0⟩+∫t0∫X×Yα(y−⟨cσ(w⋅x),¯μs⟩)⟨∇(cσ(w⋅x))⋅∇f,¯ηs⟩π(dx,dy)ds (1.5)

is a mean-zero Gaussian process; see Lemma 5.2 for its covariance structure. Finally, the stochastic evolution equation (1.5) has a unique solution in , which implies that is unique.

The CLT SPDE (1.5) is coupled with the mean-field limit PDE (1.4). (1.4) is a deterministic nonlinear PDE while (1.5) is a stochastic linear PDE. The SPDE (1.5) is linear in and driven by a Gaussian process; therefore, the limiy itself is a Gaussian process.

Theorem 1.5 indicates that for large the empirical distribution of the neural network’s parameters behaves as

 νN⌊N⋅⌋=μN⋅≈¯μ⋅+1√N¯η⋅,

where has a Gaussian distribution. Combined, Theorems 1.2 and 1.5

show that the relation between the number of particles (”hidden units” in the language of neural networks) and the number of stochastic gradient steps should be of the same order to have convergence and statistically good behavior. Under this scaling, as a measure valued process, the empirical distribution of the parameters behaves as a Gaussian distribution with specific variance-covariance structure (as indicated by Theorem

1.5).

### 1.3 Outline of Paper

In Section 2 we present the Sobolev spaces with respect to which convergence is studied. The pre-limit evolution equation for the fluctuation process is derived in Section 3. Section 4 proves relative compactness. Section 5 derives the limiting SPDE (1.5). Uniqueness of the SPDE (1.5) is proven in Section 6. Section 7 collects these results and proves Theorem 1.5. Conclusions are in Section 8.

## 2 Sobolev Spaces

We study convergence in a Sobolev space [1]. Weighted Sobolev spaces have been previously used to study central limit theorems of mean field systems in papers such as [15], [26] and [32]. Weights are not necessary in this paper since and are compactly supported uniformly with respect to and (see Lemma 4.3).

Let be a bounded domain with . For any integer , consider the space of real valued functions with partial derivatives up to order which satisfy

 ∥f∥J=(∑|k|≤J∫Θ∣∣Dkf(x)∣∣2dx)1/2<∞.

Define the space as the closure of functions of class in the norm defined above. is the space of all functions in with compact support. (The space is frequently also denoted by in the literature.) is a Hilbert space (see Theorem 3.5 and Remark 3.33 in [1]) and has the inner product

 ⟨f,g⟩J=∑|k|≤J∫ΘDkf(x)Dkg(x)dx.

When , we write . denotes the dual space of that is equipped with the norm

 ∥f∥−J=supg∈WJ,20(Θ)∣∣⟨f,g⟩∣∣∥g∥J.

We will study convergence in the Sobolev space corresponding to . From Lemma 4.3, we have that and are compactly supported. In particular, there exists a compact set such that and vanish outside the compact set for every and . We choose where . Note that , and thus the domain , may depend upon fixed parameters of the problem such that , , , and , but what is important is that the bounded set is fixed and does not change with or .

Sometimes, we may write for simplicity in place of and in place of .

## 3 Preliminary Calculations

The goal of this section is to write , with being the fluctuation process and a test function, in a way that allows us to take limits. In particular, our goal is to describe the evolution of in terms of the equation (3.5). In order to do this, we need some preliminary computations.

We consider the evolution of the empirical measure via test functions . A Taylor expansion yields

 ⟨f,νNk+1⟩−⟨f,νNk⟩ =1NN∑i=1(f(cik+1,wik+1)−f(cik,wik)) =1NN∑i=1∂cf(cik,wik)(cik+1−cik)+1NN∑i=1∇wf(cik,wik)⊤(wik+1−wik) +1NN∑i=1∂2cf(¯cik,¯wik)(cik+1−cik)2+1NN∑i=1(cik+1−cik)∇cwf(¯cik,¯wik)(wik+1−wik) +1NN∑i=1(wik+1−wik)⊤∇2wf(¯cik,¯wik)(wik+1−wik), (3.1)

for points in the segments connecting with and with , respectively. Under the compactness part of Assumption 1.1, the results of [30] imply that the parameters are uniformly bounded (in both and ):

 |cik|+∥∥wik∥∥

We shall also denote by to be the algebra generated by and . Using the relation (1.3), equation (3.1) becomes

 ⟨f,νNk+1⟩−⟨f,νNk⟩ = 1N2N∑i=1∂cf(cik,wik)α(yk−gNθk(xk))σ(wik⋅xk) +1N2N∑i=1α(yk−gNθk(xk))cikσ′(wik⋅xk)∇wf(cik,wik)⋅xk+GNkN2.

where is an term with

 GNk = N2(1NN∑i=1∂2cf(¯cik,¯wik)(cik+1−cik)2+1NN∑i=1(cik+1−cik)∇cwf(¯cik,¯wik)(wik+1−wik) +1NN∑i=1(wik+1−wik)⊤∇2wf(¯cik,¯wik)(wik+1−wik)).

Note that due to the uniform bound , having compact support, and the relation (1.3). is the compact set .

We next define the following components:

 D1,Nk = 1N∫X×Yα(y−⟨cσ(w⋅x),νNk⟩)⟨σ(w⋅x)∂cf,νNk⟩π(dx,dy), D2,Nk = 1N∫X×Yα(y−⟨cσ(w⋅x),νNk⟩)⟨cσ′(w⋅x)x⋅∇wf,νNk⟩π(dx,dy), ⟨f,M1,Nk⟩ = 1Nα(yk−⟨cσ(w⋅xk),νNk⟩)⟨σ(w⋅xk)∂cf,νNk⟩−D1,Nk, ⟨f,M2,Nk⟩ = 1Nα(yk−⟨cσ(w⋅xk),νNk⟩)⟨cσ′(w⋅xk)x⋅∇wf,νNk⟩−D2,Nk.

Combining the different terms together, we subsequently obtain

 ⟨f,νNk+1⟩−⟨f,νNk⟩ = D1,Nk+D2,Nk+⟨f,M1,N(t)⟩+⟨f,M2,N(t)⟩+O(N−2).

Next, we define the scaled versions of and :

 D1,N(t) = ⌊Nt⌋−1∑k=0D1,Nk,D2,N(t)=⌊Nt⌋−1∑k=0D2,Nk, ⟨f,M1,N(t)⟩ = ⌊Nt⌋−1∑k=0⟨f,M1,Nk⟩,⟨f,M2,N(t)⟩=⌊Nt⌋−1∑k=0⟨f,M2,Nk⟩.

As it will be demonstrated in Section 4.2, and are martingale terms. We also define

 ⟨f,MNt⟩=⟨f,M1,N(t)⟩+⟨f,M2,N(t)⟩.

and can be approximated by integrals:

 ⌊Nt⌋−1∑k=0D1,Nk = ⌊Nt⌋−1∑k=0∫k+1NkN∫X×Yα(y−⟨cσ(w⋅x),νNk⟩)⟨σ(w⋅x)∂cf,νNk⟩π(dx,dy)ds = ⌊Nt⌋−1∑k=0∫k+1NkN∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨σ(w⋅x)∂cf,μNs⟩π(dx,dy)ds = ∫t0∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨σ(w⋅x)∂cf,μNs⟩π(dx,dy)ds+V1,Nt,

where is a remainder term defined below. Similarly,

 ⌊Nt⌋−1∑k=0D2,Nk=∫t0∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨cσ′(w⋅x)x⋅∇wf,μNs⟩π(dx,dy)ds+V2,Nt.

The remainder terms and are

 V1,Nt = −∫t⌊Nt⌋N∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨σ(w⋅x)∂cf,μNs⟩π(dx,dy)ds, V2,Nt = −∫t⌊Nt⌋N∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨cσ′(w⋅x)x⋅∇wf,μNs⟩π(dx,dy)ds, VNt = V1,Nt+V2,Nt.

is a cádlág process with jumps at times . Furthermore, due to the uniform bound (3.2) and being a compact set, is an remainder term:

 supt∈[0,T]|VNt|≤CN∑|α|=1supc,w∈K|Dαf(c,w)| (3.3)

The scaled empirical measure can be written as the telescoping sum

 ⟨f,μNt⟩−⟨f,μN0⟩ = = = ⌊Nt⌋−1∑k=0(⟨f,νNk+1⟩−⟨f,νNk⟩).

Therefore, the scaled empirical measure satisfies

 ⟨f,μNt⟩−⟨f,μN0⟩ = ⌊Nt⌋−1∑k=0(⟨f,νNk+1⟩−⟨f,νNk⟩) (3.4) = ⌊Nt⌋−1∑k=0(D1,Nk+D2,Nk+⟨f,M1,N(t)⟩+⟨f,M2,N(t)⟩)+1N2⌊Nt⌋−1∑k=0GNk = ∫t0∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨σ(w⋅x)∂cf,μNs⟩π(dx,dy)ds +∫t0∫X×Yα(y−⟨cσ(w⋅x),μNs⟩)⟨cσ′(w⋅x)x⋅∇wf,μNs⟩π(dx,dy)ds +⟨f,MNt⟩+1N2⌊Nt⌋−1∑k=0GNk+VNt

Note that is . Define the fluctuation process

 ηNt=√N(μNt−¯μt).

Then,

 ⟨f,ηNt⟩−⟨f,ηN0⟩ = (3.5) − ∫t0(∫X×Yα⟨cσ(w⋅x),ηNs⟩⟨σ(w⋅x)∂cf,¯μs⟩π(dx,dy))ds + − ∫t0(∫X×Yα⟨cσ(w⋅x),ηNs⟩)⟨cσ′(w⋅x)x⋅∇wf,¯μs⟩π(dx,dy))ds + √N⟨f,MNt⟩+Γ1,Nt+Γ2,Nt+R1,Nt+R2,Nt,

where

 Γ1,Nt = 1√N∫t0∫X×Y−α⟨cσ(w⋅x),ηNs⟩⟨σ(w⋅x)∂cf,ηNs⟩π(dx,dy)ds Γ2,Nt = 1√N∫t0∫X×Y−α⟨cσ(w⋅x),ηNs⟩⟨cσ′(w⋅x)x∇wf,ηNs⟩π(dx,dy)ds.

and are remainder terms where

 R1,Nt = N−3/2⌊Nt⌋−1∑k=0GNk, R2,Nt = √NVNt.

## 4 Relative Compactness

This section proves the relative compactness of the pre-limit processes in and of in . Lemma 4.9 states that relative compactness of and of in . The proof is based on Theorem 4.20 of [25], see also Theorem 8.6 in Chapter 3 of [13]. We need to prove that and are appropriately uniformly bounded, see Lemma 4.8 and Lemma 4.5 respectively, and that they satisfy an appropriate regularity type of property, see Lemma 4.7 and Lemma 4.6 respectively.

### 4.1 Uniform bound on the fluctuations process ηN

The main result of this section is Lemma 4.1 below and it provides a uniform bound with respect to and