1 Introduction
Neural network models have been used as computational tools in many different contexts including machine learning, pattern recognition, physics, neuroscience and statistical mechanics, see for example
[21]. Neural network models, particularly in machine learning, have achieved immense practical success over the past decade in fields such as image, text, and speech recognition. We mathematically analyze neural networks with a single hidden layer in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. A law of large numbers was previously proven in
[30], see also [27, 29] for related results. This paper rigorously proves a central limit theorem (CLT) for the empirical distribution of the neural network parameters. The central limit theorem describes the fluctuations of the finite empirical distribution of the neural network parameters around its meanfield limit.The meanfield limit is a law of large numbers for the empirical measure of the neural network parameters as . It satisfies a deterministic nonlinear partial differential equation. The meanfield limit of course is only accurate in the limit , and the central limit theorem provides a firstorder correction in . The central limit theorem quantifies the fluctuations of the finite empirical measure around its meanfield limit. It satisfies a linear stochastic partial differential equation (SPDE) driven by a Gaussian process. In particular, our result shows that the trained neural network behaves as where is the empirical measure of the parameters for a neural network with hidden units, is the meanfield limit, and is the Gaussian correction from the central limit theorem.
The proof requires a linearization of the nonlinear prelimit evolution equation for the empirical distribution of the neural network parameters. This linearization produces several remainder terms which must be shown to vanish in the limit (similar to a perturbation analysis for PDEs). The SPDE for the CLT is linearized around the nonlinear PDE for the meanfield limit . The CLT SPDE and meanfield limit PDE are therefore coupled. We must also show that the prelimit evolution equation (which is in discrete time since stochastic gradient descent is a discretetime algorithm) converges to a continuoustime limit.
The proof relies upon weak convergence analysis for interacting particle systems. The convergence analysis is technically challenging since the fluctuations of the empirical distribution is a signedmeasurevalued process and its limit process turns out to be distributionvalued in the appropriate space. Unfortunately, the space of signed measures endowed with the weak topology is in general not metrizable (see [11] and [32] for further discussion of the space of signed measures). We study the convergence of the fluctuations as a process taking values in the dual space of an appropriate Sobolev space. We prove that the prelimit fluctuation process is relatively compact in that space and that any limit point is unique in that space. In particular, we will use the dual space of the Sobolev space with a bounded subset of the appropriate Euclidean space and where is sufficiently large; see Section 2 for a detailed description. Since the prelimit evolution equation has discrete updates, we study convergence in the Skorokhod space . ( is the set of maps from into which are rightcontinuous and which have lefthand limits.)
Most of the literature on central limit theorems for interacting particle systems considers continuoustime systems, see for example [15, 26, 32, 5, 10, 6]. In contrast, in this article the prelimit process is in discrete time and converges to a continuoustime limit process after an appropriate time rescaling. At a practical level, this shows that the relation between the number of particles (“hidden units” in the language of neural networks) and the number of stochastic gradient steps should be of the same order to have convergence and statistically good behavior. At a more mathematical level, this passage from discrete to continuous time produces a number of additional remainder terms that must be shown to vanish at the correct rate in order for a CLT to hold. We resolve all these issues for onelayer neural network models, rigorously establishing and characterizing the fluctuations limit.
Weak convergence and mean field analysis has been used in many other disciplines, including interacting particle systems in physics, neural networks in biology and financial modeling, see for example [17], [18], [7], [8], [9], [4], [20], [12], [23], [28], [34], [31] and the references therein for a certainly notcomplete list. Recently, [30], [35], [27], and [29] study meanfield limits of machine learning algorithms, including neural networks. In this paper, we rigorously establish a central limit theorem for neural networks trained with stochastic gradient descent. [29] also formally studies corrections to the mean field limit.
Consider the onelayer neural network
(1.1) 
where for every , and . For notational convenience we shall interpret as the standard scalar inner product. The neural network model has parameters
, which must be estimated from data.
The neural network (1.1) takes a linear function of the original data, applies an elementwise nonlinear operation using the function
, and then takes another linear function to produce the output. The activation function
is a nonlinear function such as a sigmoid or tanh function. The quantity is referred to as theth “hidden unit”, and the vector
is called the “hidden layer”. The number of units in the hidden layer is .The objective function is
(1.2) 
where the data
is assumed to have a joint distribution
. We shall write and for the state spaces of and , respectively. The parameters are estimated using stochastic gradient descent:(1.3) 
where is the learning rate and . Stochastic gradient descent minimizes (1.2) using a sequence of noisy (but unbiased) gradient descent steps . Stochastic gradient descent typically converges more rapidly than gradient descent for large datasets. For this reason, stochastic gradient descent is widely used in machine learning.
Define the empirical measure
The neural network’s output can be rewritten in terms of the empirical measure:
denotes the inner product of and . For example, .
The scaled empirical measure is
The scaled empirical measure is a random element of the Skorokhod space ^{3}^{3}3 is the set of maps from into which are rightcontinuous and which have lefthand limits. with .
We shall work on a filtered probability space
on which all the random variables are defined. The probability space is equipped with a filtration that is right continuous and contains all
null sets.We impose the following conditions.
Assumption 1.1.
We have that

The activation function .

The data is compactly supported.

The sequence of data samples is i.i.d.

The random initialization is i.i.d, generated from a distribution with compact support.
1.1 Law of Large Numbers
Theorem 1.2.
Assume Assumption 1.1. The scaled empirical measure converges in distribution to in as . For every , is the deterministic unique solution of the measure evolution equation
(1.4) 
where .
Remark 1.3.
Since weak convergence to a constant implies convergence in probability, Theorem 1.2 leads to the stronger result of convergence in probability
for every and where is the metric for .
1.2 Main Result: A Central Limit Theorem
In this paper, we prove a central limit theorem for onelayer neural networks as the size of the network and the number of training steps become large. The central limit theorem quantifies the speed of convergence of the finite neural network to its meanfield limit as well as how the finite neural network fluctuates around the meanfield limit for large .
Define the fluctuation process
We prove that , where satisfies a stochastic partial differential equation. This result characterizes the fluctuations of the finite empirical measure around its meanfield limit for large . The limit has a Gaussian distribution. We study the convergence of in the space , where is the dual of the Sobolev space with a bounded domain. These spaces are described in detail in Section 2.
Theorem 1.5.
Assume Assumption 1.1 and let . Let be given. The sequence is relatively compact in . The sequence of processes converges in distribution in to the process , which, for every , satisfies the stochastic partial differential equation
(1.5)  
The CLT SPDE (1.5) is coupled with the meanfield limit PDE (1.4). (1.4) is a deterministic nonlinear PDE while (1.5) is a stochastic linear PDE. The SPDE (1.5) is linear in and driven by a Gaussian process; therefore, the limiy itself is a Gaussian process.
Theorem 1.5 indicates that for large the empirical distribution of the neural network’s parameters behaves as
where has a Gaussian distribution. Combined, Theorems 1.2 and 1.5
show that the relation between the number of particles (”hidden units” in the language of neural networks) and the number of stochastic gradient steps should be of the same order to have convergence and statistically good behavior. Under this scaling, as a measure valued process, the empirical distribution of the parameters behaves as a Gaussian distribution with specific variancecovariance structure (as indicated by Theorem
1.5).1.3 Outline of Paper
In Section 2 we present the Sobolev spaces with respect to which convergence is studied. The prelimit evolution equation for the fluctuation process is derived in Section 3. Section 4 proves relative compactness. Section 5 derives the limiting SPDE (1.5). Uniqueness of the SPDE (1.5) is proven in Section 6. Section 7 collects these results and proves Theorem 1.5. Conclusions are in Section 8.
2 Sobolev Spaces
We study convergence in a Sobolev space [1]. Weighted Sobolev spaces have been previously used to study central limit theorems of mean field systems in papers such as [15], [26] and [32]. Weights are not necessary in this paper since and are compactly supported uniformly with respect to and (see Lemma 4.3).
Let be a bounded domain with . For any integer , consider the space of real valued functions with partial derivatives up to order which satisfy
Define the space as the closure of functions of class in the norm defined above. is the space of all functions in with compact support. (The space is frequently also denoted by in the literature.) is a Hilbert space (see Theorem 3.5 and Remark 3.33 in [1]) and has the inner product
When , we write . denotes the dual space of that is equipped with the norm
We will study convergence in the Sobolev space corresponding to . From Lemma 4.3, we have that and are compactly supported. In particular, there exists a compact set such that and vanish outside the compact set for every and . We choose where . Note that , and thus the domain , may depend upon fixed parameters of the problem such that , , , and , but what is important is that the bounded set is fixed and does not change with or .
Sometimes, we may write for simplicity in place of and in place of .
3 Preliminary Calculations
The goal of this section is to write , with being the fluctuation process and a test function, in a way that allows us to take limits. In particular, our goal is to describe the evolution of in terms of the equation (3.5). In order to do this, we need some preliminary computations.
We consider the evolution of the empirical measure via test functions . A Taylor expansion yields
(3.1) 
for points in the segments connecting with and with , respectively. Under the compactness part of Assumption 1.1, the results of [30] imply that the parameters are uniformly bounded (in both and ):
(3.2) 
We shall also denote by to be the algebra generated by and . Using the relation (1.3), equation (3.1) becomes
where is an term with
Note that due to the uniform bound , having compact support, and the relation (1.3). is the compact set .
We next define the following components:
Combining the different terms together, we subsequently obtain
Next, we define the scaled versions of and :
As it will be demonstrated in Section 4.2, and are martingale terms. We also define
and can be approximated by integrals:
where is a remainder term defined below. Similarly,
The remainder terms and are
is a cádlág process with jumps at times . Furthermore, due to the uniform bound (3.2) and being a compact set, is an remainder term:
(3.3) 
The scaled empirical measure can be written as the telescoping sum
Therefore, the scaled empirical measure satisfies
(3.4)  
Note that is . Define the fluctuation process
Then,
(3.5)  
where
and are remainder terms where
4 Relative Compactness
This section proves the relative compactness of the prelimit processes in and of in . Lemma 4.9 states that relative compactness of and of in . The proof is based on Theorem 4.20 of [25], see also Theorem 8.6 in Chapter 3 of [13]. We need to prove that and are appropriately uniformly bounded, see Lemma 4.8 and Lemma 4.5 respectively, and that they satisfy an appropriate regularity type of property, see Lemma 4.7 and Lemma 4.6 respectively.
4.1 Uniform bound on the fluctuations process
The main result of this section is Lemma 4.1 below and it provides a uniform bound with respect to
Comments
There are no comments yet.