1. Motivation and main results
While classification problems continue to be an active area research, extraordinary progress has been made on both speech and image recognition, problems that appeared intractable only a decade ago . By harvesting the power of neural networks while simultaneously benefiting from advances in computational hardware, complex tasks such as automatic language translation are now routinely performed by computers with a high degree of reliability. The underlying explanation for these significant advances seems to be related to the expressive power of neural networks, and their ability to represent high dimensional functions with accuracy.
. Any numerical calculation that uses a given function begins with a finite-dimensional representation of that function. Because standard approximations, e.g., Galerkin truncations or finite element decompositions, suffer from the curse of dimensionality, it is nearly impossible to scale such methods to large dimensions. Fundamentally, these representations are linear combinations of basis functions. The issue arises because the dimensionality of the representation is equal to that of the truncation. Neural networks, on the other hand, are highly nonlinear in their adjusting parameters. As a result, their effective dimensionality is much higher than the number of their parameters, which may explain their better capabilities observed in practice to approximate functions even when is large. Characterizing this observation with analysis is non-trivial though, precisely because the representation of a function by a neural network is nonlinear in its parameters. This renders many of the standard tools of numerical analysis useless, since they are in large part based on linear algebra.
The significant achievements of machine learning have inspired many efforts to provide theoretical justification to a vast and growing body of empirical knowledge. At the core of our understanding of neural networks are the “Universal Approximation Theorems” that specify the conditions under which a neural network can represent a target function with arbitrary accuracy [10, 4, 23]. These results do not, however, indicate how the network parameters should be determined to achieve maximal accuracy when their number is fixed . Additionally, these theorems do not provide guidance on how the error scales with the number of parameters. Several recent papers have focused on the analysis of the shape and properties of the objective or “loss” function landscape [24, 9, 3]. These studies have mainly focused on the fine features of this landscape, trying to understand how non-convex it is and making analogies with glassy landscapes. Additionally, some analysis has been performed in cases where the number of parameters vastly exceeds the amount of training data, a setting that guarantees convexity and dramatically simplifies the landscape. Further studies have examined the dynamics of the parameters on the loss landscape to understand the properties of optimization procedures based on stochastic gradient descent.
In this paper, we adopt a different perspective which enables powerful tools for a more formal analysis. Similar to what was recently proposed in [22, 29], we view the parameters in the network as particles and the loss function as a potential that dictates the interaction between them. Correspondingly, training the network is thought of as the evolution of the particles in this interaction potential. We also consider the empirical distribution of the interacting particles / parameters and analyze the properties of this distribution when the number is large using standard limit theorems[17, 26, 18, 27]. This viewpoint allows us to bypass many of the difficulties that arise with approaches that attempt to study the dynamics of the individual particles. For example, we rederive the Universal Approximation Theorem as a corollary to the Law of Large Numbers (LLN) for the empirical distribution of the particles. We also establish that the loss landscape is asymptotically convex for large in the space of the empirical distribution of the particles, and assert that convergence towards equilibrium of this distribution occurs on a time scale that is independent of to leading order—similar results were obtained in [22, 29]. Finally, we prove a Central Limit Theorem for the empirical distribution, and thereby conclude that the approximation error of the function representation by a neural network is universal and scales as as in any . These results are established first in situation where gradient descent (GD) is used to perform the training of the parameters in the network, and then shown to also apply in the context of stochastic gradient descent (SGD). In the latter case, our analysis shed light on the nature of the noise introduced in SGD, and indicates how the time step and he batch size should be scaled to achieve the optimal error. Let us briefly elaborate on these statements next, starting with a precise formulation of the problem.
1.1. Problem set-up
Given a function defined on , consider its approximation by
where , with are parameters to be learned for , and is some kernel—for simplicity we assume throughout this paper that is compact. Many models used in machine learning can be cast in the form (1.1):
Radial basis function networks. In this case and where is some kernel, for example that of a radial function such as
where is a fixed constant.
Single-layer neural networks. In this case, and with , , and
is e.g. a sigmoid function.
Multi-layer neural networks. These are essentially iterated versions of single-layer neural networks. For example, to construct a two-layer network we take as above and for , define such that
where , , , , . Therefore here we have (where with a slight abuse of notation we view the matrix
has a vector in). Three-layer networks, etc. can be constructed similarly. Note that our results apply to deep neural networks when the final layer grows large, not the total number of parameters.
In view of the growing range of applications of these methods, it is natural to ask:
How good can the approximation (1.1) be if we optimize ?
Can we guarantee the convergence of the commonly used optimization algorithms?
To answer the first question, we need to introduce a distance, or loss function, between and . A natural candidate often used in practice is
where is some measure on with a positive density and such that (for example the Lebesgue measure, , if is compact). This also offers a way to address the second question, since we can view as an objective function for :
where and we defined
Trying to minimize (1.3) over leads to difficulties, however, since this is potentially (and presumably) a non-convex optimization problem, which typically have local minimizers. As a result determining the distance (1.2) at the minimum (and its scaling with , say) is also nontrivial. To bypass these difficulties we study the problem in terms of its empirical distribution.
1.2. Functional formulation
Assume that the set is such that
for some signed density on . If is smooth, a sufficient condition is that the empirical measure
converges weakly to as . Clearly, this can be achieved, e.g. by drawing the
’s independently from some probability density functionsuch that for all and , and setting : we then have as by the Law of Large Numbers, with an error scaling as by the Central Limit Theorem. We show this scaling can be improved upon. In the large limit, (1.3) converges to an objective function for :
1.3. Universal Representation Theorem
To formalize these ideas, it is useful to view as a map from to and introduce also its adjoint . These operators are defined for any suitable and respectively as
We can then write the loss function in (1.2) evaluated on as
where and are given explicitly in (1.4). We can also express these operators as and : viewed as an operator mapping into itself, is symmetric and nonnegative.
The operator is bounded, i.e. , and square integrable:
Indeed this assumption holds for the kernels used in machine learning, and it guarantees that both and are compact operators whose domains are and , respectively. Under Assumption 1.1, it is well known that (1.9) admits at least one minimizer (and (1.15) at least one solution) if and only if (where ‘ran’ denotes the range of the operator), which is a dense subspace of . This solution/minimizer may not be unique, since we can add to it any element of , which is not necessarily trivial. In the present context, this is no real issue, however, since we only care that gives for any solution of (1.15). This is why the kernels used in machine learning are designed so that they satisfy:
Assumption 1.2 (Discriminating Kernel).
The null space of the adjoint is trivial, i.e.
We can summarize the discussion above into:
Theorem 1.3 (Universal Representation Theorem).
we see that (1.14) follows by Assumption 1.2. To show that there is a choice of such that (1.16) holds pick, for example, the independently from some probability density function such that for all and , and set . (1.16) then follows by the Law of Large Numbers. ∎
Note that the Universal Representation Theorem only gives , not . This suffices in our case since we cannot manipulate (1.15) directly, but rather need to go back to the representation (1.6). Using introduces errors due to the finiteness of which, even if they decrease as , remain larger than those induced by replacing by if we pick small enough. In other words, we can bound and make smaller than . For this reason, in what follows we do not distinguish from . What remains to be shown, however, is that the Universal Representation Theorem can be approximately realized in practice via dynamic training of the parameters in (1.1) (recall that we have no direct access to ). In addition we seek to assess quantitatively the rate of convergence in time toward this approximation and the error due to the finiteness of . We achieve this aim by treating the parameters as a set of interacting particles.
Depending on the situation, it may be useful to consider other objective functions than (1.2). For example, in the context of elliptic problems, one is led to minimize
is some positive-definite tensor and, and we impose e.g. on . This problem lends itself naturally to the present framework upon straightforward modifications, such as redefinition of the norm. For simplicity, in the present paper we only address the minimization of (1.2).
1.4. Parameters as particles with loss function as interaction potential
It is useful to view the set as particles (with viewed as a particle position and the weight as its charge), and use (1.3) as an interaction potential between them. In this interpretation, we can perform the training by making these parameters evolve by gradient descent (GD) in this potential, or stochastic gradient descent (SGD)—the method of choice used in machine learning to train neural networks. If we denote these time-dependent parameters as with , we study the way
evolves in time, as well as the behavior of this function for large . In particular, we establish a Law of Large Numbers (LLN) for as well as a Central Limit Theorem (CLT), and thereby assess the error and trainability of neural networks representations.
1.5. Discrete training set and stochastic gradient descent
For most choices of the kernel commonly encountered in machine learning it is not possible to calculate (1.2) and (1.4) exactly. Rather, we approximate these integrals using a training set, i.e. a set of points distributed according to , possibly independent, and over which is known. We then approximate by
and and/or by
These approximations are precisely what SGD relies upon to calculate the gradient of the interaction potential / loss function, and we assess the errors they introduce.
We focus on situations in which we can redraw the training set as often as we need, namely, at every step during the learning process. In this case, in the limit as the updating time step used in SGD tends to zero, SGD becomes asymptotically equivalent to an SDE whose drift terms coincide with those of GD but with multiplicative noise terms added.
1.6. Universal scaling error of neural networks
The main results we obtain can be summarized as follows: the function (1.19) evolves by GD in Wasserstein metric on a convex landscape (quadratic at leading order in ) and is such that
with a convergence rate in time that is independent of to leading order as . The convergence (1.22) holds for initial conditions satisfying specific conditions outlined in Sec. 2.3. This scaling also holds generically for SGD if we choose the size of the batch used in (1.21) such that as . If we set with , we lose accuracy and (1.22) is replaced by
If we set with , there is no gain and we get (1.22) back. These results are stated in Proposition 2.6 in the context of GD and in Proposition 3.2 in the context of SGD. In Appendix A we also establish a finite-temperature variant (Langevin dynamics) of (1.22) which applies when additive noise terms are added in the evolution equation for the particles / parameters. This result reads
where is a parameter playing the role of inverse temperature, is some given (non-random) function and
is a white-noise process, i.e. a Gaussian random field with zero mean and covariance. Note that (1.24) gives (1.22) back after quenching (i.e. by sending ). The result in (1.24) is stated in Proposition A.3
1.7. Style and organization
As is apparent from the discussion above, our approach has strong ties with the statistical mechanics of systems of many interacting particles. This is an active area of research in which rigorous results have been obtained recently, primarily in the context of Coulomb or Riesz interaction potentials. These potentials lead to kernels that are not compact operators. The situation we consider here is therefore different, and simpler in some technical ways. The main additional issues we face are that (i) our kernels are degenerate, i.e. is not injective in general, and (ii) the weights / charges are not fixed, but rather evolve alongside the particles.
Despite these difficulties, we believe that providing rigorous proof to each of our statements can be achieved using the mathematical apparatuses developed in the context of interacting particle systems—to a certain extent, this program was already started in[22, 29]. In this paper we adopt a semi-rigorous presentation and rely on formal asymptotic arguments to derive our results. This has the advantage of making the developments easier to follow. We also exploit the specific structure of the interaction potentials arising in the context of neural networks. The structure of the class of problems we study alleviates certain difficulties arising in standard interacting particle systems.
The remainder of this paper is organized as follows:
In Sec. 2 we study a model system of interacting particles undergoing gradient descent on the network loss function—this dynamics is similar to stochastic gradient descent, which is used in machine learning to train networks, but is more readily amenable to analysis. The equations for the empirical distribution of the particles are derived in Sec. 2.1. In Sec. 2.2 we study the limiting behavior of the empirical distribution and analyze the scaling of the fluctuations around this limit. The equations derived this way are then used in Sec. 2.3 to establish a Law of Large Numbers and in Sec. 2.4 a Central Limit Theorem. These results have direct implications in terms of the approximation error of the function representation by the neural network and its scaling with the number of particles, as well as the dynamics of training.
We then study stochastic gradient descent in Sec. 3, where we revisit all our results in a more practical context: In Sec. 3.1 we derive the stochastic differential equation (SDE) to which SGD is asymptotically equivalent for small time steps. The SDEs we obtain have multiplicative noise terms added to the drift terms in the GD equations studied in Sec. 2. The limiting behavior of the empirical distribution and the scaling of the fluctuations around it are then analyzed in Sec. 3.4 where we rederive a LLN and a CLT in the context of SGD, and discuss their implications in terms of approximation error and trainability.
These results are illustrated in Sec. 4, where we use a spherical -spin model with as test complex function to represent with a neural network. We show that the network accurately approximates this function in up to dimensions, with a scaling of the error consistent with the results established in Secs. 2 and 3. These results are obtained using both a radial basis function network, and a single-layer network using sigmoid functions.
2. Interacting particles with adaptive fractional charges for training
We define an idealized set of dynamical equations for that can be used to train the network by updating its parameters dynamically, and we analyze these equations as and . Specifically, we assume that
satisfy the following system of ordinary differential equations (ODEs):
for . As we show in Sec. 3, (2.1) share many properties with the stochastic gradient descent, though in SGD a multiplicative noise term persists in the equations. In Appendix A we also consider a finite-temperature version of these equations which have additive noise terms. The ODEs in (2.1) are the gradient descent flow on the energy:
This energy is simply the loss function in (1.3) rescaled by .
We consider (2.1) with initial conditions such that every pair in is drawn independently from some probability density function such that:
The density is smooth in both its arguments, and such that in and .
We denote the measure for the infinite set constructed this way by . Initial conditions of this type are frequently used in practice.
In order to guarantee global existence and uniqueness of the solution to (2.1) we also make
The kernel is a continuously differentiable function of for all .
This assumption guarantees that the functions and are continuously differentiable in their arguments, and that the energy is continuously differentiable and coercive, i.e., for every the sub-level set
2.1. Empirical distribution and McKean-Vlasov equations
To proceed, we consider the empirical distribution
in terms of which we can express (1.19) as
where we used the shorthands and .
Derivation of Dean’s equation (2.6)
Let us take the time derivative of (2.4
). By the chain rule:
Pulling the derivatives in front of the sums and using (2.1) we can write this equation as
where we used the Dirac delta to replace by and by . We can now use the definition of to replace by . In addition, if we use
2.2. Limit behavior and fluctuations scaling
Let us now use Dean’s equation (2.6) to derive equations for the limit of as and for the fluctuations around this limit. All limits should be understood in the weak (or distributional) sense because in the end we care about , not itself, and is obtained by testing against , as in (2.5).
2.2.1. Zeroth order term—mean field limit
If we formally take the limit as of (2.6), we deduce that , where satisfies
Even though (2.10) is formally identical to (2.6), it has a different meaning: We can look for a solution of it for the smooth initial condition since we know that -almost surely as , by the Law of Large Numbers. However, to justify that , the solution of (2.10) for this initial condition, is indeed the weak limit of the empirical distribution at later times, we have to control the fluctuations of around . We do this next and postpone the analysis of (2.10) until Sec. 2.3.
2.2.2. Fluctuations around the mean
Let us now consider the fluctuations of around . The scale of these fluctuations change with time and to account for this effect, we define via:
where the exponent depends on as specified below. Explicitly, (2.15) means:
As we show next, the function must initially be , the scale set by the Central Limit Theorem applied to the initial parameter selection. However, subsequent annealing of the error will allow us to change so that .
To see that choosing sets the right scale to look at the fluctuations around the initial conditions, notice that if we pick a test function the CLT tells us that under
where, and we defined
We can write (2.17) distributionally as
In order to take the limit as of this equation, we need to consider carefully the behavior of the the factors in (2.20) that contain explicitly, that is, and . Consider the latter first. If we set
where denotes expectation with respect to —for example, if , this expectation is where is given in (2.18). Equation (2.22) implies that -almost surely as , at if . We can now argue that this statement also holds at times if (2.21) holds. To this end we write (2.20) compactly as
where contains the terms at the right hand side of (2.20) that are linear in , and contains the terms involving . In order to control the term, we can write an equation for : this equation is of the form (2.23) with an additional linear term involving (same as but acting on ), the source term replaced by one involving , and the last term in (2.23) replaced by : a calculation similar to the one that gives (2.22) indicates that at this source term is higher order than the rest and goes to zero -almost surely as . The same is true for if (2.21) holds. We can then derive equations for and so on, and each time reach the same conclusion: they involve a linear part made of operators , , etc. and a remainder that is higher order. The specific form of the operator appearing in (2.23) is given by its action on any function
where is defined in (2.13) and