Neural networks as Interacting Particle Systems: Asymptotic convexity of the Loss Landscape and Universal Scaling of the Approximation Error

05/02/2018 ∙ by Grant M. Rotskoff, et al. ∙ NYU college 0

Neural networks, a central tool in machine learning, have demonstrated remarkable, high fidelity performance on image recognition and classification tasks. These successes evince an ability to accurately represent high dimensional functions, potentially of great use in computational and applied mathematics. That said, there are few rigorous results about the representation error and trainability of neural networks, as well as how they scale with the network size. Here we characterize both the error and scaling by reinterpreting the standard optimization algorithm used in machine learning applications, stochastic gradient descent, as the evolution of a particle system with interactions governed by a potential related to the objective or "loss" function used to train the network. We show that, when the number n of parameters is large, the empirical distribution of the particles descends on a convex landscape towards a minimizer at a rate independent of n. We establish a Law of Large Numbers and a Central Limit Theorem for the empirical distribution, which together show that the approximation error of the network universally scales as o(n^-1). Remarkably, these properties do not depend on the dimensionality of the domain of the function that we seek to represent. Our analysis also quantifies the scale and nature of the noise introduced by stochastic gradient descent and provides guidelines for the step size and batch size to use when training a neural network. We illustrate our findings on examples in which we train neural network to learn the energy function of the continuous 3-spin model on the sphere. The approximation error scales as our analysis predicts in as high a dimension as d=25.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Motivation and main results

While classification problems continue to be an active area research, extraordinary progress has been made on both speech and image recognition, problems that appeared intractable only a decade ago [19]. By harvesting the power of neural networks while simultaneously benefiting from advances in computational hardware, complex tasks such as automatic language translation are now routinely performed by computers with a high degree of reliability. The underlying explanation for these significant advances seems to be related to the expressive power of neural networks, and their ability to represent high dimensional functions with accuracy.

These successes open exciting possibilities in applied and computational mathematics that are only beginning to be explored [6, 25, 16, 7, 12, 5, 30]

. Any numerical calculation that uses a given function begins with a finite-dimensional representation of that function. Because standard approximations, e.g., Galerkin truncations or finite element decompositions, suffer from the curse of dimensionality, it is nearly impossible to scale such methods to large dimensions 

. Fundamentally, these representations are linear combinations of basis functions. The issue arises because the dimensionality of the representation is equal to that of the truncation. Neural networks, on the other hand, are highly nonlinear in their adjusting parameters. As a result, their effective dimensionality is much higher than the number of their parameters, which may explain their better capabilities observed in practice to approximate functions even when is large. Characterizing this observation with analysis is non-trivial though, precisely because the representation of a function by a neural network is nonlinear in its parameters. This renders many of the standard tools of numerical analysis useless, since they are in large part based on linear algebra.

The significant achievements of machine learning have inspired many efforts to provide theoretical justification to a vast and growing body of empirical knowledge. At the core of our understanding of neural networks are the “Universal Approximation Theorems” that specify the conditions under which a neural network can represent a target function with arbitrary accuracy [10, 4, 23]. These results do not, however, indicate how the network parameters should be determined to achieve maximal accuracy when their number is fixed [8]. Additionally, these theorems do not provide guidance on how the error scales with the number of parameters. Several recent papers have focused on the analysis of the shape and properties of the objective or “loss” function landscape [24, 9, 3]. These studies have mainly focused on the fine features of this landscape, trying to understand how non-convex it is and making analogies with glassy landscapes. Additionally, some analysis has been performed in cases where the number of parameters vastly exceeds the amount of training data, a setting that guarantees convexity and dramatically simplifies the landscape. Further studies have examined the dynamics of the parameters on the loss landscape to understand the properties of optimization procedures based on stochastic gradient descent.

In this paper, we adopt a different perspective which enables powerful tools for a more formal analysis. Similar to what was recently proposed in [22, 29], we view the parameters in the network as particles and the loss function as a potential that dictates the interaction between them. Correspondingly, training the network is thought of as the evolution of the particles in this interaction potential. We also consider the empirical distribution of the interacting particles / parameters and analyze the properties of this distribution when the number is large using standard limit theorems[17, 26, 18, 27]. This viewpoint allows us to bypass many of the difficulties that arise with approaches that attempt to study the dynamics of the individual particles. For example, we rederive the Universal Approximation Theorem as a corollary to the Law of Large Numbers (LLN) for the empirical distribution of the particles. We also establish that the loss landscape is asymptotically convex for large in the space of the empirical distribution of the particles, and assert that convergence towards equilibrium of this distribution occurs on a time scale that is independent of to leading order—similar results were obtained in [22, 29]. Finally, we prove a Central Limit Theorem for the empirical distribution, and thereby conclude that the approximation error of the function representation by a neural network is universal and scales as as in any . These results are established first in situation where gradient descent (GD) is used to perform the training of the parameters in the network, and then shown to also apply in the context of stochastic gradient descent (SGD). In the latter case, our analysis shed light on the nature of the noise introduced in SGD, and indicates how the time step and he batch size should be scaled to achieve the optimal error. Let us briefly elaborate on these statements next, starting with a precise formulation of the problem.

1.1. Problem set-up

Given a function defined on , consider its approximation by

(1.1)

where , with are parameters to be learned for , and is some kernel—for simplicity we assume throughout this paper that is compact. Many models used in machine learning can be cast in the form (1.1):

  • Radial basis function networks. In this case and where is some kernel, for example that of a radial function such as

    where is a fixed constant.

  • Single-layer neural networks. In this case, and with , , and

    where

    is e.g. a sigmoid function

    .

  • Multi-layer neural networks. These are essentially iterated versions of single-layer neural networks. For example, to construct a two-layer network we take as above and for , define such that

    then set

    where , , , , . Therefore here we have (where with a slight abuse of notation we view the matrix

    has a vector in

    ). Three-layer networks, etc. can be constructed similarly. Note that our results apply to deep neural networks when the final layer grows large, not the total number of parameters.

In view of the growing range of applications of these methods, it is natural to ask:

  1. How good can the approximation (1.1) be if we optimize ?

  2. Can we guarantee the convergence of the commonly used optimization algorithms?

To answer the first question, we need to introduce a distance, or loss function, between  and . A natural candidate often used in practice is

(1.2)

where is some measure on with a positive density and such that (for example the Lebesgue measure, , if is compact). This also offers a way to address the second question, since we can view as an objective function for :

(1.3)

where and we defined

(1.4)

Trying to minimize (1.3) over leads to difficulties, however, since this is potentially (and presumably) a non-convex optimization problem, which typically have local minimizers. As a result determining the distance (1.2) at the minimum (and its scaling with , say) is also nontrivial. To bypass these difficulties we study the problem in terms of its empirical distribution.

1.2. Functional formulation

Assume that the set is such that

(1.5)

for some signed density on . If is smooth, a sufficient condition is that the empirical measure

(1.6)

converges weakly to as . Clearly, this can be achieved, e.g. by drawing the

’s independently from some probability density function

such that for all and , and setting : we then have as by the Law of Large Numbers, with an error scaling as by the Central Limit Theorem. We show this scaling can be improved upon. In the large limit, (1.3) converges to an objective function for :

(1.7)

Unlike (1.3), this objective function is quadratic in . This means that minimizing (1.7) over rather than (1.3) over is conceptually simpler than a direct minimization.

1.3. Universal Representation Theorem

To formalize these ideas, it is useful to view as a map from to and introduce also its adjoint . These operators are defined for any suitable and respectively as

(1.8)

We can then write the loss function in (1.2) evaluated on as

(1.9)

and the question becomes whether there exists a minimizer of this objective function with . Note that any such minimizer is also a solution to the Euler-Lagrange equation for both (1.9) and (1.7):

(1.10)

where and are given explicitly in (1.4). We can also express these operators as and : viewed as an operator mapping into itself, is symmetric and nonnegative.

This question of existence of solutions to (1.15) is within the realm of Fredholm theory of integral equations of the first kind [13]. For neural networks, it is natural to make:

Assumption 1.1.

The operator is bounded, i.e. , and square integrable:

(1.11)

Indeed this assumption holds for the kernels used in machine learning, and it guarantees that both and are compact operators whose domains are and , respectively. Under Assumption 1.1, it is well known that (1.9) admits at least one minimizer (and (1.15) at least one solution) if and only if (where ‘ran’ denotes the range of the operator), which is a dense subspace of . This solution/minimizer may not be unique, since we can add to it any element of , which is not necessarily trivial. In the present context, this is no real issue, however, since we only care that gives for any solution of (1.15). This is why the kernels used in machine learning are designed so that they satisfy:

Assumption 1.2 (Discriminating Kernel).

The null space of the adjoint is trivial, i.e.

(1.12)

We can summarize the discussion above into:

Theorem 1.3 (Universal Representation Theorem).

Under Assumptions 1.1 and 1.2, given any and , there exists which is such that

(1.13)

and admits the representation

(1.14)

where solves

(1.15)

The function can also be realized as

(1.16)

for some choice of .

Proof.

By Assumption 1.1, (1.15) admits at least one solution if and . Since this subspace is dense in , (1.13) can be satisfied. By writing (1.15) as

(1.17)

we see that  (1.14) follows by Assumption 1.2. To show that there is a choice of such that (1.16) holds pick, for example, the independently from some probability density function such that for all and , and set . (1.16) then follows by the Law of Large Numbers. ∎

Note that the Universal Representation Theorem only gives , not . This suffices in our case since we cannot manipulate (1.15) directly, but rather need to go back to the representation (1.6). Using introduces errors due to the finiteness of which, even if they decrease as , remain larger than those induced by replacing by if we pick small enough. In other words, we can bound and make smaller than . For this reason, in what follows we do not distinguish from . What remains to be shown, however, is that the Universal Representation Theorem can be approximately realized in practice via dynamic training of the parameters in (1.1) (recall that we have no direct access to ). In addition we seek to assess quantitatively the rate of convergence in time toward this approximation and the error due to the finiteness of . We achieve this aim by treating the parameters as a set of interacting particles.

Remark 1.1.

Depending on the situation, it may be useful to consider other objective functions than (1.2). For example, in the context of elliptic problems, one is led to minimize

(1.18)

where

is some positive-definite tensor and

, and we impose e.g. on . This problem lends itself naturally to the present framework upon straightforward modifications, such as redefinition of the norm. For simplicity, in the present paper we only address the minimization of (1.2).

1.4. Parameters as particles with loss function as interaction potential

It is useful to view the set as particles (with viewed as a particle position and the weight as its charge), and use (1.3) as an interaction potential between them. In this interpretation, we can perform the training by making these parameters evolve by gradient descent (GD) in this potential, or stochastic gradient descent (SGD)—the method of choice used in machine learning to train neural networks. If we denote these time-dependent parameters as with , we study the way

(1.19)

evolves in time, as well as the behavior of this function for large . In particular, we establish a Law of Large Numbers (LLN) for as well as a Central Limit Theorem (CLT), and thereby assess the error and trainability of neural networks representations.

1.5. Discrete training set and stochastic gradient descent

For most choices of the kernel commonly encountered in machine learning it is not possible to calculate (1.2) and (1.4) exactly. Rather, we approximate these integrals using a training set, i.e. a set of points distributed according to , possibly independent, and over which is known. We then approximate by

(1.20)

and and/or by

(1.21)

These approximations are precisely what SGD relies upon to calculate the gradient of the interaction potential / loss function, and we assess the errors they introduce.

We focus on situations in which we can redraw the training set as often as we need, namely, at every step during the learning process. In this case, in the limit as the updating time step used in SGD tends to zero, SGD becomes asymptotically equivalent to an SDE whose drift terms coincide with those of GD but with multiplicative noise terms added.

1.6. Universal scaling error of neural networks

The main results we obtain can be summarized as follows: the function (1.19) evolves by GD in Wasserstein metric on a convex landscape (quadratic at leading order in ) and is such that

(1.22)

with a convergence rate in time that is independent of to leading order as . The convergence (1.22) holds for initial conditions satisfying specific conditions outlined in Sec. 2.3. This scaling also holds generically for SGD if we choose the size of the batch used in (1.21) such that as . If we set with , we lose accuracy and (1.22) is replaced by

(1.23)

If we set with , there is no gain and we get (1.22) back. These results are stated in Proposition 2.6 in the context of GD and in Proposition 3.2 in the context of SGD. In Appendix A we also establish a finite-temperature variant (Langevin dynamics) of (1.22) which applies when additive noise terms are added in the evolution equation for the particles / parameters. This result reads

(1.24)

where is a parameter playing the role of inverse temperature, is some given (non-random) function and

is a white-noise process, i.e. a Gaussian random field with zero mean and covariance

. Note that (1.24) gives (1.22) back after quenching (i.e. by sending ). The result in (1.24) is stated in Proposition A.3

1.7. Style and organization

As is apparent from the discussion above, our approach has strong ties with the statistical mechanics of systems of many interacting particles. This is an active area of research in which rigorous results have been obtained recently, primarily in the context of Coulomb or Riesz interaction potentials. These potentials lead to kernels that are not compact operators. The situation we consider here is therefore different, and simpler in some technical ways. The main additional issues we face are that (i) our kernels are degenerate, i.e. is not injective in general, and (ii) the weights / charges are not fixed, but rather evolve alongside the particles.

Despite these difficulties, we believe that providing rigorous proof to each of our statements can be achieved using the mathematical apparatuses developed in the context of interacting particle systems—to a certain extent, this program was already started in[22, 29]. In this paper we adopt a semi-rigorous presentation and rely on formal asymptotic arguments to derive our results. This has the advantage of making the developments easier to follow. We also exploit the specific structure of the interaction potentials arising in the context of neural networks. The structure of the class of problems we study alleviates certain difficulties arising in standard interacting particle systems.

The remainder of this paper is organized as follows:

In Sec. 2 we study a model system of interacting particles undergoing gradient descent on the network loss function—this dynamics is similar to stochastic gradient descent, which is used in machine learning to train networks, but is more readily amenable to analysis. The equations for the empirical distribution of the particles are derived in Sec. 2.1. In Sec. 2.2 we study the limiting behavior of the empirical distribution and analyze the scaling of the fluctuations around this limit. The equations derived this way are then used in Sec. 2.3 to establish a Law of Large Numbers and in Sec. 2.4 a Central Limit Theorem. These results have direct implications in terms of the approximation error of the function representation by the neural network and its scaling with the number of particles, as well as the dynamics of training.

We then study stochastic gradient descent in Sec. 3, where we revisit all our results in a more practical context: In Sec. 3.1 we derive the stochastic differential equation (SDE) to which SGD is asymptotically equivalent for small time steps. The SDEs we obtain have multiplicative noise terms added to the drift terms in the GD equations studied in Sec. 2. The limiting behavior of the empirical distribution and the scaling of the fluctuations around it are then analyzed in Sec. 3.4 where we rederive a LLN and a CLT in the context of SGD, and discuss their implications in terms of approximation error and trainability.

These results are illustrated in Sec. 4, where we use a spherical -spin model with as test complex function to represent with a neural network. We show that the network accurately approximates this function in up to dimensions, with a scaling of the error consistent with the results established in Secs. 2 and 3. These results are obtained using both a radial basis function network, and a single-layer network using sigmoid functions.

Some concluding remarks are made in Sec. 5 and the situation where the optimization is done with GD at finite temperature is treated in Appendix A.

2. Interacting particles with adaptive fractional charges for training

We define an idealized set of dynamical equations for that can be used to train the network by updating its parameters dynamically, and we analyze these equations as and . Specifically, we assume that

satisfy the following system of ordinary differential equations (ODEs):

(2.1)

for . As we show in Sec. 3, (2.1) share many properties with the stochastic gradient descent, though in SGD a multiplicative noise term persists in the equations. In Appendix A we also consider a finite-temperature version of these equations which have additive noise terms. The ODEs in (2.1) are the gradient descent flow on the energy:

(2.2)

This energy is simply the loss function in (1.3) rescaled by .

We consider (2.1) with initial conditions such that every pair in is drawn independently from some probability density function such that:

Assumption 2.1.

The density is smooth in both its arguments, and such that in and .

We denote the measure for the infinite set constructed this way by . Initial conditions of this type are frequently used in practice.

In order to guarantee global existence and uniqueness of the solution to (2.1) we also make

Assumption 2.2.

The kernel is a continuously differentiable function of for all .

This assumption guarantees that the functions and are continuously differentiable in their arguments, and that the energy is continuously differentiable and coercive, i.e., for every the sub-level set

(2.3)

2.1. Empirical distribution and McKean-Vlasov equations

To proceed, we consider the empirical distribution

(2.4)

in terms of which we can express (1.19) as

(2.5)

The empirical distribution (2.4) is useful to work with because it satisfies the McKean-Vlasov equation [21]

(2.6)

where we used the shorthands and .

When there is noise, (2.6) is often referred to as Dean’s equation [11]. It should be viewed as a formal identity which is useful to analyze the properties of as , as we do in Secs. 2.22.3 and 2.4.

Derivation of Dean’s equation (2.6)

Let us take the time derivative of (2.4

). By the chain rule:

(2.7)

Pulling the derivatives in front of the sums and using (2.1) we can write this equation as

(2.8)

where we used the Dirac delta to replace by and by . We can now use the definition of to replace by . In addition, if we use

(2.9)

and similarly for , we see that we can write the right hand side of (2.8) precisely as that in (2.6).

2.2. Limit behavior and fluctuations scaling

Let us now use Dean’s equation (2.6) to derive equations for the limit of as and for the fluctuations around this limit. All limits should be understood in the weak (or distributional) sense because in the end we care about , not itself, and is obtained by testing against , as in (2.5).

2.2.1. Zeroth order term—mean field limit

If we formally take the limit as of (2.6), we deduce that , where satisfies

(2.10)

Even though (2.10) is formally identical to (2.6), it has a different meaning: We can look for a solution of it for the smooth initial condition since we know that -almost surely as , by the Law of Large Numbers. However, to justify that , the solution of (2.10) for this initial condition, is indeed the weak limit of the empirical distribution at later times, we have to control the fluctuations of around . We do this next and postpone the analysis of (2.10) until Sec. 2.3.

Notice that (2.10) can be written as

(2.11)

where is given by:

(2.12)

In (2.11) we can also write

(2.13)

where we defined

(2.14)

The energy in (2.12) is the continuous limit of (2.2) scaled by , and (2.11) is the gradient decent flow on this energy in the Wasserstein metric [15, 28].

2.2.2. Fluctuations around the mean

Let us now consider the fluctuations of around . The scale of these fluctuations change with time and to account for this effect, we define via:

(2.15)

where the exponent depends on  as specified below. Explicitly, (2.15) means:

(2.16)

As we show next, the function must initially be , the scale set by the Central Limit Theorem applied to the initial parameter selection. However, subsequent annealing of the error will allow us to change so that .

To see that choosing sets the right scale to look at the fluctuations around the initial conditions, notice that if we pick a test function the CLT tells us that under

(2.17)

where

denotes the Gaussian random variable with mean zero and variance

, and we defined

(2.18)

We can write (2.17) distributionally as

(2.19)

To see what happens at later times, we derive an equation for  by subtracting (2.10) from (2.6) and using (2.15)

(2.20)

In order to take the limit as of this equation, we need to consider carefully the behavior of the the factors in (2.20) that contain explicitly, that is, and . Consider the latter first. If we set

(2.21)

the last term at the right hand side of (2.20) is higher order. Note that (2.21) means that we can vary , but only slowly. For the factor , a direct calculation shows that, for any and ,

(2.22)

where denotes expectation with respect to —for example, if , this expectation is where is given in (2.18). Equation (2.22) implies that -almost surely as , at if . We can now argue that this statement also holds at times if (2.21) holds. To this end we write (2.20) compactly as

(2.23)

where contains the terms at the right hand side of (2.20) that are linear in , and contains the terms involving . In order to control the term, we can write an equation for : this equation is of the form (2.23) with an additional linear term involving (same as but acting on ), the source term  replaced by one involving , and the last term in (2.23) replaced by : a calculation similar to the one that gives (2.22) indicates that at this source term is higher order than the rest and goes to zero -almost surely as . The same is true for if (2.21) holds. We can then derive equations for and so on, and each time reach the same conclusion: they involve a linear part made of operators , , etc. and a remainder that is higher order. The specific form of the operator appearing in (2.23) is given by its action on any function

(2.24)

where is defined in (2.13) and