The mean field (MF) regime refers to a newly discovered scaling regime, in which as the width tends to infinity, the behavior of an appropriately scaled neural network under training converges to a well-defined and nonlinear dynamical limit. The MF limit has been investigated for two-layer networks [MMN18, CB18, RVE18, SS18] as well as multilayer setups [Ngu19, AOY19, SS19, NP20].
In a recent work [NP20], we introduced a framework to describe the MF limit of multilayer neural networks under training and proved a connection between a large-width network and its MF limit. Underlying this framework is the idea of a neuronal embedding that encapsulates neural networks of arbitrary sizes. Using this framework, we showed in [NP20] global convergence guarantees for two-layer and three-layer neural networks. It is worth noting that although these global convergence results were proven in the context of independent and identically distributed (i.i.d.) initializations, the framework is not restricted to initializations of this type. In [NP20], it was also proven that when there are more than three layers, i.i.d. initializations (with zero initial biases) can cause a certain strong simplifying effect, which we believe to be undesirable in general. This clarifies a phenomenon that was first discovered in [AOY19].
The present note complements our previous work [NP20]. Our main task here is to show that the approach in [NP20] can be readily extended to prove a similar global convergence guarantee for neural networks of any number of layers. We however do not assume i.i.d. initializations. Our result applies to a type of correlated initialization and the analysis crucially relies on the ‘neuronal embedding’ framework. As such, our result realizes the vision in [Ngu19] of a MF limit that does not exhibit the aforementioned simplifying effect. Furthermore our result cannot be established by the formulations in [AOY19, SS19] which are specific to i.i.d. initializations.
Similar to the global convergence guarantees in [NP20] and unlike other works, our result does not rely critically on convexity and instead emphasizes on certain universal approximation properties of neural networks. To be precise, the key is a diversity condition, which is shown to hold at any finite training time. The insight on diversity first appeared in the work [CB18]: in the context of two-layer networks, it refers to the full support condition of the first layer’s weight in the Euclidean space. Our previous work [NP20] partially hinged on the same insight to analyze three-layer networks. Here our present result defines a new notion of diversity in the context of general multilayer networks. Firstly, it is realized in function spaces that are naturally described by the ‘neuronal embedding’ framework. Secondly, it is bidirectional: roughly speaking, for intermediate layers, diversity holds in both the forward and backward passes. The effect of bidirectional diversity is that a certain universal approximation property, at any finite training time, is propagated from the first layer to the second last one.
We first describe the multilayer setup and the MF limit in Section 2 to make the note self-contained. Our main result of global convergence (Theorem 2) is presented and proven in Section 3. This result is proven for the MF limit. Lastly Section 4 connects the result to large-width multilayer networks.
Since the emphasis here is on the global convergence result, to keep the note concise, other results are stated with proofs omitted, since they can be found or established in a similar manner to [NP20].
We use to denote a generic constant that may change from line to line. We use
to denote the absolute value for a scalar, the Euclidean norm for a vector, and the respective norm for an element of a Banach space. For an integer, we let . We write to denote the closure of a set in a topological space.
2 Multilayer neural networks and the mean field limit
2.1 Multilayer neural network
We consider the following -layer network:
in which is the input, is the weight with , , is the activation. Here the network has widths with , and denotes the time, i.e. we shall let the network evolve in (discrete) time.
We train the network with stochastic gradient descent (SGD) w.r.t. the loss. We assume that at each time , we draw independently a fresh sample from a training distribution . Given an initialization , we update according to
in which , is the learning rate, is the learning rate schedule for , and for , we define
In short, for an initialization , we obtain an SGD trajectory of an -layer network with size .
2.2 Mean field limit
The MF limit is a continuous-time infinite-width analog of the neural network under training. We first recall from [NP20]
the concept of a neuronal ensemble. Given a product probability spacewith , we independently sample , . In the following, we use
to denote the expectation w.r.t. the random variableand
to denote a dummy variable. The space is called a neuronal ensemble.
Given a neuronal ensemble , the MF limit is described by a time-evolving system with parameter , where the time and with and . It entails the quantities:
The MF limit evolves according to a continuous-time dynamics, described by a system of ODEs, which we refer to as the MF ODEs. Specifically, given an initialization , the dynamics solves:
Here , denotes the expectation w.r.t. the data , and for , we define
In short, given a neuronal ensemble , for each initialization , we have defined a MF limit .
3 Convergence to global optima
3.1 Main result: global convergence
To measure the learning quality, we consider the loss averaged over the data :
where a set of measurable functions , for .
We also recall the concept of a neuronal embedding from [NP20]. Formally, in the present context, it is a tuple , comprising of a neuronal ensemble and a set of measurable functions in which and for . The neuronal embedding connects a finite-width neural network and its MF limit, via their initializations which are specified by . We shall revisit this connection later in Section 4. In the following, we focus on the analysis of the MF limit.
Consider a neuronal embedding , recalling and with . Consider the MF limit associated with the neuronal ensemble with initialization such that and . We make the following assumptions:
Regularity: We assume that
is -bounded for , is -bounded and -Lipschitz for , and is non-zero everywhere,
is -Lipschitz in the second variable and -bounded,
with probability ,
is -bounded and -Lipschitz for ,
and for .
Diversity: The functions satisfy that
(Remark: we write to denote the random mapping , and similar for .)
Convergence: There exist limits such that as ,
(Here we take for .)
Universal approximation: The set has dense span in (the space of square integrable functions w.r.t. the measure , which is the distribution of the input ). Furthermore, for each , is non-obstructive in the sense that the set has dense span in .
The first assumption can be satisfied for several common setups and loss functions. The third assumption, similar to[CB18, NP20], is technical and sets the focus on settings where the MF dynamics converges with time, although we note it is an assumption on the mode of convergence only and not on the limits . The fourth assumption is natural and can be satisfied by common activations. For example, can be for . In general, for a bounded and continuous to be non-obstructive, it suffices that is not a constant function. The second assumption is new: it refers to an initialization scheme that introduces correlation among the weights. In particular, i.i.d. initializations do not satisfy this assumption for .
Given any neuronal ensemble and a set of functions such that the regularity assumption listed in Assumption 1 is satisfied, and given an initialization such that and , there exists a unique solution to the MF ODEs on .
This theorem can be proven in a similar manner to [NP20, Theorem 3], so we will not show the complete proof here. The main focus is on the global convergence result, which we state next.
Consider a neuronal embedding and the MF limit as in Assumption 1. Assume . Then:
Case 1 (convex loss): If is convex in the second variable, then is a global minimizer of :
Case 2 (generic non-negative loss): Suppose that implies . If is a function of , then .
The assumptions here are similar to those made in [NP20]. We remark on a special difference. In [NP20], the diversity assumption refers to a full support condition of the first layer’s weight only. Here our diversity assumptions refers to a certain full support condition for all layers. At a closer look, the condition is in the function space and reflects certain bidirectional diversity. In particular, this assumption implies both and have full supports in and respectively (which we shall refer to as forward diversity and backward diversity, respectively), for .
The proof proceeds with several insights that have already appeared in [NP20]. The novelty of our present analysis lies in the use of the aforementioned bidirectional diversity. To clarify the point, let us give a brief high-level idea of the proof. At time sufficiently large, we expect to have:
for -almost every . If the set of mappings , indexed by , is diverse in the sense that , then since is non-obstructive, we obtain
for -almost every . The desired conclusion then follows.
Hence the crux of the proof is to show that . In fact, we show that this holds for any finite time . This follows if we can prove the forward diversity property of the weights, in which has full support in for any and , and a similar property for . Interestingly to that end, we actually show that bidirectional diversity, and hence both forward diversity and backward diversity, hold at any time , even though we only need forward diversity for our purpose.
3.2 Proof of Theorem 2
We divide the proof into several steps.
Step 1: Diversity of the weights.
We show that and for , for any . We do so by showing a stronger statement, that the following bidirectional diversity condition holds at any finite training time:
for any .
We prove the first statement. Given a MF trajectory and , , we consider the following flow on :
for , with the initialization and . Existence and uniqueness of follows similarly to Theorem 1. We next prove for all finite and , there exists such that
We consider the following auxiliary dynamics on :
initialized at and , for . Existence and uniqueness of follow similarly to Theorem 1. Observe that the pair
solves the system
In particular, the solution of the ODE (3) with this initialization satisfies
Let and . Then we have and as desired.
Using this, by continuity of the map , for every , there exists a neighborhood of such that for any , . Notice that the MF trajectory satisfies
Then since has full support in , for any finite , we have has full support in , proving the first statement.
The other statements can be proven similarly by considering the following pairs of flows on , for :
initialized at and , and
initialized at and , in which we define:
for and .
Step 2: Diversity of the pre-activations.
We show that for any , for by induction.
Firstly consider the base case . Recall that
Observe that the set is a closed linear subspace of . Hence this set is equal to if it has dense span in , which we show now. Indeed, suppose that for some such that , we have for all . Equivalently,
for all . As such, for -almost every ,