1 Introduction
Explaining the success of highly overparameterized models such as deep neural networks is a central problem in the theory of modern machine learning (Belkin, 2021)
. Classical theory would imply that such models are prone to overfit to the training data. On the contrary, practice shows that while this can happen at moderate overparameterization, around the interpolation threshold where a model is just expressive enough to perfectly fit the data, further increasing model capacity leads to better generalization performance. This socalled double descent phenomenon
(Belkin et al., 2019) is often attributed to the implicit regularization of gradient descent and its variants, which are widely used for the training of deep neural networks. Many theoretical works (Oymak and Soltanolkotabi, 2019; Liu et al., 2022) have been proposing that optimization problems in modern machine learning enjoy the socalled PolyakŁojasiewicz condition (Polyak, 1963), which can informally be described as strong convexity without convexity (Karimi et al., 2016). This condition is closely linked to the positive definiteness of the neural tangent kernel (Jacot et al., 2018; Liu et al., 2022). Together with the Lipschitz gradient condition, this property implies linear convergence of gradient descent to a global optimum. Moreover, the particular optimum the algorithm converges to is one that is close to the initial point (Oymak and Soltanolkotabi, 2019), a property which could potentially explain why models trained in such a manner enjoy excellent generalization performance.Our work was motivated by Oymak and Soltanolkotabi (2019) and Liu et al. (2022), analyzing the behavior of gradient descent when training nonlinear models, in particular neural networks, from the perspective of the Lipschitz gradient and PolyakŁojasiewicz conditions. Contemporary works in the area placed numerous assumptions on the model, loss or the data, e.g., focusing on the least squares loss (Oymak and Soltanolkotabi, 2019; Liu et al., 2022), but to the best of our knowledge even the more general ones treat only the case of supervised learning. Instead, we have set out to construct a framework that is general enough to incorporate modern machine learning problems beyond supervised learning, while also being modular enough such that its components are clearly separated and we are able to derive the minimal set of conditions they must satisfy for the ideal behavior of gradient descent. We managed to do so in the form of a prototype problem to which many real life machine learning problems can be translated to.
In order to define this problem we need a number of ingredients. Let be a Borel space (the input space), let be a Hilbert space (the parameter space) and let
be a probability measure (the data). Let
be a function (the model, e.g., a neural network) mapping an input and a parameter to an output . Let be a function (the integrand, e.g., a loss) mapping an input and an output to a loss value . Define the induced function by mapping a parameter to the equivalence class of the valued function with respect to the data . Finally, define the integral functional by mapping an equivalence class to the integral . Then the prototype problem is , where one has .This problem formulation allows us to incorporate the infinite width limit of neural networks, as well as infinite data. It enables us to translate not only supervised learning problems in general, but variational autoencoders, and even gradient regularized discriminators for generative adversarial networks to particular instances of the prototype problem. Analyzing the behavior of gradient descent on the prototype problem leads to two theoretical investigations. The first, more abstract one concerns the compositional nature of the function being minimized. It asks what assumptions the functions
and , with being Hilbert spaces, have to satisfy in order for the composition to satisfy the Lipschitz gradient and PolyakŁojasiewicz conditions. The second one is about the prototype problem, asking for the requirements on the data , the model and the integrand that ensure that the induced functions and satisfy the assumptions resulting from the first question. We present novel theoretical results that conclude these two investigations, generalizing results from contemporary works. In particular, coercivity of the neural tangent kernel (which is equivalent to positive definiteness in the case of finite data), is an assumption that is required for our result to hold.After listing our contributions in Subsection 1.1, we conclude Section 1 with a summary of related work in Subsection 1.2. In Section 2, we review the theory of the Lipschitz gradient and PolyakŁojasiewicz conditions. Then in Section 3, we present our framework containing the prototype problem, motivating the two theoretical investigations mentioned above. The findings of the first concern gradient descent on composite functions and are presented in Subsection 3.1. Those of the second, on determining the requirements on the components of the prototype problem, are presented in Subsection 3.2, along with their consequences for gradient descent. To demonstrate the usefulness of our framework, we present some examples of popular learning problems that can be translated to the prototype problem in Subsection 3.3. Finally, we discuss the limitations of our work and directions for future research in Section 4. Rigorous proofs of our own results and the ones reviewed in Section 2 are included in the appendices, which are referred to throughout the paper.
1.1 Our contributions
We propose

a functional analytic framework consisting of a prototype optimization problem covering many machine learning applications,

novel theoretical results of independent interest concerning gradient descent on a composition of functions and , with being Hilbert spaces,

requirements for the components of our prototype problem that ensure convergence of gradient descent to a global optimum close to initialization, and

examples of popular machine learning problems translated to our prototype problem.
1.2 Related work
Studying the behavior of gradient descent and its variants for the training of neural networks is one of the central topics in theoretical machine learning. Aside from the optimization method, the three main components of such learning problems are the neural network model, the loss function and the data. These together determine the loss landscape, and therefore the solutions that the chosen optimization method finds, as well as its generalization performance. Contemporary works in this line of study have made simplifying assumptions on these components in order to be able to derive theoretical results, with the hope that the conclusions drawn from such results would apply to more general cases covering real world problems.
Regarding the model, the strongest assumption is linearity (Gunasekar et al., 2017; Arora et al., 2018; Soudry et al., 2018; Nacson et al., 2019; Zhu et al., 2019). A popular assumption that is a step closer to nonlinear deep networks is that the model is shallow, i.e., has one hidden layer with a nonlinear activation (Brutzkus et al., 2018; Chizat and Bach, 2018; Ge et al., 2018; Mei et al., 2018; Safran and Shamir, 2018; Du et al., 2019; Nitanda et al., 2019; Williams et al., 2019; Soltanolkotabi et al., 2019; Li et al., 2020; Oymak and Soltanolkotabi, 2020). Some works proved results for general models covering deep neural networks (Oymak and Soltanolkotabi, 2019; AllenZhu et al., 2019; Azizan and Hassibi, 2019; Du et al., 2019; Liu et al., 2022). The least squares loss is the most popular choice for analysis (Du et al., 2019, 2019; Oymak and Soltanolkotabi, 2019; Soltanolkotabi et al., 2019; Williams et al., 2019; Li et al., 2020; Oymak and Soltanolkotabi, 2020; Liu et al., 2022), but other specific losses are used as well (Brutzkus et al., 2018; Soudry et al., 2018; Nitanda et al., 2019). Some works use a general loss function for supervised learning satisfying certain assumptions (Chizat and Bach, 2018; Azizan and Hassibi, 2019; AllenZhu et al., 2019, 2019; Nacson et al., 2019). Some works have placed assumptions on the data (Ge et al., 2018; Soudry et al., 2018; AllenZhu et al., 2019; Li et al., 2020). Another way to simplify analysis is to take the learning rate to zero, resulting in a continuous time gradient descent, i.e., a gradient flow (Gunasekar et al., 2017; Chizat and Bach, 2018; Williams et al., 2019). Often no specific optimization method is assumed and analysis is restricted to the loss landscape (Safran and Shamir, 2018; Sagun et al., 2018; Li et al., 2019; Zhu et al., 2019; Geiger et al., 2021).
A prime candidate for the explanation of the surprising generalization of overparameterized models is the implicit regularization of gradient methods. Many works show that such methods take a short path in the parameter space (Azizan and Hassibi, 2019; Oymak and Soltanolkotabi, 2019; Zhang et al., 2020; Gupta et al., 2021). Similarly to our paper, the PolyakŁojasiewicz condition (Polyak, 1963; Karimi et al., 2016) has been previously studied as a possible reason of why gradient methods converge to global optima (AllenZhu et al., 2019; Oymak and Soltanolkotabi, 2019; Liu et al., 2022)
. This condition is determined by the smallest eigenvalue of the neural tangent kernel
(Jacot et al., 2018), which has been previously proposed to govern gradient descent dynamics, and is influenced by overparameterization. Additionally, this condition implies that the global optima found by gradient descent is close to initialization.Our main influences were Polyak (1963) on the convergence of gradient descent for nonconvex losses, as well as Oymak and Soltanolkotabi (2019) and Liu et al. (2022) on the behavior of gradient descent on the nonlinear least squares problem. Our results in Subsection 3.1 generalize the classical ones in Polyak (1963), and in particular, Theorem 4 is a significant generalization of Oymak and Soltanolkotabi (2019, Theorem 2.1) and Liu et al. (2022, Theorem 6). Even those works which treat general loss functions (such as AllenZhu et al. (2019)) focus on the supervised learning problem, with both the number of parameters and data assumed to be finite. To the best of our knowledge, the framework we propose is the only one so far that is general enough to incorporate many problems besides supervised learning, covering the cases of infinite number of parameters and infinite data, while also making it clear what assumptions the different components have to satisfy in order for gradient descent to find a global optimum that is close to initialization.
2 Lipschitz gradient and PolyakŁojasiewicz conditions
Informally speaking, the Lipschitz gradient (LG) and PolyakŁojasiewicz (PL) conditions quantify how well a certain function behaves under gradient descent (Polyak, 1963; Karimi et al., 2016). This section reviews the classical results of Polyak (1963), which we include to highlight the analogy to our results in Subsection 3.1. Let be a Hilbert space and a Fréchet differentiable function. If for all for some and , i.e., if the gradient mapping is Lipschitz on , then is said to satisfy the LG condition with constant on , or to be LG on . Note that if and only if is linear. Such functions satisfy a strong form of the fundamental rule of calculus (see Proposition 12). In particular, for all such that . Additional corollaries (see Proposition 14 and Proposition 15) are that for all such that . And that for all such that where . If and for all and some , then is said to satisfy the PL condition with constant on , or to be PL on . Given a learning rate , define a gradient descent step as for any , and a gradient descent sequence recursively as for some initial point . The LG and PL conditions lead to the following classical convergence result for gradient descent due to Polyak (1963) (see Appendix A for proofs of the results in this section). The theorem not only tells us that gradient descent finds a global optimum, but that the convergence is linear, and the distance traveled from initialization is bounded by a constant times the distance to a closest optimum.
Theorem 1.
Let be a Hilbert space and , be some constants. And let be a function that is PL and LG on . Fix any . Let , so that , and . If
then for all
and the series converges to some such that and
Moreover, if , i.e., is an optimum of that is closest to , then
3 A framework for overparameterized learning
We propose an abstract prototype problem that covers many real world scenarios (as exemplified in Subsection 3.3) to analyze the problem of training neural networks with gradient descent. This framework enables us to analyze the building blocks of the problem in order to determine the properties of the whole. There are going to be three main components. One is the dataset represented by a probability measure on a Borel space . Another is the neural network mapping with being the input space, the Hilbert space being the parameter space, being the output space, and being measurable in and Fréchet differentiable in for all . Assuming that for all the integral exists and is finite, we consider the induced mapping defined as being the equivalence class of the function with respect to for any . The Hilbert space of equivalence classes of square integrable valued functions with respect to is the feature space with norm for . The last component is the integrand that quantifies the loss corresponding to an and an output , with being measurable in and differentiable in . It induces the integral functional (Rockafellar, 1976) defined as .
The prototype learning problem is then defined as
which we aim to solve via gradient descent.
This motivates two theoretical investigations from the perspective of the LG and PL conditions. The first is about the compositional nature of the loss being minimized, and can be stated abstractly as follows. Given Hilbert spaces and functions and such that is assumed to be LG and PL for some , what assumptions does has to satisfy in order for the composition to enjoy the benefits of the LG and PL conditions? The second, less abstract question asks what assumptions does and have to satisfy in order for to satisfy the assumptions discovered in the first question, and for to be LG and PL for some ? We answer both in the following sections, and then we present some examples of popular learning problems that can be translated to the prototype learning problem.
3.1 Gradient descent on a composition
The notation in this section is similar to that of Section 2, emphasizing the fact that the results proposed are general enough to be of independent mathematical interest. Let be Hilbert spaces and and be Fréchet differentiable functions. If for all for some and , then is said to satisfy the Bounded Gradient (BG) condition with constant on , or to be BG on . Note that if is LG on and is bounded, then is BG on (see Proposition 13). Denote the Jacobian or Fréchet differential of at by , which is the unique linear operator from to that satisfies . We are going to introduce the Bounded Jacobian (BJ) and Lipschitz Jacobian (LJ) conditions, generalizing the BG and LG conditions. If ^{1}^{1}1With the operator norm. for all for some and , then is said to satisfy the BJ condition with constant on , or to be BJ on . If for all for some and , i.e., if the Jacobian mapping is Lipschitz on with respect to the operator norm on , then is said to satisfy the LJ condition with constant on , or to be LJ on .^{2}^{2}2Note that if and only if is linear, i.e., . Such functions satisfy a strong form of the fundamental rule of calculus (see Proposition 12), in particular that (where the integral is the Bochner integral (Cobzaş et al., 2019, Definition 1.6.8)) for all such that . The following lemma summarizes necessary conditions for to satisfy the LG condition and its corollaries (see Proposition 22, Proposition 23 and Proposition 24).
Lemma 2.
Let be Hilbert spaces, be BJ and LJ on and be LG on for some .

If is convex and is BG on for some , then is LG on .

If for some fixed one has for some and , then

If for some fixed one has for some and
then
An operator is said to be coercive for some if for all . Denote the adjoint of the Jacobian of at by . Generalizing Liu et al. (2022, Definition 3), is said to be uniformly conditioned (or UC) on for some if is coercive for all . The following lemma tells us how the composition can satisfy a property that is similar to the PL condition (see Proposition 21).
Lemma 3.
Let be Hilbert spaces, be UC on and be PL on for some . Then the composition satisfies
for all .
Given a learning rate , define a gradient descent step as for any , and a gradient descent sequence recursively as for some initial point . The above lemmas lead to the following convergence result for gradient descent (see Theorem 20). The proof consists of repeated applications of Lemma 2 to , Lemma 3 to get the descent rate and Lemma 2 to bound the distance traveled, modulo some technicalities.
Theorem 4.
Let be constants such that and . Let be Hilbert spaces, let be bounded and let . And let be BJ, LJ and UC on and be LG and PL on . Let , , (so that ), and (so that ). If
then for all
and the series converges to some such that (so that in particular ) and
Moreover, if , i.e., if is an optimum of that is closest to , then one has
Note that gives the optimal rate . Theorem 1 is exactly the special case of and being the identity map. The above result is general enough to be conveniently applicable to our prototype problem.
3.2 Requirements for convergence
In order to apply Theorem 4 to the prototype problem, some assumptions on and need to hold. The lemma below provides sufficient conditions for the former (see Lemma 26).
Lemma 5.
Let be a Borel space, a Hilbert space, open, and such that is measurable in and Fréchet differentiable in for all and the integral exists and is finite for all . Suppose that there exists such that for almost every , is BJ and LJ on . Then is differentiable on , for all the Jacobian is such that
for all , the adjoint Jacobian is such that
for all , while is BJ and LJ on . Furthermore, if there exists such that
for all and , then is UC.
We assume that is finite, measurable in and differentiable in for all . If is LG for almost every for some , then it is said to be an LG integrand. If is PL for almost every for some , then it is said to be a PL integrand. The following lemma shows that the LG and PL conditions of are inherited by (see Lemma 27).
Lemma 6.
Let be a Borel space, , an LG integrand for some and the induces integral functional. Then one has that
for all and is LG. If moreover is also a PL integrand for some (so that ), then one has that is PL.
Denote and .^{3}^{3}3Note that one has by Rockafellar (1976, Theorem 3A). Given a learning rate , define a gradient descent step as for any , and a gradient descent sequence recursively as for some initial point . We are now in a position to apply Theorem 4.
Theorem 7.
Let be a Borel space, a Hilbert space, bounded and open, , , such that is measurable in and Fréchet differentiable in for all and the integral exists and is finite for all , an LG and PL integrand for some , and assume that there exists and such that is BJ and LJ on for almost every and for all and . Let (so that ), , , (so that ), , and (so that ). If
then one has
and the series converges to some with (so that in particular ) and
Moreover, if , i.e., if is an optimum of that is closest to , then one has
Note that the selfadjoint operator on the feature space given by
for is exactly the NTK at .^{4}^{4}4The operator on the parameter space given by for could be referred to as the dual NTK. being UC and BJ means that the spectrum of the NTK is contained in . In the case of finite data, i.e., , one has that , and the NTK being coercive reduces to its matrix representation being positive definite with its smallest eigenvalue bounded from below by . In the case of infinite data, coercivity is a stronger codition than positive definiteness, since the descending eigenvalues of a positive definite operator can converge to . In any case, having , i.e., overparameterization, is a necessary condition for being UC with . Note also that , i.e., the neural network can interpolate the data if the stated conditions hold on a large enough subset around the initial point .
3.3 Examples
In this section, we detail three examples that can be translated to our prototype problem. The first covers effectively all of supervised learning. The second is a popular unsupervised learning method, and shows that the framework being general enough to incorporate infinite data is a useful property. The third shows that even gradient regularization can be treated. The examples are intended to demonstrate that our framework covers real world learning problems. Their analysis is beyond the scope of this paper and is left for future work.
3.3.1 Supervised learning
A typical supervised learning task is as follows. Let be Borel spaces, where is the set of inputs and is the set of targets. The dataset is a probability measure of input, target pairs and the loss
maps target, output pairs to the corresponding loss values. Let
be the first marginal of the data and the disintegration of along . Assume that one has for some measurable function mapping inputs to targets, i.e., all inputs have exactly one corresponding target in the data, the integrand can be defined as , which is clearly an LG or PL integrand if and only if is LG or PL in , so that it suffices to analyze the latter to determine if the induced integral functional satisfies the properties required for the convergence of gradient descent. With the induced neural network mapping it is apparent that any supervised learning problem such that all inputs have exactly one corresponding target in the data can be translated to the prototype problem .In particular, maximum likelihood training is a supervised learning problem where given input, target pairs
drawn from a dataset, one trains a neural network to output the parameters of a probability distribution such that the likelihood of target
under the distribution for input is maximized. In this case, is the function mapping a pairconsisting of a target and an output to minus the natural logarithm of the probability density function of the probability distribution parameterized by
, evaluated at . We describe two cases.When
, a popular approach is to fit the targets to Gaussian distributions. For simplicity, we consider the case with diagonal covariance matrices. In this case,
, and for an output the distribution can be parameterized by a meanand variance
. Based on the probability density function of a Gaussian, the integrand becomes . One can also fit the targets to Gaussian distributions with fixed variance. In this case, with a given variance , and the integrand is , which is equivalent to least squares ifis a constant vector.
When is a finite set, one can fit the targets to categorical distributions. A popular tool is to use the softmax function to parameterize such distributions. In this case, , and an output is mapped to the probability vector . The probability density of a target is simply the th coordinate, so that the integrand becomes , which is exactly the usual loss function used in classification.
3.3.2 Variational autoencoder
The variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014) can be translated to our prototype problem as follows. First, denote the dataset by , the prior distribution by with , and the reparameterization distribution by (which may or may not (Joo et al., 2020) be equivalent to ). The two components of the VAE are represented by the encoder map differentiable in its second argument and the decoder map differentiable in both arguments. A key component is the reparameterization function , which is measurable in its first argument with respect to and differentiable in its second argument, and satisfies the absolute continuity property for all . These are combined into the map with , and , defined as for and . Denoting and assuming that exists and is finite for all , we define the induced map .
Similarly to the maximum likelihood problem, let be the function mapping a pair consisting of an input and a decoder output to minus the natural logarithm of the probability density function of the probability distribution parameterized by , evaluated at . Let be a constant and the function mapping an encoder output
to the KullbackLeibler divergence
of the posterior probability distribution
parameterized by from the prior distribution . The integrand is then defined as , consisting of the reconstruction term and the divergence term, the latter being weighted by (Higgins et al., 2017). Training a VAE is exactly the minimization problem . One can analyze the behavior of this problem when optimized by gradient descent by expressing the Jacobian in terms of the Jacobians and