Gradient Descent Converges to Ridgelet Spectrum

07/07/2020 ∙ by Sho Sonoda, et al. ∙ 0

Deep learning achieves a high generalization performance in practice, despite the non-convexity of the gradient descent learning problem. Recently, the inductive bias in deep learning has been studied through the characterization of local minima. In this study, we show that the distribution of parameters learned by gradient descent converges to a spectrum of the ridgelet transform based on a ridgelet analysis, which is a wavelet-like analysis developed for neural networks. This convergence is stronger than those shown in previous results, and guarantees the shape of the parameter distribution has been identified with the ridgelet spectrum. In numerical experiments with finite models, we visually confirm the resemblance between the distribution of learned parameters and the ridgelet spectrum. Our study provides a better understanding of the theoretical background of an inductive bias theory based on lazy regimes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Characterizing the local minima of the gradient descent (GD) learning is important for the theoretical study of neural networks. Due to the non-convexity of the learning problem, it is a hard and challenging problem. The over-parametrization, an assumption that the parameter number is sufficiently larger than the sample size, is considered to be an important factor to prove the better performance of deep learning (Arora et al., 2019b; Allen-Zhu et al., 2019; Nitanda and Suzuki, 2017; Rotskoff and Vanden-Eijnden, 2018; Mei et al., 2018; Sirignano and Spiliopoulos, 2020a; Chizat and Bach, 2018; Jacot et al., 2018; Lee et al., 2019; Frankle and Carbin, 2019). By regarding the weights of parameters in a neural network as a signed distribution, we analyze the over-parametrized regime by means of the integral representation (Barron, 1993; Murata, 1996; Sonoda and Murata, 2017): . Here, represents the weights, or the parameter distribution, and

represents a hidden unit with an activation function

, input and hidden parameters . This is a weighted integral of infinite hidden units, but we remark that by formally letting a singular measure as , we can also represent a weighted sum of finite hidden units as .

Since all the hidden parameters are integrated out, we do not need to update hidden parameters during the training, and we only need to update the parameter distribution . This is a strong advantage because the learning problem regains the convexity in the function space. This convexification trick has been known and employed in the integral representation theory (Barron, 1993) a.k.a. ridgelet analysis (Murata, 1996; Candès, 1999), convex neural networks (Bengio et al., 2006) and random Fourier features (Rahimi and Recht, 2008). Recently, the mean-field regime (Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020a), a.k.a. the Wasserstein gradient flow theory (Chizat and Bach, 2018; Nitanda and Suzuki, 2017) also adopted this formulation, and the integral representation has been recognized as a crucial tool to prove the global convergence of deep learning.

Thus far, we know less about the minimizers themselves, due to the non-trivial null space of the integral operator . In fact, there are infinitely many different parameter distributions, say and , that indicate the same function: (see Appendix C.3 for more details). In this study, we consider a regularized square loss minimization problem and provide a unique explicit representation of the global minimizer in terms of the ridgelet transform on the torus.

The ridgelet transform, which is a wavelet-like integral transform, is originally developed in (Murata, 1996; Candès, 1998; Sonoda and Murata, 2017)

, and has a remarkable application to analysis of neural networks. Whereas in the original studies of the ridgelet transform, they employed the Fourier analysis on the Euclidean space, we utilize the Fourier analysis on the torus, and develop a simple but flexible framework to study the neural networks with modern activation functions such as the rectified linear unit (ReLU). Although the Fourier analysis on the torus imposes the periodicity on the activation function, we theoretically show a periodic activation function still provides a sufficient and effective power to analyze the over-parametrized neural networks.

To be precise, our main theorem is described as follows:

Main Theorem.

Let be the space of data generating functions (: finite Borel measure with density ) and the space of parameter distributions. Here and be a bounds of the hidden parameters. We assume the activation function is periodic with period . For any and , the unique solution of the following minimization problem

(1)

is uniquely represented by the ridgelet transform:

(2)

Here, the residual terms tend to as .

Numerical simulation confirms our main results, namely, the scatter plot of parameter distributions learned by GD shows a similar pattern to the ridgelet spectrum. As a consequence, we can also gain a better understanding of the theoretical background of lazy learning, a recent trend of inductive bias theory stating that the learned parameters are very close to the initial parameters, such as the neural tangent kernel (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019a) and the strong lottery ticket hypothesis (Frankle and Carbin, 2019).

The structure of this paper is as follows: In Section 2, we develop the theory of the ridgelet transform on the torus. It is a theoretical basis to provide an explicit representation of the over-parametrized neural network at the global minimum. Then we introduce a positive definite kernel to prove universality of neural networks with a bounded range of hidden parameters, and give a precise definition of integral representations. In Section 3, we give our main results. In Section 4, we conduct numerical simulation to see that the ridgelet spectrum is identical with the parameter distribution of over-parameterized neural networks trained by gradient descent, as suggested in our main works. In Section 5, we discuss the relation to previous studies.

Notation.

For a measurable space with a positive measure , we denote by the square integrable fuoncions on with respect to .

2 Ridgelet Transforms on the Torus

In this section, we establish the ridgelet transform on the torus, which is a theoretical basis of this study. We fix , and denote by the torus , and often regard as the interval . We fix a bounded measurable function , or equivalently, a bounded measurable periodic function on with period (). For an integer , we write . Originally, the ridgelet transform has been defined on the Euclidean space (Murata, 1996; Candès, 1999). However, the original definition excludes non-integrable activation functions such as Tanh and ReLU. Sonoda and Murata (Sonoda and Murata, 2017) have extended the ridgelet transform to accept such non-integrable activation functions, by introducing an auxiliary dual activation function. However, their theory sacrifices the Plancherel formula, which we need in this study. Therefore, in order to cover the non-integrable activation functions, we come to suppose periodic activation functions.

2.1 Ridgelet transform

We introduce the ridgelet transform and its reconstruction formula.

Definition 2.1.

We define the ridgelet transform by

(3)

To be precise, we define for all the via bounded extension, an essentially the same arguments in the definition of the

-Fourier transform. Namely, We first define

for , which is absolutely convergent because . Then, we extend for as a common limit of , where is any sequence in that converges to in . Let us introduce the admissible condition on :

Assumption 2.2 (admissible condition).

The function satisfies the following two conditions: (1) , and (2) .

We need the admissibility condition in the proof of the reconstruction formula blow. It is not at all strong. In fact, the infinite sum of the second condition always converge because is square integrable, thus, we may replace with a function satisfying these condition via only multiplying and subtracting constants. In particular, restrictions of Tanh and ReLU to can satisfy this assumption with slight modifications on the constants. Under the admissible condition, the ridgelet transform meets the reconstruction formula and the Plancherel formula as follows:

Theorem 2.3.

Impose Assumption 2.2 on . Then for , we have

(4)
(5)

By the Plancherel formula (5), the adjoint operator of is calculated as for . This might be regarded as an integral representation of the neural network. However, this integral transform is defined only on the image of , which is hard to specify (due to the non-triviality of the null space ). Thus, we will introduce the modified version of it in Section 2.3. By discretizing the integral in (4), we have a well-known universality in ( is a finite Borel measure) of 2-layer neural networks with the activation as a corollary of Theorem 2.3:

Corollary 2.4.

For any finite Borel measure on , the linear space generated by is dense in .

2.2 Reproducing kernel Hilbert space with inner product of features

In this section, we introduce an RKHS, which is an effective framework to analyze behaviors of the parameters and expressive power of neural networks, and we prove a stronger universality result (Corollary 2.7) for 2-layer neural networks with parameters restriction for later use.

We fix a positive number and let . We define a positive definite kernel on by

(6)

We denote by the RKHS associated with the kernel . We call the RKHS with inner product of features. We remark that is a continuous and bounded kernel. Next we discuss the characteristic property and -universality ((Sriperumbudur et al., 2010, p.2392)) of , namely, density properties in function spaces. To deal with this problem, let us introduce the following mild assumption on :

Assumption 2.5.

The bounded measurable function on satisfies .

In other words, the cannot be a finite sum of trigonometric polynomials, and thus any discontinuous square integrable function on satisfies this assumption. For example, and satisfy this. Under this assumption we have the following theorem:

Theorem 2.6.

Under Assumption 2.5, the is characteristic. If we additionally impose Assumption 2.2 on , is -universal.

By means of Theorem 2.6, we prove a stronger form of universality as follows:

Corollary 2.7.

For any finite Borel measure on and , the linear space generated by is dense in . Here we define .

Proof.

We here denote by the inner product in . It suffices to show that for any , if for all . Since is a nonzero constant function for a , we see that . Let . Then we have for all where . Since is generated by ’s, thus we conclude that is contained in the orthogonal complement of . Since is characteristic, the space is dense in (cf. (Kenji Fukumizu and Jordan, 2009, Proposition 5), (Sriperumbudur et al., 2010, Section 3.2)). Hence we have . ∎

Compared with Proposition 2.4, Corollary 2.7 provides a practically important conclusion, namely, it says even if the parameters in the hidden layer are bounded, 2-layer neural networks have sufficient expressive power under Assumption 2.5.

2.3 Integral representation of neural networks

In this section, we define an integral representation of a 2-layer neural network. It is also regarded as a truncated version of the adjoint operator of the ridgelet transform . Although the theory of the ridgelet transform on is very clear, it has a flaw to analyze the neural networks. In fact, because does not contains , thus any finite neural networks, we cannot see the direct connection between finite neural networks and integral representations of neural networks. To circumvent this technical issue, we consider a -weighted version (since ).

Definition 2.8.

We define an integral representation of a neural network by

(7)

The operator can be regarded as a limit of neural networks of the form whose hidden parameters are contained in . By simple computation, we see that the adjoint operator is explicitly represented as . Thus is the adjoint operator of a weighted analogue of the ridgelet transform (c.f. Definition 2.1). Then we have the following proposition describing the expressive power of :

Proposition 2.9.

Assume is continuous at a point and . The image of is dense in .

Proof.

Denote by the inner product of . Since the image of is the same as the orthogonal complement of the adjoint operator , it suffices to show that implies . In fact, if , then for almost every . Since is continuous on , we see that for all with . In addition, . Thus by Corollary 2.7, we have . ∎

3 Main Results

In this section, we describe the formulation of our problem and main results. We impose Assumptions 2.2 and 2.5 on the bounded measurable map on . We fix an absolutely continuous finite Borel measure on . We assume has a bounded density function .

3.1 Square loss minimization for the integral representation

Our main goal is to provide an explicit representation of the global minimizer of the learning problem:

(8)

We regard the function as a distribution . We will give an explicit representation to the distribution attaining the solution of in terms of the ridgelet transform.

For , , and , we consider the -regularized square loss of the integral representation :

(9)

We denote by the unique element that attains , which always exists as long as is densely defined closed operator. See Appendix D for more details. Although can tend to infinity as , by Proposition 2.9, we have the following proposition:

Proposition 3.1.

Assume is continuous at a point and . Then the square loss converges to 0 as .

Proof.

Let be an arbitrary positive number. By Proposition 2.9, there exists an element of such that . Take . Then satisfies . Thus we have . ∎

3.2 An explicit representation of the global minimizer

Our first main result is the explicit representation of the minimizer of the regularized square loss minimization problem in terms of the ridgelet transform.

Theorem 3.2.

Let be an absolutely continuous finite Borel measure on with bounded density function . Let . Then, automatically and we have

(10)

where is an element of such that

(11)

By formally completing square in the Hilbert space, we can verify the unique existence of the minimizer . However, the concrete property of is not clear. Theorem 3.2 provides the explicit representation of the minimizer via ridgelet transform.

3.3 Relation to the 2-layer finite neural networks

In this section, we prove that the over-parametrized finite neural networks converge to the minimizer in the integral representation (9) as the parameter number tends to infinity. The over-parametrization may let us suppose that the parameter distribution of a neural network weakly converges

to the uniform distribution

on , and the problem reduces to the optimization of . Here, the weak convergence assumption is satisfied, for example, when the parameters are i.i.d. samples drawn from

. However, the randomness is not necessary in the proof. Let us consider the supervised learning problem as follows: Given a sequence

of Borel measures that weakly converges to the Lebesgue measure on , define by allocating to . For , we consider the following optimization problem:

(12)

Let be the unique minimizer of (12).

Theorem 3.3.

Assume weakly converges to as . Then the minmimizer of (12) converges to the minimizer of (9) in the sense as .

Although we cannot catch the shape of the distribution of the optimal solution when the parameter number is small, the over-parametrized neural networks converge to a common parameter distribution. Combining Theorem 3.3 with Theorem 3.2, we obtain an explicit representation of the global minimizer via the ridgelet transform. Theorem 3.3 implies the weak convergence of parameter distributions, which is a stronger convergence of over-parametrized neural networks to the global minimum than previous results:

Corollary 3.4.

The distribution weakly converges to an absolutely continuous distribution , namely, for any bounded continuous function on , we have as .

In Section 4 below, we consider a learnng problem a 2-layer neural network via gradient descent. We see the parameters of the over-parametrized neural networks accumulate the ridgelet spectrums.

4 Numerical Simulation

In order to verify the main results, we conducted numerical simulation with artificial datasets. Here, we only display the results of Experiment 1. The readers are also encouraged to refer supplementary materials for further experimental results.

4.1 Scatter plots of GD trained parameters.

Given a dataset , we repeatedly trained neural networks with activation function Gaussian, Tanh and ReLU. The training is conducted by minimizing the square loss:

using stochastic gradient descent with weight decay. Note that weight decay has an equivalent effect to

regularization, which we assumed in the main theory. After the training, we obtained sets of parameters , and plotted them in the -space. ( is visualized in color.) See supplementary materials for more details on the settings.

4.2 Ridgelet spectrum

Given a dataset , we approximately compute the ridgelet spectrum of at every sample points by numerical integration:

(13)

where is a normalizing constant, which is a constant because we assume that be uniformly distributed. We remark that more sophisticated methods for the numerical computation of the ridgelet transform has been developed. See (Do and Vetterli, 2003) and (Sonoda and Murata, 2014) for example.

4.3 Results

(a) GD params, Gaussian
(b) GD params, Tanh
(c) GD params, ReLU
(d) R. spect, Gaussian
(e) R. spect, Tanh
(f) R. spect, ReLU
Figure 1: Parameter distributions trained by SGD (top) and ridgelet spectra obtained by numerical integration (bottom) for the common data generating function

In Figure 1, we have compared the scatter plot of gradient descent (GD) trained parameters and the ridgelet spectra. All six figures are obtained from the common data generating function on . Despite the fact that the scatter plot and ridgelet spectrum are obtained from different procedures: numerical optimization and numerical integration, both figures share characteristics in common. For example, red and blue parameters in the scatter plots (a-c) concentrate in the area where the ridgelet spectra (d-f) indicate the same colors. Due to the periodic assumption, the ridgelet spectrum spreads infinitely in with period . On the other hand, due to the weight decay assumption and initialized locations of parameters, the GD trained parameters gathers around the origin. Here, we used the uniform distribution for the initialization. We can understand that these differences between the scatter plot and the spectrum as the residual term in the main theorem. Another remarkable fact is that the GD trained parameters essentially did not change their positions in from the initialized value. This is possible because the support of initial parameters overlap the ridgelet spectrum from the beginning. We can understand this phenomenon as the so-called lazy regime.

5 Related Works

In the past, many authors have investigated the local minima of deep learning. However, these results have often posed strong assumptions such as that (A1) the activation function is limited to linear or ReLUs (Kawaguchi, 2016; Soudry and Carmon, 2016; Nguyen and Hein, 2017; Hardt and Ma, 2017; Lu and Kawaguchi, 2017; Yun et al., 2018); (A2) the parameters are random (Choromanska et al., 2015; Poole et al., 2016; Pennington et al., 2018; Jacot et al., 2018; Lee et al., 2019; Frankle and Carbin, 2019)

; (A3) the input is subject to normal distribution

(Brutzkus and Globerson, 2017); or (A4) the target functions are low-degree polynomials or another sparse neural network (Yehudai and Shamir, 2019; Ghorbani et al., 2019). Due to these simplifying assumptions, we know very little about the minimizers themselves. In this study, from the perspective of functional analysis, we present a stronger characterization of the distribution of parameters in the over-parametrized setting. As a result, our theory (A1’) accepts a wide range of activation functions, (A2’) need not assume the randomness of parameter distributions, (A3’) need not specify the data distribution, and (A4’) preserves the universal approximation property of neural networks such as the density in .

The mean-field regime theory (Rotskoff and Vanden-Eijnden, 2018; Mei et al., 2018; Sirignano and Spiliopoulos, 2020a, b) a.k.a. the gradient flow theory (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Arbel et al., 2019) has also employed the integral representation and parameter distribution to prove the global convergence. These lines of studies claim that for the stochastic gradient descent learning of 2-layer networks, the “time evolution” of a discrete parameter distribution, say , with parameter number and continuous training time , asymptotically converges to the time evolution of the continuous parameter distribution as

. Here, the time evolution is described by a gradient flow (the partial differential equation, the Wasserstein gradient flow, or the McKean-Vlasov equation)

with initial condition . However, we should point out that the convergence in this argument is weaker than our result. As we explained in Appendix C.3, the equation has an infinitely different solutions, say and that satisfy but . Hence, even though , we cannot expect in general, which leaves the parameter distribution indeterminate. In contrast, by explicitly posing a regularization term, we have specified the parameter distribution of the global minimizer and shown the norm convergence in the space of parameter distributions: .

In order to avoid potential confusions, we provide supplementary explanations on the trick behind the mean-field theory. In the mean-field theory, the gradient flow is often explained as the system of interacting particles by identifying the parameters as the coordinate system of physical particles. The particles obeys a non-linear equation of motion with interacting potential , where

, which is derived simply by expanding the squared loss function. Based on this physical analogy, we may accept this potential as natural. However, this is the trick because by simply changing the order, we can verify that the null space

is eliminated by implicitly applying in the potential:

(14)

This clearly indicates that the interactive potential is degenerate in , (try and for example,) and this it the trick why the mean-field theory cannot show the stronger convergence.

The lazy learning, such as the neural tangent kernel (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019a) and the strong lottery ticket hypothesis (Frankle and Carbin, 2019), has employed a different formulation of the over-parametrization to investigate the inductive bias of deep learning. These lines of studies draw much attention by radially claiming that the minimizers are very close to the initialized state. In this study, we revealed that the shape of the parameter distribution is identified with the ridgelet spectrum. From this perspective, the lazy learning is reasonable when the initial parameter distribution covers the ridgelet spectrum in its support, because the initial parameters need not to be actively updated. Furthermore, this assumption can be reasonable because the initial parameter distribution is typically a normal (or sometimes a uniform) distribution centered at the origin ; and if the data generating function is a low frequency function, then the ridgelet spectrum concentrates at the origin.

6 Conclusion

In this study, we have derived the unique explicit representation—the ridgelet spectrum with residual—of the over-parametrized neural network at the global minimum. To the present, many studies have proven the global convergence of deep learning. However, we know very little about the minimizer itself because (1) the settings are typically very simplified and (2) the integral representation operator has a non-trivial null space. To circumvent these problems, we develop the ridgelet transform on the torus and conduct analysis on the regularized square loss minimization. In the numerical simulation, the scatter plots of learned parameters have shown a very similar pattern to the ridgelet spectra, which supports our theoretical result.

Broader Impact

We believe this section is not applicable to this paper because of the theoretical nature of this study.

Acknowledgments

The authors are grateful to Taiji Suzuki, Atsushi Nitanda, Kei Hagihara, Yoshihiro Sawano, Takuo Matsubara, and Noboru Murata for productive comments. This work was supported by JSPS KAKENHI 18K18113.

References

Appendix A Proofs

a.1 Theorem 2.3

These formula follow from the computations described in "Reconstruction formula" in Appendix C.2.

a.2 Theorem 2.6

Theorem A.1.

Under Assumption 2.5, the is characteristic. If we additionally impose Assumption 2.2 on , is -universal.

Proof.

By direct computation, we have

Put . Since we assume , the support of the function is . Therefore, we see that is universal (see Section 3.2. of (Sriperumbudur et al., 2010)), thus is characteristic. Under Assumption 2.2, we have , and it implies itself is -universal. ∎

a.3 Theorem 3.2

Here, we define

and for , we define

We define a bounded absolutely integrable function by

Lemma A.2.

The correspondence is bounded and continuous mapping from to .

Proof.

We may assume is continuous function, thus we immediately see the continuity. The boundedness is obvious. ∎

Corollary A.3.

Let , and let be a bounded linear operator on . Then for any , is a well-defined elemlent in and satisfy for any ,

Lemma A.4.

For , we have

Proof.

Put . Since

for , by direct computation, we have

By taking to , we have the formula. ∎

Corollary A.5.

For any , the integral is well-defined in the similar manner with the Fourier transform. Moreover, we have

Theorem A.6.

Let be an absolutely continuous finite Borel measure on with density function . Let . Assume is bounded and . Then we have

(15)

where is an element of such that

Proof.

By the theory of the Tikhonov regularization, is explicitly desribed as follows:

where we write . We denote by . By direct computation, we have

where we define by

By Lemma A.4, we have . By Corollary A.5, we see that in . Therefore, we define

and the limit of is zero as . ∎

a.4 Theorem 3.3

Here we prove the following statement:

Theorem A.7 (Theorem 3.3).

Let . For every , let with . Assume that weakly converges to the Lebesgue measure on . Here, the weak convergence is in the sense that for any bounded continuous function on . Then the minimimizer of

converges to the minimizer of in the sense