Integral representation of the global minimizer

05/19/2018 ∙ by Sho Sonoda, et al. ∙ 0

We have obtained an integral representation of the shallow neural network that attains the global minimum of its backpropagation (BP) training problem. According to our unpublished numerical simulations conducted several years prior to this study, we had noticed that such an integral representation may exist, but it was not proven until today. First, we introduced a Hilbert space of coefficient functions, and a reproducing kernel Hilbert space (RKHS) of hypotheses, associated with the integral representation. The RKHS reflects the approximation ability of neural networks. Second, we established the ridgelet analysis on RKHS. The analytic property of the integral representation is remarkably clear. Third, we reformulated the BP training as the optimization problem in the space of coefficient functions, and obtained a formal expression of the unique global minimizer, according to the Tikhonov regularization theory. Finally, we demonstrated that the global minimizer is the shrink ridgelet transform. Since the relation between an integral representation and an ordinary finite network is not clear, and BP is convex in the integral representation, we cannot immediately answer the question such as "Is a local minimum a global minimum?" However, the obtained integral representation provides an explicit expression of the global minimizer, without linearity-like assumptions, such as partial linearity and monotonicity. Furthermore, it indicates that the ordinary ridgelet transform provides the minimum norm solution to the original training equation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training a neural network is conducted by backpropagation (BP), which results in a high-dimensional and non-convex

optimization problem. Despite the difficulty of the optimization problem, deep learning has achieved great success in a wide range of applications such as image recognition

(Redmon et al., 2016), speech synthesis (van den Oord et al., 2016), and game playing (Silver et al., 2017). The empirical success of deep learning suggests a conjecture that “all” local minima of the training problem are close or equal to global minima (Dauphin et al., 2014; Choromanska et al., 2015). Therefore, radical reviews of the shape of loss surfaces are ongoing (Draxler et al., 2018; Garipov et al., 2018). However, these lines of studies pose strong assumptions such as linear activation (Kawaguchi, 2016; Hardt and Ma, 2017), overparameterization (Nguyen and Hein, 2017), Gaussian data distribution (Brutzkus and Globerson, 2017), and shallow network, i.e. single hidden layer (Li and Yuan, 2017; Soltanolkotabi, 2017; Zhong et al., 2017; Du and Lee, 2018; Ge et al., 2018; Soudry and Hoffer, 2018).

The scope of this study is the shape of the global optimizer itself, rather than the reachability to the global optimum via empirical risk minimization. By recasting the BP training as a variational problem, i.e. an optimization problem in a function space, in the settings of the shallow neural network with an arbitraryactivation function and the mean squared error, we present an explicit expression of the global minimizer via the ridgelet transform. By virtue of functional analysis, our result is independent of the parameterization of neural networks.

Figure 1 presents an intriguing example that motivates our study. Both Figures 0(a) and 0(b) were obtained from the same dataset shown in Figure 0(c), and they show similar patterns to each other. However, they were obtained from entirely different procedures: numerical optimization and numerical integration. In the following, we provide a brief explanation of the experiments. See § 3 for more details.

(a) BP trained parameters
(b) ridgelet spectrum
(c) dataset and an example of training results
Figure 1: Motivating example: Scatter plot (a) and ridgelet spectrum (b) were obtained from the same dataset (c) and bear an intriguing resemblance to each other, despite the fact that they were obtained from different procedures—numerical optimization and numerical integration.

Figure 0(a) shows the scatter plot of the parameters in neural networks that had been trained with dataset

. The dataset is composed of uniform random variables

in

, and the response variables

with Gaussian random noise . We trained shallow neural networks. Each network had hidden units with activation . We employed ADAM for the training. The scatter plot presents the sets of trained parameters , where is visualized in color.

On the other hand, Figure 0(b) shows the spectrum of the (classic) ridgelet transform

(1)

of with a certain ridgelet function . (See § 2 for more details on the ridgelet transform.) We calculated the spectrum by using numerical integration, and used the dataset . Therefore, two figures are obtained from the same dataset.

Even though the two figures are obtained from different procedures, both results are -point star shaped. In other words, the BP trained parameters concentrate in the high intensity areas in the ridgelet spectrum. From this interesting similarity, we can conjecture that the global minimizer has a certain relation to the ridgelet transform.

In this study, we investigate the relation between the BP training problem and the ridgelet transform by reformulating the BP training in a function space, and show that the ridgelet transform can offer the global minimizer of the BP training problem.

2 Preliminaries

We provide several notation and describe the problem formulation. The most important notion is the ‘BP in the function space,’ which plays a key role to formulate our research question.

2.1 Mathematical Notation

denotes the complex conjugate of a complex number .

denotes the Fourier transform

of a function . denotes the reflection of a function , i.e. .

denotes the Hilbert space equipped with inner product . denotes the adjoint operator of a linear operator on a Hilbert space.

denotes the expectation of a function with respect to the random variable .

2.2 Problem Settings

Neural Network.  We consider an -in--out shallow neural network with an arbitrary activation function :

(2)

where is the number of hidden units, are hidden parameters and are output parameters. By , we collectively write a set of parameters . Here, we remark that the

-dimensional output assumption is only for simplicity, and we can easily generalize our results to the multi-dimensional output case. Examples of the activation function are Gaussian, hyperbolic tangent, sigmoidal function and rectified linear unit (ReLU).

Cost Function.  We formulate the BP training as the minimization problem of the mean squared error

(3)

with a certain regularization , where denotes the ground truth function. Here, we remark that this formulation covers any empirical risk function , by choosing the data distribution as an empirical distribution.

2.3 Integral Representation of Neural Network

In order to recast the BP training in the function space, we introduce the integral representation of a neural network:

(4)

where are the coefficient function, is the activation function employed in (2), and is the base measure on .

Brief Description.  Formally speaking, is an infinite sum of hidden units . In the integral representation, all the hidden parameters are integrated out, and only the output parameter is left. In other words, indicates which to use by weighting on them.

Function Class.  In this study, we assume that , and be a Borel measure. As described in Proposition 4.2, the base measure controls the expressive power of neural networks, i.e. the capacity of .

Important Examples.  Two extreme cases are important: (a) is the Lebesgue measure, and (b) is a sum of Dirac measures. When is the Lebesgue measure , then can express any -function (Sonoda and Murata, 2017). On the other hand, when is a sum of Dirac measures, then can express any finite neural network (2). With a slight abuse of notation, write

(5)

for , where denotes the Dirac delta centered at . Then, . In other words, the integral representation is a reparameterization of neural networks, and is the simplest way to connect the integral representation and the ordinary representation.

Advantages.  The integral representation has at least two advantages over the ‘ordinary representation’ (2). The first advantage is that can expressive any distribution of parameters. By virtue of this flexibility, we can identify the scatter plot Figure 0(a) as a point spectrum, and Figure 0(b) as a continuous spectrum.

The second advantage is that the hidden parameters are integrated out, and that the output parameter is the only trainable parameter. Recall that the BP training of ordinary neural networks is a non-convex optimization problem. The non-convexity is caused by the hidden parameters , because they are placed in the nonlinear function . At the same time, the non-convexity is never caused by the output parameters , because they are placed out of . On the other hand, in the integral representation, no trainable parameters are placed in . By virtue of this linearity, the BP training of , which is described later in this section, becomes a convex optimization problem.

Brief History.  Originally, the integral representation and ridgelet transform have been developed to investigate the expressive power of neural networks (Barron, 1993; Murata, 1996; Candès, 1998)

, and to estimate the approximation errors

(Kůrková, 2012). Recently, it has been applied to synthesize neural networks without BP training, by approximating the integral transform with a Riemannian sum (Sonoda and Murata, 2014; Bach, 2017a, b); to facilitate the inner mechanism of the so-called “black-box” networks (Sonoda and Murata, 2018)

, and to estimate the generalization errors of deep neural networks from the decay of eigenvalues

(Suzuki, 2018).

2.4 Ridgelet Transform

We placed the explanation of the ridgelet transform soon after the integral representation, because it is natural to understand the ridgelet transform as a right inverse operator for the integral representation operator.

Let us consider an integral equation , where is an integral representation operator, is a given function, and is the unknown function. In the context of neural networks, this equation means a prototype of learning. Namely, to learn is to find a solution from the observation . Murata (1996) and Candès (1998) discovered that the ridgelet transform provides a particular solution to the equation.

To be precise, when the base measure of is the Lebesgue measure , the function belongs to , and there exists a ridgelet function that satisfies the admissibility condition

(6)

for the activation function in , then a particular solution to is given by the ridgelet transform

(7)

This is what we call the classic ridgelet transform.

Here, we remark that the solution is not unique. On the contrary, there are an infinite number of different particular solutions, say and , that satisfy but . This is immediate from the fact that there are infinitely many different admissible ridgelet functions and . Therefore, a single (specified by ) is not the exact inverse to , which must satisfy both and ; but only a right inverse, which only satisfies .

In the context of neural networks, the existence of a solution operator for any function means the universal approximation property, because a neural network can express any function by just letting .

As demonstrated in § 3, the ridgelet transform can be computed by numerical integration. See Starck et al. (2010) and Sonoda and Murata (2017) for more details on ridgelet analysis.

2.5 BP in the Function Space

We rewrite the BP training as the minimization problem of

(8)

with respect to . This reformulation formally extends the ordinary formulation (3), because . In other words, we can understand the ordinary BP problem in the function space, as depicted in Figure 2. We call the minimization problem of as the BP in the function space. As we mentioned above, by virtue of the linearity of , the BP in the function is reduced as a quadratic programming problem.

Figure 2: Relation between the minimization problems of and . The green curve depicts a finite dimensional subspace parameterized by and embedded in the ground function space. Since

is restricted to the subspace, the gradient vector (dashed) is also projected to the tangent space (solid), and thus the gradient descent generally goes off in a different direction from the global minimizer

. If the subspace is curved in the ground function space, there would be multiple local optima such as and .

Mathematically speaking, contrary to the finite dimensional optimization problem, existence and uniqueness of the solution depend on the properties of and . For the sake of simplicity, we consider a simple case with . In this case, the sufficient condition for the unique existence of the solution is that is Lipschitz continuous. See Appendix A for more details.

Where are the Local Minima?  The BP training in the function space, i.e. , has a unique global minimum because it is a quadratic programming, while the BP training in the parameter space, i.e. , generally has a large number of local minima. This is not a paradox, but simply a matter of parameterization.

In order to figure out the paradox, let us consider performing gradient descent for cost functions and . Namely, for , we use functional gradient (Fréchet derivative) ; and for , we use partial derivative .

Between these two derivatives, a chain-rule holds:

(9)

In other words, this is a change-of-coordinate from to .

According to the chain-rule, if the functional gradient vanishes: at some , then the partial derivative also vanishes: . However, the converse is not always true. As depicted in Figure 2, if the partial derivative vanishes: at some , then is simply a local optimizer such as and .

2.6 Main Problem

At last, our research problem is formulated as to show

(10)

with a suitable reformulation of ridgelet transform , if needed.

3 Details on Motivating Examples

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 3: Experimental results (activation:, optimization:ADAM). (a-c) Sinusoidal Curve (with Noise) and , (d) Gaussian Noise, (e) High Frequency Sinusoidal Curve, (f) Topologist’s Sinusoidal Curve, (g-i) Gaussian Kernel and , (j) Square Wave. See supplementary materials for all the examples with larger images.

In Figure 1, we have compared the BP trained parameters and the ridgelet spectrum. Here, we explain the details of these experiments and review the results with nine additional datasets and three additional conditions. We note that readers are also encouraged to refer supplementary materials for further results.

3.1 Datasets

We prepared artificial datasets. For the sake of visualization, all the datasets are -in--out. We emphasize that our main results described in 4 are valid for any dimension. In the following,

denotes the normal distribution with mean

and variance

,

denotes the uniform distribution over the interval

.

Common Settings.  In all the datasets, , and except for ‘Topologist’s Sinusoidal Curve’, sample size .

Sinusoidal Curve.. We prepared this dataset as a basic example.

Sinusoidal Curve with Gaussian Noise. with and . We prepared these datasets to examine the effect of noise. By the linearity of ridgelet transform: , we can expect that the effect will be cancelled out in average.

Gaussian Noise.. We prepared this dataset to extract the effect of noise. In theory, the ridgelet spectrum is a random process. Therefore, the visualization result is only a single realization of the random process.

High Frequency Sinusoidal Curve.. We prepared this dataset to examine the effect of the change in frequency. Since the hidden parameter reflects the frequency, we can expect that the spectrum will change in .

Topologist’s Sinusoidal Curve., . We prepared this dataset to examine the effect of the change in frequency. Compared to sinusoidal curve, it contains an infinitely wide range of frequencies.

Gaussian Kernel. with . We prepared these datasets to examine the effect of the change in location. Since the hidden parameter reflects the location, we can expect that the spectrum will change in .

Square Wave.. We prepared this dataset to examine the effect of discontinuity. By the locality of the ridgelet transform, we can expect that the effect is also localized.

3.2 Scatter Plots of BP Trained Parameters

Given a dataset , we repeatedly trained neural networks . The training is conducted by minimizing the empirical mean squared error: . After the training, we obtained sets of parameters , and plotted them in the -space. ( is visualized in color.)

We preliminary adjusted the hidden units number according to the dataset. Otherwise, the plots become noisy, typically because the initial parameters were not moved during the training. If is too small, the network underfits, and all the parameters become nothing more than noise in the plot. When a network underfits, some get extremely large. On the other hand, if is too large, a large majority of hidden parameters remain to be updated, which again become noise in the plot. When a parameter remains to be updated, the gets extremely small. So, we can judge if the parameters are noise or not, by checking if is either extremely large or extremely small. For the sake of visualization, we got rid of those noisy parameters.

We examined the following settings. See supplementary materials for all the results.

Activation Function. and ReLU.

Optimization Method.  LBFGS and ADAM.

3.3 Numerical Integration of Ridgelet Spectrum

We employed the classical definition of the ridgelet transform given in (7). As admissible functions , we employed

(for )
(for ReLU)

where is the Dawson function. See Sonoda and Murata (2017) for more details on the construction of other admissible functions.

Given the dataset , we have conducted a simple Monte Carlo integration at every grid points :

(11)

where is a normalizing constant (because is uniformly distributed). For simplicity, we omitted calculating , and simply scaled so that it has value in . We remark that more sophisticated methods for the numerical computation of the ridgelet transform has been developed. See Do and Vetterli (2003) and Sonoda and Murata (2014) for example.

3.4 Experimental Results

Figure 3 presents the experimental results when the activation function is and the optimization method is ADAM. As mentioned in 2.4, there are an infinite number of different ridgelet transforms , and the presented spectra are calculated by just one particular case of . Nevertheless, we can find a visual resemblance in all the cases (a-j).

In (a-c), the three spectra are also similar to each other. As we have expected, the effect of noise has been canceled. The three scatter plot get blurred, as the noise level gets increased. In (d), the spectrum presents a single shot of the random field . In the scatter plot, parameters accumulated along the line. This is because when is on this line, the corresponding base function tends to be a constant function in the domain , as the fitting example depicted in the top. In (e-d), the scatter plots are noisy because the training easily fails, as shown in the training example. We can find some sharp peaks and troughs. In (g-i), we can observe that the location in the real domain is encoded as the angle in the spectrum. In (j), the parameters accumulated in a few sharp lines. These lines encode the locations of discontinuities in the real domain.

4 Theory

We prove the main theorem. All the proofs are given in Appendix B. We fix a locally integrable activation function

and a probability measure

on . We remark that the boundedness is not critical but just for simplicity. We can treat unbounded activation functions with small modifications employed in Sonoda and Murata (2017). We define a positive definite kernel on by

(12)

Let and assume . For example, the Dirac measure with support on is contained in .

4.1 Integral Representation of Neural Network

Here, we investigate the properties of integral representation operator .

Proposition 4.1.

The operator is a bounded linear operator, more precisely, for any , we have .

Let be a positive definite kernel on defined by

(13)

Let be the reproducing kernel Hilbert space (RKHS) associated with . We note that . Then we have

Proposition 4.2.

The image of is .

Proposition 4.2 means that the representation ability of is described by the RKHS . If is a universal kernel (cc-universal in the sense in Sriperumbudur et al. (2010)), the RKHS can approximate any compact support function in , which is just the universal approximation property of neural networks in

for arbitrary probability distribution

. Actually, under mild condition, we can make a universal kernel as follows:

Theorem 4.3.

Let be a (large) positive number. Let . Assume that is a non-constant periodic function with period , i.e. . We also impose one of the two conditions: (1) or (2) and , where . Then induces a universal kernel .

Note that in a real world problems, we can use a periodic function with sufficiently large as an alternative activation function. Thus the condition of periodicity for is not harmful. In particular, we can deal with ReLU.

4.2 Ridgelet Transform

Here, we introduce a modified version of the ridgelet transform, which attains the global minimum of the BP training problem.

Let and let . Then we define the Ridgelet transform with respect to as a linear operator defined by

(14)

We note this definition includes the classic definition when and .

Proposition 4.4.

The ridgelet transform is a bounded linear operator, more precisely, .

The adjoint of is described by the ridgelet transform:

Proposition 4.5.

We have , where .

Let be the integral transform with respect to , namely,

(15)

Then we have the following proposition:

Proposition 4.6.

We have .

4.3 Main Result

For and , let be the risk function for the integral representation of neural networks with respect to , then we have the following theorem:

Theorem 4.7.

For and , there exists a function on such that attains the unique minimum of the minimization problem . Moreover satisfies

(16)

where .

In the classic case when and , the solution is given by with any admissible , which results in a shrink version of the classic ridgelet transform . See Appendix B.8 for a sketch of proof.

5 Conclusion

We have shown that the global minimizer of the BP training problem is given by the ridgelet transform. In order to treat the scatter plot of hidden parameters, such as Figure 0(a), we introduced the integral representation of neural networks, and reformulated the BP training problem in the Hilbert space of coefficient functions, i.e. . As a result, the BP training problem was reduced to the quadratic programming, without harming the generality of activation functions. At last, we have successfully discovered a modified version of the ridgelet transform that attains the global minimum. In the classic setting, the modified transform simplifies to a shrink ridgelet transform. By virtue of functional analysis, our formulation is independent of the parameterization of neural networks, which would contribute to the geometric understanding of neural networks. Extensions to general risk functions and deep networks, and applications to the analysis of local minima will be our important future works.

Appendix A Optimization Problem in Hilbert spaces

Let be Hilbert spaces endowded with the inner products and , respectively, and be a densely defined closed linear operator.

For a given , we find satisfying

For this problem, we have the following.

Proposition A.1.

Let . Then for every , we have

where denotes the adjoint operator of .

Proof.

A direct computation gives

Therefore, the objective functional attains the minimum at . ∎

Appendix B Proofs

b.1 Proposition 4.1

Proof.

Let . Then we have

Here, in the last line, we use the Cauchy-Schwarz inequality. ∎

b.2 Proposition 4.2

Proof.

Let . Let be a closure of the linear subspace generated by . By definition, we have and . Thus, induces an isomorphism between Hilbert spaces and , in particular, the image of is . ∎

b.3 Theorem 4.3

Proof.

Let . Since is a periodic function, we have a Fourier series expansion of : . Then we have

Here, . Since we assume , or , the support of the function is . Therefore, we see that is universal (see Section 3.2. of (Sriperumbudur et al., 2010)). ∎

b.4 Proposition 4.4

Proof.

Let . Then we have

In the last line, we use the Cauchy-Schwartz inequality twice. ∎

b.5 Proposition 4.5

Proof.

It suffices to prove that . By straightforward computation, we have

Thus we have . ∎

b.6 Proposition 4.6

Proof.

By Proposition 4.5, we have

b.7 Theorem 4.7

Proof.

Since is an isomorphism, we define . By Proposition A.1, the minimizer that attains is explicitly given as . By Proposition 4.5 and Proposition 4.6, it suffices to prove that , but it follows from simple computation. ∎

b.8 Remark on Theorem 4.7

Let be an arbitrary admissible, namely, . Then,

Here, the last equation holds because

Thus, solves the equation.

Acknowledgements

This work was supported by JSPS KAKENHI 18K18113.

References

  • Bach (2017a) Bach, F. (2017a).

    Breaking the Curse of Dimensionality with Convex Neural Networks.

    Journal of Machine Learning Research

    , 18(19):1–53.
  • Bach (2017b) Bach, F. (2017b). On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, 18(21):1–38.
  • Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.
  • Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs. In ICML, pages 605–614.
  • Candès (1998) Candès, E. J. (1998). Ridgelets: theory and applications. PhD thesis, Standford University.
  • Choromanska et al. (2015) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In AISTATS, pages 192–204.
  • Dauphin et al. (2014) Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, pages 2933–2941.
  • Do and Vetterli (2003) Do, M. N. and Vetterli, M. (2003). The finite ridgelet transform for image representation. IEEE Transactions on Image Processing, 12(1):16–28.
  • Draxler et al. (2018) Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. (2018). Essentially No Barriers in Neural Network Energy Landscape. In ICML, pages 1309–1318.
  • Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the Power of Over-parametrization in Neural Networks with Quadratic Activation. In ICML, pages 1329–1338.
  • Garipov et al. (2018) Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., and Wilson, A. G. (2018). Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In NIPS, pages 8803–8812.
  • Ge et al. (2018) Ge, R., Lee, J. D., and Ma, T. (2018). Learning One-hidden-layer Neural Networks with Landscape Design. In ICLR, pages 1–37.
  • Hardt and Ma (2017) Hardt, M. and Ma, T. (2017). Identity Matters in Deep Learning. In ICLR, pages 1–14.
  • Kawaguchi (2016) Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. In NIPS, pages 586–594.
  • Kůrková (2012) Kůrková, V. (2012). Complexity estimates based on integral transforms induced by computational units. Neural Networks, 33:160–167.
  • Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In NIPS, pages 597–607.
  • Murata (1996) Murata, N. (1996). An integral representation of functions using three-layered betworks and their approximation bounds. Neural Networks, 9(6):947–956.
  • Nguyen and Hein (2017) Nguyen, Q. and Hein, M. (2017). The Loss Surface of Deep and Wide Neural Networks. In ICML, pages 2603–2612.
  • Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550:354–359.
  • Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning ReLUs via Gradient Descent. In NIPS, pages 2007–2017.
  • Sonoda and Murata (2014) Sonoda, S. and Murata, N. (2014). Sampling hidden parameters from oracle distribution. In ICANN, pages 539–546.
  • Sonoda and Murata (2017) Sonoda, S. and Murata, N. (2017). Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268.
  • Sonoda and Murata (2018) Sonoda, S. and Murata, N. (2018). Transport Analysis of Infinitely Deep Neural Network. Journal of Machine Learning Research, 19:1–52.
  • Soudry and Hoffer (2018) Soudry, D. and Hoffer, E. (2018). Exponentially vanishing sub-optimal local minima in multilayer neural networks. In ICLR, pages 1–35.
  • Sriperumbudur et al. (2010) Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2010). Universality, Characteristic Kernels and RKHS Embedding of Measures. Journal of Machine Learning Research, 12(Jul):2389–2410.
  • Starck et al. (2010) Starck, J.-L., Murtagh, F., and Fadili, J. M. (2010). The ridgelet and curvelet transforms. In Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity, pages 89–118. Cambridge University Press.
  • Suzuki (2018) Suzuki, T. (2018). Fast generalization error bound of deep learning from a kernel perspective. In AISTATS, pages 1397–1406.
  • van den Oord et al. (2016) van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio.
  • Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. (2017). Recovery Guarantees for One-hidden-layer Neural Networks. In ICML, pages 4140–4149.

References

  • Bach (2017a) Bach, F. (2017a).

    Breaking the Curse of Dimensionality with Convex Neural Networks.

    Journal of Machine Learning Research

    , 18(19):1–53.
  • Bach (2017b) Bach, F. (2017b). On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, 18(21):1–38.
  • Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.
  • Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs. In ICML, pages 605–614.
  • Candès (1998) Candès, E. J. (1998). Ridgelets: theory and applications. PhD thesis, Standford University.
  • Choromanska et al. (2015) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In AISTATS, pages 192–204.
  • Dauphin et al. (2014) Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, pages 2933–2941.
  • Do and Vetterli (2003) Do, M. N. and Vetterli, M. (2003). The finite ridgelet transform for image representation. IEEE Transactions on Image Processing, 12(1):16–28.
  • Draxler et al. (2018) Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. (2018). Essentially No Barriers in Neural Network Energy Landscape. In ICML, pages 1309–1318.
  • Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the Power of Over-parametrization in Neural Networks with Quadratic Activation. In ICML, pages 1329–1338.
  • Garipov et al. (2018) Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., and Wilson, A. G. (2018). Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In NIPS, pages 8803–8812.
  • Ge et al. (2018) Ge, R., Lee, J. D., and Ma, T. (2018). Learning One-hidden-layer Neural Networks with Landscape Design. In ICLR, pages 1–37.
  • Hardt and Ma (2017) Hardt, M. and Ma, T. (2017). Identity Matters in Deep Learning. In ICLR, pages 1–14.
  • Kawaguchi (2016) Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. In NIPS, pages 586–594.
  • Kůrková (2012) Kůrková, V. (2012). Complexity estimates based on integral transforms induced by computational units. Neural Networks, 33:160–167.
  • Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In NIPS, pages 597–607.
  • Murata (1996) Murata, N. (1996). An integral representation of functions using three-layered betworks and their approximation bounds. Neural Networks, 9(6):947–956.
  • Nguyen and Hein (2017) Nguyen, Q. and Hein, M. (2017). The Loss Surface of Deep and Wide Neural Networks. In ICML, pages 2603–2612.
  • Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550:354–359.
  • Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning ReLUs via Gradient Descent. In NIPS, pages 2007–2017.
  • Sonoda and Murata (2014) Sonoda, S. and Murata, N. (2014). Sampling hidden parameters from oracle distribution. In ICANN, pages 539–546.
  • Sonoda and Murata (2017) Sonoda, S. and Murata, N. (2017). Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268.
  • Sonoda and Murata (2018) Sonoda, S. and Murata, N. (2018). Transport Analysis of Infinitely Deep Neural Network. Journal of Machine Learning Research, 19:1–52.
  • Soudry and Hoffer (2018) Soudry, D. and Hoffer, E. (2018). Exponentially vanishing sub-optimal local minima in multilayer neural networks. In ICLR, pages 1–35.
  • Sriperumbudur et al. (2010) Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2010). Universality, Characteristic Kernels and RKHS Embedding of Measures. Journal of Machine Learning Research, 12(Jul):2389–2410.
  • Starck et al. (2010) Starck, J.-L., Murtagh, F., and Fadili, J. M. (2010). The ridgelet and curvelet transforms. In Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity, pages 89–118. Cambridge University Press.
  • Suzuki (2018) Suzuki, T. (2018). Fast generalization error bound of deep learning from a kernel perspective. In AISTATS, pages 1397–1406.
  • van den Oord et al. (2016) van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio.
  • Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. (2017). Recovery Guarantees for One-hidden-layer Neural Networks. In ICML, pages 4140–4149.

Appendix C Further Examples

We present all the experimental results described in § 3. The results are presented as sets of subfigures. A single set corresponds to one of the dataset, and it contains a scatter plot of dataset, four scatter plots of parameters, and two spectra.

On the whole, the scatter plots with ADAM are sharper than those with BFGS, and those with ReLU are also sharper than those with . These results suggest the implicit regularization property of both ADAM and ReLU. The scatter plot with ADAM tends to be sharp because less parameters can remain to be trained for the varieties of mini batches. The scatter plot with ReLU tends to be sharp because ReLU is homogeneous and thus two different parameters and , where indicate the same basis function .

(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 4: Sinusoidal Curve
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 5: Sinusoidal Curve with Gaussian Noise
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 6: Sinusoidal Curve with Gaussian Noise
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 7: Gaussian Noise
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 8: High Frequency Sinusoidal Curve
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 9: Topologist’s Sinusoidal Curve
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 10: Gaussian Kernel
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 11: Gaussian Kernel
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 12: Gaussian Kernel
(a) dataset
(b) tanh, bfgs
(c) tanh, adam
(d) tanh
(e) relu, bfgs
(f) relu, adam
(g) relu
Figure 13: Square Wave