    # Numerical Integration Method for Training Neural Network

We propose a new numerical integration method for training a shallow neural network by using the ridgelet transform with a fast convergence guarantee. Given a training dataset, the ridgelet transform can provide the parameters of the neural network that attains the global optimum of the training problem. In other words, we can obtain the global minimizer of the training problem by numerically computing the ridgelet transform, instead of by numerically optimizing the so-called backpropagation training problem. We employed the kernel quadrature for the basis of the numerical integration, because it is known to converge faster, i.e. O(1/p) with the hidden unit number p, than other random methods, i.e. O(1/√(p)), such as Monte Carlo integration methods. Originally, the kernel quadrature has been developed for the purpose of computing posterior means, where the measure is assumed to be a probability measure, and the final product is a single number. On the other hand, our problem is the computation of an integral transform, where the measure is generally a signed measure, and the final product is a function. In addition, the performance of kernel quadrature is sensitive to the selection of its kernel. In this paper, we develop a generalized kernel quadrature method with a fast convergence guarantee in a function norm that is applicable to signed measures, and propose a natural choice of kernels.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Training a neural network by backpropagation (BP) amounts to a minimization problem of a non-convex loss function with a lot of local minima. In the least-squares regression problem with shallow neural networks,

Sonoda et al. (2018) proved that the global minimizer is given by the ridgelet transform (Murata, 1996; Candès, 1998). The ridgelet transform is an integral transform that provides the parameters of a neural network from the training dataset. In other words, by computing the ridgelet transform from the dataset, we can obtain the global minimizer, instead by solving the non-convex minimization problem. In this study, we investigate the training method for neural networks by using numerical integration.

The integral representation

of a neural network with activation function

and coefficient function is given by

In comparison with an ordinary shallow neural network , indices are replaced with integration variables , hidden parameters are integrated out, and output parameters remain as the only trainable parameter . We remark that the integral representation with a sum of point masses corresponds to the ordinary network .

Historically, the ridgelet transform

with an admissible function , where the definition of admissibility is explained later, was discovered and developed as the right inverse operator of the integral representation operator . Namely, given a function and consider an integral equation of the unknown function . The ridgelet transform of is a particular solution to the equation, i.e. . In general, the solution to is not unique. Thus, is only the right inverse, i.e. . In other words, does not satisfy .

Computing the ridgelet transform amounts to the numerical integration of an integral operator. Several methods were proposed for the numerical computation of the ridgelet transform. Murata (1996) and Sonoda and Murata (2014) proposed random coding methods based on the Monte Carlo integration. Candès (1998), Donoho (2001) and Do and Vetterli (2003) proposed a series of discrete ridgelet transforms, in parallel with the discrete wavelet transform. We note that the ridgelet transform has a close relation to the wavelet transform (and the Radon transform). See Starck et al. (2010) for example.

In general, the numerical integration of an integral transform of a given function with respect to the integral kernel amounts to find a series of finite sums with weights and sigma points that approximates the integral transform, namely as (in a certain sense of convergence). In the random coding, the sigma points

are chosen randomly according to the probability distribution that is proportional to the spectrum

. As is often the case, such Monte Carlo method converges as slow as . In the discrete transform, on the other hands,

are chosen regularly according to some grids, which is often unrealistic in the present setting of machine learning. We remark that the discrete ridgelet transform has been developed in image processing. Therefore, the irregular grid, or the random sample, has been less considered.

The kernel quadrature (O’Hagan, 1991; Rasmussen and Ghahramani, 2003; Welling, 2009; Chen et al., 2010; Huszár and Duvenaud, 2012; Bach et al., 2012; Bach, 2017; Briol et al., 2015) is an emerging technique of the numerical integration. The kernel quadrature is a version of the Monte Carlo method. However, it converges faster than the ordinary Monte Carlo methods do, by choosing the sigma points depending on all the past samples. The fast convergence arises from the equivalence between the kernel quadrature and the conditional gradient. We remark that it has been studied for the purpose to compute the posterior mean. Therefore, the measure has been assumed to be positive and finite.

In this study, we develop a generalized kernel quadrature method for computing the reconstruction formula

from a given dataset with a fast convergence guarantee. Moreover, we propose a natural choice of kernels that is used in the generalized kernel quadrature.

## 2 Preliminaries

In this section, we introduce several notation, formulate the problem and describe the key concepts.

### Mathematical Notation

For the sake of simplicity and generality, we write the hidden parameters by , instead of ; and write the integral transform and the finite sum by

 S[γ](x):=∫Θσ(x;θ)dγ(θ),andS[γp](x):=p∑j=1wjσ(x;θj), (4)

respectively. Here, is a signed measure on the space of parameters, and is a weighted sum of Dirac measures on . In the following, the term ‘measure ’ includes singular measures .

denotes the Fourier transform of a function

. denotes the complex conjugate of a complex number .

denotes the uniform distribution on the interval

.

denotes the normal distribution with mean

and variance

.

### 2.1 Problem Formulation

Given a dataset of examples, where distributes according to a probability distribution with density , and with unknown function and iid noise

, our goal is to estimate the integral representation

with , which equals , by using a finite sum .

### 2.2 Integral Representation Theory

Following Sonoda and Murata (2017), we assume that is an arbitrary element of the tempered distributions (), i.e. the strong dual space of Schwartz functions. Examples of

include rectified linear unit (ReLU), hyperbolic tangent (

), and Gaussian kernel.

For a fixed , a Schwartz function is said to be admissible when it satisfies the admissibility condition

 ∫R∖{0}ˆσ(ζ)¯¯¯¯¯¯¯¯¯¯ˆρ(ζ)|ζ|−mdζ=1, (5)

where

is the dimension of input vectors

and the domain of . When and are admissible, then the reconstruction formula holds for any (Sonoda and Murata, 2017, Theorem 5.11).

### 2.3 Construction of Admissible Functions

A simple way to construct an admissible is to find a Schwartz function that satisfies a subcondition . Then,

:odd

) or satisfies the admissibility condition.

For example, if , then (Gel’fand and Shilov, 1964, § 9.3). Thus, satisfies the subcondition. See Sonoda and Murata (2017, § 6) for more examples.

### 2.4 Kernel Mean Embedding and Maximum Mean Discrepancy

We define the kernel mean embedding (KME) and maximum mean discrepancy (MMD) for signed measures. We refer to Muandet et al. (2017) for original definitions for probability measures. See § 4 for the fundamental properties such as the well-definedness.

Let be a symmetric positive definite kernel on the space of parameters. According to the Moore-Aronszajn theorem theorem, is associated with the reproducing kernel Hilbert space (RKHS) equipped with the inner product and the induced norm . We write , which satisfies the reproducing property

 h(θ)=⟨h,kθ⟩k,for every h∈H and θ∈Θ. (6)

Let be a signed measure on . We write the action of on as With this notation, we define the kernel mean embedding (KME) of as

 γ[k](θ):=∫Θkθ′(θ)dγ(θ′),θ∈Θ. (7)

We call the image as the mean element. With some appropriate conditions, the mean element is the element of , and it has the reproducing property of the ‘expectation’

 γ[h]=⟨h,γ[k]⟩k,% for every h∈H. (8)

See § 4 for the proof.

We define the maximum mean discrepancy (MMD) between and as

 MMD(γ,γ′) :=sup∥h∥k≤1{γ[h]−γ′[h]}. (9)

By the reproducing property, an MMD equals the distance between the mean elements.

 (10)

In order to describe the ordinary kernel quadrature, we consider computing the expectation of a function with respect to a probability measure . We use a weighted sum of point masses for approximating .

In the kernel quadrature, we minimize . Since , if for some , then the estimate also approximates the expectation , i.e. . We note that the minimization of is quadratic programming, which implies that if the domain is compact and convex, then with some appropriate assumptions, the optimization converges at by using conditional gradient (Bach et al., 2012).

## 3 Method

We describe the generalized kernel quadrature method for computing the integral representation.

### 3.1 Generalized Kernel Quadrature for Integral Operators

We write and assume that for any . Then, we can write , and there exists a finite number . By using the reproducing property and the Cauchy-Schwartz inequality,

 supx∈X|S[γ](x)−S[γp](x)| =supσx∈H|γ[σx]−γp[σx]| =supσx∈H∣∣⟨σx,γ[k]−γp[k]⟩k∣∣ ≤s⋅MMD(γ,γp). (11)

Therefore, in the same way as the ordinary kernel quadrature, if we minimize by using conditional gradient, then we can approximate by at the rate in the sense of uniform convergence. This is faster than , which is attained by an ordinary Monte Carlo integration method (Sonoda and Murata, 2014).

### 3.2 Greedy Minimization

In the experiments, we employed the greedy minimization of . In the initial iteration, we select for some prior , and for the fixed . In the -th iteration, we select that maximizes the gap . Namely, we select

 (θp,wp):=arg maxθ∈Θ,w∈R{2wγ[k](θ)−2p−1∑j=1wjwk(θj,θ)−w2k(θ,θ)}. (12)

Instead of running the gradient descent on (12), we employed a Monte Carlo search. Namely, at every iteration, we draw some samples from a proposal distribution on , and select one according to (12). We employed the normal distribution for the proposal distribution.

### 3.3 Selection of Kernel

We define the inner product of activations (IPA) kernel:

 kσ(θ,θ′):=∫Rmσ(x;θ)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(x;θ′)dμ(x)=μ[σ(⋅;θ)¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(⋅;θ)], (13)

with activation function and data distribution . By construction, it is a reproducing kernel. The interpretation of is simple: measures the similarity between two functions and in the sense of . Furthermore, the empirical estimation is straightforward: .

For , the mean element is calculated as below:

 γ[kσ](θ) =∫Θ∫Rmf(x)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(x;θ′)dx∫Rm¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(x′;θ)σ(x′;θ′)dμ(x′)dθ′ =∫Rm×Rmf(x)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(x′;θ)[∫Θ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(x;θ′)σ(x′;θ′)dθ′]dxdμ(x′) =∫Rm×Rmf(x′)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(x′;θ)δ(x−x′)dx′dμ(x) =∫Rmf(x)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(x′;θ)dμ(x). (14)

where we used the admissibility condition . To our surprise, the has vanished, which implies that we do not need to select for calculating the MMD. Namely,

 ˆMMD(γ,γp)2=(const.)−2p∑j=1n∑i=1wjyi¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(xi;θj)+p∑j,kn∑i=1wjwkσ(xi;θj)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯σ(xi;θk). (15)

We note that the mean element results in another ridgelet transform, say , with respect to and , which corresponds to the adjoint operator of . Namely, satisfies for any and .

## 4 Fundamental Properties of KME for General Measures

We investigate the fundamental properties of the KME for signed measures and vector-valued measures.

### 4.1 Well-Definedness

Let be a Banach space, and let be a -valued vector measure on a measurable space . denote the total variation of . denotes the integral of a measurable function on . We have the following inequality.

 |μ[f]|≤|μ|[|f|]. (16)

We fix a measurable positive definite kernel on . Let be the collection of finite -valued vector measures on that satisfies

 μ[√k(X,X)]=μ[∥φX∥k]<∞. (17)

We define the KME for a vector measure as

 μ[kX]:=∫Ωk(x,⋅)dμ(x). (18)

For an arbitrary , we define a linear functional . Then, is bounded because

 |Lμ[f]|=|μ[⟨f,kX⟩k]|≤|μ|[|⟨f,kX⟩k|]≤∥f∥k|μ|[∥kX∥k]<∞.

Therefore, . By the Riesz’s representation theorem, there exists for any such that . In particular, the reproducing property of the ‘expectation’ holds:

 μ[f(X)]=⟨f,μ[kX]⟩k. (19)

### 4.2 Characteristics

A bounded measurable positive definite kernel on is said to be characteristic when the KME operator is injective. In other words, is characteristic if and only if

 ∀f∈H:μ[f]=ν[f]⇒μ=ν. (20)

We claim that if the KME operator with respect to is injective on the collection of probability measures on , then it is also injective on .

###### Proof.

We prove the claim first for , then for .

First, we consider the KME for signed measures. Recall that is a bounded linear operator. Since , the restriction is injective. We show the converse: If is injective, then is injective. Assume that there exist that satisfies . Since (because is obvious and the converse follows from Hahn’s decomposition theorem), we can take a basis of that is composed of the elements of . Thus, by rewriting and , we have . By the assumption that is injective, the image set is linearly independent, which concludes for every components. Namely, .

Then, we consider the KME for vector measures. Assume that satisfies . In other words, for every component and . In each component, we can reuse the result for the KME for signed measures, and obtain , which concludes . ∎

## 5 Experimental Results

### 5.1 Experimental Setup

Methods. We compared two methods: (IS) importance sampling of sigma points from the probability distribution that is proportional to , which corresponds to the existing methods (Sonoda and Murata, 2014); and (GKQ) generalized kernel quadrature, or greedy minimization of the MMD with IPA kernel, which is the proposed method. In both methods, we determined weights

by using linear regression.

Network. We employed the shallow neural network, and the activation function was the first derivative of Gaussian kernel.

Datasets. We present the results with sinusoidal curve with Gaussian noise: with , and .

Evaluation. We employed the empirical maximum error (ME) , and root mean squares error (RMSE) .

### 5.2 Experimental Results

Figure 1

compares the error decay of IS and GKQ (proposed). In the comparison of RMSE, both IS and GKQ reached the gray line, which is the standard deviation of the noise in the dataset. This indicates that the numerical integration was conducted correctly in both cases. IS decayed along the red line, which corresponds to the slower rate

. On the other hand, GKQ decayed along the blue line, which corresponds to the faster rate .

## 6 Conclusion

We have developed a generalized kernel quadrature (GKQ) method for the integral representation operator with a fast convergence guarantee, and proposed a natural kernels, i.e. IPA kernels. Numerical experiments showed that the proposed method converged faster than the existing method (IS) that uses the Monte Carlo integration. The key finding is the uniform bound (11) of the approximation error by the MMD. By using conditional gradient, we can decrease the MMD at . As a result, we can decrease the approximation error faster than the ordinary Monte Carlo mehods.

#### Acknowledgements

The author is grateful to N. Murata for suggesting the topic treated in this paper. The author would like to thank T. Suzuki and T. Matsubara for useful discussions. This work was supported by JSPS KAKENHI 18K18113.

## References

• Sonoda et al. (2018) Sho Sonoda, Isao Ishikawa, Masahiro Ikeda, Kei Hagihara, Yoshihiro Sawano, Takuo Matsubara, and Noboru Murata. The global optimum of shallow neural network is attained by ridgelet transform. 2018.
• Murata (1996) Noboru Murata. Neural Networks, 9(6):947–956, 1996.
• Candès (1998) Emmanuel Jean Candès. PhD thesis, Standford University, 1998.
• Sonoda and Murata (2014) Sho Sonoda and Noboru Murata. In 24th International Conference on Artificial Neural Networks 2014, pages 539–546, Hamburg, Germany, 2014.
• Donoho (2001) David L Donoho. Journal of Approximation Theory, 111(2):143–179, 2001.
• Do and Vetterli (2003) M N Do and M Vetterli. Image Processing, IEEE Transactions on, 12(1):16–28, 2003.
• Starck et al. (2010) Jean-Luc Starck, Fionn Murtagh, and Jalal M. Fadili. In Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity, pages 89–118. Cambridge University Press, 2010.
• O’Hagan (1991) A. O’Hagan. Journal of Statistical Planning and Inference, 29(3):245–260, 1991.
• Rasmussen and Ghahramani (2003) Carl Edward Rasmussen and Zoubin Ghahramani. In Advances in Neural Information Processing Systems 15, pages 505–512, 2003.
• Welling (2009) Max Welling. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, Montreal, BC, 2009.
• Chen et al. (2010) Yutian Chen, Max Welling, and Alex Smola. In

Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence

, pages 109–116, Catalina Island, CA, 2010.
• Huszár and Duvenaud (2012) Ferenc Huszár and David Duvenaud. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 377–386, Catalina Island, CA, 2012.
• Bach et al. (2012) Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. In Proceedings of the 29th International Conference on Machine Learning, pages 1355–1362, Edinburgh, Scotland, UK, 2012.
• Bach (2017) Francis Bach. Journal of Machine Learning Research, 18(21):1–38, 2017.
• Briol et al. (2015) François-Xavier Briol, Chris J. Oates, Mark Girolami, and Michael A. Osborne. In Advances in Neural Information Processing Systems 28, pages 1162–1170, Montreal, BC, 2015.
• Sonoda and Murata (2017) Sho Sonoda and Noboru Murata. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
• Gel’fand and Shilov (1964) I. M. Gel’fand and G. E. Shilov. Generalized Functions, Vol. 1: Properties and Operations. Academic Press, New York, 1964.
• Muandet et al. (2017) Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.