Numerical Integration Method for Training Neural Network

02/02/2019 ∙ by Sho Sonoda, et al. ∙ 0

We propose a new numerical integration method for training a shallow neural network by using the ridgelet transform with a fast convergence guarantee. Given a training dataset, the ridgelet transform can provide the parameters of the neural network that attains the global optimum of the training problem. In other words, we can obtain the global minimizer of the training problem by numerically computing the ridgelet transform, instead of by numerically optimizing the so-called backpropagation training problem. We employed the kernel quadrature for the basis of the numerical integration, because it is known to converge faster, i.e. O(1/p) with the hidden unit number p, than other random methods, i.e. O(1/√(p)), such as Monte Carlo integration methods. Originally, the kernel quadrature has been developed for the purpose of computing posterior means, where the measure is assumed to be a probability measure, and the final product is a single number. On the other hand, our problem is the computation of an integral transform, where the measure is generally a signed measure, and the final product is a function. In addition, the performance of kernel quadrature is sensitive to the selection of its kernel. In this paper, we develop a generalized kernel quadrature method with a fast convergence guarantee in a function norm that is applicable to signed measures, and propose a natural choice of kernels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training a neural network by backpropagation (BP) amounts to a minimization problem of a non-convex loss function with a lot of local minima. In the least-squares regression problem with shallow neural networks,

Sonoda et al. (2018) proved that the global minimizer is given by the ridgelet transform (Murata, 1996; Candès, 1998). The ridgelet transform is an integral transform that provides the parameters of a neural network from the training dataset. In other words, by computing the ridgelet transform from the dataset, we can obtain the global minimizer, instead by solving the non-convex minimization problem. In this study, we investigate the training method for neural networks by using numerical integration.

The integral representation

of a neural network with activation function

and coefficient function is given by

(1)

In comparison with an ordinary shallow neural network , indices are replaced with integration variables , hidden parameters are integrated out, and output parameters remain as the only trainable parameter . We remark that the integral representation with a sum of point masses corresponds to the ordinary network .

Historically, the ridgelet transform

(2)

with an admissible function , where the definition of admissibility is explained later, was discovered and developed as the right inverse operator of the integral representation operator . Namely, given a function and consider an integral equation of the unknown function . The ridgelet transform of is a particular solution to the equation, i.e. . In general, the solution to is not unique. Thus, is only the right inverse, i.e. . In other words, does not satisfy .

Computing the ridgelet transform amounts to the numerical integration of an integral operator. Several methods were proposed for the numerical computation of the ridgelet transform. Murata (1996) and Sonoda and Murata (2014) proposed random coding methods based on the Monte Carlo integration. Candès (1998), Donoho (2001) and Do and Vetterli (2003) proposed a series of discrete ridgelet transforms, in parallel with the discrete wavelet transform. We note that the ridgelet transform has a close relation to the wavelet transform (and the Radon transform). See Starck et al. (2010) for example.

In general, the numerical integration of an integral transform of a given function with respect to the integral kernel amounts to find a series of finite sums with weights and sigma points that approximates the integral transform, namely as (in a certain sense of convergence). In the random coding, the sigma points

are chosen randomly according to the probability distribution that is proportional to the spectrum

. As is often the case, such Monte Carlo method converges as slow as . In the discrete transform, on the other hands,

are chosen regularly according to some grids, which is often unrealistic in the present setting of machine learning. We remark that the discrete ridgelet transform has been developed in image processing. Therefore, the irregular grid, or the random sample, has been less considered.

The kernel quadrature (O’Hagan, 1991; Rasmussen and Ghahramani, 2003; Welling, 2009; Chen et al., 2010; Huszár and Duvenaud, 2012; Bach et al., 2012; Bach, 2017; Briol et al., 2015) is an emerging technique of the numerical integration. The kernel quadrature is a version of the Monte Carlo method. However, it converges faster than the ordinary Monte Carlo methods do, by choosing the sigma points depending on all the past samples. The fast convergence arises from the equivalence between the kernel quadrature and the conditional gradient. We remark that it has been studied for the purpose to compute the posterior mean. Therefore, the measure has been assumed to be positive and finite.

In this study, we develop a generalized kernel quadrature method for computing the reconstruction formula

(3)

from a given dataset with a fast convergence guarantee. Moreover, we propose a natural choice of kernels that is used in the generalized kernel quadrature.

2 Preliminaries

In this section, we introduce several notation, formulate the problem and describe the key concepts.

Mathematical Notation

For the sake of simplicity and generality, we write the hidden parameters by , instead of ; and write the integral transform and the finite sum by

(4)

respectively. Here, is a signed measure on the space of parameters, and is a weighted sum of Dirac measures on . In the following, the term ‘measure ’ includes singular measures .

denotes the Fourier transform of a function

. denotes the complex conjugate of a complex number .

denotes the uniform distribution on the interval

.

denotes the normal distribution with mean

and variance

.

2.1 Problem Formulation

Given a dataset of examples, where distributes according to a probability distribution with density , and with unknown function and iid noise

, our goal is to estimate the integral representation

with , which equals , by using a finite sum .

2.2 Integral Representation Theory

Following Sonoda and Murata (2017), we assume that is an arbitrary element of the tempered distributions (), i.e. the strong dual space of Schwartz functions. Examples of

include rectified linear unit (ReLU), hyperbolic tangent (

), and Gaussian kernel.

For a fixed , a Schwartz function is said to be admissible when it satisfies the admissibility condition

(5)

where

is the dimension of input vectors

and the domain of . When and are admissible, then the reconstruction formula holds for any (Sonoda and Murata, 2017, Theorem 5.11).

2.3 Construction of Admissible Functions

A simple way to construct an admissible is to find a Schwartz function that satisfies a subcondition . Then,

:odd

) or satisfies the admissibility condition.

For example, if , then (Gel’fand and Shilov, 1964, § 9.3). Thus, satisfies the subcondition. See Sonoda and Murata (2017, § 6) for more examples.

2.4 Kernel Mean Embedding and Maximum Mean Discrepancy

We define the kernel mean embedding (KME) and maximum mean discrepancy (MMD) for signed measures. We refer to Muandet et al. (2017) for original definitions for probability measures. See § 4 for the fundamental properties such as the well-definedness.

Let be a symmetric positive definite kernel on the space of parameters. According to the Moore-Aronszajn theorem theorem, is associated with the reproducing kernel Hilbert space (RKHS) equipped with the inner product and the induced norm . We write , which satisfies the reproducing property

(6)

Let be a signed measure on . We write the action of on as With this notation, we define the kernel mean embedding (KME) of as

(7)

We call the image as the mean element. With some appropriate conditions, the mean element is the element of , and it has the reproducing property of the ‘expectation’

(8)

See § 4 for the proof.

We define the maximum mean discrepancy (MMD) between and as

(9)

By the reproducing property, an MMD equals the distance between the mean elements.

(10)

2.5 Kernel Quadrature

In order to describe the ordinary kernel quadrature, we consider computing the expectation of a function with respect to a probability measure . We use a weighted sum of point masses for approximating .

In the kernel quadrature, we minimize . Since , if for some , then the estimate also approximates the expectation , i.e. . We note that the minimization of is quadratic programming, which implies that if the domain is compact and convex, then with some appropriate assumptions, the optimization converges at by using conditional gradient (Bach et al., 2012).

3 Method

We describe the generalized kernel quadrature method for computing the integral representation.

3.1 Generalized Kernel Quadrature for Integral Operators

We write and assume that for any . Then, we can write , and there exists a finite number . By using the reproducing property and the Cauchy-Schwartz inequality,

(11)

Therefore, in the same way as the ordinary kernel quadrature, if we minimize by using conditional gradient, then we can approximate by at the rate in the sense of uniform convergence. This is faster than , which is attained by an ordinary Monte Carlo integration method (Sonoda and Murata, 2014).

3.2 Greedy Minimization

In the experiments, we employed the greedy minimization of . In the initial iteration, we select for some prior , and for the fixed . In the -th iteration, we select that maximizes the gap . Namely, we select

(12)

Instead of running the gradient descent on (12), we employed a Monte Carlo search. Namely, at every iteration, we draw some samples from a proposal distribution on , and select one according to (12). We employed the normal distribution for the proposal distribution.

3.3 Selection of Kernel

We define the inner product of activations (IPA) kernel:

(13)

with activation function and data distribution . By construction, it is a reproducing kernel. The interpretation of is simple: measures the similarity between two functions and in the sense of . Furthermore, the empirical estimation is straightforward: .

For , the mean element is calculated as below:

(14)

where we used the admissibility condition . To our surprise, the has vanished, which implies that we do not need to select for calculating the MMD. Namely,

(15)

We note that the mean element results in another ridgelet transform, say , with respect to and , which corresponds to the adjoint operator of . Namely, satisfies for any and .

4 Fundamental Properties of KME for General Measures

We investigate the fundamental properties of the KME for signed measures and vector-valued measures.

4.1 Well-Definedness

Let be a Banach space, and let be a -valued vector measure on a measurable space . denote the total variation of . denotes the integral of a measurable function on . We have the following inequality.

(16)

We fix a measurable positive definite kernel on . Let be the collection of finite -valued vector measures on that satisfies

(17)

We define the KME for a vector measure as

(18)

For an arbitrary , we define a linear functional . Then, is bounded because

Therefore, . By the Riesz’s representation theorem, there exists for any such that . In particular, the reproducing property of the ‘expectation’ holds:

(19)

4.2 Characteristics

A bounded measurable positive definite kernel on is said to be characteristic when the KME operator is injective. In other words, is characteristic if and only if

(20)

We claim that if the KME operator with respect to is injective on the collection of probability measures on , then it is also injective on .

Proof.

We prove the claim first for , then for .

First, we consider the KME for signed measures. Recall that is a bounded linear operator. Since , the restriction is injective. We show the converse: If is injective, then is injective. Assume that there exist that satisfies . Since (because is obvious and the converse follows from Hahn’s decomposition theorem), we can take a basis of that is composed of the elements of . Thus, by rewriting and , we have . By the assumption that is injective, the image set is linearly independent, which concludes for every components. Namely, .

Then, we consider the KME for vector measures. Assume that satisfies . In other words, for every component and . In each component, we can reuse the result for the KME for signed measures, and obtain , which concludes . ∎

5 Experimental Results

5.1 Experimental Setup

Methods. We compared two methods: (IS) importance sampling of sigma points from the probability distribution that is proportional to , which corresponds to the existing methods (Sonoda and Murata, 2014); and (GKQ) generalized kernel quadrature, or greedy minimization of the MMD with IPA kernel, which is the proposed method. In both methods, we determined weights

by using linear regression.

Network. We employed the shallow neural network, and the activation function was the first derivative of Gaussian kernel.

Datasets. We present the results with sinusoidal curve with Gaussian noise: with , and .

Evaluation. We employed the empirical maximum error (ME) , and root mean squares error (RMSE) .

5.2 Experimental Results

Figure 1

compares the error decay of IS and GKQ (proposed). In the comparison of RMSE, both IS and GKQ reached the gray line, which is the standard deviation of the noise in the dataset. This indicates that the numerical integration was conducted correctly in both cases. IS decayed along the red line, which corresponds to the slower rate

. On the other hand, GKQ decayed along the blue line, which corresponds to the faster rate .

(a) IS
(b) GKQ (proposed)
Figure 1: Error decay of numerical integration results (double logarithmic plot). Red and Blue lines correspond to and respectively. Gray line in the RMSE plot indicates the standard deviation of the noise in the dataset. Both IS and GKQ reached Gray line, and GKQ decays faster than IS.

6 Conclusion

We have developed a generalized kernel quadrature (GKQ) method for the integral representation operator with a fast convergence guarantee, and proposed a natural kernels, i.e. IPA kernels. Numerical experiments showed that the proposed method converged faster than the existing method (IS) that uses the Monte Carlo integration. The key finding is the uniform bound (11) of the approximation error by the MMD. By using conditional gradient, we can decrease the MMD at . As a result, we can decrease the approximation error faster than the ordinary Monte Carlo mehods.

Acknowledgements

The author is grateful to N. Murata for suggesting the topic treated in this paper. The author would like to thank T. Suzuki and T. Matsubara for useful discussions. This work was supported by JSPS KAKENHI 18K18113.

References