Neural network integral representations with the ReLU activation function

10/07/2019 ∙ by Anton Dereventsov, et al. ∙ Oak Ridge National Laboratory The University of Tennessee, Knoxville 0

We derive a formula for neural network integral representations on the sphere with the ReLU activation function under the finite L_1 norm (with respect to Lebesgue measure on the sphere) assumption on the outer weights. In one dimensional case, we further solve via a closed-form formula all possible such representations. Additionally, in this case our formula allows one to explicitly solve the least L_1 norm neural network representation for a given function.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial neural networks were introduced in 1940s as a mathematical model for biological neural networks. In mid 1980s there was a spike of interest in this area, including from mathematical community, which is largely attributed to the invention of the back-propagation method for training neural networks [23]. However, the amount of research in the field slowly diminished by the end of 1990s. The advancement of computational tools during the last decade has largely contributed to the revival of interest in this field. Deeper architectures have been observed to perform better than shallow ones [22]. Faster GPU accelerated the deep network training processes since computations can be done in parallel in a network type dataflow structures. However, despite recent developments in both theory and practical tools, many fundamental questions even those concerning shallow neural network representations and training still remain unanswered.

The focus of this paper is the so called integral representation approach to shallow networks. Early mathematical research in artificial neural networks was mainly focused on investigating the expressive power [8, 7, 17], and the approximation complexity [10, 2, 18] of single hidden layer neural networks. One of the main approaches employed was to represent the target function as some integral whose appropriate discretization results in a neural network approximation. There were also attempts at investigating the integral representations themselves independently [9, 2, 19, 16, 11]. This line of work, in particular, motivated by the ridgelet transform introduced by E. Candès in [6] and its later variations like curvelet and contourlet transforms.

1.1 Shallow neural networks

Definition 1.

A shallow neural network with an activation and nodes is a function of the form

(1)

where , , and are called the inner weights, biases, and outer weights, respectively.

In this paper we consider the ReLU (rectified linear unit) activation, given by

, which seems to be the conventional choice of activation in most modern architectures.

Let

be the target function, e.g. an image classifier, solution to a PDE, a specific parameter associated with a model, etc. A neural network represents computationally simple parametric family (since propagating an input through a neural network consists of matrix multiplications and activation). The aim is to find (learn, train) a neural network representation

of the target function such that . Some of the challenges that arise are

  1. Neural network architectures are typically selected by trial and error, and often it is not clear beforehand how many layers and nodes should be taken in the network;

  2. Models lack interpretabilty, in particular it is not clear what the weights represent;

  3. The training process is computationally expensive and there is not a good initialization method for the training;

  4. There are many stability issues, both during the training process (sensitivity to initialization) and post training (sensitive to adversarial attacks), and many others.

1.2 Integral representations of shallow networks

Our main approach is to think of a shallow network as a discretization of a suitable integral representation of the target function . More precisely, the sum (1) is regarded as a discretization of an integral of the form

(2)

where is a Radon measure on . Note that a shallow network with nodes can be written as (2) for an atomic measure with atoms so this type of representation is quite general. We believe this approach may open a wide area for research, potentially leading up to faster and more stable algorithms for neural network training, and to architectures best fitted for specific problems. The integral form of neural network representations is more concise and better suited for analysis. From numerical perspective, utilizing various discretization methods may allow one to obtain a network that approximates the target function, completely bypassing the optimization based training process. The constructed neural network can be refined further via a gradient based training process.

Neural network integral representations have been considered by various authors, where typically the Radon measure

is assumed to be of a special form, e.g. supported on a given set or absolutely continuous with respect to a probability measure. One specific type of integral representation for neural network integral representations discussed below originates from the harmonic analysis perspective to shallow neural networks

[6] and employs the ridgelet transform. There it is assumed that the target function can be written as

(3)

Function is called the dual ridgelet transform of the function with respect to . The ‘direct’ transform , called the ridgelet transform of with respect to , is given by

(4)

It is shown in [27] that if the pair satisfies the admissibility condition

then the reconstruction formula holds, thus providing a particular integral representation of the target function :

Methods for discretizing integral representations of the form (2) have been considered by various authors. In particular, the authors in [5] employ a greedy method to discretize the solution of (2) with the smallest total variation norm. The same problem is solved by the conditional gradient algorithm in [1]. In [20, 21] the authors suggested Monte–Carlo sampling for the parameters

to discretize the integral and least square regularization to find the values of outer weights. They call this method random vector functional-link (or RVFL) network. A related Monte–Carlo discretization method for integral representations of radial basis function (RBF) neural networks is considered in

[18]. In [26] integral representations are used to get better weight initialization. Other related work includes [12, 13, 14, 15, 3, 4].

1.3 Our approach

We note that due to the positive homogeneity of the ReLU activation function the representation (1) can be rewritten with the weights on the unit sphere . In this setting, we consider the target functions that admit integral representations of the form

(5)

where is the Lebesgue measure on and . stands for he class of Lebesgue integrable functions on with respect to . Note that, for every , the function is bounded on , and so the integral (5) is well defined. The three main problems posed here in this regard are the following:

  • Characterize the functions that have representation ,

  • For given , find all for which holds,

  • Find the least -norm solution to .

For , we have fully solved these problems. A similar approach was employed in [24, 25], however, to our knowledge, the characterization theorems in the first two bullets above are new and first time presented here.

2 Main results

The next theorem gives a sufficient condition for to have representation (5). Before doing that, we introduce the following two transforms. The Radon transform of a function is a function given by the formula:

where integration is with respect to -dimensional Lebesgue measure

on the hyperplane

. The Hilbert transform of a function is defined as

We now formulate one of our main results which provides a particular kernel for the integral representation (5).

Theorem 1.

Any compactly supported function admits the representation

Stated theorem offers a way to construct a specific for which the integral representation formula (5) holds. A similar result was proved in [12] for the Heaviside function which is the derivative of ReLU.

2.1 Univariate target functions

For the case we state a stronger version of Theorem 1 that characterizes the class of target functions that can be represented in the form (5).

Note that the unit circle can be parametrized by , where . Then the hyperplane consists of a single point: and hence . Thus Theorem 1 provides for any compactly supported function that

In this subsection we provide a more general representation and extend the set of the admissible functions by defining the class consisting of such functions that exists everywhere on , exist almost everywhere on , and

The following theorem characterizes the class of target functions that admits the integral representation (5) with an integrable kernel .

Theorem 2.

The function admits the representation

(6)

with some if and only if has the form

(7)

where and

Moreover, the set of such weight functions coincides with the set of functions of the form

(8)

where .

The following theorem characterizes the class as the functions that admit an integral representation with an appropriate integrable weight function.

Theorem 3.

A function belongs to the class if and only if it admits the representation

with a kernel satisfying

Since in Theorem 3 we make an assumption , we can pose an interesting question of finding the kernel with the smallest -norm for a given target function . In particular, -norm minimization is commonly used in compressed sensing for finding a sparse solution. In the next theorem we answer this question.

Theorem 4.

For the minimum

(9)

is attained at

A similar result has been obtained by the authors in [24] and its multidimensional analogue in [25]. They show that the value

where are is the set of all Radon measures on and is the total variation norm, for is exactly equal to

They do not make assumptions on the smoothness of the function , instead the second derivative is interpreted in the distributional sense. This result is consistent with that in Theorem 4, if we do change of variables from to .

3 Proofs

3.1 Proof of Theorem 1

Lemma 1.

For any we have

Proof.

Statement of the lemma is trivial for . For consider the following change of variables given by the spherical coordinates

where and , and . The area element on the unit sphere is given by . Therefore we obtain

To complete the proof, we change the variable to to get and . Substituting into the above integral provides the required result. ∎

We also use the following technical result, which is a corollary of Proposition 8.1 from [12].

Lemma 2.

Let be the Heaviside function, i.e. . Then for any compactly supported function in the following reconstruction formula holds:

We now proceed to the proof of Theorem 1. Let be odd. From Lemma 1 we get

where we use the positive homogeneity of Radon transform

Then integration by parts provides

and applying Lemma 2 completes the proof of this case. The case of even is proven analogously.

3.2 Proof of Theorem 2

Before proving the theorem we perform several related calculations. From the equality we obtain

and, in the same way,

From the mutual orthogonality of the functions we deduce

where we changed the variable to . Similarly,

We now prove the direct implication. Assume that a function admits the integral representation

with a weight function . Note that due to the mutual orthogonality of the functions we can assume without loss of generality that by replacing the weight function with

Then from the mutual orthogonality of we deduce

(10)

We will show that is in the class . First, we show that . Indeed, from condition (10) we get for any

By using the relation we obtain

(11)

Then from condition (10) we get . By a similar argument we have .

Next, we show the existence of the derivative and that . Let denote the Heaviside step function, then by using dominated convergence theorem we get for any

(12)

By taking into account estimate (

11) we derive . Condition proves in a similar way.

Next we show that the second derivative exists almost everywhere. Indeed, from (12) we see that exists at every such that is a Lebesgue point of , which is almost everywhere since . In that case we get

(13)

Finally, by changing the variable from to , we estimate

Therefore .

Lastly, we show that the weight function has the form (8) with some . Denote

Then and satisfies conditions (10), hence

From (13) we deduce that for almost all

Combining these relations we conclude

Since we initially subtracted the term from the weight function , in a general case we will have

which completes the proof of the direct implication.


We now prove the inverse implication. Assume that function has the form

with some function and constants . Similarly to the direct case, we can assume that by replacing function with

Denote

We will show that and that . First, note that

and hence . Taking into account that and using the assumption we get