# Neural network integral representations with the ReLU activation function

We derive a formula for neural network integral representations on the sphere with the ReLU activation function under the finite L_1 norm (with respect to Lebesgue measure on the sphere) assumption on the outer weights. In one dimensional case, we further solve via a closed-form formula all possible such representations. Additionally, in this case our formula allows one to explicitly solve the least L_1 norm neural network representation for a given function.

## Authors

• 4 publications
• 4 publications
• 6 publications
• ### Mish: A Self Regularized Non-Monotonic Neural Activation Function

The concept of non-linearity in a Neural Network is introduced by an act...
08/23/2019 ∙ by Diganta Misra, et al. ∙ 0

• ### Q-NET: A Formula for Numerical Integration of a Shallow Feed-forward Neural Network

Numerical integration is a computational procedure that is widely encoun...
06/25/2020 ∙ by Kartic Subr, et al. ∙ 0

• ### Nonclosedness of the Set of Neural Networks in Sobolev Space

We examine the closedness of the set of realized neural networks of a fi...
07/23/2020 ∙ by Scott Mahan, et al. ∙ 0

• ### Greedy Shallow Networks: A New Approach for Constructing and Training Neural Networks

We present a novel greedy approach to obtain a single layer neural netwo...
05/24/2019 ∙ by Anton Dereventsov, et al. ∙ 0

• ### On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

We establish connections between the problem of learning a two-layers ne...
02/20/2018 ∙ by Marco Mondelli, et al. ∙ 0

• ### Sorting out Lipschitz function approximation

Training neural networks subject to a Lipschitz constraint is useful for...
11/13/2018 ∙ by Cem Anil, et al. ∙ 8

• ### An algorithm for simulating Brownian increments on a sphere

This paper presents a novel formula for the transition density of the Br...
12/22/2020 ∙ by Aleksandar Mijatović, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Artificial neural networks were introduced in 1940s as a mathematical model for biological neural networks. In mid 1980s there was a spike of interest in this area, including from mathematical community, which is largely attributed to the invention of the back-propagation method for training neural networks [23]. However, the amount of research in the field slowly diminished by the end of 1990s. The advancement of computational tools during the last decade has largely contributed to the revival of interest in this field. Deeper architectures have been observed to perform better than shallow ones [22]. Faster GPU accelerated the deep network training processes since computations can be done in parallel in a network type dataflow structures. However, despite recent developments in both theory and practical tools, many fundamental questions even those concerning shallow neural network representations and training still remain unanswered.

The focus of this paper is the so called integral representation approach to shallow networks. Early mathematical research in artificial neural networks was mainly focused on investigating the expressive power [8, 7, 17], and the approximation complexity [10, 2, 18] of single hidden layer neural networks. One of the main approaches employed was to represent the target function as some integral whose appropriate discretization results in a neural network approximation. There were also attempts at investigating the integral representations themselves independently [9, 2, 19, 16, 11]. This line of work, in particular, motivated by the ridgelet transform introduced by E. Candès in [6] and its later variations like curvelet and contourlet transforms.

### 1.1 Shallow neural networks

###### Definition 1.

A shallow neural network with an activation and nodes is a function of the form

 L(x)=m∑j=1cjσ(aj⋅x+bj), (1)

where , , and are called the inner weights, biases, and outer weights, respectively.

In this paper we consider the ReLU (rectified linear unit) activation, given by

, which seems to be the conventional choice of activation in most modern architectures.

Let

be the target function, e.g. an image classifier, solution to a PDE, a specific parameter associated with a model, etc. A neural network represents computationally simple parametric family (since propagating an input through a neural network consists of matrix multiplications and activation). The aim is to find (learn, train) a neural network representation

of the target function such that . Some of the challenges that arise are

1. Neural network architectures are typically selected by trial and error, and often it is not clear beforehand how many layers and nodes should be taken in the network;

2. Models lack interpretabilty, in particular it is not clear what the weights represent;

3. The training process is computationally expensive and there is not a good initialization method for the training;

4. There are many stability issues, both during the training process (sensitivity to initialization) and post training (sensitive to adversarial attacks), and many others.

### 1.2 Integral representations of shallow networks

Our main approach is to think of a shallow network as a discretization of a suitable integral representation of the target function . More precisely, the sum (1) is regarded as a discretization of an integral of the form

 f(x)=∫Rd×Rσ(a⋅x+b)d% ν(a,b), (2)

where is a Radon measure on . Note that a shallow network with nodes can be written as (2) for an atomic measure with atoms so this type of representation is quite general. We believe this approach may open a wide area for research, potentially leading up to faster and more stable algorithms for neural network training, and to architectures best fitted for specific problems. The integral form of neural network representations is more concise and better suited for analysis. From numerical perspective, utilizing various discretization methods may allow one to obtain a network that approximates the target function, completely bypassing the optimization based training process. The constructed neural network can be refined further via a gradient based training process.

Neural network integral representations have been considered by various authors, where typically the Radon measure

is assumed to be of a special form, e.g. supported on a given set or absolutely continuous with respect to a probability measure. One specific type of integral representation for neural network integral representations discussed below originates from the harmonic analysis perspective to shallow neural networks

[6] and employs the ridgelet transform. There it is assumed that the target function can be written as

Function is called the dual ridgelet transform of the function with respect to . The ‘direct’ transform , called the ridgelet transform of with respect to , is given by

 Rτf(a,b):=∫Rdf(x)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯τ(a⋅x+b)dx. (4)

It is shown in [27] that if the pair satisfies the admissibility condition

 (2π)d−1∫R^σ(ω)¯¯¯¯¯¯¯¯¯¯¯¯¯^τ(ω)|ω|ddω=1,

then the reconstruction formula holds, thus providing a particular integral representation of the target function :

Methods for discretizing integral representations of the form (2) have been considered by various authors. In particular, the authors in [5] employ a greedy method to discretize the solution of (2) with the smallest total variation norm. The same problem is solved by the conditional gradient algorithm in [1]. In [20, 21] the authors suggested Monte–Carlo sampling for the parameters

to discretize the integral and least square regularization to find the values of outer weights. They call this method random vector functional-link (or RVFL) network. A related Monte–Carlo discretization method for integral representations of radial basis function (RBF) neural networks is considered in

[18]. In [26] integral representations are used to get better weight initialization. Other related work includes [12, 13, 14, 15, 3, 4].

### 1.3 Our approach

We note that due to the positive homogeneity of the ReLU activation function the representation (1) can be rewritten with the weights on the unit sphere . In this setting, we consider the target functions that admit integral representations of the form

 f(x)=∫Sdc(a,b)σ(a⋅x+b)dνd(a,b), (5)

where is the Lebesgue measure on and . stands for he class of Lebesgue integrable functions on with respect to . Note that, for every , the function is bounded on , and so the integral (5) is well defined. The three main problems posed here in this regard are the following:

• Characterize the functions that have representation ,

• For given , find all for which holds,

• Find the least -norm solution to .

For , we have fully solved these problems. A similar approach was employed in [24, 25], however, to our knowledge, the characterization theorems in the first two bullets above are new and first time presented here.

## 2 Main results

The next theorem gives a sufficient condition for to have representation (5). Before doing that, we introduce the following two transforms. The Radon transform of a function is a function given by the formula:

 R[f](a,b):=∫a⋅x+b=0f(x)dνd−1,

where integration is with respect to -dimensional Lebesgue measure

on the hyperplane

. The Hilbert transform of a function is defined as

 H[g](b):=1π p.v.∞∫−∞g(z)b−zdz.

We now formulate one of our main results which provides a particular kernel for the integral representation (5).

###### Theorem 1.

Any compactly supported function admits the representation

 f(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(−1)\nicefracd+122(2π)d−1∫Sd1∥a∥d+2∂d+1∂bd+1R[f](a,b) σ(a⋅x+b)dνd(a,b)if d is odd;(−1)\nicefracd22(2π)d−1∫Sd1∥a∥d+2∂d+1∂bd+1H[R[f](a,b)](b) σ(a⋅x+b)d% νd(a,b)if d is even.

Stated theorem offers a way to construct a specific for which the integral representation formula (5) holds. A similar result was proved in [12] for the Heaviside function which is the derivative of ReLU.

### 2.1 Univariate target functions

For the case we state a stronger version of Theorem 1 that characterizes the class of target functions that can be represented in the form (5).

Note that the unit circle can be parametrized by , where . Then the hyperplane consists of a single point: and hence . Thus Theorem 1 provides for any compactly supported function that

 f(x)=2π∫0f′′(tanϕ)2|cos3ϕ|σ(xcosϕ+sinϕ)dϕ.

In this subsection we provide a more general representation and extend the set of the admissible functions by defining the class consisting of such functions that exists everywhere on , exist almost everywhere on , and

 limx→±∞g(x)=limx→±∞xg′(x)=0,∞∫−∞|g′′(x)|√1+x2dx<∞.

The following theorem characterizes the class of target functions that admits the integral representation (5) with an integrable kernel .

###### Theorem 2.

 f(x)=2π∫0c(ϕ)σ(xcosϕ+sinϕ)dϕ (6)

with some if and only if has the form

 f(x)=g(x)+αx+β+γ(xarctanx+1)+ηarctanx, (7)

where and

 α=2π2π∫0c(ϕ)cosϕdϕ,β=2π2π∫0c(ϕ)sinϕdϕ,γ=2π∫0c(ϕ)|cosϕ|dϕ, η=2π∫0c(ϕ)s(ϕ)dϕ  with  s(ϕ)=s(ϕ+π)=sinϕ  for  ϕ∈[\nicefrac−π2,\nicefracπ2).

Moreover, the set of such weight functions coincides with the set of functions of the form

 f′′(−tanϕ)2|cos3ϕ|+k′′(−tanϕ)2cos3ϕ+αcosϕ+βsinϕ+γ|cosϕ|+ηs(ϕ), (8)

where .

The following theorem characterizes the class as the functions that admit an integral representation with an appropriate integrable weight function.

###### Theorem 3.

A function belongs to the class if and only if it admits the representation

 f(x)=2π∫0c(ϕ)σ(xcosϕ+sinϕ)dϕ

with a kernel satisfying

 \nicefracπ2∫−\nicefracπ2c(ϕ)cosϕdϕ=\nicefracπ2∫−\nicefracπ2c(ϕ+π)cosϕdϕ=\nicefracπ2∫−\nicefracπ2c(ϕ)sinϕdϕ=\nicefracπ2∫−\nicefracπ2c(ϕ+π)sinϕd% ϕ=0.

Since in Theorem 3 we make an assumption , we can pose an interesting question of finding the kernel with the smallest -norm for a given target function . In particular, -norm minimization is commonly used in compressed sensing for finding a sparse solution. In the next theorem we answer this question.

###### Theorem 4.

For the minimum

 minc∈L1[0,2π]∥c∥1  s.t.  2π∫0c(ϕ)σ(xcosϕ+sinϕ)dϕ=f(x) (9)

is attained at

 c(ϕ)=f′′(−tanϕ)2|cos3ϕ|.

A similar result has been obtained by the authors in [24] and its multidimensional analogue in [25]. They show that the value

 minν∈A,δ∈R∥ν∥TV  s.t.  ∫Sd−1×Rσ(a⋅x+b)% dν(ϕ)+δ=f(x),

where are is the set of all Radon measures on and is the total variation norm, for is exactly equal to

 max⎧⎪⎨⎪⎩∫R|f′′(x)|dx, limx→∞|f′(x)+f′(−x)|⎫⎪⎬⎪⎭.

They do not make assumptions on the smoothness of the function , instead the second derivative is interpreted in the distributional sense. This result is consistent with that in Theorem 4, if we do change of variables from to .

## 3 Proofs

### 3.1 Proof of Theorem 1

###### Lemma 1.

For any we have

 ∫SdF(a,b)dνd(a,b) =π∫0sind−1ϕ∫Sd−1F(αsinϕ,cosϕ)dνd−1(α)dϕ =∫Sd−1∫R1√1+β2d+1F(α√1+β2,β√1+β2)dνd−1(α)dβ.
###### Proof.

Statement of the lemma is trivial for . For consider the following change of variables given by the spherical coordinates

where and , and . The area element on the unit sphere is given by . Therefore we obtain

 ∫SdF(a,b)dνd(a,b)=π∫0…π∫02π∫0F(αsinϕ1,cosϕ1)sind−1ϕ1sind−2ϕ2…sinϕd−1dϕd…dϕ2dϕ1=π∫0sind−1ϕ1π∫0…π∫02π∫0F(αsinϕ1,cosϕ1)sind−2ϕ2…sinϕd−1dϕd…dϕ2dϕ1.

To complete the proof, we change the variable to to get and . Substituting into the above integral provides the required result. ∎

We also use the following technical result, which is a corollary of Proposition 8.1 from [12].

###### Lemma 2.

Let be the Heaviside function, i.e. . Then for any compactly supported function in the following reconstruction formula holds:

 f(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩−(−1)\nicefracd+122(2π)d−1∫Sd−1∫R∂d∂βdR[f](α,β) H(α⋅x+β)dβdνd−1(α)if d % is odd;−(−1)\nicefracd22(2π)d−1∫Sd−1∫R∂d∂βdH[R[f](α,β)](β) H(α⋅x+β)dβdνd−1(α)if d % is even.

We now proceed to the proof of Theorem 1. Let be odd. From Lemma 1 we get

 ∫Sd1∥a∥d+2 ∂d+1∂bd+1R[f](a,b) σ(a⋅x+b)dνd(a,b) =∫Sd−1∫R∂d+1∂βd+1R[f](α√1+β2,β√1+β2)σ(α⋅x+β)dβdνd−1(α) =∫Sd−1∫R∂d+1∂βd+1R[f](α,β) σ(α⋅x+β)dβdνd−1(α),

where we use the positive homogeneity of Radon transform

 R[f](α√1+β2,β√1+β2)=R[f](α,β).

Then integration by parts provides

 ∫R∂d+1∂βd+1R[f](α,β) σ(α⋅x+β)dβ=−∫R∂d∂βdR[f](α,β) H(α⋅x+β)dβ

and applying Lemma 2 completes the proof of this case. The case of even is proven analogously.

### 3.2 Proof of Theorem 2

Before proving the theorem we perform several related calculations. From the equality we obtain

 2π∫0cosϕ σ(xcosϕ+sinϕ)dϕ =122π∫0cosϕ |xcosϕ+sinϕ|dϕ+122π∫0cosϕ (xcosϕ+sinϕ)dϕ =12(π∫0+2π∫π)cosϕ |xcosϕ+sinϕ|dϕ+πx2=πx2

and, in the same way,

 2π∫0sinϕ σ(xcosϕ+sinϕ)dϕ=π2.

From the mutual orthogonality of the functions we deduce

 2π∫0|cosϕ|σ(xcosϕ+sinϕ)dϕ =122π∫0|cosϕ||xcosϕ+sinϕ|dϕ+122π∫0|cosϕ|(xcosϕ+sinϕ)dϕ =\nicefracπ2∫−\nicefracπ2|cosϕ||xcosϕ+sinϕ|dϕ=\nicefracπ2∫−\nicefracπ2cos2ϕ|x+tanϕ|dϕ =∞∫−∞|x+z|(1+z2)2dz=xarctanx+1,

where we changed the variable to . Similarly,

 2π∫0s(ϕ)σ(xcosϕ+sinϕ)dϕ =122π∫0s(ϕ)|xcosϕ+sinϕ|dϕ+122π∫0s(ϕ)(xcosϕ+sinϕ)dϕ =122π∫0s(ϕ)|xcosϕ+sinϕ|dϕ=π2∫−π2tanϕcos2ϕ|x+tanϕ|dϕ =∞∫−∞z|x+z|(1+z2)2dz=arctanx.

We now prove the direct implication. Assume that a function admits the integral representation

 f(x)=2π∫0c(ϕ)σ(xcosϕ+sinϕ)dϕ

with a weight function . Note that due to the mutual orthogonality of the functions we can assume without loss of generality that by replacing the weight function with

 c(ϕ)−αcosϕ−βsinϕ−γ|cosϕ|−ηs(ϕ).

Then from the mutual orthogonality of we deduce

 π∫0c(ϕ)cosϕdϕ=π∫0c(ϕ+π)cosϕdϕ=π∫0c(ϕ)sinϕdϕ=π∫0c(ϕ+π)sinϕdϕ=0. (10)

We will show that is in the class . First, we show that . Indeed, from condition (10) we get for any

 f(x) =\nicefracπ2∫−\nicefracπ2c(ϕ)σ(xcosϕ+sinϕ)dϕ+\nicefracπ2∫−\nicefracπ2c(ϕ+π)σ(−xcosϕ−sinϕ)dϕ =\nicefracπ2∫−arctanxc(ϕ)(xcosϕ+sinϕ)dϕ+−arctanx∫−\nicefracπ2c(ϕ+π)(−xcosϕ−sinϕ)dϕ =−x−arctanx∫−\nicefracπ2(c(ϕ)+c(ϕ+π))cosϕdϕ−−arctanx∫−\nicefracπ2(c(ϕ)+c(ϕ+π))sinϕdϕ.

By using the relation we obtain

 ∣∣ ∣ ∣∣x−arctanx∫−\nicefracπ2(c(ϕ)+c(ϕ+π))cosϕdϕ∣∣ ∣ ∣∣≤x√1+x2−arctanx∫−\nicefracπ2∣∣c(ϕ)+c(ϕ+π)∣∣dϕ. (11)

Then from condition (10) we get . By a similar argument we have .

Next, we show the existence of the derivative and that . Let denote the Heaviside step function, then by using dominated convergence theorem we get for any

 f′(x) =2π∫0c(ϕ)H(xcosϕ+sinϕ)cosϕdϕ =\nicefracπ2∫−\nicefracπ2c(ϕ)H(xcosϕ+sinϕ)cosϕdϕ−\nicefracπ2∫−\nicefracπ2c(ϕ+π)H(−xcosϕ−sinϕ)cosϕdϕ =\nicefracπ2∫−arctanxc(ϕ)cosϕdϕ−−arctanx∫−\nicefracπ2c(ϕ+π)cosϕdϕ =−−arctanx∫−\nicefracπ2(c(ϕ)+c(ϕ+π))cosϕdϕ. (12)

By taking into account estimate (

11) we derive . Condition proves in a similar way.

Next we show that the second derivative exists almost everywhere. Indeed, from (12) we see that exists at every such that is a Lebesgue point of , which is almost everywhere since . In that case we get

 f′′(x) =11+x2(c(−arctanx)+c(−arctanx+π))cos(−arctanx) =c(−arctanx)+c(−arctanx+π)(1+x2)\nicefrac32. (13)

Finally, by changing the variable from to , we estimate

 ∞∫−∞∣∣f′′(x)∣∣√1+x2dx =∞∫−∞11+x2∣∣c(−arctanx)+c(−arctanx+π)∣∣dx

Therefore .

Lastly, we show that the weight function has the form (8) with some . Denote

 ¯c(ϕ)={−c(ϕ),ϕ∈[−\nicefracπ2,\nicefracπ2),−c(ϕ),ϕ∈[\nicefracπ2,\nicefrac3π2).

Then and satisfies conditions (10), hence

 k(x):=2π∫0¯c(ϕ)σ(xcosϕ+sinϕ)%dϕ∈W(R).

From (13) we deduce that for almost all

 k′′(−tanϕ) f′′(−tanϕ) =(c(ϕ)+c(ϕ+π))|cos3ϕ|.

Combining these relations we conclude

 c(ϕ)=f′′(−tanϕ)2|cos3ϕ|+k′′(−tanϕ)2cos3ϕ.

Since we initially subtracted the term from the weight function , in a general case we will have

 c(ϕ)=f′′(−tanϕ)2|cos3ϕ|+k′′(−tanϕ)2cosϕ3+αcosϕ+βsinϕ+γ|cosϕ|+ηs(ϕ),

which completes the proof of the direct implication.

We now prove the inverse implication. Assume that function has the form

 f(x)=g(x)+αx+β+γ(xarctanx+1)+ηarctanx

with some function and constants . Similarly to the direct case, we can assume that by replacing function with

 f(x)−αx−β−γ(xarctanx+1)−ηarctanx.

Denote

 c(ϕ)=f′′(−tanϕ)2|cos3ϕ|+g′′(−tanϕ)2cos3ϕ.

We will show that and that . First, note that

 2π∫0|c(ϕ)|dϕ ≤2π2∫−π2|f′′(−tanϕ)|2|cosϕ|3dϕ+2π2∫−π2|g′′(−tanϕ)|2|cosϕ|3dϕ =∞∫−∞|f′′(z)|(1+z2)12dz+∞∫−∞|g′′(z)|(1+z2)12dz<∞

and hence . Taking into account that and using the assumption we get

 2π∫0f′′(−tanϕ)2|cos3ϕ| σ(xcosϕ+sinϕ)dϕ =\nicefracπ2∫−\nicefracπ2f′′(−tanϕ)2cos2ϕσ(x+tanϕ)%dϕ+\nicefrac3π2∫\nicefracπ2f′′(−tanϕ)2cos2ϕσ(−x−tanϕ)dϕ =12