Artificial neural networks were introduced in 1940s as a mathematical model for biological neural networks. In mid 1980s there was a spike of interest in this area, including from mathematical community, which is largely attributed to the invention of the back-propagation method for training neural networks . However, the amount of research in the field slowly diminished by the end of 1990s. The advancement of computational tools during the last decade has largely contributed to the revival of interest in this field. Deeper architectures have been observed to perform better than shallow ones . Faster GPU accelerated the deep network training processes since computations can be done in parallel in a network type dataflow structures. However, despite recent developments in both theory and practical tools, many fundamental questions even those concerning shallow neural network representations and training still remain unanswered.
The focus of this paper is the so called integral representation approach to shallow networks. Early mathematical research in artificial neural networks was mainly focused on investigating the expressive power [8, 7, 17], and the approximation complexity [10, 2, 18] of single hidden layer neural networks. One of the main approaches employed was to represent the target function as some integral whose appropriate discretization results in a neural network approximation. There were also attempts at investigating the integral representations themselves independently [9, 2, 19, 16, 11]. This line of work, in particular, motivated by the ridgelet transform introduced by E. Candès in  and its later variations like curvelet and contourlet transforms.
1.1 Shallow neural networks
A shallow neural network with an activation and nodes is a function of the form
where , , and are called the inner weights, biases, and outer weights, respectively.
In this paper we consider the ReLU (rectified linear unit) activation, given by, which seems to be the conventional choice of activation in most modern architectures.
be the target function, e.g. an image classifier, solution to a PDE, a specific parameter associated with a model, etc. A neural network represents computationally simple parametric family (since propagating an input through a neural network consists of matrix multiplications and activation). The aim is to find (learn, train) a neural network representationof the target function such that . Some of the challenges that arise are
Neural network architectures are typically selected by trial and error, and often it is not clear beforehand how many layers and nodes should be taken in the network;
Models lack interpretabilty, in particular it is not clear what the weights represent;
The training process is computationally expensive and there is not a good initialization method for the training;
There are many stability issues, both during the training process (sensitivity to initialization) and post training (sensitive to adversarial attacks), and many others.
1.2 Integral representations of shallow networks
Our main approach is to think of a shallow network as a discretization of a suitable integral representation of the target function . More precisely, the sum (1) is regarded as a discretization of an integral of the form
where is a Radon measure on . Note that a shallow network with nodes can be written as (2) for an atomic measure with atoms so this type of representation is quite general. We believe this approach may open a wide area for research, potentially leading up to faster and more stable algorithms for neural network training, and to architectures best fitted for specific problems. The integral form of neural network representations is more concise and better suited for analysis. From numerical perspective, utilizing various discretization methods may allow one to obtain a network that approximates the target function, completely bypassing the optimization based training process. The constructed neural network can be refined further via a gradient based training process.
Neural network integral representations have been considered by various authors, where typically the Radon measure
is assumed to be of a special form, e.g. supported on a given set or absolutely continuous with respect to a probability measure. One specific type of integral representation for neural network integral representations discussed below originates from the harmonic analysis perspective to shallow neural networks and employs the ridgelet transform. There it is assumed that the target function can be written as
Function is called the dual ridgelet transform of the function with respect to . The ‘direct’ transform , called the ridgelet transform of with respect to , is given by
It is shown in  that if the pair satisfies the admissibility condition
then the reconstruction formula holds, thus providing a particular integral representation of the target function :
Methods for discretizing integral representations of the form (2) have been considered by various authors. In particular, the authors in  employ a greedy method to discretize the solution of (2) with the smallest total variation norm. The same problem is solved by the conditional gradient algorithm in . In [20, 21] the authors suggested Monte–Carlo sampling for the parameters
to discretize the integral and least square regularization to find the values of outer weights. They call this method random vector functional-link (or RVFL) network. A related Monte–Carlo discretization method for integral representations of radial basis function (RBF) neural networks is considered in. In  integral representations are used to get better weight initialization. Other related work includes [12, 13, 14, 15, 3, 4].
1.3 Our approach
We note that due to the positive homogeneity of the ReLU activation function the representation (1) can be rewritten with the weights on the unit sphere . In this setting, we consider the target functions that admit integral representations of the form
where is the Lebesgue measure on and . stands for he class of Lebesgue integrable functions on with respect to . Note that, for every , the function is bounded on , and so the integral (5) is well defined. The three main problems posed here in this regard are the following:
Characterize the functions that have representation ,
For given , find all for which holds,
Find the least -norm solution to .
For , we have fully solved these problems. A similar approach was employed in [24, 25], however, to our knowledge, the characterization theorems in the first two bullets above are new and first time presented here.
2 Main results
The next theorem gives a sufficient condition for to have representation (5). Before doing that, we introduce the following two transforms. The Radon transform of a function is a function given by the formula:
where integration is with respect to -dimensional Lebesgue measure
on the hyperplane. The Hilbert transform of a function is defined as
We now formulate one of our main results which provides a particular kernel for the integral representation (5).
Any compactly supported function admits the representation
Stated theorem offers a way to construct a specific for which the integral representation formula (5) holds. A similar result was proved in  for the Heaviside function which is the derivative of ReLU.
2.1 Univariate target functions
Note that the unit circle can be parametrized by , where . Then the hyperplane consists of a single point: and hence . Thus Theorem 1 provides for any compactly supported function that
In this subsection we provide a more general representation and extend the set of the admissible functions by defining the class consisting of such functions that exists everywhere on , exist almost everywhere on , and
The following theorem characterizes the class of target functions that admits the integral representation (5) with an integrable kernel .
The function admits the representation
with some if and only if has the form
Moreover, the set of such weight functions coincides with the set of functions of the form
The following theorem characterizes the class as the functions that admit an integral representation with an appropriate integrable weight function.
A function belongs to the class if and only if it admits the representation
with a kernel satisfying
Since in Theorem 3 we make an assumption , we can pose an interesting question of finding the kernel with the smallest -norm for a given target function . In particular, -norm minimization is commonly used in compressed sensing for finding a sparse solution. In the next theorem we answer this question.
For the minimum
is attained at
where are is the set of all Radon measures on and is the total variation norm, for is exactly equal to
They do not make assumptions on the smoothness of the function , instead the second derivative is interpreted in the distributional sense. This result is consistent with that in Theorem 4, if we do change of variables from to .
3.1 Proof of Theorem 1
For any we have
Statement of the lemma is trivial for . For consider the following change of variables given by the spherical coordinates
where and , and . The area element on the unit sphere is given by . Therefore we obtain
To complete the proof, we change the variable to to get and . Substituting into the above integral provides the required result. ∎
We also use the following technical result, which is a corollary of Proposition 8.1 from .
Let be the Heaviside function, i.e. . Then for any compactly supported function in the following reconstruction formula holds:
3.2 Proof of Theorem 2
Before proving the theorem we perform several related calculations. From the equality we obtain
and, in the same way,
From the mutual orthogonality of the functions we deduce
where we changed the variable to . Similarly,
We now prove the direct implication. Assume that a function admits the integral representation
with a weight function . Note that due to the mutual orthogonality of the functions we can assume without loss of generality that by replacing the weight function with
Then from the mutual orthogonality of we deduce
We will show that is in the class . First, we show that . Indeed, from condition (10) we get for any
By using the relation we obtain
Then from condition (10) we get . By a similar argument we have .
Next, we show the existence of the derivative and that . Let denote the Heaviside step function, then by using dominated convergence theorem we get for any
By taking into account estimate (11) we derive . Condition proves in a similar way.
Next we show that the second derivative exists almost everywhere. Indeed, from (12) we see that exists at every such that is a Lebesgue point of , which is almost everywhere since . In that case we get
Finally, by changing the variable from to , we estimate
Lastly, we show that the weight function has the form (8) with some . Denote
Then and satisfies conditions (10), hence
From (13) we deduce that for almost all
Combining these relations we conclude
Since we initially subtracted the term from the weight function , in a general case we will have
which completes the proof of the direct implication.
We now prove the inverse implication. Assume that function has the form
with some function and constants . Similarly to the direct case, we can assume that by replacing function with
We will show that and that . First, note that
and hence . Taking into account that and using the assumption we get