Connections between integration along hypersufaces, Radon transforms, and neural networks are exploited to highlight an integral geometric mathematical interpretation of neural networks. By analyzing the properties of neural networks as operators on probability distributions for observed data, we show that the distribution of outputs for any node in a neural network can be interpreted as a nonlinear projection along hypersurfaces defined by level surfaces over the input data space. We utilize these descriptions to provide new interpretation for phenomena such as nonlinearity, pooling, activation functions, and adversarial examples in neural network-based learning problems.READ FULL TEXT VIEW PDF
In this work, we show that neural networks can be represented via the
Universal approximation theorem suggests that a shallow neural network c...
We propose a new training method for a feedforward neural network having...
We develop a general framework based on splines to understand the
Neural-networks have seen a surge of interest for the interpretation of
In this paper, we introduce a novel concept for learning of the paramete...
Classical regression has a simple geometric description in terms of a
Artificial Neural Networks (NN) have long been used as a mathematical modeling method and have recently found numerous applications in science and technology including computer vision, signal processing and machine learning to name a few. Although NN-based methods are recognized as powerful techniques, much remains to be explored about neural networks as a mathematical operator (one notable exception is the function approximation results in [2, 3]). As a consequence, numerous doubts often accompany NN practitioners such as: how does depth add nonlinearity in a NN? What is the effect of different activation functions? What are the effects of pooling?, and many others.
This didactic note is meant to highlight an alternative interpretation of NN-based techniques and their use in supervised learning problems. By investigating the connections of machine learning classification methods with projections along hyperplanes and hypersurfaces, we highlight the links between different NN architectures and the integration geometry of linear and nonlinear Radon transforms. We then use these concepts to highlight different properties of neural networks, which may help shed light on the questions highlighted above, as well as potentially provide a path for novel studies and developments. For brevity and to reduce pre-requisites, the derivations presented fall short of rigorous mathematical proofs. The Python code to reproduce all of the figures used here is available athttps://github.com/rohdelab/radon-neural-network.
Let be a compact domain of a manifold in Euclidean space (the space corresponding to input digital data) and let , with
represent a map (oracle) which ascertains outputs (e.g. labels) to input data (random variable). In learning problems,
is usually a vector for which the value of theelement represents the probability that the sample belongs to the class, although other regression problems can also be formulated with the same approach.
Omitting here a measure theoretic formulation (see  for a more complete development) let , and
(space of absolutely integrable functions) define the probability density functions (PDFs) for random variables, , and , respectively. Now utilizing a technique often used in the theoretical physics community , known as random variable transformation (RVT), we can write the PDF of the output as a function of via:
where is the standard Dirac distribution. See the supplementary material for a derivation. The same transformation of random variables technique can be used to derive
The goal in a regression task is to estimateso that it accurately ‘predicts’ the dependent variable for each input . In other words, we wish to find over the distribution of the input space. To that end “goodness of fit” measures are used to fit a model to given labeled data (supervised learning). One popular model is to find that minimizes the discrepancy between and according to a dissimilarity measure :
which can be interpreted in relation to random variables and and their respective distributions. For instance, the cross entropy minimization strategy can be viewed as an estimate of , which is equivalent to minimizing the KL-divergence between and .
Next, we consider the formulations for the standard Radon transform, and its generalized version and demonstrate a connection between this transformation and the statistical learning concepts reviewed above.
The standard Radon transform, , maps distribution to the infinite set of its integrals over the hyperplanes of and is defined as,
where is the one-dimensional Dirac delta function. For where is the unit sphere in , and . Each hyperplane can be written as:
which alternatively could be thought as the level set of the function . For a fixed , the integrals over all hyperplanes orthogonal to define a continuous function, , that is a projection/slice of . We note that the Radon transform is more broadly defined as a linear operator , where . Figure 1 provides a visual representation of the Radon transform, the integration hyper-planes (i.e., lines for ), and the slices .
The Radon transform is an invertible linear transformation (i.e. linear bijection). The inverse of the Radon transform denoted byis defined as:
is a one-dimensional high-pass filter with corresponding Fourier transform(it appears due to the Fourier slice theorem, see the supplementary material) and ‘’ denotes the convolution operation. The above definition of the inverse Radon transform is also known as the filtered back-projection method, which is extensively used in image reconstruction in the biomedical imaging community. Intuitively each one-dimensional projection/slice, , is first filtered via a high-pass filter and then smeared back into along to approximate . The filtered summation of all smeared approximations then reconstructs . Note that in practice acquiring infinite number of projections is not feasible therefore the integration in the filtered back-projection formulation is replaced with a finite summation over projections.
In most machine learning applications one does not have direct access to the actual distribution of the data but to its samples, . In such scenarios the empirical distribution of the data is used as an approximation for :
where is the Dirac delta function in . Then, it is straightforward to show that the Radon transform of is:
See supplementary material for detailed derivations of Equations (9). Given the high-dimensional nature of estimating density in one requires large number of samples. The projections/slices of , , however, are one dimensional and therefore it may not be critical to have large number of samples to estimate these one-dimensional densities.
Now, let us consider the supervised learning of a linear binary classifier. Given the data samples and their corresponding labels , the task is to learn a linear function of the input samples, for such that,
Many methods exist to obtain the optimalis applied to each sample, we can consider as an operator and inquire about the distribution . Here is the density of when . One can clearly see that corresponds to a slice of the input distribution with respect to , hence there is a natural relationship between the Radon transform and linear classification. Figure 2 depicts this phenomenon.
Generalized Radon transform (GRT) extends the original idea of the classic Radon transform introduced by J. Radon  from integration over hyperplanes of to integration over hypersurfaces [7, 8] (i.e. -dimensional manifolds). GRT has various applications including Thermoacoustic Tomography (TAT), where the hypersurfaces are spheres, and Electrical Impedance Tomography (EIT), where integration over hyperbolic surfaces appear.
To formally define the GRT, we introduce a function defined on with . We say that is a defining function when it satisfies the four conditions below:
is a real-valued function on
is homogeneous of degree one in , i.e.
is non-degenerate in the sense that in
The mixed Hessian of is strictly positive, i.e.
For a given defining function, , the generalized Radon transform is a linear operator , where and is defined as:
From a geometrical perspective and for a fixed , is the integral of along the hypersurface . Note that the classic Radon transform is a special case of the generalized Radon transform where .
The investigation of the sufficient and necessary conditions for showing the injectivity of GRTs is a long-standing topic [7, 8]. The conditions 1-4 for a defining function, , enumerated in this section, are necessary conditions for injectivity but not sufficient. Though the topic related to inversion of the GRT is beyond the scope of this article, an inversion approach is given in .
Here, we list a few examples of known defining functions that lead to injective GRTs. The circular defining function, with and was shown to provide an injective GRT 
. More interestingly, homogeneous polynomials with an odd degree also yield an injective GRT, i.e. , where we use the multi-index notation , , and . Here, the summation iterates over all possible multi-indices , such that , where denotes the degree of the polynomial and . The parameter set for homogeneous polynomials is then set to . We can observe that choosing reduces to the linear case , since the set of the multi-indices with becomes and contains elements.
To illustrate the relationship between deep neural networks and the generalized Radon transform we start by describing the link between perceptrons and the standard Radon transform.
Let , with , define a perceptron for input data , where we dissolved the bias, , into . Treating as a random variable and using RVT, it is straightforward to show that is isomorphic to a single slice of , , when is invertible (see supplementary material for a proof). The isomorphic relationship provides a fresh perspective on perceptrons, stating that the distribution of the perceptron’s output, , is equivalent to integration of the original data distribution, , along hyperplanes (see Equation (6)). In addition, one can show that the distribution of the output of a perceptron is equal to the generalized Radon transform with ,
An important and distinctive point here is that here we are interested in the distribution of the output of a perceptron, , and its relationship to the original distribution of the data, , as opposed to the individual responses of the perceptron, . Columns (a) and (b) in Figure 3 demonstrate the level sets (or level curves since ) and the line integrals for and , where . Note that samples that lay on the same level set will be mapped to a fixed projection (a constant value ). In other words, the samples that lay on the same level sets of are indistinguishable in the range of the perceptron. Next we discuss the case of having multiple perceptrons.
To obtain a hierarchical (multilayer) model, the concept of a perceptron can be applied recursively. As before, let and correspond to two matrices whose rows contain a set of projection vectors (different ’s in the preceding section): e.g. where is the transpose of projection vector corresponding to the first node/perceptron in layer 1. A two layer NN model can be written as . Expanding the idea further, we then may define a general formula for a -layer NN as
Note that above refers to a column vector which collapses the output of the neural network to one node and that where
is the number of neurons in the’th layer of a deep neural network.
Now, let be a Lipschitz continuous nonlinear activation function. Its self-composition is therefore also Lipschitz continuous. For invertible activation functions , and for square and invertible, the gradient of a multi-layer perceptron in equation (12) does not vanish in any compact subset of and therefore the level sets are well-behaved. Therefore, from the definition in (1) we have that the distribution over the output node could be considered as a slice of the generalized Radon transform of evaluated at : with
Figure 3 columns (c) and (d) demonstrate the level sets and the line integrals of using a multi-layer perceptron as . Column (c) is initialized randomly and column (d) shows after the network parameters are trained in a supervised classification setting to discriminate the modes of the half-moon distribution. It can be seen that after training, the level sets, , only traverse a single mode of the half-moon distribution, which indicates that the samples from different modes are not projected onto the same point (i.e. the distribution is not integrated across different modes). It also readily becomes apparent the facility with which neural networks have to generate highly nonlinear functions, even with relatively few parameters (below we compare these to other polynomials). We note that with just one layer, NN’s can form nonlinear decision boundaries, as the superposition of surfaces formed by can add curvature to the resulting surface. Note also that generally speaking, the integration streamlines (hypersurfaces for higher dimensional data) have the ability to become more curved (nonlinear) as the number of layers increases.
It has been noted recently that NN’s (e.g. convolutional neural networks) can at times work better when10, 11, 12]. The experience has encouraged others to try different activation functions such as the ‘leaky’-ReLU . While theory describing which type of activation function should be used with which type of learning problem is yet to emerge, the interpretation of NN’s as nonlinear projections can help highlight the differences between activation function types. Specifically, Figure 4 can help visualize the effects of different activation functions on the integration geometry over the input data space .
First note that the ReLU is a non-invertible map, given that negative values all map to zero. This will cause the surface generated by a perceptron constructed with ReLU to have a region over
which is flat, whereby all points in that region are integrated and mapped to the same value (zero) in the output space. This ability may provide ReLU neural networks with the flexibility to construct adaptable characteristic function-type models for different regions in the data space. Although, the outcome of the optimization procedure will dictate whether such regions would emerge in the final model. Finally, note that both ReLU and the leaky-ReLU activation functions contain non-differentiable points, which are also imparted on the surface function (hence the sharp ‘kinks’ that appear over iso-surfaces lines).
Pooling (e.g. average or maximum sampling) operations are typically used in large neural networks, especially the CNN kind. The reasons are often practical, as subsampling can be used as a way to control and reduce the dimension (number of parameters) of the model. Another often stated reason is that pooling can also add a certain amount of invariance (e.g. translation) to the NN model. In the case of average pooling, it is clear that the operation can also be written as a linear operator in equation (12) where the pooling operation can be performed by replacing a particular row of by the desired linear combination between two rows of , for example. ‘Max’-pooling on the other hand selects the maximum surface value (perceptron response), over all surfaces (each generated by different perceptrons) in a given layer. Figure 5 shows a graphical description of the concept, though it should be noted that as defined above, the integration lines are not being added, rather the underlying ‘level’ surfaces.
It has often been noted that highly flexible nonlinear learning systems such as CNN’s can be ‘brittle’ in the sense that a seemingly small perturbation of the input data can have cause the learning system to produce confident, but erroneous, outputs. Such perturbed examples are often termed as adversarial examples. Figure 7 utilizes the integral geometric perspective described above to provide a visualization of how neural networks (as well as other classification systems) can be fooled by small perturbations. To find the minimum displacement that could cause misclassificaiton, using the blue point as the starting point , we perform gradient ascent , until we reach the other side of the decision boundary (which is indicated by the orange point). We limit the magnitude of the displacement small enough so that the two points belong to the same distribution. However, once integrated along the isosurfaces corresponding to the NN drawn in the figure, due to the uneven curvature of the corresponding surface, the two points are projected onto opposite ends of the output node, thus fooling the classifier to make a confident, but erroneous, prediction.
In this note we explored links between Radon transforms and Artificial Neural Networks to study the properties of the later as an operator on the probability density function associated with a given learning problem. More specifically, it can be shown that the probability density function associated with any output node of a neural network (or a single perceptron within a hierarchical NN) can be interpreted as a particular hyperplane integration operation over the distribution . This interpretation has natural connections with the -dimensional generalized Radon transforms, which similarly proposes that a high-dimensional PDF can be represented by integrals defined over linear or nonlinear hyperplanes in . The projections can be linear (in the case of simple linear logistic regression) or nonlinear in which case the projection are computed over nonlinear hyperplanes. The analogy has limitations, however, given that depending on the number of nodes in each layer, as well as the choice of activation function, the conditions for the generalized Radon transforms in  (i.e. invertibility, homogeneity, etc) may not be satisfied with specific neural network architectures.
Despite these limitations, the analogy is useful to provide a mechanistic understanding of NN operators, and it may also be useful as a path to study the effect of different neural network architectures and related concepts (number of layers, number of nodes in each layer, choice of activation function, recurrency, etc.) as well as to provide ideas for alternative models. For example, other types of projections may be considered within a learning problem. Figure 6 compares linear projections, circular projections, a homogeneous polynomial of degree 5, and an ANN of depth 1, all trained to minimize the logistic regression cost function. While it is clear that linear and circular projections don’t have enough ‘flexibility’ to solve the separation problem, a polynomial degree of degree 5 seems to emulate the behavior of an ANN of depth 1. It is possible that in the future, the point of view provided by analyzing the nonlinear projections associated with different NN’s can provide inspiration for alternative models.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, p. 436, 2015.
P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,”Neural networks, vol. 2, no. 1, pp. 53–58, 1989.
G. Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
To define the inverse of the Radon transform we start by the Fourier slice theorem. Let be the d-dimensional Fourier transform, then the one dimensional Fourier transform of a projection/slice is:
which indicates that the one-dimensional Fourier transform of each projection/slice is equal to a slice of the d-dimensional Fourier transform in a spherical coordinate. Taking the inverse d-dimensional Fourier transform of in the Cartesian coordinate, , would lead to the reconstruction of .
where , and is often approximated as a small angle-independent constant, .
Here we show the derivations for Equation (1). Recall that is the true map (oracle) from the data samples to their corresponding labels. Let be a real function, then we can write . By definition, the average of the quantity on the left with respect to should be equal to the average of the quantity on the right with respect to , and we can write:
Now let and for the left hand side of Equation (LABEL:eq:rvt_proof) we have:
and for the right hand side of Equation (LABEL:eq:rvt_proof) we have:
and concludes our proof the derivation. For a more complete analysis, see .
For the perceptron, , where and , the distribution of the output could be obtained from:
on the other hand, the Radon slice of is obtained from
We first show that having one can recover . Let , where therefore using RVT the distribution of is equal to:
Now we show the reverse arguement. For invertible , let where , then we can obtain the distribution of from:
therefore the two distributions, and , are isomorphic.