1 Introduction
Deep Neural Networks (DNNs) are universal function approximators providing stateof theart solutions on wide range of applications. Common perceptual tasks such as speech recognition, image classification, and object tracking are now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack of a mathematical framework providing an explicit and interpretable inputoutput formula for any topology, (2) quantification of DNNs stability regarding adversarial examples (i.e. modified inputs fooling DNN predictions whilst undetectable to humans), (3) absence of generalization guarantees and controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to apply DNNs to domains where expert labeling is scarce as in the medical field. Answering those points would provide theoretical perspectives for further developments based on a common ground. Furthermore, DNNs are now deployed in tremendous societal applications, pushing the need to fill this theoretical gap to ensure control, reliability, and interpretability.
DNNs are models involving compositions of nonlinear and linear transforms. (1) We will provide a straightforward methodology to express the nonlinearities as affine spline functions. The linear part being a degenerated case of spline function, we can rewrite any given DNN topology as succession of such functionals making the network itself a piecewise linear spline. This formulation provides a universal piecewise linear expression of the inputoutput mapping of DNNs, clarifying the role of its internal components. (2) In functional analysis, the regularity of a mapping is defined via its Lipschitz constant. Our formulation eases the analytical derivation of this stability variable measuring the adversarial examples sensitivity. For any given architecture, we provide a measure of risk to adversarial attacks. (3) Recently, the deep learning community has focused on the reminiscent theory of flat and sharp minima to provide generalization guarantees. Flat minima are regions in the parameter space associated with great generalization capacities. We will first, prove the equivalence between flat minima and spline smoothness. After bridging those theories, we will motivate a novel regularization technique pushing the learning of DNNs towards flat minima, maximizing generalization performances. (4) From (1) we will reinterpret DNNs as template matching algorithms. When coupled with insights derived from (2), we will integrate unlabeled data information into the network during learning. To do so, we will propose to guide DNNs templates towards their input via a scheme assimilated as a reconstruction formula for DNNs. This inversion can be computed efficiently by back propagation leading to no computational overhead. From this, any semisupervised technique can be used outofthebox with current DNNs where we provide stateoftheart results. Unsupervised tasks would also become reachable to DNNs, a task considered as the keystone of learning for the neuroscience community. To date, those problematics have been studied independently leading to overspecialized solutions generally topology specific and cumbersome to incorporate into a preexisting pipeline. On the other hand, all the proposed solutions necessitate negligible software updates, suited for efficient largescale deployment.
Nonexhaustive list of the main contributions 1) We first develop spline operators (SOs) A.1.1, a natural generalization of multivariate spline functions as well as their linear case (LSOs). LSOs are shown to ”span” DNNs layers, being restricted cases of LSOs 3.2. From this, composition of those operators lead to the explicit analytical inputoutput formula of DNNs, for any architecture 3.3. We then dive into some analysis:

Symbols
”Dummy” variable representing an input/observation 

”Dummy” variable representing an output/prediction associated to input  
Observation of shape .  
Target variable associated to , for classification ,  
for regression .  
(resp. )  Labeled training set with (resp. ) samples . 
Unlabeled training set with samples .  
Layer at level with internal parameters .  
Collection of all parameters .  
Deep Neural Network mapping with  
Shape of the representation at layer with and  
.  
Dimension of the flattened representation at layer with , and .  
Representation of at layer in an unflattened format of shape ,  
with  
Value at channel and spatial position .  
Representation of at layer in a flattened format of dimension  
Value at dimension 
2 Background: Deep Neural Networks for Function Approximation
Most of applied mathematics interests take the form of function approximation. Two main cases arise, one where the target function to approximate is known and one where only a set of samples are observed, providing limited information on the domaincodomain structure of
. The latter case is the one of supervised learning. Given the
training set with , the unknown functionalis estimated through the approximator
. Finding an approximant with correct behaviors on is usually an illposed problem with many possible solutions. Yet, each one might behave differently for new observations, leading to different generalization performances. Generalization is the ability to replicate the behavior of on new inputs not present in thus not exploited to obtain . Hence, one seeks for an approximator having the best generalization performance. In some applications, the unobservedis known to fulfill some properties such as boundary and regularity conditions for PDE approximation. In machine learning however, the lack of physic based principles does not provide any property constraining the search for a good approximator
except the performance measure based on the training set and an estimate of generalization performance based on a test set. To tackle this search, one commonly resorts to a parametric functional where contains all the free parameters controlling the behavior of . The task thus ”reduces” to finding the optimal set of parameters minimizing the empirical error on the training set and maximizing empirical generalization performance on the test set. We now refer to this estimation problem as a regression problem if is continuous and a classification problem if is categorical or discrete. We also restrict ourselves to being a Deep Neural Network (DNN) and denote . Also, is used for a generic input as opposed to the given sample .DNNs are a powerful and increasingly applied machine learning framework for complex prediction tasks like object and speech recognition. In fact, they are proven to be universal function approximators[Cybenko, 1989, Hornik et al., 1989], fitting perfectly the context of function approximation of supervised learning described above. There are many flavors of DNNs, including convolutional, residual, recurrent, probabilistic, and beyond. Regardless of the actual network topology, we represent the mapping from the input signal to the output prediction as . By its parametric nature, the behavior of is governed by its underlying parameters . All current deep neural networks boil down to a composition of ”layer mappings” denoted by
(1) 
In all the following cases, a neural network layer at level is an operator
that takes as input a vectorvalued signal
which at is the input signal and produces a vectorvalued output . This succession of mappings is in general noncommutative, making the analysis of the complete sequence of generated signals crucial, denoted by(2) 
For concreteness, we will focus on processing channel inputs , such as RGB images, stereo signals, as well as multichannel representations which we refer to as a “signal”. This signal is indexed , , where are usually spatial coordinates, and is the channel. Any signal with greater indexdimensions fall under the following analysis by adaptation of the notations and operators. Hence, the volume is of shape with and . For consistency with the introduced layer mappings, we will use , the flattened version of as depicted in Fig. 1. The dimension of is thus . In this section, we introduce the basic concepts and notations of the main used layers enabling to create stateoftheart DNNs as well as standard training techniques to update the parameters .
2.1 Layers Description
In this section we describe the common layers one can use to create the mapping . The notations we introduce will be used throughout the report. We now describe the following: Fullyconnected;Convolutional; Nonlinearity;SubSampling;SkipConnection;Recurrent layers.
FullyConnected Layer
A FullyConnected (FC) layer is at the origin of DNNs known as MultiLayer Perceptrons (MLPs)
[Pal and Mitra, 1992] composed exclusively of FClayers and nonlinearities. This layer performs a linear transformation of its input as(3) 
The internal parameters are defined as and . This linear mapping produces an output vector of length . In current topologies, FC layers are used at the end of the mapping, as layers and , for their capacity to perform nonlinear dimensionality reduction in order to output
output values. However, due to their high number of degrees of freedom
and the unconstrained internal structure of, MLPs inherit poor generalization performances for common perception tasks as demonstrated on computer vision tasks in
[Zhang et al., 2016].Convolutional Layer
The greatest accuracy improvements in DNNs occurred after the introduction of the convolutional layer
. Through convolutions, it leverages one of the most natural operation used for decades in signal processing and template matching. In fact, as opposed to the FClayer, the convolutional layer is the corestone of DNNs dealing with perceptual tasks thanks to their ability to perform local feature extractions from their input. It is defined as
(4) 
where a special structure is defined on so that it performs multichannel convolutions on the vector . To highlight this fact, we first remind the multichannel convolution operation performed on the unflatenned input of shape given a filter bank composed of filters, each being a tensor of shape with . Hence with representing the filters depth, equal to the number of channels of the input, and the spatial size of the filters. The application of the linear filters on the signal form another multichannel signal as
(5) 
where the output of this convolution contains channels, the number of filters in . Then a bias term is added for each output channel, shared across spatial positions. We denote this bias term as . As a result, to create channel of the output, we perform a convolution of each channel of the input with the impulse response and then sum those outputs elementwise over to finally add the bias leading to as
(6) 
In general, the input is first transformed in order to apply some boundary conditions such as zeropadding, symmetric or mirror. Those are standard padding techniques in signal processing
[Mallat, 1999]. We now describe how to obtain the matrix and vector corresponding to the operations of Eq. 6 but applied on the flattened input and producing the output vector . The matrix is obtained by replicating the filter weights into the circulentblockcirculent matrices [Jayaraman et al., 2009] and stacking them into the supermatrix(7) 
We provide an example in Fig. 2 for and .
By the sharing of the bias across spatial positions, the bias term inherits a specific structure. It is defined by replicating on all spatial position of each output channel :
(8) 
The internal parameters of a convolutional layer are . The number of degrees of freedom for this layer is much less than for a FClayer, it is of . If the convolution is circular, the spatial size of the output is preserved leading to and thus the output dimension only changes in the number of channels. Taking into account the special topology of the input by constraining the matrix to perform convolutions coupled with the low number of degrees of freedom while allowing a highdimensional output leads to very efficient training and generalization performances in many perceptual tasks which we will discuss in details in 4.1. While there are still difficulties to understand what is encoded in the filters , it has been empirically shown that for images, the first filterbank applied on the input images converges toward an overcomplete Gabor filterbank, considered as a natural basis for images [Meyer, 1993, Olshausen et al., 1996]. Hence, many signal processing tools and results await to be applied for analysis.
Elementwise Nonlinearity Layer
A scalar/elementwise nonlinearity layer applies a nonlinearity to each entry of a vector and thus preserve the input vector dimension . As a result, this layer produces its output via application of across all positions as
(9) 
The choice of nonlinearity greatly impacts the learning and performances of the DNN as for example sigmoids and tanh are known to have vanishing gradient problems for high amplitude inputs, while ReLU based activation lead to unbounded activation and dying neuron problems. Typical choices include

Sigmoid: ,

tanh: ,

ReLU: ,

Leaky ReLU: ,

Absolute Value: .
The presence of nonlinearities in DNNs is crucial as otherwise the composition of linear layers would produce another linear layer, with factorized parameters. When applied after a FClayer or a convolutional layer we will consider the linear transformation and the nonlinearity as part of one layer. Hence we will denote for example.
Pooling Layer
A pooling layer operates a subsampling operation on its input according to a subsampling policy and a collection of regions on which is applied. We denote each region to be subsampled by with being the total number of pooling regions. Each region contains the set of indices on which the pooling policy is applied leading to
(10) 
where is the pooling operator and
. Usually one uses mean or max pooling defined as

MaxPooling: ,

MeanPooling: .
The regions can be of different cardinality and can be overlapping . However, in order to treat all input dimension, it is natural to require that each input dimension belongs to at least one region: . The benefits of a pooling layer are threefold. Firstly, by reducing the output dimension it allows for faster computation and less memory requirement. Secondly, it allows to greatly reduce the redundancy of information present in the input
. In fact, subsampling, even though linear, is common in signal processing after filter convolutions. Finally, in case of maxpooling, it allows to only backpropagate gradients through the pooled coefficient enforcing specialization of the neurons. The latter is the corestone of the winnertakeall strategy stating that each neuron specializes into what is performs best. Similarly to the nonlinearity layer, we consider the pooling layer as part of its previous layer.
SkipConnection
A skipconnection layer can be considered as a bypass connection added between the input of a layer and its output. Hence, it allows for the input of a layer such as a convolutional layer or FClayer to be linearly combined with its own output. The added connections lead to better training stability and overall performances as there always exists a direct linear link from the input to all inner layers. Simply written, given a layer and its input , the skipconnection layer is defined as
(11) 
In case of shape mismatch between and , a ”reshape” operator is applied to before the elementwise addition. Usually this is done via a spatial downsampling and/or through a convolutional layer with filters of spatial size .
Recurrent
Finally, another type of layer is the recurrent layer which aims to act on timeseries. It is defined as a recursive application along time by transforming the input as well as using its previous output. The most simple form of this layer is a fully recurrent layer defined as
(12)  
(13) 
while some applications use recurrent layers on images by considering the serie of ordered local patches as a time serie, the main application resides in sequence generation and analysis especially with more complex topologies such as LSTM[Graves and Schmidhuber, 2005] and GRU[Chung et al., 2014] networks. We depict the topology example in Fig. 3.
2.2 Deep Convolutional Network
The combination of the possible layers and their order matter greatly in final performances, and while many newly developed stochastic optimization techniques allow for faster learning, a suboptimal layer chain is almost never recoverable. We now describe a ”typical” network topology, the deep convolutional network (DCN), to highlight the way the previously described layers can be combined to provide powerful predictors. Its main development goes back to [LeCun et al., 1995] for digit classification. A DCN is defined as a succession of blocks made of layers : Convolution Elementwise Nonlinearity Pooling layer. In a DCN, several of such blocks are cascaded endtoend to create a sequence of activation maps followed usually by one or two FClayers. Using the above notations, a single block can be rewritten as . Hence a basic model with blocks and FClayers is defined as
(14)  
(15) 
The astonishing results that a DCN can achieve come from the ability of the blocks to convolve the learned filterbanks with their input, ”separating” the underlying features present relative to the task at hand. This is followed by a nonlinearity and a spatial subsampling to select, compress and reduce the redundant representation while highlighting task dependent features. Finally, the MLP part simply acts as a nonlinear classifier, the final key for prediction. The duality in the representation/highdimensional mappings followed by dimensionality reduction/classification is a core concept in machine learning referred as: preprocessingclassification.
2.3 Learning
In order to optimize all the weights leading to the predicted output , one disposes of (1) a labeled dataset
, (2) a loss function
, (3) a learning policy to update the parameters . In the context of classification, the target variable associated to an input is categorical . In order to predict such target, the output of the last layer of a network is transformed via a softmax nonlinearity[de Brébisson and Vincent, 2015]. It is used to transforminto a probability distribution and is defined as
(16) 
thus leading to representing . The used loss function quantifying the distance between and is the crossentropy (CE) defined as
(17)  
(18) 
For regression problems, the target is continuous and thus the final DNN output is taken as the prediction . The loss function is usually the ordinary squared error (SE) defined as
(19) 
Since all of the operations introduced above in standard DNNs are differentiable almost everywhere with respect to their parameters and inputs, given a training set and a loss function, one defines an update strategy for the weights . This takes the form of an iterative scheme based on a first order iterative optimization procedure. Updates for the weights are computed on each input and usually averaged over minibatches containing exemplars with . This produces an estimate of the ”correct” update for and is applied after each minibatch. Once all the training instances of have been seen, after minibatches, this terminates an epoch
. The dataset is then shuffled and this procedure is performed again. Usually a network needs hundreds of epochs to converge. For any given iterative procedure, the updates are computed for all the network parameters by
backpropagation [HechtNielsen et al., 1988], which follows from applying the chain rule of calculus. Common policies are Gradient Descent (GD)
[Rumelhart et al., 1988]being the simplest application of backpropagation, Nesterov Momentum
[Bengio et al., 2013] that uses the last performed updates in order to ”accelerate” convergence and finally more complex adaptive methods with internal hyperparameters updated based on the weights/updates statistics such as Adam[Kingma and Ba, 2014], Adadelta[Zeiler, 2012], Adagrad[Duchi et al., 2011], RMSprop
[Tieleman and Hinton, 2012], …Finally, to measure the actual performance of a trained network in the context of classification, one uses the accuracy loss defined as(20) 
defined such that smaller value is better,
3 Understand Deep Neural Networks Internal Mechanisms
In this section we develop spline operators, a generalization of spline function which are also generalized DNN layers. By doing so, we will open DNNs to explicit analysis and especially understand their behavior and potential through the spline region inference. DNNs will be shown to leverage template matching, a standard technique in signal processing to tackle perception tasks.
Let first motivate the need to adopt the theory of spline functions for deep learning and machine learning in general. As described in Sec. 2, the task at hand is to use parametric functionals to be able to understand[Cheney, 1980]
, predict, interpolate the world around us
[Reinsch, 1967]. For example, partial differential equations allow to approximate real world physics
[Bloor and Wilson, 1990, Smith, 1985] based on grounded principles. For this case, one knows the underlying laws that must be fulfilled by . In machine learning however, one only disposes of observed inputs or inputoutput pairs . To tackle this approximation problem, splines offer great advantages. From a computational regard, polynomials are very efficient to evaluate via for example the Horner scheme[Peña, 2000]. Yet, polynomials have ”chaotic” behaviors especially as their degree grows, leading to the Runge’s phenomenon[Boyd and Xu, 2009], synonym of extremely poor interpolation and extrapolation capacities. On the other hand, low degree polynomials are not flexible enough for modeling arbitrary functionals. Splines, however, are defined as a collection of polynomials each one acting on a specific region of the input space. The collection of possible regions forms a partition of the input space. On each of those regions , the associated and usually low degree polynomial is used to transform the input . Through the per region activation, splines allow to model highly nonlinear functional, yet, the lowdegree polynomials avoid the Runge phenomenon. Hence, splines are the tools of choice for functional approximation if one seeks robust interpolation/extrapolation performances without sacrificing the modeling of potentially very irregular underlying mappings .In fact, as we will now describe in details, current stateoftheart DNNs are linear spline functions and we now proceed to develop the notations and formulation accordingly. Let first remind briefly the case of multivariate linear splines.
Definition 1.
Given a partition of , we denote multivariate spline with local mappings , with the mapping
(21)  
(22) 
where the input dependent selection is abbreviated via
(23) 
If the local mappings are linear we have with and . We denote this functional as a multivariate linear spline:
(24)  
(25) 
where we explicit the polynomial parameters by and .
In the next sections, we study the capacity of linear spline operators to span standard DNN layers. All the development of the spline operator as well as a detailed review of multivariate spline functions is contained in Appendix A.1.1. Afterwards, the composition of the developed linear operators will lead to the explicit analytical inputoutput mapping of DNNs allowing to derive all the theoretical results in the remaining of the report. In the following sections, we omit the cases of regularity constraints on the presented functional thus leading to the most general cases.
3.1 Spline Operators[FINI]
A natural extension of spline functions is the spline operator (SO) we denote . We present here a general definition and propose in the next section an intuitive way to construct spline operators via a collection of multivariate splines, the special case of current DNNs.
Definition 2.
A spline operator is a mapping defined by a collection of local mappings associated with a partition of denoted as s.t.
where we denoted the region specific mapping associated to the input by .
A special case occurs when the mappings are linear. We thus define in this case the linear spline operator (LSO) which will play an important role for DNN analysis. In this case, , with . As a result, a LSO can be rewritten as
where we denoted the collection of intercept and biases as , and finally the input specific activation as and .
Such operators can also be defined via a collection of multivariate polynomial (resp. linear) splines. Given multivariate spline functions , their respective output is ”stacked” to produce an output vector of dimension . The internal parameters of each multivariate spline are , a partition of with and . Stacking their respective output to form an output vector leads to the induced spline operator .
Definition 3.
The spline operator defined with multivariate splines with is defined as
(26) 
with , .
The use of multivariate splines to construct a SO does not provide directly the explicit collection of mappings and regions . Yet, it is clear that the SO is jointly governed by all the individual multivariate splines. Let first present some intuitions on this fact. The spline operator output is computed with each of the splines having ”activated” a region specific functional depending on their own input space partitioning. In particular, each of the region of the input space leading to a specific joint configuration is the one of interest, leading to and . We can thus write explicitly the new regions of the spline operator based on the ensemble of partition of all the involved multivariate splines as
(27) 
We also denote the number of region associated to this SO as . From this, the local mappings of the SO correspond to the joint mappings of the splines being activated on we denote
(28) 
with s.t. . In fact, for each region of the SO there is a unique region for each of the splines , such that it is a subset as and it is disjoint to all others . In other word we have the following property:
(29) 
as we remind .
This leads to the following SO formulation
(30) 
We can now study the case of linear splines leading to LSOs. If a SO is constructed via aggregation of linear multivariate splines,. The linear property allows notation simplifications. It is defined as
(31) 
with , , .
As it is clear, the collection of matrices and biases and the partitions completely define a LSO. Hence, we denote the set of all possible matrices and biases as , . Any LSO is thus written as .
3.2 Linear Spline Operator: Generalized Neural Network Layers
In this section we demonstrate how current DNNs are expressed as composition of LSOs. We first proceed to describe layer specific notations and analytical formula to finally perform composition of LSOs providing analytical DNN mappings in the next section.
3.2.1 Nonlinearity layers
We first analyze the elementwise nonlinearity layer. Our analysis deals with any given nonlinearity. If this nonlinearity is by definition a spline s.a. with ReLU, leakyReLU, absolute value, they fall directly into this analysis. If not, arbitrary functions such as tanh, sigmoid are approximated via linear splines. We remind that a nonlinearity layer is defined by applying a nonlinearity on each input dimension of its input and produces a new output vector . While in general the used nonlinearity is the same applied on each dimension we present here a more general case where one has a specific per dimension. In addition, we present the case where the nonlinearity might not act on only one input dimension but any part of it. We thus define by the nonlinearity acting on the input dimension, . We provide illustration of famous nonlinearities in Table 1 being cases where the output dimension at position only depends on the input dimension of .
ReLU[Glorot et al., 2011]  LReLU[Xu et al., 2015]  Abs.Value 

Comments
There are no comments yet.