Deep Neural Networks (DNNs) are universal function approximators providing state-of- the-art solutions on wide range of applications. Common perceptual tasks such as speech recognition, image classification, and object tracking are now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack of a mathematical framework providing an explicit and interpretable input-output formula for any topology, (2) quantification of DNNs stability regarding adversarial examples (i.e. modified inputs fooling DNN predictions whilst undetectable to humans), (3) absence of generalization guarantees and controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to apply DNNs to domains where expert labeling is scarce as in the medical field. Answering those points would provide theoretical perspectives for further developments based on a common ground. Furthermore, DNNs are now deployed in tremendous societal applications, pushing the need to fill this theoretical gap to ensure control, reliability, and interpretability.
DNNs are models involving compositions of nonlinear and linear transforms. (1) We will provide a straightforward methodology to express the nonlinearities as affine spline functions. The linear part being a degenerated case of spline function, we can rewrite any given DNN topology as succession of such functionals making the network itself a piecewise linear spline. This formulation provides a universal piecewise linear expression of the input-output mapping of DNNs, clarifying the role of its internal components. (2) In functional analysis, the regularity of a mapping is defined via its Lipschitz constant. Our formulation eases the analytical derivation of this stability variable measuring the adversarial examples sensitivity. For any given architecture, we provide a measure of risk to adversarial attacks. (3) Recently, the deep learning community has focused on the reminiscent theory of flat and sharp minima to provide generalization guarantees. Flat minima are regions in the parameter space associated with great generalization capacities. We will first, prove the equivalence between flat minima and spline smoothness. After bridging those theories, we will motivate a novel regularization technique pushing the learning of DNNs towards flat minima, maximizing generalization performances. (4) From (1) we will reinterpret DNNs as template matching algorithms. When coupled with insights derived from (2), we will integrate unlabeled data information into the network during learning. To do so, we will propose to guide DNNs templates towards their input via a scheme assimilated as a reconstruction formula for DNNs. This inversion can be computed efficiently by back- propagation leading to no computational overhead. From this, any semi-supervised technique can be used out-of-the-box with current DNNs where we provide state-of-the-art results. Unsupervised tasks would also become reachable to DNNs, a task considered as the keystone of learning for the neuro-science community. To date, those problematics have been studied independently leading to over-specialized solutions generally topology specific and cumbersome to incorporate into a pre-existing pipeline. On the other hand, all the proposed solutions necessitate negligible software updates, suited for efficient large-scale deployment.
Non-exhaustive list of the main contributions1) We first develop spline operators (SOs) A.1.1, a natural generalization of multivariate spline functions as well as their linear case (LSOs). LSOs are shown to ”span” DNNs layers, being restricted cases of LSOs 3.2. From this, composition of those operators lead to the explicit analytical input-output formula of DNNs, for any architecture 3.3. We then dive into some analysis:
”Dummy” variable representing an input/observation
|”Dummy” variable representing an output/prediction associated to input|
|Observation of shape .|
|Target variable associated to , for classification ,|
|for regression .|
|(resp. )||Labeled training set with (resp. ) samples .|
|Unlabeled training set with samples .|
|Layer at level with internal parameters .|
|Collection of all parameters .|
|Deep Neural Network mapping with|
|Shape of the representation at layer with and|
|Dimension of the flattened representation at layer with , and .|
|Representation of at layer in an unflattened format of shape ,|
|Value at channel and spatial position .|
|Representation of at layer in a flattened format of dimension|
|Value at dimension|
2 Background: Deep Neural Networks for Function Approximation
Most of applied mathematics interests take the form of function approximation. Two main cases arise, one where the target function to approximate is known and one where only a set of samples are observed, providing limited information on the domain-codomain structure of
. The latter case is the one of supervised learning. Given thetraining set with , the unknown functional
is estimated through the approximator. Finding an approximant with correct behaviors on is usually an ill-posed problem with many possible solutions. Yet, each one might behave differently for new observations, leading to different generalization performances. Generalization is the ability to replicate the behavior of on new inputs not present in thus not exploited to obtain . Hence, one seeks for an approximator having the best generalization performance. In some applications, the unobserved
is known to fulfill some properties such as boundary and regularity conditions for PDE approximation. In machine learning however, the lack of physic based principles does not provide any property constraining the search for a good approximatorexcept the performance measure based on the training set and an estimate of generalization performance based on a test set. To tackle this search, one commonly resorts to a parametric functional where contains all the free parameters controlling the behavior of . The task thus ”reduces” to finding the optimal set of parameters minimizing the empirical error on the training set and maximizing empirical generalization performance on the test set. We now refer to this estimation problem as a regression problem if is continuous and a classification problem if is categorical or discrete. We also restrict ourselves to being a Deep Neural Network (DNN) and denote . Also, is used for a generic input as opposed to the given sample .
DNNs are a powerful and increasingly applied machine learning framework for complex prediction tasks like object and speech recognition. In fact, they are proven to be universal function approximators[Cybenko, 1989, Hornik et al., 1989], fitting perfectly the context of function approximation of supervised learning described above. There are many flavors of DNNs, including convolutional, residual, recurrent, probabilistic, and beyond. Regardless of the actual network topology, we represent the mapping from the input signal to the output prediction as . By its parametric nature, the behavior of is governed by its underlying parameters . All current deep neural networks boil down to a composition of ”layer mappings” denoted by
In all the following cases, a neural network layer at level is an operator
that takes as input a vector-valued signalwhich at is the input signal and produces a vector-valued output . This succession of mappings is in general non-commutative, making the analysis of the complete sequence of generated signals crucial, denoted by
For concreteness, we will focus on processing -channel inputs , such as RGB images, stereo signals, as well as multi-channel representations which we refer to as a “signal”. This signal is indexed , , where are usually spatial coordinates, and is the channel. Any signal with greater index-dimensions fall under the following analysis by adaptation of the notations and operators. Hence, the volume is of shape with and . For consistency with the introduced layer mappings, we will use , the flattened version of as depicted in Fig. 1. The dimension of is thus . In this section, we introduce the basic concepts and notations of the main used layers enabling to create state-of-the-art DNNs as well as standard training techniques to update the parameters .
2.1 Layers Description
In this section we describe the common layers one can use to create the mapping . The notations we introduce will be used throughout the report. We now describe the following: Fully-connected;Convolutional; Nonlinearity;Sub-Sampling;Skip-Connection;Recurrent layers.
A Fully-Connected (FC) layer is at the origin of DNNs known as Multi-Layer Perceptrons (MLPs)[Pal and Mitra, 1992] composed exclusively of FC-layers and nonlinearities. This layer performs a linear transformation of its input as
The internal parameters are defined as and . This linear mapping produces an output vector of length . In current topologies, FC layers are used at the end of the mapping, as layers and , for their capacity to perform nonlinear dimensionality reduction in order to output
output values. However, due to their high number of degrees of freedomand the unconstrained internal structure of
, MLPs inherit poor generalization performances for common perception tasks as demonstrated on computer vision tasks in[Zhang et al., 2016].
The greatest accuracy improvements in DNNs occurred after the introduction of the convolutional layer
. Through convolutions, it leverages one of the most natural operation used for decades in signal processing and template matching. In fact, as opposed to the FC-layer, the convolutional layer is the corestone of DNNs dealing with perceptual tasks thanks to their ability to perform local feature extractions from their input. It is defined as
where a special structure is defined on so that it performs multi-channel convolutions on the vector . To highlight this fact, we first remind the multi-channel convolution operation performed on the unflatenned input of shape given a filter bank composed of filters, each being a tensor of shape with . Hence with representing the filters depth, equal to the number of channels of the input, and the spatial size of the filters. The application of the linear filters on the signal form another multi-channel signal as
where the output of this convolution contains channels, the number of filters in . Then a bias term is added for each output channel, shared across spatial positions. We denote this bias term as . As a result, to create channel of the output, we perform a convolution of each channel of the input with the impulse response and then sum those outputs element-wise over to finally add the bias leading to as
In general, the input is first transformed in order to apply some boundary conditions such as zero-padding, symmetric or mirror. Those are standard padding techniques in signal processing[Mallat, 1999]. We now describe how to obtain the matrix and vector corresponding to the operations of Eq. 6 but applied on the flattened input and producing the output vector . The matrix is obtained by replicating the filter weights into the circulent-block-circulent matrices [Jayaraman et al., 2009] and stacking them into the super-matrix
We provide an example in Fig. 2 for and .
By the sharing of the bias across spatial positions, the bias term inherits a specific structure. It is defined by replicating on all spatial position of each output channel :
The internal parameters of a convolutional layer are . The number of degrees of freedom for this layer is much less than for a FC-layer, it is of . If the convolution is circular, the spatial size of the output is preserved leading to and thus the output dimension only changes in the number of channels. Taking into account the special topology of the input by constraining the matrix to perform convolutions coupled with the low number of degrees of freedom while allowing a high-dimensional output leads to very efficient training and generalization performances in many perceptual tasks which we will discuss in details in 4.1. While there are still difficulties to understand what is encoded in the filters , it has been empirically shown that for images, the first filter-bank applied on the input images converges toward an over-complete Gabor filter-bank, considered as a natural basis for images [Meyer, 1993, Olshausen et al., 1996]. Hence, many signal processing tools and results await to be applied for analysis.
Element-wise Nonlinearity Layer
A scalar/element-wise nonlinearity layer applies a nonlinearity to each entry of a vector and thus preserve the input vector dimension . As a result, this layer produces its output via application of across all positions as
The choice of nonlinearity greatly impacts the learning and performances of the DNN as for example sigmoids and tanh are known to have vanishing gradient problems for high amplitude inputs, while ReLU based activation lead to unbounded activation and dying neuron problems. Typical choices include
Leaky ReLU: ,
Absolute Value: .
The presence of nonlinearities in DNNs is crucial as otherwise the composition of linear layers would produce another linear layer, with factorized parameters. When applied after a FC-layer or a convolutional layer we will consider the linear transformation and the nonlinearity as part of one layer. Hence we will denote for example.
A pooling layer operates a sub-sampling operation on its input according to a sub-sampling policy and a collection of regions on which is applied. We denote each region to be sub-sampled by with being the total number of pooling regions. Each region contains the set of indices on which the pooling policy is applied leading to
where is the pooling operator and
. Usually one uses mean or max pooling defined as
The regions can be of different cardinality and can be overlapping . However, in order to treat all input dimension, it is natural to require that each input dimension belongs to at least one region: . The benefits of a pooling layer are three-fold. Firstly, by reducing the output dimension it allows for faster computation and less memory requirement. Secondly, it allows to greatly reduce the redundancy of information present in the input
. In fact, sub-sampling, even though linear, is common in signal processing after filter convolutions. Finally, in case of max-pooling, it allows to only backpropagate gradients through the pooled coefficient enforcing specialization of the neurons. The latter is the corestone of the winner-take-all strategy stating that each neuron specializes into what is performs best. Similarly to the nonlinearity layer, we consider the pooling layer as part of its previous layer.
A skip-connection layer can be considered as a bypass connection added between the input of a layer and its output. Hence, it allows for the input of a layer such as a convolutional layer or FC-layer to be linearly combined with its own output. The added connections lead to better training stability and overall performances as there always exists a direct linear link from the input to all inner layers. Simply written, given a layer and its input , the skip-connection layer is defined as
In case of shape mis-match between and , a ”reshape” operator is applied to before the element-wise addition. Usually this is done via a spatial down-sampling and/or through a convolutional layer with filters of spatial size .
Finally, another type of layer is the recurrent layer which aims to act on time-series. It is defined as a recursive application along time by transforming the input as well as using its previous output. The most simple form of this layer is a fully recurrent layer defined as
while some applications use recurrent layers on images by considering the serie of ordered local patches as a time serie, the main application resides in sequence generation and analysis especially with more complex topologies such as LSTM[Graves and Schmidhuber, 2005] and GRU[Chung et al., 2014] networks. We depict the topology example in Fig. 3.
2.2 Deep Convolutional Network
The combination of the possible layers and their order matter greatly in final performances, and while many newly developed stochastic optimization techniques allow for faster learning, a sub-optimal layer chain is almost never recoverable. We now describe a ”typical” network topology, the deep convolutional network (DCN), to highlight the way the previously described layers can be combined to provide powerful predictors. Its main development goes back to [LeCun et al., 1995] for digit classification. A DCN is defined as a succession of blocks made of layers : Convolution Element-wise Nonlinearity Pooling layer. In a DCN, several of such blocks are cascaded end-to-end to create a sequence of activation maps followed usually by one or two FC-layers. Using the above notations, a single block can be rewritten as . Hence a basic model with blocks and FC-layers is defined as
The astonishing results that a DCN can achieve come from the ability of the blocks to convolve the learned filter-banks with their input, ”separating” the underlying features present relative to the task at hand. This is followed by a nonlinearity and a spatial sub-sampling to select, compress and reduce the redundant representation while highlighting task dependent features. Finally, the MLP part simply acts as a nonlinear classifier, the final key for prediction. The duality in the representation/high-dimensional mappings followed by dimensionality reduction/classification is a core concept in machine learning referred as: pre-processing-classification.
In order to optimize all the weights leading to the predicted output , one disposes of (1) a labeled dataset
, (2) a loss function, (3) a learning policy to update the parameters . In the context of classification, the target variable associated to an input is categorical . In order to predict such target, the output of the last layer of a network is transformed via a softmax nonlinearity[de Brébisson and Vincent, 2015]. It is used to transform
into a probability distribution and is defined as
thus leading to representing . The used loss function quantifying the distance between and is the cross-entropy (CE) defined as
For regression problems, the target is continuous and thus the final DNN output is taken as the prediction . The loss function is usually the ordinary squared error (SE) defined as
Since all of the operations introduced above in standard DNNs are differentiable almost everywhere with respect to their parameters and inputs, given a training set and a loss function, one defines an update strategy for the weights . This takes the form of an iterative scheme based on a first order iterative optimization procedure. Updates for the weights are computed on each input and usually averaged over mini-batches containing exemplars with . This produces an estimate of the ”correct” update for and is applied after each mini-batch. Once all the training instances of have been seen, after mini-batches, this terminates an epoch
. The dataset is then shuffled and this procedure is performed again. Usually a network needs hundreds of epochs to converge. For any given iterative procedure, the updates are computed for all the network parameters bybackpropagation [Hecht-Nielsen et al., 1988]
, which follows from applying the chain rule of calculus. Common policies are Gradient Descent (GD)[Rumelhart et al., 1988]
being the simplest application of backpropagation, Nesterov Momentum[Bengio et al., 2013] that uses the last performed updates in order to ”accelerate” convergence and finally more complex adaptive methods with internal hyper-parameters updated based on the weights/updates statistics such as Adam[Kingma and Ba, 2014], Adadelta[Zeiler, 2012], Adagrad[Duchi et al., 2011]
, RMSprop[Tieleman and Hinton, 2012], …Finally, to measure the actual performance of a trained network in the context of classification, one uses the accuracy loss defined as
defined such that smaller value is better,
3 Understand Deep Neural Networks Internal Mechanisms
In this section we develop spline operators, a generalization of spline function which are also generalized DNN layers. By doing so, we will open DNNs to explicit analysis and especially understand their behavior and potential through the spline region inference. DNNs will be shown to leverage template matching, a standard technique in signal processing to tackle perception tasks.
Let first motivate the need to adopt the theory of spline functions for deep learning and machine learning in general. As described in Sec. 2, the task at hand is to use parametric functionals to be able to understand[Cheney, 1980]
, predict, interpolate the world around us[Reinsch, 1967]
. For example, partial differential equations allow to approximate real world physics[Bloor and Wilson, 1990, Smith, 1985] based on grounded principles. For this case, one knows the underlying laws that must be fulfilled by . In machine learning however, one only disposes of observed inputs or input-output pairs . To tackle this approximation problem, splines offer great advantages. From a computational regard, polynomials are very efficient to evaluate via for example the Horner scheme[Peña, 2000]. Yet, polynomials have ”chaotic” behaviors especially as their degree grows, leading to the Runge’s phenomenon[Boyd and Xu, 2009], synonym of extremely poor interpolation and extrapolation capacities. On the other hand, low degree polynomials are not flexible enough for modeling arbitrary functionals. Splines, however, are defined as a collection of polynomials each one acting on a specific region of the input space. The collection of possible regions forms a partition of the input space. On each of those regions , the associated and usually low degree polynomial is used to transform the input . Through the per region activation, splines allow to model highly nonlinear functional, yet, the low-degree polynomials avoid the Runge phenomenon. Hence, splines are the tools of choice for functional approximation if one seeks robust interpolation/extrapolation performances without sacrificing the modeling of potentially very irregular underlying mappings .
In fact, as we will now describe in details, current state-of-the-art DNNs are linear spline functions and we now proceed to develop the notations and formulation accordingly. Let first remind briefly the case of multivariate linear splines.
Given a partition of , we denote multivariate spline with local mappings , with the mapping
where the input dependent selection is abbreviated via
If the local mappings are linear we have with and . We denote this functional as a multivariate linear spline:
where we explicit the polynomial parameters by and .
In the next sections, we study the capacity of linear spline operators to span standard DNN layers. All the development of the spline operator as well as a detailed review of multivariate spline functions is contained in Appendix A.1.1. Afterwards, the composition of the developed linear operators will lead to the explicit analytical input-output mapping of DNNs allowing to derive all the theoretical results in the remaining of the report. In the following sections, we omit the cases of regularity constraints on the presented functional thus leading to the most general cases.
3.1 Spline Operators[FINI]
A natural extension of spline functions is the spline operator (SO) we denote . We present here a general definition and propose in the next section an intuitive way to construct spline operators via a collection of multivariate splines, the special case of current DNNs.
A spline operator is a mapping defined by a collection of local mappings associated with a partition of denoted as s.t.
where we denoted the region specific mapping associated to the input by .
A special case occurs when the mappings are linear. We thus define in this case the linear spline operator (LSO) which will play an important role for DNN analysis. In this case, , with . As a result, a LSO can be rewritten as
where we denoted the collection of intercept and biases as , and finally the input specific activation as and .
Such operators can also be defined via a collection of multivariate polynomial (resp. linear) splines. Given multivariate spline functions , their respective output is ”stacked” to produce an output vector of dimension . The internal parameters of each multivariate spline are , a partition of with and . Stacking their respective output to form an output vector leads to the induced spline operator .
The spline operator defined with multivariate splines with is defined as
with , .
The use of multivariate splines to construct a SO does not provide directly the explicit collection of mappings and regions . Yet, it is clear that the SO is jointly governed by all the individual multivariate splines. Let first present some intuitions on this fact. The spline operator output is computed with each of the splines having ”activated” a region specific functional depending on their own input space partitioning. In particular, each of the region of the input space leading to a specific joint configuration is the one of interest, leading to and . We can thus write explicitly the new regions of the spline operator based on the ensemble of partition of all the involved multivariate splines as
We also denote the number of region associated to this SO as . From this, the local mappings of the SO correspond to the joint mappings of the splines being activated on we denote
with s.t. . In fact, for each region of the SO there is a unique region for each of the splines , such that it is a subset as and it is disjoint to all others . In other word we have the following property:
as we remind .
This leads to the following SO formulation
We can now study the case of linear splines leading to LSOs. If a SO is constructed via aggregation of linear multivariate splines,. The linear property allows notation simplifications. It is defined as
with , , .
As it is clear, the collection of matrices and biases and the partitions completely define a LSO. Hence, we denote the set of all possible matrices and biases as , . Any LSO is thus written as .
3.2 Linear Spline Operator: Generalized Neural Network Layers
In this section we demonstrate how current DNNs are expressed as composition of LSOs. We first proceed to describe layer specific notations and analytical formula to finally perform composition of LSOs providing analytical DNN mappings in the next section.
3.2.1 Nonlinearity layers
We first analyze the elementwise nonlinearity layer. Our analysis deals with any given nonlinearity. If this nonlinearity is by definition a spline s.a. with ReLU, leaky-ReLU, absolute value, they fall directly into this analysis. If not, arbitrary functions such as tanh, sigmoid are approximated via linear splines. We remind that a nonlinearity layer is defined by applying a nonlinearity on each input dimension of its input and produces a new output vector . While in general the used nonlinearity is the same applied on each dimension we present here a more general case where one has a specific per dimension. In addition, we present the case where the nonlinearity might not act on only one input dimension but any part of it. We thus define by the nonlinearity acting on the input dimension, . We provide illustration of famous nonlinearities in Table 1 being cases where the output dimension at position only depends on the input dimension of .