I Introduction
Deep networks [1]
are parametric models that perform sequential operations on their input data. Each such operation, colloquially called a “layer”, consists of a linear transformation, say, a convolution of its input, followed by a pointwise nonlinear “activation function”, e.g., a sigmoid. Deep networks have recently led to dramatic improvements in classification performance in various applications in speech and natural language processing, and computer vision. The crucial property of deep networks that is believed to be the root of their performance is that they have a large number of layers as compared to classical neural networks; but there are other architectural modifications such as rectified linear activations (ReLUs)
[2] and residual “shortcut” connections [3]. Other major factors in their success is the availability of massive datasets, say, millions of images in datasets like ImageNet
[4], and efficient GPU computing hardware for solving the resultant highdimensional optimization problem which may have up to million parameters.The empirical success of deep learning, especially convolutional neural networks (CNNs) for imagebased tasks, presents numerous puzzles to theoreticians. In particular, there are three key factors in deep learning, namely the architectures, regularization techniques and optimization algorithms, which are critical to train wellperforming deep networks and understanding their necessity and interplay is essential if we are to unravel the secrets of their success.
Ia Approximation, depth, width and invariance properties
An important property in the design of a neural network architecture is its ability to approximate arbitrary functions of the input. But how does this ability depend on parameters of the architecture, such as its depth and width? Earlier work shows that neural networks with a single hidden layer and sigmoidal activations are universal function approximators [5, 6, 7, 8]. However, the capacity of a wide and shallow network can be replicated by a deep network with significant improvements in performance. One possible explanation is that deeper architectures are able to better capture invariant properties of the data compared to their shallow counterparts. In computer vision, for example, the category of an object is invariant to changes in viewpoint, illumination, etc. While a mathematical analysis of why deep networks are able to capture such invariances remains elusive, recent progress has shed some light on this issue for certain subclasses of deep networks. In particular, scattering networks [9] are a class of deep networks whose convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and reveal the fundamental role of geometry and stability that underpins the generalization performance of modern deep convolutional network architectures; see Section IV.
IB Generalization and regularization properties
Another critical property of a neural network architecture is its ability to generalize from a small number of training examples. Traditional results from statistical learning theory
[10] show that the number of training examples needed to achieve good generalization grows polynomially with the size of the network. In practice, however, deep networks are trained with much fewer data than the number of parameters ( regime) and yet they can be prevented from overfitting using very simple (and seemingly counterproductive) regularization techniques like Dropout [11], which simply freezes a random subset of the parameters at each iteration.One possible explanation for this conundrum is that deeper architectures produce an embedding of the input data that approximately preserves the distance between data points in the same class, while increasing the separation between classes. This tutorial will overview the recent work of [12], which uses tools from compressed sensing and dictionary learning to prove that deep networks with random Gaussian weights perform a distancepreserving embedding of the data in which similar inputs are likely to have a similar output. These results provide insights into the metric learning properties of the network and lead to bounds on the generalization error that are informed by the structure of the input data.
IC Informationtheoretic properties
Another key property of a network architecture is its ability to produce a good “representation of the data”. Roughly speaking, a representation is any function of the input data that is useful for a task. An optimal representation would be one that is “most useful” as quantified, for instance, by informationtheoretic, complexity or invariance criteria [13]
. This is akin to the “state” of the system and is what an agent would store in its memory in lieu of the data to predict future observations. For example, the state of a Kalman filter is an optimal representation for predicting data generated by a linear dynamical system with Gaussian noise; in other words, it is a minimal sufficient statistic for prediction. For complex tasks where the data may be corrupted by “nuisances” that do not contain information about the task, one may also wish for this representation to be “invariant” to such nuisances so as not to affect future predictions. In general, optimal representations for a task can be defined as sufficient statistics (of past or “training” data) that are also minimal, and invariant to nuisance variability affecting future (“test”) data
[14]. Despite a large interest in representation learning, a comprehensive theory that explains the performance of deep networks as constructing optimal representations does not yet exist. Indeed, even basic concepts such as sufficiency and invariance have received diverse treatment
[9, 14, 15].Recent work [16, 17, 18] has begun to establish informationtheoretic foundations for representations learned by deep networks. These include the observation that the information bottleneck loss [13], which defines a relaxed notion of minimal sufficiency, can be used to compute optimal representations. The information bottleneck loss can be rewritten as the sum of a crossentropy term, which is precisely the most commonly used loss in deep learning, with an additional regularization term. The latter can be implemented by introducing noise, similar to adaptive dropout noise, in the learned representation [17]. The resulting form of regularization, called information dropout in [17], shows improved learning under resource constraints, and can be shown to lead to “maximally disentangled” representations, i.e., the (total) correlation among components of the representation is minimal, thereby making the features indicators of independent characteristics of data. Moreover, similar techniques show improved robustness to adversarial perturbations [18]. Information theory is hence expected to play a key role in formalizing and analyzing the properties of deep representations and suggesting new classes of regularizers.
ID Optimization properties
The classical approach to training neural networks is to minimize a (regularized) loss using backpropagation
[19], a gradient descent method specialized to neural networks. Modern versions of backpropagation rely on stochastic gradient descent (SGD) to efficiently approximate the gradient for massive datasets. While SGD has been rigorously analyzed only for convex loss functions
[20], in deep learning the loss is a nonconvex function of the network parameters, hence there are no guarantees that SGD finds the global minimizer.In practice, there is overwhelming evidence that SGD routinely yields good solutions for deep networks. Recent work on understanding the quality of training argues that critical points are more likely to be saddle points rather than spurious local minima [21] and that local minima concentrate near the global optimum [22]. Recent work has also revealed that local minima discovered by SGD that lead to good generalization error belong to very flat regions of the parameter space [23]
. This motivates algorithms like EntropySGD that are specialized to find such regions and draw from similar results in the analysis of binary perceptrons in statistical physics
[24]; they have been shown to perform well on deep networks [25]. Surprisingly, these techniques from statistical physics are intimately connected to the regularization properties of partial differential equations (PDEs)
[26]. For instance, local entropy, the loss that EntropySGD minimizes, is the solution of the HamiltonJacobiBellman PDE and can therefore be written as a stochastic optimal controlproblem, which penalizes greedy gradient descent. This direction further leads to connections between variants of SGD with good empirical performance and standard methods in convex optimization such as infconvolutions and proximal methods. Researchers are only now beginning to unravel the loss functions of deep networks in terms of their topology, which dictates the complexity of optimization, and their geometry, which seems to be related to generalization properties of the classifiers
[27, 28, 29].This tutorial will overview recent work showing that the error surface for highdimensional nonconvex optimization problems such as deep learning has some benign properties. For example, the work of [30, 31]
shows that for certain classes of neural networks for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of RELUs, which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization
[32].IE Paper outline
The remainder of this paper is organized as follows. Section II describes the inputoutput map of a deep network. Section III studies the problem of training deep networks and establishes conditions for global optimality. Section IV studies invariance and stability properties of scattering networks. Section V studies structural properties of deep networks, such as metric properties of the embedding as well as bounds on the generalization error. Section VI studies informationtheoretic properties of deep representations.
Ii Preliminaries
A deep network is a hierarchical model where each layer applies a linear transformation followed by a nonlinearity to the preceding layer. Specifically, let be the input data,^{1}^{1}1For the sake of simplicity, we assume that inputs lie in , but many of the results in this paper apply also to more general inputs such as tensors (e.g., RGB data), in which case the weights become tensors as well. where each row of is a dimensional data point (e.g., a grayscale image with pixels) and is the number of training examples. Let be a matrix representing a linear transformation applied to the output of layer , , to obtain a dimensional representation at layer . For example, each column of could represent a convolution with some filter (as in convolutional neural networks) or the application of a linear classifier (as in fully connected networks). Let be a nonlinear activation function, e.g., a hyperbolic tangent , a sigmoid
, or a rectified linear unit (ReLU)
.^{2}^{2}2More broadly,could be a many to one function, such as max pooling.
This nonlinearity is applied to each entry of to generate the th layer of a neural network as . The output of the network is thus given by (see Fig. 1):(1) 
Note that is an matrix, where is the dimension of the output of the network, which is equal to the number of classes in the case of a classification problem. Notice also that the map can be seen as a function of the network weights with a fixed input , as will be the case when we discuss the training problem in III. Alternatively, we can view the map as a function of the input data with fixed weights , as will be the case when we discuss properties of this map in Sections IVVI.
Iii Global Optimality in Deep Learning
This section studies the problem of learning the parameters of a deep network from training examples . In a classification setting, each row of denotes a data point in and each row of denotes the membership of each data point to one out of classes, i.e., if the th row of belongs to class and otherwise. In a regression setting, the rows of denote the dependent variables for the rows of . The problem of learning the network weights is formulated as the following optimization problem:
(2) 
where is a loss function that measures the agreement between the true output, , and the predicted output, in (1), is a regularization function designed to prevent overfitting, e.g., weight decay via regularization , and is a balancing parameter.
Iiia The challenge of nonconvexity in neural network training
A key challenge in neural network training is that the optimization problem in (2) is nonconvex because, even if the loss is typically a convex function of , e.g., the squared loss , the map is usually a nonconvex function of due to the product of the variables and the nonlinearities in (1). This presents significant challenges to existing optimization algorithms –
including (but certainly not limited to) gradient descent, stochastic gradient descent, alternating minimization, block coordinate descent, backpropagation, and quasiNewton methods – which are typically only guaranteed to converge to a critical point of the objective function [33, 34, 35, 36]. However, for nonconvex problems, the set of critical points includes not only global minima, but also local minima, local maxima, saddle points and saddle plateaus, as illustrated in Fig. 2. As a result, the nonconvexity of the problem leaves the model somewhat illposed in the sense that it is not just the model formulation that is important but also implementation details, such as how the model is initialized and particulars of the optimization algorithm, which can have a significant impact on the performance of the model.
To address the issue of nonconvexity, a common strategy used in deep learning is to initialize the network weights at random, update these weights using local descent, check if the training error decreases sufficiently fast, and if not, choose another initialization. In practice, it has been observed that if the size of the network is large enough, this strategy can lead to markedly different solutions for the network weights, which give nearly the same objective values and classification performance [22]. It has also been observed that when the size of the network is large enough and the nonlinearity is chosen to be a ReLU, many weights are zero, a phenomenon known as
dead neurons
, and the classification performance significantly improves [37, 38, 39, 40]. While this empirically suggests that when the size of the network is large enough and ReLU nonlinearities are used all local minima could be global, there is currently no rigorous theory that provides a precise mathematical explanation for these experimentally observed phenomena.IiiB Optimality for neural networks with a single hidden layer
Earlier work on global optimality for neural networks [41] showed that for networks with a single hidden layer and linear activations, the squared loss has a unique global minimum and all other critical points are saddle points. However, when the activations are nonlinear, [42] gives examples of networks where the backpropagation algorithm [19] fails even with separable data. The examples are, however, not generic, and [43, 44] show that backpropagation generally finds a global minimizer for linearly separable data.
Later work [45] showed that for neural networks with a single hidden layer, if the number of neurons in the hidden layer is not fixed, but instead fit to the data through a sparsity inducing regularization, then the process of training a globally optimal neural network is analogous to selecting a finite number of hidden units from a potentially infinite dimensional space of all possible hidden units. A weighted sum of the selected hidden units is then taken to produce the output. The specific optimization problem is of the form
(3) 
where represents one possible hidden unit activation in response to the training data from an infinite dimensional space of all possible hidden unit activations. Clearly (3) is a convex optimization problem on (assuming is convex on ) and straightforward to solve for a finite set of activations. However, since
is an infinite dimensional space the primary difficulty lies in how to select the appropriate hidden unit activations. Nonetheless, by using arguments from gradient boosting, it is possible to show that problem (
3) can be globally optimized by sequentially adding hidden units to the network until one can no longer find a hidden unit whose addition will decrease the objective function [45, 46, 47].IiiC Optimality for networks with random inputs and weights
Recent work has analyzed the problem of training neural networks with random inputs and weights. For example, the authors of [48] study random networks with a single hidden layer. Rather than training the network by solving an optimization problem such as (2
), the authors use tensor decomposition methods to estimate the network weights from high order statistical moments of the network mapping. Moreover, the authors show that under certain assumptions on the loss and data distribution, polynomialtime training is possible
[48]. Further, the authors of [49]study the problem when a given initialization of a neural network is likely to be within the basin of attraction of a global minimizer and provide conditions that ensure a random initialization will be within the basin of a global minimizer with high probability.
Several recent works have also analyzed the error surface of multilayer neural networks using tools from random matrix theory and statistical physics. For example, the authors of
[21] argue that, under certain assumptions, it is vastly more likely that a critical point of a highdimensional optimization problem be a saddle point rather than a local minimizer. Therefore, avoiding saddle points is the key difficulty in highdimensional, nonconvex optimization. Likewise, the authors of [22] show that, under certain assumptions on the distributions of the training data and the network weight parameters, as the number of hidden units in a network increases, the distribution of local minima becomes increasingly concentrated in a small band of objective function values near the global optimum (and thus all local minima become increasingly close to being global minima).IiiD Global optimality for positively homogeneous networks
Recent work [30, 31] largely echoes ideas from the above work, but takes a markedly different approach. Specifically, [30, 31] analyzes the optimization problem in (2) using a purely deterministic approach which does not make any assumptions regarding the distribution of the input data, the network weight parameter statistics, or the network initialization. With this approach, [30, 31] shows that saddle points and plateaus are the only critical points that one needs to be concerned with due to the fact that for networks of sufficient size, local minima that require one to climb the objective surface to escape from, such as (f) and (h) in Fig. 2, are guaranteed not to exist.
More specifically, [30] studies conditions under which the optimization landscape for the nonconvex optimization problem in (2) is such that all critical points are either global minimizers or saddle points/plateaus, as shown in Fig. 3. The authors show that if the network size is large enough and the functions and are sums of positively homogeneous functions of the same degree, any local minimizer such that some of its entries are zero is also a global minimizer. Interestingly, ReLU and maxpooling nonlinearities are positively homogeneous, while sigmoids are not, which could provide a possible explanation for the improved performance of ReLU and max pooling. Furthermore, many stateoftheart networks are not trained with classical regularization, such as an or norm penalty on the weight parameters but instead rely on techniques such as dropout [11]. The results of [30, 31] also provide strong guidance on the design of network regularization to ensure the nonexistence of spurious local minima, showing that traditional weight decay is not appropriate for deep networks. However, more recently proposed forms of regularization such as PathSGD [50]
[51] can be easily incorporated into the analysis framework of [30, 31], and stochastic regularization methods, such as dropout [11], have strong similarities to their framework.Iv Geometric Stability in Deep Learning
An important question in the path towards understanding deep learning models is to mathematically characterize its inductive bias; i.e., define the class of regression/classification tasks for which they are predesigned to perform well, or at least better than classical alternatives.
In the particular case of computer vision tasks, convolutional archictectures provide a fundamental inductive bias at the origin of most successful deep learning vision models. As we explain next, the notion of geometric stability provides a possible framework to understand its success.
Let be a compact dimensional Euclidean domain on which squareintegrable functions are defined (for example, in image analysis applications, images can be thought of as functions on the unit square
). In a supervised learning task, an unknown function
is observed on a training set(4) 
where the target space can be thought as being discrete in a standard classification setup (with being the number of classes), or in a regression task.
In the vast majority of computer vision and speech analysis tasks, the unknown function typically satisfies the following crucial assumptions:
1) Stationarity: Consider a translation operator^{3}^{3}3 We assume periodic boundary conditions to ensure that the operation is welldefined over .
(5) 
acting on functions . Depending on the task, we assume that the function is either invariant or equivariant with respect to translations. In the former case, we have for any and . This is typically the case in object classification tasks. In the latter, we have , which is welldefined when the output of the model is a space in which translations can act upon (for example, in problems of object localization, semantic segmentation, or motion estimation). Our definition of invariance should not be confused with the traditional notion of translation invariant systems in signal processing, which corresponds to translation equivariance in our language (since the output translates whenever the input translates).
2) Local deformations and scale separation: Similarly, a deformation , where
is a smooth vector field, acts on
as . Deformations can model local translations, changes in viewpoint, rotations and frequency transpositions [9]. Most tasks studied in computer vision are not only translation invariant/equivariant, but, more importantly, also stable with respect to local deformations [52, 9]. In tasks that are translation invariant we have(6) 
for all , where measures the smoothness of a given deformation field. In other words, the quantity to be predicted does not change much if the input image is slightly deformed. In tasks that are translation equivariant, we have
(7) 
This property is much stronger than stationarity, since the space of local deformations has high dimensionality – of the order of when we discretize images with pixels, as opposed to the dimensional translation group, which has only dimensions in the case of images.
Assumptions (6)–(7) can be leveraged to approximate from features that progressively reduce the spatial resolution. Indeed, extracting, demodulating and downsampling localized filter responses creates local summaries that are insensitive to local translations, but this loss of loss of sensitivity does not affect our ability to approximate , thanks to (6)–(7). To illustrate this principle, denote by
(8) 
the joint distribution of two image pixels at an offset
from each other. In the presence of longrange statistical dependencies, this joint distribution will not be separable for any . However, the deformation stability prior states that for small . In other words, whereas longrange dependencies indeed exist in natural images and are critical to object recognition, they can be captured and downsampled at different scales. Although this principle of stability to local deformations has been exploited in the computer vision community in models other than CNNs, for instance, deformable parts models [53], CNNs strike a good balance in terms of approximation power, optimization, and invariance.Indeed, stationarity and stability to local translations are both leveraged in convolutional neural networks (CNN). A CNN consists of several convolutional layers of the form , acting on a dimensional input by applying a bank of filters , and pointwise nonlinearity ,
(9) 
producing a dimensional output often referred to as the feature maps. Here,
(10) 
denotes the standard convolution. According to the local deformation prior, the filters have compact spatial support.
Additionally, a downsampling or pooling layer may be used, defined as
(11) 
where is a neighborhood around and is a permutationinvariant function such as an average, energy, or maxpooling).
A convolutional network is constructed by composing several convolutional and optionally pooling layers, obtaining a generic hierarchical representation
(12) 
where is the hypervector of the network parameters (all the filter coefficients). The output features enjoy translation invariance/covariance depending on whether spatial resolution is progressively lost by means of pooling or kept fixed. Moreover, if one specifies the convolutional tensors to be complex wavelet decomposition operators and uses complex modulus as pointwise nonlinearities, one can provably obtain stability to local deformations [54]. Although this stability is not rigorously proved for generic compactly supported convolutional tensors, it underpins the empirical success of CNN architectures across a variety of computer vision applications [1].
A key advantage of CNNs explaining their success in numerous tasks is that the geometric priors on which CNNs are based result in a sample complexity that avoids the curse of dimensionality. Thanks to the stationarity and local deformation priors, the linear operators at each layer have a constant number of parameters, independent of the input size
(number of pixels in an image). Moreover, thanks to the multiscale hierarchical property, the number of layers grows at a rate , resulting in a total learning complexity of parameters.Finally, recently there has been an effort to extend the geometric stability priors to data that is not defined over an Euclidean domain, where the translation group is generally not defined. In particular, researchers are exploiting geometry in general graphs via the spectrum of graph Laplacians and its spatial counterparts; see [55] for a recent survey on those advances.
V Structure Based Theory for Deep Learning
Va Structure of the data throughout a neural network
An important aspect for understanding better deep learning is the relationship between the structure of the data and the deep network. For a formal analysis, consider the case of a network that has random i.i.d. Gaussian weights, which is a common initialization in training deep networks. Recent work [56] shows that such networks with random weights preserve the metric structure of the data as they propagate along the layers, allowing for stable recovery of the original data from the features calculated by the network – a property that is often encountered in general deep networks [57, 58].
More precisely, the work of [56] shows that the input to the network can be recovered from the network’s features at a certain layer if their size is proportional to the intrinsic dimension of the input data. This is similar to data reconstruction from a small number of random projections [59, 60]. However, while random projections preserve the Euclidean distance between two inputs up to a small distortion, each layer of a deep network with random weights distorts this distance proportionally to the angle between the two inputs: the smaller the angle, the stronger the shrinkage of the distance. Therefore, the deeper the network, the stronger the shrinkage achieved. Note that this does not contradict the fact that it is possible to recover the input from the output; even when properties such as lighting, pose and location are removed from an image (up to certain extent), the resemblance to the original image is still maintained.
As random projection is a universal sampling strategy for lowdimensional data [59, 60, 61], deep networks with random weights are a universal system that separates any data (belonging to a lowdimensional model) according to the angles between the data points, where the general assumption is that there are large angles between different classes [62, 63]. As the training of the projection matrix adapts it to better preserve specific distances over others, the training of a network prioritizes intraclass angles over interclass ones. This relation is alluded to by the proof techniques in [56] and is empirically manifested by observing the angles and Euclidean distances at the output of trained networks.
By using the theory of 1bit compressed sensing, [64] shows that each layer of a network preserves the metric of its input data in the GromovHausdorff sense up to a small constant , under the assumption that these data reside in a lowdimensional manifold denoted by . This allows drawing conclusions on the tessellation of the space created by each layer and the relationship between the operation of these layers and local sensitive hashing (LSH). It also implies that it is possible to retrieve the input of a layer, up to certain accuracy, from its output. This shows that every layer preserves the important information of the data.
An analysis of the behavior of the Euclidean distances and angles in the data along the network reveals an important effect of the ReLU. Without a nonlinearity, each layer is simply a random projection for which Euclidean distances are approximately preserved. The addition of a ReLU makes the system sensitive to the angles between points. In this case, the network tends to decrease the Euclidean distances between points with small angles between them (“same class”), more than the distances between points with large angles between them (“different classes”). Still, lowdimensional data at the input remain such throughout the entire network, i.e., deep networks (almost) do not increase the intrinsic dimension of the data [56]. This is related to the recent work in [65] that claims that deep networks may be viewed a sparse coding procedure leading to guarantees on the uniqueness of the representation calculated by the network and its stability.
As random networks are blind to the data labels, training may select discriminatively the angles that cause the distance deformation. Therefore, it will cause distances between different classes to increase more than the distances within the same class. This is demonstrated in several simulations for different deep networks. It is observed that a potential main goal of the training of the network is to treat the class boundary points while keeping the other distances approximately the same.
VB Generalization error
The above suggests that there is a relation between the structure of the data and the error the network achieves in training, which leads to study the relationship between the generalization error in deep networks and the data structure. Generalization error – the difference between the empirical error and the expected error – is a fundamental concept in statistical learning theory. Generalization error bounds offer insight into why learning from training samples is possible.
Consider a classification problem with a data point that has a corresponding class label , where is the number of classes. A training set of samples drawn from a distribution is denoted by and the loss function is denoted by , which measures the discrepancy between the true label and the estimated label provided by the classifier. The empirical loss of a network associated with the training set is defined as
(13) 
and the expected loss as
(14) 
The generalization error is then given as:
(15) 
Various measures such as the the VCdimension [66, 67], the Rademacher or Gaussian complexities [68] and algorithm robustness [69] have been used to bound the GE in the context of deep networks. However, these measures do not offer an explanation for good deep network generalization in practice where the number of parameters can often be larger than the number of training samples [70] or the networks are very deep [71]. For example, the VCdimension of a deep network with the hardthreshold nonlinearity is equal to the number of parameters in the network, which implies that the sample complexity is linear in the number of parameters of the network. On the other hand, the work [72] bounds the generalization error independently of the number of parameters. Yet, in its bound the generalization error of deep network with ReLUs scales exponentially with the network depth. Similarly, the authors in [69] show that deep networks are robust provided that the norm of the weights in each layer is bounded. The bounds are exponential in the norm of the weights if the norm is greater than 1.
An alternative route followed by [28, 29, 69, 73] bounds the generalization error in terms of the networks’ classification margin, which is independent of the depth and size but takes into account the structure of the data (considering its covering number) and therefore avoids the above issues. As it is hard to calculate the input margin directly, in [28] it is tied to the Jacobian matrix and the loss of the deep networks showing that bounding the spectral norm of the Jacobian matrix reduces the generalization error. This analysis is general to arbitrary network structures, types of nonlinearities and pooling operations. Furthermore, a relationship between the generalization error, the invariance in the data and the network is formally characterized in [29].
# train samples  ResNet  ResNet + Jac. reg. 

2500  55.69  62.79 
10000  71.79  78.70 
50000 + aug.  93.34  94.32 
Using the relationship between the generalization error and the network Jacobian matrix, a new Jacobian based regularization strategy is developed in [29] and its advantage is demonstrated for several networks and datasets. For example, Table I shows the improvement achieved when using this regularization with the Wide ResNet architecture [74] for CIFAR10 with different numbers of training examples.
Vi Towards an InformationTheoretic Framework
The loss function of choice for training deep networks to solve supervised classification problems is the empirical crossentropy
(16) 
This loss function is prone to overfitting, as the network could trivially memorize the training data instead of learning the underlying distribution. This problem is usually addressed by regularization, which can be explicit (e.g., the norm of , known as weight decay), or implicit in stochastic gradient descent. It was suggested by [76] almost a quarter century ago that better regularization, hence less overfitting, might be achieved by limiting the information stored in the weight, , where is a prior on the weights. Choosing this regularizer leads to the loss function
(17) 
where the first term denotes the empirical conditional crossentropy obtained from . For , this is the variational lower bound on the observed data distribution , and can therefore be seen as a form of Bayesian inference of the weights. More generally, this is equivalent to the information bottleneck Lagrangian. The first term is the same as the empirical crossentropy and ensures that information stored in the weights is sufficient for the task , while the second term minimizes the amount of information stored. Thus, the weights learned by minimizing crossentropy, with a KL regularizer, approximate a minimal sufficient statistic of the training set. Computing the KL term and optimizing was considered a showstopper until recently, when advances in Stochastic Gradient Variational Bayes [77, 78] made efficient optimization possible.
But for a representation to be useful, it should not just efficiently memorize past (training) data. It should also reduce the effect of nuisance variability affecting future (test) data. Indeed, most of the variability in imaging data can be ascribed to the effects of illumination and viewpoint, quotienting out which leaves a thin set [79]. As we have already pointed out, deep networks are known to reduce the effects of nuisance variability in test data. This can be partly achieved through the architecture, in particular multiscale convolutional structures. Some may be ascribed to the optimization procedure, that converges to “flat minima.” But the choice of regularizer is also responsible for the networks’ ability to discount nuisance variability. Denoting by be the input sample, the target variable, and the (stochastic) representation of learned by a layer in the network, the tradeoff between sufficiency an minimality of is formalized by the information bottleneck Lagrangian
(18) 
where the first term ensures that the representation is sufficient for , while the second term ensures its information content remains minimal, i.e. that nuisances are filtered out.
Notice that, while formally equivalent, the losses in (17) and (18) are conceptually different: In the former, the weights are a representation of the training set that is minimal and sufficient. In the latter, the activations are a minimal representation of the test sample. We conjecture that the two are related, and that the relation informs the generalization properties of the network, but specific bounds have yet been shown.
Different choices of the noise distribution, and different models of the marginal distribution lead to slightly different analyses. For example, [18] considers the case of additive Gaussian noise and Gaussian marginals, while [17]
study multiplicative noise distributions, and considers both a scale invariant loguniform marginal (the only one compatible with the fact that networks with ReLU activations are scale invariant), and a lognormal marginal distribution. Interestingly, the special case in which the multiplicative noise is chosen from a Bernoulli distribution reduces to the wellknown practice of Dropout
[76], while choosing from a multiplicative Gaussian distribution leads to Gaussian Dropout
[78].In order to compute the term , it is commonly assumed that the activations are mutually independent, that is, that the marginal is factorized; [17] shows that making this assumption corresponds to minimizing a modified loss function, which also reduces the total correlation of the activations. Therefore, a choice dictated by convenience in order to explicitly compute the information bottleneck Lagrangian, yields a disentangled representation, with entanglement measured by total correlation.
It is remarkable that empirical practice has managed to converge to the use of the crossentropy loss with dropout, that happens to be what would have been prescribed by first principles, since for certain choices of distribution, training is equivalent to minimizing the information bottleneck Lagrangian, that yields an approximation of a minimal sufficient invariant statistic, which define an optimal representation. It is only recently that developments in theory have made this association possible, and developments in optimization and hardware have made deep neural networks a sensational success.
Vii Acknowledgments
René Vidal acknowledges grant NSF 1618485. Raja Giryes acknowledges the Global Innovation Fund (GIF). Stefano Soatto acknowledges grants ONR N000141512261, ARO W911NF1510564/66731CS, and AFOSR FA95501510229.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[2]
V. Nair and G.E. Hinton.
Rectified linear units improve restricted boltzmann machines.
InInternational conference on machine learning
, pages 807–814, 2010.  [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.

[4]
Jia Deng, Wei Dong, R Socher, LiJia Li, and Li FeiFei.
ImageNet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 248–255. IEEE, June 2009. 
[5]
G. Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of Control Signals and Systems, 2(4):303–314, 1989.  [6] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 [7] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
 [8] Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115–133, 1994.
 [9] J. Bruna and S. Mallat. Invariant scattering convolution networks. Trans. PAMI, 35(8):1872–1886, 2013.
 [10] P. Bartlett and W. Maass. VapnikChervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pages 1188–1192, 2003.
 [11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [12] R. Giryes, G. Sapiro, and A. Bronstein. Deep neural networks with random gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
 [13] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the Allerton Conf., 2000.
 [14] S. Soatto and A. Chiuso. Visual representations: Defining properties and deep approximations. Proc. of the Intl. Conf. on Learning Representations (ICLR); ArXiv: 1411.7676, May 2016.
 [15] F. Anselmi, L. Rosasco, and T. Poggio. On invariance and selectivity in representation learning. arXiv preprint arXiv:1503.05938, 2015.
 [16] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 [17] A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy computation. arXiv:1611.01353, 2016.
 [18] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 [19] PJ Werbos. Beyond regression: New tools for predictions and analysis in the behavioral science. Cambridge, MA, itd. PhD thesis, Harvard University, 1974.
 [20] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
 [21] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Neural Information Processing Systems, pages 2933–2941, 2014.

[22]
A. Choromanska, M. Henaff, M. Mathieu, G.B Arous, and Y. LeCun.
The loss surfaces of multilayer networks.
In
International Conference on Artificial Intelligence and Statistics
, pages 192–204, 2015.  [23] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 [24] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Local entropy as a measure for sampling solutions in constraint satisfaction problems. Journal of Statistical Mechanics: Theory and Experiment, 2016(2):023301, 2016.
 [25] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. EntropySGD: Biasing gradient descent into wide valleys. International Conference of Learning and Representations, 2017.
 [26] Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillaume Carlier. Deep relaxation: partial differential equations for optimizing deep neural networks. arXiv:1704.04932, 2017.
 [27] C Daniel Freeman and Joan Bruna. Topology and geometry of halfrectified network optimization. ICLR, 2017.
 [28] J. Sokolić, R. Giryes, G. Sapiro, and M. Rodrigues. Robust large margin deep neural networks. to appear in IEEE Transactions on Signal Processing, 2017.
 [29] J. Sokolić, R. Giryes, G. Sapiro, and M. Rodrigues. Generalization error of invariant classifiers. In AISTATS, 2017.
 [30] B. Haeffele and R. Vidal. Global optimality in neural network training. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [31] B. Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv, abs/1506.07540, 2015.
 [32] B. Haeffele, E. Young, and R. Vidal. Structured lowrank matrix factorization: Optimality, algorithm, and applications to image processing. In International Conference on Machine Learning, pages 2007–2015, 2014.
 [33] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60, 2010.
 [34] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Cognitive Modeling, 5, 1988.
 [35] Stephen J Wright and Jorge Nocedal. Numerical Optimization, volume 2. Springer New York, 1999.
 [36] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
 [37] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8609–8613, 2013.
 [38] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, volume 30, 2013.
 [39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pages 1097–1105, 2012.
 [40] Matthew D. Zeiler, M. Ranzato, Rajat Monga, M. Mao, K. Yang, Quoc Viet Le, Patrick Nguyen, A. Senior, Vincent Vanhoucke, Jeffrey Dean, and Geoffry E. Hinton. On rectified linear units for speech processing. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3517–3521, 2013.

[41]
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.
Neural Networks, 2(1):53–58, 1989.  [42] Martin Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665–674, 1989.
 [43] M. Gori and A. Tesi. Backpropagation converges for multilayered networks and linearlyseparable patterns. In International Joint Conference on Neural Networks, volume 2, page 896. IEEE, 1991.
 [44] M. Gori and A. Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.
 [45] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Neural Information Processing Systems, pages 123–130, 2005.
 [46] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
 [47] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting algorithms as gradient descent. In Neural Information Processing Systems, pages 512–518, 2000.
 [48] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
 [49] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
 [50] B. Neyshabur, R. Salakhutdinov, and N. Srebro. PathSGD: Pathnormalized optimization in deep neural networks. In Neural Information Processing Systems, pages 2422–2430, 2015.
 [51] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 [52] Stéphane Mallat. Understanding deep convolutional networks. Phil. Trans. R. Soc. A, 374(2065), 2016.
 [53] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
 [54] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
 [55] M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond Euclidean data. arXiv preprint arXiv:1611.08097, 2016.
 [56] R. Giryes, G. Sapiro, and A. Bronstein. Deep neural networks with random Gaussian weights: a universal classification strategy? IEEE Trans. Sig. Proc., 64(13):3444–3457, July 2016.
 [57] J. Bruna, A. Szlam, and Y. LeCun. Signal recovery from pooling representations. In Int. Conf. on Machine Learning (ICML), pages 307–315, 2014.
 [58] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015.
 [59] E. J. Candès and T. Tao. Nearoptimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory, 52(12):5406 –5425, Dec. 2006.
 [60] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
 [61] R. Giryes, Y. Eldar, A. Bronstein, and G. Sapiro. Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. arXiv:1605.09232, 2017.
 [62] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Journal of Machine Learning Research, 4:913–931, Oct. 2003.
 [63] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.

[64]
Y. Plan and R. Vershynin.
Dimension reduction by random hyperplane tessellations.
Discrete and Computational Geometry, 51(2):438–461, 2014.  [65] V. Papyan, Y. Romano, and M. Elad. Convolutional Neural Networks Analyzed via Convolutional Sparse Coding. arXiv:1607.08194, 2016.
 [66] Vladimir N Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988–999, September 1999.
 [67] S ShalevShwartz and S BenDavid. Understanding machine learning: from theory to algorithms. Cambridge University Press, 2014.
 [68] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
 [69] Huan Xu and Shie Mannor. Robustness and generalization. Machine Learning, 86(3):391–423, 2012.
 [70] C. Zhang, S. Bengio, M. Hardt, and B. Recht. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
 [71] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, December 2016.
 [72] B. Neyshabur, R. Tomioka, and N. Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
 [73] J. Huang, Q. Qiu, G. Sapiro, and R. Calderbank. Discriminative robust transformation learning. In Neural Information Processing Systems, pages 1333–1341, 2015.
 [74] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv:1605.07146, 2016.
 [75] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1225–1234. JMLR.org, 2016.

[76]
G. Hinton and D. Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Annual Conference on Computational Learning Theory
, pages 5–13. ACM, 1993.  [77] D. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Neural Information Processing Systems, pages 2575–2583, 2015.
 [78] D. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [79] G. Sundaramoorthi, P. Petersen, V. S. Varadarajan, and S. Soatto. On the set of images modulo viewpoint and contrast changes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
Comments
There are no comments yet.