Deep learning works remarkably well, and has helped dramatically improve the state-of-the-art in areas ranging from speech recognition, translation and visual object recognition to drug discovery, genomics and automatic game playing LeCun et al. (2015); Bengio (2009). However, it is still not fully understood why
deep learning works so well. In contrast to GOFAI (“good old-fashioned AI”) algorithms that are hand-crafted and fully understood analytically, many algorithms using artificial neural networks are understood only at a heuristic level, where we empirically know that certain training protocols employing large data sets will result in excellent performance. This is reminiscent of the situation with human brains: we know that if we train a child according to a certain curriculum, she will learn certain skills — but we lack a deep understanding of how her brain accomplishes this.
This makes it timely and interesting to develop new analytic insights on deep learning and its successes, which is the goal of the present paper. Such improved understanding is not only interesting in its own right, and for potentially providing new clues about how brains work, but it may also have practical applications. Better understanding the shortcomings of deep learning may suggest ways of improving it, both to make it more capable and to make it more robust Russell et al. (2015).
i.1 The swindle: why does “cheap learning” work?
Throughout this paper, we will adopt a physics perspective on the problem, to prevent application-specific details from obscuring simple general results related to dynamics, symmetries, renormalization, etc., and to exploit useful similarities between deep learning and statistical mechanics.
The task of approximating functions of many variables is central to most applications of machine learning, including unsupervised learning, classification and prediction, as illustrated in Figure1
. For example, if we are interested in classifying faces, then we may want our neural network to implement a function where we feed in an image represented by a million greyscale pixels and get as output the probability distribution over a set of people that the image might represent.
When investigating the quality of a neural net, there are several important factors to consider:
Expressibility: What class of functions can the neural network express?
Efficiency: How many resources (neurons, parameters, etc.) does the neural network require to approximate a given function?
Learnability: How rapidly can the neural network learn good parameters for approximating a function?
This paper is focused on expressibility and efficiency, and more specifically on the following well-known Herbrich and Williamson (2002); Shawe-Taylor et al. (1998); Poggio et al. (2015) problem: How can neural networks approximate functions well in practice, when the set of possible functions is exponentially larger than the set of practically possible networks? For example, suppose that we wish to classify megapixel greyscale images into two categories, e.g., cats or dogs. If each pixel can take one of 256 values, then there are possible images, and for each one, we wish to compute the probability that it depicts a cat. This means that an arbitrary function is defined by a list of probabilities, i.e., way more numbers than there are atoms in our universe (about ).
Yet neural networks with merely thousands or millions of parameters somehow manage to perform such classification tasks quite well. How can deep learning be so “cheap”, in the sense of requiring so few parameters?
We will see in below that neural networks perform a combinatorial swindle, replacing exponentiation by multiplication: if there are say inputs taking values each, this swindle cuts the number of parameters from to times some constant factor. We will show that this success of this swindle depends fundamentally on physics: although neural networks only work well for an exponentially tiny fraction of all possible inputs, the laws of physics are such that the data sets we care about for machine learning (natural images, sounds, drawings, text, etc.) are also drawn from an exponentially tiny fraction of all imaginable data sets. Moreover, we will see that these two tiny subsets are remarkably similar, enabling deep learning to work well in practice.
The rest of this paper is organized as follows. In Section II, we present results for shallow neural networks with merely a handful of layers, focusing on simplifications due to locality, symmetry and polynomials. In Section III, we study how increasing the depth of a neural network can provide polynomial or exponential efficiency gains even though it adds nothing in terms of expressivity, and we discuss the connections to renormalization, compositionality and complexity. We summarize our conclusions in Section IV.
Ii Expressibility and efficiency of shallow neural networks
Let us now explore what classes of probability distributions are the focus of physics and machine learning, and how accurately and efficiently neural networks can approximate them. We will be interested in probability distributions , where x ranges over some sample space and will be interpreted either as another variable being conditioned on or as a model parameter. For a machine-learning example, we might interpret as an element of some set of animals and x as the vector of pixels in an image depicting such an animal, so that for gives the probability distribution of images of cats with different coloring, size, posture, viewing angle, lighting condition, electronic camera noise, etc. For a physics example, we might interpret as an element of some set of metals and x as the vector of magnetization values for different parts of a metal bar. The prediction problem is then to evaluate , whereas the classification problem is to evaluate .
Because of the above-mentioned “swindle”, accurate approximations are only possible for a tiny subclass of all probability distributions. Fortunately, as we will explore below, the function often has many simplifying features enabling accurate approximation, because it follows from some simple physical law or some generative model with relatively few free parameters: for example, its dependence on x may exhibit symmetry, locality and/or be of a simple form such as the exponential of a low-order polynomial. In contrast, the dependence of on tends to be more complicated; it makes no sense to speak of symmetries or polynomials involving a variable .
Let us therefore start by tackling the more complicated case of modeling . This probability distribution is determined by the hopefully simpler function
via Bayes’ theorem:
where is the probability distribution over (animals or metals, say) a priori, before examining the data vector x.
ii.1 Probabilities and Hamiltonians
It is useful to introduce the negative logarithms of two of these probabilities:
Statisticians refer to as “self-information” or “surprisal”, and statistical physicists refer to as the Hamiltonian, quantifying the energy of x (up to an arbitrary and irrelevant additive constant) given the parameter . Table 1 is a brief dictionary translating between physics and machine-learning terminology. These definitions transform equation (1) into the Boltzmann form
This recasting of equation (1) is useful because the Hamiltonian tends to have properties making it simple to evaluate. We will see in Section III that it also helps understand the relation between deep learning and renormalizationMehta and Schwab (2014).
|Free energy difference||KL-divergence|
|Effective theory||Nearly lossless data distillation|
ii.2 Bayes theorem as a softmax
Since the variable takes one of a discrete set of values, we will often write it as an index instead of as an argument, as . Moreover, we will often find it convenient to view all values indexed by as elements of a vector, written in boldface, thus viewing , and as elements of the vectors p, H and , respectively. Equation (3) thus simplifies to
using the standard convention that a function (in this case ) applied to a vector acts on its elements.
We wish to investigate how well this vector-valued function can be approximated by a neural net. A standard -layer feedforward neural network maps vectors to vectors by applying a series of linear and nonlinear transformations in succession. Specifically, it implements vector-valued functions of the form LeCun et al. (2015)
where the are relatively simple nonlinear operators on vectors and the are affine transformations of the form for matrices
and so-called bias vectors. Popular choices for these nonlinear operators include
Local function: apply some nonlinear function to each vector element,
Max-pooling: compute the maximum of all vector elements,
Softmax: exponentiate all vector elements and normalize them to so sum to unity
(We use to indicate the softmax function and to indicate an arbitrary non-linearity, optionally with certain regularity requirements).
ii.3 What Hamiltonians can be approximated by feasible neural networks?
It has long been known that neural networks are universal222Neurons are universal analog computing modules in much the same way that NAND gates are universal digital computing modules: any computable function can be accurately evaluated by a sufficiently large network of them. Just as NAND gates are not unique (NOR gates are also universal), nor is any particular neuron implementation — indeed, any generic smooth nonlinear activation function is universal Hornik et al. (1989); Cybenko (1989). approximators Hornik et al. (1989); Cybenko (1989)
, in the sense that networks with virtually all popular nonlinear activation functionscan approximate any smooth function to any desired accuracy — even using merely a single hidden layer. However, these theorems do not guarantee that this can be accomplished with a network of feasible size, and the following simple example explains why they cannot: There are different Boolean functions of variables, so a network implementing a generic function in this class requires at least bits to describe, i.e., more bits than there are atoms in our universe if .
The fact that neural networks of feasible size are nonetheless so useful therefore implies that the class of functions we care about approximating is dramatically smaller.
We will see below in Section II.4 that both physics and machine learning tend to favor Hamiltonians that are polynomials333 The class of functions that can be exactly expressed by a neural network must be invariant under composition, since adding more layers corresponds to using the output of one function as the input to another. Important such classes include linear functions, affine functions, piecewise linear functions (generated by the popular Rectified Linear unit “ReLU” activation function
The class of functions that can be exactly expressed by a neural network must be invariant under composition, since adding more layers corresponds to using the output of one function as the input to another. Important such classes include linear functions, affine functions, piecewise linear functions (generated by the popular Rectified Linear unit “ReLU” activation function), polynomials, continuous functions and smooth functions whose derivatives are continuous. According to the Stone-Weierstrass theorem, both polynomials and piecewise linear functions can approximate continuous functions arbitrarily well. — indeed, often ones that are sparse, symmetric and low-order. Let us therefore focus our initial investigation on Hamiltonians that can be expanded as a power series:
If the vector x has components (), then there are terms of degree up to .
ii.3.1 Continuous input variables
If we can accurately approximate multiplication using a small number of neurons, then we can construct a network efficiently approximating any polynomial by repeated multiplication and addition. We will now see that we can, using any smooth but otherwise arbitrary non-linearity that is applied element-wise. The popular logistic sigmoid activation function will do the trick.
Theorem: Let f be a neural network of the form , where acts elementwise by applying some smooth non-linear function to each element. Let the input layer, hidden layer and output layer have sizes 2, 4 and 1, respectively. Then f can approximate a multiplication gate arbitrarily well.
To see this, let us first Taylor-expand the function around the origin:
Without loss of generality, we can assume that : since is non-linear, it must have a non-zero second derivative at some point, so we can use the biases in to shift the origin to this point to ensure . Equation (10) now implies that
where we will term the multiplication approximator. Taylor’s theorem guarantees that is an arbitrarily good approximation of for arbitrarily small and . However, we can always make and arbitrarily small by scaling and then compensating by scaling . In the limit that , this approximation becomes exact. In other words, arbitrarily accurate multiplication can always be achieved using merely 4 neurons. Figure 2
illustrates such a multiplication approximator. (Of course, a practical algorithm like stochastic gradient descent cannot achieve arbitrarily large weights, though a reasonably good approximation can be achieved already for.)
Corollary: For any given multivariate polynomial and any tolerance , there exists a neural network of fixed finite size (independent of ) that approximates the polynomial to accuracy better than . Furthermore, is bounded by the complexity of the polynomial, scaling as the number of multiplications required times a factor that is typically slightly larger than 4.444In addition to the four neurons required for each multiplication, additional neurons may be deployed to copy variables to higher layers bypassing the nonlinearity in . Such linear “copy gates” implementing the function are of course trivial to implement using a simpler version of the above procedure: using to shift and scale down the input to fall in a tiny range where , and then scaling it up and shifting accordingly with .
This is a stronger statement than the classic universal universal approximation theorems for neural networks Hornik et al. (1989); Cybenko (1989), which guarantee that for every there exists some , but allows for the possibility that as . An approximation theorem in Pinkus (1999) provides an -independent bound on the size of the neural network, but at the price of choosing a pathological function .
ii.3.2 Discrete input variables
For the simple but important case where x is a vector of bits, so that or , the fact that makes things even simpler. This means that only terms where all variables are different need be included, which simplifies equation (9) to
The infinite series equation (9) thus gets replaced by a finite series with terms, ending with the term . Since there are possible bit strings x, the parameters in equation (12) suffice to exactly parametrize an arbitrary function .
The efficient multiplication approximator above multiplied only two variables at a time, thus requiring multiple layers to evaluate general polynomials. In contrast, for a bit vector x can be implemented using merely three layers as illustrated in Figure 2, where the middle layer evaluates the bit products and the third layer takes a linear combination of them. This is because bits allow an accurate multiplication approximator that takes the product of an arbitrary number of bits at once, exploiting the fact that a product of bits can be trivially determined from their sum: for example, the product if and only if the sum . This sum-checking can be implemented using one of the most popular choices for a nonlinear function : the logistic sigmoid which satisfies for and for . To compute the product of some set of bits described by the set (for our example above, ), we let and shift and stretch the sigmoid to exploit the identity
Since decays exponentially fast toward or as is increased, modestly large -values suffice in practice; if, for example, we want the correct answer to decimal places, we merely need . In summary, when x is a bit string, an arbitrary function
can be evaluated by a simple 3-layer neural network: the middle layer uses sigmoid functions to compute the products from equation (12), and the top layer performs the sums from equation (12) and the softmax from equation (8).
ii.4 What Hamiltonians do we want to approximate?
We have seen that polynomials can be accurately approximated by neural networks using a number of neurons scaling either as the number of multiplications required (for the continuous case) or as the number of terms (for the binary case). But polynomials per se are no panacea: with binary input, all functions are polynomials, and with continuous input, there are coefficients in a generic polynomial of degree in variables, which easily becomes unmanageably large. We will now discuss situations in which exceptionally simple polynomials that are sparse, symmetric and/or low-order play a special role in physics and machine-learning.
ii.4.1 Low polynomial order
The Hamiltonians that show up in physics are not random functions, but tend to be polynomials of very low order, typically of degree ranging from 2 to 4. The simplest example is of course the harmonic oscillator, which is described by a Hamiltonian that is quadratic in both position and momentum. There are many reasons why low order polynomials show up in physics. Two of the most important ones are that sometimes a phenomenon can be studied perturbatively, in which case, Taylor’s theorem suggests that we can get away with a low order polynomial approximation. A second reason is renormalization: higher order terms in the Hamiltonian of a statistical field theory tend to be negligible if we only observe macroscopic variables.
At a fundamental level, the Hamiltonian of the standard model of particle physics has . There are many approximations of this quartic Hamiltonian that are accurate in specific regimes, for example the Maxwell equations governing electromagnetism, the Navier-Stokes equations governing fluid dynamics, the Alvén equations governing magnetohydrodynamics and various Ising models governing magnetization — all of these approximations have Hamiltonians that are polynomials in the field variables, of degree ranging from 2 to 4.
There are additional reasons why we might expect low order polynomials. Thanks to the Central Limit TheoremGnedenko et al. (1954), many probability distributions in machine-learning and statistics can be accurately approximated by multivariate Gaussians, i.e., of the form
which means that the Hamiltonian
is a quadratic polynomial. More generally, the maximum-entropy probability distribution subject to constraints on some of the lowest moments, say expectation values of the formfor some integers would lead to a Hamiltonian of degree no greater than Jaynes (1957).
Image classification tasks often exploit invariance under translation, rotation, and various nonlinear deformations of the image plane that move pixels to new locations. All such spatial transformations are linear functions ( polynomials) of the pixel vector x
. Functions implementing convolutions and Fourier transforms are alsopolynomials.
Of course, such arguments do not imply that we should expect to see low order polynomials in every application. If we consider some data set generated by a very simple Hamiltonian (say the Ising Hamiltonian), but then discard some of the random variables, the resulting distribution will in general become quite complicated. Similarly, if we do not observe the random variables directly, but observe some generic functions of the random variables, the result will generally be a mess. These arguments, however, might indicate that the probability of encountering a Hamiltonian described by a low-order polynomial in some application might be significantly higher than what one might expect from some naive prior. For example, a uniform prior on the space of all polynomials of degreewould suggest that a randomly chosen polynomial would almost always have degree , but this might be a bad prior for real-world applications.
We should also note that even if a Hamiltonian is described exactly by a low-order polynomial, we would not expect the corresponding neural network to reproduce a low-order polynomial Hamiltonian exactly in any practical scenario for a host of possible reasons including limited data, the requirement of infinite weights for infinite accuracy, and the failure of practical algorithms such as stochastic gradient descent to find the global minimum of a cost function in many scenarios. So looking at the weights of a neural network trained on actual data may not be a good indicator of whether or not the underlying Hamiltonian is a polynomial of low degree or not.
One of the deepest principles of physics is locality: that things directly affect only what is in their immediate vicinity. When physical systems are simulated on a computer by discretizing space onto a rectangular lattice, locality manifests itself by allowing only nearest-neighbor interaction. In other words, almost all coefficients in equation (9) are forced to vanish, and the total number of non-zero coefficients grows only linearly with . For the binary case of equation (9), which applies to magnetizations (spins) that can take one of two values, locality also limits the degree to be no greater than the number of neighbors that a given spin is coupled to (since all variables in a polynomial term must be different).
Again, the applicability of these considerations to particular machine learning applications must be determined on a case by case basis. Certainly, an arbitrary transformation of a collection of local random variables will result in a non-local collection. (This might ruin locality in certain ensembles of images, for example). But there are certainly cases in physics where locality is still approximately preserved, for example in the simple block-spin renormalization group, spins are grouped into blocks, which are then treated as random variables. To a high degree of accuracy, these blocks are only coupled to their nearest neighbors. Such locality is famously exploited by both biological and artificial visual systems, whose first neuronal layer performs merely fairly local operations.
Whenever the Hamiltonian obeys some symmetry (is invariant under some transformation), the number of independent parameters required to describe it is further reduced. For instance, many probability distributions in both physics and machine learning are invariant under translation and rotation. As an example, consider a vector x of air pressures measured by a microphone at times . Assuming that the Hamiltonian describing it has reduces the number of parameters from to . Further assuming locality (nearest-neighbor couplings only) reduces this to , after which requiring translational symmetry reduces the parameter count to . Taken together, the constraints on locality, symmetry and polynomial order reduce the number of continuous parameters in the Hamiltonian of the standard model of physics to merely 32 Tegmark et al. (2006).
Symmetry can reduce not merely the parameter count, but also the computational complexity. For example, if a linear vector-valued function mapping a set of variables onto itself happens to satisfy translational symmetry, then it is a convolution (implementable by a convolutional neural net; “convnet”), which means that it can be computed with rather than multiplications using Fast Fourier transform.
Iii Why deep?
Above we investigated how probability distributions from physics and computer science applications lent themselves to “cheap learning”, being accurately and efficiently approximated by neural networks with merely a handful of layers. Let us now turn to the separate question of depth, i.e., the success of deep learning: what properties of real-world probability distributions cause efficiency to further improve when networks are made deeper? This question has been extensively studied from a mathematical point of view Delalleau and Bengio (2011); Mhaskar et al. (2016); Mhaskar and Poggio (2016), but mathematics alone cannot fully answer it, because part of the answer involves physics. We will argue that the answer involves the hierarchical/compositional structure of generative processes together with inability to efficiently “flatten” neural networks reflecting this structure.
iii.1 Hierarchical processess
One of the most striking features of the physical world is its hierarchical structure. Spatially, it is an object hierarchy: elementary particles form atoms which in turn form molecules, cells, organisms, planets, solar systems, galaxies, etc. Causally, complex structures are frequently created through a distinct sequence of simpler steps.
Figure 3 gives two examples of such causal hierarchies generating data vectors
that are relevant to physics and image classification, respectively. Both examples involve a Markov chain555If the next step in the generative hierarchy requires knowledge of not merely of the present state but also information of the past, the present state can be redefined to include also this information, thus ensuring that the generative process is a Markov process. where the probability distribution at the level of the hierarchy is determined from its causal predecessor alone:
where the probability vector specifies the probability distribution of according to
and the Markov matrixspecifies the transition probabilities between two neighboring levels, . Iterating equation (15) gives
so we can write the combined effect of the the entire generative process as a matrix product.
In our physics example (Figure 3, left), a set of cosmological parameters (the density of dark matter, etc.) determines the power spectrum of density fluctuations in our universe, which in turn determines the pattern of cosmic microwave background radiation reaching us from our early universe, which gets combined with foreground radio noise from our Galaxy to produce the frequency-dependent sky maps () that are recorded by a satellite-based telescope that measures linear combinations of different sky signals and adds electronic receiver noise. For the recent example of the Planck Satellite Adam et al. (2015), these datasets , contained about , , , and numbers, respectively.
More generally, if a given data set is generated by a (classical) statistical physics process, it must be described by an equation in the form of equation (16), since dynamics in classical physics is fundamentally Markovian: classical equations of motion are always first order differential equations in the Hamiltonian formalism. This technically covers essentially all data of interest in the machine learning community, although the fundamental Markovian nature of the generative process of the data may be an in-efficient description.
Our toy image classification example (Figure 3, right) is deliberately contrived and over-simplified for pedagogy: is a single bit signifying “cat or dog”, which determines a set of parameters determining the animal’s coloration, body shape, posture, etc. using approxiate probability distributions, which determine a 2D image via ray-tracing, which is scaled and translated by random amounts before a randomly generated background is added.
In both examples, the goal is to reverse this generative hierarchy to learn about the input from the output , specifically to provide the best possibile estimate of the probability distribution — i.e., to determine the probability distribution for the cosmological parameters and to determine the probability that the image is a cat, respectively.
iii.2 Resolving the swindle
This decomposition of the generative process into a hierarchy of simpler steps helps resolve the“swindle” paradox from the introduction: although the number of parameters required to describe an arbitrary function of the input data is beyond astronomical, the generative process can be specified by a more modest number of parameters, because each of its steps can. Whereas specifying an arbitrary probability distribution over multi-megapixel images requires far more bits than there are atoms in our universe, the information specifying how to compute the probability distribution for a microwave background map fits into a handful of published journal articles or software packages Seljak and Zaldarriaga (1996); Tegmark (1997a); Bond et al. (1998); Tegmark et al. (2003); Ade et al. (2014); Tegmark (1997b); Hinshaw et al. (2003). For a megapixel image of a galaxy, its entire probability distribution is defined by the standard model of particle physics with its 32 parameters Tegmark et al. (2006), which together specify the process transforming primordial hydrogen gas into galaxies.
The same parameter-counting argument can also be applied to all artificial images of interest to machine learning: for example, giving the simple low-information-content instruction “draw a cute kitten” to a random sample of artists will produce a wide variety of images with a complicated probability distribution over colors, postures, etc., as each artist makes random choices at a series of steps. Even the pre-stored information about cat probabilities in these artists’ brains is modest in size.
Note that a random resulting image typically contains much more information than
the generative process creating it; for example, the simple instruction “generate a random string of bits” contains much fewer than bits.
Not only are the typical steps in the generative hierarchy specified by a non-astronomical number of parameters, but
as discussed in Section II.4, it is plausible that neural networks can implement each of the steps
efficiently.666 Although our discussion is focused on describing probability distributions, which are not random,
stochastic neural networks can generate random variables as well.
In biology, spiking neurons provide a good random number generator, and in machine learning, stochastic architectures such as restricted Boltzmann machines
Although our discussion is focused on describing probability distributions, which are not random, stochastic neural networks can generate random variables as well. In biology, spiking neurons provide a good random number generator, and in machine learning, stochastic architectures such as restricted Boltzmann machinesHinton (2010) do the same.
A deep neural network stacking these simpler networks on top of one another would then implement the entire generative process efficiently. In summary, the data sets and functions we care about form a minuscule minority, and it is plausible that they can also be efficiently implemented by neural networks reflecting their generative process. So what is the remainder? Which are the data sets and functions that we do not care about?
Almost all images are indistinguishable from random noise, and almost all data sets and functions are indistinguishable from completely random ones. This follows from Borel’s theorem on normal numbers Émile Borel (1909), which states that almost all real numbers have a string of decimals that would pass any randomness test, i.e., are indistinguishable from random noise. Simple parameter counting shows that deep learning (and our human brains, for that matter) would fail to implement almost all such functions, and training would fail to find any useful patterns. To thwart pattern-finding efforts. cryptography therefore aims to produces random-looking patterns. Although we might expect the Hamiltonians describing human-generated data sets such as drawings, text and music to be more complex than those describing simple physical systems, we should nonetheless expect them to resemble the natural data sets that inspired their creation much more than they resemble random functions.
iii.3 Sufficient statistics and hierarchies
The goal of deep learning classifiers is to reverse the hierarchical generative process as well as possible, to make inferences about the input from the output . Let us now treat this hierarchical problem more rigorously using information theory.
Given , a sufficient statistic is defined by the equation and has played an important role in statistics for almost a century Fisher (1922). All the information about contained in is contained in the sufficient statistic. A minimal sufficient statistic Fisher (1922) is some sufficient statistic which is a sufficient statistic for all other sufficient statistics. This means that if is sufficient, then there exists some function such that . As illustrated in Figure 3, can be thought of as a an information distiller, optimally compressing the data so as to retain all information relevant to determining and discarding all irrelevant information.
The sufficient statistic formalism enables us to state some simple but important results that apply to any hierarchical generative process cast in the Markov chain form of equation (16).
Theorem 2: Given a Markov chain described by our notation above, let be a minimal sufficient statistic of . Then there exists some functions such that . More casually speaking, the generative hierarchy of Figure 3 can be optimally reversed one step at a time: there are functions that optimally undo each of the steps, distilling out all information about the level above that was not destroyed by the Markov process. Here is the proof. Note that for any , the “backwards” Markov property follows from the Markov property via Bayes’ theorem:
Using this fact, we see that
Since the above equation depends on only through , this means that is a sufficient statistic for . But since is the minimal sufficient statistic, there exists a function such that .
Corollary 2: With the same assumptions and notation as theorem 2, define the function and . Then
The proof is easy. By induction,
which implies the corollary.
Roughly speaking, Corollary 2 states that the structure of the inference problem reflects the structure of the generative process. In this case, we see that the neural network trying to approximate must approximate a compositional function. We will argue below in Section III.6 that in many cases, this can only be accomplished efficiently if the neural network has hidden layers.
In neuroscience parlance, the functions compress the data into forms with ever more invariance Riesenhuber and Poggio (2000), containing features invariant under irrelevant transformations (for example background substitution, scaling and translation).
Let us denote the distilled vectors , where . As summarized by Figure 3, as information flows down the hierarchy , some of it is destroyed by random processes. However, no further information is lost as information flows optimally back up the hierarchy as .
iii.4 Approximate information distillation
Although minimal sufficient statistics are often difficult to calculate in practice, it is frequently possible to come up with statistics which are nearly sufficient in a certain sense which we now explain.
An equivalent characterization of a sufficient statistic is provided by information theory Kullback and Leibler (1951); Cover and Thomas (2012). The data processing inequality Cover and Thomas (2012) states that for any function and any random variables ,
where is the mutual information:
A sufficient statistic is a function for which “” gets replaced by “” in equation (21), i.e., a function retaining all the information about .
Even information distillation functions that are not strictly sufficient can be very useful as long as they distill out most of the relevant information and are computationally efficient. For example, it may be possible to trade some loss of mutual information with a dramatic reduction in the complexity of the Hamiltonian; e.g., may be considerably easier to implement in a neural network than . Precisely this situation applies to the physical example described in Figure 3, where a hierarchy of efficient near-perfect information distillers have been found, the numerical cost of Tegmark (1997b); Hinshaw et al. (2003), Tegmark et al. (2003); Ade et al. (2014), Tegmark (1997a); Bond et al. (1998) and Adam et al. (2015) scaling with the number of inputs parameters as , , and , respectively. More abstractly, the procedure of renormalization, ubiquitous in statistical physics, can be viewed as a special case of approximate information distillation, as we will now describe.
iii.5 Distillation and renormalization
The systematic framework for distilling out desired information from unwanted “noise” in physical theories is known as Effective Field Theory Kardar (2007). Typically, the desired information involves relatively large-scale features that can be experimentally measured, whereas the noise involves unobserved microscopic scales. A key part of this framework is known as the renormalization group (RG) transformation Kardar (2007); Cardy (1996). Although the connection between RG and machine learning has been studied or alluded to repeatedly Johnson et al. (2007); Bény (2013); Saremi and Sejnowski (2013); Mehta and Schwab (2014); Miles Stoudenmire and Schwab (2016), there are significant misconceptions in the literature concerning the connection which we will now attempt to clear up.
Let us first review a standard working definition of what renormalization is in the context of statistical physics, involving three ingredients: a vector of random variables, a course-graining operation and a requirement that this operation leaves the Hamiltonian invariant except for parameter changes. We think of
as the microscopic degrees of freedom — typically physical quantities defined at a lattice of points (pixels or voxels) in space. Its probability distribution is specified by a Hamiltonian, with some parameter vector . We interpret the map as implementing a coarse-graining777A typical renormalization scheme for a lattice system involves replacing many spins (bits) with a single spin according to some rule. In this case, it might seem that the map could not possibly map its domain onto itself, since there are fewer degrees of freedom after the coarse-graining. On the other hand, if we let the domain and range of differ, we cannot easily talk about the Hamiltonian as having the same functional form, since the renormalized Hamiltonian would have a different domain than the original Hamiltonian. Physicists get around this by taking the limit where the lattice is infinitely large, so that maps an infinite lattice to an infinite lattice. of the system. The random variable also has a Hamiltonian, denoted , which we require to have the same functional form as the original Hamiltonian , although the parameters may change. In other words, for some function . Since the domain and the range of coincide, this map can be iterated times , giving a Hamiltonian for the repeatedly renormalized data. Similar to the case of sufficient statistics, will then be a compositional function.
Contrary to some claims in the literature, effective field theory and the renormalization group have little to do with the idea of unsupervised learning and pattern-finding. Instead, the standard renormalization procedures in statistical physics are essentially a feature extractor for supervised learning, where the features typically correspond to long-wavelength/macroscopic degrees of freedom. In other words, effective field theory only makes sense if we specify what features we are interested in. For example, if we are given data about the position and momenta of particles inside a mole of some liquid and is tasked with predicting from this data whether or not Alice will burn her finger when touching the liquid, a (nearly) sufficient statistic is simply the temperature of the object, which can in turn be obtained from some very coarse-grained degrees of freedom (for example, one could use the fluid approximation instead of working directly from the positions and momenta of particles). But without specifying that we wish to predict (long-wavelength physics), there is nothing natural about an effective field theory approximation.
To be more explicit about the link between renormalization and deep-learning, consider a toy model for natural images. Each image is described by an intensity field , where r is a 2-dimensional vector. We assume that an ensemble of images can be described by a quadratic Hamiltonian of the form
Each parameter vector defines an ensemble of images; we could imagine that the fictitious classes of images that we are trying to distinguish are all generated by Hamiltonians with the same above form but different parameter vectors . We further assume that the function is specified on pixels that are sufficiently close that derivatives can be well-approximated by differences. Derivatives are linear operations, so they can be implemented in the first layer of a neural network. The translational symmetry of equation (23) allows it to be implemented with a convnet. If can be shown Kardar (2007) that for any course-graining operation that replaces each block of pixels by its average and divides the result by , the Hamiltonian retains the form of equation (23) but with the parameters replaced by
This means that all parameters with decay exponentially with as we repeatedly renormalize and keeps increasing, so that for modest , one can neglect all but the first few ’s. What would have taken an arbitrarily large neural network can now be computed on a neural network of finite and bounded size, assuming that we are only interested in classifying the data based only on the coarse-grained variables. These insufficient statistics will still have discriminatory power if we are only interested in discriminating Hamiltonians which all differ in their first few . In this example, the parameters and correspond to “relevant operators” by physicists and “signal” by machine-learners, whereas the remaining parameters correspond to “irrelevant operators” by physicists and “noise” by machine-learners.
The fixed point structure of the transformation in this example is very simple, but one can imagine that in more complicated problems the fixed point structure of various transformations might be highly non-trivial. This is certainly the case in statistical mechanics problems where renormalization methods are used to classify various phases of matters; the point here is that the renormalization group flow can be thought of as solving the pattern-recognition problem of classifying the long-range behavior of various statistical systems.
In summary, renormalization can be thought of as a type of supervised learning888A subtlety regarding the above statements is presented by the Multi-scale Entanglement Renormalization Ansatz (MERA) Vidal (2008)
. MERA can be viewed as a variational class of wave functions whose parameters can be tuned to to match a given wave function as closely as possible. From this perspective, MERA is as an unsupervised machine learning algorithm, where classical probability distributions over many variables are replaced with quantum wavefunctions. Due to the special tensor network structure found in MERA, the resulting variational approximation of a given wavefunction has an interpretation as generating an RG flow. Hence this is an example of an unsupervised learning problem whose solution gives rise to an RG flow. This is only possible due to the extra mathematical structure in the problem (the specific tensor network found in MERA); a generic variational Ansatz does not give rise to any RG interpretation and vice versa., where the large scale properties of the system are considered the features. If the desired features are not large-scale properties (as in most machine learning cases), one might still expect the a generalized formalism of renormalization to provide some intuition to the problem by replacing a scale transformation with some other transformation. But calling some procedure renormalization or not is ultimately a matter of semantics; what remains to be seen is whether or not semantics has teeth, namely, whether the intuition about fixed points of the renormalization group flow can provide concrete insight into machine learning algorithms. In many numerical methods, the purpose of the renormalization group is to efficiently and accurately evaluate the free energy of the system as a function of macroscopic variables of interest such as temperature and pressure. Thus we can only sensibly talk about the accuracy of an RG-scheme once we have specified what macroscopic variables we are interested in.
iii.6 No-flattening theorems
Above we discussed how Markovian generative models cause to be a composition of a number of simpler functions . Suppose that we can approximate each function with an efficient neural network for the reasons given in Section II. Then we can simply stack these networks on top of each other, to obtain an deep neural network efficiently approximating .
But is this the most efficient way to represent ? Since we know that there are shallower networks that accurately approximate it, are any of these shallow networks as efficient as the deep one, or does flattening necessarily come at an efficiency cost?
To be precise, for a neural network f defined by equation (6), we will say that the neural network is the flattened version of f if its number of hidden layers is smaller and approximates f within some error (as measured by some reasonable norm). We say that is a neuron-efficient flattening if the sum of the dimensions of its hidden layers (sometimes referred to as the number of neurons ) is less than for f. We say that is a synapse-efficient flattening if the number
of non-zero entries (sometimes called synapses) in its weight matrices is less than forf. This lets us define the flattening cost of a network f as the two functions
specifying the factor by which optimal flattening increases the neuron count and the synapse count, respectively. We refer to results where or for some class of functions f as “no-flattening theorems”, since they imply that flattening comes at a cost and efficient flattening is impossible. A complete list of no-flattening theorems would show exactly when deep networks are more efficient than shallow networks.
There has already been very interesting progress in this spirit, but crucial questions remain. On one hand, it has been shown that deep is not always better, at least empirically for some image classification tasks Ba and Caruana (2014). On the other hand, many functions f have been found for which the flattening cost is significant. Certain deep Boolean circuit networks are exponentially costly to flatten Hastad (1986). Two families of multivariate polynomials with an exponential flattening cost are constructed inDelalleau and Bengio (2011). Poggio et al. (2015); Mhaskar et al. (2016); Mhaskar and Poggio (2016) focus on functions that have tree-like hierarchical compositional form, concluding that the flattening cost is exponential for almost all functions in Sobolev space. For the ReLU activation function, Telgarsky (2015) finds a class of functions that exhibit exponential flattening costs; Montufar et al. (2014) study a tailored complexity measure of deep versus shallow ReLU networks. Eldan and Shamir (2015) shows that given weak conditions on the activation function, there always exists at least one function that can be implemented in a 3-layer network which has an exponential flattening cost. Finally, Poole et al. (2016); Raghu et al. (2016) study the differential geometry of shallow versus deep networks, and find that flattening is exponentially neuron-inefficient. Further work elucidating the cost of flattening various classes of functions will clearly be highly valuable.
iii.7 Linear no-flattening theorems
In the mean time, we will now see that interesting no-flattening results can be obtained even in the simpler-to-model context of linear neural networks Saxe et al. (2013), where the operators are replaced with the identity and all biases are set to zero such that are simply linear operators (matrices). Every map is specified by a matrix of real (or complex) numbers, and composition is implemented by matrix multiplication.
One might suspect that such a network is so simple that the questions concerning flattening become entirely trivial: after all, successive multiplication with different matrices is equivalent to multiplying by a single matrix (their product). While the effect of flattening is indeed trivial for expressibility (f can express any linear function, independently of how many layers there are), this is not the case for the learnability, which involves non-linear and complex dynamics despite the linearity of the network Saxe et al. (2013). We will show that the efficiency of such linear networks is also a very rich question.
Neuronal efficiency is trivially attainable for linear networks, since all hidden-layer neurons can be eliminated without accuracy loss by simply multiplying all the weight matrices together. We will instead consider the case of synaptic efficiency and set .
Many divide-and-conquer algorithms in numerical linear algebra exploit some factorization of a particular matrix A in order to yield significant reduction in complexity. For example, when A represents the discrete Fourier transform (DFT), the fast Fourier transform (FFT) algorithm makes use of a sparse factorization of A which only contains
non-zero matrix elements instead of the naive single-layer implementation, which containsnon-zero matrix elements. As first pointed out in Bengio et al. (2007), this is an example where depth helps and, in our terminology, of a linear no-flattening theorem: fully flattening a network that performs an FFT of variables increases the synapse count from to , i.e., incurs a flattening cost . This argument applies also to many variants and generalizations of the FFT such as the Fast Wavelet Transform and the Fast Walsh-Hadamard Transform.
Another important example illustrating the subtlety of linear networks is matrix multiplication. More specifically, take the input of a neural network to be the entries of a matrix M and the output to be NM, where both M and N have size . Since matrix multiplication is linear, this can be exactly implemented by a 1-layer linear neural network. Amazingly, the naive algorithm for matrix multiplication, which requires multiplications, is not optimal: the Strassen algorithm Strassen (1969) requires only multiplications (synapses), where , and recent work has cut this scaling exponent down to Le Gall (2014). This means that fully optimized matrix multiplication on a deep neural network has a flattening cost of at least .
Low-rank matrix multiplication gives a more elementary no-flattening theorem. If A is a rank- matrix, we can factor it as where B is a matrix and C is an matrix. Hence the number of synapses is for an network and for an -network, giving a flattening cost as long as the rank .
Finally, let us consider flattening a network , where A and B are random sparse matrices such that each element is with probability and with probability . Flattening the network results in a matrix , so the probability that is . Hence the number of non-zero components will on average be , so
Note that and that this bound is asymptotically saturated for . Hence in the limit where is very large, flattening multiplication by sparse matrices is horribly inefficient.
iii.8 A polynomial no-flattening theorem
In Section II, we saw that multiplication of two variables could be implemented by a flat neural network with 4 neurons in the hidden layer, using equation (11) as illustrated in Figure 2. In Appendix A, we show that equation (11) is merely the special case of the formula
where the sum is over all possible configurations of where each can take on values . In other words, multiplication of variables can be implemented by a flat network with neurons in the hidden layer. We also prove in Appendix A that this is the best one can do: no neural network can implement an -input multiplication gate using fewer than neurons in the hidden layer. This is another powerful no-flattening theorem, telling us that polynomials are exponentially expensive to flatten. For example, if is a power of two, then the monomial can be evaluated by a deep network using only neurons arranged in a deep neural network where copies of the multiplication gate from Figure 2 are arranged in a binary tree with layers (the 5th top neuron at the top of Figure 2 need not be counted, as it is the input to whatever computation comes next). In contrast, a functionally equivalent flattened network requires a whopping neurons. For example, a deep neural network can multiply 32 numbers using neurons while a shallow one requires neurons. Since a broad class of real-world functions can be well approximated by polynomials, this helps explain why many useful neural networks cannot be efficiently flattened.
We have shown that the success of deep and cheap (low-parameter-count) learning depends not only on mathematics but also on physics, which favors certain classes of exceptionally simple probability distributions that deep learning is uniquely suited to model. We argued that the success of shallow neural networks hinges on symmetry, locality, and polynomial log-probability in data from or inspired by the natural world, which favors sparse low-order polynomial Hamiltonians that can be efficiently approximated. These arguments should be particularly relevant for explaining the success of machine-learning applications to physics, for example using a neural network to approximate a many-body wavefunction Carleo and Troyer (2016). Whereas previous universality theorems guarantee that there exists a neural network that approximates any smooth function to within an error , they cannot guarantee that the size of the neural network does not grow to infinity with shrinking or that the activation function does not become pathological. We show constructively that given a multivariate polynomial and any generic non-linearity, a neural network with a fixed size and a generic smooth activation function can indeed approximate the polynomial highly efficiently.
Turning to the separate question of depth, we have argued that the success of deep learning depends on the ubiquity of hierarchical and compositional generative processes in physics and other machine-learning applications. By studying the sufficient statistics of the generative process, we showed that the inference problem requires approximating a compositional function of the form that optimally distills out the information of interest from irrelevant noise in a hierarchical process that mirrors the generative process. Although such compositional functions can be efficiently implemented by a deep neural network as long as their individual steps can, it is generally not possible to retain the efficiency while flattening the network. We extend existing “no-flattening” theorems Delalleau and Bengio (2011); Mhaskar et al. (2016); Mhaskar and Poggio (2016) by showing that efficient flattening is impossible even for many important cases involving linear networks. In particular, we prove that flattening polynomials is exponentially expensive, with neurons required to multiply numbers using a single hidden layer, a task that a deep network can perform using only neurons.
Strengthening the analytic understanding of deep learning may suggest ways of improving it, both to make it more capable and to make it more robust. One promising area is to prove sharper and more comprehensive no-flattening theorems, placing lower and upper bounds on the cost of flattening networks implementing various classes of functions.
Acknowledgements: This work was supported by the Foundational Questions Institute http://fqxi.org/, the Rothberg Family Fund for Cognitive Science and NSF grant 1122374. We thank Scott Aaronson, Frank Ban, Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, Bart Selman, Viktoriya Krakovna, Krishanu Sankar and Boya Song for helpful discussions and suggestions, Frank Ban, Fernando Perez, Jared Jolton, and the anonymous referee for helpful corrections and the Center for Brains, Minds, and Machines (CBMM) for hospitality.
Appendix A The polynomial no-flattening theorem
We saw above that a neural network can compute polynomials accurately and efficiently at linear cost, using only about 4 neurons per multiplication. For example, if is a power of two, then the monomial can be evaluated using neurons arranged in a binary tree network with hidden layers. In this appendix, we will prove a no-flattening theorem demonstrating that flattening polynomials is exponentially expensive:
Theorem: Suppose we are using a generic smooth activation function , where for . Then for any desired accuracy , there exists a neural network that can implement the function using a single hidden layer of neurons. Furthermore, this is the smallest possible number of neurons in any such network with only a single hidden layer.
This result may be compared to problems in Boolean circuit complexity, notably the question of whether Vollmer (2013). Here circuit depth is analogous to number of layers, and the number of gates is analogous to the number of neurons. In both the Boolean circuit model and the neural network model, one is allowed to use neurons/gates which have an unlimited number of inputs. The constraint in the definition of that each of the gate elements be from a standard universal library (AND, OR, NOT, Majority) is analogous to our constraint to use a particular nonlinear function. Note, however, that our theorem is weaker by applying only to depth 1, while includes all circuits of depth .
a.1 Proof that neurons are sufficient
A neural network with a single hidden layer of neurons that approximates a product gate for inputs can be formally written as a choice of constants and satisfying
Here, we use to denote that the two sides of (29) have identical Taylor expansions up to terms of degree ; as we discussed earlier in our construction of a product gate for two inputs, this exables us to achieve arbitrary accuracy by first scaling down the factors , then approximately multiplying them and finally scaling up the result.
We may expand (29) using the definition and drop terms of the Taylor expansion with degree greater than , since they do not affect the approximation. Thus, we wish to find the minimal such that there exist constants and satisfying
for all . Let us set , and enumerate the subsets of as in some order. Define a network of neurons in a single hidden layer by setting equal to the function which is if and otherwise, setting
In other words, up to an overall normalization constant, all coefficients and equal , and each weight is simply the product of the corresponding .
We must prove that this network indeed satisfies equations (30) and (31). The essence of our proof will be to expand the left hand side of Equation (29) and show that all monomial terms except come in pairs that cancel. To show this, consider a single monomial where .
If , then we must show that the coefficient of in is 0. Since , there must be some such that . In other words, does not depend on the variable . Since the sum in Equation (29) is over all combinations of signs for all variables, every term will be canceled by another term where the (non-present) has the opposite sign and the weight has the opposite sign:
Observe that the coefficient of is equal in and , since . Therefore, the overall coefficient of in the above expression must vanish, which implies that (31) is satisfied.
If instead , then all terms have the coefficient of in is , because all terms are identical and there is no cancelation. Hence, the coefficient of on the left-hand side of (30) is
completing our proof that this network indeed approximates the desired product gate.
From the standpoint of group theory, our construction involves a representation of the group , acting upon the space of polynomials in the variables . The group is generated by elements such that flips the sign of wherever it occurs. Then, our construction corresponds to the computation
Every monomial of degree at most , with the exception of the product , is sent to 0 by for at least one choice of . Therefore, approximates a product gate (up to a normalizing constant).
a.2 Proof that neurons are necessary
for all . Let A denote the matrix with elements
We will show that A has full row rank. Suppose, towards contradiction, that for some non-zero vector c. Specifically, suppose that there is a linear dependence between rows of A given by
where the are distinct and for every . Let be the maximal cardinality of any . Defining the vector d whose components are
taking the dot product of equation (36) with d gives