Building a scalable foundation for deep learning
We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.READ FULL TEXT VIEW PDF
Building a scalable foundation for deep learning
Building a scalable foundation for deep learning
. Some of the most popular methods use multi-stage architectures composed of alternated layers of linear transformations and max function. In a particularly popular version, the max functions are known as ReLUs (Rectified Linear Units) and compute the mappingin a pointwise fashion [Nair and Hinton, 2010]. In other architectures, such as convolutional networks [LeCun et al., 1998a] and maxout networks [Goodfellow et al., 2013], the max operation is performed over a small set of variable within a layer.
The vast majority of practical applications of deep learning use supervised learning with very deep networks. The supervised loss function, generally a cross-entropy or hinge loss, is minimized using some form of stochastic gradient descent (SGD)[Bottou, 1998], in which the gradient is evaluated using the back-propagation procedure [LeCun et al., 1998b].
The general shape of the loss function is very poorly understood. In the early days of neural nets (late 1980s and early 1990s), many researchers and engineers were experimenting with relatively small networks, whose convergence tends to be unreliable, particularly when using batch optimization. Multilayer neural nets earned a reputation of being finicky and unreliable, which in part caused the community to focus on simpler method with convex loss functions, such as kernel machines and boosting.
However, several researchers experimenting with larger networks and SGD had noticed that, while multilayer nets do have many local minima, the result of multiple experiments consistently give very similar performance. This suggests that, while local minima are numerous, they are relatively easy to find, and they are all more or less equivalent in terms of performance on the test set. The present paper attempts to explain this peculiar property through the use of random matrix theory applied to the analysis of critical points in high degree polynomials on the sphere.
We first establish that the loss function of a typical multilayer net with ReLUs can be expressed as a polynomial function of the weights in the network, whose degree is the number of layers, and whose number of monomials is the number of paths from inputs to output. As the weights (or the inputs) vary, some of the monomials are switched off and others become activated, leading to a piecewise, continuous polynomial whose monomials are switched in and out at the boundaries between pieces.
An important question concerns the distribution of critical points (maxima, minima, and saddle points) of such functions. Results from random matrix theory applied to spherical spin glasses have shown that these functions have a combinatorially large number of saddle points. Loss surfaces for large neural nets have many local minima that are essentially equivalent from the point of view of the test error, and these minima tend to be highly degenerate, with many eigenvalues of the Hessian near zero.
We empirically verify several hypotheses regarding learning with large-size networks:
For large-size networks, most local minima are equivalent and yield similar performance on a test set.
The probability of finding a “bad” (high value) local minimum is non-zero for small-size networks and decreases quickly with network size.
Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.
The above hypotheses can be directly justified by our theoretical findings. We finally conclude the paper with brief discussion of our results and future research directions in Section 6.
We confirm the intuition and empirical evidence expressed in previous works that the problem of training deep learning systems resides with avoiding saddle points and quickly “breaking the symmetry” by picking sides of saddle points and choosing a suitable attractor [LeCun et al., 1998b, Saxe et al., 2014, Dauphin et al., 2014].
What is new in this paper? To the best of our knowledge, this paper is the first work providing a theoretical description of the optimization paradigm with neural networks in the presence of large number of parameters. It has to be emphasized however that this connection relies on a number of possibly unrealistic assumptions. It is also an attempt to shed light on the puzzling behavior of modern deep learning systems when it comes to optimization and generalization.
In the 1990s, a number of researchers studied the convergence of gradient-based learning for multilayer networks using the methods of statistical physics, i.e. [Saad and Solla, 1995], and the edited works [Saad, 2009]. Recently, Saxe [Saxe et al., 2014] and Dauphin [Dauphin et al., 2014] explored the statistical properties of the error surface in multi-layer architectures, pointing out the importance of saddle points.
Earlier theoretical analyses [Baldi and Hornik, 1989, Wigner, 1958, Fyodorov and Williams, 2007, Bray and Dean, 2007] suggest the existence of a certain structure of critical points of random Gaussian error functions on high dimensional continuous spaces. They imply that critical points whose error is much higher than the global minimum are exponentially likely to be saddle points with many negative and approximate plateau directions whereas all local minima are likely to have an error very close to that of the global minimum (these results are conveniently reviewed in [Dauphin et al., 2014]). The work of [Dauphin et al., 2014] establishes a strong empirical connection between neural networks and the theory of random Gaussian fields by providing experimental evidence that the cost function of neural networks exhibits the same properties as the Gaussian error functions on high dimensional continuous spaces. Nevertheless they provide no theoretical justification for the existence of this connection which instead we provide in this paper.
This work is inspired by the recent advances in random matrix theory and the work of [Auffinger et al., 2010] and [Auffinger and Ben Arous, 2013]. The authors of these works provided an asymptotic evaluation of the complexity of the spherical spin-glass model (the spin-glass model originates from condensed matter physics where it is used to represent a magnet with irregularly aligned spins). They discovered and mathematically proved the existence of a layered structure of the low critical values for the model’s Hamiltonian which in fact is a Gaussian process. Their results are not discussed in details here as it will be done in Section 4 in the context of neural networks. We build the bridge between their findings and neural networks and show that the objective function used by neural network is analogous to the Hamiltonian of the spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity, and thus their landscapes share the same properties. We emphasize that the connection between spin-glass models and neural networks was already explored back in the past (a summary can be found in [Dotsenko, 1995]). In example in [Amit et al., 1985] the authors showed that the long-term behavior of certain neural network models are governed by the statistical mechanism of infinite-range Ising spin-glass Hamiltonians. Another work [Nakanishi and Takayama, 1997] examined the nature of the spin-glass transition in the Hopfield neural network model. None of these works however make the attempt to explain the paradigm of optimizing the highly non-convex neural network objective function through the prism of spin-glass theory and thus in this respect our approach is very novel.
For the theoretical analysis, we consider a simple model of the fully-connected feed-forward deep network with a single output and rectified linear units. We call the network . We focus on a binary classification task. Let
be the random input vector of dimensionality. Let denote the number of hidden layers in the network and we will refer to the input layer as the layer and to the output layer as the layer. Let denote the number of units in the layer (note that and ). Let be the matrix of weights between and layers of the network. Also, let
denote the activation function that converts a unit’s weighted input to its output activation. We consider linear rectifiers thus. We can therefore write the (random) network output as
where is simply a normalization factor. The same expression for the output of the network can be re-expressed in the following way:
where the first summation is over the network inputs and the second one is over all paths from a given network input to its output, where is the total number of such paths (note that ). Also, for all : . Furthermore, is the weight of the segment of path indexed with which connects layer with layer of the network. Note that each path corresponds to a certain set of weights, which we refer to as a configuration of weights, which are multiplied by each other. Finally, denotes whether a path is active () or not ().
The mass of the network is the total number of all paths between all network inputs and outputs: . Also let as .
The size of the network is the total number of network parameters: .
The mass and the size of the network depend on each other as captured in Theorem 3.1. All proofs in this paper are deferred to the Supplementary material.
Let be the mass of the network, be the number of network inputs and be the depth of the network. The size of the network is bounded as
We assume the depth of the network is bounded. Therefore iff , and iff .
In the rest of this section we will be establishing a connection between the loss function of the neural network and the Hamiltonian of the spin-glass model. We next provide the outline of our approach.
In Subsection 3.3 we introduce randomness to the model by assuming ’s and
’s are random. We make certain assumptions regarding the neural network model. First, we assume certain distributions and mutual dependencies concerning the random variables’s and ’s. We also introduce a spherical constraint on the model weights. We finally make two other assumptions regarding the redundancy of network parameters and their uniformity, both of which are justified by empirical evidence in the literature. These assumptions will allow us to show in Subsection 3.4 that the loss function of the neural network, after re-indexing terms111The terms are re-indexed in Subsection 3.3 and it is done to preserve consistency with the notation in [Auffinger et al., 2010] where the proofs of the results of Section 4 can be found., has the form of a centered Gaussian process on the sphere , which is equivalent to the Hamiltonian of the -spin spherical spin-glass model, given as
with spherical constraint
The redundancy and uniformity assumptions will be explained in Subsection 3.3 in detail. However, on the high level of generality the redundancy assumption enables us to skip superscript appearing next to the weights in Equation 1 (note it does not appear next to the weights in Equation 2) by determining a set of unique network weights of size no larger than , and the uniformity assumption ensures that all ordered products of unique weights appear in Equation 2 the same number of times.
An asymptotic evaluation of the complexity of -spin spherical spin-glass models via random matrix theory was studied in the literature [Auffinger et al., 2010] where a precise description of the energy landscape for the Hamiltonians of these models is provided. In this paper (Section 4) we use these results to explain the optimization problem in neural networks.
We assume each input is a normal random variable such that . Clearly the model contains several dependencies as one input is associated with many paths in the network. That poses a major theoretical problem in analyzing these models as it is unclear how to account for these dependencies. In this paper we instead study fully decoupled model [De la Peña and Giné, 1999], where
’s are assumed to be independent. We allow this simplification as to the best of our knowledge there exists no theoretical description of the optimization paradigm with neural networks in the literature either under independence assumption or when the dependencies are allowed. Also note that statistical learning theory heavily relies on this assumption[Hastie et al., 2001] even when the model under consideration is much simpler than a neural network. Under the independence assumption we will demonstrate the similarity of this model to the spin-glass model. We emphasize that despite the presence of high dependencies in real neural networks, both models exhibit high similarities as will be empirically demonstrated.
We assume each path in Equation 1 is equally likely to be active thus ’s will be modeled as Bernoulli random variables with the same probability of success . By assuming the independence of ’s and ’s we get the following
Let be the set of all weights of the network. Let denote the set of all -length configurations of weights chosen from (order of the weights in a configuration does matter). Note that the size of is therefore . Also let be a set such that each element corresponds to the single configuration of weights from Equation 4, thus , where every single weight comes from set (note that ). Thus Equation 4 can be equivalently written as
We will now explain the notation. It is over-complicated for purpose, as this notation will be useful later on. denotes whether the configuration appears in Equation 4 or not, thus , and denote a set of random variables corresponding to the same weight configuration (since this set has at most one element). Also implies that summand is zeroed out). Furthermore, the following condition has to be satisfied: . In the notation , index refers to the number of unique weights of a network (this notation will also be helpful later).
Consider a family of networks which have the same graph of connections as network but different edge weighting such that they only have unique weights and (by notation analogy the expected output of this network will be called ). It was recently shown [Denil et al., 2013, Denton et al., 2014] that for large-size networks large number of network parameters (according to [Denil et al., 2013] even up to ) are redundant and can either be learned from a very small set of unique parameters or even not learned at all with almost no loss in prediction accuracy.
A network which has the same graph of connections as and unique weights satisfying is called a -reduction image of for some if the prediction accuracy of and differ by no more than (thus they classify at most
(thus they classify at mostfraction of data points differently).
Let be a neural network giving the output whose expectation is given in Equation 5. Let be its -reduction image for some and . By analogy, let be the expected output of network . Then the following holds
where denotes the correlation defined as , is the standard deviation and
is the standard deviation anddenotes the sign of prediction ( and are both random).
The redundancy assumption implies that one can preserve to be close to even with .
Consider the network to be a -reduction image of for some and . The output of the image network can in general be expressed as
where is the number of times each configuration repeats in Equation 5 and . We assume that unique weights are close to being evenly distributed on the graph of connections of network . We call this assumption a uniformity assumption. Thus this assumption implies that for all there exists a positive constant such that the following holds
comes from the fact that for the network where every weight is uniformly distributed on the graph of connections (thus with high probability every node is adjacent to an edge with any of the unique weights) it holds that. For simplicity assume and . Consider therefore an expression as follows
The following theorem (Theorem 3.3) captures the connection between and .
We finally assume that for some positive constant weights satisfy the spherical condition
Next we will consider two frequently used loss functions, absolute loss and hinge loss, where we approximate (recall ) with .
Let and be the (random) absolute loss and (random) hinge loss that we define as follows
where is a random variable corresponding to the true data labeling that takes values or in case of the absolute loss, where , and or in case of the hinge loss. Also note that in the case of the hinge loss operator can be modeled as Bernoulli random variable, which we assume is independent of . Given that one can show that after approximating with both losses can be generalized to the following expression
and , are some constants and weights are simply scaled weights satisfying . In case of the absolute loss the term is incorporated into the term , and in case of the hinge loss it vanishes (note that is a symmetric random quantity thus multiplying it by does not change its distribution). We skip the technical details showing this equivalence, and defer them to the Supplementary material. Note that after simplifying the notation by i) dropping the letter accents and simply denoting as , ii) skipping constants and which do not matter when minimizing the loss function, and iii) substituting , we obtain the Hamiltonian of the -spin spherical spin-glass model of Equation 2 with spherical constraint captured in Equation 3.
In this section we use the results of the theoretical analysis of the complexity of spherical spin-glass models of [Auffinger et al., 2010] to gain an understanding of the optimization of strongly non-convex loss functions of neural networks. These results show that for high-dimensional (large ) spherical spin-glass models the lowest critical values of the Hamiltonians of these models form a layered structure and are located in a well-defined band lower-bounded by the global minimum. Simultaneously, the probability of finding them outside the band diminishes exponentially with the dimension of the spin-glass model. We next present the details of these results in the context of neural networks. We first introduce the notation and definitions.
Let and be an integer such that . We will denote as a random number of critical values of in the set with index222The number of negative eigenvalues of the Hessian at is also called index of at . equal to . Similarly we will denote as a random total number of critical values of .
Later in the paper by critical values of the loss function that have non-diverging (fixed) index, or low-index, we mean the ones with index non-diverging with .
One can directly use Theorem 2.12 in [Auffinger et al., 2010] to show that for large-size networks (more precisely when but recall that iff ) it is improbable to find a critical value below certain level (which we call the ground state), where is some real number.
Let us also introduce the number that we will refer to as . We will refer to this important threshold as the energy barrier and define it as
Theorem 2.14 in [Auffinger et al., 2010] implies that for large-size networks all critical values of the loss function that are of non-diverging index must lie below the threshold . Any critical point that lies above the energy barrier is a high-index saddle point with overwhelming probability. Thus for large-size networks all critical values of the loss function that are of non-diverging index must lie in the band .
From Theorem 2.15 in [Auffinger et al., 2010] it follows that for large-size networks finding a critical value with index larger or equal to (for any fixed integer ) below energy level is improbable, where . Furthermore, the sequence is strictly decreasing and converges to as [Auffinger et al., 2010].
These results unravel a layered structure for the lowest critical values of the loss function of a large-size network, where with overwhelming probability the critical values above the global minimum (ground state) of the loss function are local minima exclusively. Above the band () containing only local minima (critical points of index ), there is another one, (), where one can only find local minima and saddle points of index , and above this band there exists another one, (), where one can only find local minima and saddle points of index and , and so on.
We will now define two non-decreasing, continuous functions on , and (their exemplary plots are captured in Figure 2).
and for any integer :
Also note that the following corollary holds.
For all and , .
Next we will show the logarithmic asymptotics of the mean number of critical points (the asymptotics of the mean number of critical points can be found in the Supplementary material).
and for all and fixed
From Theorem 4.1 and Corollary 4.1 the number of critical points in the band increases exponentially as grows and that local minima dominate over saddle points and this domination also grows exponentially as grows. Thus for large-size networks the probability of recovering a saddle point in the band , rather than a local minima, goes to .
Figure 1 captures exemplary plots of the distributions of the mean number of critical points, local minima and low-index saddle points. Clearly local minima and low-index saddle points are located in the band whereas high-index saddle points can only be found above the energy barrier . Figure 1 also reveals the layered structure for the lowest critical values of the loss function333The large mass of saddle points above is a consequence of Theorem 4.1 and the properties of functions.. This ’geometric’ structure plays a crucial role in the optimization problem. The optimizer, e.g. SGD, easily avoids the band of high-index critical points, which have many negative curvature directions, and descends to the band of low-index critical points which lie closer to the global minimum. Thus finding bad-quality solution, i.e. the one far away from the global minimum, is highly unlikely for large-size networks.
Note that the energy barrier to cross when starting from any (local) minimum, e.g. the one from the band , in order to reach the global minimum diverges with since it is bounded below by . Furthermore, suppose we are at a local minima with a scaled energy of . In order to find a further low lying minimum we must pass through a saddle point. Therefore we must go up at least to the level where there is an equal amount of saddle points to have a decent chance of finding a path that might possibly take us to another local minimum. This process takes an exponentially long time so in practice finding the global minimum is not feasible.
Note that the variance of the loss in Equation2 is which suggests that the extensive quantities should scale with . In fact this is the reason behind the scaling factor in front of the summation in the loss. The relation to the logarithmic asymptotics is as follows: the number of critical values of the loss below the level is roughly . The gradient descent gets trapped roughly at the barrier denoted by , as will be shown in the experimental section.
The theoretical part of the paper considers the problem of training the neural network, whereas the empirical results focus on its generalization properties.
To illustrate the theorems in Section 4, we conducted spin-glass simulations for different dimensions from 25 to 500. For each value of
, we obtained an estimate of the distribution of minima by sampling 1000 initial points on the unit sphere and performing stochastic gradient descent (SGD) to find a minimum energy point. Note that throughout this section we will refer to the energy of the Hamiltonian of the spin-glass model as its loss.
We performed an analogous experiment on a scaled-down version of MNIST, where each image was downsampled to size. Specifically, we trained 1000 networks with one hidden layer and hidden units (in the paper we also refer to the number of hidden units as nhidden
), each one starting from a random set of parameters sampled uniformly within the unit cube. All networks were trained for 200 epochs using SGD with learning rate decay.
To verify the validity of our theoretical assumption of parameter redundancy, we also trained a neural network on a subset of MNIST using simulated annealing (SA) where of parameters were assumed to be redundant. Specifically, we allowed the weights to take one of values uniformly spaced in the interval . We obtained less than drop in accuracy, which demonstrates the heavy over-parametrization of neural networks as discussed in Section 3.
It is necessary to verify that our solutions obtained through SGD are low-index critical points rather than high-index saddle points of poor quality. As observed by [Dauphin et al., 2014] certain optimization schemes have the potential to get trapped in the latters. We ran two tests to ensure that this was not the case in our experimental setup. First, for we computed the eigenvalues of the Hessian of the loss function at each solution and computed the index. All eigenvalues less than 0.001 in magnitude were set to 0. Figure 4 captures an exemplary distribution of normalized indices, which is the proportion of negative eigenvalues, for (the results for can be found in the Supplementary material). It can be seen that all solutions are either minima or saddle points of very low normalized index (of the order 0.01). Next, we compared simulated annealing to SGD on a subset of MNIST. Simulated annealing does not compute gradients and thus does not tend to become trapped in high-index saddle points. We found that SGD performed at least as well as simulated annealing, which indicates that becoming trapped in poor saddle points is not a problem in our experiments. The result of this comparison is in the Supplementary material. All figures in this paper should be read in color.
To observe qualitative differences in behavior for different values of or , it is necessary to rescale the loss values to make their expected values approximately equal. For spin-glasses, the expected value of the loss at critical points scales linearly with , therefore we divided the losses by (note that this normalization is in the statement of Theorem 4.1) which gave us the histogram of points at the correct scale. For MNIST experiments, we empirically found that the loss with respect to number of hidden units approximately follows an exponential power law: . We fitted the coefficients and scaled the loss values to .
Figure 3 shows the distributions of the scaled test losses for both sets of experiments. For the spin-glasses (left plot), we see that for small values of , we obtain poor local minima on many experiments, while for larger values of the distribution becomes increasingly concentrated around the energy barrier where local minima have high quality. We observe that the left tails for all touches the barrier that is hard to penetrate and as increases the values concentrate around . In fact this concentration result has long been predicted but not proved until [Auffinger et al., 2010]. We see that qualitatively the distribution of losses for the neural network experiments (right plot) exhibits similar behavior. Even after scaling, the variance decreases with higher network sizes. This is also clearly captured in Figure 8 and 9 in the Supplementary material. This indicates that getting stuck in poor local minima is a major problem for smaller networks but becomes gradually of less importance as the network size increases. This is because critical points of large networks exhibit the layered structure where high-quality low-index critical points lie close to the global minimum.
The theory and experiments thus far indicate that minima lie in a band which gets smaller as the network size increases. This indicates that computable solutions become increasingly equivalent with respect to training error, but how does this relate to error on the test set? To determine this, we computed the correlation between training and test loss for all solutions for each network size. The results are captured in Table 1 and Figure 7 (the latter is in the Supplementary material). The training and test error become increasingly decorrelated as the network size increases. This provides further indication that attempting to find the absolute possible minimum is of limited use with regards to generalization performance.
This paper establishes a connection between the neural network and the spin-glass model. We show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth has similar landscape to the Hamiltonian of the -spin spherical spin-glass model. We empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks. To the best of our knowledge our work is one of the first efforts in the literature to shed light on the theory of neural network optimization.
The authors thank L. Sagun and the referees for valuable feedback.
Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2:53–58.
Rectified linear units improve restricted boltzmann machines.In ICML.
First we will prove the lower-bound on
. By the inequality between arithmetic and geometric mean the mass and the size of the network are connected as follows
and since then
Next we show the upper-bound on . Let . Then
We will first proof the following more general lemma.
Let and be the outputs of two arbitrary binary classifiers. Assume that the first classifiers predicts with probability where, without loss of generality, we assume and otherwise. Furthemore, let the prediction accuracy of the second classifier differ from the prediction accuracy of the first classifier by no more than . Then the following holds
Consider two random variables and . Let denote the set of data points for which the first classifier predicts and let denote the set of data points for which the first classifier predicts (, where is the entire dataset). Also let . Furthermore, let denote the dataset for which and and denote the dataset for which and , where . Also let and . Therefore
One can compute that , , , , and finally . Thus we obtain
Note that and . Furthermore
where the last inequality is the direct consequence of the uniformity assumption of Equation 6. ∎
We consider two loss functions, (random) absolute loss and (random) hinge loss defined in the main body of the paper. Recall that in case of the hinge loss operator can be modeled as Bernoulli random variable, that we will refer to as , with success () probability for some non-negative constant . We assume is independent of . Therefore we obtain that
Note that both cases can be generalized as
where in case of the absolute loss and in case of the hinge loss . Furthermore, using the fact that ’s are Gaussian random variables one we can further generalize both cases as
Let for all . Note that . Thus
Note that the spherical assumption in Equation 9 directly implies that
To simplify the notation in Equation 11 we drop the letter accents and simply denote as . We skip constant and as it does not matter when minimizing the loss function. After substituting we obtain
Below, we provide the asymptotics of the mean number of critical points (Theorem 11.1) and the mean number of local minima (Theorem 11.2), which extend Theorem 4.1. Those results are the consequences of Theorem 2.17. and Corollary 2.18. [Auffinger et al., 2010].
For , the following holds as :
where , ,
where is the Airy function of first kind.
For and , the following holds as :