No bad local minima: Data independent training error guarantees for multilayer neural networks

05/26/2016 ∙ by Daniel Soudry, et al. ∙ Stanford University 0

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multilayer Neural Networks (MNNs) have achieved state-of-the-art performances in many areas of machine learning

[20]. This success is typically achieved by training complicated models, using a simple stochastic gradient descent (SGD) method, or one of its variants. However, SGD is only guaranteed to converge to critical points in which the gradient of the expected loss is zero [5], and, specifically, to stable local minima [25] (this is true also for regular gradient descent [22]

). Since loss functions parametrized by MNN weights are non-convex, it has long been a mystery why does SGD work well, rather than converging to “bad” local minima, where the training error is high (and thus also the test error is high).

Previous results (section 2) suggest that the training error at all local minima should be low, if the MNNs have extremely wide layers. However, such wide MNNs would also have an extremely large number of parameters, and serious overfitting issues. Moreover, current state of the art results are typically achieved by deep MNNs [19, 13], rather then wide. Therefore, we are interested to provide training error guarantees at a more practical number of parameters.

As a common rule-of-the-thumb, a multilayer neural network should have at least as many parameters as training samples, and use regularization, such as dropout [15] to reduce overfitting. For example, Alexnet [19] had 60 million parameters and was trained using 1.2 million examples. Such over-parametrization regime continues in more recent works, which achieve state-of-the-art performance with very deep networks [13]. These networks are typically under-fitting [13], which suggests that the training error is the main bottleneck in further improving performance.

In this work we focus on MNNs with a single output and leaky rectified linear units. We provide a guarantee that the training error is zero in every differentiable local minimum (DLM), under mild over-parametrization, and essentially for every data set. With one hidden layer (Theorem

4) we show that the training error is zero in all DLMs, whenever the number of weights in the first layer is larger then the number of samples , i.e., when , where is the width of the activation -th layer. For MNNs with layers we show that, if , then convergence to potentially bad DLMs (in which the training error is not zero) can be averted by using a small perturbation to the MNN’s weights and then fixing all the weights except the last two weight layers (Corollary 6).

A key aspect of our approach is the presence of a multiplicative dropout-like noise term in our MNNs model. We formalize the notion of validity for essentially every dataset by showing that our results hold almost everywhere with respect to the Lebesgue measure over the data and this noise term. This approach is commonly used in smoothed analysis of algorithms, and often affords great improvements over worst-case guarantees (e.g., [30]). Intuitively, there may be some rare cases where our results do not hold, but almost any infinitesimal perturbation of the input and activation functions will fix this. Thus, our results assume essentially no structure on the input data, and are unique in that sense.

2 Related work

At first, it may seem hopeless to find any training error guarantee for MNNs. Since the loss of MNNs is highly non-convex, with multiple local minima [8], it seems reasonable that optimization with SGD would get stuck at some bad local minimum. Moreover, many theoretical hardness results (reviewed in [29]) have been proven for MNNs with one hidden layer.

Despite these results, one can easily achieve zero training error [3, 24], if the MNN’s last hidden layer has more units than training samples (). This case is not very useful, since it results in a huge number of weights (larger than ), leading to strong over-fitting. However, such wide networks are easy to optimize, since by training the last layer we get to a global minimum (zero training error) from almost every random initialization [16, 23, 11].

Qualitatively similar training dynamics are observed also in more standard (narrower) MNNs. Specifically, the training error usually descends on a single smooth slope path with no “barriers”[9], and the training error at local minima seems to be similar to the error at the global minimum [7]. The latter was explained in [7]

by an analogy with high-dimensional random Gaussian functions, in which any critical point high above the global minimum has a low probability to be a local minimum. A different explanation to the same phenomenon was suggested by

[6]. There, a MNN was mapped to a spin-glass Ising model, in which all local minima are limited to a finite band above the global minimum.

However, it is not yet clear how relevant these statistical mechanics results are for actual MNNs and realistic datasets. First, the analogy in [7] is qualitative, and the mapping in [6] requires several implausible assumptions (e.g., independence of inputs and targets). Second, such statistical mechanics results become exact in the limit of infinite parameters, so for a finite number of layers, each layer should be infinitely wide. However, extremely wide networks may have serious over-fitting issues, as we explained before.

Previous works have shown that, given several limiting assumptions on the dataset, it is possible to get a low training error on a MNN with one hidden layer: [10] proved convergences for linearly separable datasets; [27] either required that , or clustering of the classes. Going beyond training error, [2]

showed that MNNs with one hidden layer can learn low order polynomials, under a product of Gaussians distributional assumption on the input. Also,

[17]

devised a tensor method, instead of the standard SGD method, for which MNNs with one hidden layer are guaranteed to approximate arbitrary functions. Note, however, the last two works require a rather large

to get good guarantees.

3 Preliminaries

Model.

We examine a Multilayer Neural Network (MNN) optimized on a finite training set , where are the input patterns,

are the target outputs (for simplicity we assume a scalar output), and

is the number of samples. The MNN has layers, in which the layer inputs and outputs (a component of is denoted ) are given by

(3.1)

where is the input of the network, are the weight matrices (a component of is denoted , bias terms are ignored for simplicity), and are piecewise constant activation slopes defined below. We set .

Activations.

Many commonly used piecewise-linear activation functions (e.g.,

rectified linear unit, maxout, max-pooling) can be written in the matrix product form in eq. (

3.1). We consider the following relationship:

When

we recover the common leaky rectified linear unit (leaky ReLU) nonlinearity, with some fixed slope

. The matrix can be viewed as a realization of dropout noise — in most implementations is distributed on a discrete set (e.g., ), but competitive performance is obtained with continuous distributions (e.g. Gaussian) [31, 32]. Our results apply directly to the latter case. The inclusion of is the innovative part of our model — by performing smoothed analysis jointly on and we are able to derive strong training error guarantees. However, our use of dropout is purely a proof strategy; we never expect dropout to reduce the training error in realistic datasets. This is further discussed in sections 6 and 7.

Measure-theoretic terminology

Throughout the paper, we make extensive use of the term -almost everywhere, or a.e. for short. This is taken to mean, almost everywhere with respect of the Lebesgue measure on all of the entries of . A property hold a.e. with respect to some measure, if the set of objects for which it doesn’t hold has measure 0. In particular, our results hold with probability 1 whenever is taken to have i.i.d. Gaussian entries, and arbitrarily small Gaussian i.i.d. noise is used to smooth the input .

Loss function.

We denote as the output error, where is output of the neural network with , , and as the empirical expectation over the training samples. We use the mean square error, which can be written as one of the following forms

(3.2)

The loss function depends on ,

, and on the entire weight vector

, where is the flattened weight matrix of layer , and is total number of weights.

4 Single Hidden layer

MNNs are typically trained by minimizing the loss over the training set, using Stochastic Gradient Descent (SGD), or one of its variants (e.g., ADAM [18]). In this section and the next, we guarantee zero training loss in the common case of an over-parametrized MNN. We do this by analyzing the properties of differentiable local minima (DLMs) of the MSE (eq. (3.2)). We focus on DLMs, since under rather mild conditions [25, 5], SGD asymptotically converges to DLMs of the loss (for finite , a point can be non-differentiable only if ).

We first consider a MNN with one hidden layer . We start by examining the MSE at a DLM

(4.1)

To simplify notation, we absorb the redundant parameterization of the weights of the second layer into the first , obtaining

(4.2)

Note this is only a simplified notation — we do not actually change the weights of the MNN, so in both equations the activation slopes remain the same, i.e., . If there exists an infinitesimal perturbation which reduces the MSE in eq. (4.2), then there exists a corresponding infinitesimal perturbation which reduces the MSE in eq. (4.1). Therefore, if is a DLM of the MSE in eq. (4.1), then must also be a DLM of the MSE in eq. (4.2). Clearly, both DLMs have the same MSE value. Therefore, we will proceed by assuming that is a DLM of eq. (4.2), and any constraint we will derive for the MSE in eq. (4.2) will automatically apply to any DLM of the MSE in eq. (4.1).

If we are at a DLM of eq. (4.2), then its derivative is equal to zero. To calculate this derivative we rely on two facts. First, we can always switch the order of differentiation and expectation, since we average over a finite training set. Second, at any a differentiable point (and in particular, a DLM), the derivative of with respect to the weights is zero. Thus, we find that, at any DLM,

(4.3)

To reshape this gradient equation to a more convenient form, we denote Kronecker’s product by , and define the “gradient matrix” (without the error )

(4.4)

where denotes the Khatari-Rao product (cf. [1], [4]). Using this notation, and recalling that eq. (4.3) becomes

(4.5)

Therefore, lies in the right nullspace of , which has dimension . Specifically, if , the only solution is . This immediately implies the following lemma.

Lemma 1.

Suppose we are at some DLM of of eq. (4.2). If , then .

To show that has, generically, full column rank, we state the following important result, which a special case of [1, lemma 13],

Fact 2.

For and with , we have, almost everywhere,

(4.6)

However, since depends on , we cannot apply eq. (4.6) directly to . Instead, we apply eq. (4.6) for all (finitely many) possible values of (appendix A), and obtain

Lemma 3.

For , if , then simultaneously for every , , almost everywhere.

Combining Lemma 1 with Lemma 3, we immediately have

Theorem 4.

If , then all differentiable local minima of eq. (4.1) are global minima with , almost everywhere.

Note that this result is tight, in the sense that the minimal hidden layer width , is exactly the same minimal width which ensures a MNN can implement any dichotomy [3] for inputs in general position.

5 Multiple Hidden Layers

We examine the implications of our approach for MNNs with more than one hidden layer. To find the DLMs of a general MNN, we again need to differentiate the MSE and equate it to zero. As in section 4, we exchange the order of expectation and differentiation, and use the fact that are piecewise constant. Differentiating near a DLM with respect to , the vectorized version of , we obtain

(5.1)

To calculate for to the -th weight layer, we write111For matrix products we use the convention . its input and its back-propagated “delta” signal (without the error )

(5.2)

where we keep in mind that are generally functions of the inputs and the weights. Using this notation we find

(5.3)

Thus, defining

we can re-formulate eq. (5.1) as

(5.4)

similarly to eq. (4.5) the previous section. Therefore, each weight layer provides as many linear constraints (rows) as the number of its parameters. We can also combine all the constraints and get

(5.5)

In which we have constraints (rows) corresponding to all the parameters in the MNN. As in the previous section, if and we must have . However, it is generally difficult to find the rank of , since we need to find whether different have linearly dependent rows. Therefore, we will focus on the last hidden layer and on the condition , which ensures , from eq. (5.4). However, since depends on the weights, we cannot use our results from the previous section, and it is possible that . For example, when and , we get so we are at a differentiable critical point (note it is well defined, even though ), which is generally not a global minimum. Intuitively, such cases seem fragile, since if we give any random perturbation, one would expect that “typically” we would have . We establish this idea by first proving the following stronger result (appendix B),

Theorem 5.

For and fixed values of , any differentiable local minimum of the MSE (eq. 3.2) as a function of and , is also a global minimum, with , almost everywhere.

Theorem 5 means that for any (Lebesgue measurable) random set of weights of the first layers, every DLM with respect to the weights of the last two layers is also a global minimum with loss 0. Note that the condition implies that has more weights then (a plausible scenario, e.g., [19]). In contrast, if, instead we were only allowed to adjust the last layer of a random MNN, then low training error can only be ensured with extremely wide layers (, as discussed in section 2), which require much more parameters ().

Theorem 5 can be easily extended to other types of neural networks, beyond of the basic formalism introduced in section 3. For example, we can replace the layers below with convolutional layers, or other types of architectures. Additionally, the proof of Theorem 5 holds (with a trivial adjustment) when are fixed to have identical nonzero entries — that is, with dropout turned off except in the last two hidden layers. The result continues to hold even when is fixed as well, but then the condition has to be weakened to .

Next, we formalize our intuition above that DLMs of deep MNNs must have zero loss or be fragile, in the sense of the following immediate corollary of Theorem 5,

Corollary 6.

For , let be a differentiable local minimum of the MSE (eq. 3.2). Consider a new weight vector , where

has i.i.d. Gaussian (or uniform) entries with arbitrarily small variance. Then,

almost everywhere and with probability 1 w.r.t. , if are held fixed, all differentiable local minima of the MSE as a function of and are also global minima, with .

Note that this result is different from the classical notion of linear stability at differentiable critical points, which is based on the analysis of the eigenvalues of the Hessian

of the MSE. The Hessian can be written as a symmetric block matrix, where each of its blocks corresponds to layers and . Specifically, using eq. (5.3), each block can be written as a sum of two components

(5.6)

where, for

(5.7)

while , and for . Combining all the blocks, we get

If we are at a DLM, then is positive semi-definite. If we examine again the differentiable critical point and , we see that , so it is not a strict saddle. However, this point is fragile in the sense of Corollary 6.

Interestingly, the positive semi-definite nature of the Hessian at DLMs imposes additional constraints on the error. Note that the matrix is symmetric positive semi-definite of relatively small rank . However, can potentially be of high rank, and thus may have many negative eigenvalues (the trace of is zero, so the sum of all its eigenvalues is also zero). Therefore, intuitively, we expect that for to be positive semi-definite, has to become small, generically (i.e., except at some pathological points such as ). This is indeed observed empirically [7, Fig 1].

6 Numerical Experiments

Figure 6.1: Final training error (meanstd) in the over-parametrized regime is low, as predicted by our results (right of the dashed black line). We trained standard MNNs with one or two hidden layers (with widths equal to ), a single output, (non-leaky) ReLU activations, MSE loss, and no dropout, on two datasets: (1) a synthetic random dataset in which ,

was drawn from a normal distribution

, and with probability (2) binary classification (between digits and ) on sized subsets of the MNIST dataset [21]. The value at a data point is an average of the mean classification error (MCE) over 30 repetitions. In this figure, when the mean MCE reached zero, it was zero for all 30 repetitions.
Figure 6.2: The existence of differentiable local minima. In this representative figure, we trained a MNN with a single hidden layer, as in Fig. 6.1, with , on the synthetic random data (

) until convergence with gradient descent (so each epoch is a gradient step). Then, starting from epoch 5000 (dashed line), we gradually decreased the learning rate (multiplying it by

each epoch) until it was about . We see that the activation inputs converged to values above , while the final MSE was about

. The magnitudes of these numbers, and the fact that all the neuronal inputs do not keep decreasing with the learning rate, indicate that we converged to a differentiable local minimum, with MSE equal to 0, as predicted.

In this section we examine numerically our main results in this paper, Theorems 4 and 5, which hold almost everywhere with respect to the Lebesgue measure over the data and dropout realization. However, without dropout, this analysis is not guaranteed to hold. For example, our results do not hold in MNNs where all the weights are negative, so has constant entries and therefore cannot have full rank.

Nonetheless, if the activations are sufficiently “variable” (formally, has full rank), then we expect our results to hold even without dropout noise and with the leaky ReLU’s replaced with basic ReLU’s (). We tested this numerically and present the result in Figure 6.1

. We performed a binary classification task on a synthetic random dataset and subsets of the MNIST dataset, and show the mean classification error (MCE, which is the fraction of samples incorrectly classified), commonly used at these tasks. Note that the MNIST dataset, which contains some redundant information between training samples, is much easier (a lower error) than the completely random synthetic data. Thus the performance on the random data is more representative of the “typical worst case”,

(i.e., hard yet non-pathological input), which our smoothed analysis approach is aimed to uncover.

For one hidden layer, the error goes to zero when the number of non-redundant parameters is greater than the number of samples (), as predicted by Theorem 4. Theorem 5 predicts a similar behavior when for a MNN with two hidden layers (note we trained all the layers of the MNN). This prediction also seems to hold, but less tightly. This is reasonable, as our analysis in section 5 suggests that typically the error would be zero if the total number of parameters is larger the number of training samples (), though this was not proven. We note that in all the repetitions in Figure 6.1, for , the matrix always had full rank. However, for smaller MNNs than shown in Figure 6.1 (about ), sometimes did not have full rank.

Recall that Theorems 4 and 5 both give guarantees only on the training error at a DLM. However, for finite , since the loss is non-differentiable at some points, it is not clear that such DLMs actually exist, or that we can converge to them. To check if this is indeed the case, we performed the following experiment. We trained the MNN for many epochs, using batch gradient steps. Then, we started to gradually decrease the learning rate. If the we are at DLM, then all the activation inputs should converge to a distinctly non-zero value, as demonstrated in Figure 6.2. In this figure, we tested a small MNN on synthetic data, and all the neural inputs seem to remain constant on a non-zero value, while the MSE keeps decreasing. This was the typical case in our experiments. However, in some instances, we would see some converge to a very low value (). This may indicate that convergence to non-differentiable points is possible as well.

Implementation details

Weights were initialized to be uniform with mean zero and variance , as suggested in [14]. In each epoch we randomly permuted the dataset and used the Adam [18] optimization method (a variant of SGD) with . In Figure 6.1 the training was done for no more than epochs (we stopped if was reached). Different learning rates and mini-batch sizes were selected for each dataset and architecture.

7 Discussion

In this work we provided training error guarantees for mildly over-parameterized MNNs at all differentiable local minima (DLM). For a single hidden layer (section 4

), the proof is surprisingly simple. We show that the MSE near each DLM is locally similar to that of linear regression (

i.e., a single linear neuron). This allows us to prove (Theorem 4) that, almost everywhere, if the number of non-redundant parameters is larger then the number of samples , then all DLMs are a global minima with , as in linear regression. With more then one hidden layers, Theorem 5 states that if (i.e., so has more weights than ) then we can always perturb and fix some weights in the MNN so that all the DLMs would again be global minima with .

Note that in a realistic setting, zero training error should not necessarily be the intended objective of training, since it may encourage overfitting. Our main goal here was to show that that essentially all DLMs provide good training error (which is not trivial in a non-convex model). However, one can decrease the size of the model or artificially increase the number of samples (e.g., using data augmentation, or re-sampling the dropout noise) to be in a mildly under-parameterized regime, and have relatively small error, as seen in Figure 6.1. For example, in AlexNet [19] has weights, which is larger than , as required by Theorem 5. However, without data augmentation or dropout, Alexnet did exhibit severe overfitting.

Our analysis is non-asymptotic, relying on the fact that, near differentiable points, MNNs with piecewise linear activation functions can be differentiated similarly to linear MNNs [28]. We use a smoothed analysis approach, in which we examine the error of the MNN under slight random perturbations of worst-case input and dropout. Our experiments (Figure 6.1) suggest that our results describe the typical performance of MNNs, even without dropout. Note we do not claim that dropout has any merit in reducing the training loss in real datasets — as used in practice, dropout typically trades off the training performance in favor of improved generalization. Thus, the role of dropout in our results is purely theoretical. In particular, dropout ensures that the gradient matrix (eq. (5.4)) has full column rank. It would be an interesting direction for future work to find other sufficient conditions for to have full column rank.

Many other directions remain for future work. For example, we believe it should be possible to extend this work to multi-output MNNs and/or other convex loss functions besides the quadratic loss. Our results might also be extended for stable non-differentiable critical points (which may exist, see section 6) using the necessary condition that the sub-gradient set contains zero in any critical point [26]. Another important direction is improving the results of Theorem 5, so it would make efficient use of the all the parameters of the MNNs, and not just the last two weight layers. Such results might be used as a guideline for architecture design, when training error is a major bottleneck [13]. Last, but not least, in this work we focused on the empirical risk (training error) at DLMs. Such guarantees might be combined with generalization guarantees (e.g., [12]), to obtain novel excess risk bounds that go beyond uniform convergence analysis.

Acknowledgments

The authors are grateful to O. Barak, D. Carmon, Y. Han., Y. Harel, R. Meir, E. Meirom, L. Paninski, R. Rubin, M. Stern, U. Sümbül and A. Wolf for helpful discussions. The research was partially supported by the Gruss Lipper Charitable Foundation, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.

References

Appendix A Single hidden layer — proof of Lemma 3

We prove the following Lemma 3, using the previous notation and results from section (4).

Lemma.

For , if , then simultaneously for every , , almost everywhere.

Proof.

We fix an activation pattern and set . We apply eq. (4.6) to conclude that , -a.e. and hence also -a.e.. We repeat the argument for all values of , and use fact 7. We conclude that for all simultaneously, -a.e.. Since for every set of weights we have for some , we have , -a.e. ∎

Appendix B Multiple Hidden Layers — proof of theorem 5

First we prove the following helpful Lemma, using a technique similar to that of [1].

Lemma 8.

Let be a matrix with , with entries that are all polynomial functions of some vector . Also, we assume that for some value , we have . Then, for almost every , we have .

Proof.

There exists a polynomial mapping such that does not have full column rank if and only if . Since we can construct explicitly as the sum of the squares of the determinants of all possible different subsets of rows from . Since , we find that is not identically equal to zero. Therefore, the zeros of such a (“proper”) polynomial, in which , are a set of measure zero. ∎

Next we prove Theorem 5, using the previous notation and the results from section (5):

Theorem.

For and fixed values of , any differentiable local minimum of the MSE (eq. 3.2) as a function of and , is also a global minimum, with , almost everywhere.

Proof.

Without loss of generality, assume , since we can absorb the weights of the last layer into the weight layer, as we did in single hidden layer case (eq. (4.2)). Fix an activation pattern as defined in the beginning of this appendix. Set

(B.1)

and

(B.2)

Note that, since the activation pattern is fixed, the entries of are polynomials in the entries of , and we may therefore apply Lemma 8 to . Thus, to establish -a.e., we only need to exhibit a single for which . We note that for a fixed activation pattern, we can obtain any value of with some choice of , so we will specify directly. We make the following choices:

(B.3)
(B.4)
(B.5)

where (respectively,

) denotes an all ones (zeros) matrix of dimensions

, denotes the identity matrix, and denotes a matrix composed of the first columns of . It is easy to verify that with this choice, we have for any , and so and

(B.6)

which obviously satisfies . We conclude that , -a.e., and remark this argument proves Fact 2, if we specialize to .

As we did in the proof of Lemma 3, we apply the above argument for all values of , and conclude via Fact 7 that for every , -a.e.. Since for every , for some which depends on , this implies that, -a.e., simultaneously for all values of . Thus in any DLM of the MSE, with all weights except fixed, we can use eq. 5.4 (), and get . ∎