Bad Global Minima Exist and SGD Can Reach Them

by   Shengchao Liu, et al.

Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by Stochastic Gradient Descent (SGD). The emergent consensus explanation has two parts: the first is that there are "no bad local minima", while the second is that SGD performs implicit regularization by having a bias towards low complexity models. We revisit both of these ideas in the context of image classification with common deep neural network architectures. Our first finding is that there exist bad global minima, i.e., models that fit the training set perfectly, yet have poor generalization. Our second finding is that given only unlabeled training data, we can easily construct initializations that will cause SGD to quickly converge to such bad global minima. For example, on CIFAR, CINIC10, and (Restricted) ImageNet, this can be achieved by starting SGD at a model derived by fitting random labels on the training data: while subsequent SGD training (with the correct labels) will reach zero training error, the resulting model will exhibit a test accuracy degradation of up to 40 Finally, we show that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.


Implicit bias of SGD in L_2-regularized linear DNNs: One-way jumps from high to low rank

The L_2-regularized loss of Deep Linear Networks (DLNs) with more than o...

Assessing Generalization of SGD via Disagreement

We empirically show that the test error of deep networks can be estimate...

Highly over-parameterized classifiers generalize since bad solutions are rare

We study the generalization of over-parameterized classifiers where Empi...

Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

Most modern learning problems are highly overparameterized, meaning that...

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...

Local minima in training of neural networks

There has been a lot of recent interest in trying to characterize the er...

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Over the past few years, softmax and SGD have become a commonly used com...

Please sign up or login with your details

Forgot password? Click here to reset