Entropic gradient descent algorithms and wide flat minima

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. First, we discuss Gaussian mixture classification models and show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions. These estimators can be found by applying maximum flatness algorithms either directly on the classifier (which is norm independent) or on the differentiable loss function used in learning. Next, we extend the analysis to the deep learning scenario by extensive numerical validations. Using two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include in the optimization objective a non-local flatness measure known as local entropy, we consistently improve the generalization error for common architectures (e.g. ResNet, EfficientNet). An easy to compute flatness measure shows a clear correlation with test accuracy.

Authors

• 2 publications
• 11 publications
• 3 publications
• 3 publications
• 2 publications
• 18 publications
• 35 publications
• 1 publication
• 15 publications
• Wide flat minima and optimal generalization in classifying high-dimensional Gaussian mixtures

We analyze the connection between minimizers with good generalizing prop...
10/27/2020 ∙ by Carlo Baldassi, et al. ∙ 0

• Shaping the learning landscape in neural networks around wide flat minima

Learning in Deep Neural Networks (DNN) takes place by minimizing a non-c...
05/20/2019 ∙ by Carlo Baldassi, et al. ∙ 0

• Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper proposes a new optimization algorithm called Entropy-SGD for ...
11/06/2016 ∙ by Pratik Chaudhari, et al. ∙ 0

• Regularizing Neural Networks via Adversarial Model Perturbation

Recent research has suggested that when training neural networks, flat l...
10/10/2020 ∙ by Yaowei Zheng, et al. ∙ 7

• Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

The notion of flat minima has played a key role in the generalization pr...
01/15/2019 ∙ by Yusuke Tsuzuku, et al. ∙ 12

• Deforming the Loss Surface to Affect the Behaviour of the Optimizer

In deep learning, it is usually assumed that the optimization process is...
09/14/2020 ∙ by Liangming Chen, et al. ∙ 28

• Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

10/12/2020 ∙ by Pan Zhou, et al. ∙ 46

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The geometrical structure of the loss landscape of neural networks has been a key topic of study for several decades hochreiter; keskar

. One area of ongoing research is the connection between the flatness of minima found by optimization algorithms like stochastic gradient descent (SGD) and the generalization performance of the network

baldassi2019shaping; keskar. There are open conceptual problems in this context: On the one hand, there is accumulating evidence that flatness is a good predictor of generalization jiang2019fantastic

. On the other hand, modern deep networks using ReLU activations are invariant in their outputs with respect to rescaling of weights in different layers

Dinh2017, which makes the mathematical picture complicated111We note, in passing, that an appropriate framework for theoretical studies would be to consider networks with binary weights, for which most ambiguities are absent.. General results are lacking. Some initial progress has been made in connecting PAC-Bayes bounds for the generalization gap with flatness dziugaite2.

The purpose of this work is to shed light on the connection between flatness and generalization by using methods and algorithms from the statistical physics of disordered systems, and to corroborate the results with a performance study on state-of-the-art deep architectures.

Methods from statistical physics have led to several results in the last years. Firstly, wide flat minimizers have been shown to be a structural property of shallow networks. They exist even when training on random data and are accessible by relatively simple algorithms, even though coexisting with exponentially more numerous minima locentfirst; unreasoanable; baldassi2019shaping

. We believe this to be a overlooked property of neural networks, which makes them particularly suited for learning. In analytically tractable settings, it has been shown that the flatness correlates with the choice of the loss function, with the choice of the activation functions and with generalization

baldassi2019shaping; relu_locent.

We employ a notion of flatness referred to as Local Entropy locentfirst; unreasoanable. It measures the low-loss volume in the weight space around a minimizer. This framework has been used to introduce a variety of learning algorithms, which we call entropic algorithms in this paper, that focus their search on flat regions unreasoanable; entropysgd; parle.

In what follows we make a substantial step in connecting flatness and generalization by providing both analytical and state-of-the-art numerical results.

• In the case of a Gaussian mixture classification with shallow networks, we show analytically that the minimum norm condition, which needs to be imposed during the learning phase in order to reach the Bayes optimal performance, corresponds to solutions of maximum local entropy for the classifier (which is norm invariant). These solutions can be found by entropic algorithms acting on the learning loss function.

• We systematically apply two entropic algorithms, Entropy-SGD (eSGD) and Replicated-SGD (rSGD), to state-of-the-art deep architectures. With little or no hyperparameter tuning, we achieve a improved generalization performance and show that it is correlated with a computable measure of flatness.

2 Related work

The idea of using the flatness of a minimum of the loss function, also called the fatness of the posterior and the local area estimate of quality, for evaluating different minimizers is several decades old hochreiter; Hinton1993; buntine1991bayesian. These works connect the flatness of a minimum to information theoretical concepts like the minimum description length of its minimizer: flatter minima correspond to minimizers that can be encoded using fewer bits. For neural networks, a recent empirical study keskar shows that large-batch methods find sharp minima while small-batch ones find flatter ones, with a positive effect on generalization performance.

PAC-Bayes bounds can be used for deriving generalization bounds for neural networks orbanz. In dziugaite1, a method for optimizing the PAC-Bayes bound directly is introduced and the authors note similarities between the resulting objective function and an objective function that searches for flat minima. This connection is further analyzed in dziugaite2. In jiang2019fantastic, the authors present a large-scale empirical study of the correlation between different complexity measures of neural networks and their generalization performance. The authors conclude that PAC-Bayes bounds and flatness measures are the most predictive measures of generalization.

The concept of local entropy has been introduced in the context of a statistical mechanics approach to machine learning for discrete neural networks in

locentfirst, and subsequently extended to models with continuous weights. The general definition of the local entropy loss for a system in a given configuration

(a vector of size

) can be given in terms of any common (usually, data-dependent) loss as:

 LLE(w)=−1βlog∫dw′ e−βL(w′)−βγd(w′,w). (1)

The function measures a distance and is commonly taken to be the squared norm of the difference of the configurations and :

 d(w′,w)=12N∑i=1(w′i−wi)2 (2)

The integral is performed over all possible configurations ; for discrete systems, it can be substituted by a sum. The two parameters and are Legendre conjugates of the loss and the distance. For large systems, , their effect is to jointly restrict the integral to configurations below a certain loss and within a certain distance from the reference configuration . In general, increasing reduces and increasing reduces .

The interpretation of this quantity is that it computes the log-volume of the configurations in a region of size around that have loss less or equal than . Compared to the original loss , it can be interpreted as a Gaussian smoothing, or a site-dependent regularization. It also has the interpretation of a non-local measure of flatness, since, for large , configurations with small must lie in the middle of extensive regions in which a large fraction of the configurations have small loss .

In a neural network that performs a classification task, the most natural choice for in eq. (1) is the training error. This is the definition that has been used in detailed analytical studies on relatively tractable shallow networks accompanied by numerical experiments, where indeed

has been shown to correlate with generalization error and eigenvalues of the Hessian

locentfirst; baldassi2019shaping. Another interesting finding is that the cross-entropy loss baldassi2019shaping and ReLU transfer functions relu_locent, which have become the de-facto standard for neural networks, tend to bias the models towards high local entropy regions (computed based on the error loss).

Using the local entropy as an objective is in general computationally intractable. However, it can be approximated to derive general algorithmic schemes. Replicated stochastic gradient descent (rSGD) replaces the local entropy objective by an objective involving several replicas of the model, each one moving in the potential induced by the loss while also attracting each other. The method has been introduced in unreasoanable, but only demonstrated on shallow networks. The rSGD algorithm is very closely related to Elastic Averaging SGD (EASGD), presented in elastic, even though the latter was motivated purely by the idea of enabling massively parallel training and had no theoretical basis. The main distinguishing feature of rSGD compared to EASGD when applied to deep networks is the focusing procedure by which is gradually increased, discussed in more detail below. Another difference is that in rSGD there is no explicit master replica.

Entropy-SGD (eSGD), introduced in entropysgd, is a method that directly optimizes the local entropy using stochastic gradient Langevin dynamics (SGLD) welling2011bayesian. While the goal of this method is the same as rSGD, the optimization techniques involves a double loop instead of replicas. Parle parle, combines eSGD and EASGD (with added focusing) to obtain a distributed algorithm that shows also excellent generalization performance, consistently with the results obtained in this work.

3 Analytical Results on shallow networks

The relation between local entropy, flatness and generalization properties has been investigated theoretically and numerically for several models in locentfirst; baldassi2019shaping; relu_locent. So far, the theoretical results were limited to the so-called teacher-student scenario in the context of classification: a training set with i.i.d. randomly-generated inputs and labels provided by a (shallow) teacher network is presented to a student network with the same architecture as the teacher. In the under-parameterized regime, in which the training set does not contain sufficient information, several local minima exist, and the ones with high local entropy were shown to have almost-bayesian generalization capabilities.

The under-parametrized teacher-student scenario considered in the above-mentioned studies is highly non-convex, and using random i.i.d inputs is not particularly realistic. Although it was shown in baldassi2019shaping that the phenomenology is similar with real datasets, the problem of obtaining theoretical insight into other classification tasks with different distributions remains open.

Here, we confirm the general scenario in a very simple model often used in high-dimensional statistical machine learning

mai2019high; Lelarge2019; deng2019model; Lesieur2016: Gaussian mixtures. The generative model for this task is as follows: for a given problem size , an -dimensional vector is randomly generated from a standard multivariate normal ; then, two classes of patterns are generated, with labels , each class being distributed as . We call the size of the training set. Here, for simplicity, we will restrict ourselves to the case in which the two classes are balanced.

The performance of a linear classifier (a single-unit neural network, a.k.a. a perceptron) on this model has been studied recently in

Mignacco2020, in particular in the case in which the network is trained using the mean square error (MSE) loss. This system is prone to overfitting, especially around . In Mignacco2020 it was shown that generalization performances are improved by introducing an regularization controlled via a positive parameter , and that, in the balanced case, optimal performances are achieved in the limit . In the limit of large , adding the regularization term is equivalent to fixing the norm of the weights.

This problem is rather peculiar when compared to typical classification tasks performed with neural networks, since the training loss is convex. Indeed, Bayes-optimal performance can be achieved with a single configuration (instead of requiring a distribution) and can be easily found analytically. Additionally, the task is impossible, in the sense that no classifier can achieve zero test error (in the teacher-student context this would be similar to the case of having a "noisy", unreliable teacher).

It is interesting to consider that the output of the network (and thus the generalization error) is independent of the norm. On one hand, this is also true for most deep neural network models that use ReLU activations in the intermediate layers and an operation to produce the output label, and are therefore invariant to uniform scaling of all their weights and biases. On the other hand, this shows that the norm is only relevant due to the choice of the loss, which is often only used as a continuous relaxation of the classification error. In light of this, the norm cannot affect the generalization capabilities of the network, and it thus seems unlikely that a norm-based regularization could be a valid general strategy.222There is a caveat to this statement: for particular choices of the loss, e.g. cross-entropy, it is possible to reparametrize the problem in an invariant way and interpret the norm in terms of a time-evolving parameter of the loss with a similar role to the focusing procedure discussed below, see e.g. baldassi2019shaping.

We have performed a replica-theory calculation (detailed in the SM) for this model, in which we have studied analytically the solutions found by optimizing the regularized MSE loss. In particular, we have explored the normalized local entropy landscape of these configurations (defined below) in the space of the training error. We stress that by using the error instead of the MSE we explore the properties of the model in the regime in which it is applied. Furthermore, we can freely renormalize all the configurations and simplify the analysis.

In this case the normalized local entropy around a given (normalized) configuration measures the logarithm of the fraction of configurations whose training error is smaller or equal than that of the reference in a volume within a given squared-distance around . More precisely, we computed:

 ΦLE(λ,d)=limN→∞Ev⋆EσEξlog∫SNdw′ Θ(ε(w∗)−ε(w′))Θ(d−d(w′,w∗))∫SNdw′ Θ(d−d(w′,w∗)) (3)

where is the normalized minimizer of the -regularized MSE loss, is the training error, and is the Heaviside step function if and otherwise. The distance parameter ranges in and is essentially the Legendre transform of of eq. (1). In this definition we also used instead of a generic loss and a hard cutoff instead of using , to make the formula more explicit. The denominator is the volume of configurations within squared distance , and the domain of integration is the unit sphere in

dimensions. The expectations provide the average behavior on the whole distribution of the generative model, which with high probability is the same as the behavior of any one random instance for large

.

Due to the and the normalization term, is upper-bounded by zero and always zero at . For sharp minima, it is expected to drop rapidly with , whereas for flat regions it’s expected to stay very close to zero at least within some range. Some representative results are shown in fig. 1 (left panel), and they confirm that the configurations that generalize better (which for this model are those that have been obtained with the largest regularization parameter ) have generally higher local entropy curves, i.e. they lie in the middle of fairly dense regions of good configurations, a.k.a. wide flat minima. The SM contains the full derivation, and additional results that show that the same general scenario holds for different values of the parameters. It also shows that a reasonable alternative choice for the cutoff could be used in the definition and lead to analogous conclusions.

These results thus confirm that the local entropy landscape constructed using the training error is a good predictor of generalization performance. However, when dealing with much more complex architectures, using the training error as the loss function in eq. (1

) is not (yet) algorithmically feasible. In particular, the entropic algorithms rSGD and eSGD must still operate on a differentiable loss. This leaves the question whether targeting high-local-entropy regions in a differentiable loss landscape can still lead to good generalization results open. We have investigated this question analytically on the Gaussian mixture model with a linear classifier and the MSE loss, using the same technique explained in

baldassi2019shaping; relu_locent (details in the SM). This amounts at studying the generalization error of the barycenter of a replicated system of classifiers, each with its own parameters with , each optimizing the MSE under constraints on their norms and on their mutual angles , that is: . The barycenter is defined as . Due to the peculiarities of this model, we are interested in whether it is aligned with the solution of the norm-regularized model with large . In this analysis we used the angle rather than the distance in order to compare situations with different norms (if then is the same as used previously). Our results indicate that, with sufficiently many replicas (even just ) and with sufficiently large angles the generalization performance is nearly optimal and the dependence on the norm is mild, and much less pronounced than at small angles (the limit of zero angles reproduces the results of the norm-regularized analysis without replicas). Some representative results are shown in fig. 1 (right panel) and are confirmed by numerical experiments. The fact that for this model the best results are obtained with widely separated replicas is due to the convex nature of the problem, and we do not expect this phenomenon to carry over to the non-convex landscapes of deep neural networks.

4 Numerical experiments on deep networks

4.1 Entropic algorithms

For our numerical experiments on deep network we have used two entropic algorithms, rSGD and eSGD, mentioned in the introduction. They both approximately optimize the local entropy as defined in eq. (1), for which an exact evaluation of the integral is intractable. The two algorithms employ different but related approximation strategies, as detailed below. Our aim is to explore the characteristics of these algorithms on difficult datasets and state-of-the-art networks, comparing their performance with each other and with standard SGD. We also investigate the relationship between their generalization properties and the flatness of the minima that they produce.

Entropy-SGD.

Entropy-SGD (eSGD), introduced in entropysgd, minimizes the local entropy (1) by approximate evaluations of its gradient. The gradient can be expressed as

 ∇LLE(w)=γ(w−⟨w′⟩) (4)

where denotes the expectation over the measure , where is a normalization factor. The eSGD strategy is to approximate (which implicitly depends on ) using steps of stochastic gradient Langevin dynamics (SGLD). The resulting double-loop algorithm is presented as Algorithm 1. The noise parameter in the algorithm is linked to the inverse temperature by the usual Langevin relation . In practice we always set it to the small value as in entropysgd. For , eSGD approximately computes a proximal operator chaudhari2018deep. For , eSGD reduces to the recently introduced Lookahead optimizer zhang2019lookahead.

Replicated-SGD.

Replicated-SGD (rSGD) consists in a replicated version of the usual stochastic gradient (SGD) method. In rSGD, a number of replicas of the same system, each with its own parameters where , are trained in parallel for of iterations, interacting through an attractive term with their center of mass. As detailed in unreasoanable; baldassi2019shaping, the replicated system trained with a stochastic algorithm such as SGD collectively explores an approximation of the local entropy landscape, and the replication bypasses the need to explicitly estimate the integral in eq. (1). In principle, the larger the better the approximation, but already with the effect of the replication is significant. To summarize, rSGD replaces the local entropy (1) with the replicated loss :

 LR({wa}a)=y∑a=1L(wa)+γy∑a=1d(wa,¯w) (5)

Here, is a center replica defined as . The algorithm is presented as Algorithm 2. Any of the replicas or the center can be used after training as the resulting model for inference. This procedure is parallelizable over the replicas, so that wall-clock time for training is comparable to SGD, excluding the communication which happens every parallel optimization steps. In order to decouple the communication period and the coupling hyperparameter , we let the coupling strength take the value . In our experiments, we did not observe any degradation in generalization performance with of up to .

Focusing.

A common feature of both algorithms is that the parameter in the objective changes during the optimization process. We start with a small (targeting large regions and allowing a wider exploration of the landscape) and gradually increase it. We call this process focusing. Focusing improves the dynamics by driving the system quickly to wide regions and then, once there, gradually trading off the width in order to get to the minima of the loss within those regions, see baldassi_local_2016; unreasoanable. We adopt an exponential schedule for

, where its value at epoch

is given by . For rSGD, we fix by balancing the distance and the data term in the objective before training starts, i.e. we set for rSGD. The parameter is chosen such that increases by a factor . For eSGD, we were unable to find a criterion that worked for all experiments and manually tuned it.

Optimizers.

Vanilla SGD updates in Algorithms 1 and 2 can be replaced by optimization steps of any commonly used gradient-based optimizers.

4.2 Comparisons across several architectures and datasets

In this section we show that, by optimizing the local entropy with eSGD and rSGD, we are able to systematically improve the generalization performance compared to standard SGD. We perform experiments on image classification tasks, using common benchmark datasets, state-of-the-art deep architectures and the usual cross-entropy loss. The detailed settings of the experiments are reported in the SM. For the experiments with eSGD and rSGD, we use the same settings and hyper-parameters (architecture, dropout, learning rate schedule,…) as for the baseline, unless otherwise stated in the SM and apart from the hyper-parameters specific to these algorithms.

While we do some little hyper-parameter exploration to obtain a reasonable baseline, we do not aim to reproduce the best achievable results with these networks, since we are only interested in comparing different algorithms in similar contexts. For instance, we train PyramidNet+ShakeDrop for 300 epochs, instead of the 1800 used in AA

, and we start from random initial conditions for EfficientNet instead of doing transfer learning as done in

efficientnet. In the case of the ResNet110 architecture instead, we use the training specification of the original paper resnet.

All combinations of datasets and architectures we tested are reported in Table 1. Blanks correspond to untested combinations. The first 3 columns correspond to experiments with the same number of effective epochs, that is considering that in each iteration of the outer loop in Algorithms 1 and 2 we sample and mini-batches respectively. In the last column instead, each replica consumes individually the same amount of data as the baseline. Being a distributable algorithm, rSGD enjoys the same scalability of the related EASGD and Parle elastic; parle.

For rSGD, we use replicas and the scoping schedules described in Sec. 4.1. In our explorations, rSGD proved to be quite robust with respect to specific choices of the hyper-parameters. The error reported is that of the center . For eSGD, we set , and in all experiments, and we perform little tuning for the the other hyper-parameters. The algorithm is a little more sensitive to hyper-parameters than rSGD, while still being quite robust. Moreover, it misses an automatic scoping schedule.

Results in Table 1 show that entropic algorithm generally outperform the corresponding baseline with roughly the same amount of parameter tuning and computational resources. In the next section we also show that they land in flatter minima.

4.3 Flatness measures

The local entropy curves are hard to estimate except in simple models. For deep networks, we use an alternative and computationally cheaper measure of flatness to describe the training landscape geometry and relate it to the generalization error. When the comparison between these two measures is possible, the qualitative agreement is generally good.

Given a configuration , we add a noise with components . We average 1000 perturbations for each selected value of to compute the perturbed train error as a function of . Intuitively, the slower the error grows with , the flatter is the minimum. The multiplicative nature of the perturbation handles irrelevant differences in the norm magnitudes, cf. Sec. 3.

This sharpness-based measure has been reported as being one of the most reliable predictor of generalization performance jiang2019fantastic. In this section, we use this measure to correlate the improved generalization properties of the minimizers found by entropic algorithms with flatness, i.e. we ask whether rSGD and eSGD generalize better than SGD by finding flatter minima. We trace the evolution of the flatness during training. We stop when the training error and loss have reached stationary values. In our experiments, the final training error is close to 0 (see Fig. 3). We note that eSGD and rSGD curves are below the SGD curve across a large range of values, while also achieving better generalization. Similar results are found for different architectures, as reported in the SM.

Another set of experiments was performed on the shallow networks that have been studied analytically in baldassi2019shaping. We found that the generalization performance of rSGD and eSGD is correlated with the local entropy of the minimizers that they find. These results are compared to different implementations of SGD, which display worse generalizaton and smaller local entropy. Details are reported in the SM.

5 Discussion and conclusions

We studied analytically and numerically the connection between generalization and flatness, as defined by the local entropy measure for the classification error loss function and for its differentiable relaxations. Starting with analytically tractable models, we have discussed new results for Gaussian mixtures classification, which show that optimal Bayesian predictors correspond to high local entropy regions of the classifier and of the differentiable loss. These optimal solutions can be found algorithmically both by adding a strong regularization to the learning loss function or by an entropic algorithm (rSGD). Observing that the classifier itself is independent of the norm of the weights and that flatness can be properly defined on any loss, our results give further support to the idea that the flatness of minima plays an important role for generalization. A similar scenario is known to exist in DNNs with ReLU activations and operations for the output labels, which are invariant to weights rescaling. We have performed an extensive numerical study on state of the art deep architectures to verify that the improvement in performance is correlated with estimates of flatness. Our future efforts will be devoted to study the connection between generalization bounds and the existence of wide flat regions in the landscape of the classifier.

This work has no ethical or societal impact.

Appendix A Local Entropy and Replicated Systems

The analytical framework of Local Entropy was introduced in Ref. [locentfirst], while the connection between Local Entropy and systems of real replicas (as opposed to the "fake" replicas of spin glass theory [mezard1987spin]) was made in Ref. [unreasoanable]. For convenience, we briefly recap here the simple derivation.

We start from the definition of the local entropy loss given in the main text:

 LLE(w)=−1βlog∫dw′ e−βL(w′)−12βγ∥w′−w∥2. (6)

We then consider the Boltzmann distribution of a system with energy function and with an inverse temperature , that is

 p(w)∝e−βyLLE(w), (7)

where equivalence is up to a normalization factor. If we restrict to integer values, we can then use the definition of to construct an equivalent but enlarged system, containing

replicas. Their joint distribution

is readily obtained by plugging Eq. (6) into Eq. (7). We can then integrate out the original configuration and obtain the marginal distributional for the remaining replicas

 p({wa}a)∝e−βLR({wa}a), (8)

where the energy function is now given by

 LR({wa}a)=y∑a=1L(wa)+12γy∑a=1∥wa−¯w∥2, (9)

with . We have thus recovered the loss function for the replicated SGD (rSGD) algorithm presented in the main text.

Appendix B Gaussian Mixtures

In this section we will provide details of the analytical computations performed on a common model considered in high-dimensional statistics: the Gaussian mixture model.

As the first step, an -dimensional vector

is randomly generated from a Gaussian centered at the origin and with covariance matrix equal to the identity matrix. Samples from two classes are then generated in the following way: First we generate a label

or with probability and respectively. Then, we generate a pattern

by using a Gaussian distribution centered in

and with covariance matrix proportional to the identity matrix; the proportionality constant (which we will call by ) controls the width of the two clusters. We generate such points; the coordinate of point is therefore given by

 ξμi=v⋆i√Nσμ+√Δzμi (10)

where

are i.i.d Gaussian random variables with mean zero and unit variance. This results in two clusters, with the label indicating the cluster a pattern belongs to. In the following we will always limit ourselves to the symmetric case

and unit noise .

We consider the case of a linear classifier (i.e. a perceptron). Training this classifier corresponds to the minimization of the overall loss

 L(w,b)=P∑μ=1ℓ[σμi(1√NN∑i=1wiξμi+b)] (11)

where and are respectively the weights and the bias of the network that need to be learned. is a generic loss function for a single pattern. We will consider in particular the case of the MSE loss . As usual in statistical physics, we will consider the high-dimensional limit, where both and with the ratio fixed.

Recently, this model has been studied in [Mignacco2020] by using Gordon’s inequality. They showed that the MSE loss is severely prone to overfitting, especially when . However, if a parameter for controlling the regularization of the weight norms is introduced, the generalization performance is improved. In the limit (corresponding to vanishing values for the norm of the weights), the generalization error of the network is equal to the Bayes-optimal one.

b.1 Typical case analysis

We briefly review how the the geometry of the space of typical Gibbs configurations of the model can be studied using statistical physics techniques [gardner1988The, gardner1988optimal].

Denoting by the inverse temperature, we define the partition function as

 Z=∫∏idwie−βL(w,b)+λ2∑iw2i (12)

We will denote the average over the distribution of patterns, labels and the centroids with . The average of the log-volume is the free entropy of the model , where is the free energy. We can evaluate it in the large- limit by using the “replica trick”, i.e. the formula

 lnZ=limn→0∂nZn.

We first compute the average for integer values of and then we analytically continue to 0. As usual in replica computations one needs to introduce several order parameters in order to use the saddle point method when is large. Indicating by , the replica indexes, those order parameters are: 1) the overlap matrix between two weights for , 2) the squared norm and 3) the overlap between a weight and the centroid . In order to enforce the previous three definitions via Dirac delta functions we need also the corresponding conjugated parameters, that we denote by , and respectively. The order parameters and the bias satisfy saddle point equations that, once solved, permit to evaluate the free entropy of the model. Note that when the bias is always zero.

In the replica-symmetric ansatz we seek solutions to the saddle point equations of the form for , , and similarly for conjugated order parameters. The final expression of the free entropy is given by

 −βf=GS+αGE (13)

where we have defined the entropic and energetic terms as

 GS =q^q2−Q^Q−M^M+12ln(2π^q−2^Q+λ)+^q+^M22(^q−2^Q+λ) (14a) GE =Eσ∫Dyln∫Dhe−βℓ(√Δ(Q−q)h+√Δqy+M+σb) (14b)

and is the standard Gaussian measure. The train loss is found simply by taking the derivative .

When the interesting regime is found when the regularization parameter is itself scaled with as . Moreover, if we consider a loss with a unique minimum (such as the MSE), the overlap between two replicas must go to the squared norm . Therefore we must impose a scaling for of the type

 q=Q−δqβ. (15)

Correspondingly, one can verify from the saddle point equations that the conjugated order parameters must be scaled as

 ^q =β2δ^Q−βδ^q (16a) ^Q =β22δ^Q−βδ^q (16b) ^M =βδ^M (16c)

All the new order parameters introduced in those scalings must satisfy new saddle point equations obtained by taking derivatives of the free energy ; the entropic and energetic terms (rescaled with ) are now given by where . Calling by the corresponding argmin, the training loss is

 ϵℓ=αEσ∫Dyℓ(√Δδqh∗σ(y)+√ΔQy+M+σb) (17)

The training error can be found by plugging inside equation (17), where if and 0 otherwise (the Heaviside step function). For the MSE loss, is easily found, so that the training error is

 ϵt=αEσH(Δδq+M+bσ√ΔQ), (18)

where . Also the MSE training loss can be easily found by explicitly performing the integral in (17). One can verify that when is increased not only the corresponding squared norm lowers, but also, and more importantly, the training error/loss increases (even below the critical capacity of the model, where a zero training error solution can be found). This means that insisting in searching zero error solutions with the MSE loss is counterproductive and leads to overfitting. This is to be expected since the Gaussian mixture model is a particular case of general noisy teacher problems, in which the training set is no longer generated by a rule that can be inferred [engel-vandenbroek].

b.2 Local entropy around a given typical configuration: Franz-Parisi approach

In order to quantify the local geometrical landscape around a typical configuration of the Gibbs measure with loss function , regularization parameter and inverse temperature , we have studied the so-called Franz-Parisi free entropy [franz1995recipes, huang2014origin]. It is defined as

 −βfFP(S)≡1N⟨∫∏id~wie−βrLr(~w,~b)+λr∑i~w2ilnN(~w,S)∫∏id~wie−βrLr(~w,~b)+λr∑i~w2i⟩ (19)

where the quantity

 N(~w,S)≡∫dμP(w)e−βL(~w,~b)δ(∑iwi~wi−NS) (20)

is the volume of configurations at inverse temperature that have overlap with the reference configuration . is the flat measure over the admissible values of with a fixed squared norm ; in other terms the weights live on the hyper-sphere . is chosen to match the squared norm of the reference , that is . Note that is fixed via the soft constraint with the regularization parameter .

In order to compute the average over the disorder induced by the patterns, we use two replica tricks, one for the denominator of (19), which is just the partition function and one for the log in the numerator of the same equation . From now on we will use indexes or for replicas in and , . Therefore we get

 −βfFP(S)=1Nlimn→0limr→0∂n⟨∫∏a,id~wai∏ae−βrLr(~wa,~b)+λr∑i(~wai)2Nn(~wa=1,S)⟩ (21)

The computation proceeds as usual by averaging over the disorder and introducing some other order parameters (in addition to those involving only the reference ), namely , , , and the corresponding conjugated ones. Note that is just the squared norm because of the spherical constraint inside the measure . We obtain that the Franz-Parisi free entropy can be split into the sum of an entropic and an energetic term as . Using a RS ansatz on all the order parameter involved, the entropic term can be written as

 GS=p^p2+t^t−P^P−O^O−S^S+12ln(2π^p−2^P)+1^p−2^P⎡⎢ ⎢⎣^p+^O22+(^S−^t)2(2(^q−^Q)+^M2+λr)2(^q−2^Q+λr)2+(^S−^t)(^t+^M^O)^q−2^Q+λr⎤⎥ ⎥⎦ (22)

whereas the energetic one is

 GE=Eσ∫Dx1Z(x)∫Dhe−βrℓr(σ~b+M+√Δqx+√Δ(Q−q)h)×∫Dyln∫Due−βℓ[σb+O+√Δγ−Δ(S−t)2Q−qy+Δt√Δqx+Δ(S−t)√Δ(Q−q)h+√Δ(P−p)u] (23)

In the previous equation have defined and

 Z(x)≡∫Dhe−βrℓr(σ~b+M+√Δqx+√Δ(Q−q)h). (24)

Note that the parameters involving only the reference i.e. , , , and satisfy the same saddle point equations of the previous subsection. We are now interested in sending to infinity. In order to do that, we need to add to the scalings of the order parameters involving only the reference (16), together with the ones for the overlaps between reference and and their conjugated ones. The new scalings are

 t =S−δtβr (25a) ^t =βrδ^t (25b) ^S−^t =δ^S. (25c)

Using these scalings, the entropic term becomes

 GS=p^p2−δt^δt−P^P−O^O−Sδ^S+12ln(2π^p−2^P)+1^p−2^P⎡⎢ ⎢⎣^p+^O22+δ^S2(δ^Q+δ^M2)2(δ^q+λr)2+δ^S(δ^t+δ^M^O)δ^q+λr⎤⎥ ⎥⎦ (26)

and the energetic term becomes

 GE=Eσ∫DxDyln∫Due−βℓ[σb+O+√Δγy+ΔS√ΔQx+Δδt√Δδqh∗σ(x)+√Δ(P−p)u] (27)

where we have redefined as .

Once is known, we can compute the energy of the configuration with overlap with the reference as and the local entropy as . The same formulas are valid if we look at the local entropy landscape in the space of the training error, where . As in the previous subsection we indicate by the training error of the configuration to distinguish it with respect to the training loss. The training error can be written as

 ϵt=∂(βfFP)∂β=αe−βEσ∫DxDyH⎛⎝σb+O+√Δγy+ΔS√ΔQx+Δδt√Δδqh∗σ(x)√Δ(P−p)⎞⎠Hβ⎛⎝−σb+O+√Δγy+ΔS√ΔQx+Δδt√Δδqh∗σ(x)√Δ(P−p)⎞⎠ (28)

where .

At the Franz-Parisi free entropy is

 −βfFP(S,α=0)=12[1+ln(2π)+ln(1λr−λrS2)], (29)

and gives the total volume of configurations at overlap with the reference.

As stated in the main text, we are interested in studying the local entropy landscape of configurations found by optimizing the regularized MSE loss in the space of the training error. Therefore we choose and . On the other hand, the parameter has been chosen in such a way that the training error of given in (28) is equal to a certain cutoff .

Notice that (29) gives an upper bound to the local entropy. Therefore, if we normalize the local entropy with respect to (29) it will be either negative, or equal to zero for distances equal to zero. For sharp minima we expect that the normalized local entropy will have a sharp drop near , whereas for flat minima it will be close to zero for some range of distances.

We have studied two different values for the energy :

• in the first case is chosen to be equal to the training error of the reference given in equation (18). This case corresponds the left panel of the first figure in the main text, where we plot the normalized local entropy as a function of the distance .

• in the second case is equal to the training error of the teacher , which is given by . This case is depicted in figure 4.

In both cases we clearly see that references with better generalization properties (corresponding to larger values of the regularization parameter ) have higher local entropy curves.

b.3 Replicated system in the loss landscape

We now study a system of real replicas where each one optimizes a loss under constraints on their squared norm and on their mutual angles, namely: . This problem is equivalent to imposing a 1RSB ansatz on the standard equilibrium measure (12) with the Parisi parameter and the intra-block overlap parameter fixed as external parameters; their physical meaning is identified respectively with the number of replicas and the overlap between replicas (see also [monasson1995structural, baldassi2019shaping]). Therefore the partition function of this system of replicas is

The free entropy of a single replica is and can be evaluated by the usual replica trick

 −βf=lims→01Ny∂s¯¯¯¯¯¯Zs (31)

If we choose , this formalism of the replicated partition function (30) reduces to the 1RSB formalism on the standard equilibrium measure given in (12), with the only difference being that and are fixed as external parameters. The final result is again , where

 GS=q1^q12−m2(q1^q1−q0^q0)−Q^Q−M^M+12ln(2π^q1−2^Q+λ)+12^q0+^M2^q1−2^Q+λ−m(^q1−^q0)+12mln(^q1−2^Q+λ^q1−2^Q+λ−m(^q1−^q0)) (32)

and

 GE=1mEσ∫Dyln∫Dz[∫Dhe−βℓ(√Δ(Q−q1)h+√Δq0y+√Δ(q1−q0)z+M+σb)]m (33)
b.3.1 Computing the barycenter of the replicas

In this subsection we want to evaluate the relevant quantities in the barycenter of the replicas, which it is defined as

 ¯wi≡1m∑awi. (34)

The relevant order parameters that we need to find in order to compute physical quantities are the overlap with the teacher and the norm of the center . We can see that and can be expressed in terms of the known replica-overlap quantities: is simply

 ¯M=1Ny∑i∑awaiv⋆i=1y∑aMa (35)

whereas is

 ¯Q=1N∑i¯w2i=1y2N∑ab∑iwaiwbi=1y2∑aQa+1y2∑a≠bqab (36)

Since all real replicas are have the same mutual overlap and the same squared norm, we get

 ¯M =M (37a) ¯Q =Q−q1m+q1 (37b)
b.3.2 Large-β limit for the MSE loss

For the MSE loss all the integrals in the energetic term can be solved, giving

 GE=−12ln(1+βΔ(Q−q1))+12mln(1+βΔ(Q−q1)1+βΔ(Q−q1)+mβΔ(q1−q0))−β2Δq0+Eσ(M+bσ−1)21+βΔ(Q−q1)+βmΔ(q1−q0) (38)

When is large we get the following scaling for

 q0=Q−q1m+q1+δq0β. (39)

The other scalings are

 ^q0 =β2δ^q0+β2δ^q1 (40a) ^q1 =β2δ^q0−β2δ^q1 (40b) ^Q =^q12−12(Q−q1) (40c) ^M =βδ^M (40d)

The new entropic and energetic terms (rescaled with ) are therefore

 GS ≡limβ→∞GSβ=12(Q−q1)δ^q1+