Deep Networks on Toroids: Removing Symmetries Reveals the Structure of Flat Regions in the Landscape Geometry

02/07/2022
by   Fabrizio Pittorino, et al.
0

We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and other characteristics. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.

READ FULL TEXT VIEW PDF

page 8

page 14

page 15

page 16

page 19

page 20

page 21

page 22

08/01/2018

Geometry of energy landscapes and the optimizability of deep neural networks

Deep neural networks are workhorse models in machine learning with multi...
03/02/2018

Essentially No Barriers in Neural Network Energy Landscape

Training neural networks involves finding minima of a high-dimensional n...
05/24/2022

Linear Connectivity Reveals Generalization Strategies

It is widely accepted in the mode connectivity literature that when two ...
09/05/2020

Optimizing Mode Connectivity via Neuron Alignment

The loss landscapes of deep neural networks are not well understood due ...
10/03/2019

Pure and Spurious Critical Points: a Geometric Study of Linear Networks

The critical locus of the loss function of a neural network is determine...
05/07/2019

A Generative Model for Sampling High-Performance and Diverse Weights for Neural Networks

Recent work on mode connectivity in the loss landscape of deep neural ne...

1 Introduction

The loss landscape of a typical deep neural network performing a supervised learning task is in general highly non-convex. Moreover, even small networks (by the current standards) have a huge number of configurations of small loss, corresponding to zero or near-zero training error. In this sense, most modern networks operate in a strongly over-parameterized regime. Understanding how simple variants of first-order algorithms are able to escape bad local minima and yet avoid overfitting is a fundamental problem, which has received a lot of attention from several perspectives 

(Belkin et al., 2019; Rocks & Mehta, 2020).

A natural and promising approach for addressing this issue is to investigate the geometrical properties of the loss landscape. Broadly speaking, there are two related but conceptually distinct main research directions in this area: one is about the dynamics of gradient-based learning algorithms (e.g. Feng & Tu (2021)); the other concerns a static description of the geometry, its overall structure and its relation to the generalization properties of the network on unseen data (e.g. Gotmare et al. (2018)). In this paper, we focus on the latter.

A first basic observation is that (near-)minimizers of the loss, corresponding to (near-)zero training error, can have dramatically different generalization properties (Keskar et al., 2016; Liu et al., 2020; Pittorino et al., 2021). A growing amount of evidence shows a consistent correlation between the flatness of the minima of the loss and the test accuracy, across a large number of models and with several alternative measures of flatness, see e.g. Dziugaite & Roy (2017); Jiang* et al. (2020); Pittorino et al. (2021); Yue et al. (2020)

. Moreover, several studies indicate that stochastic gradient descent (SGD) and its variants introduce a bias, compared to full-batch gradient descent, towards flatter minima 

(Keskar et al., 2016; Chaudhari & Soatto, 2018; Feng & Tu, 2021; Pittorino et al., 2021)

. This effect seems to be amplified by other operating procedures, e.g. the use of the cross-entropy loss function, drop-out, judicious initialization, ReLU transfer functions 

(Baldassi et al., 2018, 2020, 2019; Liu et al., 2020; Zhang et al., 2021). Therefore, in practical applications, bad minima are seldom reported or observed, even though they exist in the landscape.

In this paper, we present a coherent empirical exploration of the structure of the minima of the landscape. Our work has two main features that, taken together, sets it apart from the majority of existing literature (see also sec. 2 on related work): 1) The main object of our study is the error (also called “energy”) landscape, rather than the loss; 2) Our underlying geometrical space and topology is that of networks, intended as functional relations, rather than the space of parameters.

The first point is less crucial to our results, although it affects the second one. Providing a detailed description of the dynamics requires studying the loss landscape. However, we argue that the error landscape is a similar but more basic object of study for a static analysis, since it is directly related to the observable behavior of the network, especially the generalization error. This is assuming that the end goal of the training is to obtain a classifier whose output is the over the last layer. The loss function on the other hand uses the entire output of the layer, which is not normally needed after the training is complete. The error landscape is also more amenable to theoretical analysis, e.g. Baldassi et al. (2015); Dziugaite & Roy (2017).

The second point is inspired from similar considerations. We posit that, after training, two networks that implement the same input-output relation (not only on the training set, but on any input) must be identified, even if their parameterization differs, and as such we group them into equivalence classes. In order to define a topology and a metric over this space, we standardize the parameterization of the networks (not during the training, but only for the geometrical description), thereby removing the symmetries that affect the usual parameterizations. In most common networks there are two of them: a continuous one, a scale invariance that allows to renormalize the weights, and a discrete one, a permutation symmetry that reflects the fact that in a hidden layer the labels of units (or the labels of filters in convolutional layers) can be exchanged. The latter in particular means that two networks may appear to be very distant from each other in parameter space even if they are very similar or even identical. The scale symmetry can also have this effect, and moreover it may affect many measures of flatness making them imprecise at best and misleading at worst (Dinh et al., 2017). These problems obviously affect every other investigation, such as studying the paths that join two configurations.

Our approach is thus as follows: 1) we choose a normalization method that leaves the network behavior unchanged while fixing the norm of each of the hidden units, projecting them onto a hyper-toroidal manifold; 2) when we compare two networks, we normalize and align them first, and consider the geodesic paths between them in normalized space. With respect to the original goal of exploring function space, this approach is approximate, mainly for computational reasons (all our procedures are efficient and scale well with the number of parameters), and because in our characterization of the landscape we neglect the biases and batch-norm parameters. Yet, the approximation appears to be very good and the effect is critical. We have applied these techniques to several networks (continuous and discrete111For the discrete networks that we consider the rescaling symmetry is substituted with another discrete symmetry, a sign-reversal one.

, multi-layer perceptrons and convolutional networks) and datasets (both real-world and synthetic). In each case, we used different training protocols aimed at sampling zero-error configurations (also called solutions) with different characteristics. In particular, we compared the kind of solutions found by SGD or its variants with momentum, with the (typically flatter and more accurate ones) found by the Replicated-SGD (RSGD) algorithm

(Pittorino et al., 2021), and with some poorly-generalizing solutions found by adversarial initialization. With these, we could explore the local landscape of each solution and the paths connecting any two of them.

Remarkably, our results display some qualitative features that are shared by all networks and datasets, but which are visible and stable only in the space of networks as described above, and not in that of their parameters. Besides confirming that flatter minima generalize better than sharper ones, we found that they are also closer to each other, and that geodesic paths between them encounter lower barriers. Also, with few exceptions, the solutions we find can be connected by paths of (near-)zero error with a single bend. Overall, our results are compatible with the analysis of the geometry of the space of solutions in binary, shallow networks reported in Baldassi et al. (2021a, b), according to which efficient algorithms target large connected structures of solutions with the more robust (flatter) ones at the center, radiating out into progressively sharper ones.

2 Related Work

Early works relating flatness and generalization performance are Hochreiter & Schmidhuber (1997); Hinton & Van Camp (1993). In Keskar et al. (2016), the authors show that minimizers with different geometrical properties can be found by varying algorithmic choices like the batch-size. In Jiang* et al. (2020) a large scale experiment exploring the correlation between generalization and different complexity measures reported some flatness-based measures as the most robust predictors of good generalization performance. Several optimization algorithms explicitly designed for finding flatter minima have been presented in the literature, resulting in improved generalization performance in several settings (Chaudhari et al., 2019; Pittorino et al., 2021; Yue et al., 2020). Some analytical investigations of phenomena related to flatness, their relation to generalization and algorithmic implications can be found in Baldassi et al. (2015, 2016); Zhou et al. (2018); Dziugaite & Roy (2017).

Several recent works analyze the topic of mode connectivity empirically and analytically. In Draxler et al. (2018), the authors construct low-loss paths between minima of networks trained on image data using a variant of the nudged elastic band algorithm (Jónsson et al., 1998). In the closely connected work Garipov et al. (2018), the authors develop a method for finding low-loss paths as simple as a polygonal chain with a single bend and use this insight for creating a fast ensembling method. In Gotmare et al. (2018) the authors show that minima found with different training and initialization strategies can be connected by high accuracy paths. Notably for the present work, the authors note that some of these choices like batch size or the optimizers are expected to have an effect on the flatness of the found solutions. Mode connectivity in the context of adversarial attacks has been studied in Zhao et al. (2020), where the authors also propose to exploit mode connectivity to repair tampered models. In Kuditipudi et al. (2019) it is shown analytically that, given some suitable assumptions on the robustness of the network, a low-loss path can be constructed between the solutions of ReLU networks with multiple layers. A similar result is derived in Shevchenko & Mondelli (2020) for networks trained specifically with SGD.

The importance of symmetries for the question of flatness has been highlighted in Dinh et al. (2017), where symmetries are used to show that simple notions of flatness are problematic. In Brea et al. (2019)

, low-loss paths between minimizers are constructed that cross ’permutation points’, which are points where the weights associated with two hidden neurons in the same layer coincide. In

Tatro et al. (2020), a method for aligning neurons based on matching activation distributions is introduced and the authors show analytically and empirically that this method increases mode connectivity. In Singh & Jaggi (2020) a neuron alignment method based on optimal transport is introduced in the context of model fusion, where one tries to merge two or more trained models into a single model. Among other results, the authors show that matching is crucial when averaging the weights of models trained in the same or different settings (for example trained on different subsets of the labels). In Entezari et al. (2021), the authors test the hypothesis that barriers on the linear path between minima of ReLU networks found by SGD vanish if the permutation symmetry is taken into account. They present evidence in the form of extensive numerical tests, exploring different settings with respect to the width, depth and architectures used.

3 Numerical tools for the study of the energy landscape geometry

As stated in the introduction, our aim is to study the geometry of the error landscape (which we will also call the “energy” landscape) in the space of networks. In particular, after sampling solutions (zero-error configurations) with different characteristics we study their flatness profiles and two types of paths connecting them: geodesic and optimized paths. The latter are obtained by finding a midpoint between two networks, using it as the starting point of a new optimization process that reaches zero error, and then connecting the new solution to the two original endpoints via two geodesic paths (this is a modified procedure of what is called polygonal chains with one bend in Garipov et al. (2018), where the authors use euclidean geometry and do not account for the permutation symmetry). To make these definitions precise, we first need to describe the tools by which we remove the symmetries in the neural networks (code available at SM, anonymous).

We will use the following notation. We denote a whole neural network (NN) with , and with the number of its layers. We use the index for the layers, and denote the number of units in the layer by . Each layer has associated parameters that we collectively call (which include batch-norm parameters). The input weights for the -th unit are and its bias . For all layers except the last, the activations are , where are the pre-activations. In binary networks we change the with a . The last layer is linear and the output of the network is given by . In binary classification tasks we use a single output unit, and a instead of an .

3.1 Breaking the symmetries

3.1.1 Normalization

Networks with continuous parameters and activation functions are invariant to the rescaling of the weights of the units in each hidden layer by a positive real value (that may be different for each unit), since the function has the property for any . To break this symmetry, we apply the following normalization procedure, starting from the first layer and proceeding upward, see alg. 1.

  Input: A NN with continuous weights, layers, parameters and ReLU activations.
  for  to  do
     
     
     
  end for
  
  
Algorithm 1 Neural Network Normalization

For each layer , starting from the first and up to , we calculate the norm of each hidden unit in the layer, and we multiply its incoming weights, bias and batch-norm parameters by , and its outgoing weights by . As a result, all the units in the layer have norm , while the function realized by the network (and also its loss) remains unaffected. The last layer does not have this symmetry, but it can be globally rescaled by a positive factor since the output of the network is invariant (as a consequence of using the operation). For consistency with the other layers, we normalize the last layer to . The resulting space is a product of normalized hyper-spheres, inducing a hyper-toroidal topology.

This normalization choice is rather natural, results in simple expressions for the computation of the geodesics (see below), and leads to sensible results; it is certainly not the only possible (or reasonable) one, and other possibilities might be worth exploring.

3.1.2 Alignment

Multi-layer networks (continuous or discrete) also have a discrete permutation symmetry, that allows to exchange the units in the hidden layers (neurons in fully-connected layers and filters in convolutional layers). When comparing two networks and , we break the symmetry by first normalizing both networks, and then by applying the following alignment procedure, again starting from the first layer and proceeding upward, see alg. 2.

  Input: Two normalized NNs of the same type , with layers and parameters , .
  for  to  do
     
     PermutePrev
     PermuteNext
  end for
Algorithm 2 Neural Network Alignment

For each layer , we use a matching algorithm to find the permutation

of the indices of the second network that maximizes the cosine similarity

222We can ignore the norms, thanks to our choice of the normalization and the fact that we do not match the last layer. For the same reasons, this is also equivalent to minimizing a squared distance. . The permutation is applied to both the ingoing and the outgoing indices of the weights of layer of the second network (as well as other parameters associated to the units such as the biases and the batch-norm parameters).

In the case of discrete networks, we use the activation function instead of the , thus instead of a continuous rescaling symmetry there is a discrete sign-reversal one (allowing to flip the signs of all ingoing and outgoing weights of any hidden unit). We break this symmetry during the alignment step: in the matching step, we use the absolute value of the cosine similarities in the optimization objective, then we apply the permutation, and finally for each unit we either flip its sign or not based on which choice maximizes the cosine similarity.

This procedure is not guaranteed to realize a global distance minimization between the two networks in a worst-case scenario (and it is not clear whether such goal would be computationally feasible). It also does not take into account the biases, if present. However, it guarantees that if and are the same network with shuffled hidden units labels, at the end they will be perfectly matched almost surely. Furthermore, it is simple to implement, computationally efficient, and rather general (it applies to fully-connected and convolutional layers, to continuous or discrete weights). Basing the matching on the weights (as opposed to using e.g. the activations) is also data-independent and consistent with our overall geometrical picture.

3.1.3 Geodesic paths

Given two (non-normalized) continuous networks and , we want to consider the geodesic paths between the networks in function space. Formally, this is defined as the shortest path between all possible permutations of the networks in the normalized space. We approximate this with the path between the normalized-and-aligned networks, which can be computed easily thanks to our choice of the normalization. Consider first a hyper-sphere of norm

, and two vectors

on it. The angle between them is , and their (geodesic) distance is . A generic point along the geodesic, located at distance from where , can be expressed as where . This is easily extended to the metric on the full network: given two normalized-and-aligned networks, we can simply apply this formula independently (but with the same ) to each hidden unit of the first layers (with norm ) and to the last layer (with norm ). Similarly, the overall squared geodesic distance is just the sum of the squared geodesic distances within each spherical sub-manifold.

For discrete binary networks we use the Hamming distance to measure the discrepancy between two (aligned) networks. In this case there is no analog to the geodesic path, since there are multiple shortest paths connecting any two networks: each path corresponds to a choice of the order in which to change the weights that differ between the two networks. In our tests, we simply pick one such path at random.

3.2 Minima with Different Flatness and Generalization

We sample different kinds of solutions by using different algorithms. The standard

minima are found by using the Stochastic Gradient Descent (SGD) algorithm with Nesterov momentum. In order to find

flatter minima, we use replicated-SGD (RSGD), see Pittorino et al. (2021), which was designed for this purpose. To find sharper minima, we use the the adversarial initialization described in Liu et al. (2020) followed by the SGD algorithm without momentum. We call this method ADV; it was developed to overfit the dataset and produce poorly generalizing solutions, but since there is a known correlation between flatness and generalization error, we expect (and indeed confirm a posteriori) that these solutions are sharp.

In all cases, we do not use

regularization or data augmentation; the only image pre-processing we use in our experiments is to normalize the images to zero mean and unit variance.

In the case of discrete binary networks we used the same techniques, but based on top of the BinaryNet training scheme (see e.g. Simons & Lee (2019)), which is a variant of SGD tailored to binary weights. Our implementation differs from the original one (Hubara et al., 2016) in that the output layer is also binary (details in SI Sec. B).

4 Neural Networks with Continuous Weights

In this section we present results on the energy landscape geometry and connectivity obtained on NNs with continuous weights. We consider 3 different architectures of increasing complexity: a) a multi-layer perceptron (MLP) with hidden layers of units; b) a small LeNet-like architecture with convolutional layers of and filters followed by a MaxPool and one fully-connected layer of units; c) the VGG architecture (Zhang et al., 2015)

with batch normalization. We train architectures a) and b) on the MNIST, Fashion-MNIST and CIFAR-10 datasets, while architecture c) is trained on the CIFAR-10 dataset.

4.1 Flatness and generalization

We train these 3 networks with the 3 algorithms described in Sec. 3.2 (RSGD, SGD, ADV) for epochs, sufficient to find a configuration at zero training error (or with error (Jiang* et al., 2020)). In Fig. 1 we show that the flatness of the solutions, measured by the local energy, follows the expected ranking between the algorithms, and the expected correlation with the generalization error, confirming that flatter minima generalize better. The local energy is defined as the average train error difference obtained by perturbing a given configuration with a multiplicative Gaussian noise of varying amplitude , in formulas: where is element-wise multiplication and .

Figure 1:

Flatness and generalization for several algorithms. Green triangles: RSGD; blue circles: SGD; red squares: ADV. Columns 1 and 3: local energies as a function of noise amplitude; columns 2 and 4: corresponding local energies (at maximal reported noise) versus test error. Shades and error-bars are standard deviations.

4.2 Landscape geometry and connectivity

Analyzing the connectivity of minima with different flatness levels allows us to explore the fine structure of the energy landscape, shedding light on the geometry of possible connected basins and/or on the presence of isolated regions. To this aim we choose distinct pairs of solutions for each of the possible paths among the types of solutions: the paths between solutions of the same flatness (RSGD-RSGD, SGD-SGD, ADV-ADV) and of different flatness (RSGD-SGD, RSGD-ADV, SGD-ADV).

Geodesic paths

We show the error along the geodesic paths connecting pairs of minima for the different networks and datasets in Fig. 2. The top row in the figure demonstrates the effect of accounting for the symmetries, by showing three sets of curves - one set per panel, left to right: linear paths between raw configurations as output by the algorithms; linear paths in which the endpoint networks are aligned but not normalized333We still maximize the layers’ cosine similarity when performing the alignment.; the geodesic path between the normalized-and-aligned networks. We can see that: a) taking into account the permutation symmetry the barriers are lowered; b) following the geodesic removes the distortion in the paths, such that they appear flatter towards flatter solutions as they should c) considering these two symmetries the barriers heights (in particular their maximal value) follow a general overall ranking which is correlated to the flatness level of the corresponding solutions: the RSGD-RSGD one is consistently the lowest, followed by RSGD-SGD and SGD-SGD, then RSGD-ADV and SGD-ADV, and finally ADV-ADV is consistently the highest.

The top row of Fig. 2 is a representative example; analogous figures for all the other networks/datasets considered in the paper are reported in the Appendix, Sec. A.2. In all cases, accounting for the symmetries of the network is indeed critical to reveal these seemingly very general geometrical features.

Notably, by sampling solutions with high flatness level using the RSGD algorithm, we are able to target connected structures and to find them, even in cases like VGG in which they could appear as missing (Entezari et al., 2021), see the black path in the lower-right panel of Fig. 2.

Figure 2: Error landscape along one-dimensional paths connecting minima of different flatness (minima appear on left/right following the up/down order of the legend). All distances have been rescaled in . The top row highlights the general effect of the symmetries in a representative example (LeNet on Fashion-MNIST). All other panels report the geodesic-aligned paths representing the final result of our analysis.
Figure 3: Removing the permutation symmetry eliminates the barriers that may appear in the single-bend optimized paths. Left to right: linear path; linear-aligned path; geodesic-aligned path.
Optimized paths

Our procedure to find the optimized paths is this: we first align the two endpoint networks, then find their midpoint, optimize it for epochs with SGD in order to reach zero training error, and explore the two aligned geodesics connecting the optimized midpoint and the endpoints.

We report a representative example in Fig. 3 (for the other networks/datasets see Appendix Sec. A.2). Again, we contrast three situations, one in which we use straight segments between non-normalized unaligned configurations, one where we remove the permutation symmetry, and one where we use the geodesic paths. The barriers are much lower than for the unoptimized paths even in the unaligned linear case, but removing the permutation symmetry has still an important effect in lowering the barriers, and in most cases (like the one shown here) it removes them entirely.

Figure 4: Bi-dimensional landscape sections. Top row: LeNet, Fashion-MNIST. Bottom row: VGG16, CIFAR10. Left column: without normalization, lines are linear paths. Right column: with normalization, lines are distorted geodesics. In each panel the left point is RSGD, the right point unaligned ADV, the middle/top point is aligned ADV. Aligning the NNs can lower the barriers (not for VGG in this RSGD-ADV case), normalization reveals the geometry around them.
Bi-dimensional visualizations

Following Garipov et al. (2018), we studied the error landscape on bi-dimensional sections of the parameters space, specified by 3 configurations, using the Gram-Schmidt procedure. In Fig. 4 we show the results on the planes defined by an RSGD solution (left-most point), an unaligned ADV solution (right-most point) and the corresponding aligned ADV solution (top point). We show the non-normalized (left panels) and normalized (right panels) cases. In the normalized case, the plane as a whole does not lie in normalized space (only the three chosen points do) and thus the plot suffers from some (presumably mild) distortion. Nevertheless, we can still see that accounting for the permutation symmetry alone reduces the distance between different minima and can lower the barriers between them, but also that only after normalization the expected relative sizes of the basins are revealed.

Distances.

We studied the distances between pairs of solutions, categorized according to their flatness. Some representative results are reported in Fig. 5, the full results are in Appendix Sec. A.4. When NNs are normalized and aligned, we consistently find that flatter solutions are closer to each other than sharper ones. In optimized paths, the optimized midpoints end up closer to the flatter solutions. Although our sampling is very limited, these results (together with all the previous ones) are compatible with the octopus-shaped geometrical structures described in (Baldassi et al., 2021b).

Figure 5: Distances between configurations, MLP on MNIST. (Left panel) distance between configurations independently sampled by different algorithms (on the -axis: RRSGD; SSGD; AADV). (Middle panel) distance between the optimized midpoint initialized as the mean of the two configurations in the -axis with the one indicated on the left; (left panel) same as the middle panel but the distance is between the optimized midpoint and the configuration indicated on the right in the -axis.

5 Neural Networks with Binary Weights

In this section we present some results on the error landscape connectivity of binary NNs, in which each weight (and activation) is either or , a topic that is almost absent in literature.

We considered two main scenarios: shallow networks on synthetic datasets, for which a comparison with some theoretical results is possible, and deep networks, both MLP and convolutional NNs, on real data.

5.1 Over-parameterized Shallow Networks on Synthetic Datasets

Figure 6: Fully connected binary CM trained on HMM data (see text). Top Left: Local energies for different classes of solutions. Inset: local energy at a fixed distance from a solution ( in the main plot) as a function of the number of parameters. Top right: Test errors decrease with overparameterization and correlate with local energies. Bottom left: Average hamming distances between different solutions, before and after removal of symmetries. All distances grow with overeparameterization. Bottom right: Maximum barrier height (train error percentage) along a random shortest path connecting two solutions. Barriers go to zero with overparameterization.

In order to bridge the gap with the theory, we performed numerical experiments on the error landscape connectivity in shallow binary architectures trained on data generated by the so called Hidden Manifold Model (HMM), also known as Random Features Model (Goldt et al., 2020; Gerace et al., 2020; Baldassi et al., 2021a).

The HMM is a mismatched student-teacher setting. The teacher generating the labels is a simple one-layer network, whose inputs are -dimensional random i.i.d. patterns. The students does not see these original patterns, but a non-linear random projection onto an -dimensional space. By varying the relative size of the projection, the degree of overparameterization can be controlled. This arrangement aims to provide an analytically tractable model that retains some relevant features of the most common real-world vision datasets (Goldt et al., 2020).

We trained both a binary perceptron and a fully connected binary committee machine (CM), i.e. a network with a single hidden layer where the weights of the output layer are fixed to . Using data from a HMM, we analyzed the error landscape around solutions of different flatness, at varying levels of overparameterization (all implementation details are reported in Appendix Sec. B).

In this regime, the analysis of Baldassi et al. (2021a) suggests a scenario where algorithmically accessible solutions are arranged in a connected zero-error landscape, with flatter solutions surrounded by sharper ones. Overall, the numerical findings we report here are consistent with this scenario.

In Fig. 6 we report the results for the binary CM (similar results were obtained for the binary perceptron, see Appendix Sec. B). As the number of parameters is increased, the flatness of all the solutions increases (while the ranking RSGD-SGD-ADV is preserved), and the generalization error is correlated with the flatness, as expected. As more parameters are added and solutions become flatter, the average maximum error along random linear paths connecting two solutions decreases. We observe a robust correlation between barrier heights and flatness: the flatter the solution, the lower the barrier. However, the barriers are not significantly affected by aligning the networks.

5.2 Deep Architectures on Real-World Datasets

Figure 7: Three hidden layer binary MLP trained on the Fashion-MNIST dataset (top row) and binary convolutional NNs trained on CIFAR-10 (bottom row). Left: Local energies for different types of solutions (test errors are reported in the legend). Right: Train error percentage along random linear paths connecting solutions, for raw solutions (light curves) and aligned solutions (darker curves). Insets: errors along the optimized paths. The change in distance due to symmetries removal may be appreciated in the -axis.

Figure 8: Train errors in the plane spanned by three solutions for a binary MLP trained on Fashion-MNIST, before (left column) and after (right column) symmetries removal. We compare ADV (top row) and RSGD (bottom row) solutions.

We consider two deep binary NNs: a MLP with -hidden-layers of units each, trained on the Fashion-MNIST dataset, and a -layers convolutional NN trained on CIFAR10 (implementation details in Appendix Sec. B).

For both architectures we analyzed the error landscape along random shortest paths connecting pairs of solutions with different flatness. Results are reported in Fig. 7. As expected, the average barrier height correlates with the local energies of the solutions (and in turn with test errors). The barriers are lowered when symmetries are removed, and are lowered further when the paths are optimized (although they remain considerably large, especially for the convolutional NNs).

The effect of removing symmetries on the error landscape connectivity can be appreciated in Fig. 8 (see also Appendix Fig. 23 for analogous results on the convolutional network), where we projected the error landscape onto the plane spanned by the different solutions. We used the internal continuous weights used by BinaryNet (of which the actual weights are just the ) and proceeded as in the continuous case (Sec. 4.2

), but at each point we binarized the configurations in order to obtain the errors (details in Appendix Sec. 

B). The resulting projections are a heavily distorted representation, but the effect of symmetry removal on the barriers is rather striking, especially in the case of the wider minima, revealing the presence of a connected structure.

6 Conclusions and discussion

We investigated numerically several features of the error landscape in a wide variety of neural networks, building on a systematic geometric approach that dictates the removal of all the symmetries in the represented functions. The methods we developed are approximate but simple, rather general and efficient, and proved to be critically important to our findings. By sampling different kinds of minima, investigating the landscape around them and the paths connecting them, we found a number of fairly robust features shared by all models. In particular, besides confirming the known connection between wide minima and generalization, our results support the conjecture of Baldassi et al. (2021a): that, for sufficiently overparameterized networks, wide regions of solutions are organized in a complex, octopus-shaped structure with the denser regions clustered at the center, from which progressively sharper minima branch out. Intriguingly, a similar phenomenon has been recently observed also in the—rather different—context of systems of coupled oscillators (Zhang & Strogatz, 2021).

Our work lies at the intersection of two lines of research that have seen significant interest lately: one on mode connectivity and the structure of neural network landscapes and the other on flat minima and their connection to generalization performance. We believe that further systematic explorations of these topics can produce results of great theoretical interest and significant algorithmic implications.

Acknowledgements

FP acknowledges the European Research Council for Grant No. 834861 SO-ReCoDi.

References

Appendix A Neural Networks with Continuous Weights

We report complete and additional results on the Neural Networks (NNs) with continuous weights studied in the main paper.

a.1 Numerical details and parameters

The training parameters for the algorithms for all the networks/datasets tested in the main paper are: (SGD) SGD with Nesterov momentum and initial learning rate with cosine annealing ( for MLP on CIFAR-10); (RSGD) Replicated-SGD with Nesterov momentum and initial learning rate with cosine annealing, replicas, with an exponential schedule on the interaction parameter with and automatically chosen (see Pittorino et al. (2021) for details on this algorithm); (ADV) for the configurations obtained with the adversarial initialization, we use vanilla SGD and initial learning rate with cosine annealing (see Liu et al. (2020) for details on the generation of this initialization: we replicate one time the dataset with random labels, i.e. , and we zero-out a random fraction of the pixels in each image). We train all networks with batch-size and for epochs in order to reach training errors . Neither data augmentation nor regularization are used in our experiments.

a.2 One-Dimensional Paths

We add here further results on the comparison among Linear, Linear-Aligned and Geodetic-Aligned one-dimensional paths on the networks/dataset with continuous weights studied in the main paper. We report in Fig. 9 the paths without optimizing the midpoint, while in Fig. 10 the results obtained by optimizing it (the single-bend optimized paths).

Figure 9: Error landscape along one-dimensional paths connecting minima of different flatness (minima appear on left/right following the up/down order of the legend). For each network/dataset the comparison highlights the general effect of the symmetries. The geodesic-aligned paths represents the final result of our analysis.
Figure 10: Removing the permutation symmetry eliminates the barriers that may appear in the single-bend optimized paths. For each network/dataset, left to right: linear path; linear-aligned path; geodesic-aligned path. For these optimized paths with one bend, the midpoint is optimized with batch-size for epochs with Nesterov momentum and initial learning rate with cosine annealing in order to reach training error equal or almost equal to zero.

a.3 Bi-Dimensional Visualization

In this section we report bi-dimensional visualizations for some of the networks/datasets explored in the main paper. In Fig. 11 we compare train and test errors for LeNet on Fashion-MNIST (RSGD-ADV); In Figs. 12 and 13 we report visualizations for all pairs of LeNet on Fashion-MNIST (without and with normalization respectively); In Figs. 14 and 15 we report visualizations for all pairs of MLP on CIFAR-10 (without and with normalization respectively); in Fig. 16 we show the effects of permutation symmetry removal on VGG16 on CIFAR-10 for solutions of varying sharpness.

Figure 11: Bi-dimensional sections, LeNet on Fashion-MNIST. Train errors (panels 1 and 3) and test errors (panels 2 and 4). Comparison of RSGD (left points), unaligned ADV (right points), aligned ADV (top points). Panels 1 and 2: without normalization. Panels 3 and 4: with normalization. Dashed lines represent linear (panels 1 and 2) and geodesic paths (panels 3 and 4).
Figure 12: Bi-dimensional sections of the train error, LeNet on Fashion-MNIST, without normalization. Top points are the aligned version of the right point w.r.t. the left point. Top row: (left) left-right points: RSGD-RSGD; (middle) left-right points: SGD-SGD; (right) left-right points: ADV-ADV. Bottom row: (left) left-right points: RSGD-SGD; (middle) left-right points: RSGD-ADV; (right) left-right points: SGD-ADV. Aligning the NNs lowers the barriers and in some cases reveals that solutions lie in closer and connected basins. Dashed lines represent linear paths.
Figure 13: Bi-dimensional sections of the train error, LeNet on Fashion-MNIST, with normalization. Top points are always the aligned version of the right point w.r.t. the left point. Top row: (left) left-right points: RSGD-RSGD; (middle) left-right points: SGD-SGD; (right) left-right points: ADV-ADV. Bottom row: (left) left-right points: RSGD-SGD; (middle) left-right points: RSGD-ADV; (right) left-right points: SGD-ADV. Normalization reveals the geometry around the solutions. Dashed lines represent distorted geodesic paths.
Figure 14: Bi-dimensional sections of the train error, MLP on CIFAR-10, without normalization. Top points are the aligned version of the right point w.r.t. the left point. Top row: (left) left-right points: RSGD-RSGD; (middle) left-right points: SGD-SGD; (right) left-right points: ADV-ADV. Bottom row: (left) left-right points: RSGD-SGD; (middle) left-right points: RSGD-ADV; (right) left-right points: SGD-ADV. Aligning the NNs lowers the barriers and in some cases reveals that solutions lie in closer and connected basins. Dashed lines represent linear paths.
Figure 15: Bi-dimensional sections of the train error, MLP on CIFAR-10, with normalization. Top points are always the aligned version of the right point w.r.t. the left point. Top row: (left) left-right points: RSGD-RSGD; (middle) left-right points: SGD-SGD; (right) left-right points: ADV-ADV. Bottom row: (left) left-right points: RSGD-SGD; (middle) left-right points: RSGD-ADV; (right) left-right points: SGD-ADV. Normalization reveals the geometry around the solutions. Dashed lines represent distorted geodesic paths.
Figure 16: Bi-dimensional sections of the train error, effect of the permutation symmetry on VGG16 trained on CIFAR-10. Top points are the aligned version of the right point w.r.t. the left point. (Left) left-right points: RSGD-RSGD; (Middle) left-right points: SGD-SGD; (right) left-right points: ADV-ADV. Aligning the NNs lowers the barriers and in some cases reveals that solutions lie in closer and connected basins (for RSGD-RSGD and to a lesser extent for SGD-SGD), while it is not sufficient in other cases (ADV-ADV). Dashed lines represent linear paths.

a.4 Distances

We report in Fig. 17 the distances between configurations and optimized midpoints for the networks/datasets studied in the main paper.

Figure 17: Distances between optimized configurations. For all networks/datasets: (Left panel) distance between configurations independently sampled by different algorithms (on the -axis: RRSGD; SSGD; AADV). (Middle panel) distance between the optimized midpoint initialized as the mean of the two configurations in the -axis with the one indicated on the left; (left panel) same as the middle panel but the distance is between the optimized midpoint and the configuration indicated on the right.

Appendix B Neural Networks with Binary weights

Here we provide implementation details for the experiments presented in Section 5, as well as additional numerical tests. In all our experiments we used the BinaryNet training scheme (see e.g. Simons & Lee (2019)), which is a variant of SGD tailored to binary weights. Notice that our implementation differs from the original one (Hubara et al., 2016) because we use binary weights in the whole network, including the output layer. In all the cases, for each solution type (ADV, SGD, RSGD) we were able to obtain solutions with zero or close to zero () train error.

Local Energies

The local energy provides a robust measure of the flatness of a given solution. It has been shown to be highly correlated with generalization errors (Jiang* et al., 2020; Pittorino et al., 2021). It is defined as the average error as a function of the distance from a reference solution. In the case of binary weights neural networks, we perturb a solution by changing sign to a random fraction of the weights, and measure the train error for random choices of the perturbed weights. , where is the train error of the unperturbed solution. By varying we are able to obtain the profile of the errors as a function of the hamming distance from the reference solution. In all the experiments, for each value of we average the errors over independent realizations of the choice of the perturbed weights.

Random Linear Paths and Optimized Paths

In order to explore the barriers between different solutions we measured the train error along the shortest paths connecting them. Given a source solution and a destination solution, we simply count the number of weights with different sign (that corresponds to the extensive hamming distance), progressively change the sign of those weights in the source solution in order to approach the destination solution, and measure the errors along the path. In all the experiments we consider solutions for each ADV, SGD, RSGD (see Sec. 3.2). The reported paths are averaged over all the possible paths among solutions of a given type (back and forth), and over realizations of the random path (the weight flipping order).

Optimized paths are random linear paths with one bend. We take a weight vector located at equal distance between two solutions, and use it to initialize SGD. In all the experiments the middle point has been optimized using the same parameters of SGD-type solutions. We then report the random linear paths between the source and the optimized middle point, and the optimized middle point and the destination. For each of the possible couples of solutions we averaged over independent choices of the middle point and realization of the random paths.

Shallow binary networks and the Hidden Manifold Model

We trained both a binary perceptron and a binary fully-connected committee machine (CM) on synthetic data generated by the so-called Hidden Manifold Model (HHM) (Goldt et al., 2020; Gerace et al., 2020; Baldassi et al., 2021a), also known as the Random Features Model (Mei & Montanari, 2019).

In the HMM the data are first generated in a -dimensional manifold, where a perceptron teacher assigns to a set of random i.i.d. patterns a label, according to . Then the student sees a non-linear random projection of the data into an N-dimensional manifold: , where

is a fixed random matrix with i.i.d. elements, and classify them according to the teacher label. In our experiments we fixed the size of the perceptron teacher (i.e. the dimension of the data points in their hidden manifold) to

and the pattern number to , so that we are working at . By increasing the size of the projection, we are able to study the system in the over-parameterized regime () (see also (Baldassi et al., 2021a)). In the case of the CM we fixed the hidden layer size to and vary the input size .

In Fig. 18 we report the results for the binary perceptron, analogous to the ones presented in the main text in Fig. 6. In the perceptron case there is no redundancy in the function expressed in the student model, so that solutions do not need to be aligned. As for the CM, in the limit we can see that solutions get flatter and the maximum error along random paths connecting them approaches zero. Even in this case, we observe a strong correlation among flatness (as measured with the local energy), generalization errors and maximum barrier heights.

We report training parameters for both models. Each model is trained with batchsize for epochs with binary cross entropy loss and SGD without momentum. For both the binary perceptron and the CM we used the following parameters: (SGD) lr, (RSGD) lr=, with replicas coupled with an elastic constant , where is the epoch. (ADV) networks have been first trained on a modified train set with randomized labels with lr for epochs. The resulting configuration is used as an initial condition and has been optimized with SGD for epochs and lr.

Figure 18: Binary Perceptron trained on data generate by the Hidden Manifold Model (see also Fig. 6) (Left) Local energy of different solutions. Inset: local energy at a fixed distance () as a function of the number of parameters. (Center Left) Generalization errors of different solutions as a function of the number of parameters. (Center Right) Average hamming distance of different solutions. (Right) Average maximum train error (percentage) along random linear path connecting different solutions.
Binary Multi-layer Perceptrons

We consider two binary Multi-layer perceptron (MLP) architectures: a) a two hidden layer MLP trained on k MNIST images (binary classification of the parity of digits), and b) a three hidden layer MLP trained on FashionMNIST. For architecture a) we analyzed the effect of removing symmetries as the network size is increased, by increasing the width of the hidden layers, while MLP b) has three hidden layer of fixed size . In case a), for all hidden layer widths, we optimized the binary cross-entropy loss for epochs using SGD with no momentum, with fixed batchsize and lr=. For the ADV solutions we used as a starting point for SGD the results of epochs of SGD optimization on a train set where the label have been randomized. For RSGD solutions we used replicas coupled with an epochs-dependent elastic constant . In case b) we used Adam optimization with lr= for both SGD and RSGD solutions. For RSGD we used the same number of replica and elastic constant as in case a). The ADV solutions have been obtained by first training with SGD with no momentum and lr= for on the train set with random labels, and then for other epochs with the same lr and optimization algorithm.

Results for case b) are reported in the main text (see Fig. 78).

In case a) we performed analogous experiments as the one presented for shallow binary NNs on the HMM (see Fig. 186). The results are reported in Fig. 19. As in the simpler synthetic dataset scenario, there is a strong correlation between flatness, generalization errors and barrier heights. However, while in this case taking into account the permutation and sign-reversal symmetries among solutions considerably lower the barriers and the average distances between solutions, the barrier do not seem to approach zero error in the limit of a large number of parameters.

Figure 19: Binary MLP with hidden layers with units each, trained on k examples of the MNIST dataset (binary classification). (Left) Local energy for the tree type of solutions considered at . Inset: local energy at a fixed distance as a function of the number of parameters. (Center Left) Test errors for the three type of solutions as a function of the number of parameters. (Center Right) Average hamming distance among solutions, before (light colors) and after (dark colors) solutions have been aligned. (Right) Maximum barrier heights (in percentage of train errors) along random linear paths connecting solutions, before and after solutions have been aligned.
Binary Convolutional Network

We trained a layers convolutional networks with binary weights. The first two layers are convolutional layers with and

filters with no padding, each followed by a

maxpool layer and sign activation function. The two convolutional layers are followed by fully connected layers with units each. We used the following optimization parameters: (SGD) we trained the model for epochs with Adam optimization with batchsize= and lr= that is multiplied by every epochs. (RSGD) We used replicas coupled with an elastic constant , where is the epoch. All the other parameters are the same as SGD, except that we optimized for epochs. (ADV) We first initialize the solutions by optimizing a modified train set where the labels have been randomized for epochs with SGD with lr=, and then train them for epochs using Adam with lr= that is halved at epoch and then every epochs.

Bi-Dimensional Error Landscapes

We describe the procedure to produce the bi-dimensional error landscape plot reported in the main text (Fig. 8). Given three solutions, we pick the continuous weights associated with the binary ones444see (Hubara et al., 2016) for more insights on the relation between continuous and binary weights in the BinaryNet optimization scheme and use them to construct an orthonormal basis using the Gram-Schmidt procedure (as for the case of continuous NNs, Sec. 4.2). At each point in the plane, we binarize the weights by taking their sign, and report the train error. With this bi-dimensional projection of the error landscape one can graphically appreciate the effect of removing symmetries. In Fig. 202122 we report the error landscape for the three types of solutions considered (ADV, SGD, RSGD) in the case of the -hidden layer MLP trained on MNIST, as a function of the number of parameters. Without taking into account symmetries, the solutions appear to be isolated, while their flatness increase. However, once solutions are aligned by removing symmetries, a different landscape appears, where they are connected with low errors paths. In Fig. 23 we show the effect of removing symmetries for ADV and RSGD solutions of the binary convolutional network (analogous to Fig. 8).

Figure 20: Bi-dimensional error landscape for ADV solutions of a -hidden layer MLP trained on MNIST. From top to bottom the width of hidden layers is increased. Left column: raw solutions. Right column: top and right solutions have been aligned to the left solutions.
Figure 21: Bi-dimensional error landscape for SGD solutions of a -hidden layer MLP trained on MNIST. From top to bottom the width of hidden layers is increased. Left column: raw solutions. Right column: top and right solutions have been aligned to the left solutions.
Figure 22: Bi-dimensional error landscape for RSGD solutions of a -hidden layer MLP trained on MNIST. From top to bottom the width of hidden layers is increased. Left column: raw solutions. Right column: top and right solutions have been aligned to the left solutions.
Figure 23: Train errors in the plane spanned by three solutions (for both ADV and RSGD solutions) for a binary convolutional NN trained on CIFAR-10 (see also Fig. 8). Going from left to right panels one can appreciate the effect of removing symmetries in the error landscape.