Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

02/27/2018 ∙ by Timur Garipov, et al. ∙ 0

The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by a simple polygonal chain with only one bend, over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10 and CIFAR-100, using state-of-the-art deep residual networks. On ImageNet we improve the top-1 error-rate of a pre-trained ResNet by 0.56

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The loss surfaces of deep neural networks (DNNs) are highly non-convex and can depend on millions of parameters. The geometric properties of these loss surfaces are not well understood. Even for simple networks, the number of local optima and saddle points is large and can grow exponentially in the number of parameters (Auer et al., 1996; Choromanska et al., 2015; Dauphin et al., 2014). Moreover, the loss is high along a line segment connecting two optima (e.g., Goodfellow et al., 2015; Keskar et al., 2017). These two observations suggest that the local optima are isolated.

In this paper, we provide a new training procedure which can in fact find paths of near-constant accuracy between the modes of large deep neural networks. Furthermore, we show that for a wide range of architectures we can find these paths in the form of a simple polygonal chain of two line segments. Consider, for example, Figure 1, which illustrates the ResNet-164

-regularized cross-entropy train loss on CIFAR-100, through three different planes. We form each two dimensional plane by all affine combinations of three weight vectors.

111Suppose we have three weight vectors . We set , . Then the normalized vectors , form an orthonormal basis in the plane containing . To visualize the loss in this plane, we define a Cartesian grid in the basis and evaluate the networks corresponding to each of the points in the grid. A point with coordinates in the plane would then be given by .

The left panel shows a plane defined by three independently trained networks. In this plane, all optima are isolated, which corresponds to the standard intuition. However, the middle and right panels show two different paths of near-constant loss between the modes in weight space, discovered by our proposed training procedure. The endpoints of these paths are the two independently trained DNNs corresponding to the two lower modes on the left panel.

We believe that this geometric discovery has major implications for research into multilayer networks, including (1) improving the efficiency, reliability, and accuracy of training, (2) creating better ensembles, and (3) deriving more effective posterior approximation families in Bayesian deep learning. Indeed, in this paper we are inspired by this geometric insight to propose a new ensembling procedure that can efficiently discover multiple high-performing but diverse deep neural networks.

Figure 1: The -regularized cross-entropy train loss surface of a ResNet-164 on CIFAR-100, as a function of network weights in a two-dimensional subspace. In each panel, the horizontal axis is fixed and is attached to the optima of two independently trained networks. The vertical axis changes between panels as we change planes (defined in the main text). Left: Three optima for independently trained networks. Middle and Right: A quadratic Bezier curve, and a polygonal chain with one bend, connecting the lower two optima on the left panel along a path of near-constant loss. Notice that in each panel a direct linear path between each mode would incur high loss.

In particular, our contributions include:

  • The discovery that the local optima for modern deep neural networks are connected by very simple curves, such as a polygonal chain with only one bend.

  • A new method that finds such paths between two local optima, such that the train loss and test error remain low along these paths.

  • Using the proposed method we demonstrate that such mode connectivity holds for a wide range of modern deep neural networks, on key benchmarks such as CIFAR-100. We show that these paths correspond to meaningfully different representations that can be efficiently ensembled for increased accuracy.

  • Inspired by these observations, we propose Fast Geometric Ensembling (FGE), which outperforms the recent state-of-the-art Snapshot Ensembles (Huang et al., 2017), on CIFAR-10 and CIFAR-100, using powerful deep neural networks such as VGG-16, Wide ResNet-28-10, and ResNet-164. On ImageNet we achieve top- error-rate improvement for a pretrained ResNet-50 model by running FGE for only epochs.

  • We release the code for reproducing the results in this paper at
    https://github.com/timgaripov/dnn-mode-connectivity

The rest of the paper is organized as follows. Section 2 discusses existing literature on DNN loss geometry and ensembling techniques. Section 3 introduces the proposed method to find the curves with low train loss and test error between local optima, which we investigate empirically in Section 4. Section 5 then introduces our proposed ensembling technique, FGE, which we empirically compare to the alternatives in Section 6. Finally, in Section 7 we discuss connections to other fields and directions for future work.

Note that we interleave two sections where we make methodological proposals (Sections 3, 5), with two sections where we perform experiments (Sections 4, 6). Our key methodological proposal for ensembling, FGE, is in Section 5.

2 Related Work

Despite the success of deep learning across many application domains, the loss surfaces of deep neural networks are not well understood. These loss surfaces are an active area of research, which falls into two distinct categories.

The first category explores the local structure of minima found by SGD and its modifications. Researchers typically distinguish sharp and wide local minima, which are respectively found by using large and small mini-batch sizes during training. Hochreiter and Schmidhuber (1997) and Keskar et al. (2017), for example, claim that flat minima lead to strong generalization, while sharp minima deliver poor results on the test dataset. However, recently Dinh et al. (2017) argue that most existing notions of flatness cannot directly explain generalization. To better understand the local structure of DNN loss minima, Li et al. (2017) proposed a new visualization method for the loss surface near the minima found by SGD. Applying the method for a variety of different architectures, they showed that the loss surfaces of modern residual networks are seemingly smoother than those of VGG-like models.

The other major category of research considers global loss structure. One of the main questions in this area is how neural networks are able to overcome poor local optima. Choromanska et al. (2015) investigated the link between the loss function of a simple fully-connected network and the Hamiltonian of the spherical spin-glass model. Under strong simplifying assumptions they showed that the values of the investigated loss function at local optima are within a well-defined bound. In other research, Lee et al. (2016a) showed that under mild conditions gradient descent almost surely converges to a local minimizer and not a saddle point, starting from a random initialization.

In recent work Freeman and Bruna (2017)

theoretically show that local minima of a neural network with one hidden layer and ReLU activations can be connected with a curve along which the loss is upper-bounded by a constant that depends on the number of parameters of the network and the “smoothness of the data”. Their theoretical results do not readily generalize to multilayer networks. Using a dynamic programming approach they empirically construct a polygonal chain for a CNN on MNIST and an RNN on PTB next word prediction. However, in more difficult settings such as AlexNet on CIFAR-10 their approach struggles to achieve even the modest test accuracy of

. Moreover, they do not consider ensembling.

By contrast, we propose a much simpler training procedure that can find near-constant accuracy polygonal chains with only one bend between optima, even on a range of modern state-of-the-art architectures. Inspired by properties of the loss function discovered by our procedure, we also propose a new state-of-the-art ensembling method that can be trained in the time required to train a single DNN, with compelling performance on many key benchmarks (e.g., 96.4% accuracy on CIFAR-10).

Xie et al. (2013) proposed a related ensembling approach that gathers outputs of neural networks from different epochs at the end of training to stabilize final predictions. More recently, Huang et al. (2017) proposed snapshot ensembles, which use a cosine cyclical learning rate (Loshchilov and Hutter, 2017) to save “snapshots” of the model during training at times when the learning rate achieves its minimum. In our experiments, we compare our geometrically inspired approach to Huang et al. (2017), showing improved performance.

Draxler et al. (2018) simultaneously and independently discovered the existence of curves connecting local optima in DNN loss landscapes. To find these curves they used a different approach inspired by the Nudged Elastic Band method (Jonsson et al., 1998) from quantum chemistry.

3 Finding Paths between Modes

We describe a new method to minimize the training error along a path that connects two points in the space of DNN weights. Section 3.1 introduces this general procedure for arbitrary parametric curves, and Section 3.2

describes polygonal chains and Bezier curves as two example parametrizations of such curves. In the supplementary material, we discuss the computational complexity of the proposed approach and how to apply batch normalization at test time to points on these curves. We note that after curve finding experiments in Section

4, we make our key methodological proposal for ensembling in Section 5.

3.1 Connection Procedure

Let and in be two sets of weights corresponding to two neural networks independently trained by minimizing any user-specified loss , such as the cross-entropy loss. Here, is the number of weights of the DNN. Moreover, let be a continuous piecewise smooth parametric curve, with parameters , such that .

To find a path of high accuracy between and , we propose to find the parameters

that minimize the expectation over a uniform distribution on the curve,

:

(1)

where the distribution on is defined as: The numerator of (1) is the line integral of the loss on the curve, and the denominator is the normalizing constant of the uniform distribution on the curve defined by . Stochastic gradients of in Eq. (1) are generally intractable since depends on . Therefore we also propose a more computationally tractable loss

(2)

where is the uniform distribution on . The difference between (1) and (2) is that the latter is an expectation of the loss with respect to a uniform distribution on , while (1) is an expectation with respect to a uniform distribution on the curve. The two losses coincide, for example, when defines a polygonal chain with two line segments of equal length and the parametrization of each of the two segments is linear in .

To minimize (2), at each iteration we sample from the uniform distribution and make a gradient step for with respect to the loss

. This way we obtain unbiased estimates of the gradients of

, as

We repeat these updates until convergence.

3.2 Example Parametrizations

Polygonal chain

The simplest parametric curve we consider is the polygonal chain (see Figure 1, right). The trained networks and serve as the endpoints of the chain and the bends of the chain are the parameters of the curve parametrization. Consider the simplest case of a chain with one bend . Then

Bezier curve

A Bezier curve (see Figure 1, middle) provides a convenient parametrization of smooth paths with given endpoints. A quadratic Bezier curve with endpoints and is given by

These formulas naturally generalize for bends (see supplement).

Figure 2: The -regularized cross-entropy train loss (left) and test error (middle) as a function of the point on the curves found by the proposed method (ResNet-164 on CIFAR-100). Right: Error of the two-network ensemble consisting of the endpoint of the curve and the point on the curve (CIFAR-100, ResNet-164). “Segment” is a line segment connecting two modes found by SGD. “Polychain” is a polygonal chain connecting the same endpoints.

4 Curve Finding Experiments

We show that the proposed training procedure in Section 3 does indeed find high accuracy paths connecting different modes, across a range of architectures and datasets. Moreover, we further investigate the properties of these curves, showing that they correspond to meaningfully different representations that can be ensembled for improved accuracy. We use these insights to propose an improved ensembling procedure in Section 5, which we empirically validate in Section 6.

In particular, we test VGG- (Simonyan and Zisserman, 2014), a -layer Wide ResNet with widening factor (Zagoruyko and Komodakis, 2016) and a -layer ResNet (He et al., 2016) on CIFAR-10, and VGG-, -layer ResNet-bottleneck (He et al., 2016) on CIFAR-100. For CIFAR-10 and CIFAR-100 we use the same standard data augmentation as Huang et al. (2017). We provide additional results, including detailed experiments for fully connected and recurrent networks, in the supplement.

For each model and dataset we train two networks with different random initializations to find two modes. Then we use the proposed algorithm of Section 3 to find a path connecting these two modes in the weight space with a quadratic Bezier curve and a polygonal chain with one bend. We also connect the two modes with a line segment for comparison. In all experiments we optimize the loss (2), as for Bezier curves the gradient of loss (1) is intractable, and for polygonal chains we found loss (2) to be more stable.

Figures 1 and 2 show the results of the proposed mode connecting procedure for ResNet-164 on CIFAR-100. Here loss refers to -regularized cross-entropy loss. For both the Bezier curve and polygonal chain, train loss (Figure 2, left) and test error (Figure 2

, middle) are indeed nearly constant. In addition, we provide plots of train error and test loss in the supplementary material. In the supplement, we also include a comprehensive table summarizing all path finding experiments on CIFAR-10 and CIFAR-100 for VGGs, ResNets and Wide ResNets, as well as fully connected networks and recurrent neural networks, which follow the same general trends. In the supplementary material we also show that the connecting curves can be found consistently as we vary the number of parameters in the network, although the ratio of the arclength for the curves to the length of the line segment connecting the same endpoints decreases with increasing parametrization. In the supplement, we also measure the losses (

1) and (2) for all the curves we constructed, and find that the values of the two losses are very close, suggesting that the loss (2) is a good practical approximation to the loss (1).

The constant-error curves connecting two given networks discovered by the proposed method are not unique. We trained two different polygonal chains with the same endpoints and different random seeds using VGG-16 on CIFAR-10. We then measured the Euclidean distance between the turning points of these curves. For VGG-16 on CIFAR-10 this distance is equal to and the distance between the endpoints is , showing that the curves are not unique. In this instance, we expect the distance between turning points to be less than the distance between endpoints, since the locations of the turning points were initialized to the same value (the center of the line segment connecting the endpoints).

Although high accuracy connecting curves can often be very simple, such as a polygonal chain with only one bend, we note that line segments directly connecting two modes generally incur high error. For VGG-16 on CIFAR-10 the test error goes up to in the center of the segment. For ResNet-158 and Wide ResNet-28-10 the worst errors along direct line segments are still high, but relatively less, at and , respectively. This finding suggests that the loss surfaces of state-of-the-art residual networks are indeed more regular than those of classical models like VGG, in accordance with the observations in Li et al. (2017).

In this paper we focus on connecting pairs of networks trained using the same hyper-parameters, but from different random initializations. Building upon our work, Gotmare et al. (2018) have recently shown that our mode connectivity approach applies to pairs of networks trained with different batch sizes, optimizers, data augmentation strategies, weight decays and learning rate schemes.

To motivate the ensembling procedure proposed in the next section, we now examine how far we need to move along a connecting curve to find a point that produces substantially different, but still useful, predictions. Let and be two distinct sets of weights corresponding to optima obtained by independently training a DNN two times. We have shown that there exists a path connecting and with high test accuracy. Let , parametrize this path with , . We investigate the performance of an ensemble of two networks: the endpoint of the curve and a point on the curve corresponding to . Figure 2 (right) shows the test error of this ensemble as a function of , for a ResNet-164 on CIFAR-100. The test error starts decreasing at and for the error of an ensemble is already as low as the error of an ensemble of the two independently trained networks used as the endpoints of the curve. Thus even by moving away from the endpoint by a relatively small distance along the curve we can find a network that produces meaningfully different predictions from the network at the endpoint. This result also demonstrates that these curves do not exist only due to degenerate parametrizations of the network (such as rescaling on either side of a ReLU); instead, points along the curve correspond to meaningfully different representations of the data that can be ensembled for improved performance. In the supplementary material we show how to create trivially connecting curves that do not have this property.

5 Fast Geometric Ensembling

In this section, we introduce a practical ensembling procedure, Fast Geometric Ensembling (FGE), motivated by our observations about mode connectivity.

Figure 3: Left: Plot of the learning rate (Top), test error (Middle) and distance from the initial value (Bottom) as a function of iteration for FGE with Preactivation-ResNet-164 on CIFAR-100. Circles indicate the times when we save models for ensembling. Right: Ensemble performance of FGE and SSE (Snapshot Ensembles) as a function of training time, using ResNet-164 on CIFAR-100 ( epochs). Crosses represent the performance of separate “snapshot” models, and diamonds show the performance of the ensembles constructed of all models available by the given time.

In the previous section, we considered ensembling along mode connecting curves. Suppose now we instead only have one set of weights corresponding to a mode of the loss. We cannot explicitly construct a path as before, but we know that multiple paths passing through exist, and it is thus possible to move away from in the weight space without increasing the loss. Further, we know that we can find diverse networks providing meaningfully different predictions by making relatively small steps in the weight space (see Figure 2, right).

Inspired by these observations, we propose the Fast Geometric Ensembling (FGE) method that aims to find diverse networks with relatively small steps in the weight space, without leaving a region that corresponds to low test error.

While inspired by mode connectivity, FGE does not rely on explicitly finding a connecting curve, and thus does not require pre-trained endpoints, and so can be trained in the time required to train a single network.

Let us describe Fast Geometric Ensembling. First, we initialize a copy of the network with weights set equal to the weights of the trained network . Now, to force to move away from without substantially decreasing the prediction accuracy we adopt a cyclical learning rate schedule (see Figure 3, left), with the learning rate at iteration defined as

where , the learning rates are , and the number of iterations in one cycle is given by even number . Here by iteration we mean processing one mini-batch of data. We can train the network using the standard -regularized cross-entropy loss function (or any other loss that can be used for DNN training) with the proposed learning rate schedule for iterations. In the middle of each learning rate cycle when the learning rate reaches its minimum value (which corresponds to ) we collect the checkpoints of weights . When the training is finished we ensemble the collected models. An outline of the algorithm is provided in the supplement.

Figure 3 (left) illustrates the adopted learning rate schedule. During the periods when the learning rate is large (close to ), is exploring the weight space doing larger steps but sacrificing the test error. When the learning rate is small (close to ), is in the exploitation phase in which the steps become smaller and the test error goes down. The cycle length is usually about to epochs, so that the method efficiently balances exploration and exploitation with relatively-small steps in the weight space that are still sufficient to gather diverse and meaningful networks for the ensemble.

To find a good initialization for the proposed procedure, we first train the network with the standard learning rate schedule (the schedule used to train single DNN models) for about of the time required to train a single model. After this pre-training is finished we initialize FGE with and run the proposed fast ensembling algorithm for the remaining computational budget. In order to get more diverse samples, one can run the algorithm described above several times for a smaller number of iterations initializing from different checkpoints saved during training of , and then ensemble all of the models gathered across these runs.

Cyclical learning rates have also recently been considered in Smith and Topin (2017) and Huang et al. (2017). Our proposed method is perhaps most closely related to Snapshot Ensembles (Huang et al., 2017), but has several distinctive features, inspired by our geometric insights. In particular, Snapshot Ensembles adopt cyclical learning rates with cycle length on the scale of to epochs from the beginning of the training as they are trying to do large steps in the weight space. However, according to our analysis of the curves it is sufficient to do relatively small steps in the weight space to get diverse networks, so we only employ cyclical learning rates with a small cycle length on the scale of to epochs in the last stage of the training. As illustrated in Figure 3 (left), the step sizes made by FGE between saving two models (that is the euclidean distance between sets of weights of corresponding models in the weight space) are on the scale of for Preactivation-ResNet-164 on CIFAR-100. For Snapshot Ensembles for the same model the distance between two snapshots is on the scale of . We also use a piecewise linear cyclical learning rate schedule following Smith and Topin (2017) as opposed to the cosine schedule in Snapshot Ensembles.

6 Fast Geometric Ensembling Experiments

CIFAR-100 CIFAR-10
DNN (Budget) method
VGG-16 () Ind
SSE
FGE
ResNet-164 () Ind
SSE
FGE
WRN-28-10 () Ind
SSE
FGE
Table 1: Error rates () on CIFAR-100 and CIFAR-10 datasets for different ensembling techniques and training budgets. The best results for each dataset, architecture, and budget are bolded.

In this section we compare the proposed Fast Geometric Ensembling (FGE) technique against ensembles of independently trained networks (Ind), and SnapShot Ensembles (SSE) (Huang et al., 2017), a recent state-of-the-art fast ensembling approach.

For the ensembling experiments we use a -layer Preactivation-ResNet in addition to the VGG-16 and Wide ResNet-28-10 models. Links for implementations to these models can be found in the supplement.

We compare the accuracy of each method as a function of computational budget. For each network architecture and dataset we denote the number of epochs required to train a single model as . For a budget, we run each of Ind, FGE and SSE times from random initializations and ensemble the models gathered from the runs. In our experiments we set for VGG-16 and Wide ResNet-28-10 (WRN-28-10) models, and for ResNet-164, since epochs is typically sufficient to train this model. We note the runtime per epoch for FGE, SSE, and Ind is the same, and so the total computation associated with budgets is the same for all ensembling approaches.

For Ind, we use an initial learning rate of 0.1 for ResNet and Wide ResNet, and 0.05 for VGG. For FGE, with VGG we use cycle length epochs, and a total of models in the final ensemble. With ResNet and Wide ResNet we use epochs, and the total number of models in the final ensemble is for Wide ResNets and for ResNets. For VGG we set the learning rates to , ; for ResNet and Wide ResNet models we set , . . For SSE, we followed Huang et al. (2017) and varied the initial learning rate and number of snapshots per run . We report the best results we achieved, which corresponded to for ResNet, for Wide ResNet, and for VGG. The total number of models in the FGE ensemble is constrained by network choice and computational budget. Further experimental details are in the supplement.

Table 1 summarizes the results of the experiments. In all conducted experiments FGE outperforms SSE, particularly as we increase the computational budget. The performance improvement against Ind is most noticeable for CIFAR-100. With a large number of classes, any two models are less likely to make the same predictions. Moreover, there will be greater uncertainty over which representation one should use on CIFAR-100, since the number of classes is increased tenfold from CIFAR-10, but the number of training examples is held constant. Thus smart ensembling strategies will be especially important on this dataset. Indeed in all experiments on CIFAR-100, FGE outperformed all other methods. On CIFAR-10, FGE consistently improved upon SSE for all budgets and architectures. FGE also improved against Ind for all training budgets with VGG, but is more similar in performance to Ind on CIFAR-10 when using ResNets.

Figure 3 (right) illustrates the results for Preactivation-ResNet-164 on CIFAR-100 for one and two training budgets. The training budget is epochs. Snapshot Ensembles use a cyclical learning rate from the beginning of the training and they gather the models for the ensemble throughout training. To find a good initialization we run standard independent training for the first epochs before applying FGE. In this case, the whole ensemble is gathered over the following epochs (-) to fit in the budget of each of the two runs. During these epochs FGE is able to gather diverse enough networks to outperform Snapshot Ensembles both for and budgets.

Diversity of predictions of the individual networks is crucial for the ensembling performance (e.g., Lee et al., 2016b). We note that the diversity of the networks averaged by FGE is lower than that of completely independently trained networks. Specifically, two independently trained ResNet-164 on CIFAR-100 make different predictions on of test objects, while two networks from the same FGE run make different predictions on of test objects. Further, performance of individual networks averaged by FGE is slightly lower than that of fully trained networks (e.g. against on CIFAR100 for ResNet-164). However, for a given computational budget FGE can propose many more high-performing networks than independent training, leading to better ensembling performance (see Table 1).

6.1 ImageNet

ImageNet ILSVRC-2012 (Russakovsky et al., 2012) is a large-scale dataset containing million training images and validation images divided into classes.

CIFAR-100 is the primary focus of our ensemble experiments. However, we also include ImageNet results for the proposed FGE procedure, using a ResNet-50 architecture. We used a pretrained model with top- test error of to initialize the FGE procedure. We then ran FGE for epochs with a cycle length of epochs and with learning rates , . The top- test error-rate of the final ensemble was . Thus, in just epochs we could improve the accuracy of the model by using FGE. The final ensemble contains models (including the pretrained one). Despite the harder setting of only epochs to construct an ensemble, FGE performs comparably to the best result reported by Huang et al. (2017) on ImageNet, error, which was also achieved using a ResNet-50.

7 Discussion and Future Work

We have shown that the optima of deep neural networks are connected by simple pathways, such as a polygonal chain with a single bend, with near constant accuracy. We introduced a training procedure to find these pathways, with a user-specific curve of choice. We were inspired by these insights to propose a practical new ensembling approach, Fast Geometric Ensembling, which achieves state-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet.

There are so many exciting future directions for this research. At a high level we have shown that even though the loss surfaces of deep neural networks are very complex, there is relatively simple structure connecting different optima. Indeed, we can now move towards thinking about valleys of low loss, rather than isolated modes.

These valleys could inspire new directions for approximate Bayesian inference, such as stochastic MCMC approaches which could now jump along these bridges between modes, rather than getting stuck exploring a single mode. One could similarly derive new proposal distributions for variational inference, exploiting the flatness of these pathways. These geometric insights could also be used to accelerate the convergence, stability and accuracy of optimization procedures like SGD, by helping us understand the trajectories along which the optimizer moves, and making it possible to develop procedures which can now search in more structured spaces of high accuracy. One could also use these paths to construct methods which are more robust to adversarial attacks, by using an arbitrary collection of diverse models described by a high accuracy curve, returning the predictions of a different model for each query from an adversary. We can also use this new property to create better visualizations of DNN loss surfaces. Indeed, using the proposed training procedure, we were able to produce new types of visualizations showing the connectivity of modes, which are normally depicted as isolated. We also could continue to build on the new training procedure we proposed here, to find curves with particularly desirable properties, such as diversity of networks. Indeed, we could start to use entirely new loss functions, such as line and surface integrals of cross-entropy across structured regions of weight space.

Acknowledgements.

Timur Garipov was supported by Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001). Timur Garipov and Dmitrii Podoprikhin were supported by Samsung Research, Samsung Electronics. Andrew Gordon Wilson and Pavel Izmailov were supported by Facebook Research and NSF IIS-1563887.

References

Appendix A Supplementary Material

We organize the supplementary material as follows. Section A.1 discusses the computational complexity of the proposed curve finding method. Section A.2 describes how to apply batch normalization at test time to points on curves connecting pairs of local optima. Section A.3 provides formulas for a polygonal chain and Bezier curve with bends. Section A.4 provides details and results of experiments on curve finding and contains a table summarizing all path finding experiments. Section A.5 provides additional visualizations of the train loss and test accuracy surfaces. Section A.6 contains details on curve ensembling experiments. Section A.7 describes experiments on relation between mode connectivity and the number of parameters in the networks. Section A.8 discusses a trivial construction of curves connecting two modes, where points on the curve represent reparameterization of the endpoints, unlike the curves in the main text. Section A.9 provides details of experiments on FGE. Finally, Section A.10 describes pathways traversed by FGE.

a.1 Computational complexity of curve finding

The forward pass of the proposed method consists of two steps: computing the point and then passing a mini-batch of data through the DNN corresponding to this point. Similarly, the backward pass consists of first computing the gradient of the loss with respect to , and then multiplying the result by the Jacobian . The second step of the forward pass and the first step of the backward pass are exactly the same as the forward and backward pass in the training of a single DNN model. The additional computational complexity of the procedure compared to single model training comes from the first step of the forward pass and the second step of the backward pass and in general depends on the parametrization of the curve.

In our experiments we use curve parametrizations of a specific form. The general formula for a curve with one bend is given by

Here the parameters of the curve are given by and coefficients .

For this family of curves the computational complexity of the first step of the method is , as we only need to compute a weighted sum of , and . The Jacobian matrix

thus the additional computational complexity of the backward pass is also , as we only need to multiply the gradient with respect to by a scalar. Thus, the total additional computational complexity is . In practice we observe that the gap in time-complexity between one epoch of training a single model and one epoch of the proposed method with the same network architecture is usually below .

a.2 Batch Normalization

Batch normalization (Ioffe and Szegedy [2015]) is essential to modern deep learning architectures. Batch normalization re-parametrizes the output of each layer as

where and

are the mean and standard deviation of the output

, is a constant for numerical stability and and are free parameters. During training, and are computed separately for each mini-batch and at test time statistics aggregated during training are used.

When connecting two DNNs that use batch normalization, along a curve , we compute and for any given over mini-batches during training, as usual. In order to apply batch-normalization to a network on the curve at the test stage we compute these statistics with one additional pass over the data, as running averages for these networks are not collected during training.

a.3 Formulas for curves with bends

For bends , the parametrization of a polygonal chain connecting points is given by

for and .

For bends , the parametrization of a Bezier curve connecting points and is given by

a.4 Curve Finding Experiments

-1in-1in Model Length Train Loss Train Error (%) Test Error (%) DNN Curve Ratio Min Int Mean Max Min Int Max Min Int Max MNIST FC Single FC Segment FC Bezier FC Polychain CIFAR-10 3conv3fc Single 3conv3fc Segment 3conv3fc Bezier 3conv3fc Polychain VGG-16 Single VGG-16 Segment VGG-16 Bezier VGG-16 Polychain ResNet-158 Single ResNet-158 Segment ResNet-158 Bezier ResNet-158 Polychain WRN-10-28 Single WRN-10-28 Segment WRN-10-28 Bezier WRN-10-28 Polychain CIFAR-100 VGG-16 Single VGG-16 Segment VGG-16 Bezier VGG-16 Polychain ResNet-164 Single ResNet-164 Segment ResNet-164 Bezier ResNet-164 Polychain

Table 2: The properties of loss and error values along the found curves for different architectures and tasks
Model Train Validation Test
DNN Curve Min Max Min Max Min Max
PTB
RNN Single
RNN Segment
RNN Bezier
Table 3: The value of perplexity along the found curves for PTB dataset
Figure 4: The -regularized cross-entropy train loss (Top) and test error (Bottom) surfaces of a deep residual network (ResNet-164) on CIFAR-100. Left: Three optima for independently trained networks. Middle and Right: A quadratic Bezier curve, and a polygonal chain with one bend, connecting the lower two optima on the left panel along a path of near-constant loss. Notice that in each panel, a direct linear path between each mode would incur high loss.
Figure 5: Same as Fig. 4 for VGG-16 on CIFAR-10.

All experiments on curve finding were conducted with TensorFlow (

Abadi et al. [2016]) and as baseline models we used the following implementations:

Table 2 summarizes the results of the curve finding experiments with all datasets and architectures. For each of the models we report the properties of loss and the error on the train and test datasets. For each of these metrics we report 3 values: “Max” is the maximum values of the metric along the curve, “Int” is a numerical approximation of the integral where metric represents the train loss or the error on the train or test dataset and “Min” is the minimum value of the error on the curve. “Int” represents a mean over a uniform distribution on the curve, and for the train loss it coincides with the loss () in the paper. We use an equally-spaced grid with points on to estimate the values of “Min”, “Max”, “Int”. For “Int” we use the trapezoidal rule to estimate the integral. For each dataset and architecture we report the performance of single models used as the endpoints of the curve as “Single”, the performance of a line segment connecting the two single networks as “Segment”, the performance of a quadratic Bezier curve as “Bezier” and the performance of a polygonal chain with one bend as “Polychain”. Finally, for each curve we report the ratio of its length to the length of a line segment connecting the two modes.

We also examined the quantity “Mean” defined as , which coincides with the loss () from the paper, but in all our experiments it is nearly equal to “Int”.

Besides convolutional and fully-connected architectures we also apply our approach to RNN architecture on next word prediction task, PTB dataset (Marcus et al. [1993]). As a base model we used the implementation available at https://www.tensorflow.org/tutorials/recurrent. As the main loss we consider perplexity. The results are presented in Table 3.

a.5 Train loss and test accuracy surfaces

In this section we provide additional visualizations. Fig. 4 and Fig. 5 show visualizations of the train loss and test accuracy for ResNet- on CIFAR-100 and VGG- on CIFAR-10.

a.6 Curve Ensembling

Figure 6: Error as a function of the point on the curves found by the proposed method, using a ResNet-164 on CIFAR-100. Top left: train error. Bottom left:

test error; dashed lines correspond to quality of ensemble constructed from curve points before and after logits rescaling.

Top right: train loss ( regularized cross-entropy). Bottom right: cross-entropy before and after logits rescaling for the polygonal chain.

Here we explore ensembles constructed from points sampled from these high accuracy curves. In particular, we train a polygonal chain with one bend connecting two independently trained ResNet-164 networks on CIFAR-100 and construct an ensemble of networks corresponding to points placed on an equally-spaced grid on the curve. The resulting ensemble had error-rate on the test dataset. The error-rate of the ensemble constructed from the endpoints of the curve was . An ensemble of three independently trained networks has an error rate of . Thus, the ensemble of the networks on the curve outperformed an ensemble of its endpoints implying that the curves found by the proposed method are actually passing through diverse networks that produce predictions different from those produced by the endpoints of the curve. Moreover, the ensemble based on the polygonal chain has the same number of parameters as three independent networks, and comparable performance.

Furthermore, we can improve the ensemble on the chain without adding additional parameters or computational expense, by accounting for the pattern of increased training and test loss towards the centres of the linear paths shown in Figure 6

. While the training and test accuracy are relatively constant, the pattern of loss, shared across train and test sets, indicates overconfidence away from the three points defining the curve: in this region, networks tend to output probabilities closer to

, sometimes with the wrong answers. This overconfidence decreases the performance of ensembles constructed from the networks sampled on the curves. In order to correct for this overconfidence and improve the ensembling performance we use temperature scaling [Guo et al., 2017], which is inversely proportional to the loss. Figure 6, bottom right, illustrates the test loss of ResNet-164 on CIFAR-100 before and after temperature scaling. After rescaling the predictions of the networks, the test loss along the curve decreases and flattens. Further, the test error-rate of the ensemble constructed from the points on the curve went down from to after applying the temperature scaling, outperforming independently trained networks.

However, directly ensembling on the curves requires manual intervention for temperature scaling, and an additional pass over the training data for each of the networks ( in this case) at test time to perform batch normalization as described in section A.2. Moreover, we also need to train at least two networks for the endpoints of the curve.

a.7 The Effects of Increasing Parametrization

Figure 7: The worst train loss along the curve, maximum of the losses of the endpoints, and the ratio of the length of the curve and the line segment connecting the two modes, as a function of the scaling factor of the sizes of fully-connected layers.

One possible factor that influences the connectedness of a local minima set is the overparameterization of neural networks. In this section, we investigate the relation between the observed connectedness of the local optima and the number of parameters (weights) in the neural network. We start with a network that has three convolutional layers followed by three fully-connected layers, where each layer has neurons. We vary , and for each value of we train two networks that we connect with a Bezier curve using the proposed procedure.

For each value of , Figure 7 shows the worst training loss along the curve, maximum of losses of the endpoints, and the ratio of the length of the curve and the line segment connecting the two modes. Increasing the number of parameters we are able to reduce the difference between the worst value of the loss along the curve and the loss of single models used as the endpoints. The ratio of the length of the found curve and the length of the line segment connecting the two modes also decreases monotonically with . This result is intuitive, since a greater parametrization allows for more flexibility in how we can navigate the loss surfaces.

a.8 Trivial connecting curves

For convolutional networks with ReLU activations and without batch normalization we can construct a path connecting two points in weight space such that the accuracy of each point on the curve (excluding the origin of the weight space) is at least as good as the minimum of the accuracies of the endpoints. Unlike the paths found by our procedure, these paths are trivial and merely exploit redundancies in the parametrization. Also, the training loss goes up substantially along these curves. Below we give a construction of such paths.

Let and be two sets of weights. This path of interest consists of two parts. The first part connects the point with and the second one connects the point with . We describe only the first part of the path, such that , as the second part is completely analogous. Let the weights of the network be where are the weights and biases of the -th layer, and is the total number of layers. Throughout the derivation we consider the inputs of the network fixed. The output of the -th layer , , where corresponds to the first layer and corresponds to logits (the outputs of the last layer). We construct in the following way. We set and It is easy to see that logits of the network with weights are equal to for all . Note that the predicted labels corresponding to the logits and are the same, so the accuracy of all networks corresponding to is the same.

0:   weights , LR bounds ,cycle length  (even), number of iterations
0:  ensemble
   {Initialize weight with }
  
  for  do
      {Calculate LR for the iteration}
      {Stochastic gradient update}
     if  then
         {Collect weights}
     end if
  end for
Algorithm 1 Fast Geometric Ensembling

a.9 Fast geometric ensembling experiments

For the FGE (Fast Geometric Ensembling) strategy on ResNet we run the FGE routine summarized in Alg. after epoch of the usual (same as Ind) training for epochs. The total training time is thus epochs. For VGG and Wide ResNet models we run the pre-training procedure for epochs to initialize FGE. Then we run FGE for epochs starting from checkpoints corresponding to epochs and and ensemble all the gathered models. The total training time is thus epochs. For VGG we use cycle length epochs, which means that the total number of models in the final ensemble is . For ResNet and Wide ResNet we use epochs, and the total number of models in the final ensemble is for Wide ResNets and for ResNets.

a.10 Polygonal chain connecting FGE proposals

In order to better understand the trajectories followed by FGE we construct a polygonal chain connecting the points that FGE ensembles. Suppose we run FGE for learning rate cycles obtaining points in the weight space that correspond to the lowest values of the learning rate. We then consider the polygonal chain consisting of the line segments connecting to for . We plot test accuracy and train error along this polygonal chain in Figure 8. We observe that along this curve both train loss and test error remain low, agreeing with our intuition that FGE follows the paths of low loss and error. Surprisingly, we find that the points on the line segments connecting the weights have lower train loss and test error than and . See Izmailov et al. [2018] for a detailed discussion of this phenomenon.

Figure 8: Train loss and test error along the polygonal chain connecting the sequence of points ensembled in FGE. The plot is generated using PreResNet-164 on CIFAR 100. Circles indicate the bends on the polygonal chain, i.e. the networks ensembled in FGE.