# Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which have been proposed to be associated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Specifically, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side--we formally define such minima as asymmetric valleys. Under mild assumptions, we prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact minimizer. Further, we show that simply averaging the weights along the SGD trajectory gives rise to such biased solutions implicitly. This provides a theoretical explanation for the intriguing phenomenon observed by Izmailov et al. (2018). In addition, we empirically find that batch normalization (BN) appears to be a major cause for asymmetric valleys.

## Authors

• 1 publication
• 42 publications
• 16 publications
• ### Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Stochastic gradient descent (SGD) and its variants are mainstream method...
06/24/2020 ∙ by Qi Meng, et al. ∙ 10

• ### A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast

Stochastic optimization algorithms, such as Stochastic Gradient Descent ...
02/10/2020 ∙ by Zeke Xie, et al. ∙ 14

• ### Partial local entropy and anisotropy in deep weight spaces

We refine a recently-proposed class of local entropic loss functions by ...
07/17/2020 ∙ by Daniele Musso, et al. ∙ 0

• ### Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

10/12/2020 ∙ by Pan Zhou, et al. ∙ 46

• ### Shaping the learning landscape in neural networks around wide flat minima

Learning in Deep Neural Networks (DNN) takes place by minimizing a non-c...
05/20/2019 ∙ by Carlo Baldassi, et al. ∙ 0

• ### BN-invariant sharpness regularizes the training model to better generalization

It is arguably believed that flatter minima can generalize better. Howev...
01/08/2021 ∙ by Mingyang Yi, et al. ∙ 0

• ### The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradi...
09/14/2017 ∙ by Vivak Patel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The loss landscape of neural networks has attracted great research interests in the deep learning community

(Choromanska et al., 2015; Cooper, 2018; Keskar et al., 2017; Draxler et al., 2018; Ge et al., 2017; Sagun et al., 2017). It provides the basis of designing better optimization algorithms, and helps to answer the question of when and how a deep network can achieve good generalization performance. One hypothesis that draws attention recently is that the local minima of neural networks can be characterized by their flatness, and it is conjectured that sharp minima tend to generalize worse than the flat ones (Keskar et al., 2017). A plausible explanation is that a flat minimizer of the training loss can achieve lower generalization error if the test loss is shifted from the training loss due to random perturbations. Figure 1(a) gives an illustration for this argument.

Although being supported by plenty of empirical observations (Keskar et al., 2017; Izmailov et al., 2018; Li et al., 2018), the definition of flatness was recently challenged by (Dinh et al., 2017), who showed that one can construct arbitrarily sharp minima through weight re-parameterization without changing the generalization performance. In addition, recent evidence suggests that the minima of modern deep networks are connected with simple paths with low generalization error (Draxler et al., 2018; Garipov et al., 2018). Similarly, the minima found by large batch training and small batch training are shown to be connected without any “bumps” (Sagun et al., 2017). This raises several questions: (1) If all the minima are well connected, why do some algorithms keep finding sharp minima and others keep finding flat ones (Keskar et al., 2017)? (2) Does flatness really affect generalization?

In this paper, we address these questions by introducing the concept of asymmetric valleys. We observe that the local geometry of the loss function of neural networks is usually asymmetric. In other words, there exist many directions such that the loss increases abruptly along one side, and grows rather slowly along the opposite side (see Figure 1(b) as an illustration). We formally define this kind of local minima as asymmetric valleys. As we will show in Section 6, asymmetric valleys brings interesting illusions in high dimensional space. For example, located in the same valley, may appear to be a wider and flatter minimum than as the former is farther away from the sharp side.

For the second question, we argue that flatness does affect generalization. However, we do not simply follow the argument in (Keskar et al., 2017), which states that flat minima tend to generalize better because they are more stable. Instead, we prove that in asymmetric valleys, the solution biased towards the flat side of the valley gives better generalization under mild assumptions. This result has at least two interesting implications: (1) converging to which local minimum (if there are many) may not be critical for modern deep networks. However, it matters a lot where the solution locates; and (2) the solution with lowest a priori generalization error is not necessarily the minimizer of the training loss.

Given that a biased solution is preferred for asymmetric valleys, an immediate question is how we can find such solutions in practice. It turns out that simply averaging the weights along the SGD trajectory, naturally leads to the desired solutions with bias. We give a theoretical analysis to support this argument, see Figure 1(c) for an illustration. Note that our result is in line with the empirical observations recently made by Izmailov et al. (2018).

In addition, we provide empirical analysis to verify our theoretical results and support our claims. For example, we show that asymmetric valleys are indeed prevalent in modern deep networks, and solutions with lower generalization error has bias towards the flat side of the valley. We also find that batch normalization seems to be a major cause for shaping asymmetric loss surfaces.

## 2 Related Work

Neural network landscape. Analyzing the landscape of deep neural networks is an active and exciting area (Goodfellow & Vinyals, 2014; Li et al., 2018; Ge et al., 2017; Pennington & Bahri, 2017; Wu et al., 2017; Cooper, 2018; Sagun et al., 2017). For example, (Draxler et al., 2018; Garipov et al., 2018) observed that essentially all local minima are connected together with simple paths. (Huang et al., 2017) used cyclic learning rate and took the ensemble of intermediate models to get improved accuracy. There are also appealing visualizations for the neural network landscape (Li et al., 2018).

Sharp and flat minima. The discussion of sharp and flat local minima dates back to (Hochreiter & Schmidhuber, 1995), and recently regains its popularity. For example, Keskar et al. (2017) proposed that large batch SGD finds sharp minima, which leads to poor generalization. In (Chaudhari et al., 2016), an entropy regularized SGD was introduced to explicitly searching for flat minima. It was later pointed out that large batch SGD can yield comparable performance when the learning rate or the number of training iterations are properly set (Hoffer et al., 2017; Goyal et al., 2017; Smith et al., 2017; Masters & Luschi, 2018; Smith & Le, 2017; Jastrzebski et al., 2017). Moreover, (Dinh et al., 2017) showed that from a given flat minimum, one could construct another minimum with arbitrarily sharp directions but equally good performance. In this paper, we argue that the description of sharp or flat minima is an oversimplification. There may simultaneously exist steep directions, flat directions, and asymmetric directions for the same minimum.

SGD optimization and generalization. As the de facto optimization tool for deep networks, SGD and its variants are extensively studied in the literature. For example, it is shown that they could escape saddle points or sharp local minima under reasonable assumptions (Ge et al., 2015; Jin et al., 2017, 2018a, 2018b; Xu et al., 2018; Allen-Zhu, 2018a, b; Allen-Zhu & Li, 2018; Kleinberg et al., 2018). For convex functions (Polyak & Juditsky, 1992) or strongly convex but non-smooth functions (Rakhlin et al., 2012), SGD averaging is shown to give better convergence rate. In addition, it can also achieve higher generalization performance for Lipschitz functions in theory (Shalev-Shwartz et al., 2009; Cesa-bianchi et al., 2002), or for deep networks in practice (Huang et al., 2017; Izmailov et al., 2018; therearemanyconsistentexplanations). Discussions on the generalization bound of neural networks can be found in (Bartlett et al., 2017; Neyshabur et al., 2018, 2017b; Kawaguchi et al., 2017; Neyshabur et al., 2017a; Arora et al., 2018; Zhou et al., 2019).

We show that SGD averaging has implicit bias on the flat sides of the minima. Previously, it was shown that SGD has other kinds of implicit bias as well (Soudry et al., 2017; Ji & Telgarsky, 2018; Gunasekar et al., 2018).

## 3 Asymmetric Valleys

In this section, we give a formal definition of asymmetric valley, and show that it is prevalent in the loss landscape of modern deep neural networks.

#### Preliminaries.

In supervised learning, we seek to optimize

is the population loss, is the input from distribution , denotes the model parameter, and is the loss function.

Since the data distribution is usually unknown, instead of optimizing directly, we often use SGD to find the empirical risk minimizer for a set of random samples from (a.k.a. training set): .

We use a unit vector

to represent a direction such that the points on this direction passing can be written as for .

### 3.1 Definition of asymmetric valley

Before formally introducing asymmetric valleys, we first define asymmetric directions.

###### Definition 1 (Asymmetric direction).

Given constants , a direction is -asymmetric with respect to point and loss function , if , and for .

To put it simply, asymmetric direction is a direction along which the loss function grows at different rates at the positive/negative direction. The constant handles the small neighborhood around with very small gradients. With this definition, we now formally define the asymmetric valley.

###### Definition 2 (Asymmetric valley).

Given constants , a local minimum of is a -asymmetric valley, if there exists at least one direction such that is -asymmetric with respect to and .

Notice that here we abuse the name “valley”, since is essentially a point at the center of a valley.

### 3.2 Find asymmetric directions empirically

Empirically, by taking random directions with value

in each dimension, we could find an asymmetric direction for a given local minimum with decent probability

111By contrast, a random direction with value in is usually not asymmetric. . We perform experiments with three widely used deep networks, i.e., ResNet-110, ResNet-164 (He et al., 2016), DenseNet-100 (Huang et al., 2016), on the CIFAR-10 and CIFAR-100 image classification datasets. For each model on each dataset, we conduct 5 independent runs. The results show that we could always find asymmetric directions with certain specification with , which means all the local minima222Notice that empirically we could not verify whether the SGD solution is a local minimum. See the discussion in Section 6. being found are located in asymmetric valleys. Figure 2 shows an asymmetric direction for a local minimum in ResNet-110 trained on the CIFAR-10 dataset. We verified that it is a -asymmetric direction. Asymmetric valleys widely exist in other models as well, see Appendix A.

## 4 Bias and Generalization

As we show in the previous section, most local minima in practice are asymmetric, i.e., they might be sharp on one direction, but flat on the opposite direction. Therefore, it is important to investigate the generalization ability of a solution in this scenario. In this section, we prove that a biased solution on the flat side of an asymmetric valley yields lower generalization error than the empirical minimizer in that valley.

### 4.1 Theoretical analysis

Before presenting our theorem, we first introduce two mild assumptions. We will show that they empirically hold on modern deep networks in Section 4.2.

The first assumption (Assumption 1) states that there exists a shift between the empirical loss and true population loss. This is a common assumption in the previous works, e.g., (Keskar et al., 2017), but was usually presented in an informal way. Here we define the “shift” in a formal way. Without loss of generality, we will compare the empirical loss with to remove the “vertical difference” between and . Notice that and are constants and do not affect our generalization guarantee.

###### Definition 3 ((δ,R)-shift gap).

For , , and fixed functions and , we define the -shift gap between and with respect to a point as

 ξδ(w)=maxv∈B(R)|L′(w+v+δ)−^L(w+v)|

where , and is the -dimensional ball with radius centered at .

From the above definition, we know that the two functions match well after the shift if is very small. For example, means is locally identical to after the shift . Since is computed on a set of random samples from , the actual shift between and

is a random variable, ideally with zero expectation.

###### Assumption 1 (Random shift assumption).

For a given population loss and a random empirical loss , constants , a vector with for all , a minimizer , we assume that there exists a random variable correlated with such that for all , and the -shift gap between and with respect to is bounded by .

Roughly, the above assumption says that the local landscape of the empirical loss and population loss match well after applying a shift vector , which has equal probability of being positive or negative in each dimension. Therefore, has possible values for a given shift vector , each with probability . The second assumption stated below can be seen as an extension of Definition 2.

###### Assumption 2 (Locally asymmetric).

For a given population loss , and a minimizer , there exist orthogonal directions s.t. is -asymmetric with respect to for all and .

Assumption 2 states that if is an asymmetric direction at , then the point that deviates from along the perpendicular direction of , is also asymmetric along the direction of . In other words, the neighborhood around is an asymmetric valley.

Under the above assumptions, we are ready to state our theorem, which says the empirical minimizer is not necessarily the optimal solution, while a biased solution leads to better generalization. We defer the proof to Appendix B.

###### Theorem 1 (Bias leads to better generalization).

For any , if Assumption 1 holds for , Assumption 2 holds for , and , then we have

 EδL(^w∗)−EδL(^w∗+k∑i=1liui) ≥ k∑i=1(ci−1)lipi/2−2kξ>0

#### Remark on Theorem 1.

It is widely known that the empirical minimizer is usually different from the true optimum. However, in practice it is difficult to know how the training loss shifts from the population loss. Therefore, the best we could do is minimizing the empirical loss function (with some regularizers). On the contrary, Theorem 1 states that under the asymmetric case, we should pick a biased solution to minimize the expected population loss even the shift is unknown. Moreover, it is possible to distill our insight into practical algorithms, as we will discuss in Section 5.

### 4.2 Verification of assumptions

#### Verification of Assumption 1.

We show that a shift between and is quite common in practice, by taking a ResNet-110 trained on CIFAR-10 as an example. Since we could not visualize a shift in a high dimensional space, we randomly sample an asymmetric direction (more results are shown Appendix C) at the SGD solution . The blue and red curves shown in Figure 3(a) are obtained by calculating and for , which correspond to the training and test loss, respectively.

We then try different shift values of to “match” the two curves. As shown in Figure 3(a), after applying a horizontal shift to the test loss, the two curves overlap almost perfectly. Quantitatively, we can use the shift gap defined in Definition 3 to evaluate how well the two curves match each other after shifting. It turns out that , which is much lower than before shifting ( has only one dimension here). In Figure 3(b), we plot as a function of . Clearly, there exists a that minimizes this ratio, indicating a good match.

We conducted the same experiments for different directions, models and datasets, and similar observations were made. Please refer to Appendix C for more results.

#### Verification of Assumption 2.

This is a mild assumption that can be verified empirically. For example, we take a SGD solution of ResNet-110 on CIFAR-10 as , and specify an asymmetric direction for . We then randomly sample different local adjustments for . Based on these adjustments, we present the mean loss curves and standard variance zone on the asymmetric direction for all the points in Figure 4. As we can see, the variance of these curves are very small, which means all of them are similar to each other. Moreover, we verified that is -asymmetric with respect to all neighboring points.

## 5 Averaging Generates Good Bias

In the previous section, we show that when the loss landscape of a local minimum is asymmetric, a solution with bias towards the flat side of the valley has better generalization performance. One immediate question is that how can we obtain such a solution via practical algorithms? Below we show that it can be achieved by simply taking the average of SGD iterates during the course of training. We first analyze the one dimensional case in Section 5.1, and then extend the analysis to the high dimensional case in Section 5.2.

### 5.1 One dimensional case

For asymmetric functions, as long as the learning rate is not too small, SGD will oscillate between the flat side and the sharp side. Below we focus on one round of oscillation, and show that the average of the iterates in each round has a bias on the flat side. Consequently, by aggregating all rounds of oscillation, averaging SGD iterates leads to a bias as well.

For each individual round , we assume that it starts from the iteration when SGD goes from sharp side to flat side (denoted as ), and ends at the iteration exactly before the iteration that SGD goes from sharp side to flat side again (denoted as ). Here denotes the number of iterations in the -th rounds. The average iterate in the -th round can be written as . For notational simplicity, we will omit the super script on .

The following theorem shows that the expectation of the average has bias on the flat side. To get a formal lower bound on , we consider the asymmetric case where , and also assume lower bounds for the gradients on the function. Notice that we made little effort to optimize the constants or bounds on the parameters, and we defer the proof to Appendix E.

###### Theorem 2 (SGD averaging generates a bias).

Assume that a local minimizer is a -asymmetric valley, where for , and for . Assume for a large constant , and . The SGD updating rule is where is the noise and , and assume . Then we have

 E[¯w]>c0>0,

where is a constant that only depends on and .

Theorem 2 can be intuitively explained by Figure 5. If we run SGD on this one dimensional function, it will stay at the flat side for more iterations as the magnitude of the gradient on this side is much smaller. Therefore, the average of the locations is biased towards the flat side.

Of course, if the learning rate is sufficiently small, there will be no oscillations on the SGD trajectory, as shown in Figure 6. In this case, the bias on the sharp side tends to be closer to the center compared to the bias on the flat side, as the gradient on the sharp side is much larger than the gradient on the flat side, so SGD converges much faster. In other words, even if there is no oscillation and Theorem 2 does not apply, SGD averaging creates more bias on flat sides than sharp sides in expectation. Thus in all the scenarios, taking average of SGD iterates would be beneficial for asymmetric loss function.

In addition, for symmetric loss functions, averaging SGD iterates may also be helpful in terms of denoising (see Appendix D for concrete examples). Therefore, taking the average of the SGD trajectory may always improve generalization, regardless of whether the loss function is symmetric or not.

### 5.2 High dimensional case

For high dimensional functions, the analysis on averaging SGD iterates would be more complicated compared to that given in the previous subsection. However, if we only care about the bias on a specific direction , we could directly apply Theorem 2 with one additional assumption. Specifically, if the projections of the loss function onto along the SGD trajectory satisfy the assumptions in Theorem 2, i.e., being asymmetric and the gradient on both sides have upper and lower bounds, then the claim of Theorem 2 directly applies. This is because only the gradient along the direction will affect the SGD trajectory projected onto , and we could safely omit all other directions.

Empirically, we find that this assumption generally holds. For a given SGD solution, we fix a random asymmetric direction , and sample the loss surface on direction that passes the

-th epoch of SGD trajectory (denoted as

), i.e., evaluate , for and .

As shown in the Figure 7, after the first epochs, the projected loss surfaces becomes relatively stable. Therefore, we could directly apply Theorem 2 to the direction .

As we will see in Section 6.1, compared with SGD solutions, SGD averaging indeed creates bias along different asymmetric directions, as predicted by our theory.

## 6 Sharp and Flat Minima Illusion

In this section, we show that where the solution locates at a local minimum basin is very important, which is a refinement of judging the generalization performance by the sharpness/flatness of a local minimum. All of our observations support our theoretical analysis in the previous sections.

First we remark that rigorously testing whether a point is a local minimum, or even close to a local minimum, is extremely hard for deep models, see e.g. (Safran & Shamir, 2017)

. In fact, the Hessian of most empirical solutions still have plenty of small negative eigenvalues

(Chaudhari et al., 2016), so technically they are saddle points. But we choose to ignore these technicalities, and treat all these points as “local minima”.

### 6.1 Illusion case 1: SWA algorithm

Recently, Izmailov et al. (2018) proposed the stochastic weight averaging (SWA) algorithm, which explicitly takes the average of SGD iterates to achieve better generalization. Inspired by their observation that “SWA leads to solutions corresponding to wider optima than SGD”, we provide a more refined explanation in this subsection. That is, averaging weights leads to “biased” solutions in an asymmetric valley, which correspond to better generalization.

Specifically, we run the SWA algorithm (with deceasing learning rate) with three popular deep networks, i.e., ResNet-110, ResNet-164 and DenseNet-100, on the CIFAR-10 and CIFAR-100 datasets, following the configurations in (Izmailov et al., 2018) (denoted as SWA). Then we run SGD with small learning rate from the SWA solutions to find a solution located in the same basin (denoted as SGD).

In Figure 8

, We draw an interpolation between the solutions obtained by SWA and SGD

333Izmailov et al. (2018) have done a similar experiment.. One can observe that there is no “bump” between these two solutions, meaning they are located in the same basin. Clearly, the SWA solution is biased towards the flat side, which verifies our theoretical analysis in Section 5. Further, we notice that although the biased SWA solution has higher training loss than the empirical minimizer, it indeed yields lower test loss. This verifies our analysis in Section 4. Similar observations are made on other networks and other datasets, which we present in Appendix F.

To further support our claim, we list our result in Table 1, from which we can observe that SGD solutions always have higher training accuracy, but worse test accuracy, compared to SWA solutions. This supports our claim in Theorem 1, which states that a bias towards the flat sides of asymmetric valleys could help improve generalization, although it yields higher training error.

#### Verifying Theorem 2.

We further verify that averaging SGD solutions could create a bias towards the flat side in expectation for many other asymmetric directions, not just for the specific direction we discussed above.

We take a ResNet-110 trained on CIFAR-100 as an example. Denote as the unit vector pointing from the SGD solution to the SWA solution. We pick another unit random direction . Then, we use the direction to verify our claim.

The results are shown in Figure 9, from which we can observe that SWA has a bias on the flat side compared with the SGD solution. We create different random vectors for each network and each dataset, and similar observations can be made (see more examples in Appendix G).

### 6.2 Illusion case 2: large batch SGD

Keskar et al. (2017) observed that training with small batch size using SGD algorithm generalizes better than training with large batch size. They argue that it is because large batch SGD tends to converge to sharp minima, while small batch SGD generally converges to flat minima. Here we show that it may not be the case in practice.

We use a PreResNet-164 trained on CIFAR-100 as an example. We first running SGD with a batch size of 128 for 200 epochs to find a solution (denoted as Large batch solution), and then contintue the training with batch size 32 for another 80 epoch to find a nearby solution (denoted as Small batch solution).

From the results shown in Figure 10, it is clear that the small batch solution has worse training accuracy but better test accuracy. Meanwhile, there is no ’bump’ between these solutions which suggests they are in the same basin. Therefore, small batch SGD generalizes better because it could find a better biased solution in the asymmetric valley, not because it finds a different wider or flatter minimum.

### 6.3 Illusion on the width of a minimum

We further point out that visualizing the “width” of a local minimum in a low-dimensional space may lead to illusive results. For example, one visualization technique (Izmailov et al., 2018) is showing how the loss changes along many random directions ’s drawn from the

-dimensional Gaussian distribution.

We take the large batch and small batch solutions from the previous subsection as our example. Figure 11 visualizes the “width” of the two solutions using the method described above. From the figure, one may draw the conclusion that small batch training leads to a wider minimum compared to large batch training. However, as discussed in Subsection 6.2 these two solutions are actually from the same basin. In other words, the loss curvature near the two solutions looks different because they are located at different locations in an asymmetric valley, instead of being located at different local minima. Similar observation holds for SWA and SGD solutions, see Appendix H.

## 7 Batch Norm and Asymmetric Valleys

Preview sections have focused on defining what are asymmetric valleys, and how to leverage them for better generalization. In this section, we take a step forward to answer where they originate, by showing empirical evidences that the Batch Normalization (BN) (Ioffe & Szegedy, 2015) adopted by modern neural networks seems to be a major cause for asymmetric valleys.

#### Directions on BN parameters are more asymmetric.

For a given SGD solution, if we take a random direction where only the BN parameters have non-zero entries, and compare it with a random direction where only the non-BN parameters have non-zero entries, we observe that those BN-related directions are usually more asymmetric. The result with ResNet-110 on CIFAR-10 is shown in Figure 12. As we can see, the Non-BN direction is sharp on both sides, but BN direction is flat on one side, and sharp on the other side. We also conducted trials with different networks and datasets, and obtained similar results (see Appendix I).

#### SGD averaging is more effective on BN parameters.

By Theorem 1 and 2, we know that SGD averaging could lead to biased solutions on asymmetric directions with better generalization. If BN indeed creates many asymmetric directions, can we improve the model performance by only averaging the weights of BN layers?

Note that BN parameters only constitute a small fraction of the total model parameters, e.g., 1.41% in a ResNet-110. In the follow experiment on ResNet-110 for CIFAR-10, we perform SGD averaging only on BN parameters, denoted as SWA-BN; and also averaging randomly selected non-BN parameters of the same amount (1.41% of the total parameters), denoted as SWA-Non-BN. The results are shown in Figure 13. It can be observed that averaging only BN parameters (blue curve) is more effective than averaging non-BN parameters (green curve), although there is still a gap comparing to averaging all the weights (yellow curve).

Moreover, we also conduct experiments with two 8-layer ResNets on CIFAR-10, one with BN layers and one without. We choose shallow networks here as deeper models without BN can not be effectively trained.

As shown in figure 14, we start weight averaging at the -th epoch. Although in both networks, we observe an improvement in test accuracy after averaging, it is clear that the network with BN layers have larger improvement compared with the network without BN layers. This again indicates that SGD averaging is more effective on BN parameters.

The results presented above are still quite preliminary. Understanding how the asymmetric valleys are formed in deep networks might be a valuable future research direction.

## 8 Conclusion

The width of solutions has been used to explain generalization. In this paper, we elaborate on these arguments, and show that width along Asymmetric Valleys, where the loss may increase at different rates along two opposition directions, is especially important for explaining generalization. Based on a formal definition of asymmetric valley, we showed that a biased solution lying on the flat side of the valley generalizes better than the empirical minimizer. Further, it is proved that by averaging the points along the SGD trajectory naturally leads to such biased solution. We have conducted extensive experiments with state-of-the-art deep models to verify our theorems. We hope this paper will strengthen our understanding on the loss landscape of deep neural networks, and inspire new theories and algorithms that further improve generalization.

## Appendix A Additional Figures for Section 3.2: Asymmetric Directions

See Figure 15, Figure 16, Figure 17, Figure 18, and Figure 19.

## Appendix B Missing Proof for Theorem 1

###### Proof.

Since has possible value for a given , we can use an integer to represent each value. When writing in binary, its -th digit represents whether (equal to ) or (equal to ). We use to represent the bitwise AND operator between and , which equals if the -th digit of is .

To prove our theorem, it suffices to show that for any ,

 (1)

If (1) is true, it suffices to take summation over on both sides, and we will get our conclusion. Therefore, below we will prove (1).

 = EδL′⎛⎝^w∗+i−1∑i0=1li0ui0⎞⎠to17.75pt\vboxto17.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{6.674805pt}{0.0pt}\pgfsys@curveto{6.674805pt}{3.686393pt}{3.686% 393pt}{6.674805pt}{0.0pt}{6.674805pt}\pgfsys@curveto{-3.686393pt}{6.674805pt}{% -6.674805pt}{3.686393pt}{-6.674805pt}{0.0pt}\pgfsys@curveto{-6.674805pt}{-3.68% 6393pt}{-3.686393pt}{-6.674805pt}{0.0pt}{-6.674805pt}\pgfsys@curveto{3.686393% pt}{-6.674805pt}{6.674805pt}{-3.686393pt}{6.674805pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-6.374903pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{% \scriptsize 1}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture≥12d2d−1∑j=0^L⎛⎝^w∗+i−1∑i0=1li0ui0+δj⎞⎠−ξ = 12d2d−1∑j=0j∧2i=0⎡⎣^L⎛⎝^w∗+i−1∑i0=1li0ui0+δj⎞⎠+^L⎛⎝^w∗+i−1∑i0=1li0ui0+δj+2i⎞⎠⎤⎦−ξ (2)

Where holds by Assumption 1, and the fact that . For every s.t. ,

 ^w∗+i∑i0=1li0ui0+δj = ^w∗+i∑i0=1li0ui0+δj+⟨δj,ui⟩ui−⟨δj,ui⟩ui = ^w∗+i−1∑i0=1li0ui0+δj−¯δiui−⟨δj,ui⟩ui+liui = ^w∗+i−1∑i0=1li0ui0+δj−⟨δj,ui⟩ui+(li−¯δi)ui

Since , , we know that . By Assumption 2, for every , is asymmetric with respect to . Since , we have . By the definition of asymmetric direction, we know

 ^L⎛⎝^w∗+i−1∑i0=1li0ui0+δj⎞⎠≥^L⎛⎝^w∗+i∑i0=1li0ui0+δj⎞⎠+cilipi (3)

Similarly,

 ^w∗+i∑i0=1li0ui0+δj+2i = ^w∗+i−1∑i0=1li0ui0+δj+2i+⟨δj+2i,ui⟩ui−⟨δj+2i,ui⟩ui+liui = ^w∗+i−1∑i0=1li0ui0+δj+2i−⟨δj+2i,ui⟩ui+(¯δi+li)ui

Since , we have . Therefore,

 ^L⎛⎝^w∗+i−1∑i0=1li0ui0+δj+2i⎞⎠≥^L⎛⎝^w∗+i∑i0=1li0ui0+δj+2i⎞⎠−lipi (4)

Combining (3) and (4), we have,

 (???)≥ 12d2d−1∑j=0j∧2i=0⎡⎣^L⎛⎝^w∗+i∑i0=1li0ui0+δj⎞⎠+cilipi+^L⎛⎝^w∗+i∑i0=1li0ui0+δj+2i⎞⎠−lipi⎤⎦−ξ = 12d2d−1∑j=0⎡⎣^L⎛⎝^w∗+i∑i0=1liui0+δj⎞⎠⎤⎦+(ci−1)lipi/2−ξ to17.75pt\vboxto17.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{6.674805pt}{0.0pt}\pgfsys@curveto{6.674805pt}{3.686393pt}{3.686% 393pt}{6.674805pt}{0.0pt}{6.674805pt}\pgfsys@curveto{-3.686393pt}{6.674805pt}{% -6.674805pt}{3.686393pt}{-6.674805pt}{0.0pt}\pgfsys@curveto{-6.674805pt}{-3.68% 6393pt}{-3.686393pt}{-6.674805pt}{0.0pt}{-6.674805pt}\pgfsys@curveto{3.686393% pt}{-6.674805pt}{6.674805pt}{-3.686393pt}{6.674805pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-6.374903pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{% \scriptsize 2}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture≥ EδL′⎛⎝^w∗+i∑i0=1li0ui0⎞⎠+(ci−1)lipi/2−2ξ = EδL⎛⎝^w∗+i∑i0=1li0ui0⎞⎠−minwL(w)+minw^L(w)+(ci−1)lipi/2−2ξ

Where holds by Assumption 1 and the fact that . That means,

Where the last inequality holds as .

## Appendix C Additional Figures for Section 4.2: Shift Exists Empirically

See Figure 20, Figure 21, and Figure 22.

## Appendix D Additional Figures in Section 5: Averaging Works For Symmetric Case

If the function is symmetric, there are two possible cases, as we show in Figure 23 and Figure 24. On one hand, if the function is flat, SGD is likely to stay on one side of the function along the trajectory, and the average will have bias on that side. On the other hand, if the function is sharp, SGD is likely to oscillate between the two sides, and therefore the average of the iterates will concentrate around the center. In both cases, SGD averaging could help to create bias on flat sides or to denoise.

## Appendix E Missing Proof for Theorem 2

To prove Theorem 2, we will need the following concentration bound.

###### Lemma 3 (Azuma’s inequality).

Let be independent random variables satisfying , for . We have the following bound for :

 Pr(|X−E(X)|≥λ)≤2e−λ22∑ni=1c2i

Let , . Since , we know . First, we have the following bounds on the first step .

For every , .

###### Proof.

Since is the first step that SGD jumps from the flat side to the sharp side, denote the previous location as . Since is at the sharp side, we know that the gradient is . Therefore, we have

 w0=w−1−η(∇L(w−1)+ω−1)

Where is the noise bounded by .

At the time when SGD jump from the flat side to sharp side, denote the target position as . We know that . Since the gradient on the sharp side is at most , we know the next step is lower bounded by . In other words, SGD stays at the sharp side for only iterations (this matches with our empirical observation, see e.g. Figure 5).

That means, the bound on can be applied to as well, because they are the same iterate. By applying the upper and lower bound on , we get:

 w0≥−η(a++ν)−η(a−+ν)=pmin

and also

 w0≤0−η(b−−ν)=pmax

Below we first define , where is a constant with value to be set later. satisfies the following inequality.

.

###### Proof.

By the definition of , we have

 −η(a−+a++2ν)−tηa+−√2tηνlog1/2(2τ)≥0 ⇐ (a−+a++2ν)+ta++√2tνlog1/2(2τ)≤0 ⇐ (a−+a++2ν)+Δ2a++√2Δrlog1/2(2τ)≤0     (Δ≜√t) ⇐ Δ∈[0,−√2νlog1/2(2τ)+√2ν2log(2τ)−4a+(a−+a++2ν)2a+] ⇐ t≤(−√2νlog1/2(2τ)+√2ν2log(2τ)−4a+(a−+a++2ν)2a+)2\qed

Now, we have the following theorem that says with decent probability, the minimum number of iterates on the flat side in -th round is at least .

###### Theorem 6.

If we start at , for every fixed , with probability at least , we have .

###### Proof.

Define filtration , where denotes the sigma field. Define the event and define , where . Since we only consider the case , we have

 Gt=w0−wt−tηa++(Tmin+1)(w0+ν+2ηa+)>w0−wt−tηa++wt+tηa+>0

Therefore, is always positive. By SGD updating rule, we have

 E[Gt+11Et|Ft]=E[(w0−wt+1−(t+1)ηa++M)1Et|Ft] ≤ E[(w0−wt+ηωt−tηa++M)1Et|Ft]=w0−wt−tηa++M=Gt1Et (5)

Since , and is always positive, we have

 Gt1Et≤Gt1Et−1 (6)

Combining (5) and (6) together, we know is a supermartingale.

We can also bound the absolute value of the difference in every iteration:

 |Gt+11Et−E[Gt+11Et|Ft]| = |(w0−wt+1−(t+1)ηa++M)−(w0−wt−∇L(wt)−(t+1)ηa++M)|Ft] ≤ ην

By Azuma’s inequality, we get:

 Pr(Gt1Et−1−G0≥λ)≤2e−λ22tη2ν2

That gives,

 Pr(Gt1Et−1−G0≥√2tηνlog1/2(2τ))≤1/τ

That means, if holds, with probability at least ,

 w0−wt−tηa++M<√2tηνlog1/2(2τ)+G0=√2tηνlog1/2(2τ)+M

Which gives

 wt>w0