DeepAI

# Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective

The lottery ticket hypothesis (LTH) has attracted attention because it can explain why over-parameterized models often show high generalization ability. It is known that when we use iterative magnitude pruning (IMP), which is an algorithm to find sparse networks with high generalization ability that can be trained from the initial weights independently, called winning tickets, the initial large learning rate does not work well in deep neural networks such as ResNet. However, since the initial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability. In this paper, we confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets. Finally, we revisit existing algorithms for finding winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.

• 1 publication
• 58 publications
02/17/2021

### A General Framework for the Derandomization of PAC-Bayesian Bounds

PAC-Bayesian bounds are known to be tight and informative when studying ...
09/18/2022

### Bootstrap Generalization Ability from Loss Landscape Perspective

Domain generalization aims to learn a model that can generalize well on ...
10/21/2021

### User-friendly introduction to PAC-Bayes bounds

Aggregated predictors are obtained by making a set of basic predictors v...
01/15/2019

### Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

The notion of flat minima has played a key role in the generalization pr...
11/04/2018

### Nonlinear Collaborative Scheme for Deep Neural Networks

Conventional research attributes the improvements of generalization abil...
06/05/2017

### Emergence of Invariance and Disentangling in Deep Representations

Using established principles from Information Theory and Statistics, we ...
02/20/2020

### Bayesian Deep Learning and a Probabilistic Perspective of Generalization

The key distinguishing property of a Bayesian approach is marginalizatio...

## 1 Introduction

The high generalization ability of modern neural networks can be attributed to the heavier overparameterization and effective learning algorithms [34]. This increase in the number of parameters leads to high computational cost and high memory usage, and network pruning is one of the effective techniques for addressing these problems [17] [11] [10]. After pruning a significant number of parameters, the pruned network can often work well with little or no accuracy loss. However, training this sparse subnetwork independently from the initial weights often does not work, and we can only obtain the sparse subnetwork through pruning after training the whole network.

Frankle and Carbin [7] presented “Lottery Ticket Hypothesis (LTH)” that states the existence of winning tickets: small but critical subnetworks which can be trained independently from scratch. They proposed an algorithm called iterative magnitude pruning (IMP) to obtain a winning ticket. They pointed out that, for deeper networks such as VGG [29] and ResNet [13], small learning rate is required to obtain a winning ticket. However, since a large learning rate helps the generalization ability of neural networks [19], the learning rate should be well controlled to find a winning ticket that has a higher test accuracy. Frankle et al. [8] found a correlation between the stability to SGD noise and the ability to find the winning ticket and empirically showed that the large learning rate moves weights too much under the low-stability learning process.

In this paper, we first empirically show that winning tickets are actually more vulnerable to label noise setting compared to the subnetwork created with the large learning rate; that is, the generalization ability of winning tickets is degraded due to the learning rate constraint. To explain this, we next apply the PAC-Bayesian theory to LTH and show that it can explain the relationship between LTH and generalization behavior. We use the PAC-Bayes bound for a spike-and-slab distribution to analyze winning tickets, which is based on our experimental findings that reducing the expected sharpness restricted to an unpruned parameter space and adding the regularization of distance from the initial weights can enhance the test performance of winning tickets. Finally, we revisit existing algorithms such as IMP, continuous sparsification [28] from the point of view of the PAC-Bayes bound optimization. This consideration gives an interpretation of these methods as an approximation of bound optimization.

To sum up, our contributions are as follows.

• We experimentally show that the suppressing expected sharpness considered only on unpruned weights can improve the generalization accuracy of winning tickets and that the distance from the initial weights is critical for IMP, i.e., balancing the distance and the training error helps to find them.

• On the basis of our findings, we reveal that the PAC-Bayesian formulation for a spike-and-slab distribution effectively captures the behavior of the winning tickets behavior.

• We revisit the existing algorithms from the PAC-Bayesian perspective and explain their behavior.

## 2 Related Work

##### Generalization Ability

Li et al. [19] showed that the initial learning rate helps a model train the hard-to-generalize pattern faster and better than the small learning rate, leading to a higher generalization accuracy. Lewkowycz et al. [18] also showed that the large learning rate generally helps an optimizer to converge to flatter minima, which is advantageous for generalization ability. It is known that the distance from the learned weights to the initial ones becomes smaller in over-parameterized networks [5] [24] [34]. The complexity measures such as the spectral norm and norm fail to explain the generalization ability of over-parameterized networks because they depend on the number of hidden units [25]; however, the distance from the initial weights can explain it.

##### Experimental Results on LTH

Frankle et al. [8]

provided the insights in why IMP with the large learning rate fails into find winning tickets in some problem settings. They proposed IMP with rewinding to an early epoch to avoid the very early training process because training is not stable to SGD noise in the early training regime.

Liu et al. [21] claimed that the large learning rate performs better in the larger model settings such as ResNet56 and ResNet110, contrary to Frankle and Carbin [7]. From these studies, we suppose that there is no good solution near the initial weights in such a setting; therefore the small learning rate cannot find the winning tickets. We conclude that IMP with rewinding is effective in such a setting because it shifts the initial weights closer to the final trained weights by pretraining a little and searches for the winning ticket in a better lottery.

Zhang et al. [34] state that over-parameterization of neural network leads to memorize label noise and hurt the generalization ability. Xia et al. [31] propose a label-noise learning method to prevent memorization by dividing the parameters into critical and non-critical ones based on LTH.

Some studies have investigated the flatness and the distance from the initial weights on LTH. Bain [2] plotted the loss landscape of winning tickets visually and found that IMP produces more convex and sharper minimum relative to random pruning. Bartoldson et al. [3] found that pruning regularizes similarly to noise injection, and they discussed the generalization of pruned networks considering flatness. They measured flatness by using the trace of Hessian and gradient covariance matrix. He et al. [14] refers the distance from the initial weights to analyze the label noise robustness of winning tickets; they stated that the influence of the label noise is lessened if the distance is suppressed. Liu et al. [20] discussed the relationship between winning tickets and the learning rate considering the similarity between initial and trained weights in terms of pruning mask overlap instead of the distance from the initial weights. There is a recent study on finding a mask that shows good accuracy without training weights at all [36], [27]. This is an extreme case in terms of the distance from the initial weights, however it is consistent with our result that IMP finds wining tickets close to the initial weights.

##### Theoretical Results on LTH

Malach et al. [23] demonstrated a subnetwork with a comparable accuracy to the original network in a sufficiently over-parameterized network, without any training. Zhang et al. [35] analyzed the generalization capability of winning tickets based on sample complexity. They provided insight into why the sparsity increases the generalization ability though it is limited to the case of a one-hidden-layer neural network.

##### PAC-Bayesian Theory

Hayou et al. [12] also used a spike-and-slab distribution in their PAC-Bayesian analysis. However, their motivation and purpose are completely different from our work. Their aim was to obtain a sophisticated pruning mask by optimizing the PAC-Bayes bound, and this is mainly in the context of network pruning rather than LTH. Our work differs in that we use it to analyze the generalization behavior of a given winning ticket on the basis of our empirical findings on the flatness and distance from initial weights.

## 3 Background

##### Imp

Iterative pruning is a method of obtaining a subnetwork by repeating training and pruning in stages. Frankle and Carbin [7] showed that iterative pruning (Algorithm 1) can find the winning ticket by adding an operation to restore the weights to the initial weights after pruning.

In Frankle and Carbin [7], the mask criterion is simply to keep the weights with a large final magnitude, ; This is called Iterative Magnitude Pruning (IMP). This paper follows their setting: is set to and the models are pruned globally.

Frankle et al. [8]

proposed an iterative pruning with rewinding to avoid lower-stability phase to SGD noise in the early training in each IMP step. The algorithm does not return to the exact initial weights, but instead returns the weights trained slightly in advance as the initial weights. It can find the winning ticket even with the initial large learning rate or in the harder settings such as ImageNet; however it revises the original LTH setting. Our paper focuses on analyzing the properties of winning tickets rather than improving the accuracy or robustness of winning tickets; thus, we will not consider this problem setting.

##### PAC-Bayesian Theory

First, we provide the notations used in this paper. Denote the training sample , which is randomly sampled from an underlying data distribution . Let be a set of hypotheses, and

be a loss function. Given

, we formulate the empirical risk on and the generalization error on as

 LS(f)=1mm∑i=1ℓ(f,xi,yi);LD(f)=E(x,y)∼Dℓ(f,x,y). (1)

The PAC-Bayesian framework gives a bound on the generalization error of a posterior distribution over the hypothesis ; we denote it . It assumes that we have a prior distribution on which does not depend on training data, and we update it to through the learning process. Optimizing the PAC-Bayes bound controls the balance between the empirical risk and the closeness to the prior (small complexity of the model). Although there are several types of PAC-Bayes bound, we consider the following well-used form of the PAC-Bayes bound.

###### Theorem 1 (Alquier et al. bound [1])

Given a real number , a non-negative real number , and a prior distribution on defined before seeing any training sample

, with probability at least

, for all on

 LD(Q)≤LS(Q)+1λ(KL[Q∥P]+log1δ+Ψ(λ,m)), (2)

where

 Ψ(λ,m)\coloneqqlogEh∼P,S∼Dm[exp(λ(LD(h)−LS(h)))]. (3)

It is known that flatness is closely related to the generalization ability of neural networks [16] [15]. As shown in previous studies ([5], [25] [26]), it can be viewed in the expected sense from a PAC-Bayesian perspective. We decompose the PAC-Bayes bound as follows.

 (4)

where is a solution obtained by a training. The expected sharpness term represents the amount of change in empirical risk around the trained weights, and the solutions in flatter minima are expected to have a relatively smaller value of this term. In the PAC-Bayesian framework, the role flatness plays in generalization behavior can be understood in this way.

For example, we consider the case where the Gaussian distribution is used for the prior and posterior. Let

be the Gaussian distribution, where is the mean and , is the covariance matrix and be the parameters of a neural network. We set a prior to be and a posterior to be , where , and the KL term is calculated as . This is consistent with the conventional understanding: solutions in flat minima obtained with norm regularization can achieve good generalization accuracy.

## 4 Empirical Analysis

We first show some empirical results because our new findings about winning tickets in this empirical analysis motivate the PAC-Bayesian analysis for LTH; thus, we interpret the results on the basis of the PAC-Bayesian perspective in the next section.

We empirically investigated the properties of winning tickets mainly related to the learning rate. First, we show that winning tickets are vulnerable to label noise due to a small learning rate. They are in relatively sharp minima, and generalization ability can be improved by finding the flatter solutions. We also focused on the distance from initial weights and show that the regularization of this distance, rather than the regularization of the norm, is important in the discovery of winning tickets.

### 4.1 Vulnerability to label noise

We examined the test accuracy when some fractions of the labels in the training set are randomly flipped to see whether the generalization behavior of winning tickets is degraded by the constraint that the learning rate must be small. We followed the experimental setting of Frankle and Carbin [7] and perform IMP with ResNet20 and VGG16 trained on CIFAR10.

Figure 1 shows the test accuracy on clean and label noise datasets of sparse subnetworks produced by IMP with different learning rate. As for the no label noise setting (green line), there is an accuracy drop at some point as the learning rate is increased. The subnetwork eventually performs worse than the original unpruned network, which means that IMP ultimately fails to find the winning ticket. In contrast, the test accuracy generally increases as the learning rate increases in the high label noise setting (orange line). The test accuracy continues to improve even when it is not the winning ticket. To sum up, the large learning rate is not suitable for finding winning tickets under a clean dataset but is advantageous in the high label noise setting.

### 4.2 Flatness

Now that we have seen that winning tickets have undesirable properties due to the small learning rate, we investigated whether this result comes from the difference in flatness around the found solution. There are many previous studies that discuss the relationship between the learning rate and flatness [19], [18]. In this experiment, perturbations are added only to unpruned weights to consider flatness on pruned networks. We will discuss this justification in detail in Section 5.

First, we visualized the loss landscape shape of subnetworks produced by IMP. Figure 2 shows the 1-d loss landscape of the subnetwork that has a high sparsity. This landscape shows the test accuracy around the trained weights adding perturbation restricted to unpruned parameters. We can see that the parameters of the winning ticket (small learning rate) are in the sharper minimum compared with the large learning rate. The large learning rate can find a flatter minimum; however, it is not the winning ticket. This graph is a fixed sparsity loss landscape; therefore, at first glance it seems puzzling that VGG16 shows that the large learning rate setting has a smaller error than the small learning rate setting, but in fact, there is a test accuracy drop when sparsity is changed.

Given that IMP requires a small learning rate rather than a large one, these results suggest two possible interpretations; 1) sharp minimum is essential to find winning tickets, therefore the large learning rate fails in IMP, 2) sharp minimum is simply the result of small learning rate training, and flat minimum is better for winning tickets if possible. To investigate these possibilities, we used sharpness-aware minimization (SAM) [6] and neural variable risk minimization (NVRM) [32] to search for parameters that lie in neighborhoods having uniformly low loss. They differ in that SAM minimizes the maximum loss in the neighborhood, whereas NVRM minimizes the expected loss in the neighborhood. We used both of them to compare with normal SGD. These optimizers are based on SGD with the same setting as the original LTH paper, and the noise considered in them is limited to the unpruned parameters. Figure 2 also shows the loss landscape when SAM is used; It can actually reach a flatter minimum and higher accuracy compared to using SGD. The loss landscape is as flat as the large learning rate setting; however the minimum loss is much lower than that of the large learning rate, which means that IMP succeeds in finding the winning ticket.

Next, we investigated the generalization accuracy of the winning ticket properties created by these optimizers. Figure 1 shows that SAM and NVRM can achieve a test accuracy the same as or even better than the SGD in the small learning setting. This improvement in test accuracy is only slight in the clean datasets setting but is remarkable for the high label noise setting, especially in CIFAR10. We found no significant difference in test accuracy between SAM and NVRM and that suppressing the expected sharpness on pruned networks with these optimizers is useful for generalization ability. We also found that the large learning rate still cannot find winning tickets even though we use SAM instead of SGD (Appendix A.2). The large learning rate has already found relatively flatter minima; therefore, the results do not change by using SAM. This fact implies that the flatness of solutions found by SGD with a large learning rate and by SAM with a small learning rate are similar, but winning tickets cannot be found with large learning rate because they find different solutions in some other property. Next, we analyze this difference by focusing on the distance from the initial weights.

### 4.3 Distance from initial weights

As a reason for the small learning rate constraint, we hypothesized that winning tickets can only be found in IMP within a range not far from the initial weights. In order to confirm this, we run IMP suppressing this distance and compare it with the usual regularization by parameters norm. In the context of LTH, this distance is also referred in He et al. [14]; however they measure it to analyze correlation with the label noise robustness of winning tickets. We investigated this concept to discuss training and test accuracy not just robustness to label noise.

We empirically analyzed that the winning ticket can be obtained via the regularization of this distance even under the large learning rate setting. Let be a regularization hyperparameter, be network weights, and be a pruning mask. We designed the loss function with and , respectively, as follows.

 LS(θ)+λ∥(θ−θinit)⊙m∥22;LS(θ)+λ∥θ⊙m∥22. (5)

Figure 3 shows the test accuracy with different regularizations by changing the hyperparameter . We plot % and % sparsity subnetworks and the original unpruned network as a base line. If the subnetwork accuracy is close to the whole network accuracy, it is considered to be successful in finding the winning ticket. As discussed previously, the large learning rate fails to find winning tickets unlike the small learning rate setting; however adding the regularization from the initial weights changes this trend. In Figure 3, around shows that a winning ticket is found since the accuracy drop from the whole network is suddenly reduced, and there is no such a trend when norm regularization is added. This means that IMP can obtain winning tickets suppressing the distance from the initial weights even with the large learning rate. (Other settings in Appendix A.3.) We can also obtain interesting results when the learning rate is small. The accuracy drop is small while is small; however it becomes large increasing . In the case of , it is possible that the gap widens because sparse networks are more affected by strong regularization (it is the same for large learning rate setting), but in the case of , the gap widens significantly even though generalization ability increases because of regularization. The strong norm regularization makes it fail to find a winning ticket even if a winning ticket could be found originally.

These results are related to the prior-mean selection from the PAC-Bayesian perspective. In the training of a normal network, there is a trade-off between suppressing the parameter norm, i.e., reducing the KL term, and reducing the training loss, and it is important to ensure a balance between them. As for IMP, there is a specific situation where suppressing the parameter norm makes the training loss large due to the failure to find a winning ticket. If we take the initial weights as a prior mean instead of , the training loss can decrease by suppressing the KL term when the trained weights are far from the initial weights. This experiment corresponds to what to take as a prior mean in terms of minimizing PAC-Bayes bound, and this result indicates that setting the initial weights to a prior mean seems to be compatible with minimizing the PAC-Bayes bound in the case of winning tickets.

## 5 PAC Bayesian Analysis on LTH

First, we present some possible definitions of subnetwork flatness and consider a PAC-Bayes bound of a spike-and-slab formulation based on our experiments. Next, we show that this bound captures the generalization behavior of winning tickets and revisit existing algorithms from the perspective of this bound.

### 5.1 Flatness on winning tickets

While we could simply consider the neighborhood of the trained weights in a unpruned neural network, it is not trivial to define the flatness of pruned networks depending on how the pruned weights are taken into consideration. The possible measure of the expected sharpness are as follows.

1. Add noise to the parameters restricted on unpruned parameter space.

2. Add noise to the parameters including the pruned weights.

3. Recover pruned weights and train the whole network to convergence (re-dense training [9]) and measure its flatness.

Measure 2 is the same as the unpruned neural network, but the sparse weights in the whole parameter space can no longer be in the local minimum; thus, it is uncertain whether flatness has any meaning in such a setting. He et al. [14] conducted re-dense training and showed that solutions at high sparsity are no longer minimizers in high dimensions. They also found that the winning ticket has higher sharpness than the original network based on Measure 3 and concluded that highly sparse solutions do not stick around the flat basins of minimizers. However, none of these metrics has any justification.

As discussed in the previous section, flatness can be viewed from a PAC-Bayesian perspective. We use a PAC-Bayes bound of a spike-and-slab formulation, where expected sharpness corresponds to Measure 1. We will also compare this formulation with the normal Gaussian formulation (type 2) through numerical experiments and show that the spike-and-slab formulation captures the generalization behavior of winning tickets better. This supports our findings that using SAM and NVRM on pruned networks can enhance the generalization accuracy of winning tickets.

We should note that prior has to be defined without depending on the training examples . It can be thought that if the parameter space is restricted to an unpruned weight subspace, we do not have to consider the pruned network case differently. However, this is not valid because this prior depends on the structure of the pruning mask , which is found after seeing the training samples . Target sparsity does not depend on , so we can use it in the prior.

### 5.2 Spike-and-slab formulation

There are several problems when it comes to using the Gaussian distribution as prior and posterior for analyzing winning ticket properties. As discussed in Section 4.3, the distance from the initial weights of the winning ticket is expected to be small. We set the prior mean as the initial weights to take advantage of this property; however the original PAC Bayesian formulation based on the Gaussian distribution has the disadvantage that the pruned weights have weight

and the norm of the initial weights corresponding to these weights remains in the KL part. This not only results in a large bound but also behaves contrary to the purpose of getting sparse subnetworks when the bound is optimized because the more sparse the subnetwork are, the larger the bound gets. In addition, noise is inevitably added to the pruned weights when considering expected sharpness since the variance of the pruned part cannot be set to zero. In order to limit the distance from the initial weights and noise added in expected sharpness only to the unpruned weights, we use a spike-and-slab distribution, which is the mixture of the Gaussian distribution and Dirac delta distribution, in the PAC-Bayesian bound. Let

and

be vectors whose

element is a variance of Gaussian distribution, and be vectors that represents the mixture ratio of prior and posterior, respectively. represents the network parameter, and is the initial weights and is the trained weights.

We design the prior and posterior as follows.

 P(θi)=(1−λp,i)δ{0}+λp,iN(θi∣θinit,i,σp,i), (6) Q(θi)=(1−λq,i)δ{0}+λq,iN(θi∣¯θi,σq,i).

The KL divergence of the spike-and-slab distribution [30] can be calculated as

 KL[Q(θ)∥P(θ))] (7) =∑i(λq,i(logσp,iσq,i+σ2q,i+(¯θi−θinit,i)22σ2p,i−12)+kl[λq,i∥λp,i]),

where

 kl[λq,i∥λp,i]=λq,ilogλq,iλp,i+(1−λq,i)log(1−λq,i1−λp,i). (8)

Since the structure of the pruned network is given, we set the element of to or asymptotically according to the pruning mask and obtain the following divergence about . This operation has been conventionally done in the entropy discussion [22].

 kl[λq,i∥λp,i]={−logλp,i(unpruned)−log(1−λp,i)(pruned). (9)

### 5.3 Numerical Experiment

We conducted numerical experiments to confirm that our PAC-Bayes formulation can adequately explain the behavior of winning tickets. We optimized the posterior variance to minimize the PAC-Bayes bound, and plot the training loss above the posterior and KL term to investigate that the bound can capture the test accuracy of the subnetworks produced by IMP. As a comparison, we also experiment with the Gaussian distribution setting using the zero-mean prior. The PAC-Bayesian bound used here is the variational KL bound [4] (Theorem D.2, and see also Appendix A.6).

Figure 4 shows the distribution of the training risk term and KL term when optimizing the PAC-Bayes bound, and the actual test accuracy is colored. We show that the bound with the spike-and-slab formulation successfully explains the behavior of winning tickets, dividing the point cloud into three groups: A) winning tickets (moderate learning rate), B) subnetworks that failed to find winning tickets (too high learning rate) and C) not much trained subnetworks (too small learning rate).

In the left figure, as the learning rate is increased, the distance from the initial weights increases and the training risk gradually decreases from C to A; The same trend also appears on the right figure. On the other hand, the distribution of B differs greatly between left and right. The right figure does not capture the test accuracy well because B should have higher test accuracy considering its KL term and training risk term. This means that, as the learning rate is increased and winning tickets can no longer be found, the KL term when prior mean is zero becomes much smaller than that of A. This is because many parameters become close to zero rather than near their initial weights when the winning ticket fails to be found (Appendix A.4). In terms of not only the bound optimization but also the analysis of the existing winning ticket, it is preferable to use the spike-and-slab distribution to set the prior mean as the initial weights.

### 5.4 Revisiting existing algorithms

We reconsider the existing algorithms for winning ticket searching from the perspective of optimizing the PAC-Bayes bound. Although we focus on IMP and continuous sparsification, this view could be helpful for other methods as well.

#### 5.4.1 Imp

IMP is a heuristic method and does not have an explicit target function. Here, we explain how IMP behaves in the sense of a PAC-Bayes bound with our formulation instead of viewing IMP as a direct bound optimization problem.

The risk term and KL term in the PAC-Bayes bound are basically in a trade-off relationship: choosing a complex model to fit the training data will increase the KL term, while choosing a simple model may not have high accuracy on the training data. We point out that the two steps of IMP: 1) train the subnetwork, 2) prune the subnetwork and revert its weights, optimize the overall bound by alternately reducing one term while suppressing the increase in the other term.

In the first step, IMP trains the subnetwork from initial weights under a given pruning mask. This process reduces the training risk, and the increase in KL term is not expected to be so large in our formulation because the prior mean is set to the initial weights and the trained weights of IMP are not far from the initial weights as shown in Section 4.3, 5.3.
In the second step, IMP prunes a certain percentage of the smallest magnitude weights and reverts the trained weights to the initial state. Reverting to and changing part of from to make the KL term small. The number of Gaussian KL summations decreases and the distance from initialization gets to , and the KL part is also reduced if the prior mixture ratio is set to the final target sparsity. This KL reduction is not dependent on the pruning criterion. The problem here is how to minimize the increase in training loss, which is related to what heuristic pruning criterion we choose and why pruning weights with a small absolute value work well.

Table 2 lists the training accuracy drop when we use three different pruning criteria; leaves a large absolute value of weights and corresponds to IMP, conversely leaves small weights, and prunes randomly. This notation of criteria follows that of Zhou et al. [36]. As expected, has a smaller drop in training accuracy after pruning than the others. Training accuracy gets lower after reverting; If we assume that retraining can reach weights that show the same or better accuracy because weights achieving good accuracy with the same structure exist, it seems to make sense to use to decrease the KL term while suppressing the increase in training loss. The results in Table 2 confirm this assumption empirically. We can also discuss this with the following consideration using Taylor expansion (Appendix A.5).

 |f(w+Δw)−f(w)|≤12∥Δw∥22supγ∈[0,1]λw+γΔwmax, (10)

where is a top eigenvalue of .

This provides a brief insight into why IMP succeeds by pruning small magnitude weights under the assumption that the maximum eigenvalues are not very different.

#### 5.4.2 Continuous Sparsification

Continuous sparsification [28] is a method to find winning tickets by removing the parameters continuously instead of alternating between training and pruning. This target function is as follows,

 minw∈Rd,m∈{0,1}dLS(m⊙w)+η⋅∥m∥1, (11)

where is a hyperparameter. Continuous sparsification is formulated as the training loss minimization with the

regularization of weights, and a sigmoid function

is used for the continuous relaxation of the regularization term as follows.

 minw∈Rd,s∈Rd≠0limβ→∞LS(σ(βs)⊙w)+η⋅∥σ(βs)∥1. (12)

We can regard this function as an approximation of the PAC-Bayes bound of our formulation. Let be the Gaussian KL part in 7, the summation of training risk and KL is as follows.

 LS(Q)+∑iϕiλq,i+∑ikl[λq,i∥λp,i]. (13)

We make three approximations: 1) replace with over the spike-and-slab distribution by first-order Taylor expansion on the training risk, 2) simplify the second term to the norm of because the second term can be viewed as a weighted summation of , and 3) remove the third term, which can be regarded as a regularization to target sparsity. This yields the following, which is similar to Eq. 12.

 (14)

Since the Gaussian KL is approximated, the distance from the initial weights is not taken into account in this setting. The authors adopt a problem setting where the weights trained a few epochs ahead instead of the initial weights are used for ticket search following Frankle et al. [8]; therefore their work does not have to consider suppressing the learning rate, i.e., the distance from the initial weights. Note that Hayou et al. [12] proposed PAC-Bayes pruning (PBP) by optimizing the PAC-Bayes bound. However, the limitation of our analysis is that we cannot reveal an explicit relationship between continous sparcification and PBP.

## 6 Conclusion

In this work, we explored the fact that a small learning rate is required to find winning tickets, and we provided empirical analysis related to flatness and the distance from the initial weights. On the basis of these findings, we used the PAC-Bayesian framework to analyze winning tickets and experimentally showed that it captures the generalization behavior well. Finally, we reconsider IMP and continuous sparsification from a PAC-Bayesian perspective. In this study, we does not analyze the case where no solution exists near the initial weights, which needs IMP with rewinding to early epoch.

## References

• [1] P. Alquier, J. Ridgway, and N. Chopin (2016) On the properties of variational approximations of gibbs posteriors.

The Journal of Machine Learning Research

17 (1), pp. 8374–8414.
Cited by: Theorem 1.
• [2] R. Bain (2021) Visualizing the loss landscape of winning lottery tickets. arXiv preprint arXiv:2112.08538. Cited by: §2.
• [3] B. Bartoldson, A. Morcos, A. Barbu, and G. Erlebacher (2020) The generalization-stability tradeoff in neural network pruning. Advances in Neural Information Processing Systems 33, pp. 20852–20864. Cited by: §2.
• [4] G. K. Dziugaite, K. Hsu, W. Gharbieh, G. Arpino, and D. Roy (2021) On the role of data in pac-bayes. In

International Conference on Artificial Intelligence and Statistics

,
pp. 604–612. Cited by: §A.6, §5.3.
• [5] G. K. Dziugaite and D. M. Roy (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. Cited by: §2, §3.
• [6] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur (2020) Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, Cited by: §4.2.
• [7] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §1, §2, §3, §3, §4.1.
• [8] J. Frankle, G. K. Dziugaite, D. Roy, and M. Carbin (2020) Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. Cited by: §1, §2, §3, §5.4.2.
• [9] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. (2017) DSD: dense-sparse-dense training for deep neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. Cited by: item 3.
• [10] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §1.
• [11] B. Hassibi, D. G. Stork, and G. J. Wolff (1993) Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. Cited by: §1.
• [12] S. Hayou, B. He, and G. K. Dziugaite (2021) Probabilistic fine-tuning of pruning masks and pac-bayes self-bounded learning. arXiv preprint arXiv:2110.11804. Cited by: §A.6, §2, §5.4.2.
• [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

,
pp. 770–778. Cited by: §1.
• [14] Z. He, Q. Zhu, and Z. Qin (2022)

Can network pruning benefit deep learning under label noise?

.
Cited by: §2, §4.3, §5.1.
• [15] Y. Jiang*, B. Neyshabur*, H. Mobahi, D. Krishnan, and S. Bengio (2020) Fantastic generalization measures and where to find them. In International Conference on Learning Representations, Cited by: §3.
• [16] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. ICLR. Cited by: §3.
• [17] Y. LeCun, J. Denker, and S. Solla (1989) Optimal brain damage. Advances in neural information processing systems 2. Cited by: §1.
• [18] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari (2020) The large learning rate phase of deep learning: the catapult mechanism. Cited by: §2, §4.2.
• [19] Y. Li, C. Wei, and T. Ma (2019) Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems 32. Cited by: §1, §2, §4.2.
• [20] N. Liu, G. Yuan, Z. Che, X. Shen, X. Ma, Q. Jin, J. Ren, J. Tang, S. Liu, and Y. Wang (2021) Lottery ticket preserves weight correlation: is it desirable or not?. In International Conference on Machine Learning, pp. 7011–7020. Cited by: §2.
• [21] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. In International Conference on Learning Representations, Cited by: §2.
• [22] D. J. MacKay, D. J. Mac Kay, et al. (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §5.2.
• [23] E. Malach, G. Yehudai, S. Shalev-Schwartz, and O. Shamir (2020) Proving the lottery ticket hypothesis: pruning is all you need. In International Conference on Machine Learning, pp. 6682–6691. Cited by: §2.
• [24] V. Nagarajan and J. Z. Kolter (2019) Generalization in deep networks: the role of distance from initialization. In NIPS workshop on Deep Learning: Bridging Theory and Practice, Cited by: §2.
• [25] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring generalization in deep learning. Advances in neural information processing systems 30. Cited by: §2, §3.
• [26] B. Neyshabur, S. Bhojanapalli, and N. Srebro (2018) A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, Cited by: §3.
• [27] V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Rastegari (2020-06) What’s hidden in a randomly weighted neural network?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
• [28] P. Savarese, H. Silva, and M. Maire (2020) Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems 33, pp. 11380–11390. Cited by: §1, §5.4.2.
• [29] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1.
• [30] F. Tonolini, B. S. Jensen, and R. Murray-Smith (2020) Variational sparse coding. In Uncertainty in Artificial Intelligence, pp. 690–700. Cited by: §5.2.
• [31] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y. Chang (2021) Robust early-learning: hindering the memorization of noisy labels. In ICLR, Cited by: §2.
• [32] Z. Xie, F. He, S. Fu, I. Sato, D. Tao, and M. Sugiyama (2021) Artificial neural variability for deep learning: on overfitting, noise memorization, and catastrophic forgetting. Neural computation 33 (8), pp. 2163–2192. Cited by: §4.2.
• [33] Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney (2020) Pyhessian: neural networks through the lens of the hessian. In 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590. Cited by: Figure 2.
• [34] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. ICLR. Cited by: §1, §2, §2.
• [35] S. Zhang, M. Wang, S. Liu, P. Chen, and J. Xiong (2021) Why lottery ticket wins? a theoretical perspective of sample complexity on sparse neural networks. Advances in Neural Information Processing Systems 34. Cited by: §2.
• [36] H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. Advances in neural information processing systems 32. Cited by: §2, §5.4.1.

## Appendix A Other Experiments

### a.1 Retrain winning tickets with different learning rate

We conducted the following experiment related to distance from initial weights. We verified whether only the structure of winning tickets is important; if so, after getting a pruning mask with a small learning rate it can be trained with a initial large learning rate as in normal network training, which leads to a higher generalization ability. Figure 5 shows the result. Retraining with a large learning rate for a given mask greatly improves test accuracy at a low sparsity. In contrast, as sparsity is increased, this improvement decreases or the test accuracy worsened; it is generally the same trend as IMP with the large learning rate. We found that the mask obtained by IMP with a small learning rate has a structure that performs well with weights around the initialization, and it does not work well when trained directly of the subnetwork with a large learning rate.

### a.2 SAM with the large learning rate

Figure 6 shows the test accuracy when we use SAM with a large learning rate; we experimented only ResNet20 + CIFAR10 due to computational resource limitations. This figure shows that SAM does not improve test accuracy from that of IMP with a large learning rate and that SAM does not help to find winning tickets. Since IMP with a large learning rate already produces relatively flatter solutions, SAM will not change the results as expected. This also confirms that the improved generalization accuracy when we use SAM instead of SGD for IMP with a small learning rate is because of finding flatter solution, not some other side effects of SAM.

### a.3 Regularization of the distance from the initial weights

Figure 7 shows the test accuracy on CIFAR10 and CIFAR100 when we train ResNet20 and VGG16 with a regularization from the initial weights. For a large learning rate setting (orange line), increasing sparsity significantly reduces the test accuracy, which means that IMP cannot find the winning ticket. By adding regularization from the initialization, this decrease in the test accuracy can be minimized (red line), showing a similar trend for a small learning setting (green line), which is successful in finding winning tickets.

### a.4 The distribution of parameter changing the learning rate

Figure 8 shows the parameter distribution of ResNet20 and VGG16 changing the learning rate. We plot the unpruned weights in order from the smallest to largest: , , , . The same sparsity where the winning tickets cannot be found in Figure 1 shows a change in the trend of the distribution. Although the difference is not apparent when we measure L2 norm, we confirm that each parameter, which was near the initial weights, is distributed in a wider range when the learning rate is increased.

### a.5 Proof of 10

We estimate the deviation of

when the moves from the trained weights using Taylor’s theorem. Let be a Hessian matrix, then we have

 |f(w+Δw)−f(w)| =|∇f(w)⋅Δw+12Δw⊤H(w+γΔw)Δw| (15) =12∥Δw⊤H(w+γΔw)Δw∥2 ≤12∥Δw∥22⋅∥H(w+γΔw∥2 ≤12∥Δw∥22supγ∈[0,1]∥H(w+γΔw)∥2 =12∥Δw∥22supγ∈[0,1]λw+γΔwmax,

where is a top eigenvalue of .

First equation comes from quadratic Taylor’s theorem, second equation comes from the fact that is a trained weights and , and third inequality holds because of sub-multiplicativity of matrix norm.

### a.6 Variational KL bound

In Section 5.3, we optimize the following variational KL bound [4], [12]. Given a real number , with probability over the training sample ,

 min⎧⎨⎩LS(Q)+B+√B(B+2LS(Q))LS(Q)+√B2, (16)

where

 B=KL(Q∥P)+log2√|S|δ|S|. (17)