Winning the Lottery with Continuous Sparsification

12/10/2019 ∙ by Pedro Savarese, et al. ∙ The University of Chicago Toyota Technological Institute at Chicago 0

The Lottery Ticket Hypothesis from Frankle Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for "winning tickets", Iterative Magnitude Pruning, consistently finds sub-networks with 90-95% less parameters which train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose Continuous Sparsification, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network's structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperforms the ones learned by Iterative Magnitude Pruning, and at the same time providing faster search, when measured in number of training epochs or wall-clock time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although deep neural networks have become ubiquitous in fields such as computer vision and natural language processing, extreme overparameterization is typically required to achieve state-of-the-art results, causing higher training costs and hindering applications where memory or inference time are constrained. Recent theoretical work suggest that overparameterization plays a key role in both the capacity and generalization of a network

(Neyshabur et al., 2018), and in training dynamics (Allen-Zhu et al., 2019). However, it remains unclear whether overparameterization is truly necessary to train networks to state-of-the-art performance.

At the same time, empirical approaches have been successful in finding less overparameterized neural networks, either by reducing the network after training (Han et al., 2015, 2016) or through more efficient architectures that can be trained from scratch (Iandola et al., 2016). Recently, the combination of these two approaches lead to new methods which discover efficient architectures through optimization instead of design (Liu et al., 2019; Savarese and Maire, 2019). Nonetheless, parameter efficiency is typically maximized by pruning an already trained network.

The fact that pruned networks are hard to train from scratch (Han et al., 2015) suggests that, while overparameterization is not necessary for a model’s capacity, it might be required for successful network training. Recently, this idea has been put into question by Frankle and Carbin (2019), where heavily pruned networks are trained faster than their original counterparts, often yielding superior performance.

A key finding is that the same parameter initialization should be used when re-training the pruned network. A winning ticket, defined by a sub-network and a setting of randomly-initialized parameters, is quickly trainable and has already found applications in, for example, transfer learning (Morcos et al., 2019; Mehta, 2019; Soelen and Sheppard, 2019), making the search for winning tickets a problem of independent interest.

Currently, the standard algorithm to find winning tickets is Iterative Magnitude Pruning (IMP) (Frankle and Carbin, 2019), which consists of a repeating a 2-stage procedure that alternates between parameter optimization and pruning. As a result, IMP relies on a sensible choice for pruning strategy, and is time-consuming: finding a winning ticket with of the original parameters in a 6-layer CNN requires over rounds of training followed by pruning, totalling over epochs. Choosing a parameter’s magnitude as pruning criterion has also shown to be sub-optimal in some settings (Zhou et al., 2019), leading to the question of whether better winning tickets can be found by different pruning methods. Moreover, at each iteration, IMP resets the parameters of the network back to initialization, hence considerable time is spent on re-training similar networks with different sparsities.

With the goal of speeding up the search for winning tickets in deep neural networks, we design a novel method, Continuous Sparsification111Available at https://github.com/lolemacs/continuous-sparsification, which continuously removes weights from a network during training, instead of following a strategy to prune parameters at discrete time intervals. Unlike IMP, our method approaches the search for sparse networks as an -regularized optimization problem (Louizos et al., 2018), resulting in a method that can be fully described in the optimization framework. To approximate -regularization, we propose a smooth re-parameterization, allowing for the sub-network’s structure to be directly learned with gradient-based methods. Unlike previous works, our re-parameterization is deterministic, proving more convenient for the tasks of pruning and ticket search, while also yielding faster training times.

Experimentally, our method offers superior performance when pruning VGG (Simonyan and Zisserman, 2015) to extreme regimes, and is capable of finding winning tickets in Residual Networks (He et al., 2016) trained on CIFAR-10 (Krizhevsky, 2009) at a fraction of time taken by Iterative Magnitude Pruning. In particular, Continuous Sparsification  successfully finds tickets in under 5 iterations, compared to 20 iterations required by Iterative Magnitude Pruning in the same setting. To further speed up the search for sub-networks, our method abdicates parameter rewinding, a key ingredient of Iterative Magnitude Pruning. By showing superior results without rewinding, our experiments offer insights on how ticket search should be performed.

2 Related Work

2.1 Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis (Frankle and Carbin, 2019) states that for a network , , and randomly-initialized parameters , there exists a sparse sub-network, defined by a configuration , , that, when trained from scratch, achieves higher performance than while requiring fewer training iterations. The authors support this conjecture experimentally, showing that such sub-networks indeed exist: in particular, they can be discovered by repeatedly training, pruning, and re-initializing the network, through a procedure named Iterative Magnitude Pruning (IMP; Algorithm 1) (Frankle et al., 2019). More specifically, IMP alternates between: (1) training the weights of a network, (2) removing a fixed fraction of the weights with the smallest magnitude (pruning), and (3) rewinding: setting the remaining weights back to their original initialization .

The sub-networks found by IMP, which indeed train faster and outperform their original, dense networks, are called winning tickets, and can generalize across datasets (Mehta, 2019; Soelen and Sheppard, 2019) and training methods (Morcos et al., 2019). In this sense, IMP can be a promising tool in applications that involve knowledge transfer, such as transfer or meta learning.

Zhou et al. (2019) perform extensive experiments to re-evaluate and better understand the Lottery Ticket Hypothesis. Relevant to this work is the fact that the authors propose a method to learn the binary mask in an end-to-end manner through SGD, instead of relying on magnitude-based pruning. The authors show that learning only the binary mask and not the weights is sufficient to achieve competitive performance, confirming that the learned masks are highly dependent on the initialized values , and are also capable of encoding substantial information about a problem’s solution.

2.2 Sparse Networks

The core aspect of searching for a winning ticket is finding a sparse sub-network that attains high performance relative to its dense counterpart. One way to achieve this is through pruning methods (LeCun et al., 1990), which follow a strategy to remove weights from a trained network while minimizing negative impacts on its performance. In Han et al. (2015), a network is iteratively trained and pruned using parameter magnitudes as criterion: this iterative, two-stage algorithm is shown to outperform “one-shot pruning”: training and pruning the network only once.

Other methods attempt to approximate regularization on the weights of a network, yielding one-stage procedures that can be fully described in the optimization framework. In order to find a sparse setting of a network , Srinivas et al. (2016) and Louizos et al. (2018) use a stochastic re-parameterization with and

applied element-wise. First-order methods, coupled with gradient estimators, are then used to train both

and to minimize the expected loss. This approach performs continuous parameter removal during training in an automatic fashion: any component of that assumes a value during training where effectively removes from the network. Moreover, approximating regularization has the advantage of not requiring the added complexity of developing a custom pruning strategy.

1:Initialize and
2:Minimize until is produced
3:Set for the active weights with smallest magnitudes ( and )
4:If satisfied, output ticket
5:Otherwise, set and go back to step 2
Algorithm 1 Iterative Magnitude Pruning (Frankle et al., 2019)
1:Initialize ,
2:Minimize until and are produced
3:If satisfied, output ticket ,
4:Otherwise, set , for , and go back to step 2
Algorithm 2 Iterative Stochastic Sparsification (inspired by Zhou et al. (2019))
1:Initialize , ,
2:Minimize while increasing , producing , , and
3:If satisfied, output ticket
4:Otherwise, set , , (optionally, ), and go back to step 2
Algorithm 3 Continuous Sparsification

3 Method

Designing a method to quickly find winning tickets requires an efficient way to sparsify networks: ideally, sparsification should be done as early as possible in training, and the number of removed parameters should be maximized without harming the model’s performance. In other words, sparsification must be continuously maximized following a trade-off with the performance of the network. This goal is not met by Iterative Magnitude Pruning: sparsification is done at discrete time steps, only after fully training the network, and optimal pruning rates likely depend on the model’s performance and current sparsity: factors which are typically not accounted for – note that these are inherent characteristics of magnitude-based pruning.

In light of this, we turn to -regularization methods for learning sparse networks, which consist of optimizing a clear trade-off between sparsity and performance. As we will see, performing sparsification continuously is not only straightforward, but done automatically by the optimizer.

3.1 Continuous Sparsification by Learning Deterministic Masks

[width=]newfigs/reparam.pdf

Figure 1: Illustration of our proposed re-parameterization , where

is the sigmoid function and

acts as a temperature. As increases, approaches , which can can be used to frame an -regularized problem (Equation 4). Note that the gradients of vanish as increases, suggesting that should be annealed slowly during training.

We first frame the search for sparse networks as a loss minimization problem with regularization (Louizos et al., 2018; Srinivas et al., 2016):

(1)

where controls the sparsity of the solution, and, with a slight abuse of notation, denotes the loss incurred by the network (e.g., the cross-entropy loss over a training set). As regularization is typically intractable, we re-state the above minimization problem as:

(2)

which uses the fact that, for , . The term can be minimized with subgradient descent, however the constraint makes the above problem combinatorial and poorly suited for local search methods like SGD.

We can avoid the binary constraint by re-parameterizing as a function of a newly-introduced variable . For example, Louizos et al. (2018) propose a stochastic mapping and use gradient methods to minimize the expected total loss, while using estimators for the gradients of (since is still binary). Having a stochastic mask (or, equivalently, a distribution over sub-networks) poses an immediate challenge for the task of finding tickets, as it is not clear which ticket should be chosen once a distribution over

is learned. Moreover, relying on gradient estimators often causes gradients to have high variance, requiring longer training to reach optimality. Alternatively, we consider a deterministic parameterization

, where and is applied element-wise:

(3)

Applying this re-parameterization to Equation 2 yields:

(4)

Clearly, the above problem is again intractable, as it is still equivalent to the original problem in Equation 1. More specifically, the step function is non-convex, and having zero gradients makes gradient-based optimization ineffective. Instead, we consider the following smooth relaxation of :

(5)

where , and is the sigmoid function , applied element-wise. By controlling

, which acts as a temperature parameter, we effectively interpolate between

, a smooth function well-suited for SGD, and , our original goal, which brings computational hardness to the problem. Figure 1 illustrates this behavior. Note that, if is continuous in , then:

(6)

Although gradient methods will become ineffective as due to vanishing gradients of , we can increase while optimizing and with gradient descent. That is, our loss at each iteration will be a function of as follows:

(7)

How does the soft mask behave as we minimize while increasing ? As , every negative component of will be mapped to , effectively removing its correspondent weight parameter from the network. While analytically the weights will never truly be zeroed-out, limited numerical precision has the fortunate side-effect of causing actual sparsification to the network during training, as long as is increased to a large enough value.

In a nutshell, we learn sparse networks by minimizing for parameter updates with gradient descent while jointly annealing : producing , and , which is ideally large enough such that, numerically 222We observed in our experiments that a final temperature of is sufficient for iterates of when training with SGD with -bit precision. The required temperature is likely to depend on the how is numerically represented, as in reality our method relies on numerical imprecision., . In case is truly required to be binary (as in the task of finding tickets), the dependence on numerical imprecision can be avoided by directly outputting at the end of training.

Finally, note that minimizing while increasing is not generally equivalent to minimizing the original -regularized problem. Informally, the former aims to solve , while the problem is .

3.2 Ticket Search through Continuous Sparsification

The method presented above offers a direct alternative to magnitude-based pruning when performing ticket search, but a few considerations must follow. Most importantly, when searching for winning tickets, there is a strict constraint that the learned mask be binary: otherwise, one can also learn the magnitude of the weights, defeating the purpose of finding sub-networks that can be trained from scratch. To guarantee that the output mask satisfies this constraint regardless of numerical precision, we always output instead of .

Additionally, we also incorporate two techniques from successful methods for learning sparse networks and searching for winning tickets. First, motivated by Han et al. (2015), where it is shown that iteratively pruning a network yields improved sparsity compared to pruning it only once, we enable “kept” weights – those whose corresponding component of is positive after many iterations – to be removed from the network at a later stage. More specifically, when becomes large after gradient descent updates, the gradients of vanish and weights will no longer be removed from the network. To avoid this, we set , effectively resetting the soft mask parameters for the remaining weights while at the same time not interfering with weights that have been removed. This is followed by a reset on the temperature, , to allow training of once again.

Second, we perform parameter rewinding, following Frankle and Carbin (2019), which is a key component of Iterative Magnitude Pruning. More specifically, after gradient descent steps, we reset the weight values back to an earlier stage , where . Even though experimental results in Frankle and Carbin (2019) suggest that rewinding is necessary for successful ticket search, we leave rewinding as an optinal component of our algorithm: as we will see empirically, it turns out that ticket search is possible without rewinding weights. Our proposed algorithm to find winning tickets is presented as Algorithm 3, and referred simply as “Continuous Sparsification”.

4 Experiments

Our experiments aim at comparing different methods on the task of finding winning tickets in neural networks, hence our evaluation focuses on the generalization performance of each ticket (sub-network) when trained from scratch (or from an iterate in early-training). Additionally, we measure the cost of the search procedure: the number of training epochs to find tickets with varying performance and sparsity.

Besides comparing our proposed method to Iterative Magnitude Pruning (Algorithm 1), we also design a baseline method, Iterative Stochastic Sparsification (ISS, Algorithm 2), motivated by the procedure in Zhou et al. (2019) to find a binary mask with gradient descent in an end-to-end fashion. More specifically, ISS uses a stochastic re-parameterization with , and trains and jointly with gradient descent and the straight-through estimator (Bengio et al., 2013). When run for multiple iterations, all components of the mask parameters which have decreased in value from initialization are set to , such that the corresponding weight is permanently removed from the network. While this might look arbitrary, we observed empirically that ISS was unable to remove weights quickly without this step unless

was chosen to be large – in which case the model’s performance decreases in exchange for sparsity. The hyperparameters used in this section were chosen based on analysis presented in Appendix

A, where we study how the pruning rate affects IMP, and how and interact in CS.

4.1 Convolutional Neural Networks

We train a neural network with 6 convolutional layers on the CIFAR-10 dataset (Krizhevsky, 2009), following Frankle and Carbin (2019). The network consists of 3 blocks of 2 resolution-preserving convolutional layers followed by max-pooling, where convolutions in each block have and channels, a

kernel, and are immediately followed by ReLU activations. The blocks are followed by fully-connected layers with

and neurons, with ReLUs in between. The network is trained with Adam (Kingma and Ba, 2015) with a learning rate of and a batch size of .

Learning a Supermask: As a first baseline, we consider the task of learning a “supermask” (Zhou et al., 2019): a binary mask that, when applied to a network with randomly initialized weights, yields performance competitive to that of training its weights. This task is equivalent to pruning a randomly-initialized network, or learning an architecture that performs well prior to training with a fixed initialization. We compare ISS and CS, where each method is run for a single iteration composed of epochs. When run for a single iteration, ISS is equivalent to the algorithm proposed in Zhou et al. (2019) to learn a supermask, referred here as simply Stochastic Sparsification. We control the sparsity of the learned masks by varying between and for Stochastic Sparsification (which showed to be more effective than varying ), while for Continuous Sparsification we vary between and (which results in stable and consistent training, unlike varying ). SS uses SGD with a learning rate of to learn its mask parameters, while CS uses Adam with .

Results are presented in Figure 2: CS outperforms SS in terms of both training speed and the quality of the learned mask. In particular, CS finds masks with over sparsity that yield over test accuracy, while the performance of masks found by SS decreases when sparsity is over . Moreover, CS makes faster progress in training, showing that optimizing a deterministic mask is indeed faster than learning a distribution over masks through stochastic re-parameterizations.

Finding Winning Tickets: We run IMP and ISS for a total of 30 iterations, each consisting of 40 epochs. Parameters are trained with Adam (Kingma and Ba, 2015) with a learning rate of , following Frankle and Carbin (2019). For IMP, we use pruning rates of for convolutional/dense layers. We initialize the Bernoulli parameters of ISS with , and train them with SGD and a learning rate of , along with a regularization of . For CS, we anneal the temperature from to following an exponential schedule (), training both the weights and the mask with Adam and a learning rate of .

To test whether our method is capable of finding winning tickets in a limited amount of time, we limit each run of CS to iterations only, in contrast with IMP and ISS which are run for 30. We perform 6 runs of CS, each with a different value for the mask initialization : , , , , , , keeping , such that sparsification is not enforced during training, but heavily biased at initialization. In order to evaluate how consistent our method is, we repeat each run with 3 different random seeds so that error bars can be computed.

Figure 3 (left) presents the quality of tickets found by each method, measured by their test accuracy when trained from scratch. To illustrate the quality of the tickets that can be found by Continuous Sparsification, we plot the Pareto curve (green) of the tickets founds with the 6 different values for . With , in only 2 iterations CS finds a ticket with over sparsity (first marker of purple curve) which outperforms every ticket found by IMP in its 30 iterations. The Pareto curve of CS strictly dominates IMP for tickets with up to sparsity, where ticket performance is superior or similar to the original dense network.

In terms of computational time, the total cost to run CS with the 6 different values for is lower than performing a single run of IMP for 30 iterations, even though CS takes extra time per epoch due to the mask parameters. This shows the potential of our model even in the setting where a specific sparsity is desired for the tickets. When run in parallel, CS takes less wall-clock time to find all tickets in the Pareto curve than to run IMP for 5 iterations.

[width=]newfigs/supermask_acc_per_iteration.pdf

[width=]newfigs/supermasks.pdf

Figure 2: Learning a binary mask with weights frozen at initialization with Stochastic Sparsification (SS, Algorithm 2 with one iteration) and Continuous Sparsification (CS), on a 6-layer CNN on CIFAR-10. Left: Training curves with hyperparameters for which masks learned by SS and CS  were both approximately sparse. CS  learns the mask significantly faster while attaining similar early-stop performance. Right: Sparsity and test accuracy of masks learned with different settings for SS and CS: our method learns sparser masks while maintaining test performance, while SS is unable to successfully learn masks with over sparsity.

[width=]newfigs/conv6.pdf

[width=]newfigs/resnet.pdf

Figure 3: Test accuracy of tickets found by different methods on CIFAR-10. Error bars depict variance across 3 runs. Left: Performance of tickets found on a 6-layer CNN, when trained from scratch. Right: Performance of tickets found on a ResNet 20, when rewinded to the second training epoch. In both experiments, tickets found by CS outperform ones found by IMP. In most cases, CS successfully finds winning tickets in 2 iterations (purple curves).

4.2 Finding Winning Tickets in Residual Networks without Rewinding

Searching for tickets in realistic models is not as straightforward as finding tickets in a small CNN, and might require new strategies. Frankle et al. (2019) show that IMP fails at finding winning tickets in ResNets (He et al., 2016) unless the learning rate is smaller than the recommended value, leading to worse overall performance and defeating the purpose of ticket search. However, the authors propose a slight modification to IMP that enables search for winning tickets to be successful on complex networks: instead of training from scratch, tickets are initialized with weights from early training.

With this in mind, we evaluate how Continuous Sparsification performs in the time-consuming task of finding winning tickets in a ResNet-20 333We used the same network as Frankle and Carbin (2019) and Frankle et al. (2019), where it is referred as a ResNet 18. (He et al., 2016) trained on CIFAR-10: a setting where IMP might take over iterations ( epochs) to succeed. We follow the setup in Frankle and Carbin (2019) and Frankle et al. (2019): in each iteration, the network is trained with SGD, a learning rate of , and a momentum of for a total of epochs, using a batch size of . The learning rate is decayed by a factor of at epochs and , and a weight decay of is applied to the weights (for CS, we do not apply weight decay to the mask parameters ). The two skip-connections that perform convolutions and the output layer are not removable: for IMP, their parameters are not pruned, while for CS their weights have neither a correspondent mask nor mask parameters .

When training the returned tickets, in order to evaluate their performance we initialize their weights with the iterates from the end of epoch 2 ( parameter updates), similarly to Frankle et al. (2019). Unlike when searching for winning tickets in the 6-layer CNN, IMP performs global pruning, removing of the remaining parameters with smallest magnitude, ranked across different layers. IMP runs for a total of iterations, while CS is limited to only 5 iterations for each run. The sparsity of the tickets found by CS is controlled by varying the mask initialization . To allow for even faster ticket search, we run CS without parameter rewinding: that is, the weights are transferred from one iteration to another, decreasing the burden of re-training the network at each iteration. For both CS and IMP, each run is repeated with 3 different random seeds.

The results presented in Figure 3 (right) show that CS is able to successfully find winning tickets with varying sparsity in under 5 iterations. The Pareto curve strictly dominates IMP, and variance across runs is smaller than IMP’s. Most notably, CS is capable of quickly sparsifying the network in a single iteration and typically finds better tickets than IMP after only 2 rounds. Most notably, in all runs (including with different random seeds), 5 iterations were enough for CS  to find better tickets than IMP regardless of the final sparsity.

In terms of search cost, in a parallel setting CS strictly outperforms IMP by performing concurrent runs with different values for . To achieve sparsity, IMP requires 10 iterations while CS finds a better ticket in only 2 (i.e. faster), and the speedup only increases with the ticket’s sparsity. In a sequential setting where an approximate sparsity is desired for the found ticket, one can run CS while performing binary search over (in Appendix A.1 we experimentally show that the sparsity is monotonically increasing with ). In this case, considering a binary search over the explored values for , IMP requires 9 iterations to find a ticket with sparsity, while CS finds a ticket with higher test accuracy in 6 iterations (5 iterations for a run with , plus one iteration with ). The binary search can be made more efficient by, for example, halting a run if the candidate value for yields higher sparsity than desired, or by executing CS for less than 5 iterations per run. Even without these modifications, CS is often faster than IMP in the sequential setting. We leave further improvements to CS in the sequential setting as future work.

4.3 Pruning VGG

[width=0.48]newfigs/pruning.pdf

Figure 4: Performance of different methods when performing one-shot pruning on VGG. CS maintains over test accuracy after removing of the weights, while other methods fail to successfully remove more than of the parameters.

Our experiments show that Continuous Sparsification is capable of finding tickets quickly and consistently, and we attribute its success to its deterministic re-parameterization of the binary mask. Here, we evaluate our method a pruning technique, to better assess whether our proposed re-parameterization is advantageous only in terms of training time, or also in respect to the quality of the learned masks.

For this task, we train VGG (Simonyan and Zisserman, 2015) on the CIFAR-10 dataset, following the protocol in Frankle and Carbin (2019): the network is trained with SGD and an initial learning rate of , which is decayed by a factor of at epochs 80 and 120. After 160 training epochs, the network is sparsified and then fine-tuned for 40 epochs with a learning rate of . We evaluate previously described methods when executed for a single iteration (one-shot pruning): Continuous Sparsification, Magnitude Pruning (IMP with 1 iteration) (Han et al., 2015), and Stochastic Sparsification (ISS with 1 iteration), which is similar to methods in Zhou et al. (2019), Srinivas et al. (2016), and Louizos et al. (2018).

At the sparsification step, IMP performs global pruning, ISS fixes the binary mask to be the maximum likelihood one under (which performed better than sampling from the distribution), and CS changes the parameterization of the mask from to (or, equivalently, weights where are removed). We use a momentum of , a weight decay of (not applied to ), and a batch-size of . Following Frankle and Carbin (2019)

, sparsification is not applied to batch normalization layers or the final linear layer.

To evaluate each method when finding masks with different sparsity levels, we run IMP with global pruning rates , , , , , , , , , , , , and ISS and CS with initial mask values , , , , , , , , , . Results are shown in Figure 4: both magnitude pruning and stochastic regularization (Stochastic Sparsification) fail at removing over of the weights without severely degrading the performance of the model. On the other hand, Continuous Sparsification successfully removes of the parameters in the convolutional layers while still yielding over test accuracy. When taken to the extreme, our method is capable of removing of the weights and still yielding over accuracy.

The dramatic performance difference between stochastic and continuous sparsification shows that our proposed deterministic re-parameterization is key to achieve superior results in both network pruning and ticket search. The fact that it outperforms magnitude pruning, a standard technique in the pruning literature, suggests that further exploration of -based methods could yield significant advances in pruning techniques.

5 Discussion

With Frankle and Carbin (2019), we now realize that sparse sub-networks can indeed be successfully trained from scratch, putting in question the belief that overparameterization is required for proper optimization of neural networks. Such sub-networks, called winning tickets, can be potentially used to significantly decrease the required resources for training deep networks, as they are shown to transfer between different, but similar, tasks (Mehta, 2019; Soelen and Sheppard, 2019).

Currently, the search for winning tickets is a poorly explored problem, where Iterative Magnitude Pruning (Frankle and Carbin, 2019) stands as the only algorithm suited for this task, and it is unclear whether its key ingredients – post-training magnitude pruning and parameter rewinding – are the correct choices for the task. Here, we approach the problem of finding sparse sub-networks as an -regularized optimization problem, which we approximate through a smooth, parameterized relaxation of the step function. Our proposed algorithm for finding winning tickets, Continuous Sparsification, removes parameters automatically and continuously during training, and can be fully described by the optimization framework. We show empirically that, indeed, post-training pruning might not be a sensible choice for finding winning tickets, raising questions on how the search for tickets differs from standard network compression. With this work, we hope to further motivate the problem of quickly finding tickets in overparameterized networks, as recent work suggests that the task might be highly relevant to transfer learning and mobile applications.

References

  • Z. Allen-Zhu, Y. Li, and Z. Song (2019)

    A convergence theory for deep learning via over-parameterization

    .
    In ICML, Cited by: §1.
  • Y. Bengio, N. Léonard, and A. Courville (2013) Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432. Cited by: §4.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: Winning the Lottery with Continuous Sparsification, §1, §1, §2.1, §3.2, §4.1, §4.1, §4.2, §4.3, §4.3, §5, §5, footnote 3.
  • J. Frankle, G. Karolina Dziugaite, D. M. Roy, and M. Carbin (2019) Stabilizing the Lottery Ticket Hypothesis. arXiv:1903.01611. Cited by: §2.1, §4.2, §4.2, §4.2, Algorithm 1, footnote 3.
  • S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §1.
  • S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. NIPS. Cited by: §1, §1, §2.2, §3.2, §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. CVPR. Cited by: §1, §4.2, §4.2.
  • F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 1MB model size. arXiv:1602.07360. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1, §4.1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1, §4.1.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In NIPS, Cited by: §2.2.
  • H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. ICLR. Cited by: §1.
  • C. Louizos, M. Welling, and D. P. Kingma (2018) Learning Sparse Neural Networks through Regularization. ICLR. Cited by: §1, §2.2, §3.1, §3.1, §4.3.
  • R. Mehta (2019) Sparse Transfer Learning via Winning Lottery Tickets. arXiv:1905.07785. Cited by: §1, §2.1, §5.
  • A. S. Morcos, H. Yu, M. Paganini, and Y. o. Tian (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. NeurIPS. Cited by: §1, §2.1.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2018) Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. arXiv:1805.12076. Cited by: §1.
  • P. Savarese and M. Maire (2019) Learning implicitly recurrent CNNs through parameter sharing. In ICLR, Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1, §4.3.
  • R. V. Soelen and J. W. Sheppard (2019)

    Using winning lottery tickets in transfer learning for convolutional neural networks

    .
    In IJCNN, Cited by: §1, §2.1, §5.
  • S. Srinivas, A. Subramanya, and R. Venkatesh Babu (2016) Training Sparse Neural Networks. arXiv:1611.06694. Cited by: §2.2, §3.1, §4.3.
  • H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. NeurIPS. Cited by: §1, §2.1, §4.1, §4.3, §4, Algorithm 2.

Appendix

Appendix A Hyperparameter Analysis

a.1 Continuous Sparsification

In this section, we study how the hyperparameters of Continuous Sparsification affect its performance in terms of sparsity and performance of the found tickets. More specifically, we consider the following hyperparameters:

  • Final temperature : the final value for , which controls how smooth the parameterization is.

  • penalty : the strength of the regularization applied to the soft mask , which promotes sparsity.

  • Mask initial value : the value used to initialize all components of the soft mask , where smaller values promote sparsity.

Our setup is as follows: to analyze how each of the 3 hyperparameters impact the performance of Continuous Sparsification, we train a ResNet 20 on CIFAR-10 (following the same protocol from Section 4.2), varying one hyperparameter while keeping the other two fixed. To capture how hyperparameters interact with each other, we repeat the described experiment with different settings for the fixed hyperparameters.

Since different hyperparameter settings naturally yield vastly distinct sparsity and performance for the found tickets, we report relative changes in accuracy and in sparsity.

In Figure 5, we vary between and for three different settings: , , and . As we can see, there is little impact on either the performance or the sparsity of the found ticket, except for the case where and , for which yields slightly increased sparsity.

[width=]newfigs/parameter_lmbda.pdf

Figure 5: Impact on relative test accuracy and sparsity of tickets found in a ResNet 20 trained on CIFAR-10, for different values of and fixed settings for and .

Next, we consider the fixed settings , , , and proceed to vary the final temperature between 50 and 200. Figure 6 shows the results: in all cases, a larger temperature of yielded better accuracy. However, it decreased sparsity compared to smaller temperature values for the settings and , while at the same time increasing sparsity for . While larger temperatures appear beneficial and might suggest that even higher values should be used, note that, the larger is, the earlier in training the gradients of will vanish, at which point training of the mask will stop. Since the performance for temperatures between 100 and 200 does not change significantly, we recommend values around 150 or 200 when either pruning or performing ticket search.

[width=]newfigs/parameter_temp.pdf

Figure 6: Impact on relative test accuracy and sparsity of tickets found in a ResNet 20 trained on CIFAR-10, for different values of and fixed settings for and .

[width=]newfigs/parameter_i_mask.pdf

Figure 7: Impact on relative test accuracy and sparsity of tickets found in a ResNet 20 trained on CIFAR-10, for different values of and fixed settings for and .

Lastly, we vary the initial mask value between and , with hyperpameter settings , , and . Results are given in Figure 7: unlike the exploration on and , we can see that has a strong and consistent effect on the sparsity of the found tickets. For this reason, we suggest proper tuning of when the goal is to achieve a specific sparsity value. Since the percentage of remaining weights is monotonically increasing with , we can perform binary search over values for to achieve any desired sparsity level. In terms of performance, lower values for naturally lead to performance degradation, since sparsity quickly increases as becomes more negative.

a.2 Iterative Magnitude Pruning

Here, we assess whether the running time of Iterative Magnitude Pruning can be improved by increasing the amount of parameters pruned at each iteration. The goal of this experiment is to evaluate if Continuous Sparsification offers faster ticket search only because it prunes the network more aggressively than IMP, or because it is truly more effective in how parameters are chosen to be removed.

Following the same setup as the previous section, we train a ResNet 20 on CIFAR-10. We run IMP for 30 iterations, performing global pruning with different pruning rates at the end of each iteration. Figure 8 shows that the performance of tickets found by IMP decays when the pruning rate is increased to . In particular, the final performance of found tickets is mostly monotonically decreasing with the number of remaining parameters, suggesting that, in order to find tickets which outperform the original network, IMP is not compatible with more aggressive pruning rates.

[width=0.6]newfigs/imp_vary_pruning.pdf

Figure 8: Performance of tickets found by Iterative Magnitude Pruning in a ResNet 20 trained on CIFAR, for different pruning rates.