1 Introduction
Sparse learning algorithms search for model parameters that minimize training loss while retaining only a small fraction of nonzero entries. Parameter sparsity yields benefits along several axes: reduced model storage costs, greater computational and energy efficiency during training and inference, and potentially improved model generalization [32]. However, sparse training is a computationally challenging task: for the general optimization problem of learning parameters subject to the constraint that is sparse, the feasible set is the union of
axisaligned hyperplanes intersecting at the origin, each of dimension
. This complex, nonconvex geometry of the feasible set compounds the existing difficulty of optimizing deep neural network models.This work focuses on minimizing generalization error for sparse neural network models. To this end, we introduce Spartan, or Sparsity via Regularized Transportation—a sparse learning algorithm that leverages an optimal transportationbased top
masking operation to induce parameter sparsity during training. Spartan belongs to a family of “densetosparse” algorithms that maintains a dense parameter vector
throughout training [47; 23], in contrast to “sparsetosparse” algorithms that adhere to a memory budget for representing the parameters of a sparse model [3; 30; 31; 13; 15]. While computational cost and memory usage at training time are important considerations, Spartan primarily optimizes for performance at inference time.Intuitively, Spartan aims to achieve a controlled transition between the exploration and exploitation
of various sparsity patterns during training. In the exploration regime, the goal is for the learner to easily transition between differing sparsity patterns in order to escape those that correspond to poor minima of the loss function. On the other hand, the goal of exploitation is for the learner to optimize model parameters given a fixed sparsity pattern, or a small set of similar patterns. This latter setting avoids frequent oscillations between disparate sparsity patterns, which may be detrimental to the optimization process.
We operationalize this intuition through the use of a differentiable soft top masking operation . This function maps parameters to an approximately sparse output that suppresses lowmagnitude entries in . We parameterize the soft top mask with a sharpness parameter : at , simply scales the input by a constant, and as , the mapping reduces to hard top magnitudebased selection (Figure 1 (a, b)). Soft top masking therefore constrains the iterates to be close to the set of exactly sparse vectors; we give some geometeric intuition for the effect of this mapping in Figure 1 (c) and Figure 2. We implement soft top masking using a regularized optimal transportation formulation [12; 41] and demonstrate that this technique scales to networks on the order of parameters.
We evaluate Spartan on ResNet50 [20] and ViT [14] models trained on the ImageNet1K dataset. On ResNet50, we find that sparse models trained with Spartan achieve higher generalization accuracies than those trained with existing methods at sparsity levels of 90% and above. In particular, we train ResNet50 models to top1 validation accuracy at 95% sparsity and to accuracy at 97.5% sparsity, improving on the previous stateoftheart by and respectively. Our sparse ViTB/16 models reduce model storage size by and inference FLOPs by at the cost of a accuracy reduction relative to DeiTB [37]. We further demonstrate that Spartan is effective for block structured pruning, a form of structured sparsity that is more amenable to acceleration on current GPU hardware than unstructured sparsity [17]. On a ViTB/16 model with block structured pruning, Spartan achieves top1 accuracy at 90% sparsity.
To summarize, we make the following contributions in this paper:

We present Spartan, a sparsityinducing training algorithm based on a soft top
masking operation. We show that Spartan interpolates between two existing sparse learning algorithms: iterative magnitude pruning
[47] and Top Always Sparse Training [23]. 
We empirically evaluate Spartan using ResNet50 and ViT models trained on the ImageNet1K dataset, demonstrating consistent improvements over the prior stateoftheart.

We study the effect of Spartan’s hyperparameters on an explorationexploitation tradeoff during training and on the final accuracy of the trained models.
Notation. We use to denote the set of nonnegative real numbers. denotes the allones vector. denotes the norm of , and is the number of nonzero entries in . We write to denote elementwise multiplication of and . The operator denotes Euclidean projection onto the set of sparse vectors.
2 Related Work
Neural network pruning. Our use of a magnitudebased criterion for sparse learning draws on early work in the area of neural network pruning [22]. Magnitudebased pruning is computationally cheap relative to alternative criteria that rely on first or secondorder information [32; 26; 19], and is a perhaps surprisingly performant option despite its simplicity [16]. More generally, Spartan builds on a substantial body of previous work that aims to jointly optimize sparsity patterns and model parameters for deep neural networks [42; 8; 18; 28; 27; 47, inter alia; see, e.g., 21 for a survey].
Spartan as a generalization of existing methods. We highlight two particularly relevant methods in the literature: the iterative magnitude pruning (IMP) method from Zhu and Gupta [47] (Algorithm 1), and TopKAST, or Top Always Sparse Training [23] (Algorithm 2). At the extreme points of the sharpness parameter , Spartan’s parameter update procedure reduces to those of IMP and TopKAST. We can thus view Spartan as a method that generalizes and smoothly interpolates between this pair of algorithms. As , Spartan is equivalent to IMP: this approach sparsifies parameters using a top magnitudebased binary mask, and entries that were masked in the forward pass receive zero gradient in the backward pass. At , Spartan reduces to TopKAST: this method again sparsifies parameters by magnitude with a binary mask, but unlike IMP, all entries are updated in the backward pass using the gradient of the loss with respect to the masked parameters. The TopKAST update is thus an application of the straightthrough gradient method [5; 9], otherwise known as lazy projection or dual averaging (DA) in optimization [33; 40; 2].
Smooth approximations. Our use of the soft top
operation is related to prior methods that use the logistic sigmoid function as a differentiable approximation to the step function for sparse training
[29; 35; 1]. These approaches similarly regulate the sharpness of the approximation with a temperature parameter that scales the input logits to the sigmoid function. A distinctive characteristic of Spartan is that the parameter
directly controls the degree of sparsity of the mask; this differs from previous approaches involving the logistic sigmoid approximation that only indirectly control the degree of sparsity using an penalty term.Transformerspecific methods. Several approaches are specialized to transformerbased architectures. These include structured pruning of pretrained transformers for NLP tasks [38; 25; 39] and sparse training methods specifically applied to vision transformers [7; 6]. In contrast, Spartan is a generalpurpose sparse training algorithm that is designed to be agnostic to the model architecture.
3 Sparsity via Regularized Transportation
Each iteration of Spartan consists of the following two steps (Algorithm 3): (1) an approximately sparse masking of the model parameters using the soft top operator, and (2) dual averagingbased parameter updates with the set of exactly sparse vectors as the feasible set. This update procedure aims to combine the advantages of optimization via dual averaging and those of smooth approximationbased algorithms.
The key advantage of the dual averaging method is that it mitigates the issue of gradient sparsity: in a standard gradient update, only those parameters that were not masked in the forward pass receive a nonzero gradient, resulting in slow progress at high sparsities. Thus, dual averaging iterates are able to more quickly explore the space of possible sparsity patterns.
On the other hand, large variations in the sparsity patterns realized postprojection can lead to instability in optimization, thus hampering the final performance of the model; for instance, TopKAST empirically benefits from the addition of sparsityinducing regularizers on its iterates [23; 36]. This issue motivates our use of a soft top approximation as a mechanism for controlling the stability of our training iterates.
In the following, we begin by describing the soft top masking operation and address the issues of scalability and of applying soft top masking to structured sparsity. We subsequently discuss how the update procedure outlined in Algorithm 3 can be incorporated within a complete training loop.
3.1 Soft Top Masking
Our soft top masking scheme is based on the soft top operator described by Xie et al. [41]. takes as input a vector , a budget parameter and a sharpness parameter , and outputs a vector . By using the magnitudes of the parameter vector as a measure of the “value” of each entry, we obtain a soft top magnitude pruning operator by masking the entries of the parameter vector with the output of the soft top operator:
We generalize the soft top operator described by Xie et al. [41] by incorporating a strictly positive cost vector . In particular, we require that the output mask satisfies the budget constraint . This abstraction is useful for modeling several notions of heterogeneous costs associated with different sparsity patterns within a model. For instance, parameters that are repeatedly invoked within a network have a higher associated computational cost: Evci et al. [15] observe that the FLOP count of a ResNet50 model at a fixed sparsity can vary by more than a factor of due to variations in the level of sparsity among convolutional layers with differing output dimensions.
To derive the costsensitive soft top
operator, we begin by expressing the optimal mask as the solution to the following linear program (LP):
(1)  
subject to 
Deferring the details to the Appendix, we rewrite this LP in the following form with :
(2)  
subject to  
We identify this LP as an optimal transportation problem with the variables and profit matrix defined by the normalized values . Note that when , this LP reduces to the original soft top OT problem from Xie et al. [41].
In order to efficiently approximate the solution to Problem 2, we add entropic regularization with weight and solve the resulting regularized problem using the SinkhornKnopp algorithm [12; 4; 11]. Algorithms 4 and 5 describe the forward and backward passes of the costsensitive soft top operator respectively. The expression for the gradient in Algorithm 5 follows from Theorem 3 in Xie et al. [41] with some algebraic simplification. We remark that this closedform backward pass is approximate in the sense that it assumes that we obtain the optimal dual variables in the forward pass; in practice, we do not encounter issues when using approximate values of the dual variables with tolerance and a maximum iteration count of in Algorithm 4.
Scalability. Since we apply soft top masking to highdimensional parameters in each iteration of training, it is important to minimize the computational overhead of Sinkhorn iteration in the forward pass. We observe that in order to compute the hard top projection step in dual averaging (as in Algorithm 3), it is necessary to find the index of the th largest entry of . By using the value to initialize the dual variable , we can accelerate the convergence of Sinkhorn iteration. Concretely, we demonstrate in Sec. 4.3 that our implementation of Spartan incurs a periteration runtime overhead of approximately 5% over standard dense ResNet50 training.
Structured sparsity. To implement block structured sparsity, we instantiate one mask variable per block and mask all parameters within that block with the same mask variable. To compute the mask, we use the sum of the magnitudes of the entries in each block as the corresponding value . In a standard pruning setting with uniform costs across all parameters, the corresponding cost is simply the total number of entries in the block.
3.2 Training with Spartan
Training schedule.
When training with Spartan, we divide the training process into a warmup phase, an intermediate phase, and a finetuning phase. In our experiments, the warmup phase consists of the first 20% of epochs, the intermediate phase the next 60%, and the finetuning phase the final 20%. In the warmup phase, we linearly anneal the global sparsity of the model from a value of
at the start of training to the target level of sparsity. Throughout the intermediate phase, we maintain the target sparsity level, and during the finetuning phase, we fix the sparsity mask to that used at the start of finetuning. This training schedule is similar to those used in prior work [36].Exploration vs. exploitation as a function of . From the start of training up to the start of the finetuning phase, we linearly ramp the sharpness parameter from an initial value of to the final value . We interpret the spectrum of updates parameterized by as being characteristic of an explorationexploitation tradeoff. In Figure 3, we illustrate this phenomenon by plotting Pearson correlation coefficients between sparsity masks at different stages of training. We observe that iterative magnitude pruning converges on the final mask relatively early in training, which is indicative of insufficient exploration of different sparsity patterns—consequently, this model achieves lower validation accuracy than the remaining models. The three models trained with Spartan each ramp from at the start of training to at epoch 80. The value of correlates well both with the Pearson correlation with the final mask, and with the Pearson correlation between masks at the end of subsequent epochs. Spartan thus interpolates between the highexploration regime of standard dual averaging and the lowexploration regime of iterative magnitude pruning. In this example, the intermediate setting achieves the highest top1 validation accuracy of , with TopKAST at and IMP at .
4 Empirical Evaluation
In this section, we report the results of our sparse training experiments on the ImageNet1K dataset with two standard architectures: ResNet50 [20] and ViTB/16 [14], consisting of 25.6M and 86.4M parameters respectively. On ResNet50, we evaluate only unstructured pruning, whereas on ViTB/16, we evaluate both unstructured and block structured pruning. We subsequently present empirical studies of the sensitivity of Spartan to the value of , the effect of running Spartan without the dual averaging step, and the computational overhead of soft top masking.
4.1 ImageNet1K Classification
In all our experiments, we run Spartan with the training schedule described in Section 3.2. We train and evaluate our models on the ImageNet1K dataset with the standard trainingvalidation split and report means and standard deviations over 3 independent trials. We provide additional details on our experimental setup in the Appendix.
ResNet50 experimental setup. For consistency with our baselines, we use standard data augmentation with horizontal flips and random crops at resolution. For all Spartan runs, we use
, which we selected based on models trained at 95% accuracy. We sparsify the parameters of all linear and convolutional layers with a global sparsity budget, excluding bias parameters and the parameters of batch normalization layers. Our baselines are iterative magnitude pruning
[47], RigL with the ErdosRenyiKernel (ERK) sparsity distribution [15], Soft Threshold Weight Reparameterization (STR) [24], probabilistic masking (ProbMask) [46], OptG [45], and TopKAST with Powerpropagation and ERK [23; 36]. We additionally rerun the most performant baseline method, TopKAST, using a reimplementation in our own codebase. For TopKAST, we exclude the first convolutional layer from pruning, following [36; 15]. We use mixed precision training with a batch size of 4096 on 8 NVIDIA A100 GPUs.ViT experimental setup. We use the ViTB architecture with patches at input resolution. We augment the training data using RandAugment [10], MixUp [44] and CutMix [43]. Our ViT models are trained from random initialization, without any pretraining. We set for Spartan with unstructured sparsity, and and for Spartan with and blocks respectively. These latter settings are the values of for the unstructured case scaled up by factors of and —since each block averages the magnitudes of
entries, we expect the variance of the block values to correspondingly decrease by a factor of
, and we thus compensate by scaling up by . In the block structured case, we exclude the input convolutional layer and the output classification head from pruning since their parameter dimensions are not divisible by the block size. We use mixed precision training with a batch size of 4096 on 16 NVIDIA A100 GPUs across 2 nodes.Results. Table 1 lists the top1 validation accuracies achieved by fully dense ResNet50 and ViTB/16 models. Table 2 reports validation accuracies for ResNet50 models at 80%, 90%, 95%, 97.5% and 99% sparsity, and Table 3 reports validation accuracies for ViT at 90% sparsity. In the Appendix, we additionally report measurements of inferencetime FLOP costs for our sparse models and the results of experiments with FLOPsensitive pruning.
For ResNet50 models, we find that Spartan outperforms all our baselines across all training durations at sparsity levels of 90% and above. In particular, Spartan achieves a mean top1 accuracy within of fully dense training at 95% sparsity. We observe that additional epochs of sparse training consistently improves the final generalization accuracy; in contrast, validation accuracy peaks at 200 epochs for dense training. This trend persists at 800 training epochs, where Spartan achieves top1 accuracy at 99% sparsity. For the TopKAST baseline, we omit results at 99% sparsity due to training instability.
For ViTB/16, Spartan outperforms TopKAST for both unstructured and block structured pruning. We observe a particularly large improvement over TopKAST in the block structured case, where Spartan improves absolute validation accuracy by for block structured pruning. For unstructured pruning, Spartan achieves comparable accuracy to SViTE [7] ( for Spartan vs. for SViTE), but with higher sparsity ( for Spartan vs. for SViTE). In exchange for a reduction in accuracy relative to DeiTB [37], Spartan reduces model storage cost by and the FLOP cost of inference by .
4.2 Sensitivity and Ablation Analysis
Figure 4 (left) shows the effect of ablating the dual averaging step in Spartan—i.e., omitting hard top projection in the forward pass—over a range of settings of for 95% sparse ResNet50 models trained for 100 epochs. The dashed line shows top1 accuracy with TopKAST. For Spartan training without dual averaging, we compute a hard top mask at the end of epoch 80, and as with standard Spartan training, we fix this mask until the end of training. In the absence of top projection, we find that accuracy increases with increasing up to ; at lower settings of , the higher approximation error of soft masking is detrimental to the final accuracy of the model. In contrast, the use of top projection in the forward pass mitigates this mismatch between training and inference and improves the final accuracy of the sparse model.
4.3 Computational Overhead
We evaluate the computational overhead of Spartan over standard dense training by measuring wall clock training times for a ResNet50 model (Figure 4, right). Our benchmark measures runtime on a single NVIDIA A100 GPU over 50 iterations of training with a batch size of 256. We use random inputs and labels to avoid incurring data movement costs. We compare three approaches: standard Sinkhorn iteration, Sinkhorn iteration with the dual variable initialized using its final value from the previous iteration, and Sinkhorn iteration with initialized using the value , where the index of the th largest entry of is computed by sorting the vector in each iteration.
Standard Sinkhorn iteration incurs higher overheads as increases—this is due to the additional iterations required to reach convergence as the regularized OT problem more closely approximates the original LP. We find that dual caching and sorting both prevent this growth in runtime over the range of values that we tested. In our remaining experiments, we use the simple sortingbased approach since it corresponds to the lowest relative overhead over standard dense training (approximately 5%). We note that this relative overhead decreases as the batch size increases since the cost of computing the mask is independent of the batch size.
5 Discussion
As an alternative to magnitudebased pruning, we may also apply Spartan in conjunction with learned value parameters, as in methods such as Movement Pruning [34]. In this approach, we would compute masks using a set of auxiliary parameters instead of the magnitudes : , and similarly for hard projection. We remark that while this variant requires additional memory during training to store the value parameters, there is no additional cost during inference since the sparsity mask is fixed at the end of training.
Limitations. Since Spartan retains a dense parameter vector and computes dense backward passes during training, it incurs higher memory and computational costs in each iteration than sparsetosparse methods like RigL [15]. Nevertheless, we note that in terms of total computational or energy cost over a full training run, Spartan may remain a competitive option as it requires fewer iterations of training to reach a given accuracy threshold relative to sparsetosparse methods. However, we do not compare total training FLOP costs in our empirical evaluation.
A further limitation is that costsensitive pruning with Spartan is only compatible with relatively crude linear cost models. This restriction arises due to the requirements of the regularized OT formulation used to compute the soft top mask. In particular, this limitation precludes the use of cost models involving interaction terms such as those arising from spatially coherent sparsity patterns.
Societal Impacts. Inference with deep neural network models is a computationally intensive process. At present, the total energy footprint associated with serving these models in production is expected to continue growing in tandem with the rising prevalence of large transformerbased architectures in vision and NLP applications. Research towards improving the energy efficiency of deep neural networks is therefore an important counterbalancing force against increasing resource usage by these models. The development of sparse learning algorithms is particularly relevant to these efforts, and we expect that the impact of these approaches will further increase as sparsityaware hardware acceleration becomes more widely available.
6 Conclusions & Future Work
In this work, we describe a sparse learning algorithm that interpolates between two parameter update schemes: standard stochastic gradient updates with hard masking and the dual averaging method. We show that there exists an intermediate regime between these two methods that yields improved generalization accuracy for sparse convolutional and transformer models, particularly at higher levels of sparsity. While we have demonstrated promising empirical results with our proposed method, the learning dynamics of stochastic optimization for deep networks under sparsity constraints remains relatively poorly understood from a theoretical standpoint. There thus exists ample opportunity for further work towards better understanding sparse learning algorithms, which may in turn inspire future algorithmic advances in this area.
References
 [1] (2020) Learned threshold pruning. arXiv preprint arXiv:2003.00075. Cited by: §2.
 [2] (2018) ProxQuant: Quantized Neural Networks via Proximal Operators. In International Conference on Learning Representations, Cited by: §2.
 [3] (2018) Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations, Cited by: §1.
 [4] (2015) Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37 (2). Cited by: §3.1.
 [5] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §2.
 [6] (2022) Vision transformer slimming: multidimension searching in continuous optimization space. arXiv preprint arXiv:2201.00814. Cited by: §2.
 [7] (2021) Chasing sparsity in vision transformers: an endtoend exploration. In Advances in Neural Information Processing Systems, Cited by: §2, §4.1.
 [8] (2014) Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442. Cited by: §2.
 [9] (2015) BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, Cited by: §2.

[10]
(2020)
Randaugment: practical automated data augmentation with a reduced search space.
In
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, Cited by: §4.1.  [11] (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, Cited by: §3.1.
 [12] (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1.
 [13] (2019) Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: §1.
 [14] (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1, §4.

[15]
(2020)
Rigging the lottery: making all tickets winners.
In
International Conference on Machine Learning
, Cited by: §1, §3.1, §4.1, Table 2, §5.  [16] (2019) The state of sparsity in deep neural networks. arXiv eprints arXiv:1902.09574. Cited by: §2.
 [17] (2017) GPU kernels for blocksparse weights. arXiv preprint arXiv:1711.09224. Cited by: §1.
 [18] (2015) Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems. Cited by: §2.
 [19] (1992) Second order derivatives for network pruning: optimal brain surgeon. Advances in Neural Information Processing Systems. Cited by: §2.
 [20] (2016) Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.

[21]
(2021)
Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks
. Journal of Machine Learning Research. Cited by: §2.  [22] (1989) Pruning versus clipping in neural networks. Physical Review A. Cited by: §2.
 [23] (2020) TopKAST: TopK Always Sparse Training. Advances in Neural Information Processing Systems. Cited by: 1st item, §1, §2, §3, §4.1.
 [24] (2020) Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, Cited by: §4.1, Table 2.

[25]
(2021)
Block pruning for faster transformers.
In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, Cited by: §2.  [26] (1989) Optimal brain damage. Advances in Neural Information Processing Systems 2. Cited by: §2.
 [27] (2018) SNIP: Singleshot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, Cited by: §2.
 [28] (2018) Learning sparse neural networks through regularization. In International Conference on Learning Representations, Cited by: §2.
 [29] (2020) AutoPruner: an endtoend trainable filter pruning method for efficient deep model inference. Pattern Recognition. Cited by: §2.
 [30] (2018) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications. Cited by: §1.

[31]
(2019)
Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization
. In International Conference on Machine Learning, Cited by: §1.  [32] (1988) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
 [33] (2009) Primaldual subgradient methods for convex problems. Mathematical Programming. Cited by: §2.
 [34] (2020) Movement pruning: adaptive sparsity by finetuning. In Advances in Neural Information Processing Systems, Cited by: §5.
 [35] (2020) Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems. Cited by: §2.
 [36] (2021) Powerpropagation: a sparsity inducing weight reparameterisation. In Advances in Neural Information Processing Systems, Cited by: §3.2, §3, §4.1, Table 2.
 [37] (2021) Training dataefficient image transformers & distillation through attention. In International Conference on Machine Learning, Cited by: §1, §4.1.

[38]
(2020)
Structured pruning of large language models
. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.  [39] (2022) Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408. Cited by: §2.
 [40] (2009) Dual averaging method for regularized stochastic learning and online optimization. Advances in Neural Information Processing Systems. Cited by: §2.
 [41] (2020) Differentiable topk with optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1, §3.1, §3.1, §3.1.
 [42] (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.

[43]
(2019)
Cutmix: regularization strategy to train strong classifiers with localizable features
. In IEEE/CVF International Conference on Computer Vision, Cited by: §4.1.  [44] (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §4.1.
 [45] (2022) Optimizing gradientdriven criteria in network sparsity: gradient is all you need. arXiv preprint arXiv:2201.12826. Cited by: §4.1, Table 2.
 [46] (2021) Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.1, Table 2.
 [47] (2018) To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. In International Conference on Learning Representations, Cited by: 1st item, §1, §2, §2, §4.1.
Appendix A Derivation of the Optimal Transportation LP
Here, we show how the original top LP with costs can be straightforwardly rewritten in the form of an optimal transportation problem. For a given value vector , cost vector and budget , the top LP is:
subject to  
Define the variables as and substitute to obtain:
subject to  
Now eliminate the upper bound constraint by introducing additional variables to give:
subject to  
which can be recognized as an optimal transportation problem in the variables and .
Appendix B Training Details
Hyperparameter  Value 

optimizer  Nesterov accelerated gradient method () 
max. learning rate  
min. learning rate  
learning rate warmup epochs  
learning rate decay schedule  cosine 
batch size  
weight decay  ( for bias and normalization parameters) 
label smoothing  
data augmentation  random crops, random horizontal flips 
input resolution  
sparsity annealing schedule  linear from to target sparsity at epoch fraction 
annealing schedule  linear from to at epoch fraction 
Sinkhorn max. iterations  
Sinkhorn tolerance 
Hyperparameter  Value 

optimizer  AdamW () 
max. learning rate  
min. learning rate  
learning rate warmup epochs  of total epochs 
learning rate decay schedule  cosine 
batch size  
weight decay  ( for bias and normalization parameters) 
label smoothing  
data augmentation  random crops, random horizontal flips, 
RandAugment (ops , magnitude )  
mixup  
CutMix  
gradient norm clip  
input resolution  
exponential moving averaging  false 
sparsity annealing schedule  linear from to target sparsity at epoch fraction 
annealing schedule  linear from to at epoch fraction 
Sinkhorn max. iterations  
Sinkhorn tolerance 
Appendix C FLOP Measurements
Due to differences in the computational cost associated with individual parameters, the sparsity fraction does not map 1to1 to the fraction of FLOPs required for inference. Tables 6 and 7
give FLOP costs for our sparse models as a percentage of the FLOP cost of the corresponding dense model. We performed our FLOP measurements using the open source tool available at
https://github.com/sovrasov/flopscounter.pytorch/.We count multiply and add operations as one FLOP each. There is some inconsistency in the literature regarding this convention, with some prior work using multiplyaccumulate (MAC) counts and FLOP counts interchangeably. To convert the base FLOP counts listed below for ResNet50 and ViTB/16 to MACs, we can simply divide the given counts by 2.
In Table 6, the ResNet50 FLOP counts for TopKAST are slightly higher than those for Spartan. This is due to the exclusion of the input convolutional layer from pruning in the case of TopKAST.
Sparsity  

Method  Epochs  80%  90%  95%  97.5%  99% 
TopKAST  0.75  0.75  0.75  0.75    
0.75  0.75  0.75  0.75    
0.75  0.75  0.75  0.75    
Spartan  0.75  0.75  0.75  0.75  0.75  
0.75  0.75  0.75  0.75  0.75  
0.75  0.75  0.75  0.75  0.75 
Sparsity Structure  

Method  Epochs  Unstructured  blocks  blocks 
TopKAST  0.75  0.75  0.75  
0.75  0.75  0.75  
Spartan  0.75  0.75  0.75  
0.75  0.75  0.75 
Appendix D FLOPSensitive Pruning
We demonstrate FLOPsensitive pruning with Spartan on ResNet50 using the following cost model: assign a cost of 1 to each parameter of a fully connected layer, and a cost of to each parameter of a convolutional layer where the output has size along its spatial dimensions. We evaluate two valuation functions: and . results in the same pruning order as in standard pruning, but with a FLOP budget constraint instead of the usual sparsity budget. assigns a lower value to the parameters of convolutional layers, and results in networks where the parameters of convolutional layers are preferentially pruned. We use for and for to compensate for the relatively smaller scale of the normalized values in the soft top forward pass (Algorithm 4).
Table 8 gives the top1 accuracy, FLOP percentage, and sparsity percentage for each of these valuation functions. Spartan yields models with identical FLOP percentages of , which is slightly higher than the budgeted value of
—this discrepancy is due to the additional cost of the normalization layers and activation functions in the network. Most notably, there is a substantial difference in the sparsity percentages realized by these valuation functions. As expected,
preferentially sparsifies the parameters of convolutional layers and yields denser fully connected layers, resulting in lower sparsity overall.Accuracy %  FLOP %  Sparsity %  

Appendix E Additional Experiments
In Table 9, we compare Spartan against two additional variants of the TopKAST baseline: TopKAST with the ErdosRenyiKernel (ERK) sparsity distribution, and with pruning applied to the parameters of all convolutional and fullyconnected layers with the exception of bias terms (prune all). TopKAST (excl. input conv.) denotes the TopKAST variant used in the experiments presented in the main text, where we exclude the input convolutional layer from pruning. We find that there is some small variation in the measured top1 validation accuracies, but our conclusion that Spartan improves generalization at higher levels of sparsity is unchanged.
Sparsity  
Method  Epochs  90%  95% 
TopKAST (ERK)  
TopKAST (prune all)  
TopKAST (excl. input conv.)  0.75  0.75  
0.75  0.75  
0.75  0.75  
Spartan  0.75  0.75  
0.75  0.75  
0.75  0.75 
Appendix F Learned Sparsity Patterns
We observe a qualitative difference in the distribution of perlayer sparsities between ViTB/16 models trained with unstructured sparsity and those trained with block structured sparsity (Figure 5). In particular, the output projections of selfattention layers under block structured pruning are significantly more dense in the later blocks of the network relative to unstructured pruning. The reasons for this difference are not immediately clear to us, and we leave further investigation of this phenomenon to future work.
Block structured pruning produces coherent sparsity patterns in ViTB/16 models. In Figure 6, we visualize the magnitudes of the weight matrices corresponding to the input projection of each selfattention layer in a ViTB/16 model trained with Spartan using block structured pruning. This matrix maps vectors of dimension to query, key, and value embedding vectors, each of dimension . We observe that the training process yields similar sparsity patterns in the query and key embedding submatrices, which correspond to the left and center panels in the visualization for each layer. This is an intuitively reasonable property since the selfattention layer computes inner products of the query and key embeddings in order to construct attention maps. We note that this symmetry emerges purely as a result of the optimization process; we did not incorporate any prior knowledge into Spartan regarding the role of particular entries of the weight matrices subject to sparsification.