Log In Sign Up

Spartan: Differentiable Sparsity via Regularized Transportation

by   Kai Sheng Tai, et al.

We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an exploration-exploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On ImageNet-1K classification, Spartan yields 95 sparse ResNet-50 models and 90 absolute top-1 accuracy losses of less than 1 training.


DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures

In seeking for sparse and efficient neural network models, many previous...

Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off

Over-parameterization of deep neural networks (DNNs) has shown high pred...

Soft Threshold Weight Reparameterization for Learnable Sparsity

Sparsity in Deep Neural Networks (DNNs) is studied extensively with the ...

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Recently, researchers proposed pruning deep neural network weights (DNNs...

Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators

This paper studies structured sparse training of CNNs with a gradual pru...

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, ...

Sparsity-Constrained Optimal Transport

Regularized optimal transport (OT) is now increasingly used as a loss or...

1 Introduction

Sparse learning algorithms search for model parameters that minimize training loss while retaining only a small fraction of non-zero entries. Parameter sparsity yields benefits along several axes: reduced model storage costs, greater computational and energy efficiency during training and inference, and potentially improved model generalization [32]. However, sparse training is a computationally challenging task: for the general optimization problem of learning parameters subject to the constraint that is -sparse, the feasible set is the union of

axis-aligned hyperplanes intersecting at the origin, each of dimension

. This complex, nonconvex geometry of the feasible set compounds the existing difficulty of optimizing deep neural network models.

This work focuses on minimizing generalization error for sparse neural network models. To this end, we introduce Spartan, or Sparsity via Regularized Transportation—a sparse learning algorithm that leverages an optimal transportation-based top-

masking operation to induce parameter sparsity during training. Spartan belongs to a family of “dense-to-sparse” algorithms that maintains a dense parameter vector

throughout training [47; 23], in contrast to “sparse-to-sparse” algorithms that adhere to a memory budget for representing the parameters of a -sparse model [3; 30; 31; 13; 15]. While computational cost and memory usage at training time are important considerations, Spartan primarily optimizes for performance at inference time.

Intuitively, Spartan aims to achieve a controlled transition between the exploration and exploitation

of various sparsity patterns during training. In the exploration regime, the goal is for the learner to easily transition between differing sparsity patterns in order to escape those that correspond to poor minima of the loss function. On the other hand, the goal of exploitation is for the learner to optimize model parameters given a fixed sparsity pattern, or a small set of similar patterns. This latter setting avoids frequent oscillations between disparate sparsity patterns, which may be detrimental to the optimization process.

[width=0.65]figures/masking.png [width=0.33]figures/projection.png      (a)                              (b)                                          (c)

Figure 1: Soft top- masking. (a, b) The soft top- masking operator computes approximately -sparse outputs, with the sharpness of the mask controlled by the parameter . (c) Small updates to iterates far from the -sparse feasible set () can correspond to a large perturbation in parameter space after projection by . Updates to iterates in the approximately sparse region (shaded in grey) correspond to smaller post-projection perturbations.

We operationalize this intuition through the use of a differentiable soft top- masking operation . This function maps parameters to an approximately sparse output that suppresses low-magnitude entries in . We parameterize the soft top- mask with a sharpness parameter : at , simply scales the input by a constant, and as , the mapping reduces to hard top- magnitude-based selection (Figure 1 (a, b)). Soft top- masking therefore constrains the iterates to be close to the set of exactly -sparse vectors; we give some geometeric intuition for the effect of this mapping in Figure 1 (c) and Figure 2. We implement soft top- masking using a regularized optimal transportation formulation [12; 41] and demonstrate that this technique scales to networks on the order of parameters.

We evaluate Spartan on ResNet-50 [20] and ViT [14] models trained on the ImageNet-1K dataset. On ResNet-50, we find that sparse models trained with Spartan achieve higher generalization accuracies than those trained with existing methods at sparsity levels of 90% and above. In particular, we train ResNet-50 models to top-1 validation accuracy at 95% sparsity and to accuracy at 97.5% sparsity, improving on the previous state-of-the-art by and respectively. Our sparse ViT-B/16 models reduce model storage size by and inference FLOPs by at the cost of a accuracy reduction relative to DeiT-B [37]. We further demonstrate that Spartan is effective for block structured pruning, a form of structured sparsity that is more amenable to acceleration on current GPU hardware than unstructured sparsity [17]. On a ViT-B/16 model with block structured pruning, Spartan achieves top-1 accuracy at 90% sparsity.

To summarize, we make the following contributions in this paper:

  • We present Spartan, a sparsity-inducing training algorithm based on a soft top-

    masking operation. We show that Spartan interpolates between two existing sparse learning algorithms: iterative magnitude pruning 

    [47] and Top- Always Sparse Training [23].

  • We empirically evaluate Spartan using ResNet-50 and ViT models trained on the ImageNet-1K dataset, demonstrating consistent improvements over the prior state-of-the-art.

  • We study the effect of Spartan’s hyperparameters on an exploration-exploitation tradeoff during training and on the final accuracy of the trained models.

Notation. We use to denote the set of nonnegative real numbers. denotes the all-ones vector. denotes the -norm of , and is the number of nonzero entries in . We write to denote elementwise multiplication of and . The operator denotes Euclidean projection onto the set of -sparse vectors.


Figure 2: A 2D example of soft masking.

We plot probability densities corresponding to the action of the soft top-1 operator

on the 2D standard Gaussian distribution (darker colors indicate higher density). Specifically, we visualize the densities of

for a range of sharpness parameters . Higher values of constrain iterates to be closer to the -sparse feasible set.

2 Related Work

Neural network pruning. Our use of a magnitude-based criterion for sparse learning draws on early work in the area of neural network pruning [22]. Magnitude-based pruning is computationally cheap relative to alternative criteria that rely on first- or second-order information [32; 26; 19], and is a perhaps surprisingly performant option despite its simplicity [16]. More generally, Spartan builds on a substantial body of previous work that aims to jointly optimize sparsity patterns and model parameters for deep neural networks [42; 8; 18; 28; 27; 47, inter alia; see, e.g.21 for a survey].

Spartan as a generalization of existing methods. We highlight two particularly relevant methods in the literature: the iterative magnitude pruning (IMP) method from Zhu and Gupta [47] (Algorithm 1), and Top-KAST, or Top- Always Sparse Training [23] (Algorithm 2). At the extreme points of the sharpness parameter , Spartan’s parameter update procedure reduces to those of IMP and Top-KAST. We can thus view Spartan as a method that generalizes and smoothly interpolates between this pair of algorithms. As , Spartan is equivalent to IMP: this approach sparsifies parameters using a top- magnitude-based binary mask, and entries that were masked in the forward pass receive zero gradient in the backward pass. At , Spartan reduces to Top-KAST: this method again sparsifies parameters by magnitude with a binary mask, but unlike IMP, all entries are updated in the backward pass using the gradient of the loss with respect to the masked parameters. The Top-KAST update is thus an application of the straight-through gradient method [5; 9], otherwise known as lazy projection or dual averaging (DA) in optimization [33; 40; 2].

parameters , loss function , sparsity budget , step size parameters
Algorithm 1 Iterative magnitude pruning update
parameters , loss function , sparsity budget , step size parameters
Algorithm 2 Dual averaging / Top-KAST update

Smooth approximations. Our use of the soft top-

operation is related to prior methods that use the logistic sigmoid function as a differentiable approximation to the step function for sparse training 

[29; 35; 1]

. These approaches similarly regulate the sharpness of the approximation with a temperature parameter that scales the input logits to the sigmoid function. A distinctive characteristic of Spartan is that the parameter

directly controls the degree of sparsity of the mask; this differs from previous approaches involving the logistic sigmoid approximation that only indirectly control the degree of sparsity using an penalty term.

Transformer-specific methods. Several approaches are specialized to transformer-based architectures. These include structured pruning of pre-trained transformers for NLP tasks [38; 25; 39] and sparse training methods specifically applied to vision transformers [7; 6]. In contrast, Spartan is a general-purpose sparse training algorithm that is designed to be agnostic to the model architecture.

3 Sparsity via Regularized Transportation

parameters , loss function , sparsity budget , sharpness parameter , step size parameters
1: apply soft masking
2: project onto -sparse set
3: compute dual averaging update
Algorithm 3 Spartan parameter update

Each iteration of Spartan consists of the following two steps (Algorithm 3): (1) an approximately -sparse masking of the model parameters using the soft top- operator, and (2) dual averaging-based parameter updates with the set of exactly -sparse vectors as the feasible set. This update procedure aims to combine the advantages of optimization via dual averaging and those of smooth approximation-based algorithms.

The key advantage of the dual averaging method is that it mitigates the issue of gradient sparsity: in a standard gradient update, only those parameters that were not masked in the forward pass receive a nonzero gradient, resulting in slow progress at high sparsities. Thus, dual averaging iterates are able to more quickly explore the space of possible sparsity patterns.

On the other hand, large variations in the sparsity patterns realized post-projection can lead to instability in optimization, thus hampering the final performance of the model; for instance, Top-KAST empirically benefits from the addition of sparsity-inducing regularizers on its iterates [23; 36]. This issue motivates our use of a soft top- approximation as a mechanism for controlling the stability of our training iterates.

In the following, we begin by describing the soft top- masking operation and address the issues of scalability and of applying soft top- masking to structured sparsity. We subsequently discuss how the update procedure outlined in Algorithm 3 can be incorporated within a complete training loop.

3.1 Soft Top- Masking

Our soft top- masking scheme is based on the soft top- operator described by Xie et al. [41]. takes as input a vector , a budget parameter and a sharpness parameter , and outputs a vector . By using the magnitudes of the parameter vector as a measure of the “value” of each entry, we obtain a soft top- magnitude pruning operator by masking the entries of the parameter vector with the output of the soft top- operator:

We generalize the soft top- operator described by Xie et al. [41] by incorporating a strictly positive cost vector . In particular, we require that the output mask satisfies the budget constraint . This abstraction is useful for modeling several notions of heterogeneous costs associated with different sparsity patterns within a model. For instance, parameters that are repeatedly invoked within a network have a higher associated computational cost: Evci et al. [15] observe that the FLOP count of a ResNet-50 model at a fixed sparsity can vary by more than a factor of due to variations in the level of sparsity among convolutional layers with differing output dimensions.

To derive the cost-sensitive soft top-

operator, we begin by expressing the optimal mask as the solution to the following linear program (LP):

subject to
values , costs , budget , sharpness parameter , max. iterations , tolerance , initial dual variable mask , dual variables ,
1: normalize values by costs
3: normalize mask entries
4: normalize sum to be equal to
5: compute mask
6:return , ,
7:return , ,
Algorithm 4 Soft top- forward pass
gradient w.r.t. outputs , mask , costs , budget , sharpness parameter gradient w.r.t. inputs
4:return gradient vanishes when or
Algorithm 5 Soft top- backward pass

Deferring the details to the Appendix, we rewrite this LP in the following form with :

subject to

We identify this LP as an optimal transportation problem with the variables and profit matrix defined by the normalized values . Note that when , this LP reduces to the original soft top- OT problem from Xie et al. [41].

In order to efficiently approximate the solution to Problem 2, we add entropic regularization with weight and solve the resulting regularized problem using the Sinkhorn-Knopp algorithm [12; 4; 11]. Algorithms 4 and 5 describe the forward and backward passes of the cost-sensitive soft top- operator respectively. The expression for the gradient in Algorithm 5 follows from Theorem 3 in Xie et al. [41] with some algebraic simplification. We remark that this closed-form backward pass is approximate in the sense that it assumes that we obtain the optimal dual variables in the forward pass; in practice, we do not encounter issues when using approximate values of the dual variables with tolerance and a maximum iteration count of in Algorithm 4.

Scalability. Since we apply soft top- masking to high-dimensional parameters in each iteration of training, it is important to minimize the computational overhead of Sinkhorn iteration in the forward pass. We observe that in order to compute the hard top- projection step in dual averaging (as in Algorithm 3), it is necessary to find the index of the th largest entry of . By using the value to initialize the dual variable , we can accelerate the convergence of Sinkhorn iteration. Concretely, we demonstrate in Sec. 4.3 that our implementation of Spartan incurs a per-iteration runtime overhead of approximately 5% over standard dense ResNet-50 training.

Structured sparsity. To implement block structured sparsity, we instantiate one mask variable per block and mask all parameters within that block with the same mask variable. To compute the mask, we use the sum of the magnitudes of the entries in each block as the corresponding value . In a standard pruning setting with uniform costs across all parameters, the corresponding cost is simply the total number of entries in the block.

3.2 Training with Spartan

Training schedule.

When training with Spartan, we divide the training process into a warmup phase, an intermediate phase, and a fine-tuning phase. In our experiments, the warmup phase consists of the first 20% of epochs, the intermediate phase the next 60%, and the fine-tuning phase the final 20%. In the warmup phase, we linearly anneal the global sparsity of the model from a value of

at the start of training to the target level of sparsity. Throughout the intermediate phase, we maintain the target sparsity level, and during the fine-tuning phase, we fix the sparsity mask to that used at the start of fine-tuning. This training schedule is similar to those used in prior work [36].


Figure 3: Spartan parameterizes an exploration-exploitation tradeoff. For each run of ResNet-50 training on ImageNet-1K, we plot Pearson correlation coefficients between the sparsity mask at the end of each epoch and the mask obtained at the end of training (left), and between sparsity masks at the end of subsequent epochs (right). Kinks in the correlation curves at epochs 20 and 80 are respectively due to the end of sparsity annealing and the start of fine-tuning with fixed masks (see Section 3.2 for details on the training schedule).

Exploration vs. exploitation as a function of . From the start of training up to the start of the fine-tuning phase, we linearly ramp the sharpness parameter from an initial value of to the final value . We interpret the spectrum of updates parameterized by as being characteristic of an exploration-exploitation tradeoff. In Figure 3, we illustrate this phenomenon by plotting Pearson correlation coefficients between sparsity masks at different stages of training. We observe that iterative magnitude pruning converges on the final mask relatively early in training, which is indicative of insufficient exploration of different sparsity patterns—consequently, this model achieves lower validation accuracy than the remaining models. The three models trained with Spartan each ramp from at the start of training to at epoch 80. The value of correlates well both with the Pearson correlation with the final mask, and with the Pearson correlation between masks at the end of subsequent epochs. Spartan thus interpolates between the high-exploration regime of standard dual averaging and the low-exploration regime of iterative magnitude pruning. In this example, the intermediate setting achieves the highest top-1 validation accuracy of , with Top-KAST at and IMP at .

4 Empirical Evaluation

In this section, we report the results of our sparse training experiments on the ImageNet-1K dataset with two standard architectures: ResNet-50 [20] and ViT-B/16 [14], consisting of 25.6M and 86.4M parameters respectively. On ResNet-50, we evaluate only unstructured pruning, whereas on ViT-B/16, we evaluate both unstructured and block structured pruning. We subsequently present empirical studies of the sensitivity of Spartan to the value of , the effect of running Spartan without the dual averaging step, and the computational overhead of soft top- masking.

0.8! Method Epochs Accuracy (%) Method Epochs Accuracy (%) ResNet-50 0.75 ViT-B/16 0.75 0.78 0.78 0.75

Table 1: Top-1 accuracies on ImageNet-1K validation set with fully dense training.

4.1 ImageNet-1K Classification

! Sparsity Method Epochs 80% 90% 95% 97.5% 99% [15] RigL (ERK) 0.75 0.75 0.75 - - 0.75 0.75 0.75 - - [24] STR [46] ProbMask - [45] OptG - - [36] IMP 0.75 0.75 0.75 - - Top-KAST 0.75 0.75 0.75 - - (with PP) 0.75 0.75 0.75 - - (with ERK) 0.75 0.75 - - - (with PP 0.75 0.75 - - - & ERK) 0.75 0.75 - - - 0.75 0.75 - - - Top-KAST 0.75 0.75 0.75 0.75 - 0.75 0.75 0.75 0.75 - 0.78 0.75 0.75 0.75 - Spartan 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.75 0.75 0.78 0.78 0.78 0.75 Reported accuracies for models closest to the listed sparsity level. Model trained at 98% sparsity.

Table 2:

ResNet-50 top-1 accuracies on ImageNet-1K validation set at varying levels of sparsity and epochs of training. When available, we report means and standard deviations over 3 trials.

In all our experiments, we run Spartan with the training schedule described in Section 3.2. We train and evaluate our models on the ImageNet-1K dataset with the standard training-validation split and report means and standard deviations over 3 independent trials. We provide additional details on our experimental setup in the Appendix.

ResNet-50 experimental setup. For consistency with our baselines, we use standard data augmentation with horizontal flips and random crops at resolution. For all Spartan runs, we use

, which we selected based on models trained at 95% accuracy. We sparsify the parameters of all linear and convolutional layers with a global sparsity budget, excluding bias parameters and the parameters of batch normalization layers. Our baselines are iterative magnitude pruning 

[47], RigL with the Erdos-Renyi-Kernel (ERK) sparsity distribution [15], Soft Threshold Weight Reparameterization (STR) [24], probabilistic masking (ProbMask) [46], OptG [45], and Top-KAST with Powerpropagation and ERK [23; 36]. We additionally re-run the most performant baseline method, Top-KAST, using a reimplementation in our own codebase. For Top-KAST, we exclude the first convolutional layer from pruning, following [36; 15]. We use mixed precision training with a batch size of 4096 on 8 NVIDIA A100 GPUs.

0.75! Sparsity Structure Method Epochs Unstructured blocks blocks Top-KAST 0.75 0.75 0.75 0.75 0.75 0.75 Spartan 0.78 0.78 0.78 0.78 0.78 0.78

Table 3: ViT-B/16 top-1 accuracies on ImageNet-1K validation set at 90% sparsity with unstructured sparsity and block structured pruning of attention layers.

ViT experimental setup. We use the ViT-B architecture with patches at input resolution. We augment the training data using RandAugment [10], MixUp [44] and CutMix [43]. Our ViT models are trained from random initialization, without any pretraining. We set for Spartan with unstructured sparsity, and and for Spartan with and blocks respectively. These latter settings are the values of for the unstructured case scaled up by factors of and —since each block averages the magnitudes of

entries, we expect the variance of the block values to correspondingly decrease by a factor of

, and we thus compensate by scaling up by . In the block structured case, we exclude the input convolutional layer and the output classification head from pruning since their parameter dimensions are not divisible by the block size. We use mixed precision training with a batch size of 4096 on 16 NVIDIA A100 GPUs across 2 nodes.

Results. Table 1 lists the top-1 validation accuracies achieved by fully dense ResNet-50 and ViT-B/16 models. Table 2 reports validation accuracies for ResNet-50 models at 80%, 90%, 95%, 97.5% and 99% sparsity, and Table 3 reports validation accuracies for ViT at 90% sparsity. In the Appendix, we additionally report measurements of inference-time FLOP costs for our sparse models and the results of experiments with FLOP-sensitive pruning.

For ResNet-50 models, we find that Spartan outperforms all our baselines across all training durations at sparsity levels of 90% and above. In particular, Spartan achieves a mean top-1 accuracy within of fully dense training at 95% sparsity. We observe that additional epochs of sparse training consistently improves the final generalization accuracy; in contrast, validation accuracy peaks at 200 epochs for dense training. This trend persists at 800 training epochs, where Spartan achieves top-1 accuracy at 99% sparsity. For the Top-KAST baseline, we omit results at 99% sparsity due to training instability.

For ViT-B/16, Spartan outperforms Top-KAST for both unstructured and block structured pruning. We observe a particularly large improvement over Top-KAST in the block structured case, where Spartan improves absolute validation accuracy by for block structured pruning. For unstructured pruning, Spartan achieves comparable accuracy to SViTE [7] ( for Spartan vs. for SViTE), but with higher sparsity ( for Spartan vs. for SViTE). In exchange for a reduction in accuracy relative to DeiT-B [37], Spartan reduces model storage cost by and the FLOP cost of inference by .

4.2 Sensitivity and Ablation Analysis


Figure 4: Effect of and dual averaging (left). Top-1 ImageNet-1K validation accuracies for Spartan with and without dual averaging (DA) and with varying at 95% sparsity. Computational overhead (right). Percentage increase in per-iteration wall clock runtime over dense training for Spartan with standard Sinkhorn iteration, Sinkhorn with dual caching, and Sinkhorn with sorting.

Figure 4 (left) shows the effect of ablating the dual averaging step in Spartan—i.e., omitting hard top- projection in the forward pass—over a range of settings of for 95% sparse ResNet-50 models trained for 100 epochs. The dashed line shows top-1 accuracy with Top-KAST. For Spartan training without dual averaging, we compute a hard top- mask at the end of epoch 80, and as with standard Spartan training, we fix this mask until the end of training. In the absence of top- projection, we find that accuracy increases with increasing up to ; at lower settings of , the higher approximation error of soft masking is detrimental to the final accuracy of the model. In contrast, the use of top- projection in the forward pass mitigates this mismatch between training and inference and improves the final accuracy of the sparse model.

4.3 Computational Overhead

We evaluate the computational overhead of Spartan over standard dense training by measuring wall clock training times for a ResNet-50 model (Figure 4, right). Our benchmark measures runtime on a single NVIDIA A100 GPU over 50 iterations of training with a batch size of 256. We use random inputs and labels to avoid incurring data movement costs. We compare three approaches: standard Sinkhorn iteration, Sinkhorn iteration with the dual variable initialized using its final value from the previous iteration, and Sinkhorn iteration with initialized using the value , where the index of the th largest entry of is computed by sorting the vector in each iteration.

Standard Sinkhorn iteration incurs higher overheads as increases—this is due to the additional iterations required to reach convergence as the regularized OT problem more closely approximates the original LP. We find that dual caching and sorting both prevent this growth in runtime over the range of values that we tested. In our remaining experiments, we use the simple sorting-based approach since it corresponds to the lowest relative overhead over standard dense training (approximately 5%). We note that this relative overhead decreases as the batch size increases since the cost of computing the mask is independent of the batch size.

5 Discussion

As an alternative to magnitude-based pruning, we may also apply Spartan in conjunction with learned value parameters, as in methods such as Movement Pruning [34]. In this approach, we would compute masks using a set of auxiliary parameters instead of the magnitudes : , and similarly for hard projection. We remark that while this variant requires additional memory during training to store the value parameters, there is no additional cost during inference since the sparsity mask is fixed at the end of training.

Limitations. Since Spartan retains a dense parameter vector and computes dense backward passes during training, it incurs higher memory and computational costs in each iteration than sparse-to-sparse methods like RigL [15]. Nevertheless, we note that in terms of total computational or energy cost over a full training run, Spartan may remain a competitive option as it requires fewer iterations of training to reach a given accuracy threshold relative to sparse-to-sparse methods. However, we do not compare total training FLOP costs in our empirical evaluation.

A further limitation is that cost-sensitive pruning with Spartan is only compatible with relatively crude linear cost models. This restriction arises due to the requirements of the regularized OT formulation used to compute the soft top- mask. In particular, this limitation precludes the use of cost models involving interaction terms such as those arising from spatially coherent sparsity patterns.

Societal Impacts. Inference with deep neural network models is a computationally intensive process. At present, the total energy footprint associated with serving these models in production is expected to continue growing in tandem with the rising prevalence of large transformer-based architectures in vision and NLP applications. Research towards improving the energy efficiency of deep neural networks is therefore an important counterbalancing force against increasing resource usage by these models. The development of sparse learning algorithms is particularly relevant to these efforts, and we expect that the impact of these approaches will further increase as sparsity-aware hardware acceleration becomes more widely available.

6 Conclusions & Future Work

In this work, we describe a sparse learning algorithm that interpolates between two parameter update schemes: standard stochastic gradient updates with hard masking and the dual averaging method. We show that there exists an intermediate regime between these two methods that yields improved generalization accuracy for sparse convolutional and transformer models, particularly at higher levels of sparsity. While we have demonstrated promising empirical results with our proposed method, the learning dynamics of stochastic optimization for deep networks under sparsity constraints remains relatively poorly understood from a theoretical standpoint. There thus exists ample opportunity for further work towards better understanding sparse learning algorithms, which may in turn inspire future algorithmic advances in this area.


  • [1] K. Azarian, Y. Bhalgat, J. Lee, and T. Blankevoort (2020) Learned threshold pruning. arXiv preprint arXiv:2003.00075. Cited by: §2.
  • [2] Y. Bai, Y. Wang, and E. Liberty (2018) ProxQuant: Quantized Neural Networks via Proximal Operators. In International Conference on Learning Representations, Cited by: §2.
  • [3] G. Bellec, D. Kappel, W. Maass, and R. Legenstein (2018) Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations, Cited by: §1.
  • [4] J. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré (2015) Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37 (2). Cited by: §3.1.
  • [5] Y. Bengio, N. Léonard, and A. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §2.
  • [6] A. Chavan, Z. Shen, Z. Liu, Z. Liu, K. Cheng, and E. Xing (2022) Vision transformer slimming: multi-dimension searching in continuous optimization space. arXiv preprint arXiv:2201.00814. Cited by: §2.
  • [7] T. Chen, Y. Cheng, Z. Gan, L. Yuan, L. Zhang, and Z. Wang (2021) Chasing sparsity in vision transformers: an end-to-end exploration. In Advances in Neural Information Processing Systems, Cited by: §2, §4.1.
  • [8] M. D. Collins and P. Kohli (2014) Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442. Cited by: §2.
  • [9] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [10] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    Cited by: §4.1.
  • [11] M. Cuturi, O. Teboul, and J. Vert (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, Cited by: §3.1.
  • [12] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1.
  • [13] T. Dettmers and L. Zettlemoyer (2019) Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: §1.
  • [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1, §4.
  • [15] U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen (2020) Rigging the lottery: making all tickets winners. In

    International Conference on Machine Learning

    Cited by: §1, §3.1, §4.1, Table 2, §5.
  • [16] T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. arXiv e-prints arXiv:1902.09574. Cited by: §2.
  • [17] S. Gray, A. Radford, and D. P. Kingma (2017) GPU kernels for block-sparse weights. arXiv preprint arXiv:1711.09224. Cited by: §1.
  • [18] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems. Cited by: §2.
  • [19] B. Hassibi and D. Stork (1992) Second order derivatives for network pruning: optimal brain surgeon. Advances in Neural Information Processing Systems. Cited by: §2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.
  • [21] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste (2021)

    Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks

    Journal of Machine Learning Research. Cited by: §2.
  • [22] S. A. Janowsky (1989) Pruning versus clipping in neural networks. Physical Review A. Cited by: §2.
  • [23] S. Jayakumar, R. Pascanu, J. Rae, S. Osindero, and E. Elsen (2020) Top-KAST: Top-K Always Sparse Training. Advances in Neural Information Processing Systems. Cited by: 1st item, §1, §2, §3, §4.1.
  • [24] A. Kusupati, V. Ramanujan, R. Somani, M. Wortsman, P. Jain, S. Kakade, and A. Farhadi (2020) Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, Cited by: §4.1, Table 2.
  • [25] F. Lagunas, E. Charlaix, V. Sanh, and A. M. Rush (2021) Block pruning for faster transformers. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Cited by: §2.
  • [26] Y. LeCun, J. Denker, and S. Solla (1989) Optimal brain damage. Advances in Neural Information Processing Systems 2. Cited by: §2.
  • [27] N. Lee, T. Ajanthan, and P. Torr (2018) SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, Cited by: §2.
  • [28] C. Louizos, M. Welling, and D. Kingma (2018) Learning sparse neural networks through regularization. In International Conference on Learning Representations, Cited by: §2.
  • [29] J. Luo and J. Wu (2020) AutoPruner: an end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition. Cited by: §2.
  • [30] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta (2018) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications. Cited by: §1.
  • [31] H. Mostafa and X. Wang (2019)

    Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

    In International Conference on Machine Learning, Cited by: §1.
  • [32] M. C. Mozer and P. Smolensky (1988) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • [33] Y. Nesterov (2009) Primal-dual subgradient methods for convex problems. Mathematical Programming. Cited by: §2.
  • [34] V. Sanh, T. Wolf, and A. Rush (2020) Movement pruning: adaptive sparsity by fine-tuning. In Advances in Neural Information Processing Systems, Cited by: §5.
  • [35] P. Savarese, H. Silva, and M. Maire (2020) Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems. Cited by: §2.
  • [36] J. Schwarz, S. Jayakumar, R. Pascanu, P. Latham, and Y. Teh (2021) Powerpropagation: a sparsity inducing weight reparameterisation. In Advances in Neural Information Processing Systems, Cited by: §3.2, §3, §4.1, Table 2.
  • [37] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, Cited by: §1, §4.1.
  • [38] Z. Wang, J. Wohlwend, and T. Lei (2020)

    Structured pruning of large language models

    In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • [39] M. Xia, Z. Zhong, and D. Chen (2022) Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408. Cited by: §2.
  • [40] L. Xiao (2009) Dual averaging method for regularized stochastic learning and online optimization. Advances in Neural Information Processing Systems. Cited by: §2.
  • [41] Y. Xie, H. Dai, M. Chen, B. Dai, T. Zhao, H. Zha, W. Wei, and T. Pfister (2020) Differentiable top-k with optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1, §3.1, §3.1, §3.1.
  • [42] D. Yu, F. Seide, G. Li, and L. Deng (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.
  • [43] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019)

    Cutmix: regularization strategy to train strong classifiers with localizable features

    In IEEE/CVF International Conference on Computer Vision, Cited by: §4.1.
  • [44] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §4.1.
  • [45] Y. Zhang, M. Lin, M. Chen, Z. Xu, F. Chao, Y. Shen, K. Li, Y. Wu, and R. Ji (2022) Optimizing gradient-driven criteria in network sparsity: gradient is all you need. arXiv preprint arXiv:2201.12826. Cited by: §4.1, Table 2.
  • [46] X. Zhou, W. Zhang, H. Xu, and T. Zhang (2021) Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.1, Table 2.
  • [47] M. Zhu and S. Gupta (2018) To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. In International Conference on Learning Representations, Cited by: 1st item, §1, §2, §2, §4.1.

Appendix A Derivation of the Optimal Transportation LP

Here, we show how the original top- LP with costs can be straightforwardly rewritten in the form of an optimal transportation problem. For a given value vector , cost vector and budget , the top- LP is:

subject to

Define the variables as and substitute to obtain:

subject to

Now eliminate the upper bound constraint by introducing additional variables to give:

subject to

which can be recognized as an optimal transportation problem in the variables and .

Appendix B Training Details

In Tables 4 and 5, we detail the hyperparameters used for our training runs.

Hyperparameter Value
optimizer Nesterov accelerated gradient method ()
max. learning rate
min. learning rate
learning rate warmup epochs
learning rate decay schedule cosine
batch size
weight decay ( for bias and normalization parameters)
label smoothing
data augmentation random crops, random horizontal flips
input resolution
sparsity annealing schedule linear from to target sparsity at epoch fraction
annealing schedule linear from to at epoch fraction
Sinkhorn max. iterations
Sinkhorn tolerance
Table 4: ResNet-50 training hyperparameters
Hyperparameter Value
optimizer AdamW ()
max. learning rate
min. learning rate
learning rate warmup epochs of total epochs
learning rate decay schedule cosine
batch size
weight decay ( for bias and normalization parameters)
label smoothing
data augmentation random crops, random horizontal flips,
RandAugment (ops , magnitude )
gradient norm clip
input resolution
exponential moving averaging false
sparsity annealing schedule linear from to target sparsity at epoch fraction
annealing schedule linear from to at epoch fraction
Sinkhorn max. iterations
Sinkhorn tolerance
Table 5: ViT-B/16 training hyperparameters

Appendix C FLOP Measurements

Due to differences in the computational cost associated with individual parameters, the sparsity fraction does not map 1-to-1 to the fraction of FLOPs required for inference. Tables 6 and 7

give FLOP costs for our sparse models as a percentage of the FLOP cost of the corresponding dense model. We performed our FLOP measurements using the open source tool available at

We count multiply and add operations as one FLOP each. There is some inconsistency in the literature regarding this convention, with some prior work using multiply-accumulate (MAC) counts and FLOP counts interchangeably. To convert the base FLOP counts listed below for ResNet-50 and ViT-B/16 to MACs, we can simply divide the given counts by 2.

In Table 6, the ResNet-50 FLOP counts for Top-KAST are slightly higher than those for Spartan. This is due to the exclusion of the input convolutional layer from pruning in the case of Top-KAST.

Method Epochs 80% 90% 95% 97.5% 99%
Top-KAST 0.75 0.75 0.75 0.75 -
0.75 0.75 0.75 0.75 -
0.75 0.75 0.75 0.75 -
Spartan 0.75 0.75 0.75 0.75 0.75
0.75 0.75 0.75 0.75 0.75
0.75 0.75 0.75 0.75 0.75
Table 6: Percentage FLOP costs of sparse ResNet-50 models relative to the FLOP cost of a dense ResNet-50. The cost of a dense ResNet-50 model is 8.24 GFLOPs.
Sparsity Structure
Method Epochs Unstructured blocks blocks
Top-KAST 0.75 0.75 0.75
0.75 0.75 0.75
Spartan 0.75 0.75 0.75
0.75 0.75 0.75
Table 7: Percentage FLOP costs of 90% sparse ViT-B/16 models at input resolution relative to the FLOP cost of a dense ViT-B/16 model. The cost of a dense ViT-B/16 model is 35.19 GFLOPs.

Appendix D FLOP-Sensitive Pruning

We demonstrate FLOP-sensitive pruning with Spartan on ResNet-50 using the following cost model: assign a cost of 1 to each parameter of a fully connected layer, and a cost of to each parameter of a convolutional layer where the output has size along its spatial dimensions. We evaluate two valuation functions: and . results in the same pruning order as in standard pruning, but with a FLOP budget constraint instead of the usual sparsity budget. assigns a lower value to the parameters of convolutional layers, and results in networks where the parameters of convolutional layers are preferentially pruned. We use for and for to compensate for the relatively smaller scale of the normalized values in the soft top- forward pass (Algorithm 4).

Table 8 gives the top-1 accuracy, FLOP percentage, and sparsity percentage for each of these valuation functions. Spartan yields models with identical FLOP percentages of , which is slightly higher than the budgeted value of

—this discrepancy is due to the additional cost of the normalization layers and activation functions in the network. Most notably, there is a substantial difference in the sparsity percentages realized by these valuation functions. As expected,

preferentially sparsifies the parameters of convolutional layers and yields denser fully connected layers, resulting in lower sparsity overall.

Accuracy % FLOP % Sparsity %
Table 8: Results of FLOP-sensitive pruning experiments on ImageNet-1K with ResNet-50 models.

Appendix E Additional Experiments

In Table 9, we compare Spartan against two additional variants of the Top-KAST baseline: Top-KAST with the Erdos-Renyi-Kernel (ERK) sparsity distribution, and with pruning applied to the parameters of all convolutional and fully-connected layers with the exception of bias terms (prune all). Top-KAST (excl. input conv.) denotes the Top-KAST variant used in the experiments presented in the main text, where we exclude the input convolutional layer from pruning. We find that there is some small variation in the measured top-1 validation accuracies, but our conclusion that Spartan improves generalization at higher levels of sparsity is unchanged.

Method Epochs 90% 95%
Top-KAST (prune all)
Top-KAST (excl. input conv.) 0.75 0.75
0.75 0.75
0.75 0.75
Spartan 0.75 0.75
0.75 0.75
0.75 0.75
Table 9: Comparison between Spartan and additional variants of the Top-KAST baseline on ImageNet-1K with ResNet-50 models.

Appendix F Learned Sparsity Patterns

We observe a qualitative difference in the distribution of per-layer sparsities between ViT-B/16 models trained with unstructured sparsity and those trained with block structured sparsity (Figure 5). In particular, the output projections of self-attention layers under block structured pruning are significantly more dense in the later blocks of the network relative to unstructured pruning. The reasons for this difference are not immediately clear to us, and we leave further investigation of this phenomenon to future work.


Figure 5: Per-layer sparsities of ViT-B/16 models trained with Spartan.

Block structured pruning produces coherent sparsity patterns in ViT-B/16 models. In Figure 6, we visualize the magnitudes of the weight matrices corresponding to the input projection of each self-attention layer in a ViT-B/16 model trained with Spartan using block structured pruning. This matrix maps vectors of dimension to query, key, and value embedding vectors, each of dimension . We observe that the training process yields similar sparsity patterns in the query and key embedding submatrices, which correspond to the left and center panels in the visualization for each layer. This is an intuitively reasonable property since the self-attention layer computes inner products of the query and key embeddings in order to construct attention maps. We note that this symmetry emerges purely as a result of the optimization process; we did not incorporate any prior knowledge into Spartan regarding the role of particular entries of the weight matrices subject to sparsification.


Figure 6: Weight magnitudes of the input projection matrices of each self-attention layer in a block sparse ViT-B/16 network trained using Spartan.