1 Introduction
Deep neural networks (DNNs) have shown promising performances on various tasks including computer vision, natural language processing, speech recognition, etc. However, a DNN usually comes with a large number of learnable parameters, ranging from millions of to even billions of (
e.g., GPT3 (Brown et al., 2020)), making the DNN model burdensome and difficult to be applied to realworld deployments. Therefore, researchers began to investigate how to speed up and compress DNNs via various methods such as knowledge distillation (Hinton et al., 2015), quantization (Jacob et al., 2018; Zhou et al., 2017), designing efficient model architectures (Howard et al., 2017), and structured sparsity (Wen et al., 2016; Li et al., 2016).In this paper, we focus on the problem of sparsifying DNNs. Sparsity in DNNs can be categorized into unstructured sparsity and structured sparsity. Unstructured sparsity prunes individual weights at any location, which is finegrained and can achieve extremely high compression ratio (Han et al., 2015; Guo et al., 2016)
. However, unstructured sparsity struggles to take advantage of vectorprocessing architectures such as SIMD and poorly utilizes memory buses, which increases latency due to dependent sequences of reads
(Nvidia, 2020). Compared with unstructured sparsity, structured sparsity is more friendly to hardware, especially for block pruning (Wang et al., 2019), kernel shape sparsity (Tan et al., 2020) or channel and filter pruning (Li et al., 2016; Wen et al., 2016). Although structured sparsity can speed up DNNs on commodity hardware, it hurts model performance more significantly than unstructured finegrained sparsity. For example, ResNet50 network generated by unstructured pruning can achieve a compression ratio, with the same accuracy as the original network, but it can only achieve compression in the case of structured sparsity (Renda et al., 2020). Therefore, how to combine the unstructured sparsity and structured sparsity to accelerate DNNs on modern hardware (e.g., GPU) becomes a challenging yet valuable problem. Recently, Nvidia Ampere A100 is equipped with the Sparse Tensor Cores to accelerate 2:4 structured finegrained sparsity. Here, N:M sparsity indicates the sparsity of DNNs in which only weights are nonzero for every continuousweights. To the best of our knowledge, A100 is the first commodity sparse hardware, where the sparse tensor core can support several common operations including linear, convolutional, recurrent cells, transformer blocks, etc. Specifically, suppose a typical matrix multiplication
in DNNs, and denote input tensor and parameter tensor respectively. The Dense Tensor Cores implement matrix multiplication by 2 cycles while the Sparse Tensor Cores only need 1 cycle if the parameter tensor satisfies the 2:4 structured sparse pattern.Nvidia has proposed an ASP^{1}^{1}1 https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity (APEX’s Automatic Sparsity) solution (Nvidia, 2020) to sparsify a dense neural network to satisfy the 2:4 finegrained structured sparsity requirement. The recipe contains three steps: (1) training a dense network until converge; (2) pruning for 2:4 sparsity with magnitudebased singleshot pruning; (3) repeating the original training procedure. However, ASP is computationally expensive since it requires training the full dense models from scratch and finetuning again. Therefore, we still lack a simple recipe to obtain a structured sparse DNN model consistent with the dense network without extra finetuning.
This paper addresses this question: Can we design a simple yet universal recipe to learn : sparsity neural networks from scratch in an efficient way?
It is difficult to find the optimal sparse architecture (connections) and optimal parameters (Evci et al., 2019b) simultaneously during training sparse CNNs and Transformers although SETMLP could easily outperform dense MLP (Bourgin et al., 2019). There are two schemes to obtain such sparse models. One is a twostage scheme, which discovers a sparse neural architecture by pruning a welltrained dense network and then uses the same or even greater computational resources to retrain the sparse models (Nvidia, 2020; Evci et al., 2019b; Han et al., 2015; Frankle and Carbin, 2018). The other is a onestage scheme, which adopts the dynamic method to alternatively optimize parameters and prunes network architectures based on different criteria (Bellec et al., 2017; Mocanu et al., 2018; Mostafa and Wang, 2019; Evci et al., 2019b; Kusupati et al., 2020; Dettmers and Zettlemoyer, 2019). Compared with the twostage scheme, the onestage scheme can save training time and cost however usually obtains lower performance.
To overcome the aforementioned tradeoff between training cost and performance, we present a simple yet effective framework to train sparse neural networks from scratch. Specifically, we employ the magnitudebased pruning method (Renda et al., 2020; Gale et al., 2019) during the forward process. Considering that the pruning operation is a nondifferentiable operator (a similar dilemma in model quantization (Courbariaux et al., 2016)), we extend the widely used Straightthrough Estimator (STE) (Bengio et al., 2013) in model quantization to aid sparse neural network’s backpropagation. However, perturbations are introduced during the backpropagation (Yin et al., 2019; Bengio et al., 2013). Hence we define Sparse Architecture Divergence (SAD) to further analyze : sparse networks trained by STE methods so that we can identify the impact of perturbations on sparse neural networks training. Based on SAD analysis, to alleviate the negative impact, we propose a sparserefined term mitigating the approximated gradients’ influence.
We also compare the performance of neural networks with different granularities of finegrained structured sparsity (i.e., 1:4, 2:4, 2:8, 4:8) and conduct thorough experiments on several typical deep neural networks with different : sparsity levels, covering image classification, detection, segmentation, optical flow estimation, and machine translation. Experimental results have shown that the models with our proposed structured sparsity can achieve neglectful performance drop and can even sometimes outperform the dense model.
The main contributions of this paper are summarized as threefold. (1) To the best of our knowledge, this is the first systematic study into training : structured sparse neural networks from scratch without performance drop. The : structured sparsity is a missing yet promising ingredient in model acceleration, which can be a valuable supplement with various compression methods. (2) We extend STE to tackle the problem of training : sparse neural networks. To alleviate the limitations of STE on sparsifying networks, we propose a sparse refined term to enhance the effectiveness on training the sparse neural networks from scratch. (3) We conduct extensive experiments on various tasks with : finegrained sparse nets, and provide benchmarks for : sparse net training to facilitate codevelopment of related software and hardware design.
2 Related Work
Unstructured and Structured Sparsity.
Sparsity of DNNs is a promising direction to compress and accelerate a deep learning model. Among all sparsity types, unstructured sparsity can achieve a significantly high compression ratios (
e.g. 13 (Han et al., 2015) and 108 (Guo et al., 2016)) while ensuring decent accuracy by pruning. Many different pruning criterions and pruning methods are proposed for unstructured sparsity, e.g., magnitudebased pruning (Han et al., 2015; Frankle and Carbin, 2018), Hessian based heuristics
(LeCun et al., 1990), and pruning with connection sensitivity (Lee et al., 2018). However, unstructured sparsity’s ability to accelerate is highly limited since it takes a lot of overhead to store the irregular nonzero index matrix. On the other hand, Wen et al. (2016) introduces the structural sparsity to speed up deep models on CPU/GPU. Existing structural sparsity contains filterwise sparsity (Li et al., 2016), channelwise sparsity (Li et al., 2016), filtershapewise sparsity, and depthwise sparsity. Different from existing sparsity patterns (finegrained unstructured sparsity and coarsegrained structured sparsity), this paper presents an : finegrained structured sparsity, a sparsity type that has both high efficiency and lossless performance.Onestage and twostage methods. There are mainly two types of techniques to obtain a sparse neural network, onestage methods and twostage ones. The twostage method first prunes a trained dense neural network and then retrains fixed sparse network to recover its performance. Typical twostage methods include singleshot pruning (Lee et al., 2018) and iterative pruning (Han et al., 2015; Guo et al., 2016). Later, the lottery ticket hypothesis (Frankle and Carbin, 2018) shows that the sparse subnetwork (winning tickets) can be trained from scratch with the same initialization while the winning tickets are discovered by dense training. DeepRewiring (Bellec et al., 2017), on the other hand, is a typical onestage method, which takes a Bayesian perspective and samples sparse network connections from a posterior, however is computationally expensive and challenging to be applied to largescale tasks. Sparse Evolutionary Training (Mocanu et al., 2018)(SET) is proposed as a simpler scheme where weights are pruned according to the standard magnitude criterion used in pruning and growing connections in random locations. Dettmers and Zettlemoyer (2019) uses the momentum of each parameter as the criterion for growing weights and receives an improvement in test accuracy. GMP (Gale et al., 2019) trains the unstructured sparse net using variational dropout and regularization from scratch, and shows that unstructured sparse architectures learned through pruning cannot be trained from scratch to have the same testing performance as dense models do. Recently proposed stateoftheart method STR (Kusupati et al., 2020) introduces pruning learnable thresholds to obtain a nonuniform sparse network. RigL (Evci et al., 2019a) uses the magnitudebased method to prune and the periodic dense gradients to regrow connection. However, compared with training dense neural networks from scratch, to achieve the same performance, RigL needs 5 more training time. The most closely related work to ours may be DNW(Wortsman et al., 2019) which uses a fully dense gradient in the backward run to discover optimal wiring on the fly.
3 Method
3.1 N:m finegrained structured sparsity
Here we define the problem of training a neural network with : finegrained structured sparsity. A neural network with : sparsity satisfies that, in each group of consecutive weights of the network, there are at most weights have nonzero values. Fig. 1 illustrates a 2:4 sparse network.
Generally, our objective is to train an : sparse neural network as
(1) 
where denotes the observed data,
represents the loss function,
indicates the parameters of an layer neural network, and is the : sparse neural network parameters.3.2 Straightthrough Estimator(STE) on training N:m sparse networks
A straightforward solution for training an : sparsity network is to simply extend Straightthrough Estimator (STE) (Bengio et al., 2013) to perform online magnitudebased pruning and sparse parameter updating, which is depicted in Fig. 2(a). STE is widely used in model quantization (Rastegari et al., 2016), since the quantized function is nondifferentiable without STE and the networks optimized with STE has decent performance under careful settings (Yin et al., 2019). In STE, a dense network is maintained during the training process. During the forward pass, we project the dense weights into sparse weights satisfying : sparsity. Let be a group of consecutive parameters in and be the corresponding group in . The projection of can be formulated as:
(2) 
where is the th largest value in . Intuitively, this projection function produces sparse parameters by setting parameters that have the least significant absolute values to zero in each consecutive parameter group, while keeping the other parameters the same as before. The computation of an : sparse subnetwork onthefly in the forward pass is illustrated in Fig. 1.
The projection function , which is nondifferentiable during backpropagation, generates the : sparse subnetwork on the fly. To get gradients during backpropagation, STE computes the gradients of the subnetwork based on the sparse subnetwork , which can be directly backprojected to the dense network as the approximated gradients of the dense parameters. The approximated parameter update rule for the dense network (see Fig. 2(a) in Appendix) can be formulated as
(3) 
where represents dense parameters at iteration and indicates the learning rate.
3.2.1 Analysis of dynamic pruning using STE
To validate the performance of STE on N:M sparse networks, we perform a pilot experiment with STE. The results are shown in Fig. 3(a). From Fig. 3(a), :
neural network trained with the above mentioned STE shows significant performance drop compared with the dense network. We conjecture that this drop results from unstable neural architecture updates caused by the approximated gradients of the dense parameters from the STEmodified chain rules. As Eq.
3 shows, is a rough estimate of the gradients for due to the mismatch between the forward and backward passes. When conducting gradient descent to with the rough gradients estimated by STE, discrepancies between the accurate gradients and approximated ones may lead to erroneous parameter updates. These imprecise value updates on the pruned parameters may further produce unstable alternations of the architecture of the onthefly pruned sparse neural networks in the forward pass, which causes the notable performance drop. To demonstrate the possible relationship between sparse network architecture updates and performance drops, we define SAD (Sparse Architecture Divergence) and measure the network architecture change with this metric.Before formally defining SAD, we first define the binary parameter mask produced in the magnitudebased pruning process as where represents the length of . Specifically, if the th parameter of survived (not pruned) in the pruned subnetwork , we set , and otherwise. Thus, the sparse subnetwork can be represented as , where represents elementwise multiplication. For convenience, we define .
For a single training run, we propose Sparse Architecture Divergence (SAD) to measure the change of the binary mask from (the weights after the th iteration) to (the weights after the th iteration). We define , where and are the binary masks for and respectively. This formula measures the number of connections(weights) that are pruned in and not pruned in , or pruned in while not pruned in . Smaller indicates less discrepancy between the network architectures of and .
To reveal the impact of STE to the obtained sparse network architecture, we analyze SAD between the weights after different number of training iterations of two training schemes. Our quantitative results are shown in Fig. 3(b). The first scheme applies forward pass with dense weights and updates the net weights by . Then the corresponding sparse subnetwork is obtained by pruning the trained dense network. The other updating scheme carries out forward pass with sparse layers and uses STEmodified chain rule, , in backward pass. We define to be the sparse model pruned from the model trained for iterations of regular gradient descent, and to be the sparse model pruned from the network trained with iterations of STEmodified gradient descent. Let denote SAD between the th layer of and trained with sparse forward. Similarly, represents the SAD value between the th layer of and when trained with dense forward. As depicted in Fig. 3(b), for each layer number , holds both when and . Meanwhile, grows as increases from 1 to 10. This phenomenon indicates a positive correlation between the poor performance of a sparse neural network and high SAD.
Before our defined SAD, Liu et al. (2020)
proposes Neural Network Sparse Topology Distance (NNSTD) to measure topological distances between two sparse neural networks. It is worth noting that NNSTD reorders the neurons in each layer, based on their connections to the previous layer, to maximize the similarities between the two compared networks’ topologies. Hence, NNSTD can only measure general topological differences between two networks but fails to reflect the transitions of individual connections’ states (pruned or not pruned). However, when calculating SAD, the mask for each connection is directly computed without neurons being ordered, hence SAD could provide more precise estimation of actual state changes (from being pruned to not pruned, and vice versa) of network connections, which is what we most concern about in this paper. It is also worth noting that SAD has been implicitly adopted in existing published research papers. In RigL
(Evci et al., 2019a), the authors consider the case of , where is dynamically calculated for each layer during training. RigL can achieve stateoftheart results on training unsturctured sparse networks from scratch. Another example is that, when we prune the network at initialization and don’t update the connections, according to the recent work (Frankle et al., 2020), the performance drops significantly. This can be regarded as setting during the whole training phase.3.3 Sparserefined Straightthrough Estimator (SRSTE) on Training N:m sparse networks
Inspired by above observations made from SAD analysis, we aim to reduce SAD to improve the sparse network’s performance. Since the magnitude of parameter are used as a metric to prune the network weights, we need to alternate the weight updating process in order to prevent high SAD. Two choices are left to us to achieve this goal: (1) restricting the values of weights pruned in , (2) promoting the nonpruned weights in . It is worth noting that although gradients of parameters calculated from STE are all approximated, the pruned parameters’ gradients are more coarse than the nonpruned ones. Because for pruned parameters, the values to compute gradients in and the values to be updated in are distinct, however for nonpruned parameters those two stages use the same values. These statements make penalizing weights pruned in a natural choice for restraining SAD.
Hence, we propose a sparserefined term in the STE update formulation. We denote this new scheme as SRSTE. Compared to STE, when utilizing SRSTE in : sparse model training, the backward pass is carried out with a refined gradients for pruned weights, as illustrated in Fig. 2(b). The purpose of the regularization term is to decrease the magnitude of the pruned weights, which are determined in the forward pass. Intuitively, we encourage the pruned weights at the current iteration to be pruned also in the following iterations so that the sparse architecture is stabilized for enhancing the training efficiency and effectiveness.
Formally, the network parameter update rule changes from Eq. 3 to the following formulation with a sparserefined regularization term,
(4) 
where denotes the mask for the pruned weights, denotes Hadamard product, denotes the relative weight for the sparserefined term, and denotes the learning rate.
When , Eq. 4 is equivalent to Eq. 3, which is the STE update rule. In general, we set . SRSTE terms with a positive set a constraint on and only on the pruned parameters, to prevent them from 1) being unpruned due to the different levels of mismatch between pruned parameter gradients and nonpruned parameter gradients, and 2) ineffectively alternating the pruned network architecture. When fewer sparse connections in the network are alternated, a more stable training process and a higher validation accuracy would be expected, which has been demonstrated in the analysis above and manifested in following experiments.
We perform extensive experiments with SRSTE, and these results can be found in Fig. 4. The experiments here are conducted with ResNet18 (He et al., 2016)
on ImageNet
(Deng et al., 2009). In Fig. 4(a), 4 different settings of , namely , are inspected. With , the potential negative impact of pruned weights’ coarse gradients are enlarged, which leads to the poorest top1 accuracy (68.5%) and the most significant SAD. For smaller values of corresponding to the standard version of SRSTE, SAD shrinks meanwhile the top1 accuracy receives clear increase. Furthermore, we examine performances of three neural networks: 1) a dense neural network trained with regular SGD method; 2) an : sparse network optimized with STE; 3) an : sparse network trained with SRSTE. The curves of their top1 accuracy for all training epochs are illustrated in Fig. 4(b). The accuracy curve of STE is consistently below the other two curves, and has more turbulence between different training epochs. Note that the SAD value is associated with the learning rate . For instance, SAD grows rapidly during the first 5 epochs in Fig. 4(a) since the learning rate increases from 0 to 0.1 in the socalled “warmup” process. Besides, we also present other formations of sparserefined term in Appendix A.5.4 Experiments
In this section, we demonstrate the effectiveness of our proposed : finegrained structured sparsity neural network on computer vision tasks (e.g
., image classification, object detection, instance segmentation, and optical flow prediction) and machine translation tasks. For these experiments, the implementation details, including dataset settings, training schedules, and evaluation metrics, are listed in the Appendix
A.4. Meanwhile, we set as in all experiments because this value gives a good performance in our experiments.4.1 image classification
In this section, we first conduct several experiments to evaluate the effects of different :
sparse patterns and different training methods on the image classification benchmark ImageNet1K
(Deng et al., 2009) with different backbones. Then, we provide the performance of our proposed : finegrained structured sparsity network compared with the stateoftheart sparsity methods.Different : Sparse Patterns. To investigate the performance of different : sparse patterns, we exploit the popular ResNet50 (He et al., 2016) model with four different : structural sparse patterns: 2:4, 4:8, 1:4, and 2:8. The baseline model is the traditional dense model. For the designed patterns, the 2:4 and 4:8 have the same sparsity 50%. The sparsity of 1:4 and 2:8 are both 75%. In Table 1, we observe that the 4:8 structural sparsity outperforms 2:4 with the same computational cost, and 2:8 also performs better than 1:4. the training curve in Fig. 6(a). It shows that with the same sparsity for : structural sparse patterns, a larger M will lead to better performance since it can provide more abundant convolution kernel shape (we visualize and analysis the convolution kernel shape in Appendix A.2). For the fixed M, we can adjust N to obtain the different sparsity ranges. With the same M, it is reasonable that the performance of the larger N is better due to more parameters and computational cost. Meanwhile, the performance of the ResNet50 with 1.25 width can achieve 77.5% top1 accuracy about only 71% sparsity of original dense ResNet50. We also conduct several experiments with RegNetXs (Radosavovic et al., 2020) to evaluate the effectiveness of our proposed : finegrained structural sparse patterns in the compact models in Appendix A.3.
Different Training Methods. We also verify the effectiveness of the proposed SRSTE for training the : sparse pattern neural network. In Table 2, we find that SRSTE outperforms the NVIDIA ASP method and STE with less training epochs and better accuracy.
Comparison with Stateofthearts. Before the advent of : finegrained structured sparsity, there exist many stateoftheart methods to generate sparsity models, including DSR (Mostafa and Wang, 2019), RigL (Evci et al., 2019a), GMP (Gale et al., 2019), and STR (Kusupati et al., 2020). SRSTE is compared to those methods on ResNet50 in midlevel sparsity( 80%) and ultralevel sparsity( 95%). Table 3 shows that SRSTE can outperform all stateoftheart methods, even if other methods are unstructured sparsity. And STR (Kusupati et al., 2020) shows that training the model with nonuniform sparsity can improve the performance consistently, thus the SRSTE can extend the sparsity with nonuniform structural sparsity setting (e.g., mixed : finegrained structural sparsity). We believe that the mixed : sparsity could further improve the results and we leave this exploration for the future work.
Model  Method  Sparse Pattern  Top1 Acc(%)  Params(M)  Flops(G) 

ResNet50    Dense  77.3  25.6  4.09 
ResNet50  SRSTE  2:4  77.0  12.8  2.05 
ResNet50  SRSTE  4:8  77.4  12.8  2.05 
ResNet50  SRSTE  1:4  75.9  6.4  1.02 
ResNet50  SRSTE  2:8  76.4  6.4  1.02 
ResNet50 x1.25  SRSTE  2:8  77.5  9.9  1.6 
Model  Method  Sparse Pattern  Top1 Acc  Epochs 

ResNet18  ASP (Nvidia, 2020)  2:4  70.7  200 
ResNet18  STE  2:4  69.9  120 
ResNet18  SRSTE  2:4  71.2  120 
ResNet50  ASP(Nvidia, 2020)  2:4  76.8  200 
ResNet50  STE  2:4  76.4  120 
ResNet50  SRSTE  2:4  77.0  120 
Method  Top1 Acc(%)  Sparsity(%)  Params(M)  Flops(G)  Structured  Uniform 

ResNet50  77.3  0.0  25.6  4.09     
DSR  71.6  80  5.12  0.82  ✗  ✗ 
RigL  74.6  80  5.12  0.92  ✗  ✓ 
GMP  75.6  80  5.12  0.82  ✗  ✓ 
STR  76.1  81  5.22  0.71  ✗  ✗ 
STE  76.2  80  5.12  0.82  ✗  ✓ 
SRSTE  77.0  80  5.12  0.82  ✗  ✓ 
SRSTE  76.4  75(2:8)  6.40  1.02  ✓  ✓ 
RigL  67.5  95  1.28  0.32  ✗  ✓ 
GMP  70.6  95  1.28  0.20  ✗  ✓ 
STR  70.2  95  1.24  0.16  ✗  ✗ 
STE  68.4  95  1.28  0.20  ✗  ✓ 
SRSTE  72.4  95  1.28  0.20  ✗  ✓ 
SRSTE  72.2  94(1:16)  1.60  0.25  ✓  ✓ 
4.2 Object Detection and Instance Segmentation
We further conduct experiments on the challenging COCO dataset
(Lin et al., 2014) to evaluate the efficiency of the proposed approach for two important computer vision tasks, i.e., object detection and instance segmentation. We exploit the classical model Faster RCNN (Ren et al., 2015) for object detection and Mask RCNN (He et al., 2017) for instance segmentation. All the experiments are conducted based on MMDetection (Chen et al., 2019). Table 5 and Table 5 show that 2:8 (25%) structured sparsity can achieve comparable result with dense baseline models. Furthermore, 4:8 (50%) sparsity can provide even better result than dense models. These results also illustrate that the : sparsity pretrained model gives a similar or better feature transfer ability.Model  Method  Sparse Pattern  LR Schd  mAP 

FRCNNR50  –  Dense  37.4  
FRCNNR50  SRSTE  2:4  38.2  
FRCNNR50  SRSTE  2:8  37.2  
FRCNNR50  –  Dense  38.4  
FRCNNR50  SRSTE  2:4  39.2  
FRCNNR50  SRSTE  2:8  38.9 
Model  Method  Sparse Pattern  LR Schd  Box mAP  Mask mAP 

MRCNNR50  –  Dense  38.2  34.7  
MRCNNR50  SRSTE  2:4  39.0  35.3  
MRCNNR50  SRSTE  2:8  37.6  33.9  
MRCNNR50  –  Dense  39.4  35.4  
MRCNNR50  SRSTE  2:4  39.8  35.9  
MRCNNR50  SRSTE  2:8  39.4  35.4 
4.3 Optical Flow
Optical flow prediction is one representative dense pixellevel prediction task in computer vision. We verify our proposed method on a recent stateoftheart model RAFT (Teed and Deng, 2020) model on FlyingChairs (Dosovitskiy et al., 2015). The smaller value of the metric endpointerror (EPE) represents better performance. Compared with the dense model for optical flow prediction, Table 7 shows that our proposed method can achieve comparable accuracy with half of the parameters.
Model  Method  Sparse Pattern  EPE  Params(M)  Flops(G) 

RAFT    Dense  0.86  5.3  134 
RAFT  SRSTE  2:4  0.88  2.65  67 
Model  Method  Sparse pattern  BLUE  Params(M)  Flops(G) 

Transformer    Dense  27.31  63  10.2 
Transformer  SRSTE  2:4  27.23  31.5  5.1 
4.4 Machine Translation (MT)
Besides the computer vision tasks, we investigate the effectiveness of our method on one of the most common tasks in natural language processing, i.e., machine translation. We conduct our experiments based on Transformer, which employs a number of linear layers. We train our models with transformer_base adopted by Vaswani et al. (2017)
, which contains a 6layer encoder and a 6layer decoder with 512dimensional hidden representations. The larger value of the metric BLUE represents better performance. Compared with the dense model, Table
7 shows that our proposed method can achieve the negligible accuracy loss.5 Discussion and Conclusion
In this work, we present SRSTE for the first time to train : finegrained structural sparse networks from scratch. SRSTE extends StraightThrough Estimator with a regularization term to alleviate ineffective sparse architecture updates brought by coarse gradients computed by STEmodified chain rules. We define a metric, Sparse Architecture Difference
(SAD), to analyze these architecture updates. The experimental results show that SAD correlates strongly with pruned network’s performance. We hope this work could shed light on machine learning acceleration and SAD could inspire more theoretical and empirical studies in sparse network training and other fields such as neural architecture search.
References
 Deep rewiring: training very sparse deep networks. arXiv preprint arXiv:1711.05136. Cited by: §1, §2.
 Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §1, §3.2.
 Cognitive model priors for predicting human decisions. In International conference on machine learning, pp. 5133–5141. Cited by: §1.
 Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
 Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §A.4.2, §4.2.
 Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §1.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §A.4.1, §3.3, §4.1.  Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: §1, §2.
 Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §A.4.3, §A.4.3, §4.3.
 Rigging the lottery: making all tickets winners. arXiv preprint arXiv:1911.11134. Cited by: §2, §3.2.1, §4.1.
 The difficulty of training sparse neural networks. arXiv preprint arXiv:1906.10732. Cited by: §1.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2, §2.
 Pruning neural networks at initialization: why are we missing the mark?. arXiv preprint arXiv:2009.08576. Cited by: §3.2.1.
 The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §1, §2, §4.1.
 Dynamic network surgery for efficient dnns. In Advances in neural information processing systems, pp. 1379–1387. Cited by: §1, §2, §2.
 Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143. Cited by: §1, §1, §2, §2.
 Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3, §4.1.

Bag of tricks for image classification with convolutional neural networks
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §A.4.1.  Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1.
 Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.4.4.
 Soft threshold weight reparameterization for learnable sparsity. arXiv preprint arXiv:2002.03231. Cited by: §1, §2, §4.1.
 Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.
 Snip: singleshot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §2, §2.
 Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §1, §2.
 Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §A.4.2, §4.2.
 Topological insights in sparse neural networks. arXiv preprint arXiv:2006.14085. Cited by: §3.2.1.
 Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9 (1), pp. 1–12. Cited by: §1, §2.
 Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. arXiv preprint arXiv:1902.05967. Cited by: §1, §4.1.
 NVIDIA a100 tensor core gpu architecture. https://www.nvidia.com/content/dam/enzz/Solutions/DataCenter/nvidiaamperearchitecturewhitepaper.pdf. Cited by: §1, §1, §1, Table 2.
 BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §A.4.4.
 Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. Cited by: §A.3, §4.1.
 Xnornet: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §3.2.
 Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.2.
 Comparing rewinding and finetuning in neural network pruning. arXiv preprint arXiv:2003.02389. Cited by: §1, §1.
 Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1715–1725. Cited by: §A.4.4.
 PCNN: patternbased finegrained regular pruning towards optimizing cnn accelerators. arXiv preprint arXiv:2002.04997. Cited by: §1.
 RAFT: recurrent allpairs field transforms for optical flow. arXiv preprint arXiv:2003.12039. Cited by: §A.4.3, §A.4.3, §4.3.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.4.
 Nonstructured dnn weight pruning considered harmful. arXiv preprint arXiv:1907.02124. Cited by: §1.
 Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1, §1, §2.
 Discovering neural wirings. In Advances in Neural Information Processing Systems, pp. 2684–2694. Cited by: §2.
 Understanding straightthrough estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662. Cited by: §1, §3.2.
 Incremental network quantization: towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044. Cited by: §1.
Appendix A Appendix
a.1 algorithm
a.2 kernel shape
Fig. 5 illustrates six learnt convolution kernels which are picked up from the trained ResNet50 2:8 sparse model. Note that, for these six convolution kernels, the shape of nonzero elements under the 2:8 sparsity constraints cannot be acquired or learnt in the case of 1:4 sparsity.
a.3 RegNetXs on ImageNet1K
We further verify whether N:M sparsity can boost the compact models. Recent RegNetXs (Radosavovic et al., 2020) are stateoftheart and hardwarefriendly models, which are the best models out of a search space with candidate models. The Table 8 shows that SRSTE can improve RegNetX002 performance than STE significantly, and the 2:4 structured sparsity can outperform the dense RegNetXs model with the same Flops. Therefore, we can obtain N:M finegrained structured sparse model with SRSTE easily and we believe that the proposed N:M finegrained structured sparsity method could become a standard technique on model deployment.
Model  Method  Sparse Pattern  Top1 acc(%)  Flops(G) 

RegNetX002  SRSTE  2:4  66.7  0.1 
RegNetX004  SRSTE  2:4  71.4  0.2 
RegNetX006  SRSTE  2:4  72.8  0.3 
RegNetX008  SRSTE  2:4  74.1  0.4 
RegNetX016  SRSTE  2:4  76.7  0.8 
RegNetX032  SRSTE  2:4  77.8  1.6 
RegNetX002  –  Dense  68.3  0.2 
RegNetX004  –  Dense  72.3  0.4 
RegNetX006  –  Dense  73.8  0.6 
RegNetX008  –  Dense  74.9  0.8 
RegNetX016  –  Dense  76.8  1.6 
a.4 implementation details
a.4.1 Classification
Dataset. ImageNet1K (Deng et al., 2009) is a largescale classification task, which is known as the most challenging image classification benchmark. ImageNet1K dataset has about 1.2 million training images and 50 thousand validation images. Each image is annotated as one of 1000 object classes.
Training scheduler. All ImageNet1K experiments trained on images of
, the dense models baselines following the hyperparameter setting in
(He et al., 2019). Specifically, all models are trained with batch size of 256 over 120 epochs and learning rates are annealed from 0.1 to 0 with a cosine scheduler and first 5 epochs the learning rate linearly increases from 0 to 0.1.Evaluation Metric. All reported results follow standard Top1 accuracy.
a.4.2 Object Detection and Instance Segmentation
Dataset. All experiments are performed on the challenging MS COCO 2017 dataset (Lin et al., 2014) of 80 categories. It consists of 115 images for training (train2017) and images for validation (val2017). We train models on the training dataset train2017, and evaluate models on the validation dataset val2017.
Training scheduler. For object detection and instance segmentation, we conduct these experiments on MMDetection (Chen et al., 2019). The 1x and 2x training schedule settings follow the settings in MMDetection (Chen et al., 2019).
Evaluation Metric. All reported results follow standard COCOstyle Average Precision (AP) metric, i.e., mAP of IoUs from 0.5 to 0.95.
a.4.3 Optical Flow
Dataset. The optical flow prediction is conducted on the FlyingChairs (Dosovitskiy et al., 2015) dataset, which is a synthetic dataset with optical flow ground truth. This dataset consists of 22,872 image pairs and corresponding flow fields. The training dataset contains 22,232 samples and the validation dataset contains 640 test samples. We train the RAFT (Teed and Deng, 2020) model on the training dataset and report the final results on this validation dataset.
Training scheduler. We employ the original standard open framework^{2}^{2}2https://github.com/princetonvl/RAFT/ to run the RAFT (Teed and Deng, 2020) model. And the training settings for the FlyingChairs (Dosovitskiy et al., 2015) dataset have been listed in Teed and Deng (2020).
Evaluation Metric. We choose the endpoint error (EPE) to evaluate the predicted result. EPE is the Euclidean distance between the predicted flow vector and the ground truth, averaged over all pixels.
a.4.4 Machine Translation
Dataset. For EnglishGerman translation, the training set consists of about 4.5 million bilingual sentence pairs from WMT 2014. We use newstest2013 as the validation set and newstest2014 as the test set. Sentences are encoded using BPE, which has a shared vocabulary of about 37,000 tokens.
Training scheduler. We use subword method (Sennrich et al., 2016) to encode source side sentences and the combination of target side sentences. The vocabulary size is 37,000 for both sides. Each minibatch on one GPU contains a set of sentence pairs with roughly 4,096 source and 4,096 target tokens. We use Adam optimizer (Kingma and Ba, 2015) with and . For our model, we train for 300,000 steps. We employ four Titan XP GPUs to train both the baseline and our model.
Evaluation Metric. All reported results follow standard BLUE (Papineni et al., 2002) on tokenized, truecase output.
a.5 other refined formulation
we present another sparserefined regularization term, i.e., sign constant, as follow:
(5) 
Apart from the parameter of model, we modified the Eq. 4 to apply on approximated gradient directly, as follow:
(6) 
Fig. 6(b) depicts the curve of SAD of Eq. 4, Eq. 5, and Eq. 6. The refined term on gradients’ SAD value is smaller than the STE in early stage of training while larger in the late training period, leading to worse performance (69.4%) than the performance of STE (69.9%). The sign constant SAD value converges similar to the Eq. 4, which improves the performance about 0.9%.