Computation on Sparse Neural Networks: an Inspiration for Future Hardware

04/24/2020 ∙ by Fei Sun, et al. ∙ 0

Neural network models are widely used in solving many challenging problems, such as computer vision, personalized recommendation, and natural language processing. Those models are very computationally intensive and reach the hardware limit of the existing server and IoT devices. Thus, finding better model architectures with much less amount of computation while maximally preserving the accuracy is a popular research topic. Among various mechanisms that aim to reduce the computation complexity, identifying the zero values in the model weights and in the activations to avoid computing them is a promising direction. In this paper, we summarize the current status of the research on the computation of sparse neural networks, from the perspective of the sparse algorithms, the software frameworks, and the hardware accelerations. We observe that the search for the sparse structure can be a general methodology for high-quality model explorations, in addition to a strategy for high-efficiency model execution. We discuss the model accuracy influenced by the number of weight parameters and the structure of the model. The corresponding models are called to be located in the weight dominated and structure dominated regions, respectively. We show that for practically complicated problems, it is more beneficial to search large and sparse models in the weight dominated region. In order to achieve the goal, new approaches are required to search for proper sparse structures, and new sparse training hardware needs to be developed to facilitate fast iterations of sparse models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I introduction

In the past few years, artificial intelligence (AI) have significantly changed our lives. New technologies such as autonomous driving, personalized recommendation, and language translation are all backed by complicated neural network models. The advancement of AI is mainly driven by three factors: the algorithm innovation, the amount of data, and the computing power 

[66]. Among the three, the computing power supports both searching for larger and better models, and processing enormous amount of data.

With the advancement of the algorithms and the amount of computation, problems previously considered challenging now perceived as trivial. And newer problems requiring much higher computing power seem to be within the reach of hands.

However, the amount of computing resource is limited. Several methods have been proposed to reduce the computation of the AI tasks, in particular, the deep learning approach. One of the most important concepts in deep learning based AI is a

tensor

, which can be seen as a multi-dimensional matrix. Since the computation of tensors grows dramatically with the tensor size, there is a variety of methods to decompose a tensor into several low-rank tensors. Polyadic tensor decomposition 

[39, 40] is proposed to approximate a -way tensor into rank-one tensors. Tucker decomposition [85]

is proposed as a higher-order principal component analysis (PCA) to decompose a tensor into a core tensor multiplied (or transformed) by a matrix along each dimension. Other variants 

[14, 36, 15, 37] impose constraints such as linearity or symmetry onto Polyadic or Tucker decomposition. Some modern neural network architectures, such as MobileNetV2 [77], take advantages of the decomposition of 4-D convolutional kernels. More recently, Winograd convolution is proposed in  [90, 53] to reduce the arithmetic complexity.  [72, 2]

use fast Fourier transform to accelerate convolution neural networks. Another approach to reduce the computation of deep neural networks is to quantize the model,

i.e. use less number of bits to represent weights and activation [44]. Similarly, after a pioneering study on the neural network compression [34], it has become a hot topic to improve the performance of neural networks when a majority of the parameters are set to zero to form a sparse model such that most of the computation can be avoided through carefully designed software and hardware implementations.

In this paper, we survey the recent research progress on sparsity. From the algorithm side, we survey the static sparsity, the dynamic sparsity, and the sparsely activated models. On the software side, we describe different sparse representations, and compare some sparse libraries. On the hardware side, we examine the existing hardware accelerators for training and inference.

We also describe several complicated problems that push the current computation to its limit, and explain that sparse computation may be a viable solution. In order to achieve that, we introduce the concept of weight dominated and structure dominated regions as the model size increases. We illustrate that in the weight dominated region, sparse models are more accurate than the corresponding dense models with the same number of floating point operations (FLOPs). However, it is not possible to apply the conventional pruning algorithms to large and dense models to create the complicated sparse models close to the computing limit, and therefore the sparse models need to be derived from the smaller dense models. Thus, the approach to find a reasonable sparse structure is also a model exploration problem.

In addition to the necessary algorithm innovations to create large and sparse models close to the computing limit, we also need to invest in the sparse training platforms and the corresponding training software frameworks, so that the entire training procedure can be done efficiently within the computing limit.

We organize this paper as follows. In Section II, we describe a few complicated problems that may benefit from the computation on sparse models, followed by a survey of the existing pruning algorithms in Section III. In Section IV, we outline our reasons that sparsity may be a viable solution to those complicated problems. We survey the sparse software and hardware solutions in Sections V and VI, respectively. Then we identify some future research directions on the sparse algorithms and those enabled by the sparse training hardware in Section VII. We conclude in Section VIII.

Ii The Challenges on Solving Real-life Complicated Problems

The introduction of the neural networks fundamentally changes researchers’ approach to solve complicated problems. With sufficient data, once a promising neural network model architecture is identified, the solution is to find the most complicated model that is economically viable and practically usable.

In 2012, the image classification was considered a complicated problem, as the model quality at the time was far from being acceptable. Training an AlexNet on ImageNet dataset took five to six days on a two-GPU machine 

[51]. With the recent improvements in computing power, training a ResNet-50 on ImageNet dataset can be shorten to two minutes [62]. Supported by the computing power, many neural network models have surpassed human-level accuracy [38] and thus the single image classification problem is no longer considered a complicated problem.

However, this is just the beginning of artificial intelligence. Many new problems limited by the computing resources are still considered complicated. Below are a few examples.

Tasks such as video classification or action recognition are widely used in autonomous driving [61] and game playing [63]. With an additional time dimension, the amount of data to process is increased significantly. Limited by the computing power, the requirement of real-time latency also hinders the feasibility of deploying large models for video analysis.

Similarly, the 3D tasks based on point cloud data, such as semantic segmentation and classification, raise challenges in both computation and model architecture exploration. Due to the discrete and sparse nature of the inputs, a recent model named PointNet++ [73]

feeds the embedded inputs to a hierarchical multi-layer perceptron (MLP); and MinkowskiEngine 

[17] proposes a generalized sparse convolution. With billions of points in the dataset, the speed of processing them is usually the bottleneck.

Graph neural network (GNN) models [78] are emerging rapidly to be used in social networks [20] and risk control [81]. The sparse nature of the inputs with billions of nodes and edges puts the existing training and inference systems under enormous stress. Being both computation and bandwidth bounded depending on the phase, the model training suffers from sub-linear scaling on massive parallel machines.

Self-attention based transformer model architecture is widely used in natural language processing (NLP) tasks. It is known that the prediction accuracy increases on larger models, and the number of weights of the largest model, GPT-2 8B 

[75], exceeds 8 billion, which requires 512 GPUs to train.

Since 2012, the amount of computation used to train the largest models doubles every 3.4 month [66]. The trend of computation scaling is likely to continue technologically, but at a much higher price tag. For example, a single training run of XLNet model on Google cloud TPU v3 costs as much as $60K [67]. Thus, the searching for better models is only affordable by affluent institutions on high rewarding models.

The trend to endlessly continue scaling the computation is less likely to continue economically. Thus we believe that searching for highly effective and efficient models using alternative methods is prominent. Among various approaches, intentionally inducing zeros to the models and avoiding computing them is a promising direction.

Iii Sparse Models and Sparse Computation

In this section, we survey the status of research on sparse neural network model computation. The sparsity of the neural network execution can be generally categorized into two types, static sparsity and dynamic sparsity. Static sparsity refers to the reduction of non-zero weights in neural network models. Once the positions and values of the non-zero weights are determined, they are fixed in inference. Therefore, the amount of computation is constant for different inputs. On the other hand, dynamic sparsity reduces the computation within a neural network layer dynamically based on the computational characteristics during inference or training. The recent development of large deep learning models enables them to be sparsely activated for each input examples and for each learning task.

Iii-a Static Sparsity

One of the approaches to increase the representation capability of a neural network is to increase the model size, usually evaluated by the number of weights (connections between neurons). Therefore, modern neural network designs have gone through a period of time when the number of weights is increased drastically. On the other hand, the intrinsic redundancy within large models implies that the non-critical weights can be identified and pruned with minimum accuracy loss compared to the dense models. The training process to obtain a pruned neural network in the existing literature usually follows three steps. First, a large and dense neural network is pre-trained. Second, in the pruning process, some non-critical weights are identified and set to zero permanently. Third, the relatively sparse neural network is re-trained and expected to obtain similar accuracy to the dense one. The three steps can be applied repeatedly to gradually increase the sparsity and maintain an acceptable accuracy. This three-step pruning process requires to pre-train a large and dense model in the first step as a superset of the sparse model. In Section 

IV, we will point out that this approach is infeasible to obtain large and sparse models for complicated problems.

Based on the location characteristics of the remaining non-zero weights, static sparsity can be achieved via irregular pruning or structured pruning.

Irregular pruning aims to remove non-critical weights without any constraints on their locations, the key of which is to define the importance of each weight. [35] proposes a magnitude-based importance metric and iteratively prune weights with small magnitude. After that, the pruning ratio is largely improved by different approximation and optimization methods. [57] uses regularization and  [89] manages to solve the non-convexity problem of by approximation. [95, 76] utilize an effective technique in optimization theory to improve the sparsity based on regularization. Other interesting works [26, 56] explore the feasibility of re-training the sparse neural networks without a pre-trained model.

While irregular pruning provides much flexibility of zeroing the weights and thus maximally preserves the accuracy, the overhead of representing the locations of non-zero weights and interpreting them during inference cannot be overlooked. For example, compressed sparse row (CSR) and compressed sparse column (CSC) [84] are two popular sparse matrix formats but they all require the non-zero elements to be extracted sequentially. This limits the throughput of non-zero weights to be read out and calculated.

On the other hand, structured pruning imposes constraints on the locations of non-zero weights and thus reduces the irregularity. It improves the execution efficiency of the sparse models on existing devices, but often reduces the sparsity level in order to maintain the same accuracy. Different structures are proposed for different computing devices. For example, filter-wise pruning and shape-wise pruning [89] can remove rows and columns in matrix-matrix multiplications, which translates to computation reduction in GPU platforms. Recent works explore some special dimensions and propose pattern pruning and kernel-wise pruning [59]. These structured prunings have finer scales than filter- and shape-wise pruning but find themselves useful in edge computing devices such as mobile platform.

Depending on the structure of the pre-trained dense model, the sparsity level of the pruned model can reach 90-97% on some heavily over-parameterized models (e.g. VGG, AlexNet, RNN models), and 50-70% on some compact models (e.g. MobileNetV1/V2/V3).

Both irregular and structured pruning are performed during the training phase, which is before the deployment of the neural networks. Therefore, the amount of computation for the sparse neural network models can be accurately predicted with little variations. However, this benefit comes with a drawback, that is, the neural network models do not differentiate inputs that are difficult to recognize from the easy ones that intuitively can be correctly recognized with less computation.

Iii-B Dynamic sparsity

Dynamic sparsity refers to neural network models and computation mechanisms that adjust the amount of computation according to different input data and internal signals inside the neural networks during inference. So far, its sole purpose is to improve the computation efficiency without affecting the model performance.

During dynamic inference, computation is reduced by pruning the activation in the neural networks [7], by skipping the calculation of zeros after the function [49, 69, 50]  or by dynamically change the number of bits in quantization [70]. Besides, -induced sparsity prediction methods in convolutional neural networks (CNNs) have been proposed to skip computations dynamically [22, 27, 13, 42].

Many mechanisms to explore irregular sparsity in dynamic inference rely on the hardware implementation. Other than

-based dynamic computation skipping, the special cell structure and the temporal input similarity have enabled computation and update skipping in recurrent neural networks (RNNs) 

[64, 96, 12].

Note that the difficulty of a learning task can depend on the input examples. The computation of neural networks can be reduced if an easy input can be identified and use a much simpler data and network flow. [82, 43, 9, 24, 48]  propose to output the final learning decision by early exiting if the confidence score is above a threshold inside the neural networks. [10] creates a directed acyclic graph where each node is a pre-trained deep neural network (DNN) with simpler DNNs at the source and complex DNNs at the sink. Then an exit policy is trained to determine which DNN to go through.[86, 87, 91]  propose to skip part of the sequential layers based on the dynamics of some gates in the networks, the decisions of which are trained by the Gumbel sampling [46]

or the reinforcement learning. Another mechanism to use a gate to guide the neural network to execute one of several parallel branches is proposed in 

[83]. Besides dynamic inference, dynamic sparse graph is proposed to reduce computational and representational costs at DNN training [55]. Recently, [11] proposes an efficient progressive shrinking method to train a super-network based on some architecture and deploy only a portion of it, which outperform the base architecture (and its NAS extensions) such as EfficientNet or MobileNetV3.

The sparsity of the activation can reach 50-70% post . Much higher sparsity can be reached if larger structures or even layers are skipped at the cost of some accuracy loss.

In some most complicated problems, such as graphs and point clouds, the inputs are inherently sparse. This sparsity is currently not sufficiently taken advantages of to explore the model structure, and we believe this may be an interesting area of research.

Iii-C Sparsely activated models

Single-task single-model deep learning suffers from the small model capacity to handle complicated problems. [79] proposes a sparsely-gated mixture-of-experts (MoE) layer to enable only two or three sub-models (out of thousands) for each input example and this sparsely activated neural network is shown to simultaneously achieve higher learning accuracy and lower computation than the state-of-the-art single model approach. Multi-gate MoEs are also used to explore the relationships in multi-tasks [58]. Multi-modal deep learning [8, 47] has also been developed as a progress towards artificial general intelligence. It incorporates features in various modality (e.g., video, voice, text) to activate partial or all neural network models to accomplish one or more tasks. In [19], the combination of sparsely activated models with many tasks and modalities are projected as promising future directions for AI researches.

Iv Solving Complicated Problems with Sparsity

Fig. 1: Illustration of the relations between model accuracy and the number of weights in a scalable model architecture.

In many modern vision models, such as MobileNet family and EfficientNet, the number of weights of the models can be controlled by one scaling factor. Thus, the same model architecture can be applied to both edges (when the scaling factor is small) and servers (when the scaling factor is large). Fig. 1 illustrates the relationship between the number of weights and the model accuracy in such a typical model. One interesting observation is that the accuracy of the model increases rapidly along with the number of weights in a small model, as shown on the left side of Fig. 1. Such models are usually under-parameterized, and we call this region weight dominated region, as most weights are effectively used and any change in the number of weights affects the model quality. On the other hand, towards the right side of Fig. 1, the model accuracy is insensitive to the number of weights. The model accuracy is more limited by the macro-structure rather than the number of weights. We call this region structure dominated, and models in this region are usually over-parameterized.

In most of the model designs for training and inference on servers, one can usually increase the number of weights to improve accuracy and obtain a model in the structure dominated region. This, however, relies on the fact that the model is within the available computing limit, i.e. the model can be trained within reasonable amount of time using the available computing resources. This is shown in the rightmost line in Fig. 1.

However, this approach is insufficient to solve complicated problems. As described in Section II, complicated problems are limited by the computing resources and the quality of the results are unsatisfactory. Fig. 2 hypothetically illustrates the relations between the accuracy and the number of weights in a neural network model for a complicated problem. Line 1 is a hypothetical series of models whose structure dominated region is below the existing computing limit. As the computing capability increases in the future, it is reasonable to assume that a much larger model with much higher accuracy can be found. It is also reasonable to assume that this series of models also contain the weight dominated region and the structure dominated region, as shown by Line 2 in Fig. 2. Even though the structure dominated region of the series of models is well beyond the existing computing limit, the weight dominated region may still be partially below it. Thus, Line 2 may potentially produce higher accuracy than Line 1. Searching for a larger model in the weight dominated region under the computing limit may generate fruitful research results.

Fig. 2: Illustration of the relations between model accuracy and the number of weights. Line 1 is a hypothetical series of dense models within the existing computing limit. Line 2 is a hypothetical series of large and dense models well beyond the existing computing limit. Line 3 is a hypothetical series of large and sparse models derived from the largest model in Line 2.

On the other hand, the sparse models described in Section III-A mostly focus on the structure dominated region, as the quality of the sparse model is compared to the original dense model that it is generated from, rather than a dense model with the same FLOPs or number of weights. We hypothesize that in the weight dominated region, sparse models usually result in much higher accuracy than the dense models with the same FLOPs or number of weights. In the structure dominated region, the accuracy difference is not obvious.

In order to test our hypothesis, we have modified the MobileNetV2 0.35 model [77]. In a reverse bottleneck layer of MobileNetV2, the number of channels expands 6 times internally. We have created a series of models by keeping the number of the expanded channels the same, while varying the number of the bottleneck channels to be , (the original MobileNetV2 model), , , , , , and (the number of channels of the bottleneck layer is the same as the expanded layer). We have also generated a series of irregular sparse models with the same structure as the largest model above, but varying the sparsity so that the number of weights and FLOPs are the same as the above dense models. We have trained the model using the ImageNet dataset and present the validation accuracy in Fig. 3.

Fig. 3 confirms our hypothesis that the accuracy difference between the sparse and dense models with the same FLOPs is larger in the weight dominated region than in the structure dominated region. This may be due to the fact that in the dense model, the number of activation channels is reduced along with the number of weight channels in the bottleneck layer. On the other hand, in the sparse model, although the number of FLOPS is reduced by sparsifying the weight channels, the number of channels in the activation remains the same, which leaves the sparse models more freedom to select different activation channels for different neurons.

It is reasonable to extrapolate the same conclusion to large models beyond the existing computing limit: the sparse models pruned from large and dense models achieve much higher accuracy in the weight dominated region than the corresponding dense models with the same number of weights or FLOPs.

Since the research of sparsity focuses on the weight dominated region with an acceptable accuracy drop from the dense and large model, the achievable sparsity level can be much higher than the state-of-the-art described in Section III-A.

In Fig. 2, Line 3 is a hypothetical series of sparse models, and Line 2 is the corresponding series of dense models. Thus, it is more beneficial to search large and sparse models directly.

However, due to the computing limit, the existing pruning mechanism described in Section III-A is no longer applicable, because the pre-trained large and dense model is well beyond the computing limit. Therefore, it is necessary to derive the large and sparse models by growing the small models directly, with all operations under the computing limit. However, there are only few researches in this direction [18, 23].

We consider this an important research area. The complicated problems benefit more than the simple problems by using large and sparse models to improve the accuracy. This also indicates that searching for sparse models is not only for the purpose of improving computation efficiency, but also for the model architecture exploration. Thus, finding an optimal sparse structure of a neural network model can be framed as a neural architecture search problem.

Fig. 3: Comparison of the validation accuracy of the sparse and dense models based on MobileNetV2 with the same number of weights and FLOPs.

V Software Frameworks on Sparse Computation

Popular machine learning frameworks such as PyTorch 

[71]

 and Tensorflow 

[1] have integrated sparse computation natively. The deep scalable sparse tensor network engine (DSSTNE) [6] is a sparse ML framework designed to train and inference recommendation models with sparse inputs. Those frameworks rely on the sparse kernel implementations to deliver fast sparse inference and training.

There have been a long history of efforts implementing efficient sparse kernels on existing computing platforms such as CPUs and GPUs. There are three levels of sparse kernels: level one is the sparse vector dense vector (spVV) operations; level two is the sparse matrix dense vector (spMV) operations; and level three is the sparse matrix dense matrix (spMM) operations. The sparse-sparse operations are also popular in ultra-sparse matrices. The Intel math kernel library (MKL) 

[45] provides efficient sparse BLAS on Intel CPUs, and cuSPARSE [65] is the sparse library on NVIDIA GPUs.

The efficiency of the sparse kernels is determined by the sparse matrix representation, the software optimization, and the hardware architecture.

Sparse matrices are often represented using compact formats. As in a sparse matrix, the majority of the values are zero. Some widely used formats are compressed sparse row (CSR), compressed sparse column (CSC), block compressed row (BSR), coordinate list (COO), and list of lists (LIL). The choice of the formats is application-specific and may impact performance [25]. Those compact formats require explicitly or implicitly indexing the coordinates of the non-zero values, which translates to indirect memory access and storage overhead. Comparing with the dense counterparts, those overheads fundamentally limit the performance improvement at low sparsity levels.

Commercial sparse libraries such as MKL and cuSPARSE are widely used in scientific computation. It often contains extremely sparse matrices where the number of non-zero values is far less than one percent of the total number of values [31], i.e. the sparsity level is much greater than 99%. At this level, the storage overhead of the indices and the indirect memory accesses are negligible comparing with the dense matrix computation.

In neural networks, however, the sparsity of the matrices may be much less than 99%. For example, it is desirable to achieve performance improvement when the sparsity level is around 70%. This brings a lot of challenges to the sparse kernel designs. The block sparse library by OpenAI [30] explores efficient sparse implementations on GPUs at the block granularity of 88 or more. Sparsity at block level is suitable for computing sparse matrices exhibiting the characteristics of small world networks [88]. SparseTrain [29] leverages the dynamic sparsity introduced by and obtains speed ups on convolution operators at low sparsity levels on Intel architecture with AVX-512 extensions. Such research marginally improves the execution efficiency on existing hardware.

Vi Sparse Hardware Accelerators

The state-of-the-art pruning algorithms described in Section III is capable of introducing 70% to 99% sparsity, which translates to 3 to 100 theoretical speedup. However, the reduced number of weights and MACs may not translate proportionally to the wall clock time savings, mainly because the hardware architectures are not optimized for the sparsity at this level.

The CPUs and GPUs are heavily optimized for the dense matrix computation. The AVX-512 module in CPU is capable of computing the same floating point (FP) operation on 16 single precision data per cycle. A Tensor Core in NVIDIA GPUs has a throughput of computing 44 matrix multiplications per cycle. The CUDA warp is most efficient executing the same instruction on 32 data. The newer TPU uses 128128 systolic array as the matrix multiplication engine, which is most efficient performing large matrix multiplications due to its rigid structure.

However, when performing sparse matrix multiplications, those wide execution units can only utilize a small fraction of the peak performance due to indirect memory access. Thus, many accelerators for sparse matrix multiplications have been proposed.

To speed up the large sparse matrices used in scientific computation, OuterSPACE [68] executes sparse matrix multiplication via outer products to leverage input reuse instead of the inner products often used on dense matrices. SpArch further obtains output reuse by merging partial matrix on-chip [97]. These approaches, however, are more suitable for ultra-sparse matrices with sparsity level much higher than 99%.

With a much lower achievable sparsity level in the sparse neural networks, early sparse accelerators integrate element-wise weight sparsity through compressed storage and computing skip of zero weights [33, 32, 94, 69]. However, the high indexing overhead of irregular element-wise sparsity motivates the structured weight sparsity. Scalpel [92] proposes SIMD-aware weight sparsity, maintaining non-zero weights in aligned fixed-size groups to fully utilize the SIMD units, on low-parallelism platforms for more regular execution pattern. To take this one step further, column-wise weight sparsity [52], block-wise weight sparsity [98], or intra-block structured weight sparsity [21, 100] are leveraged for more aggressive performance improvement. Note that the sparse weights remain static after training which simplifies the architecture design. Since the neuron activations evolve with different inputs, the dynamic activation sparsity is more difficult to exploit than the static weight sparsity.

On the dynamic sparsity side, a lot of the existing accelerators leverage the input zeros from the previous layer produced by function. Eyeriss [16] and SCNN [69] compress the zero activations for memory reduction and use computation-gating for energy saving. EIE [33], Cnvlutin [5], NullHop [3], and others [49, 99] further skip the cycles involving zero inputs for both energy saving and execution acceleration. SparTen [28] with weight sorting and SNAP [93] with associative index matching attempt to mitigate the MAC under-utilization due to load imbalance in sparse weight and input processing. Diffy [60] exploits input spatial similarity for computation saving and data compression. Some accelerators also skip the negative outputs that would be rectified to zeros by function. Y. Lin et al. [54] and M. Song et al. [80] leverage the high-order bits data to approximate the output activations; SnaPEA [4] proposes to reorder the sequence of MACs and early-terminate the execution upon predicting negative output. Instead of the fine-grain bit-level operations [80, 54] and the early calculation of predictable outputs [4], dynamic channel gating [41] can also be used to reduce computations.

Most of the above-mentioned accelerators design for improving the execution efficiency on inference workload. A lot of them target edge devices with fewer processing elements (PEs).

As described in Section IV, when targeting complicated problems with sparse models close to the compute limit, it is preferable to grow from a smaller network. In this process, it is important to be able to train a sparse network directly, so that the entire training can be done below the computing limit. Thus, an efficient sparse training hardware is essential to solve the complicated problems.

A few recent works have started to target sparse training on the server side. SIGMA [74] proposes flexible routing and reduction networks in hardware to reduce the indirect memory reference overhead. Its 128128 Flex-DPE matches the size of a TPU but delivers higher utilization of the PEs. We consider this an important area of research and more attention from the research community is warranted.

Vii Future Research Directions

The current research on sparsity has been primarily focused on the structure dominated models targeting simple problems. The efforts on the sparse computation frameworks and the sparse hardware accelerators mainly try to improve the computation efficiency, i.e. reducing the amount of computation from a large and dense model.

In order to effectively and efficiently solve the complicated problems, we need to consider sparsity as a first class generic model exploration methodology, which is necessary due to the following reasons:

  • The large and sparse models close to the computing limit can only be grown from the smaller models, rather than being pruned from the even larger models.

  • The sparse algorithms may be universally applied to different application domains, such as vision, NLP, and recommendation, and it may achieve acceptable results.

  • A sparse model structure may be specialized on the training dataset. Thus, exploration is needed to update the sparse model structure when transferring the model to a different application or domain.

Since the existing CPU and GPU implementations may not fulfill the need of the training and inference the sparse models, new hardware accelerators and the corresponding software frameworks need to be developed to speed up the run time. Working with the models for complicated problems, the iteration speed is critical. It is essential to design new sparse training accelerators specifically targeting the sparsity level those models likely falling into (e.g., 90-99%).

To make significant progress on sparse model researches, we need to make coordinated breakthroughs in algorithm, software, and hardware. These three disciplines are tightly intertwined, and a co-design approach is preferred.

Viii conclusion

The amount of computation we currently have limits our ability to explore some complicated problems, such as video, point cloud, transformer based NLP, and graph. Even though the amount of computation used to train the largest models doubles every 3.4 months, the trend is unlikely to sustain economically. Thus, we have expressed our view in alleviating the computation scarcity problem by focusing the future neural network designs based on the sparse matrices rather than the dense matrices. We have shown that in the weight dominated region, the sparse models achieve much higher accuracy than the dense models with the same number of FLOPs or weights, and we project that we may observe the same on the complicated problems. We need to make breakthroughs in researching the computation on sparse neural networks across the stack: from algorithm, software, to hardware, among which a new sparse training hardware is essential to facilitate fast iteration of the sparse algorithms.

References

  • [1] M. Abadi, P. Barham, J. Chen et al., “TensorFlow: A system for large-scale machine learning,” in Proc. Symp. Operating Systems Design & Implementation, 2016, pp. 265–283.
  • [2] T. Abtahi, C. Shea, A. M. Kulkarni, and T. Mohsenin, “Accelerating convolutional neural network with FFT on embedded hardware,” IEEE Trans. VLSI Systems, vol. 26, no. 9, pp. 1737–1749, Sept. 2018.
  • [3] A. Aimar, H. Mostafa, E. Calabrese et al., “Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps,” IEEE Trans. Neural Networks and Learning Systems, no. 99, pp. 1–13, 2018.
  • [4] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, “Snapea: Predictive early activation for reducing computation in deep convolutional neural networks,” in Proc. Int. Symp. Computer Architecture, 2018, pp. 662–673.
  • [5] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in Proc. Int. Symp. Computer Architecture, 2016, pp. 1–13.
  • [6] Amazon, “Amazon DSSTNE: deep scalable sparse tensor network engine.” [Online]. Available: https://github.com/amzn/amazon-dsstne
  • [7] A. Ardakani, C. Condo, and W. J. Gross, “Activation pruning of deep convolutional neural networks,” in IEEE Global Conf. Signal and Information Processing (GlobalSIP), 2017, pp. 1325–1329.
  • [8] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: a survey and taxonomy,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 41, no. 2, p. 423–443, Feb. 2019.
  • [9] K. Berestizshevsky and G. Even, “Sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence,” arXiv preprint arXiv:1805.10982, 2018.
  • [10] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, “Adaptive neural networks for efficient inference,” in Proc. Int. Conf. Machine Learning, 2017, pp. 527–536.
  • [11] H. Cai, C. Gan, and S. Han, “Once for all: Train one network and specialize it for efficient deployment,” in Proc. Int. Conf. Learning Representations, 2020.
  • [12] V. Campos, B. Jou, X. G. i Nieto, J. Torres, and S.-F. Chang, “Skip RNN: Learning to skip state updates in recurrent neural networks,” in Proc. Int. Conf. Learning Representations, 2018.
  • [13] S. Cao, L. Ma, W. Xiao, C. Zhang, Y. Liu, L. Zhang, L. Nie, and Z. Yang, “SeerNet: Predicting convolutional neural network feature-map sparsity through low-bit quantization,” in

    Proc. Conf. Computer Vision and Pattern Recognition

    , 2019, pp. 11 216–11 225.
  • [14] J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition,” Psychometrika, vol. 35, pp. 283–319, 1970.
  • [15] J. D. Carroll, S. Pruzansky, and J. B. Kruskal, “Candelinc: A general approach to multidimensional analysis of many-way arrays with linear constraints on parameters,” Psychometrika, vol. 45, pp. 3–24, 1980.
  • [16] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
  • [17] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets: Minkowski convolutional neural networks,” in Proc. Conf. Computer Vision and Pattern Recognition, 2019, pp. 3075–3084.
  • [18] X. Dai, H. Yin, and N. K. Jha, “Nest: A neural network synthesis tool based on a grow-and-prune paradigm,” in IEEE Trans. Computers, vol. 68, no. 10, Oct. 2019, pp. 1487–1497.
  • [19] J. Dean, “The deep learning revolution and its implications for computer architecture and chip design,” arXiv preprint arXiv:1911.05289, 2019.
  • [20] A. Degenne and M. Forsé, Introducing social networks.   Sage, 1999.
  • [21] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, “PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices,” in Proc. Int. Symp. Microarchitecture, 2018, pp. 189–202.
  • [22] X. Dong, J. Huang, Y. Yang, and S. Yan, “More is less: A more complicated network with less inference complexity,” arXiv preprint arXiv:1703.08651, 2017.
  • [23] X. Du, Z. Li, and Y. Cao, “CGaP: Continuous growth and pruning for efficient deep learning,” arXiv preprint arXiv:1905.11533, 2019.
  • [24] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. P. Vetrov, and R. Salakhutdinov, “Spatially adaptive computation time for residual networks,” Proc. Conf. Computer Vision and Pattern Recognition, pp. 1790–1799, 2017.
  • [25] S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, “Sparse matrix-vector multiplication on GPGPUs,” ACM trans. Mathematical Software, vol. 43, no. 4, pp. 1–49, Mar. 2017.
  • [26] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Training pruned neural networks,” in Proc. Int. Conf. Learning Representations, 2019.
  • [27] X. Gao, Y. Zhao, Łukasz Dudziak, R. Mullins, and C. zhong Xu, “Dynamic channel pruning: Feature boosting and suppression,” arXiv preprint arXiv:1810.05331, 2018.
  • [28] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “Sparten: A sparse tensor accelerator for convolutional neural networks,” in Proc. Int. Symp. Microarchitecture, 2019, p. 151–165.
  • [29] Z. Gong, H. Ji, C. Fletcher, C. Hughes, and J. Torrellas, “SparseTrain: Leveraging dynamic sparsity in training DNNs on general-purpose SIMD processors,” arXiv preprint arXiv:1911.10175, 2019.
  • [30] S. Gray, A. Radford, and D. P. Kingma, “GPU kernels for block-sparse weights,” Technical report, OpenAI, Tech. Rep., 2017.
  • [31] M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, and V. Sarkar, “A survey of sparse matrix-vector multiplication performance on large matrices,” arXiv preprint arXiv:1608.00636, 2016.
  • [32] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 75–84.
  • [33] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in Proc. Int. Symp. Computer Architecture, 2016, pp. 243–254.
  • [34] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proc. Int. Conf. Learning Representations, 2016.
  • [35] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proc. Conf. Neural Information Processing Systems, 2015, pp. 1135–1143.
  • [36] R. A. Harshman, “PARAFAC2: Mathematical and technical notes,” UCLA Working Papers in Phonetics, vol. 22, pp. 30–44, 1972b.
  • [37] R. A. Harshman, Margaret, and E. Lundy, “Uniqueness proof for a family of models sharing features of tucker’s three-mode factor analysis and parafac/candecomp,” Psychometrika, 1996.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: surpassing human-level performance on ImagerNet classification,” in Proc. Int. Conf. Computer Vision, 2015, pp. 1026–1034.
  • [39] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” J. of Mathematics and Physics, vol. 6, no. 1-4, pp. 164–189, 1927.
  • [40] Hitchcock, Frank L., “Multiple invariants and generalized rank of a p-way matrix or tensor,” J. of Mathematics and Physics, vol. 7, no. 1-4, pp. 39–79, 1928.
  • [41] W. Hua, Y. Zhou, C. De Sa, Z. Zhang, and G. E. Suh, “Boosting the performance of CNN accelerators with dynamic fine-grained channel gating,” in Proc. Int. Symp. Microarchitecture, 2019, p. 139–150.
  • [42]

    W. Hua, Y. Zhou, C. M. De Sa, Z. Zhang, and G. E. Suh, “Channel gating neural networks,” in

    Proc. Conf. Neural Information Processing Systems, 2019, pp. 1884–1894.
  • [43] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger, “Multi-scale dense networks for resource efficient image classification,” in Proc. Int. Conf. Learning Representations, 2017.
  • [44] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” J. Machine Learning Research, vol. 18, no. 1, p. 6869–6898, Jan. 2017.
  • [45] Intel, “Intel math kernel library.” [Online]. Available: https://software.intel.com/en-us/mkl
  • [46] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2017.
  • [47] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.
  • [48] Y. Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” in Proc. Int. Conf. Machine Learning, Jun 2019.
  • [49] D. Kim, J. Ahn, and S. Yoo, “ZeNA: Zero-aware neural network accelerator,” IEEE Design & Test, vol. 35, no. 1, pp. 39–46, 2018.
  • [50] D. Kim, S. Kim, and S. Yoo, “FPGA prototyping of low-precision zero-skipping accelerator for neural networks,” in Int. Symp. Rapid System Prototyping (RSP), 2018, pp. 104–110.
  • [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Conf. Neural Information Processing Systems, Dec. 2012, pp. 1097–1105.
  • [52] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” arXiv preprint arXiv:1811.04770, 2018.
  • [53] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in Proc. Conf. Computer Vision and Pattern Recognition, June 2016, pp. 4013–4021.
  • [54] Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag, “PredictiveNet: an energy-efficient convolutional neural network via zero prediction,” in Proc. Int. Symp. Circuits & Systems, 2017, pp. 1–4.
  • [55] L. Liu, L. Deng, X. Hu, M. Zhu, G. Li, Y. Ding, and Y. Xie, “Dynamic sparse graph for efficient deep learning,” in Proc. Int. Conf. Learning Representations, 2019.
  • [56] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” in Proc. Int. Conf. Learning Representations, 2018.
  • [57] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through l0 regularization,” in Proc. Int. Conf. Learning Representations, 2018.
  • [58] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” in Proc. Int. Conf. Knowledge Discovery & Data Mining, 2018, p. 1930–1939.
  • [59] X. Ma, F.-M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang, “PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices,” arXiv preprint arXiv:1909.05073, 2019.
  • [60] M. Mahmoud, K. Siu, and A. Moshovos, “Diffy: a déjà vu-free differential deep neural network accelerator,” in Proc. Int. Symp. Microarchitecture, 2018, pp. 134–147.
  • [61]

    K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep supervised learning for hyperspectral data classification through convolutional neural networks,” in

    Int. Geoscience and Remote Sensing Symp., 2015, pp. 4959–4962.
  • [62] H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama, “Massively distributed SGD: imagenet/resnet-50 training in a flash,” arXiv preprint arXiv:1811.05233, 2018.
  • [63] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [64] D. Neil, J. H. Lee, T. Delbruck, and S.-C. Liu, “Delta networks for optimized recurrent network computation,” in Proc. Int. Conf. Machine Learning, 2017, pp. 2584–2593.
  • [65] NVIDIA, “NVIDIA CUDA sparse matrix library.” [Online]. Available: https://developer.nvidia.com/cusparse
  • [66] OpenAI, “AI and compute.” [Online]. Available: https://openai.com/blog/ai-and-compute/
  • [67] OpenAI, “The staggering cost of training SOTA AI models.” [Online]. Available: https://medium.com/syncedreview/the-staggering-cost-of-training-sota-ai-models-e329e80fa82
  • [68] S. Pal, J. Beaumont, D.-H. Park et al., “OuterSPACE: an outer product based sparse matrix multiplication accelerator,” in Proc. Int. Symp. High Performance Computer Architecture, 2018, pp. 724–736.
  • [69] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An accelerator for compressed-sparse convolutional neural networks,” in Proc. Int. Symp. Computer Architecture, 2017, pp. 27–40.
  • [70]

    E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network accelerator based on outlier-aware low-precision computation,” in

    Proc. Int. Symp. Computer Architecture, 2018, p. 688–698.
  • [71] A. Paszke, S. Gross, F. Massa et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Conf. Neural Information Processing Systems, 2019, pp. 8026–8037.
  • [72] H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “FCNN: Fourier convolutional neural networks,” in Machine Learning and Knowledge Discovery in Databases, 2017, pp. 786–798.
  • [73] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Conf. Neural Information Processing Systems, 2017, pp. 5099–5108.
  • [74] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training,” in Proc. Int. Symp. High Performance Computer Architecture, 2020.
  • [75] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, p. 9, 2019.
  • [76] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “ADMM-NN: An algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers,” in Proc. Int. Conf. Architectural Support for Programming Languages and Operating Systems, 2019, pp. 925–938.
  • [77] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. Conf. Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
  • [78] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.
  • [79] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in Proc. Int. Conf. Learning Representations, 2017.
  • [80] M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, “Prediction based execution on deep neural networks,” in Proc. Int. Symp. Computer Architecture.   IEEE, 2018, pp. 752–763.
  • [81] H. Summala, “Risk control is not risk adjustment: The zero-risk theory of driver behaviour and its implications,” Ergonomics, pp. 491–506, 1988.
  • [82] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “BranchyNet: Fast inference via early exiting from deep neural networks,” in Proc. Int. Conf. Pattern Recognition, 2016, pp. 2464–2469.
  • [83] R. Teja Mullapudi, W. R. Mark, N. Shazeer, and K. Fatahalian, “HydraNets: Specialized dynamic architectures for efficient inference,” in Proc. Conf. Computer Vision and Pattern Recognition, 2018, pp. 8080–8089.
  • [84] W. F. Tinney and J. W. Walker, “Direct solutions of sparse network equations by optimally ordered triangular factorization,” Proc. IEEE, vol. 55, no. 11, pp. 1801–1809, 1967.
  • [85] L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement of change,” in Problems in measuring change., C. W. Harris, Ed.   University of Wisconsin Press, 1963, pp. 122–137.
  • [86] A. Veit and S. Belongie, “Convolutional networks with adaptive inference graphs,” in Proc. European Conf. Computer Vision., 2018, pp. 3–18.
  • [87] X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “SkipNet: Learning dynamic routing in convolutional networks,” in Proc. European Conf. Computer Vision., 2018, pp. 409–424.
  • [88] D. J. Watts, Small Worlds: The Dynamics of Networks between Order and Randomness.   Princeton University Press, 2004, vol. 9.
  • [89] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proc. Conf. Neural Information Processing Systems, 2016, pp. 2074–2082.
  • [90] S. Winograd, Arithmetic Complexity of Computations.   Siam, 1980, vol. 33.
  • [91] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” in Proc. Conf. Computer Vision and Pattern Recognition, 2018.
  • [92] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, “Scalpel: Customizing DNN pruning to the underlying hardware parallelism,” in Proc. Int. Symp. Computer Architecture, 2017, pp. 548–560.
  • [93] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang, “SNAP: A 1.67—21.55 TOPS/W sparse neural acceleration processor for unstructured sparse deep neural network inference in 16nm CMOS,” in Symp. VLSI Circuits, 2019, pp. C306–C307.
  • [94] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in Proc. Int. Symp. Microarchitecture, 2016, pp. 1–12.
  • [95] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic DNN weight pruning framework using alternating direction method of multipliers,” in Proc. European Conf. Computer Vision., 2018, pp. 184–199.
  • [96]

    X. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu, “Towards memory friendly long-short term memory networks (LSTMs) on mobile GPUs,” in

    Proc. Int. Symp. Microarchitecture, Oct 2018, pp. 162–174.
  • [97] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “SpArch: Efficient architecture for sparse matrix multiplication,” arXiv preprint arXiv:2002.08947, 2020.
  • [98] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, “Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach,” in Proc. Int. Symp. Microarchitecture, 2018, pp. 15–28.
  • [99] J. Zhu, J. Jiang, X. Chen, and C.-Y. Tsui, “SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity,” in Proc. Design Automation & Test Europe Conf., 2018, pp. 241–244.
  • [100] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs,” in Proc. Int. Symp. Microarchitecture, 2019, pp. 359–371.