1 Introduction
Deep Neural Networks (DNNs) lecun2015deep have been achieving impressive progress in a wide spectrum of domains simonyan2014very ; he2016deep ; abdel2014convolutional ; redmon2016yolo9000 ; wu2016google
, while the models are extremely memory and computeintensive. The high representational and computational cost motivates many researchers to investigate approaches on improving the execution performance, including matrix or tensor decomposition
xue2014singular ; novikov2015tensorizing ; garipov2016ultimate ; yang2017tensor ; alvarez2017compression , data quantization courbariaux2016binarized ; zhou2016dorefa ; deng2018gxnor ; leng2017extremely ; wen2017terngrad ; wu2018training ; mckinstry2018discovering , and network pruning ardakani2016sparsely ; han2015learning ; han2015deep ; liu2017learning ; li2016pruning ; he2017channel ; luo2017thinet ; wen2016learning ; molchanov2016pruning ; sun2017meprop ; spring2017scalable ; lin2017predictivenet ; zhang2018adam . However, most of the previous work aims at inference while the challenges for reducing the representational and computational cost of training are not wellstudied. Although some work demonstrate acceleration in distributed training lin2017deep ; goyal2017accurate ; you2017imagenet , we target at singlenode optimization, and our method can also boost training in a distributed fashion.DNN training, which demands much more hardware resources in terms of both memory capacity and computation volume, is far more challenging than inference. Firstly, activation data in training will be stored for backpropagation, significantly increasing the memory consumption. Secondly, training iteratively updates model parameters using minibatched stochastic gradient descent (SGD). We almost always expect larger minibatches for higher throughput (Figure
1(a)), faster convergence, and better accuracy smith2017don . However, memory capacity is often the limiting factor (Figure 1(b)); it may cause performance degradation or even make large models with deep structures or targeting highresolution vision tasks hard to train he2016deep ; wu2018group .It is difficult to apply existing sparsity techniques towards inference phase to training phase because of the following reasons: 1) Prior arts mainly compress the pretrained and fixed weight parameters to reduce the offchip memory access in inference han2016eie ; han2017ese , while instead, the dynamic neuronal activations turn out to be the crucial bottleneck in training jain2018gist , making the prior inferenceoriented methods inefficient. Besides, during training we need to stash vast batched activation space for the backward gradient calculation. Therefore, neuron activations creates a new memory bottleneck (Figure 1(c)). In this paper, we will sparsify the neuron activations for training compression. 2) The existing inference accelerations usually add extra optimization problems onto the critical path wen2016learning ; molchanov2016pruning ; liu2017learning ; luo2017thinet ; liang2018crossbar ; zhang2018adam , i.e., ‘complicated training simplified inference’, which embarrassingly complicates the training phase. 3) Moreover, previous studies reveal that batch normalization (BN) is crucial for improving accuracy and robustness (Figure 1(d)) through activation fusion across different samples within one minibatch for better representation morcos2018importance ; ioffe2015batch . BN almost becomes a standard training configuration; however, inferenceoriented methods seldom discuss BN and treat BN parameters as scaling and shift factors in the forward pass. We further find that BN will damage the sparsity due to the activation reorganization (Figure 1(e)). Since this work targets both training and inference, the BN compatibility problem should be addressed.
From the view of information representation, the activation of each neuron reflects its selectivity to the current stimulus sample morcos2018importance , and this selectivity dataflow propagates layer by layer forming different representation levels. Fortunately, there is much representational redundancy, for example, lots of neuron activations for each stimulus sample are so small and can be removed (Figure 1
(f)). Motivated by above comprehensive analysis regarding memory and compute, we propose to search critical neurons for constructing a sparse graph at every iteration. By activating only a small amount of neurons with a high selectivity, we can significantly save memory and simplify computation with tolerable accuracy degradation. Because the neuron response dynamically changes under different stimulus samples, the sparse graph is variable. The neuronaware dynamic and sparse graph (DSG) is fundamentally distinct from the static one in previous work on permanent weight pruning since we never prune the graph but activate part of them each time. Therefore, we maintain the model expressive power as much as possible. A graph selection method, dimensionreduction search (DRS), is designed for both compressible activations with elementwise unstructured sparsity and accelerative vectormatrix multiplication (VMM) with vectorwise structured sparsity. Through doublemask selection (DMS) design, it is also compatible with BN. We can use the same selection pattern and extend our method to inference. In a nutshell, we propose a compressible and accelerative DSG approach supported by DRS and DMS methods. It can achieve 1.74.5x memory compression and 2.34.4x computation reduction with minimal accuracy loss. This work simultaneously pioneers the approach towards efficient online training and offline inference, which can benefit the deep learning in both the cloud and the edge.
2 Approach
Our method forms DSGs for different inputs, which are accelerative and compressive, as shown in Figure2(a). On the one hand, choosing a small number of critical neurons to participate in computation, DSG can reduce the computational cost by eliminating calculations of noncritical neurons. On the other hand, it can further reduce the representational cost via compression on sparsified activations. Different from previous methods using permanent pruning, our approach does not prune any neuron and the associated weights; instead, it activates a sparse graph according to the input sample at each iteration. Therefore, DSG does not compromise the expressive power of the model.
Constructing DSG needs to determine which neurons are critical. A naive approach is to select critical neurons according to the output activations. If the output neurons have a small or negative activation value, i.e., not selective to current input sample, they can be removed for saving representational cost. Because these activations will be small or absolute zero after the following ReLU nonlinear function (i.e., ReLU(
) max(0, )), it’s reasonable to set all of them to be zero. However, this naive approach requires computations of all VMM operations within each layer before the selection of critical neurons, which is very costly.2.1 Dimension Reduction Search
To avoid the costly VMM operations in the mentioned naive selection, we propose an efficient method, i.e., dimension reduction search (DRS), to estimate the importance of output neurons. As shown in Figure
2(b), we first reduce the dimensions of X and W, and then execute the lightweight VMM operations in the lowdimension space at minimal cost. After that, we estimate the neuron importance according to the virtual output activations. Then, a binary mask can be produced in which the zeros represent the noncritical neurons with small activations that are removable. We use a topk search method that only keeps largest k neurons, where an intersample threshold sharing mechanism is leveraged to greatly reduce the search cost ^{1}^{1}1Implementation details are shown in Appendices.. Note that k is determined by the output size and a preconfigured sparsity parameter . Then we can compute the accurate activations of the critical neurons in the original highdimension space and avoid calculating the noncritical neurons. Thus, besides the compressive sparse activations, DRS can further save a significant amount of expensive operations in highdimensional space.In this way, a vectorwise structured sparsity can be achieved, as shown in Figure 3(b). The ones in the selection mask (marked as colored blocks) denote the critical neurons, and the noncritical ones can bypass the memory access and computation of a corresponding whole column of the weight matrix. Furthermore, the generated sparse activations can be compressed via the zerovalue compression zhang2000frequent ; vijaykumar2015case ; rhu2018compressing (Figure 3(c)). Consequently, it is critical to reduce the vector dimension but keep the activations calculated in lowdimension space as accurate as possible, compared to the ones in original highdimension space.
2.2 Sparse Random Projection for Efficient DRS
Notations: Each CONV layer has a four dimensional weight tensor (, , , ), where is the number of filters, i.e., the number of output feature maps (FMs); is the number of input FMs; (, ) represents the kernel size. Thus, the CONV layer in Figure 3(a) can be converted to many VMM operations, as shown in Figure 3(b). Each row in the matrix of input FMs is the activations from a sliding window across all input FMs (), and after the VMM operation with the weight matrix () it can generate points at the same location across all output FMs. Further considering the size of each output FM and the minibatch size of , the whole rows of VMM operations has a computational complexity of . For the FC layer with input neurons and output neurons, this complexity is . Note that here we switch the order of BN and ReLU layer from ‘CONV/FCBNReLU’ to ‘CONV/FCReLUBN’, because it’s hard to determine the activation value of the noncritical neurons if the following layer is BN (this value is zero for ReLU). As shown in previous work, this reorganization could bring better accuracy mishkin2015all .
For the sake of simplicity, we just consider the operation for each sliding window in the CONV layer or the whole FC layer under one single input sample as a basic optimization problem. The generation of each output activation requires an inner product operation, as follows:
(1) 
where is the th row in the matrix of input FMs (for the FC layer, there is only one X vector), is the th column of the weight matrix , and is the neuronal transformation (e.g., ReLU function, here we abandon bias). Now, according to equation (1), the preservation of the activation is equivalent to preserve the inner product.
We introduce a dimensionreduction lemma, named Johnson–Lindenstrauss Lemma (JLL) johnson1984extensions , to implement the DRS with inner product preservation. This lemma states that a set of points in a highdimensional space can be embedded into a lowdimensional space in such a way that the Euclidean distances between these points are nearly preserved. Specifically, given , a set of points in (i.e., all and ), and a number of , there exists a linear map such that
(2) 
for any given and pair, where is a hyperparameter to control the approximation error, i.e., larger larger error. When is sufficiently small, one corollary from JLL is the following norm preservation vu2016random ; Kakade_cmsc35900 :
(3) 
where Z could be any or , and
denotes a probability. It means the vector norm can be preserved with a high probability controlled by
. Given these basics, we can further get the inner product preservation:(4) 
The detailed proof can be found in the Appendices.
Random projection vu2016random ; ailon2009fast ; achlioptas2001database is widely used to construct the linear map . Specifically, the original dimensional vector is projected to a dimensional () one, using a random matrix R. Then we can reduce the dimension of all and by
(5) 
The random projection matrix R
can be generated from Gaussian distribution
ailon2009fast . In this paper, we adopt a simplified version, termed as sparse random projection achlioptas2001database ; bingham2001random ; li2006very with(6) 
for all elements in R. This R only has ternary values that can remove the multiplications during projection, and the remained additions are very sparse. Therefore, the projection overhead is negligible compared to other highprecision operations involving multiplication. Here we set with 67% sparsity in statistics.
Equation (4) indicates the lowdimensional inner product can still approximate the original highdimensional one in equation (1) if the reduced dimension is sufficiently high. Therefore, it is possible to calculate equation (1) in a lowdimensional space for activation estimation, and select the important neurons. As shown in Figure 3(b), each sliding window dynamically selects its own important neurons for the calculation in highdimensional space, marked in red and blue as two examples. Figure 4 visualizes two sliding windows in a real network to help understand the dynamic DRS process. Here the neuronal activation vector ( length) is reshaped to a matrix for clarity. Now For the CONV layer, the computational complexity is only , which is less than the original highdimensional computation with complexity because we usually have . For the FC layer, we also have .
2.3 DMS for BN Compatibility
To deal with the important but intractable BN layer, we propose a doublemask selection (DMS) method presented in Figure 2
(c). After the DRS estimation, we produce a sparsifying mask that removes the unimportant neurons. The ReLU activation function can maintain this mask by inhibiting the negative activation (actually all the activations from the CONV layer or FC layer after the DRS mask are positive under reasonably large sparsity). However, the BN layer will damage this sparsity through intersample activation fusion. To address this issue, we copy the same DRS mask and directly use it on the BN output. It is straightforward but reasonable because we find that although BN causes the zero activation to be nonzero (Figure
1(f)), these nonzero activations are still very small and can also be removed. This is because BN just scales and shifts the activations that won’t change the relative sort order. In this way, we can achieve fully sparse activation dataflow.3 Experimental Results
3.1 Experiment Setup
The overall training algorithm is presented in the Appendices. Going through the dataflow where the red color denotes the sparse tensors, a widespread sparsity in both the forward and backward passes is demonstrated. Regarding the evaluation network models, we use LeNet lecun1998gradient
and a multilayered perceptron (MLP) on a smallscale FASHION dataset
xiao2017fashion , VGG8 courbariaux2016binarized ; deng2018gxnor /ResNet8 (a customized ResNetvariant with 3 residual blocks and 2 FC layers)/ResNet20/WRN82 zagoruyko2016wide on mediumscale CIFAR10 dataset krizhevsky2009learning , VGG8 and WRN82 on another mediumscale CIFAR100 dataset krizhevsky2009learning , and ResNet18 he2016deep /WRN182 zagoruyko2016wide /VGG16 simonyan2014veryon the largescale ImageNet dataset
deng2009imagenetas workloads. The programming framework is PyTorch and the training platform is based on NVIDIA Titan Xp GPU. We adopt the zerovalue compression method
zhang2000frequent ; vijaykumar2015case ; rhu2018compressing for memory compression and MKL compute library wang2014intel on Intel Xeon CPU for the acceleration evaluation.3.2 Accuracy Analysis
In this section, we provide a comprehensive analysis regarding the influence of sparsity on accuracy and explore the robustness of MLP and CNN, the graph selection strategy, the BN compatibility, and the importance of width and depth.
Accuracy using DSG. Figure 5(a) presents the accuracy curves on small and medium scale models by using DSG under different sparsity levels. Three conclusions are observed: 1) The proposed DSG affects little on the accuracy when the sparsity is 60%, and the accuracy will present an abrupt descent with sparsity larger than 80%. 2) Usually, the ResNet model family is more sensitive to the sparsity increasing since fewer parameters than the VGG family. For the VGG8 on the CIFAR10 dataset, the accuracy loss is still within 0.5% when sparsity reaches 80%. 3) Compared to MLP, CNN can tolerate more sparsity. Figure 5(b) further shows the results on large scale ImageNet models. Because training large model is time costly, we only present several experimental points. Consistently, the VGG16 shows better robustness compared to the ResNet18, and the WRN with wider channels on each layer performs much better than the other two models. We will discuss the topic of width and depth later.
Graph Selection Strategy. To investigate the influence of graph selection strategy, we repeat the sparsity vs. accuracy experiments on CIFAR10 dataset under different selection methods. Two baselines are used here: the Oracle one that keeps the neurons with topk activations after the whole VMM computation at each layer, and the random one that randomly selects neurons to keep. The results are shown in Figure 5(c), in which we can see that our DRS and the Oracle one perform much better than the random selection under high sparsity condition. Moreover, DRS achieves nearly the same accuracy with the oracle topk selection, which indicates the proposed random projection method can find an accurate activation estimation in lowdimensional space. In detail, Figure 5(d) shows the influence of parameter that reflects the degree of dimension reduction. Lower can approach the original inner product more accurately, that brings higher accuracy but at the cost of more computation for graph selection since less dimension reduction. With , the accuracy loss is within 1% even if the sparsity reaches 80%.
BN Compatibility. Figure 5(e) focuses the BN compatibility issue. Here we use DRS for the graph sparsifying, and compare three cases: 1) removing the BN operation and using single mask; 2) keeping BN and using only single mask (the first one in Figure 2(c)); 3) keeping BN and using double masks (i.e. DMS). The one without BN is very sensitive to the graph ablation, which indicates the importance of BN for training. Comparing the two with BN, the DMS even achieves better accuracy since the regularization effect. This observation indicates the effectiveness of the proposed DMS method for simultaneously recovering the sparsity damaged by the BN layer and maintaining the accuracy.
Width or Depth. Furthermore, we investigate an interesting comparison regarding the network width and depth, as shown in Figure 5(f). On the training set, WRN with fewer but wider layers demonstrates more robustness than the deeper one with more but slimmer layers. On the validation set, the results are a little more complicated. Under small and medium sparsity, the deeper ResNet performs better (1%) than the wider one. While when the sparsity increases substantial (75%), WRN can maintain the accuracy better. This indicates that, in mediumsparse space, the deeper network has stronger representation ability because of the deep structure; however, in ultrahighsparse space, the deeper structure is more likely to collapse since the accumulation of the pruning error layer by layer. In reality, we can determine which type of model to use according to the sparsity requirement. In Figure 5(b) on ImageNet, the reason why WRN182 performs much better is that it has wider layers without reducing the depth.
3.3 Representational Cost Reduction
This section presents the benefits from DSG on representational cost. We measure the memory consumption over five CNN benchmarks on both the training and inference phases. For data compression, we use zerovalue compression algorithm zhang2000frequent ; vijaykumar2015case ; rhu2018compressing . Figure 6 shows the memory optimization results, where the model name, minibatch size, and the sparsity are provided. In training, besides the parameters, the activations across all layers should be stashed for the backward computation. Consistent with the observation mentioned above that the neuron activation beats weight to dominate memory overhead, which is different from the previous work on inference. We can reduce the overall representational cost by average 1.7x (2.72 GB), 3.2x (4.51 GB), and 4.2x (5.04 GB) under 50%, 80% and 90% sparsity, respectively. If only considering the neuronal activation, these ratios could be higher up to 7.1x. The memory overhead for the selection masks is minimal (2%).
During inference, only memory space to store the parameters and the activations of the layer with maximum neuron amount is required. The benefits in inference are relatively smaller than that in training since weight is the dominant memory. On the ResNet152, the extra mask overhead even offsets the compression benefit under 50% sparsity, whereas, we can still achieve up to average 7.1x memory reduction for activations and 1.7x for overall memory. Although the compression is limited for inference, it still can achieve noticeable acceleration that will be shown in the next section. Moreover, reducing costs for both training and inference is our major contribution.
3.4 Computational Cost Reduction
We assess the results on reducing the computational cost of both training and inference. As shown in Figure 7, both the forward and backward pass consume much fewer operations, i.e., multiplyandaccumulate (MAC). On average, 1.4x (5.52 GMACs), 1.7x (9.43 GMACs), and 2.2x (10.74 GMACs) operation reduction are achieved in training under 50%, 80% and 90% sparsity, respectively. For inference with only forward pass, the results increase to 1.5x (2.26 GMACs), 2.8x (4.22 GMACs), and 3.9x (4.87 GMACs), respectively. The overhead of the DRS computation in lowdimensional space is relatively larger (6.5% in training and 19.5% in inference) compared to the mask overhead in memory cost. Note that the training demonstrates less improvement than the inference, which is because the acceleration of the backward pass is partial. The error propagation is accelerative, but the weight gradient generation is not because of the irregular sparsity that is hard to obtain practical acceleration. Although the computation of this part is also very sparse with much fewer operations^{2}^{2}2See Algorithm 1 in the Appendices, we don’t include its GMACs reduction for practical concern.
Finally, we evaluate the execution time on CPU using Intel MKL kernels (wang2014intel ). As shown in Figure 8(a), we evaluate the execution time of these layers after the DRS selection on VGG8. Comparing to VMM baselines, our approach can achieve 2.0x, 5.0x, and 8.5x speedup under 50%, 80%, and 90% sparsity, respectively. When the baselines change to GEMM (general matrix multiplication), the speedup decreases to 0.6x, 1.6x, and 2.7x, respectively. The reason is that DSG generates dynamic vectorwise sparsity, which is not well supported by GEMM.
We further compare our approach with smaller dense models which could be another way to reduce computational cost. As shown in Figure 8(b), comparing with dense baseline, our approach can reduce training time with little accuracy loss. Even though the equivalent smaller dense models with the same effective nodes, i.e., reduced MACs, save more training time, the accuracy is much worse than our DSG approach.
4 Related Work
DNN Compression ardakani2016sparsely achieved up to 90% weight sparsity by randomly removing connections. han2015learning ; han2015deep reduced the weight parameters by pruning the unimportant connections. However, the compression is mainly achieved on FC layers, that makes it ineffective for CONV layerdominant networks, e.g., ResNet. Moreover, it is difficult to obtain practical speedup due to the irregularity of the elementwise sparsity. Even if designing ASIC from scratch han2016eie ; han2017ese , the index overhead is enormous and it only works under high sparsity. These methods usually require a pretrained model, iterative pruning and finetune retrain, that targets inference optimization.
DNN Acceleration Different from compression, the acceleration work consider more on the sparse pattern. In contrast to the finegrain compression, coarsegrain sparsity was further proposed to optimize the execution speed. Channellevel sparsity was gained by removing unimportant weight filters he2018soft , training penalty coefficients liu2017learning , or introducing grouplasso optimization luo2017thinet ; he2017channel ; liang2018crossbar . wen2016learning introduced a L2norm grouplasso optimization for both mediumgrain sparsity (row/column) and coarsegrain weight sparsity (channel/filter/layer). molchanov2016pruning introduced the Taylor expansion for neuron pruning. However, it just benefits the inference acceleration, and the extra solving of the optimization problem usually makes the training more complicated. lin2017predictivenet demonstrated predicting important neurons then bypassed the unimportant ones via lowprecision precomputation on small networks. spring2017scalable leveraged the randomized hashing to predict the important neurons. However, the hashing search aims at finding neurons whose weight bases are similar to the input vector, which cannot estimate the inner product accurately thus will probably cause significant accuracy loss on large models. sun2017meprop used a straightforward topk pruning on the back propagated errors for training acceleration. But they only simplified the backward pass and presented the results on tiny FC models. Furthermore, the BN compatibility problem that is very important for largemodel training still remains untouched. lin2017deep pruned the gradients for accelerating distributed training, but the focus is on multinode communication rather than the computation topic discussed in this paper.
5 Conclusion
In this work, we propose DSG (dynamic and sparse graph) structure for efficient DNN training and inference through a DRS (dimension reduction search) sparsity forecast for compressive memory and accelerative execution and a DMS (doublemask selection) for BN compatibility without sacrificing model’s expressive power. It can be easily extended to the inference by using the same selection pattern after training. Our experiments over various benchmarks demonstrate significant memory saving (4.5x for training and 1.7x for inference) and computation reduction (2.3x for training and 4.4x for inference). Through significantly boosting both forward and backward passes in training, as well as in inference, DSG promises efficient deep learning in both the cloud and edge.
References
 (1) Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 (2) K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

(3)
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016. 
(4)
O. AbdelHamid, A.r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,”
IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.  (5) J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, vol. 1612, 2016.
 (6) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.

(7)
J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based lowfootprint speaker adaptation and personalization for deep neural network,” in
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6359–6363, IEEE, 2014.  (8) A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 (9) T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: compressing convolutional and fc layers alike,” arXiv preprint arXiv:1611.03214, 2016.
 (10) Y. Yang, D. Krompass, and V. Tresp, “Tensortrain recurrent neural networks for video classification,” arXiv preprint arXiv:1707.01786, 2017.
 (11) J. M. Alvarez and M. Salzmann, “Compressionaware training of deep networks,” in Advances in Neural Information Processing Systems, pp. 856–867, 2017.
 (12) M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.
 (13) S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
 (14) L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “Gxnornet: Training deep neural networks with ternary weights and activations without fullprecision memory under a unified discretization framework,” Neural Networks, vol. 100, pp. 49–58, 2018.
 (15) C. Leng, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv preprint arXiv:1707.09870, 2017.
 (16) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, pp. 1508–1518, 2017.
 (17) S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.
 (18) J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha, “Discovering lowprecision networks close to fullprecision networks for efficient embedded inference,” arXiv preprint arXiv:1809.04191, 2018.
 (19) A. Ardakani, C. Condo, and W. J. Gross, “Sparselyconnected neural networks: towards efficient vlsi implementation of deep neural networks,” arXiv preprint arXiv:1611.01427, 2016.
 (20) S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
 (21) S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 (22) Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, IEEE, 2017.
 (23) H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
 (24) Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in International Conference on Computer Vision (ICCV), vol. 2, p. 6, 2017.
 (25) J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
 (26) W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 (27) P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” 2016.
 (28) X. Sun, X. Ren, S. Ma, and H. Wang, “meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting,” arXiv preprint arXiv:1706.06197, 2017.
 (29) R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized hashing,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 445–454, ACM, 2017.
 (30) Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag, “Predictivenet: An energyefficient convolutional neural network via zero prediction,” in Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pp. 1–4, IEEE, 2017.
 (31) T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and Y. Wang, “Adamadmm: A unified, systematic framework of structured weight pruning for dnns,” arXiv preprint arXiv:1807.11091, 2018.
 (32) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
 (33) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
 (34) Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” CoRR, abs/1709.05011, 2017.
 (35) S. L. Smith, P.J. Kindermans, and Q. V. Le, “Don’t decay the learning rate, increase the batch size,” arXiv preprint arXiv:1711.00489, 2017.
 (36) Y. Wu and K. He, “Group normalization,” arXiv preprint arXiv:1803.08494, 2018.
 (37) S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254, IEEE, 2016.
 (38) S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 75–84, ACM, 2017.
 (39) A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko, “Gist: Efficient data encoding for deep neural network training,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 776–789, IEEE, 2018.
 (40) L. Liang, L. Deng, Y. Zeng, X. Hu, Y. Ji, X. Ma, G. Li, and Y. Xie, “Crossbaraware neural network pruning,” arXiv preprint arXiv:1807.10816, 2018.
 (41) A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018.
 (42) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 (43) Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and valuecentric data cache design,” ACM SIGPLAN Notices, vol. 35, no. 11, pp. 150–159, 2000.
 (44) N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A case for coreassisted bottleneck acceleration in gpus: enabling flexible data compression with assist warps,” in ACM SIGARCH Computer Architecture News, vol. 43, pp. 41–53, ACM, 2015.
 (45) M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Compressing dma engine: Leveraging activation sparsity for training deep neural networks,” in High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pp. 78–91, IEEE, 2018.
 (46) D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
 (47) W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,” Contemporary mathematics, vol. 26, no. 189206, p. 1, 1984.
 (48) K. K. Vu, Random projection for highdimensional optimization. PhD thesis, Université ParisSaclay, 2016.
 (49) I. S. Kakade and G. Shakhnarovich, “Cmsc 35900 (spring 2009) large scale learning lecture: 2 random projections,” 2009.
 (50) N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,” SIAM Journal on computing, vol. 39, no. 1, pp. 302–322, 2009.
 (51) D. Achlioptas, “Databasefriendly random projections,” in Proceedings of the twentieth ACM SIGMODSIGACTSIGART symposium on Principles of database systems, pp. 274–281, ACM, 2001.
 (52) E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250, ACM, 2001.
 (53) P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296, ACM, 2006.
 (54) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 (55) H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
 (56) S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
 (57) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 (58) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
 (59) E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang, “Intel math kernel library,” in HighPerformance Computing on the Intel® Xeon Phi™, pp. 167–188, Springer, 2014.
 (60) Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.
Appendix A DRS for Inner Product Preservation
Theorem 1. Given a set of points in (i.e. all and ), and a number of , there exist a linear map and a , for we have
(7) 
for all and .
Proof. According to the definition of inner product and vector norm, any two vectors a and b satisfy
(8) 
It is easy to further get
(9) 
Therefore, we can transform the target in equation (7) to
(10) 
which is also based on the fact that . Now recall the definition of random projection in equation (5) of the main text
(11) 
Substituting equation (11) into equation (10), we have
(12) 
Further recalling the norm preservation in equation (3) of the main text: there exist a linear map and a , for we have
(13) 
Substituting the equation (13) into equation (12) yields
(14) 
Combining equation (12) and (14), finally we have
(15) 
It can be seen that, for any given and pair, the inner product can be preserved if the is sufficiently small. Actually, previous work achlioptas2001database ; bingham2001random ; vu2016random discussed a lot on the random projection for various big data applications, here we reorganize these supporting materials to form a systematical proof. We hope this could help readers to follow this paper. In practical experiments, there exists a tradeoff between the dimension reduction degree and the recognition accuracy. Smaller usually brings more accurate inner product estimation and better recognition accuracy while at the cost of higher computational complexity with larger , and vice versa. Because the and are not strictly bounded, the approximation may suffer from some noises. Anyway, from the abundant experiments in the main text, the effectiveness of our approach for training dynamic and sparse neural networks has been validated.
Appendix B Implementation and overhead
The training algorithm for producing DSG is presented in Algorithm 1. Furthermore, the generation procedure of the critical neuron mask based on the virtual activations estimated in lowdimensional space is presented in Figure 9, which is a typical topk search. The k value is determined by the activation size and the desired sparsity . To reduce the search cost, we calculate the first input sample within the current minibatch and then conduct a topk search over the whole virtual activation matrix for obtaining the topk threshold under this sample. The remaining samples share the topk threshold from the first sample to avoid costly searching overhead. At last, the overall activation mask is generated by setting the mask element to one if the estimated activation is larger than the topk threshold and setting others to zero. In this way, we greatly reduce the search cost. Note that, for the FC layer, each sample is a vector.
Layers  Dimension  Operations (MMACs)  

, ,  BL  0.3  0.5  0.7  0.9  BL  0.3  0.5  0.7  0.9 
1024, 1152, 128  1152  539  232  148  119  144  67.37  29  18.5  14.88 
256, 1152, 256  1152  616  266  169  136  72  38.5  16.63  10.56  8.5 
256, 2304, 256  2304  616  266  169  136  144  38.5  16.63  10.56  8.5 
64, 2304, 512  2304  693  299  190  154  72  21.65  9.34  5.94  4.81 
64, 4608, 512  4608  693  299  190  154  144  21.65  9.34  5.94  4.81 
Furthermore, we investigate the influence of the on the DRS computation cost for importance estimation. We take several layers from the VGG8 on CIFAR10 as a study case, as shown in Table 1. With larger, the DRS can achieve lower dimension with much fewer operations. The average dimension reduction is 3.6x (), 8.5x (), 13.3x (), and 16.5x (). The resulting operation reduction is 3.1x, 7.1x, 11.1x, and 13.9x, respectively.
Comments
There are no comments yet.