Dynamic Sparse Graph for Efficient Deep Learning

10/01/2018 ∙ by Liu Liu, et al. ∙ Tsinghua University The Regents of the University of California 0

We propose to execute deep neural networks (DNNs) with dynamic and sparse graph (DSG) structure for compressive memory and accelerative execution during both training and inference. The great success of DNNs motivates the pursuing of lightweight models for the deployment onto embedded devices. However, most of the previous studies optimize for inference while neglect training or even complicate it. Training is far more intractable, since (i) the neurons dominate the memory cost rather than the weights in inference; (ii) the dynamic activation makes previous sparse acceleration via one-off optimization on fixed weight invalid; (iii) batch normalization (BN) is critical for maintaining accuracy while its activation reorganization damages the sparsity. To address these issues, DSG activates only a small amount of neurons with high selectivity at each iteration via a dimension-reduction search (DRS) and obtains the BN compatibility via a double-mask selection (DMS). Experiments show significant memory saving (1.7-4.5x) and operation reduction (2.3-4.4x) with little accuracy loss on various benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) lecun2015deep have been achieving impressive progress in a wide spectrum of domains simonyan2014very ; he2016deep ; abdel2014convolutional ; redmon2016yolo9000 ; wu2016google

, while the models are extremely memory- and compute-intensive. The high representational and computational cost motivates many researchers to investigate approaches on improving the execution performance, including matrix or tensor decomposition

xue2014singular ; novikov2015tensorizing ; garipov2016ultimate ; yang2017tensor ; alvarez2017compression , data quantization courbariaux2016binarized ; zhou2016dorefa ; deng2018gxnor ; leng2017extremely ; wen2017terngrad ; wu2018training ; mckinstry2018discovering , and network pruning ardakani2016sparsely ; han2015learning ; han2015deep ; liu2017learning ; li2016pruning ; he2017channel ; luo2017thinet ; wen2016learning ; molchanov2016pruning ; sun2017meprop ; spring2017scalable ; lin2017predictivenet ; zhang2018adam . However, most of the previous work aims at inference while the challenges for reducing the representational and computational cost of training are not well-studied. Although some work demonstrate acceleration in distributed training lin2017deep ; goyal2017accurate ; you2017imagenet , we target at single-node optimization, and our method can also boost training in a distributed fashion.

DNN training, which demands much more hardware resources in terms of both memory capacity and computation volume, is far more challenging than inference. Firstly, activation data in training will be stored for backpropagation, significantly increasing the memory consumption. Secondly, training iteratively updates model parameters using mini-batched stochastic gradient descent (SGD). We almost always expect larger mini-batches for higher throughput (Figure

1(a)), faster convergence, and better accuracy smith2017don . However, memory capacity is often the limiting factor (Figure 1(b)); it may cause performance degradation or even make large models with deep structures or targeting high-resolution vision tasks hard to train he2016deep ; wu2018group .

It is difficult to apply existing sparsity techniques towards inference phase to training phase because of the following reasons: 1) Prior arts mainly compress the pre-trained and fixed weight parameters to reduce the off-chip memory access in inference han2016eie ; han2017ese , while instead, the dynamic neuronal activations turn out to be the crucial bottleneck in training jain2018gist , making the prior inference-oriented methods inefficient. Besides, during training we need to stash vast batched activation space for the backward gradient calculation. Therefore, neuron activations creates a new memory bottleneck (Figure 1(c)). In this paper, we will sparsify the neuron activations for training compression. 2) The existing inference accelerations usually add extra optimization problems onto the critical path wen2016learning ; molchanov2016pruning ; liu2017learning ; luo2017thinet ; liang2018crossbar ; zhang2018adam , i.e., ‘complicated training simplified inference’, which embarrassingly complicates the training phase. 3) Moreover, previous studies reveal that batch normalization (BN) is crucial for improving accuracy and robustness (Figure 1(d)) through activation fusion across different samples within one mini-batch for better representation morcos2018importance ; ioffe2015batch . BN almost becomes a standard training configuration; however, inference-oriented methods seldom discuss BN and treat BN parameters as scaling and shift factors in the forward pass. We further find that BN will damage the sparsity due to the activation reorganization (Figure 1(e)). Since this work targets both training and inference, the BN compatibility problem should be addressed.

Figure 1: Comprehensive motivation illustration. (a) Using larger mini-batch size helps improve throughput until it is compute-bound; (b) Limited memory capacity on a single computing node prohibits the use of large mini-batch size; (c) Neuronal activation dominates the representational cost when mini-batch size becomes large; (d) BN is indispensable for maintaining accuracy; (e) Upper and lower one are the feature maps before and after BN, respectively. However, using BN damages the sparsity through information fusion; (f) There exists such great representational redundancy that more than 80% of activations are close to zero.

From the view of information representation, the activation of each neuron reflects its selectivity to the current stimulus sample morcos2018importance , and this selectivity dataflow propagates layer by layer forming different representation levels. Fortunately, there is much representational redundancy, for example, lots of neuron activations for each stimulus sample are so small and can be removed (Figure 1

(f)). Motivated by above comprehensive analysis regarding memory and compute, we propose to search critical neurons for constructing a sparse graph at every iteration. By activating only a small amount of neurons with a high selectivity, we can significantly save memory and simplify computation with tolerable accuracy degradation. Because the neuron response dynamically changes under different stimulus samples, the sparse graph is variable. The neuron-aware dynamic and sparse graph (DSG) is fundamentally distinct from the static one in previous work on permanent weight pruning since we never prune the graph but activate part of them each time. Therefore, we maintain the model expressive power as much as possible. A graph selection method, dimension-reduction search (DRS), is designed for both compressible activations with element-wise unstructured sparsity and accelerative vector-matrix multiplication (VMM) with vector-wise structured sparsity. Through double-mask selection (DMS) design, it is also compatible with BN. We can use the same selection pattern and extend our method to inference. In a nutshell, we propose a compressible and accelerative DSG approach supported by DRS and DMS methods. It can achieve 1.7-4.5x memory compression and 2.3-4.4x computation reduction with minimal accuracy loss. This work simultaneously pioneers the approach towards efficient online training and offline inference, which can benefit the deep learning in both the cloud and the edge.

2 Approach

Our method forms DSGs for different inputs, which are accelerative and compressive, as shown in Figure2(a). On the one hand, choosing a small number of critical neurons to participate in computation, DSG can reduce the computational cost by eliminating calculations of non-critical neurons. On the other hand, it can further reduce the representational cost via compression on sparsified activations. Different from previous methods using permanent pruning, our approach does not prune any neuron and the associated weights; instead, it activates a sparse graph according to the input sample at each iteration. Therefore, DSG does not compromise the expressive power of the model.

Figure 2: (a) Illustration of dynamic and sparse graph (DSG); (b) Dimension reduction search (DRS) for construction of DSG; (c) Double mask selection (DMS) for BN compatibility.

Constructing DSG needs to determine which neurons are critical. A naive approach is to select critical neurons according to the output activations. If the output neurons have a small or negative activation value, i.e., not selective to current input sample, they can be removed for saving representational cost. Because these activations will be small or absolute zero after the following ReLU non-linear function (i.e., ReLU(

) max(0, )), it’s reasonable to set all of them to be zero. However, this naive approach requires computations of all VMM operations within each layer before the selection of critical neurons, which is very costly.

2.1 Dimension Reduction Search

To avoid the costly VMM operations in the mentioned naive selection, we propose an efficient method, i.e., dimension reduction search (DRS), to estimate the importance of output neurons. As shown in Figure

2(b), we first reduce the dimensions of X and W, and then execute the lightweight VMM operations in the low-dimension space at minimal cost. After that, we estimate the neuron importance according to the virtual output activations. Then, a binary mask can be produced in which the zeros represent the non-critical neurons with small activations that are removable. We use a top-k search method that only keeps largest k neurons, where an inter-sample threshold sharing mechanism is leveraged to greatly reduce the search cost 111Implementation details are shown in Appendices.. Note that k is determined by the output size and a pre-configured sparsity parameter . Then we can compute the accurate activations of the critical neurons in the original high-dimension space and avoid calculating the non-critical neurons. Thus, besides the compressive sparse activations, DRS can further save a significant amount of expensive operations in high-dimensional space.

Figure 3: Compressive and accelerative DSG. (a) original dense convolution; (b) converted accelerative VMM operation. (c) Zero-value compression.

In this way, a vector-wise structured sparsity can be achieved, as shown in Figure 3(b). The ones in the selection mask (marked as colored blocks) denote the critical neurons, and the non-critical ones can bypass the memory access and computation of a corresponding whole column of the weight matrix. Furthermore, the generated sparse activations can be compressed via the zero-value compression zhang2000frequent ; vijaykumar2015case ; rhu2018compressing (Figure 3(c)). Consequently, it is critical to reduce the vector dimension but keep the activations calculated in low-dimension space as accurate as possible, compared to the ones in original high-dimension space.

2.2 Sparse Random Projection for Efficient DRS

Notations: Each CONV layer has a four dimensional weight tensor (, , , ), where is the number of filters, i.e., the number of output feature maps (FMs); is the number of input FMs; (, ) represents the kernel size. Thus, the CONV layer in Figure 3(a) can be converted to many VMM operations, as shown in Figure 3(b). Each row in the matrix of input FMs is the activations from a sliding window across all input FMs (), and after the VMM operation with the weight matrix () it can generate points at the same location across all output FMs. Further considering the size of each output FM and the mini-batch size of , the whole rows of VMM operations has a computational complexity of . For the FC layer with input neurons and output neurons, this complexity is . Note that here we switch the order of BN and ReLU layer from ‘CONV/FC-BN-ReLU’ to ‘CONV/FC-ReLU-BN’, because it’s hard to determine the activation value of the non-critical neurons if the following layer is BN (this value is zero for ReLU). As shown in previous work, this reorganization could bring better accuracy mishkin2015all .

For the sake of simplicity, we just consider the operation for each sliding window in the CONV layer or the whole FC layer under one single input sample as a basic optimization problem. The generation of each output activation requires an inner product operation, as follows:

(1)

where is the -th row in the matrix of input FMs (for the FC layer, there is only one X vector), is the -th column of the weight matrix , and is the neuronal transformation (e.g., ReLU function, here we abandon bias). Now, according to equation (1), the preservation of the activation is equivalent to preserve the inner product.

We introduce a dimension-reduction lemma, named Johnson–Lindenstrauss Lemma (JLL) johnson1984extensions , to implement the DRS with inner product preservation. This lemma states that a set of points in a high-dimensional space can be embedded into a low-dimensional space in such a way that the Euclidean distances between these points are nearly preserved. Specifically, given , a set of points in (i.e., all and ), and a number of , there exists a linear map such that

(2)

for any given and pair, where is a hyper-parameter to control the approximation error, i.e., larger larger error. When is sufficiently small, one corollary from JLL is the following norm preservation vu2016random ; Kakade_cmsc35900 :

(3)

where Z could be any or , and

denotes a probability. It means the vector norm can be preserved with a high probability controlled by

. Given these basics, we can further get the inner product preservation:

(4)

The detailed proof can be found in the Appendices.

Random projection vu2016random ; ailon2009fast ; achlioptas2001database is widely used to construct the linear map . Specifically, the original -dimensional vector is projected to a -dimensional () one, using a random matrix R. Then we can reduce the dimension of all and by

(5)

The random projection matrix R

can be generated from Gaussian distribution

ailon2009fast . In this paper, we adopt a simplified version, termed as sparse random projection achlioptas2001database ; bingham2001random ; li2006very with

(6)

for all elements in R. This R only has ternary values that can remove the multiplications during projection, and the remained additions are very sparse. Therefore, the projection overhead is negligible compared to other high-precision operations involving multiplication. Here we set with 67% sparsity in statistics.

Figure 4: Structured selection via dynamic DRS for producing sparse pattern of neuronal activations.

Equation (4) indicates the low-dimensional inner product can still approximate the original high-dimensional one in equation (1) if the reduced dimension is sufficiently high. Therefore, it is possible to calculate equation (1) in a low-dimensional space for activation estimation, and select the important neurons. As shown in Figure 3(b), each sliding window dynamically selects its own important neurons for the calculation in high-dimensional space, marked in red and blue as two examples. Figure 4 visualizes two sliding windows in a real network to help understand the dynamic DRS process. Here the neuronal activation vector ( length) is reshaped to a matrix for clarity. Now For the CONV layer, the computational complexity is only , which is less than the original high-dimensional computation with complexity because we usually have . For the FC layer, we also have .

2.3 DMS for BN Compatibility

To deal with the important but intractable BN layer, we propose a double-mask selection (DMS) method presented in Figure 2

(c). After the DRS estimation, we produce a sparsifying mask that removes the unimportant neurons. The ReLU activation function can maintain this mask by inhibiting the negative activation (actually all the activations from the CONV layer or FC layer after the DRS mask are positive under reasonably large sparsity). However, the BN layer will damage this sparsity through inter-sample activation fusion. To address this issue, we copy the same DRS mask and directly use it on the BN output. It is straightforward but reasonable because we find that although BN causes the zero activation to be non-zero (Figure

1(f)), these non-zero activations are still very small and can also be removed. This is because BN just scales and shifts the activations that won’t change the relative sort order. In this way, we can achieve fully sparse activation dataflow.

3 Experimental Results

3.1 Experiment Setup

The overall training algorithm is presented in the Appendices. Going through the dataflow where the red color denotes the sparse tensors, a widespread sparsity in both the forward and backward passes is demonstrated. Regarding the evaluation network models, we use LeNet lecun1998gradient

and a multi-layered perceptron (MLP) on a small-scale FASHION dataset

xiao2017fashion , VGG8 courbariaux2016binarized ; deng2018gxnor /ResNet8 (a customized ResNet-variant with 3 residual blocks and 2 FC layers)/ResNet20/WRN-8-2 zagoruyko2016wide on medium-scale CIFAR10 dataset krizhevsky2009learning , VGG8 and WRN-8-2 on another medium-scale CIFAR100 dataset krizhevsky2009learning , and ResNet18 he2016deep /WRN-18-2 zagoruyko2016wide /VGG16 simonyan2014very

on the large-scale ImageNet dataset

deng2009imagenet

as workloads. The programming framework is PyTorch and the training platform is based on NVIDIA Titan Xp GPU. We adopt the zero-value compression method

zhang2000frequent ; vijaykumar2015case ; rhu2018compressing for memory compression and MKL compute library wang2014intel on Intel Xeon CPU for the acceleration evaluation.

3.2 Accuracy Analysis

In this section, we provide a comprehensive analysis regarding the influence of sparsity on accuracy and explore the robustness of MLP and CNN, the graph selection strategy, the BN compatibility, and the importance of width and depth.

Accuracy using DSG. Figure 5(a) presents the accuracy curves on small and medium scale models by using DSG under different sparsity levels. Three conclusions are observed: 1) The proposed DSG affects little on the accuracy when the sparsity is 60%, and the accuracy will present an abrupt descent with sparsity larger than 80%. 2) Usually, the ResNet model family is more sensitive to the sparsity increasing since fewer parameters than the VGG family. For the VGG8 on the CIFAR10 dataset, the accuracy loss is still within 0.5% when sparsity reaches 80%. 3) Compared to MLP, CNN can tolerate more sparsity. Figure 5(b) further shows the results on large scale ImageNet models. Because training large model is time costly, we only present several experimental points. Consistently, the VGG16 shows better robustness compared to the ResNet18, and the WRN with wider channels on each layer performs much better than the other two models. We will discuss the topic of width and depth later.

Graph Selection Strategy. To investigate the influence of graph selection strategy, we repeat the sparsity vs. accuracy experiments on CIFAR10 dataset under different selection methods. Two baselines are used here: the Oracle one that keeps the neurons with top-k activations after the whole VMM computation at each layer, and the random one that randomly selects neurons to keep. The results are shown in Figure 5(c), in which we can see that our DRS and the Oracle one perform much better than the random selection under high sparsity condition. Moreover, DRS achieves nearly the same accuracy with the oracle top-k selection, which indicates the proposed random projection method can find an accurate activation estimation in low-dimensional space. In detail, Figure 5(d) shows the influence of parameter that reflects the degree of dimension reduction. Lower can approach the original inner product more accurately, that brings higher accuracy but at the cost of more computation for graph selection since less dimension reduction. With , the accuracy loss is within 1% even if the sparsity reaches 80%.

Figure 5: Comprehensive analysis on sparsity vs. accuracy. (a) & (b) Accuracy using DSG, and the influence of (c) the graph selection strategy, (d) the degree of dimension reduction, (e) the DMS for BN compatibility, (f) the network depth and width.

BN Compatibility. Figure 5(e) focuses the BN compatibility issue. Here we use DRS for the graph sparsifying, and compare three cases: 1) removing the BN operation and using single mask; 2) keeping BN and using only single mask (the first one in Figure 2(c)); 3) keeping BN and using double masks (i.e. DMS). The one without BN is very sensitive to the graph ablation, which indicates the importance of BN for training. Comparing the two with BN, the DMS even achieves better accuracy since the regularization effect. This observation indicates the effectiveness of the proposed DMS method for simultaneously recovering the sparsity damaged by the BN layer and maintaining the accuracy.

Width or Depth. Furthermore, we investigate an interesting comparison regarding the network width and depth, as shown in Figure 5(f). On the training set, WRN with fewer but wider layers demonstrates more robustness than the deeper one with more but slimmer layers. On the validation set, the results are a little more complicated. Under small and medium sparsity, the deeper ResNet performs better (1%) than the wider one. While when the sparsity increases substantial (75%), WRN can maintain the accuracy better. This indicates that, in medium-sparse space, the deeper network has stronger representation ability because of the deep structure; however, in ultra-high-sparse space, the deeper structure is more likely to collapse since the accumulation of the pruning error layer by layer. In reality, we can determine which type of model to use according to the sparsity requirement. In Figure 5(b) on ImageNet, the reason why WRN-18-2 performs much better is that it has wider layers without reducing the depth.

3.3 Representational Cost Reduction

This section presents the benefits from DSG on representational cost. We measure the memory consumption over five CNN benchmarks on both the training and inference phases. For data compression, we use zero-value compression algorithm zhang2000frequent ; vijaykumar2015case ; rhu2018compressing . Figure 6 shows the memory optimization results, where the model name, mini-batch size, and the sparsity are provided. In training, besides the parameters, the activations across all layers should be stashed for the backward computation. Consistent with the observation mentioned above that the neuron activation beats weight to dominate memory overhead, which is different from the previous work on inference. We can reduce the overall representational cost by average 1.7x (2.72 GB), 3.2x (4.51 GB), and 4.2x (5.04 GB) under 50%, 80% and 90% sparsity, respectively. If only considering the neuronal activation, these ratios could be higher up to 7.1x. The memory overhead for the selection masks is minimal (2%).

Figure 6: Memory footprint comparisons for (a) training and (b) inference.

During inference, only memory space to store the parameters and the activations of the layer with maximum neuron amount is required. The benefits in inference are relatively smaller than that in training since weight is the dominant memory. On the ResNet152, the extra mask overhead even offsets the compression benefit under 50% sparsity, whereas, we can still achieve up to average 7.1x memory reduction for activations and 1.7x for overall memory. Although the compression is limited for inference, it still can achieve noticeable acceleration that will be shown in the next section. Moreover, reducing costs for both training and inference is our major contribution.

3.4 Computational Cost Reduction

We assess the results on reducing the computational cost of both training and inference. As shown in Figure 7, both the forward and backward pass consume much fewer operations, i.e., multiply-and-accumulate (MAC). On average, 1.4x (5.52 GMACs), 1.7x (9.43 GMACs), and 2.2x (10.74 GMACs) operation reduction are achieved in training under 50%, 80% and 90% sparsity, respectively. For inference with only forward pass, the results increase to 1.5x (2.26 GMACs), 2.8x (4.22 GMACs), and 3.9x (4.87 GMACs), respectively. The overhead of the DRS computation in low-dimensional space is relatively larger (6.5% in training and 19.5% in inference) compared to the mask overhead in memory cost. Note that the training demonstrates less improvement than the inference, which is because the acceleration of the backward pass is partial. The error propagation is accelerative, but the weight gradient generation is not because of the irregular sparsity that is hard to obtain practical acceleration. Although the computation of this part is also very sparse with much fewer operations222See Algorithm 1 in the Appendices, we don’t include its GMACs reduction for practical concern.

Figure 7: Computational complexity comparisons for (a) training and (b) inference.

Finally, we evaluate the execution time on CPU using Intel MKL kernels (wang2014intel ). As shown in Figure 8(a), we evaluate the execution time of these layers after the DRS selection on VGG-8. Comparing to VMM baselines, our approach can achieve 2.0x, 5.0x, and 8.5x speedup under 50%, 80%, and 90% sparsity, respectively. When the baselines change to GEMM (general matrix multiplication), the speedup decreases to 0.6x, 1.6x, and 2.7x, respectively. The reason is that DSG generates dynamic vector-wise sparsity, which is not well supported by GEMM.

We further compare our approach with smaller dense models which could be another way to reduce computational cost. As shown in Figure 8(b), comparing with dense baseline, our approach can reduce training time with little accuracy loss. Even though the equivalent smaller dense models with the same effective nodes, i.e., reduced MACs, save more training time, the accuracy is much worse than our DSG approach.

Figure 8: (a) Layer-wise execution time comparison; (b) Validation accuracy vs. training time of different models: large-sparse ones and smaller-dense ones with equivalent MACs.

4 Related Work

DNN Compression ardakani2016sparsely achieved up to 90% weight sparsity by randomly removing connections. han2015learning ; han2015deep reduced the weight parameters by pruning the unimportant connections. However, the compression is mainly achieved on FC layers, that makes it ineffective for CONV layer-dominant networks, e.g., ResNet. Moreover, it is difficult to obtain practical speedup due to the irregularity of the element-wise sparsity. Even if designing ASIC from scratch han2016eie ; han2017ese , the index overhead is enormous and it only works under high sparsity. These methods usually require a pre-trained model, iterative pruning and fine-tune retrain, that targets inference optimization.

DNN Acceleration Different from compression, the acceleration work consider more on the sparse pattern. In contrast to the fine-grain compression, coarse-grain sparsity was further proposed to optimize the execution speed. Channel-level sparsity was gained by removing unimportant weight filters he2018soft , training penalty coefficients liu2017learning , or introducing group-lasso optimization luo2017thinet ; he2017channel ; liang2018crossbar . wen2016learning introduced a L2-norm group-lasso optimization for both medium-grain sparsity (row/column) and coarse-grain weight sparsity (channel/filter/layer). molchanov2016pruning introduced the Taylor expansion for neuron pruning. However, it just benefits the inference acceleration, and the extra solving of the optimization problem usually makes the training more complicated. lin2017predictivenet demonstrated predicting important neurons then bypassed the unimportant ones via low-precision pre-computation on small networks. spring2017scalable leveraged the randomized hashing to predict the important neurons. However, the hashing search aims at finding neurons whose weight bases are similar to the input vector, which cannot estimate the inner product accurately thus will probably cause significant accuracy loss on large models. sun2017meprop used a straightforward top-k pruning on the back propagated errors for training acceleration. But they only simplified the backward pass and presented the results on tiny FC models. Furthermore, the BN compatibility problem that is very important for large-model training still remains untouched. lin2017deep pruned the gradients for accelerating distributed training, but the focus is on multi-node communication rather than the computation topic discussed in this paper.

5 Conclusion

In this work, we propose DSG (dynamic and sparse graph) structure for efficient DNN training and inference through a DRS (dimension reduction search) sparsity forecast for compressive memory and accelerative execution and a DMS (double-mask selection) for BN compatibility without sacrificing model’s expressive power. It can be easily extended to the inference by using the same selection pattern after training. Our experiments over various benchmarks demonstrate significant memory saving (4.5x for training and 1.7x for inference) and computation reduction (2.3x for training and 4.4x for inference). Through significantly boosting both forward and backward passes in training, as well as in inference, DSG promises efficient deep learning in both the cloud and edge.

References

Appendix A DRS for Inner Product Preservation

Theorem 1. Given a set of points in (i.e. all and ), and a number of , there exist a linear map and a , for we have

(7)

for all and .

Proof. According to the definition of inner product and vector norm, any two vectors a and b satisfy

(8)

It is easy to further get

(9)

Therefore, we can transform the target in equation (7) to

(10)

which is also based on the fact that . Now recall the definition of random projection in equation (5) of the main text

(11)

Substituting equation (11) into equation (10), we have

(12)

Further recalling the norm preservation in equation (3) of the main text: there exist a linear map and a , for we have

(13)

Substituting the equation (13) into equation (12) yields

(14)

Combining equation (12) and (14), finally we have

(15)

It can be seen that, for any given and pair, the inner product can be preserved if the is sufficiently small. Actually, previous work achlioptas2001database ; bingham2001random ; vu2016random discussed a lot on the random projection for various big data applications, here we re-organize these supporting materials to form a systematical proof. We hope this could help readers to follow this paper. In practical experiments, there exists a trade-off between the dimension reduction degree and the recognition accuracy. Smaller usually brings more accurate inner product estimation and better recognition accuracy while at the cost of higher computational complexity with larger , and vice versa. Because the and are not strictly bounded, the approximation may suffer from some noises. Anyway, from the abundant experiments in the main text, the effectiveness of our approach for training dynamic and sparse neural networks has been validated.

Appendix B Implementation and overhead

Data: a mini-batch of inputs and targets (; ), previous weights , previous BN parameters .
Result: update weights , update BN parameters .
 
Random projection: ;
 
Step 1. Forward Computation;
for k=1 to L do
       if k<L then
             Projection: ;
             Generating via DRS according to and ;
             ;
             ;
            
      else
             ;
            
       end if
      
end for
 
Step 2. Backward Computation;
Compute the gradient of the output layer ;
for k=L to 1 do
       if k==L then
             ;
             ;
            
      else
             ;
             ;
             if k1 then
                  ;
             end if
            
       end if
      
end for
 
Step 3. Parameter Update;
for k=1 to L do
       ;
       ;
      
end for
Algorithm 1 DSG training

The training algorithm for producing DSG is presented in Algorithm 1. Furthermore, the generation procedure of the critical neuron mask based on the virtual activations estimated in low-dimensional space is presented in Figure 9, which is a typical top-k search. The k value is determined by the activation size and the desired sparsity . To reduce the search cost, we calculate the first input sample within the current mini-batch and then conduct a top-k search over the whole virtual activation matrix for obtaining the top-k threshold under this sample. The remaining samples share the top-k threshold from the first sample to avoid costly searching overhead. At last, the overall activation mask is generated by setting the mask element to one if the estimated activation is larger than the top-k threshold and setting others to zero. In this way, we greatly reduce the search cost. Note that, for the FC layer, each sample is a vector.

Figure 9: DRS mask generation: using a top-k search on the first input sample within each mini-batch to obtain a top-k threshold which is shared by the following samples. Then, we apply thresholding on the whole output activation tensor to generate the importance mask for this mini-batch.
Layers Dimension Operations (MMACs)
, , BL 0.3 0.5 0.7 0.9 BL 0.3 0.5 0.7 0.9
1024, 1152, 128 1152 539 232 148 119 144 67.37 29 18.5 14.88
256, 1152, 256 1152 616 266 169 136 72 38.5 16.63 10.56 8.5
256, 2304, 256 2304 616 266 169 136 144 38.5 16.63 10.56 8.5
64, 2304, 512 2304 693 299 190 154 72 21.65 9.34 5.94 4.81
64, 4608, 512 4608 693 299 190 154 144 21.65 9.34 5.94 4.81
Table 1: DRS computational complexity. MMACs denotes mega-MACs, and BL denotes baseline.

Furthermore, we investigate the influence of the on the DRS computation cost for importance estimation. We take several layers from the VGG8 on CIFAR10 as a study case, as shown in Table 1. With larger, the DRS can achieve lower dimension with much fewer operations. The average dimension reduction is 3.6x (), 8.5x (), 13.3x (), and 16.5x (). The resulting operation reduction is 3.1x, 7.1x, 11.1x, and 13.9x, respectively.