Accelerating CNN Training by Sparsifying Activation Gradients

08/01/2019 ∙ by Xucheng Ye, et al. ∙ Beihang University 1

Gradients to activations get involved in most of the calculations during back propagation procedure of Convolution Neural Networks (CNNs) training. However, an important known observation is that the majority of these gradients are close to zero, imposing little impact on weights update. These gradients can be then pruned to achieve high gradient sparsity during CNNs training and reduce the computational cost. In particular, we randomly change a gradient to zero or a threshold value if the gradient is below the threshold which is determined by the statistical distribution of activation gradients. We also theoretically proved that the training convergence of the CNN model can be guaranteed when the above activation gradient sparsification method is applied. We evaluated our method on AlexNet, MobileNet, ResNet-18, 34, 50, 101, 152 with CIFAR-10, 100 and ImageNet datasets. Experimental results show that our method can substantially reduce the computational cost with negligible accuracy loss or even accuracy improvement. Finally, we analyzed the benefits that the sparsity of activation gradients introduced in detail.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) have been widely applied on many applications and various devices in recent years. However, the network structures are becoming more and more complex so that training them on large scale datasets are very time consuming especially with limited hardware resources. Several research works have reported that training them could be finished within minutes on high performance computation platforms goyal2017accurate you2018imagenet jia2018highly , but almost thousands of GPUs are utilized which are not feasible for many researchers. Even though there are many research works on network compression, most of them are focused on inference cheng2018recent . Our work aims to reduce the training workloads efficiently so that large scale training could be performed on budgeted computation platforms.

The kernel optimization step of CNN training is to perform Stochastic Gradient Descent (SGD) algorithm in backward propagation procedure. There are several data types among the training dataflow: weights, gradients to weight, activations, gradients to activation. Backward propagation starts from computing the weight gradients with the activations and then performs weights update

micikevicius2017mixed . Among them, activation gradients back-propagation and weight gradients computation require intensive convolution operations so that they dominate the total training cost. It is well known that computation cost can be reduced by skipping over zero-values. Since the above two convolution steps require the activation gradients as input, improving the sparsity of activation gradients should significantly reduce the computation cost and memory footprint during backward propagation procedure.

We assume that activation gradients are distributed normally, then a threshold can be calculated based on this hypothesis. Our work apples stochastic pruning to the activation gradients under and map them to zero or

randomly. Note that the gradients passed ReLU layer distribute irregularly so we divide common networks into two categories, one is networks using

Conv-ReLU as basic block such as AlexNet krizhevsky2012imagenet and VGGNet simonyan2014very , another one is that using Conv-BN-ReLU structure such as ResNet he2016deep and MobileNet howard2017mobilenets . Our experiments show stochastic pruning method works for both Conv-ReLU structure and Conv-BN-ReLU

structure in modern networks. A mathematical proof is proposed to demonstrate that stochastic pruning can maintain the convergence of original CNN model. Our approach can be classified as gradient sparsification, and can be combined with gradient quantization method.

2 Related Works

Pruning

Acceleration of CNN inference phase by pruning has been researched widely and has achieved outstanding advances. Pruning of inference can be divided into five categories cheng2018recent : element-level han2015deep

, vector-level

mao2017exploring , kernel-level anwar2017structured , group-level lebedev2016fast and filter-level pruning luo2017thinet he2017channel liu2017learning . Inference pruning focus on raising parameter sparsity of convolution layers. There are also pruning methods designed for training, e.g. weight gradients sparsification, which can reduce communication cost for weight gradients exchanging in a distributed learning system. Aji aji2017sparse pruned

weight gradients with smallest absolute value by a heuristic algorithm. According to filters’ correlation, Prakash

prakash2018repr prunes filters temporarily when training, which can improve training efficiency. Different from those works mentioned above, the purpose of our work is to decrease computing cost of backward propagation by pruning activation gradients.

Quantization

Quantization is another common way to release the computational complexity and memory consumption of training. Gupta’s work gupta2015deep maintains the accuracy by training the model in 16-bit fixed-point number with stochastic rounding. DoReFaNet zhou2016dorefa derived from AlexNet krizhevsky2012imagenet uses 1-bit, 2-bit and 6-bit fixed-point number to store weights, activations and gradients, respectively, but brings visible accuracy drop. Terngrad wen2017terngrad is designed for distributed learning. Their approach only requires three numerical levels for weight gradient. They demonstrate that model can still converge using ternary weight gradients.

Mixed precision is a new research direction of quantization. Park park2018value proposed a value-aware quantization method that use low-precision for small values, which can significantly reduce memory consumption when training ResNet-152 he2016deep and Inception-V3 szegedy2016rethinking with activations quantified to 3-bit. Micikevicius micikevicius2017mixed

keeps a FP32 copy for weights update and uses FP16 for computation. This method has been proved to work in various fields of deep learning and speedup the training process.

3 Methodologies

3.1 Original Dataflow

Fig. 1 represents one backward propagation iteration for a convolution layer. Activation gradients back-propagation (AGB) perform convolution between gradients and layer’s weight and get gradient of , which will be propagated to the next layer. Weight gradients computation (WGC) takes and as input, and the result of convolution is weight gradients ). Note that without explicit mention, next layer indicate the next layer in backward procedure in this paper.

We found that activation gradients are full of small values, which are values extremely near to zero. Obviously, the closer one activation gradient is to 0, the less it will affect the update of weights. So it’s reasonable to assume that pruning those small values will not harm CNN model’s performance after training.

To achieve this goal, the easiest way is to sort the absolute values of gradients for a threshold, then dropout those small values whose absolute values are under this threshold directly. But here comes two problems. Firstly, sorting requires serious computational and memory overhead. Secondly, experiment results show that direct drop all these small values will significantly influence convergence and accuracy for there are tons of them.

To address these problems, we propose two algorithms: Distribution Based Threshold Determination and Stochastic Pruning.

Figure 1: Backward propagation of convolutional layer.

3.2 Sparsification Algorithms

Distribution Based Threshold Determination (DBTD)

As mentioned above, it is unfeasible to find the threshold by sorting, so a new method that is easier to implement and has less overhead is needed. First, we analyzed the data distribution of activation gradients in the training process. According to the structure of the modern CNN models, it needs to be divided into two cases for discussion.

For those networks using Conv-ReLU structure, which means that the convolutional layer is followed by ReLU layer, as a basic block, is sparse but has anomalous distribution. In addition, the gradients going to propagate to previous layer , is full of non-zero values. At the same time, statistical data shows that the proportion of values is symmetric with 0 and decreases with the increment of absolute value. For those network taking Conv-BN-ReLU structure as a basic block, has the same properties with .

In the first case, next block can inherit the sparsity of because ReLU layer won’t map zeros to non-zeros, So the from Conv-ReLU structure and the from Conv-BN-ReLU is considered as pruning target, we use to denote both of them. According to the attribute of , we made a hypothesis that

is a set of simple random samples of a normal distribution with mean

and variance

, and is unknown. Suppose the length of is , let

The expectation of is:

Let

then

Clearly,

is an unbiased estimator of parameter

.

Here we use the mean of the absolute values because the computational overhead is acceptable. Now, we got the approximated distribution of gradient . Base on the distribution, how to find the threshold can be solved by computing the percentile of . For instance, -th percentile of () is greater than gradients’ absolute value. So, threshold , and .

In summary, this algorithm has two advantages,

  1. Less cost: its arithmetic complexity is , less than sorting which is at least , it requires almost no extra storage at the same time.

  2. No extra hyper-parameters, only need to be set.

Stochastic Pruning

During the experiment we found that there are too many values which are near zero. Single small value has little impact on weights update, however, once all these values are set to 0, the distribution of will have a huge change, which will influence the update of weights and cause accuracy loss. Inspired by Stochastic Rounding in gupta2015deep , we adopt stochastic pruning to solve this problem.

The algorithm treat as a one-dimensional vector with length of , threshold is determined by DBTD with percentile .

Input: original activation gradient , threshold
Output: sparse activation gradient
for  do
       if  then
             Generate a random number ;
             if  then
                   ? : ;
                  
            else
                   = 0 ;
                  
             end if
            
       end if
      
end for
Algorithm 1 Stochastic Pruning
Figure 2: Effect of stochastic pruning and its placement in different structures.

Fig. 2 shows the effect of stochastic pruning and where it works in different situations mentioned above. The mathematical proof in section 3.3 shows that applying this gradient sparsification method to a convolutional neural network during training does not affect its convergence.

3.3 Convergence Analysis

In this section, we will prove that the model training with pruning algorithm has the same convergence ability with original training schedule under the GOGA (General Online Gradient Algorithm) framework bottou1998online .

In bottou1998online , L. Bottou considered a learning problem as follows: suppose that there is an unknown distribution and can only get a batch of samples each iteration, which denotes iteration times. The goal of training is to find the optimal parameters

which minimize the loss function

. For convenience, we define the cost function as

(1)

Under this framework, L. Bottou proved that an online learning system with update rule as

(where is the learning rate) will finally converge as long as the assumptions below are satisfied.

Assumption 1.

The cost function has a single global minimum and satisfies the condition that

(2)
Assumption 2.

Learning rate fulfills that

(3)
Assumption 3.

The update function meets that

(4)

and

(5)

The only difference between proposed algorithm and original algorithm is the update function , where in original algorithm

(6)

while in proposed algorithm

In this case, if we assume original algorithm meets all the assumptions, the proposed algorithm also meets Assumption 1 and 2.

To prove Assumption 3, we first give the following lemma:

Lemma 1.

For a stochastic variable , we get another stochastic variable by applying Algorithm 1 to with threshold , which means

Then satisfies

(7)
(8)

Then we can prove eq. 4 using lemma 1.

Proof.
(9)

where is the total layer number of neural networks. The weight gradient of -th layer

(10)
(11)
(12)
(13)
(14)
(15)
(16)

where means the operation of -th layer and denotes our pruning algorithm. Equation 11 is right because convolution and its derivatives are linear.

Since , we got

which is combined with eq. 9 and finally gives that

Similarly, it’s easy to prove that eq. 5 is satisfied if original training method meets this condition with the same strategies. Thus, we can say that the proposed pruning algorithm has the same convergence with original algorithm under the GOGA framework.

4 Experimental Results

In this section, several experiments are conducted to demonstrate that the proposed approach could reduce the training complexity significantly with a negligible model accuracy loss. PyTorch

paszke2017automatic framework is adopted for all the evaluations.

4.1 Datasets and Models

Three datasets are utilized including CIFAR-10, CIFAR-100 krizhevsky2009learning and ImageNet deng2009imagenet . CIFAR-10 and CIFAR-100 datasets contains pixels of RGB-colored images with classes and classes, respectively. Each CIFAR dataset contains pictures for training and pictures for testing, which is distributed uniformly for each class. ImageNet dataset contains pixels of RGB-color images with classes, in which million images for training and images for testing. AlexNet krizhevsky2012imagenet , ResNet he2016deep and MobileNet howard2017mobilenets are evaluated while ResNet include Res-{18, 34, 50, 101, 152} models.

The last layer size of each model is changed in order to adapt them on CIFAR datasets. Additionally for AlexNet, the kernels in first two convolution layers are set as with and . For FC-1 and FC-2 layers in AlexNet, they are also re-sized to and , respectively. For ResNet, kernels in first layer are replaced by kernels with and . Meanwhile, the pooling layer before FC-1 in ResNet is set to Average-Pooling with the size of

. For MobileNet, the kernels stride of first layer is changed to

and the last pooling layer is changed to Average-Pooling with size of .

4.2 Training Settings

All the models mentioned above are trained for epochs on CIFAR-{10, 100} datasets. While for ImageNet, only AlexNet, ResNet-{18, 34, 50} and MobileNet are trained for epochs due to our limited computing resources.

The Momentum SGD is used for all training with and . Learning rate lr is set to for AlexNet and for the others. lr-decay is set to for CIFAR-{10, 100} and for ImageNet.

Model Baseline
acc% acc% acc% acc% acc%
AlexNet 90.50 0.09 90.34 0.01 90.55 0.01 90.31 0.01 89.66 0.01
ResNet-18 95.04 1 95.23 0.36 95.04 0.35 94.91 0.34 94.86 0.31
ResNet-34 94.90 1 95.13 0.34 95.09 0.32 95.16 0.31 95.02 0.28
ResNet-50 94.94 1 95.36 0.22 95.13 0.20 95.01 0.17 95.28 0.14
ResNet-101 95.60 1 95.61 0.24 95.48 0.22 95.60 0.19 94.77 0.12
ResNet-152 95.70 1 95.13 0.18 95.58 0.18 95.45 0.16 93.84 0.08
MobileNet 92.28 1 92.12 0.26 92.10 0.23 28.71 0.04 67.95 0.13
Table 1: Evaluation results on CIFAR-10, where acc% means the training accuracy and means the density of non-zeros.
Model Baseline
acc% acc% acc% acc% acc%
AlexNet 67.61 0.10 67.49 0.03 68.13 0.03 67.99 0.03 67.93 0.02
ResNet-18 76.47 1 76.89 0.40 77.16 0.39 76.44 0.37 76.66 0.34
ResNet-34 77.51 1 77.72 0.36 78.04 0.35 77.84 0.33 77.40 0.31
ResNet-50 77.74 1 78.83 0.25 78.27 0.22 78.92 0.20 78.52 0.16
ResNet-101 79.70 1 78.22 0.23 79.10 0.21 79.08 0.19 77.13 0.13
ResNet-152 79.25 1 80.51 0.22 79.42 0.19 79.76 0.18 76.40 0.10
MobileNet 68.21 1 8.68 0.02 68.55 0.25 53.45 0.16 9.82 0.03
Table 2: Evaluation results on CIFAR-100, where acc% means the training accuracy and means the density of non-zeros.
Model Baseline
acc% acc% acc% acc% acc%
AlexNet 56.38 0.07 57.10 0.05 56.84 0.04 55.38 0.04 39.58 0.02
ResNet-18 68.73 1 69.02 0.41 68.85 0.40 68.66 0.38 68.74 0.36
ResNet-34 72.93 1 72.92 0.39 72.86 0.38 72.74 0.37 72.42 0.34
MobileNet 70.76 1 70.94 0.32 70.09 0.28 70.23 0.27 0.84 0.01
Table 3: Evaluation results on ImageNet, where acc% means the training accuracy and means the density of non-zeros.

4.3 Results and Discussion

As discussed previously, sparsity is different for Conv-ReLU and Conv-BN-ReLU structures but are covered by our evaluated three type of models. The percentage in the proposed method varies from , , to for comparison with the baseline. All the training are run directly without any fine-tuning. The evaluation results are shown in Table 1, Table 2 and Table 3, where the non-zero gradients density means the percentage of non-zero gradients over all gradients.

(a) AlexNet on CIFAR-10.
(b) ResNet-18 on CIFAR-10.
(c) MobileNet on CIFAR-10.
(d) AlexNet on ImageNet.
(e) ResNet-18 on ImageNet.
(f) MobileNet on ImageNet.
Figure 3: Training loss of AlexNet/ResNet/MobileNet on CIFAR-10/100 and ImageNet.

Accuracy Analysis.

From Table 1, Table 2 and Table 3, there is no accuracy lost for most situations. And even for ResNet-50 on CIFAR-100, there is accuracy improvement. But for MobileNet and AlexNet on ImageNet, there is a significant accuracy loss when using very aggressive pruning policy like . On the other hand, it even does not converge when setting for training MobileNet on ImageNet or CIFAR-100. The possible reason is that MobileNet is a compact neural network and there are not much redundancies on gradients, which makes it sensitive to gradients pruning. In summary, the accuracy loss is almost negligible when a non-aggressive policy is adopted for gradients pruning.

Gradients Sparsity.

The gradients density illustrated in Table 1, Table 2, Table 3 has shown the ratio of non-zero gradients over all gradients, which could be a good measurement for sparsity. Although the basic block of AlexNet is Conv-ReLU whose natural density is relatively low, our method could still reduce the gradients density for more than on CIFAR-10 and on CIFAR-100. While it comes to ResNet, whose basic block is Conv-BN-ReLU and activation gradients are naturally fully dense, our method could reduce the gradients density as . In addition, the deeper networks could obtain a relative lower gradients density with our sparsification, which means that it works better for complex networks. The potential benefits brought by gradients sparsification include reducing computation cost up to and the memory footprint up to on AlexNet.

Convergence Rate.

The training loss is also demonstrated in Fig. 3 for AlexNet, ResNet-18 and MobileNet on CIFAR-10 and ImageNet datasets. Fig. 2(b) and Fig. 2(e) have shown that ResNet-18 is very robust for gradients pruning. For AlexNet, the gradients pruning could be still robust on CIFAR-10 but will degrade the accuracy on ImageNet. Fig. 2(e) also confirms that sparsification with a larger will result in a worse convergence rate. For MobileNet, it will divergence if the pruning policy is too aggressive with , as shown in Fig. 2(c) and Fig. 2(f).

5 Conclusions

In this paper, a new algorithm is proposed for dynamically gradients pruning for CNNs training. Different from original training, we assume the activation gradients of CNNs satisfy normal distribution and then estimate their variance according to their average value. After that, we calculate the pruning threshold according to the variance and a preset parameter . The gradients are pruned stochastically if they are less than the threshold, which has been theoretically proved to be convergent. Evaluations on state-of-the-art models have confirmed that the gradients pruning approach could significantly reduce the computation cost of backward-propagation with a negligible accuracy loss.

References