EDCompress: Energy-Aware Model Compression with Dataflow

06/08/2020 ∙ by Zhehui Wang, et al. ∙ Agency for Science, Technology and Research 0

Edge devices demand low energy consumption, cost and small form factor. To efficiently deploy convolutional neural network (CNN) models on edge device, energy-aware model compression becomes extremely important. However, existing work did not study this problem well because the lack of considering the diversity of dataflow in hardware architectures. In this paper, we propose EDCompress, an Energy-aware model compression method, which can effectively reduce the energy consumption and area overhead of hardware accelerators, with different Dataflows. Considering the very nature of model compression procedures, we recast the optimization process to a multi-step problem, and solve it by reinforcement learning algorithms. Experiments show that EDCompress could improve 20X, 17X, 37X energy efficiency in VGG-16, MobileNet, LeNet-5 networks, respectively, with negligible loss of accuracy. EDCompress could also find the optimal dataflow type for specific neural networks in terms of energy consumption and area overhead, which can guide the deployment of CNN models on hardware systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural network (CNN) shows good performance in various applications such as image classification and object detection. However, traditional CNN is in large scale, which makes it challenging to implement on edge devices. For example, the VGG-16 network contains 528 MB weights Simonyan and Zisserman (2014)

. To classify one image, we need to perform

multiply–accumulate (MAC) operations. There are two consequences. First, the limited memory space of edge devices cannot store the parameters. Second, the edge device becomes power hungry because the calculation and data movement operations consume a large amount of energy.

Model compression method such as quantization and pruning, is an emerging technique developed in recent years to alleviate this problem. Most model compression method target to the reduction of model size. For example, Han et al. proposed Deep Compressing method Han et al. (2015), which helps to fit the neural networks into the on-chip memory of hardware accelerators. However, the model size does not directly decide the two significant metrics of edge device, i.e., the energy consumption and area overhead.

To prove this, we compare our work EDCompress (EDC) with Deep Compression (DC) in Figure 1. we can see that although EDCompress shows lower compression rate, it has higher energy and area efficiency than DC. This is because the energy consumption does not only depend on the model size, but also depend on the dataflow design, which is the way we reuse the data. In hardware accelerators, different processing elements may share the same input or output data. By reusing the data, there is no need to load the data from memory by multiple times. Given the fact that a large portion of the energy is spent on the data movement (e.g. around 72% in VGG-16). A good dataflow design can effectively improve the energy efficiency. In this paper, we propose EDCompress, which has two following features:

  • Dataflow Awareness: This paper first studies model compression problem with the knowledge of the diversity of dataflow designs. We study the impact of different dataflow designs on quantization and pruning, and exploit the best model compression strategy in terms of energy consumption and area.

  • Automated Approach

    : We first formulate the energy-aware model compression as a multi-step optimization problem. At each step, we partially quantize or prune the model, and then fine tune the model by a few epochs. We further recast it into an reinforcement learning task.

2 Related work

Many energy based model compression methods have been proposed in the literature. For example, Wang 

et al. proposed a hardware-aware quantization method using the Deep Deterministic Policy Gradients (DDPG) algorithm Wang et al. (2019). He et al. proposed a pruning method for mobile devices using the DDPG algorithm He et al. (2018). Yang et al. proposed an energy-aware pruning method for low-power devices Yang et al. (2017). Cai et al. Cai et al. (2018) and Yang et al. Yang et al. (2018b) proposed optimization methods to reduce the latency of neural networks. Several other work also focused on model compressing techniques, such as Han et al. (2015) Guo et al. (2016) Xiao et al. (2017) Liu et al. (2018) Chang and Sha (2018) Manessi et al. (2018) Li et al. (2016a) Singh et al. (2019). According to the previous work, there are several effective model compression techniques, including pruning and quantization. In pruning Kang (2019) Lemaire et al. (2019) Frickenstein et al. (2019) Fang et al. (2018) Yang et al. (2018a) Hacene et al. (2018), we reduce the model size by replacing those weights with small absolute values by zeros. In quantization Ding et al. (2019) Geng et al. (2019) Tung and Mori (2018), we reduce the model size by decreasing the precision of the weights and the activations. These work reduce the size of parameters stored in the memory so that the compressed CNN can be applied on edge devices. Our proposed work EDCompress is different with them because we has considered the diversity of dataflow designs.


Figure 1: Comparison between our EDCompress (EDC) and Deep Compression (DC)
  for  in range (do
     for  in range (do
        for  in range (do
           for  in range (do
              for  from -(-1)/2 to (-1)/2 do
                 for  from -(-1)/2 to (-1)/2 do
                    [][x][y]+=[][x+][y+][][][][]
Algorithm 1 Computation of a typical convolutional layer

3 Energy-Aware Quantization/Pruning with Dataflow

Dataflow is an important concept in accelerators. It is related to the mapping strategy between the mathematical operations and the processing elements Yang et al. (2018c). Algorithm 1 shows the computation of a typical convolutional layer. The algorithm contains six loops. One loop corresponds to one dimension in either the filter or the feature map. Here, and denote the number of output and input channels. and denote the width and height of the feature map. and denote the width and height of the filter. In each iteration of the innermost loop, we perform a basic operation called multiply–accumulate (MAC). Before the MAC operation, we read three elements from the memory, one from the input feature map, one from the weight, and one from the output feature map. After the MAC operation, we write the result into the memory. During the whole process, most of the energy is spent on the MAC calculation and data movement. To compute one conventional layer, we need to execute MAC operations in total.

In hardware accelerators using spatial architectures, we have an array or a matrix of processing elements, each one can execute the MAC operation independently. The strategy to map the operation into those elements becomes a key consideration in the hardware design. There is a large design space to explore. For example, given an array of processing element, we can unroll any one of the loops in the algorithm, and map each iteration in the loop into each processing element in the array. By similar rules, we can further unrolling two loops in the algorithm and map the MAC operations into a matrix of processing elements. With six loops in total, there are =15 possibilities, each one corresponds to one dataflow design. Here, we introduce four popular dataflows in Table 1. They are denoted as A:B, where A and B stand for the name of each loop.

Different dataflow designs employ different data movement policies, and thus show different energy efficiency and area overhead. In Figure 2 (a), we show example of four popular dataflow designs. To simplify the figure, we only show four processing elements in each example. In real implementations, the : dataflow design requires processing elements. In :, we store MAC operation results in registers at output ports of processing elements. At each iteration, we read the last MAC operation result from registers. In :, we store weights in registers at input ports of processing elements. At each iteration, we sum up MAC operation results. In :, we store weights in registers at input ports of processing elements. At each iteration, we reuse the weights by times, and sum up MAC operation results. In :, at each iteration, we reuse the input feature map by times, and sum up MAC operation results.


Figure 2: (a) The hardware accelerators with four popular dataflow designs. In each dataflow, we show four processing elements, and each of them contains a multiplier and an adder; (b) If the weights are quantized from 4 bits to 3 bits, we can skip the first row of adders; (c) If the weights are pruned, we can skip those multipliers whose weights are zero
Dataflow Applied by Dataflow Applied by Dataflow Applied by Dataflow Applied by
 Du et al. (2015) Song et al. (2018)  Qiu et al. (2016)  Chen et al. (2016) Gao et al. (2017) Li et al. (2016b)  Chen et al. (2014) Jouppi and others (2017) Zhang et al. (2015) Alwani et al. (2016) Shen et al. (2016) Suda and others (2016)
Table 1: Popular dataflow types

3.1 Improvement on Energy and Area Efficiency

Quantization and pruning are two popular techniques in model compression. To quantize a model, we lower the precision of parameters based on the quantization depth (the number of digits presenting a parameter). After quantization, the low precision parameters may still store enough information for model inference. To prune a model, we replace some of the parameters in the model to be zero. A well-trained model usually contains many weights with small absolute values. We sort all the weights in the filter, and replace those weights with the least absolute values by zeros.

We can save energy and reduce area overhead of the logic circuit using quantization and pruning. Figure 2 (b) shows the inner structure of a 4 bits4 bits multiplier, which contains 12 adders. If the weights are quantized from 4 bits to 3 bits, we can skip the last row of adders, and thus save the energy consumption and reduce the area overhead. In real application, a high precision model with 32FP data type (32 bit float point) requires 23 bit23 bit multipliers, with 506 adders in total. If both the activations and weights can be quantized, we can save a plenty of energy and area overhead. For example, if the activations are quantized from 32FP to 16FP, and the weights are quantized from 32FP to 8INT (8 bit integer), only 10 bit8 bit multipliers are required, with 72 adders in total, which is 86% less than the original amount. Figure 2 (c) shows an array of three processing elements, each containing a multiplier and an adder. If the weights are pruned, some processing elements would have inputs equaling zero. In this case, we can skip the related multiplier, and thus save the energy consumption.

We can also save energy and reduce area overhead of the memory modules using quantization and pruning. To inference a model, we need to store all the weights, and put the intermediate feature map of each layer into the memory. The memory can be either the on-chip memory or the off-chip memory. No matter which type of memory we use, the data movement energy consumption and the area overhead of memory modules are proportional to the total amount of data transmitted in bits. To decrease this value, we can either reduce the size of parameters by quantization, or reduce the number of parameters by pruning. For example, if we quantize the parameters from 32FP to 16FP, and prune half of the parameters, then roughly 75% of the energy and area of memory modules can be reduced.

3.2 Recasting to the Multi-Step Problem

We recast the model compression process to a multi-step problem. Our goal is to lower the energy consumption and area overhead of edge devices while keeping the accuracy of the model. Instead of quantizing/pruning the model directly in one step, our final target is approached through a sequence of quantization/pruning steps. This is because we cannot alter the parameters too much at one time. Otherwise, the performance of the model will be reduced obviously, and it will be too difficult to restore the model Zhu and Gupta (2017).

We show an example of the multi-step optimization process in Figure 3 (a). In each step, we increase or decrease the quantization depth (the precision of the parameters) or the prune amount in different layers. For example, in step 1, we prune 40% weights, and the left weights are quantized by 7 bits. We then fine tune the model, train a few more epochs, and check the accuracy and energy of the model. If the accuracy is greater than threshold, we change the quantization depth and the pruning amount, and repeat the optimization process. In step , we prune 60% weights, and quantize the remaining weights by 3 bit. Since the accuracy drops a lot at this step, we abort the optimization process. The quantization depth and pruning amount can be adjusted independently at each step.

The searching space of optimal solutions in this problem is very huge. In general, an L-layer model has possible choices, assuming 1% pruning amount granularity. Designers are always facing many choices, and in most cases, they have to make decisions by their experience on different dataflows.


Figure 3: (a) The multi-step optimization; (b) The reinforcement learning based optimization model. The agent increases or decreases the quantization depth/pruning remaining amount at each step

3.3 Optimization through Reinforcement Learning

Reinforcement learning is a good candidate to solve the multi-step problem. We propose a method to search for the best model compression strategy for high energy efficiency and high area efficiency via reinforcement learning algorithms, considering the diversity of dataflow designs. This mechanism can automatically explore the design space, and find the optimal quantization/pruning policies for each dataflow. We show the overview of our reinforcement learning model in Figure 3 (b). In each episode, an agent interacts with the environment (the CNN model) via a sequence of steps. In each step

, the agent generates an action vector

based on the state vector of the environment . The environment responds to action , quantize/prune the parameters in the model, and change its state to . The model is then fine tuned by one or few epochs, and a reward

considering both accuracy and energy consumption is returned. For large dataset such as ImageNet, the model is not fine tuned in the first few steps. The agent then updates its own parameters for achieving higher rewards in later actions. In each episode, we start from 100% pruning remaining amount and 8 bit quantization depth. An episode ends if the number of steps exceeds the limit, or the accuracy of the model drops below the predefined threshold. The reason of this limit is to make the episode stop when we reach the optimal point.

(1)

The quantization depth and the pruning remaining amount can be expressed by Equation 1. Here, and denote the original quantization depth and pruning remaining amount of -th layer in the CNN model before the optimization. and denote the quantization depth and the pruning remaining amount after optimization step (). To obtain and , we need steps of optimization. In step , the agent changes the values of and by and respectively. To get a better optimization result, we take smaller steps when and are close to the optimal point. The discount factor

is used to regulate the variance of

and . We test different values of in experiments, and find that is an optimal value.

(2)

The action can be expressed by Equation 2. Here is the set containing changes of and in all layers. Although the quantization depth is a discrete variable, we use the continuous action space. This is because we don’t want to loss the small changes of the quantization depth accumulated in each optimization step. When we fine tune the network, we round the quantization depth to the nearest integer value.

(3)

The state can be expressed by Equation 3. Here is the set containing all the quantization depth , the pruning remaining amount , and the reward from step to step . It also contains , the index of current step. We want the state of the environment to well reflect the history of the optimization process. Hence, the state contains the values of and in previous steps. To guarantee that the state set has the same dimension at any optimization step, we have and if is less than .

(4)

The reward can be expressed by Equation 4. Here, and are the accuracy at current step and previous step , respectively. and are the energy consumption at step and step . The area overhead is not involved in this equation because it is highly correlated with energy consumption. Low energy consumption comes with a low area overhead. In the optimization process, we want to reduce the energy consumption and at the same time maintain the accuracy of the model. Intuitively, decreasing the quantization depth and the pruning remaining amount would reduce the energy consumption and at the same time decrease the accuracy. The reinforcement learning algorithms can automatically find the trade-off point between the accuracy and the energy consumption . We use a third parameter to show the importance of accuracy over the energy consumption. It is normally greater than , and is fixed during the optimization. We test different values of in experiments, and find that is an optimal value.

4 Experiment

Algorithm setup: we use a state-of-the-art reinforcement learning algorithms SAC (soft actor-critic) Haarnoja et al. (2018) to train our optimization model. Compared with classical large-space problems, the search space in our problem is not large, and SAC can approach the optimal solutions very quickly (less than ONE day on ImageNet using a single graphic card Titan Xp). We test EDCompress on the ImageNet, CIFAR-10 and MNIST datasets using three different neural networks: VGG-16 Simonyan and Zisserman (2014), MobileNet Howard et al. (2017) and LeNet-5 LeCun et al. (1998). VGG is a complex deep neural network. MobileNet is designed for computation efficiency. LeNet-5 is a simple neural network with only two neural layers. We study four dataflow types, which are the most commonly used dataflow types. In each episode, we start from a well-train model. When the last episode ends, we restore the weights from a saved checkpoint, and reset the quantization depth/pruning remaining amount in each layer.

Hardware setup: we implement popular dataflows , , and on the Xilinx Virtex UltraScale FPGA, and obtain the energy consumption and area overhead from the Xilinx XPE toolkit Xilinx (2018). The energy can be reported in a few seconds. In the logic part, the multipliers, adders and registers are implemented on LUTs (lookup tables). An multiplier requires LUTs Walters (2016). In our experiment, parameters in the feature map are quantized by bits, while the weights are quantized by -bit ( ranging from 0 to 8). Hence, we need LUTs for a single multiplier. In the memory part, the on-chip memory is implemented on RAM (Random-Access Memory) modules. During inference, to save the memory space, the input feature map is not kept after the computation of each layer. Hence, the size of the memory modules must support the weights in all layers plus the maximum feature map in the model.

Dataflow Norm. Energy Norm. Area
 Wang et al. (2019) Ours  Wang et al. (2019) Ours
5.44 1.41 26.1 5.27
6.31 1.81 2.53 1.00
6.32 1.81 2.53 1.00
4.48 1.00 505 92.0
Top-1 Acc. 64.8 68.3 64.8 68.3
Top-5 Acc. 85.9 88.3 85.9 88.3
Table 3: Comparison of EDCompress and the previous work Li et al. (2016a) Singh et al. (2019) on CIFAR-10 using VGG-16
Dataflow Norm. Energy Norm. Area
 Li et al. (2016a)  Singh et al. (2019) Ours  Li et al. (2016a)  Singh et al. (2019) Ours
24.41 15.10 1.69 7.78 5.56 1.00
22.61 14.42 2.31 6.42 4.20 1.27
22.17 15.10 2.73 6.42 4.20 1.42
19.68 12.21 1.00 434 431 47.58
Accuracy 93.1 93.4 91.3 93.1 93.4 91.3
Table 2: Comparison of EDCompress and HAQ Wang et al. (2019) on ImageNet using MobileNet
Energy (J) Area (mm)
 Han et al. (2015)  Guo et al. (2016)  Xiao et al. (2017)  Liu et al. (2018)  Chang and Sha (2018)  Manessi et al. (2018) Ours  Han et al. (2015)  Guo et al. (2016)  Xiao et al. (2017)  Liu et al. (2018)  Chang and Sha (2018)  Manessi et al. (2018) Ours

X:Y

Conv1 1.62 3.34 6.29 2.93 3.61 15.76 0.27 0.95 5.77 5.77 5.77 5.77 5.81 0.53
Conv2 0.60 1.47 1.75 1.20 0.92 8.29 0.57 0.15 0.78 0.78 0.78 0.78 0.81 0.09
FC1 0.06 0.07 0.04 0.06 0.02 0.32 0.11 0.02 0.03 0.03 0.03 0.03 0.06 0.02
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.31 4.96 8.25 4.25 4.62 25.5 0.96 0.97 5.81 5.81 5.80 5.80 5.83 0.55

:

Conv1 1.33 3.09 5.67 2.73 3.33 13.91 0.22 0.05 0.20 0.20 0.20 0.20 0.24 0.03
Conv2 0.58 1.58 1.86 1.29 0.99 7.78 0.36 0.06 0.23 0.23 0.23 0.23 0.26 0.04
FC1 0.08 0.08 0.05 0.07 0.02 0.38 0.09 0.04 0.20 0.21 0.20 0.20 0.24 0.03
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.06
Total 2.03 4.84 7.75 4.16 4.42 23.21 0.69 0.09 0.66 0.66 0.66 0.66 0.7 0.08

X:

Conv1 1.17 3.44 6.05 3.05 3.70 12.93 0.39 0.18 1.04 1.05 1.04 1.04 1.08 0.11
Conv2 0.71 1.69 2.00 1.37 1.04 8.66 0.53 0.09 0.41 0.41 0.41 0.41 0.45 0.06
FC1 0.10 0.09 0.05 0.07 0.02 0.41 0.20 0.02 0.06 0.06 0.06 0.06 0.09 0.02
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.01 5.31 8.28 4.56 4.84 23.13 1.14 0.2 1.07 1.07 1.07 1.07 1.11 0.12

:

Conv1 2.08 4.07 7.58 3.57 4.40 18.32 0.36 0.02 0.06 0.06 0.06 0.06 0.10 0.02
Conv2 0.73 1.81 2.14 1.47 1.13 8.88 0.63 0.14 0.75 0.75 0.75 0.75 0.78 0.09
FC1 0.06 0.09 0.06 0.08 0.03 0.35 0.08 1.55 14.11 14.11 14.11 14.11 14.15 1.29
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.91 6.05 9.94 5.19 5.64 28.68 1.09 1.56 14.14 14.14 14.14 14.13 14.17 1.3
Accuracy 99.3 99.1 99.1 99.1 99.0 99.1 98.6 99.3 99.1 99.1 99.1 99.0 99.1 98.6
Table 4: Comparison of EDCompress and the previous work Han et al. (2015) Guo et al. (2016) Xiao et al. (2017) Liu et al. (2018) Chang and Sha (2018) Manessi et al. (2018) on MNIST using LeNet-5 network. Total area is the maximum area that can support the function of each layer

4.1 Comparison with the State-of-the-Art

EDCompress is effective on all kinds of datasets. Table LABEL:t:Compare21, Table LABEL:t:Compare22 and Table 4 compare EDCompress with the state-of-the-art work on the ImageNet, CIFAR-10 and MNIST datasets. Compared with HAQ on ImageNet, our EDCompress test on four dataflow types and could achieve averaged 3.8X, and 3.9X improvement on energy and area efficiency with similar accuracy. We then focused on small-size dataset because we are targeting on edge devices running lite applications. It shows that among the four dataflows, EDCompress could more effectively reduce the energy consumption and area overhead, with negligible loss of accuracy. Compared with the state-of-the-art work, EDCompress shows 9X improvement on energy efficiency and 8X improvement on area efficiency in LeNet-5, in average of the four dataflows. It also shows 11X/6X improvement on energy/area efficiency in VGG-16. If we optimize the model by EDCompress, the dataflow is the most appropriate choice for LeNet-5 in terms of energy consumption and area overhead, and the dataflow is the most appropriate one for VGG-16.


Figure 4: Layerwise comparison of energy consumption and area overhead between EDCompress and Deep Compression on LeNet-5. The color bar denotes the breakdown of energy and area, and the red polyline denotes the number of parameters in each layer (right-hand y-axis)

Comparisons also indicate that instead of compressing the model size, EDCompress is more efficient in the reduction of energy consumption and area overhead. For example, in Figure 4, we compare the energy and area between EDCompress and Deep Compression (DC) Han et al. (2015), layer by layer. From the figure, EDCompress shows 2.4X higher energy efficiency and 1.4X higher area efficiency than DC. We can see that in the third layer, DC shows better performance than EDCompress on energy consumption because this layer contains 93% of the total parameters. However, this layer does not contribute to most of the energy consumption. In fact, compressing the first layer would be more helpful on the energy reduction, although it only contains 0.1% of the parameters. Figure 4 and Table 4 show that EDCompress reduces much more energy consumption and area overhead in the first layer, compared with previous work. Another example is the dataflow , whose third layer contributes to most of the area overhead. From the figure, we can see that EDCompress shows higher area efficiency than DC in the third layer. This observation further prove that EDCompress is more efficient in the reduction of hardware resources.


Figure 5: Optimization process of EDCompress on CIFAR-10 (VGG-16/MobileNet) and MNIST (LeNet-5). In each episode, we run thirty-two steps. The curves show the energy consumption of four dataflows, and the bars show the accuracy of the model

4.2 Insights on Dataflow

Quantization and pruning have different effects on different dataflow designs. Figure 5 shows the optimization process of the hardware accelerators using three neural networks in terms of energy consumption and accuracy. We start the optimization from a model with activations in 16FP data type and weights in 8INT data type. From the figure, we can see that the reinforcement learning algorithm could effectively reduce the energy consumption, with negligible loss of accuracy. Figure 6 shows the energy consumption breakdown of each dataflow before EDCompress (model using 16FP activations and 8INT weight) and after EDCompress. If we compare the optimized result from EDCompress with the original model, the energy efficiency in VGG-16, MobileNet, LeNet-5 networks can be improve by 20X, 17X, 37X, respective. More specifically, around 55% energy consumption are saved from processing elements and the rest 45% are saved from data movement.

The results also indicate that optimization could change our choice on dataflow types. Those dataflows that do not show good energy efficiency before the optimization may show very high energy efficiency after the optimization. Take the VGG-16 for example, before the optimization, the dataflow consumes the most energy among the four dataflows. However, after the optimization, consumes the second lowest energy consumption. This is because the energy consumption of hardware accelerators include the energy of MAC operations on processing elements, and the energy on data movement. As we can see from Figure 6, given the fixed pruning remaining amount and quantization depth, the energy consumed on processing elements are almost the same. The only way to save the energy is to spent less energy on data movement. Due to the optimization, the energy consumed on data movement decreases because the amount of delivered data is reduced. In this process, different dataflow designs have different amount of reduction on the delivered data. , in this case, is more efficient in data movement reduction, and therefore we can save more energy consumption on this dataflow than other dataflow types.


Figure 6: Energy consumption breakdown before and after the optimization of EDCompress. The solid bar and patterned bar represent results before and after the EDCompress, respectively

Figure 7: The performance of EDCompress by applying quantization technique only, pruning technique only, and both quantization/pruning techniques

4.3 Insights on Quantization/Pruning

The effectiveness of quantization and pruning techniques on the reduction of energy consumption and area overhead is highly related to the dataflow design. Figure 7 shows their individual contributions. From the figure, we can see that in most cases, both quantization and pruning can effectively reduce the energy consumption and area overhead. More specifically, if we apply quantization technique only, EDCompress can achieve 5.6X improvement on energy efficiency and 4.3X improvement on area efficiency. If we apply pruning techniques only, EDCompress can achieve 3.8X/1.7X improvement on energy/area efficiency.

We have two observations in Figure 7. First, pruning shows very little improvement on area overhead of the dataflow design. Second, the small-scale model LeNet-5 prefers quantization over pruning. This is because in these cases, the accelerator demands more area on the processing elements than the memory modules. Pruning can effectively reduce the area of memory modules because of the reduction of model size. However, it is not good at decreasing the area of processing elements. Quantization, on the other hand, could reduce the area of both processing elements and memory modules effectively. Hence, the quantization technique would be more useful in these cases.

5 Conclusions

We propose EDCompress, an energy-aware model compression method with dataflow. To the best of our knowledge, this is the first paper studying this problem with the knowledge of the dataflow design in accelerators. Considering the very nature of model compression procedures, we recast the optimization to a multi-step problem, and solve it by reinforcement learning algorithms. EDCompress could find the optimal dataflow type for specific neural networks, which can guide the deployment of CNN on hardware systems. However, deciding which dataflow type to use in the hardware accelerator depends on many other constraints, such as the expected computation speed, the thermal design power, the fabrication budget, etc. Therefore, we leave the final decision to hardware developers.

References

  • M. Alwani, H. Chen, M. Ferdman, and P. Milder (2016) Fused-layer CNN Accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12. Cited by: Table 1.
  • H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: Direct Neural Architecture Search on Target Task and Hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.
  • J. Chang and J. Sha (2018) Prune Deep Neural Networks With the Modified {} Penalty. IEEE Access 7, pp. 2273–2280. Cited by: §2, Table 4.
  • T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam (2014)

    Diannao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning

    .
    ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: Table 1.
  • Y. Chen, J. Emer, and V. Sze (2016) Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. ACM SIGARCH Computer Architecture News 44 (3), pp. 367–379. Cited by: Table 1.
  • C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang (2019) REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 33–42. Cited by: §2.
  • Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam (2015) ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92–104. Cited by: Table 1.
  • B. Fang, X. Zeng, and M. Zhang (2018)

    NestDNN: Resource-aware Multi-tenant On-device Deep Learning for Continuous Mobile Vision

    .
    In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pp. 115–127. Cited by: §2.
  • A. Frickenstein, C. Unger, and W. Stechele (2019) Resource-Aware Optimization of DNNs for Embedded Applications. In 2019 16th Conference on Computer and Robot Vision (CRV), Vol. , pp. 17–24. Cited by: §2.
  • M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis (2017) Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 751–764. Cited by: Table 1.
  • X. Geng, J. Fu, B. Zhao, J. Lin, M. M. S. Aly, C. J. Pal, and V. Chandrasekhar (2019) Dataflow-Based Joint Quantization for Deep Neural Networks. In DCC, pp. 574. Cited by: §2.
  • Y. Guo, A. Yao, and Y. Chen (2016) Dynamic Network Surgery for Efficient DNNs. In Advances in neural information processing systems, pp. 1379–1387. Cited by: §2, Table 4.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:1801.01290. Cited by: §4.
  • G. B. Hacene, V. Gripon, M. Arzel, N. Farrugia, and Y. Bengio (2018) Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks. arXiv preprint arXiv:1812.11337. Cited by: §2.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §4.1, Table 4.
  • Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: AutoML for Model Compression and Acceleration on Mobile Devicesyang2017designing. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 784–800. Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861. Cited by: §4.
  • N. P. Jouppi et al. (2017)

    In-Datacenter Performance Analysis of a Tensor Processing Unit

    .
    In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: Table 1.
  • H. Kang (2019) Accelerator-Aware Pruning for Convolutional Neural Networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
  • C. Lemaire, A. Achkar, and P. Jodoin (2019) Structured Pruning of Neural Networks with Budget-Aware Regularization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9108–9116. Cited by: §2.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016a) Pruning Filters for Efficient Convnets. arXiv preprint arXiv:1608.08710. Cited by: §2, Table 3.
  • H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang (2016b) A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. Cited by: Table 1.
  • Z. Liu, J. Xu, X. Peng, and R. Xiong (2018) Frequency-domain Dynamic Pruning for Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1043–1053. Cited by: §2, Table 4.
  • F. Manessi, A. Rozza, S. Bianco, P. Napoletano, and R. Schettini (2018) Automated Pruning for Deep Neural Network Compression. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 657–664. Cited by: §2, Table 4.
  • J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al. (2016) Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26–35. Cited by: Table 1.
  • Y. Shen, M. Ferdman, and P. Milder (2016) Overcoming Resource Underutilization in Spatial CNN Accelerators. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. Cited by: Table 1.
  • K. Simonyan and A. Zisserman (2014) Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
  • P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri (2019) Play and Prune: Adaptive Filter Pruning for Deep Model Compression. arXiv preprint arXiv:1905.04446. Cited by: §2, Table 3.
  • M. Song, J. Zhang, H. Chen, and T. Li (2018) Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-based Deep Learning. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 66–77. Cited by: Table 1.
  • N. Suda et al. (2016) Throughput-optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 16–25. Cited by: Table 1.
  • F. Tung and G. Mori (2018) CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7873–7882. Cited by: §2.
  • E. G. Walters (2016) Array Multipliers for High Throughput in Xilinx FPGAs with 6-input LUTs. Computers 5 (4), pp. 20. Cited by: §4.
  • K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §2, Table 3.
  • X. Xiao, L. Jin, Y. Yang, W. Yang, J. Sun, and T. Chang (2017) Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition. Pattern Recognition 72, pp. 72–81. Cited by: §2, Table 4.
  • Xilinx (2018) Vivado Design Suite User Guide. Technical Publication. Cited by: §4.
  • H. Yang, Y. Zhu, and J. Liu (2018a) Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking. arXiv preprint arXiv:1806.04321. Cited by: §2.
  • T. Yang, Y. Chen, and V. Sze (2017) Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §2.
  • T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018b) Netadapt: Platform-aware Neural Network Adaptation for Mobile Applications. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285–300. Cited by: §2.
  • X. Yang, M. Gao, J. Pu, A. Nayak, Q. Liu, S. E. Bell, J. O. Setter, K. Cao, H. Ha, C. Kozyrakis, et al. (2018c) DNN Dataflow Choice Is Overrated. arXiv preprint arXiv:1809.04070. Cited by: §3.
  • C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong (2015) Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. Cited by: Table 1.
  • M. Zhu and S. Gupta (2017) To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. arXiv preprint arXiv:1710.01878. Cited by: §3.2.