1 Introduction
Convolutional neural network (CNN) shows good performance in various applications such as image classification and object detection. However, traditional CNN is in large scale, which makes it challenging to implement on edge devices. For example, the VGG16 network contains 528 MB weights Simonyan and Zisserman (2014)
. To classify one image, we need to perform
multiply–accumulate (MAC) operations. There are two consequences. First, the limited memory space of edge devices cannot store the parameters. Second, the edge device becomes power hungry because the calculation and data movement operations consume a large amount of energy.Model compression method such as quantization and pruning, is an emerging technique developed in recent years to alleviate this problem. Most model compression method target to the reduction of model size. For example, Han et al. proposed Deep Compressing method Han et al. (2015), which helps to fit the neural networks into the onchip memory of hardware accelerators. However, the model size does not directly decide the two significant metrics of edge device, i.e., the energy consumption and area overhead.
To prove this, we compare our work EDCompress (EDC) with Deep Compression (DC) in Figure 1. we can see that although EDCompress shows lower compression rate, it has higher energy and area efficiency than DC. This is because the energy consumption does not only depend on the model size, but also depend on the dataflow design, which is the way we reuse the data. In hardware accelerators, different processing elements may share the same input or output data. By reusing the data, there is no need to load the data from memory by multiple times. Given the fact that a large portion of the energy is spent on the data movement (e.g. around 72% in VGG16). A good dataflow design can effectively improve the energy efficiency. In this paper, we propose EDCompress, which has two following features:

Dataflow Awareness: This paper first studies model compression problem with the knowledge of the diversity of dataflow designs. We study the impact of different dataflow designs on quantization and pruning, and exploit the best model compression strategy in terms of energy consumption and area.

Automated Approach
: We first formulate the energyaware model compression as a multistep optimization problem. At each step, we partially quantize or prune the model, and then fine tune the model by a few epochs. We further recast it into an reinforcement learning task.
2 Related work
Many energy based model compression methods have been proposed in the literature. For example, Wang
et al. proposed a hardwareaware quantization method using the Deep Deterministic Policy Gradients (DDPG) algorithm Wang et al. (2019). He et al. proposed a pruning method for mobile devices using the DDPG algorithm He et al. (2018). Yang et al. proposed an energyaware pruning method for lowpower devices Yang et al. (2017). Cai et al. Cai et al. (2018) and Yang et al. Yang et al. (2018b) proposed optimization methods to reduce the latency of neural networks. Several other work also focused on model compressing techniques, such as Han et al. (2015) Guo et al. (2016) Xiao et al. (2017) Liu et al. (2018) Chang and Sha (2018) Manessi et al. (2018) Li et al. (2016a) Singh et al. (2019). According to the previous work, there are several effective model compression techniques, including pruning and quantization. In pruning Kang (2019) Lemaire et al. (2019) Frickenstein et al. (2019) Fang et al. (2018) Yang et al. (2018a) Hacene et al. (2018), we reduce the model size by replacing those weights with small absolute values by zeros. In quantization Ding et al. (2019) Geng et al. (2019) Tung and Mori (2018), we reduce the model size by decreasing the precision of the weights and the activations. These work reduce the size of parameters stored in the memory so that the compressed CNN can be applied on edge devices. Our proposed work EDCompress is different with them because we has considered the diversity of dataflow designs.3 EnergyAware Quantization/Pruning with Dataflow
Dataflow is an important concept in accelerators. It is related to the mapping strategy between the mathematical operations and the processing elements Yang et al. (2018c). Algorithm 1 shows the computation of a typical convolutional layer. The algorithm contains six loops. One loop corresponds to one dimension in either the filter or the feature map. Here, and denote the number of output and input channels. and denote the width and height of the feature map. and denote the width and height of the filter. In each iteration of the innermost loop, we perform a basic operation called multiply–accumulate (MAC). Before the MAC operation, we read three elements from the memory, one from the input feature map, one from the weight, and one from the output feature map. After the MAC operation, we write the result into the memory. During the whole process, most of the energy is spent on the MAC calculation and data movement. To compute one conventional layer, we need to execute MAC operations in total.
In hardware accelerators using spatial architectures, we have an array or a matrix of processing elements, each one can execute the MAC operation independently. The strategy to map the operation into those elements becomes a key consideration in the hardware design. There is a large design space to explore. For example, given an array of processing element, we can unroll any one of the loops in the algorithm, and map each iteration in the loop into each processing element in the array. By similar rules, we can further unrolling two loops in the algorithm and map the MAC operations into a matrix of processing elements. With six loops in total, there are =15 possibilities, each one corresponds to one dataflow design. Here, we introduce four popular dataflows in Table 1. They are denoted as A:B, where A and B stand for the name of each loop.
Different dataflow designs employ different data movement policies, and thus show different energy efficiency and area overhead. In Figure 2 (a), we show example of four popular dataflow designs. To simplify the figure, we only show four processing elements in each example. In real implementations, the : dataflow design requires processing elements. In :, we store MAC operation results in registers at output ports of processing elements. At each iteration, we read the last MAC operation result from registers. In :, we store weights in registers at input ports of processing elements. At each iteration, we sum up MAC operation results. In :, we store weights in registers at input ports of processing elements. At each iteration, we reuse the weights by times, and sum up MAC operation results. In :, at each iteration, we reuse the input feature map by times, and sum up MAC operation results.
Dataflow  Applied by  Dataflow  Applied by  Dataflow  Applied by  Dataflow  Applied by 

Du et al. (2015) Song et al. (2018)  Qiu et al. (2016)  Chen et al. (2016) Gao et al. (2017) Li et al. (2016b)  Chen et al. (2014) Jouppi and others (2017) Zhang et al. (2015) Alwani et al. (2016) Shen et al. (2016) Suda and others (2016) 
3.1 Improvement on Energy and Area Efficiency
Quantization and pruning are two popular techniques in model compression. To quantize a model, we lower the precision of parameters based on the quantization depth (the number of digits presenting a parameter). After quantization, the low precision parameters may still store enough information for model inference. To prune a model, we replace some of the parameters in the model to be zero. A welltrained model usually contains many weights with small absolute values. We sort all the weights in the filter, and replace those weights with the least absolute values by zeros.
We can save energy and reduce area overhead of the logic circuit using quantization and pruning. Figure 2 (b) shows the inner structure of a 4 bits4 bits multiplier, which contains 12 adders. If the weights are quantized from 4 bits to 3 bits, we can skip the last row of adders, and thus save the energy consumption and reduce the area overhead. In real application, a high precision model with 32FP data type (32 bit float point) requires 23 bit23 bit multipliers, with 506 adders in total. If both the activations and weights can be quantized, we can save a plenty of energy and area overhead. For example, if the activations are quantized from 32FP to 16FP, and the weights are quantized from 32FP to 8INT (8 bit integer), only 10 bit8 bit multipliers are required, with 72 adders in total, which is 86% less than the original amount. Figure 2 (c) shows an array of three processing elements, each containing a multiplier and an adder. If the weights are pruned, some processing elements would have inputs equaling zero. In this case, we can skip the related multiplier, and thus save the energy consumption.
We can also save energy and reduce area overhead of the memory modules using quantization and pruning. To inference a model, we need to store all the weights, and put the intermediate feature map of each layer into the memory. The memory can be either the onchip memory or the offchip memory. No matter which type of memory we use, the data movement energy consumption and the area overhead of memory modules are proportional to the total amount of data transmitted in bits. To decrease this value, we can either reduce the size of parameters by quantization, or reduce the number of parameters by pruning. For example, if we quantize the parameters from 32FP to 16FP, and prune half of the parameters, then roughly 75% of the energy and area of memory modules can be reduced.
3.2 Recasting to the MultiStep Problem
We recast the model compression process to a multistep problem. Our goal is to lower the energy consumption and area overhead of edge devices while keeping the accuracy of the model. Instead of quantizing/pruning the model directly in one step, our final target is approached through a sequence of quantization/pruning steps. This is because we cannot alter the parameters too much at one time. Otherwise, the performance of the model will be reduced obviously, and it will be too difficult to restore the model Zhu and Gupta (2017).
We show an example of the multistep optimization process in Figure 3 (a). In each step, we increase or decrease the quantization depth (the precision of the parameters) or the prune amount in different layers. For example, in step 1, we prune 40% weights, and the left weights are quantized by 7 bits. We then fine tune the model, train a few more epochs, and check the accuracy and energy of the model. If the accuracy is greater than threshold, we change the quantization depth and the pruning amount, and repeat the optimization process. In step , we prune 60% weights, and quantize the remaining weights by 3 bit. Since the accuracy drops a lot at this step, we abort the optimization process. The quantization depth and pruning amount can be adjusted independently at each step.
The searching space of optimal solutions in this problem is very huge. In general, an Llayer model has possible choices, assuming 1% pruning amount granularity. Designers are always facing many choices, and in most cases, they have to make decisions by their experience on different dataflows.
3.3 Optimization through Reinforcement Learning
Reinforcement learning is a good candidate to solve the multistep problem. We propose a method to search for the best model compression strategy for high energy efficiency and high area efficiency via reinforcement learning algorithms, considering the diversity of dataflow designs. This mechanism can automatically explore the design space, and find the optimal quantization/pruning policies for each dataflow. We show the overview of our reinforcement learning model in Figure 3 (b). In each episode, an agent interacts with the environment (the CNN model) via a sequence of steps. In each step
, the agent generates an action vector
based on the state vector of the environment . The environment responds to action , quantize/prune the parameters in the model, and change its state to . The model is then fine tuned by one or few epochs, and a rewardconsidering both accuracy and energy consumption is returned. For large dataset such as ImageNet, the model is not fine tuned in the first few steps. The agent then updates its own parameters for achieving higher rewards in later actions. In each episode, we start from 100% pruning remaining amount and 8 bit quantization depth. An episode ends if the number of steps exceeds the limit, or the accuracy of the model drops below the predefined threshold. The reason of this limit is to make the episode stop when we reach the optimal point.
(1) 
The quantization depth and the pruning remaining amount can be expressed by Equation 1. Here, and denote the original quantization depth and pruning remaining amount of th layer in the CNN model before the optimization. and denote the quantization depth and the pruning remaining amount after optimization step (). To obtain and , we need steps of optimization. In step , the agent changes the values of and by and respectively. To get a better optimization result, we take smaller steps when and are close to the optimal point. The discount factor
is used to regulate the variance of
and . We test different values of in experiments, and find that is an optimal value.(2) 
The action can be expressed by Equation 2. Here is the set containing changes of and in all layers. Although the quantization depth is a discrete variable, we use the continuous action space. This is because we don’t want to loss the small changes of the quantization depth accumulated in each optimization step. When we fine tune the network, we round the quantization depth to the nearest integer value.
(3) 
The state can be expressed by Equation 3. Here is the set containing all the quantization depth , the pruning remaining amount , and the reward from step to step . It also contains , the index of current step. We want the state of the environment to well reflect the history of the optimization process. Hence, the state contains the values of and in previous steps. To guarantee that the state set has the same dimension at any optimization step, we have and if is less than .
(4) 
The reward can be expressed by Equation 4. Here, and are the accuracy at current step and previous step , respectively. and are the energy consumption at step and step . The area overhead is not involved in this equation because it is highly correlated with energy consumption. Low energy consumption comes with a low area overhead. In the optimization process, we want to reduce the energy consumption and at the same time maintain the accuracy of the model. Intuitively, decreasing the quantization depth and the pruning remaining amount would reduce the energy consumption and at the same time decrease the accuracy. The reinforcement learning algorithms can automatically find the tradeoff point between the accuracy and the energy consumption . We use a third parameter to show the importance of accuracy over the energy consumption. It is normally greater than , and is fixed during the optimization. We test different values of in experiments, and find that is an optimal value.
4 Experiment
Algorithm setup: we use a stateoftheart reinforcement learning algorithms SAC (soft actorcritic) Haarnoja et al. (2018) to train our optimization model. Compared with classical largespace problems, the search space in our problem is not large, and SAC can approach the optimal solutions very quickly (less than ONE day on ImageNet using a single graphic card Titan Xp). We test EDCompress on the ImageNet, CIFAR10 and MNIST datasets using three different neural networks: VGG16 Simonyan and Zisserman (2014), MobileNet Howard et al. (2017) and LeNet5 LeCun et al. (1998). VGG is a complex deep neural network. MobileNet is designed for computation efficiency. LeNet5 is a simple neural network with only two neural layers. We study four dataflow types, which are the most commonly used dataflow types. In each episode, we start from a welltrain model. When the last episode ends, we restore the weights from a saved checkpoint, and reset the quantization depth/pruning remaining amount in each layer.
Hardware setup: we implement popular dataflows , , and on the Xilinx Virtex UltraScale FPGA, and obtain the energy consumption and area overhead from the Xilinx XPE toolkit Xilinx (2018). The energy can be reported in a few seconds. In the logic part, the multipliers, adders and registers are implemented on LUTs (lookup tables). An multiplier requires LUTs Walters (2016). In our experiment, parameters in the feature map are quantized by bits, while the weights are quantized by bit ( ranging from 0 to 8). Hence, we need LUTs for a single multiplier. In the memory part, the onchip memory is implemented on RAM (RandomAccess Memory) modules. During inference, to save the memory space, the input feature map is not kept after the computation of each layer. Hence, the size of the memory modules must support the weights in all layers plus the maximum feature map in the model.
Dataflow  Norm. Energy  Norm. Area  

Wang et al. (2019)  Ours  Wang et al. (2019)  Ours  
5.44  1.41  26.1  5.27  
6.31  1.81  2.53  1.00  
6.32  1.81  2.53  1.00  
4.48  1.00  505  92.0  
Top1 Acc.  64.8  68.3  64.8  68.3 
Top5 Acc.  85.9  88.3  85.9  88.3 
Dataflow  Norm. Energy  Norm. Area  

Li et al. (2016a)  Singh et al. (2019)  Ours  Li et al. (2016a)  Singh et al. (2019)  Ours  
24.41  15.10  1.69  7.78  5.56  1.00  
22.61  14.42  2.31  6.42  4.20  1.27  
22.17  15.10  2.73  6.42  4.20  1.42  
19.68  12.21  1.00  434  431  47.58  
Accuracy  93.1  93.4  91.3  93.1  93.4  91.3 
Energy (J)  Area (mm)  
Han et al. (2015)  Guo et al. (2016)  Xiao et al. (2017)  Liu et al. (2018)  Chang and Sha (2018)  Manessi et al. (2018)  Ours  Han et al. (2015)  Guo et al. (2016)  Xiao et al. (2017)  Liu et al. (2018)  Chang and Sha (2018)  Manessi et al. (2018)  Ours  
X:Y 
Conv1  1.62  3.34  6.29  2.93  3.61  15.76  0.27  0.95  5.77  5.77  5.77  5.77  5.81  0.53 
Conv2  0.60  1.47  1.75  1.20  0.92  8.29  0.57  0.15  0.78  0.78  0.78  0.78  0.81  0.09  
FC1  0.06  0.07  0.04  0.06  0.02  0.32  0.11  0.02  0.03  0.03  0.03  0.03  0.06  0.02  
FC2  0.03  0.09  0.17  0.07  0.08  1.14  0.02  0.08  0.63  0.63  0.62  0.62  0.66  0.07  
Total  2.31  4.96  8.25  4.25  4.62  25.5  0.96  0.97  5.81  5.81  5.80  5.80  5.83  0.55  
: 
Conv1  1.33  3.09  5.67  2.73  3.33  13.91  0.22  0.05  0.20  0.20  0.20  0.20  0.24  0.03 
Conv2  0.58  1.58  1.86  1.29  0.99  7.78  0.36  0.06  0.23  0.23  0.23  0.23  0.26  0.04  
FC1  0.08  0.08  0.05  0.07  0.02  0.38  0.09  0.04  0.20  0.21  0.20  0.20  0.24  0.03  
FC2  0.03  0.09  0.17  0.07  0.08  1.14  0.02  0.08  0.63  0.63  0.62  0.62  0.66  0.06  
Total  2.03  4.84  7.75  4.16  4.42  23.21  0.69  0.09  0.66  0.66  0.66  0.66  0.7  0.08  
X: 
Conv1  1.17  3.44  6.05  3.05  3.70  12.93  0.39  0.18  1.04  1.05  1.04  1.04  1.08  0.11 
Conv2  0.71  1.69  2.00  1.37  1.04  8.66  0.53  0.09  0.41  0.41  0.41  0.41  0.45  0.06  
FC1  0.10  0.09  0.05  0.07  0.02  0.41  0.20  0.02  0.06  0.06  0.06  0.06  0.09  0.02  
FC2  0.03  0.09  0.17  0.07  0.08  1.14  0.02  0.08  0.63  0.63  0.62  0.62  0.66  0.07  
Total  2.01  5.31  8.28  4.56  4.84  23.13  1.14  0.2  1.07  1.07  1.07  1.07  1.11  0.12  
: 
Conv1  2.08  4.07  7.58  3.57  4.40  18.32  0.36  0.02  0.06  0.06  0.06  0.06  0.10  0.02 
Conv2  0.73  1.81  2.14  1.47  1.13  8.88  0.63  0.14  0.75  0.75  0.75  0.75  0.78  0.09  
FC1  0.06  0.09  0.06  0.08  0.03  0.35  0.08  1.55  14.11  14.11  14.11  14.11  14.15  1.29  
FC2  0.03  0.09  0.17  0.07  0.08  1.14  0.02  0.08  0.63  0.63  0.62  0.62  0.66  0.07  
Total  2.91  6.05  9.94  5.19  5.64  28.68  1.09  1.56  14.14  14.14  14.14  14.13  14.17  1.3  
Accuracy  99.3  99.1  99.1  99.1  99.0  99.1  98.6  99.3  99.1  99.1  99.1  99.0  99.1  98.6 
4.1 Comparison with the StateoftheArt
EDCompress is effective on all kinds of datasets. Table LABEL:t:Compare21, Table LABEL:t:Compare22 and Table 4 compare EDCompress with the stateoftheart work on the ImageNet, CIFAR10 and MNIST datasets. Compared with HAQ on ImageNet, our EDCompress test on four dataflow types and could achieve averaged 3.8X, and 3.9X improvement on energy and area efficiency with similar accuracy. We then focused on smallsize dataset because we are targeting on edge devices running lite applications. It shows that among the four dataflows, EDCompress could more effectively reduce the energy consumption and area overhead, with negligible loss of accuracy. Compared with the stateoftheart work, EDCompress shows 9X improvement on energy efficiency and 8X improvement on area efficiency in LeNet5, in average of the four dataflows. It also shows 11X/6X improvement on energy/area efficiency in VGG16. If we optimize the model by EDCompress, the dataflow is the most appropriate choice for LeNet5 in terms of energy consumption and area overhead, and the dataflow is the most appropriate one for VGG16.
Comparisons also indicate that instead of compressing the model size, EDCompress is more efficient in the reduction of energy consumption and area overhead. For example, in Figure 4, we compare the energy and area between EDCompress and Deep Compression (DC) Han et al. (2015), layer by layer. From the figure, EDCompress shows 2.4X higher energy efficiency and 1.4X higher area efficiency than DC. We can see that in the third layer, DC shows better performance than EDCompress on energy consumption because this layer contains 93% of the total parameters. However, this layer does not contribute to most of the energy consumption. In fact, compressing the first layer would be more helpful on the energy reduction, although it only contains 0.1% of the parameters. Figure 4 and Table 4 show that EDCompress reduces much more energy consumption and area overhead in the first layer, compared with previous work. Another example is the dataflow , whose third layer contributes to most of the area overhead. From the figure, we can see that EDCompress shows higher area efficiency than DC in the third layer. This observation further prove that EDCompress is more efficient in the reduction of hardware resources.
4.2 Insights on Dataflow
Quantization and pruning have different effects on different dataflow designs. Figure 5 shows the optimization process of the hardware accelerators using three neural networks in terms of energy consumption and accuracy. We start the optimization from a model with activations in 16FP data type and weights in 8INT data type. From the figure, we can see that the reinforcement learning algorithm could effectively reduce the energy consumption, with negligible loss of accuracy. Figure 6 shows the energy consumption breakdown of each dataflow before EDCompress (model using 16FP activations and 8INT weight) and after EDCompress. If we compare the optimized result from EDCompress with the original model, the energy efficiency in VGG16, MobileNet, LeNet5 networks can be improve by 20X, 17X, 37X, respective. More specifically, around 55% energy consumption are saved from processing elements and the rest 45% are saved from data movement.
The results also indicate that optimization could change our choice on dataflow types. Those dataflows that do not show good energy efficiency before the optimization may show very high energy efficiency after the optimization. Take the VGG16 for example, before the optimization, the dataflow consumes the most energy among the four dataflows. However, after the optimization, consumes the second lowest energy consumption. This is because the energy consumption of hardware accelerators include the energy of MAC operations on processing elements, and the energy on data movement. As we can see from Figure 6, given the fixed pruning remaining amount and quantization depth, the energy consumed on processing elements are almost the same. The only way to save the energy is to spent less energy on data movement. Due to the optimization, the energy consumed on data movement decreases because the amount of delivered data is reduced. In this process, different dataflow designs have different amount of reduction on the delivered data. , in this case, is more efficient in data movement reduction, and therefore we can save more energy consumption on this dataflow than other dataflow types.
4.3 Insights on Quantization/Pruning
The effectiveness of quantization and pruning techniques on the reduction of energy consumption and area overhead is highly related to the dataflow design. Figure 7 shows their individual contributions. From the figure, we can see that in most cases, both quantization and pruning can effectively reduce the energy consumption and area overhead. More specifically, if we apply quantization technique only, EDCompress can achieve 5.6X improvement on energy efficiency and 4.3X improvement on area efficiency. If we apply pruning techniques only, EDCompress can achieve 3.8X/1.7X improvement on energy/area efficiency.
We have two observations in Figure 7. First, pruning shows very little improvement on area overhead of the dataflow design. Second, the smallscale model LeNet5 prefers quantization over pruning. This is because in these cases, the accelerator demands more area on the processing elements than the memory modules. Pruning can effectively reduce the area of memory modules because of the reduction of model size. However, it is not good at decreasing the area of processing elements. Quantization, on the other hand, could reduce the area of both processing elements and memory modules effectively. Hence, the quantization technique would be more useful in these cases.
5 Conclusions
We propose EDCompress, an energyaware model compression method with dataflow. To the best of our knowledge, this is the first paper studying this problem with the knowledge of the dataflow design in accelerators. Considering the very nature of model compression procedures, we recast the optimization to a multistep problem, and solve it by reinforcement learning algorithms. EDCompress could find the optimal dataflow type for specific neural networks, which can guide the deployment of CNN on hardware systems. However, deciding which dataflow type to use in the hardware accelerator depends on many other constraints, such as the expected computation speed, the thermal design power, the fabrication budget, etc. Therefore, we leave the final decision to hardware developers.
References
 Fusedlayer CNN Accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12. Cited by: Table 1.
 Proxylessnas: Direct Neural Architecture Search on Target Task and Hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.
 Prune Deep Neural Networks With the Modified {} Penalty. IEEE Access 7, pp. 2273–2280. Cited by: §2, Table 4.

Diannao: A SmallFootprint HighThroughput Accelerator for Ubiquitous MachineLearning
. ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: Table 1.  Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks. ACM SIGARCH Computer Architecture News 44 (3), pp. 367–379. Cited by: Table 1.
 REQYOLO: A ResourceAware, Efficient Quantization Framework for Object Detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 33–42. Cited by: §2.
 ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92–104. Cited by: Table 1.

NestDNN: Resourceaware Multitenant Ondevice Deep Learning for Continuous Mobile Vision
. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pp. 115–127. Cited by: §2.  ResourceAware Optimization of DNNs for Embedded Applications. In 2019 16th Conference on Computer and Robot Vision (CRV), Vol. , pp. 17–24. Cited by: §2.
 Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the TwentySecond International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 751–764. Cited by: Table 1.
 DataflowBased Joint Quantization for Deep Neural Networks. In DCC, pp. 574. Cited by: §2.
 Dynamic Network Surgery for Efficient DNNs. In Advances in neural information processing systems, pp. 1379–1387. Cited by: §2, Table 4.
 Soft Actorcritic: Offpolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:1801.01290. Cited by: §4.
 Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks. arXiv preprint arXiv:1812.11337. Cited by: §2.
 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2, §4.1, Table 4.

AMC: AutoML for Model Compression and Acceleration on Mobile Devicesyang2017designing.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 784–800. Cited by: §2.  Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861. Cited by: §4.

InDatacenter Performance Analysis of a Tensor Processing Unit
. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: Table 1.  AcceleratorAware Pruning for Convolutional Neural Networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.
 Gradientbased Learning Applied to Document Recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.

Structured Pruning of Neural Networks with BudgetAware Regularization.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9108–9116. Cited by: §2.  Pruning Filters for Efficient Convnets. arXiv preprint arXiv:1608.08710. Cited by: §2, Table 3.
 A High Performance FPGAbased Accelerator for LargeScale Convolutional Neural Networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. Cited by: Table 1.
 Frequencydomain Dynamic Pruning for Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1043–1053. Cited by: §2, Table 4.
 Automated Pruning for Deep Neural Network Compression. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 657–664. Cited by: §2, Table 4.
 Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 26–35. Cited by: Table 1.
 Overcoming Resource Underutilization in Spatial CNN Accelerators. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. Cited by: Table 1.
 Very Deep Convolutional Networks for Largescale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
 Play and Prune: Adaptive Filter Pruning for Deep Model Compression. arXiv preprint arXiv:1905.04446. Cited by: §2, Table 3.
 Towards Efficient Microarchitectural Design for Accelerating Unsupervised GANbased Deep Learning. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 66–77. Cited by: Table 1.
 Throughputoptimized OpenCLbased FPGA Accelerator for LargeScale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 16–25. Cited by: Table 1.
 CLIPQ: Deep Network Compression Learning by Inparallel PruningQuantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7873–7882. Cited by: §2.
 Array Multipliers for High Throughput in Xilinx FPGAs with 6input LUTs. Computers 5 (4), pp. 20. Cited by: §4.
 HAQ: HardwareAware Automated Quantization with Mixed Precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §2, Table 3.
 Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition. Pattern Recognition 72, pp. 72–81. Cited by: §2, Table 4.
 Vivado Design Suite User Guide. Technical Publication. Cited by: §4.
 EnergyConstrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking. arXiv preprint arXiv:1806.04321. Cited by: §2.
 Designing EnergyEfficient Convolutional Neural Networks Using EnergyAware Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §2.
 Netadapt: Platformaware Neural Network Adaptation for Mobile Applications. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285–300. Cited by: §2.
 DNN Dataflow Choice Is Overrated. arXiv preprint arXiv:1809.04070. Cited by: §3.
 Optimizing FPGABased Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 161–170. Cited by: Table 1.
 To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. arXiv preprint arXiv:1710.01878. Cited by: §3.2.
Comments
There are no comments yet.