1 Introduction
In recent years, deep neural networks (DNNs) have achieved remarkable performance across a wide range of applications, including but not limited to computer vision, natural language processing, speech recognition, etc. These breakthroughs are closely related to the increased amount of training data and more powerful computing resources now available. For example, one breakthrough in the natural image recognition field was achieved by AlexNet
[44], which was trained using multiple graphics processing units (GPUs) on about 1.2M images. Since then, the performance of DNNs has continued to improve. For many tasks, DNNs are reported to be able to outperform humans. The problem, however, is that the computational complexity as well as the storage requirements of these DNNs has also increased drastically as shown in Table 1. Specifically, the widely used VGG16 model [72]involves more than 500MB of storage and over 15B FLOPs to classify a single
image.Thanks to the recent crop of powerful GPUs and CPU clusters equipped with more abundant memory resources and computational units, these more powerful DNNs can be trained within a relatively reasonable time period. However, when it is time for the inference phase, such a long execution time is impractical for realtime applications. Recent years have witnessed great progress in embedded and mobile devices including unmanned drones, smart phones, intelligent glasses, etc. The demand for deployment of DNN models on these devices has become more intense. However, the resources of these devices, for example, the storage and computational units as well as the battery power remain very limited, and this poses a real challenge in accelerating modern DNNs in lowcost settings.
Therefore, a critical problem currently is how to equip specific hardware with efficient deep networks without significantly lowering the performance. To deal with this issue, many great ideas and methods from the algorithm side have been investigated over the past few years. Some of these works focused on model compression while others focused on acceleration or lowering power consumption. As for the hardware side, a wide variety of FPGA/ASICbased accelerators have been proposed for embedded and mobile applications. In this paper, we present a comprehensive survey of several advanced approaches in network compression, acceleration and accelerator design. We will present the central ideas behind each approach and explore the similarities and differences between the different methods. Finally, we will present some future directions in the field.
The rest of this paper is organized as follows. In Section 2, we give some background on network acceleration and compression. From Section 3 to Section 7, we systematically describe a series of hardwareefficient DNN algorithms, including network pruning, lowrank approximation, network quantization, teacherstudent networks and compact network design. In Section 8, we introduce the design and implementation of hardware accelerators based on FPGA/ASIC technologies. In Section 9, we discuss some future directions in the field, and Section 10 concludes the paper.
Method  Parameters  Computation  
Size(M)  Conv(%)  Fc(%)  FLOPS(G)  Conv(%)  Fc(%)  
AlexNet  61  3.8  96.2  0.72  91.9  8.1 
VGGS  103  6.3  93.7  2.6  96.3  3.7 
VGG16  138  10.6  89.4  15.5  99.2  0.8 
NIN  7.6  100  0  1.1  100  0 
GoogLeNet  6.9  85.1  14.9  1.6  99.9  0.1 
ResNet18  5.6  100  0  1.8  100  0 
ResNet50  12.2  100  0  3.8  100  0 
ResNet101  21.2  100  0  7.6  100  0 
2 Background
Recently, deep convolutional neural networks (CNNs) have become quite popular due to their powerful representational capacity. With the huge success of CNNs, the demand for deployment of deep networks in real world applications has continued to increase. However, the large storage consumption and computational complexity remain two key problems for deployment of these networks. For the CNN training phase, the computational complexity is not a critical problem thanks to the high performance GPUs or CPU clouds. The large storage consumption also has less effect on the training phase because modern computers have very large disk and memory storage capacities. However, things are quite different for the inference phase in CNNs, especially with regard to embedded and mobile devices.
The enormous computational complexity introduces two problems in the deployment of CNNs in realworld applications. One is that the CNN inference phase slows down as the computational complexity grows larger. This makes it difficult to deploy CNNs in realtime applications. The other problem is that the dense computation inherent to CNNs will consume substantial battery power, which is limited on mobile devices.
The large number of parameters of CNNs consumes considerable storage and runtime memory, which are quite limited on embedded devices. In addition, it becomes more difficult to download new models online on mobile devices.
To solve these problems, network compression and acceleration methods have been proposed. In general, the computational complexity of CNNs is dominated by the convolutional layers, while the number of parameters is mainly related to the fully connected layers as shown in Table 1. Thus, most network acceleration methods focus on decreasing the computational complexity of the convolutional layers, while the network compression methods mainly try to compress the fully connected layers.
3 Network Pruning
Pruning methods were proposed before deep learning became popular, and they have been widely studied in recent years [47, 24, 22, 23]
. Based on the assumption that many parameters in deep networks are unimportant or unnecessary, pruning methods are used to remove the unimportant parameters. In this way, pruning methods can expand the sparsity of the parameters significantly. The high sparsity of the parameters after pruning introduces two benefits for deep neural networks. On the one hand, the sparse parameters after pruning require less disk storage since the parameters can be stored in the compressed sparse row format (CSR) or compressed sparse column (CSC) format. On the other hand, computations involving those pruned parameters are omitted; thus, the computational complexity of deep networks can be reduced. According to the granularity of the pruning, pruning methods can be categorized into five groups: finegrained pruning, vectorlevel pruning, kernellevel pruning, grouplevel pruning and filterlevel pruning. Figure.
1 shows the pruning methods with their different granularities. In the following subsections, we describe the different pruning methods in detail.3.1 Finegrained Pruning
Finegrained pruning methods or vanilla pruning methods remove parameters in an unstructured way, i.e., any unimportant parameters in the convolutional kernels can be pruned, as shown in Figure. 1. Since there are no extra constraints on the pruning patterns, the parameters can be pruned with a high sparsity. Early works on pruning [47, 24]
used the approximate secondorder derivativeses of the loss function w.r.t. the parameters to determine the saliency of the parameters, and then pruned those parameters with low saliency. Yet, deep networks can ill afford to compute the second order derivativeses due to the huge computational complexity. Recently
[22] proposed a deep compression framework to compress deep neural networks in three steps: pruning, quantization, and Huffman encoding. By using this method, AlexNet could be compressed by 35 without drops in accuracy. After pruning, the pruned parameters in [22] remain unchanged, incorrectly pruned parameters could cause accuracy drops. To solve this problem, [17]proposed a dynamic network surgery framework, which consists of two operations: pruning and splicing. The pruning operation aims to prune those unimportant parameters while the splicing operation aims to recover the incorrectly pruned connections. Their method requires fewer training epochs and achieves a better compression ratio than
[22].3.2 Vectorlevel and Kernellevel Pruning
Vectorlevel pruning methods prune vectors in the convolutional kernels, and kernellevel pruning methods prune 2D convolutional kernels in the filters. Since most pruning methods focus on finegrained pruning or filterlevel pruning, there are few works on vectorlevel and kernellevel pruning. [3]
first explored the kernellevel pruning, and then proposed an intrakernel strided pruning method, which prunes a subvector in a fixed stride.
[58] explored different granularity levels in pruning, and found that vectorlevel pruning takes up less storage than finegrained pruning because vectorlevel pruning requires fewer indices to indicate the pruned parameters. Nevertheless, vectorlevel, kernellevel, and filterlevel pruning techniques are friendlier in hardware implementations since they are the more structured pruning methods.3.3 Grouplevel Pruning
Grouplevel pruning methods prune the parameters according to the same sparse pattern on the filters. As shown in Figure 2, each filter has the same sparsity pattern, thus the convolutional filters can be represented as a thinned dense matrix. By using grouplevel pruning, convolutions can be implemented by thinned dense matrices multiplication. As a result, the Basic Linear Algebra Subprograms (BLAS) can be utilized to achieve a higher speedup. [46] proposed the groupwise brain damage approach, which prunes the weight matrix in a groupwise fashion. By using groupsparsity regularization, deep networks can be trained easily with groupsparsified parameters. Since grouplevel pruning can utilize the BLAS library, the practical speedup is almost linear at the sparsity level. By using this method, they achieved a 3.2 speedup for all convolutional layers in AlexNet. Concurrent with [46], [85] proposed using the group Lasso to prune groups of parameters. In contrast, [85] explored different levels of structured sparsity in terms of filters, channels, filter shapes, and depth. Their method can be regarded as a more general groupregularized pruning method. For AlexNet’s convolutional layers, their method achieves about 5.1 and speedups on a CPU and GPU respectively.
3.4 Filterlevel Pruning
Filterlevel pruning methods prune the convolutional filters or channels which make the deep networks thinner. After the filter pruning for one layer, the number of input channels of the next layer is also reduced. Thus, filterlevel pruning is more efficient for accelerating deep networks. [54] proposed a filterlevel pruning method named ThiNet. They used the next layer’s feature map to guide the filter pruning in the current layer. By minimizing the feature map’s reconstruction errors, they select the channels in a greedy way. Similar to [54], [26] proposed an iterative twostep algorithm to prune filters by minimizing the feature map errors. Specifically, they introduced a selection weight for each filter , then added sparse constraints on
. Then the channel selection problem can be casted into a LASSO regression problem. To minimize the feature map errors, they iteratively updated
and . Moreover, their method achieved a speedup on VGG16 network with little drop in accuracy. Instead of using additional selection weight , [53]proposed to leverage the scaling factor of the batch normalization layer for to evaluate the importance of the filters. By pruning the channels with nearzero scaling factors, they can prune filters without introducing overhead into the networks.
4 Lowrank Approximation
The convolutional kernel of a convolutional layer
is a 4D tensor. These four dimensions correspond to the kernel width, kernel height and the number of input and output channels respectively. Note that by merging some of the dimensions, the 4D tensor can be transformed into a
D () tensor. The motivation behind lowrank decomposition is to find an approximate tensor that is close to but facilitates more efficient computation. Many lowrank based methods have been proposed by the community; two key differences are in how to rearrange the four dimensions, and on which dimension the lowrank constraint is imposed. Here we roughly divide the lowrank based methods into three categories, according to how many components the filters are decomposed into: twocomponent decomposition, threecomponent decomposition and fourcomponent decomposition.4.1 Twocomponent Decomposition
For twocomponent decomposition, the weight tensor is divided into two parts and the convolutional layer is replaced by two successive layers. [35] decomposed the spatial dimension into and filters. They achieved a 4.5 speedup for a CNN trained on a text character recognition dataset, with a 1% accuracy drop.
SVD is a popular lowrank matrix decomposition method. By merging the dimensions , and , the kernel becomes a 2D matrix of size , on which the SVD decomposition method can be conducted. In [11], the authors utilized SVD to reduce the network redundancy. SVD decomposition was also investigated in [97], in which the filters were replaced by two filter banks: one consisting of filters of shape and the other composed of filters of shape . Here, represents the rank of the decomposition, i.e., the filters are linear combinations of the first
filters. They also proposed the nonlinear response reconstruction method based on the lowrank decomposition. On the challenging VGG16 model for the ImageNet classification task, this twocomponent SVD decomposition method achieved a 3
theoretical speedup at a cost of about 1.66% increased top5 error.Similarly, another SVD decomposition method can be used by exploring the lowrank property along the input channel dimension . In this way, we reshape the weight tensor into a matrix of size . By selecting the rank to , the convolution can be decomposed first by a convolution and then by a convolution. These two decomposition are symmetric.
4.2 Threecomponent Decomposition
Based on the analysis of twocomponent decomposition methods, one straightforward threecomponent decomposition method can be obtained by two successive twocomponent decompositions. Note that in the SVD decomposition, two weight tensors are introduced. The first is a tensor and the other is a tensor (matrix). The first convolution is also very time consuming due to the large size of the first tensor. We can also conduct a twocomponent decomposition on the first weight tensor after the SVD decomposition, which turns into a threecomponent decomposition method. This strategy was studied by [97], whereby after the SVD decomposition, they utilized the decomposition method proposed by [35] for the first decomposed tensor. Thus, the final three components were convolutions with a spatial size of , , and , respectively. By utilizing this threecomponent decomposition, only a 0.3% increased top5 error was produced in [97] for a theoretical speedup.
If we use the SVD decomposition along the input channel dimension for the first tensor after the twocomponent decomposition, we can get the Tucker decomposition format as proposed by [41]. These three components are convolutions of a spatial size , and another convolution. Note that instead of using the twostep SVD decomposition, [41] utilized the Tucker decomposition method directly to obtain these three components. Their method achieved a 4.93 theoretical speedup at a cost of 0.5% increased top5 accuracy.
To further reduce complexity, [80] proposed a BlockTerm Decomposition (BTD) method based on lowrank and group sparse decomposition. Note that in the Tucker decomposition, the second component corresponding to the convolution also requires a large number of computations. Because the second tensor is already low rank along both the input and output channel dimensions, the decomposition methods discussed above cannot be used any longer. [80]
proposed to approximate the original weight tensor by the sum of some smaller subtensors, each of which is in the Tucker decomposition format. By rearranging these subtensors, the BTD can be seen as a Tucker decomposition where the second decomposed tensor is a block diagonal tensor. By using this decomposition, they achieved a 7.4% actual speedup for the VGG16 model, at a cost of a 1.3% increased in the top5 error. Their method also achieved high speedup for object detection and image retrieval tasks as reported in
[82].4.3 Fourcomponent Decomposition
By exploring the lowrank property along the input/output channel dimension as well as the spatial dimension, a fourcomponent decomposition can be obtained. This is corresponds to the CPdecomposition acceleration method proposed in [45]. In this way, the four components are convolutions of size , , and . The CPdecomposition can achieve a very high speedup ratio, however, due to the approximate error, only the second layer of AlexNet was processed in [45]. They achieved a 4.5 speedup for the second layer of AlexNet at a cost of about a 1% accuracy drop.
Method  Quantization  Acceleration  
Weight  Activation  Gradient  Training  Testing  
BinaryConnect [10]  Binary  Full  Full  No  Yes 
BWN [65]  Binary  Full  Full  No  Yes 
BWNH [32]  Binary  Full  Full  No  Yes 
TWN [48]  Binary  Full  Full  No  Yes 
FFN [81]  Ternary  Full  Full  No  Yes 
INQ [99]  Ternary5bit  Full  Full  No  Yes 
BNN [65]  Binary  Binary  Full  No  Yes 
XNOR [65]  Binary  Binary  Full  No  Yes 
HWGQ [4]  Binary  2bit  Full  No  Yes 
DoReFaNet [100]  Binary  14bit  6bit, 8bit, Full  Yes  Yes 
5 Network Quantization
Quantization is an approach for many compression and acceleration applications. It has wide applications in image compression, information retrieval, etc. Many quantization methods have also been investigated for network acceleration and compression. We can categorize these methods into two main groups: (1) scalar and vector quantization, which may need a codebook for quantization, and (2) fixedpoint quantization.
5.1 Scalar and Vector Quantization
Scalar and vector quantization techniques have a long history, and they were originally used for data compression. By using scalar or vector quantization, the original data can be represented by a codebook and a set of quantization codes. The codebook contains a set of quantization centers, and the quantization codes are used to indicate the assignment of the quantization centers. In general, the number of quantization centers is far smaller than the amount of original data. In addition, quantization codes can be encoded through a lossless encoding method, e.g., Huffman coding, or just represented as lowbit fixed points. Thus, scalar or vector quantization can achieve a high compression ratio. [15] explored scalar and vector quantization techniques for compressing deep networks. For scalar quantization, they used the wellknown means algorithm to compress the parameters. In addition, the product quantization algorithm (PQ) [36], a special case of vector quantization, was leveraged to compress the fully connected layers. By partitioning the feature space into several disjoint subspaces and then conducting means in each subspace, the PQ algorithm can compress the fully connected layers with little loss. As [15] only compressed the fully connected layers, in [86] and [8], the authors proposed to utilize the PQ algorithm to simultaneously accelerate and compress convolutional neural networks. They proposed to quantize the convolutional filters layer by layer by minimizing the feature map’s reconstruction loss. During the inference phase, a lookup table is built by precomputing the inner product between feature map patches and codebooks, then the output feature map can be calculated by simply accessing the lookup table. By using this method, they can achieve 4 6 speedup and 15 20 compression ratio with little accuracy loss.
5.2 Fixedpoint Quantization
Fixedpoint quantization is an effective approach for lowering the resource consumption of a network. Based on which part is quantized, two main categories can be classified, i.e., weight quantization and activation quantization. There are some other works that try to also quantize gradients, which can result in acceleration at the network training stage. Here, we mainly review weight quantization and activation quantization methods, which accelerate the testphase computation. Table 2 summarizes these methods according to which part is quantized and whether the training and testing stages can be accelerated.
5.2.1 Fixedpoint Quantization of Weights
Fixedpoint weight quantization is a fairly mature topic in network acceleration and compression. [19] proposed a VLSI architecture for network acceleration using 8bit input and output, and 16bit internal representation. In [28]
, the authors provided a theoretical analysis of error caused by lowbit quantization to determine the bitwidth for a multilayer perceptron. They showed that 8
16 bit quantization was sufficient for training small neural networks. These early works mainly focused on simple multilayer perceptrons. A more recent work [6] showed that it is necessary to use 32bit fixedpoint for the convergence of a convolutional neural network trained on MNIST dataset. By using stochastic rounding, the work by [18] found that it is sufficient to use 16bit fixedpoint numbers to train a convolutional neural network on MNIST. In addition, 8bit fixedpoint quantization was also investigated in [12] to speed up the convergence of deep networks in parallel training. Logarithmic data representation was also investigated in [59].Recently, much lower bit quantization or even binary and ternary quantization methods have been investigated. Expectation Backpropagation (EBP) was introduced in
[9], which utilized the variational Bayes method to binarize the network. The BinaryConnect method proposed in
[10] constrained all weights to be either +1 or 1. By training from scratch, the BinaryConnect can even outperform the floatingpoint counterpart on the CIFAR10 [43] image classification dataset. Using binary quantization, the network can be compressed about 32 times compared to 32bit floatingpoint networks. Most of the floatingpoint multiplication can also be eliminated [51]. In [65], the authors proposed the Binary Weight Network (BWN), which was among the earliest works that achieved good results on the large ImageNet [68] dataset. Lossaware binarization was proposed in ([30]), which can directly minimize the classification loss with respect to the binarized weights. In the work of [32], the authors proposed a novel approach called BWNH to train Binary Weight Networks via Hashing, which outperformed other weight binarization methods by a large margin. Ternary quantization was also utilized in [33]. In [48], the authors proposed the Ternary Weight Network (TWN), which was similar to BWN, but constrained all weights to be ternary values among {1, 0, +1}. The TWN outperformed BWN by a large margin on deep models like ResNet. Trained Ternary Quantization proposed in [101] learned both ternary values and ternary assignments at the same time using backpropagation. They achieved comparable results on the AlexNet model. Different from previous quantization methods, the Incremental Network Quantization (INQ) method proposed in [99] gradually turned all weights into a logarithmic format in a multistep manner. This incremental quantization strategy can lower the quantization error during each stage, and thus can make the quantization problem much easier. All these lowbit quantization methods discussed above directly quantize the fullprecision weight into a fixedpoint format. In [81], the authors proposed a very different quantization strategy. In stead of direct quantization, they proposed using a fixedpoint factorized network (FFN) to quantize all weights into ternary values. This fixedpoint decomposition method can significantly lower the quantization error. The FFN method achieved comparable results on commonly used deep models such as AlexNet, VGG16 and ResNet.5.2.2 Fixedpoint Quantization of Activations
Given only weight quantization, there is also a need for the timeconsuming floatingpoint operations. If the activations were also quantized into fixedpoint values, the network can be efficiently executed by only fixedpoint operations. Many activation quantization methods were also proposed by the deep learning community. The bitwise neural network was proposed in [40]. Binarized Neural Networks (BNN) were among the first works that quantized both weights and activations into either 1 or +1. BNN achieved a comparable accuracy with the fullprecision baseline on the CIFAR10 dataset. To extend the BNN for the ImageNet classification task, the authors in [75] improved the training strategies of the BNN. Much higher accuracy was reported using these strategies. Based on the BWN, the authors in [65] further quantize all activations into binary values, making the network into a XNORNet. Compared with BNN, the XNORNet can achieve much higher accuracy on the ImageNet dataset. To further understand the effect of bitwidth on the training of deep neural networks, the DoReFaNet was proposed in [100]. It investigated the effect of different bitwidths for weights and activations as well as gradients. By making use of batch normalization, the work by [4] presented the Halfwave Gaussian Quantization (HWGQ) method to quantize both weights and activations. A high performance was achieved on commonly used CNN models using the HWGQ methond, with 2bit activations and binary weights.
6 Teacherstudent Network
The teacherstudent network is different from the network compression or acceleration methods since it trains a student network using a teacher network and the student network can be designed with a different network architecture. Generally speaking, a teacher network is a large neural network or the ensemble of neural networks while a student network is a compact and efficient neural network. By utilizing the dark knowledge transferred from the teacher network, the student network can achieve higher accuracy than training merely through the class labels. [27]
proposed the knowledge distillation (KD) method which trains a student network by the softmax layer’s output of the teacher network. Following this line of thinking,
[67] proposed the FitNets to train a deeper and thinner student network. Since the depth of neural networks is more important than the width of them, a deeper student network would have higher accuracy. Besides, they utilized both intermediate layers’ featuremaps and soft outputs of the teacher network to train the student network. Rather than mimicking the intermediate layers’ feature maps, [91] proposed to train a student network by imitating the attention maps of a teacher network. Their experiments showed that the attention maps are more important than the layers’ activations and their method can achieve higher accuracy than FitNets.7 Compact Network Design
The objective of network acceleration and compression is to optimize the execution and storage framework for a given deep neural network. One property is that the network architecture is not changed. Another parallel line of inquiry for network acceleration and compression is to design more efficient but lowcost network architecture itself.
In [50], the authors proposed NetworkInNetwork architecture, where a convolution was utilized to increase the network capacity while keeping the overall computational complexity small. To reduce the storage requirement of the CNN models, they also proposed to remove the fully connected layer and make use of a global average pooling. These strategies are also used by many stateoftheart CNN models like GoogLeNet [74] and ResNet [25].
Branching (multiple group convolution) is another commonly used strategy for lowering network complexity, which was explored in the work of GoogLeNet [74]. By largely making use of convolution and the branching strategy, the SqueezeNet proposed in [34] achieved about 50 compression over AlexNet, with comparable accuracy. By branching, the work of ResNeXt work of [89] can achieve much higher accuracy than the ResNet [25] at the same computational budget. The depthwise convolution proposed in MobileNet [31] takes the branching strategy to the extreme, i.e., the number of branches equals the number of input/output channels. The resulting MobileNet can be 32 smaller and 27 faster than the VGG16 model, with comparable image classification accuracy on ImageNet. When using depthwise convolution and convolution as in MobileNet, most of the computation and parameters reside in the convolutions. One strategy to further lower the complexity of the convolution is to use multiple groups. The ShuffleNet proposed in [96] introduced the channel shuffle operation to increase the information change within the multiple groups, which can prominently increase the representational power of the networks. Their method achieved about 13 actual speedup over AlexNet with comparable accuracy.
8 Hardware Accelerator
8.1 Background
Deep neural networks provide impressive performance for various tasks while suffering from degrees of computational complexity. Traditionally, algorithms based on deep neural networks should be executed on general purpose platforms such as CPUs and GPUs, but this works at the expense of unexpected power consumption and oversized resource utilization for both computing and storage. In recent years, there are an increasing number of applications that are based on embedded systems, including autonomous vehicles, unmanned drones, security cameras, etc. Considering the demands for high performance, light weight and low power consumption on these devices, CPU/GPUbased solutions are no longer suitable. In this scenario, FPGA/ASICbased hardware accelerators are gaining popularity as efficient alternatives.
8.2 General Architecture
The deployment of a DNN on a realworld application consists of two phases: training and inference. Network training is known to be expensive in terms of speed and memory, thus it is usually carried out on GPUs offline. During the inference phase, the pretrained network parameters can be loaded either from the cloud or from dedicated offchip memory. Most recently, hardware accelerators for training have received widespread attention [42, 90, 79], but in this section we mainly focus on the inference phase in embedded settings.
Typically, an accelerator is composed of five parts: data buffers, parameter buffers, processing elements, global controller and offchip transfer manager, as shown in Figure. 3. The data buffers are used to caching input images, intermediate data and output predictions, while the weight buffers are mainly used to cache convolutional filters. Processing elements are a collection of basic computing units that execute multiplyadds, nonlinearity and any other functions such as normalization, quantization, etc. The global controller is used to orchestrate the computing flow onchip, while offchip transfers of data and instructions are conducted through a manager. This basic architecture can be found in existing accelerators designed for both specific and general tasks.
Heterogeneous computing is widely adopted in hardware acceleration. For computingintensive operations such as multiplyadds, it is efficient to fit them on hardware for high throughout, otherwise, data preprocessing, softmax and any other graphic operations can be placed on the CPU/GPU for low latency processing.
8.3 Processing Elements
Among all of the accelerators, the biggest differences exist in the processing elements as they are designed for the majority of computing tasks in deep networks, such as massive multiplyadd operations, normalization (batch normalization or local response normalization), and nonlinearities (ReLU, sigmoid and tanh). Typically, the computing engine of an accelerator is composed of many small basic processing elements, as shown in Figure.
3, and this architecture is mainly designed for fully investing in data reuse and parallelism. However, there are many accelerators that operate with only one processing element in consideration of lower data movement and resource conservation [92, 57].8.4 Optimizing for High Throughput
Since the majority of the computations in a network are matrixmatrix/matrixvector multiplication, it is critical to deal with the massive nested loops to achieve high throughput. Loop optimization is one of the most frequently adopted techniques in accelerator design [92, 56, 73, 2, 88, 49], including loop tiling, loop unrolling, loop interchange, etc. Loop tiling is used to divide all of the data into multiple small blocks in order to alleviate the pressure of onchip storage [56, 2, 64], while loop unrolling attempts to improve the parallelism of the computing engine for high speed [56, 64]. Loop interchange determines the sequential computation order of the nested loops because different computation orders can result in significant differences in performance. The wellknown systolic array can be seen as a combination of the loop optimization methods listed above, which leverage the nature of data locality and weight sharing in the network to achieve high throughput [37, 84].
8.5 Optimizing for Low Energy Consumption
Existing works attempt to reduce the energy consumption of a hardware accelerator from both computing and I/O perspectives. [29] systematically illustrated the energy cost in terms of arithmetic operations and memory accesses. He demonstrated that operations based on integers are much more cheaper than their floatpoint counterparts, and lower bit integers are better. Therefore, most existing accelerators adopt lowbit or even binary data representation [98, 77, 61] to preserve energy efficiency. Most recently, logarithmic computation that transfers multiplications into bit shift operations has also shown its promise in energy savings [13, 16, 76].
Sparsity is gaining an increased popularity in accelerator design based on the observation that a great number of arithmetic operations can be discarded to obtain energy efficiency. [21], [20] and [62] designed architectures for image or speech recognition based on network pruning, while [1] and [95] proposed to eliminate ineffectual operations based on the inherent sparsity in networks.
Offchip data transfers happen inordinately in hardware accelerators due to the fact that both network parameters and intermediate data are too large to fit on chip. [29] suggested that power consumption caused by DRAM access is several orders of magnitude of the SRAM, and therefore reducing offchip transfers is a critical issue. [70] designed a flexible data buffing scheme to reduce bandwidth requirements, and [2] and [88] proposed a fusionbased method to reduce offchip traffic. Most recently, [49] presented a blockbased convolution that can completely avoid offchip transfers of intermediate data in VGG16 with high throughput.
8.6 Design Automation
Recently, design automation frameworks that automatically map deep neural networks onto hardware are receiving wider attention. [83], [69], [78] and [84] proposed frameworks that automatically generate synthesizable accelerator for a given network. [55] presented an RTL compiler for FPGA implementation of diverse networks. [52] proposed an instruction set for hardware implementation, while [93] proposed a uniformed convolutional matrix multiplication representation for CNNs.
8.7 Emerging Techniques
In the past few years, there have been many new techniques from both the algorithm side and the circuit side that have been adopted to implement fast and energyefficient accelerators. Stochastic computing representing continuous values through streams of random bits have been investigated for hardware acceleration of deep neural networks [66, 71, 39]. On the hardware side, RRAMbased accelerators [5, 87] and the usage of 3D DRAM [38, 14] have received greater attention.
9 Future Trends and Discussion
In this section, we discuss some possible future directions in this field.
Nonfinetuning or Unsupervised Compression.
Most of the existing methods, including network pruning, lowrank compression and quantization, need labeled data to retrain the network for accuracy retention. The problems are twofold. First, labeled data is sometimes unavailable, as in medical images. Another problem is that retraining requires considerable human effort as well as professional knowledge. These two problems raise the need for unsupervised compression or even finetuningfree compression methods.
Scalable (Selfadaptive) Compression.
Current compression methods have many hyperparameters that need to be determined ahead of time. For example, the sparsity of the network pruning, the rank of the decompositionbased methods or the bitwidth of fixedpoint quantization methods. The selection of these hyperparameters is tedious work, which also requires professional experience. Thus, the investigation of methods that do not rely on humandesigned hyperparameters is a promising research topic. One direction may be to use annealing methods, or reinforcement learning.
Network Acceleration for Object Detection.
Most of the model acceleration methods are optimized for image classification, yet very little effort has been devoted to the acceleration of other computer vision tasks such object detection. It seems that model acceleration methods for image classification can be directly used for detection. However, the deep neural networks for object detection or image segmentation are more sensitive to model acceleration methods, i.e., using the same model acceleration methods for object detection would suffer from a greater number of accuracy drops than with image classification. One reason for this phenomenon may be that object detection requires more complex feature representation than image classification. The design of model acceleration methods for object detection represents a challenge.
Hardwaresoftware Codesign.
To accelerate deep learning algorithms on dedicated hardware, a straightforward method is to pick up a model and design a corresponding architecture. However, the gap between algorithm modeling and hardware implementation will make it difficult to put this into practice. Recent advances in deep learning algorithms and hardware accelerators demonstrate that it is highly desirable to design hardwareefficient algorithms according to the lowlevel features of specific hardware platforms. This codesign methodology will be a trend in future work.
10 Conclusion
Deep neural networks provide impressive performance while suffering from huge computational complexity and high energy expenditure. In this paper, we provide a survey of recent advances in efficient processing of deep neural networks from both the algorithm and hardware points of view. In addition, we point out a few topics that deserve further investigation in the future.
References

[1]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos.
Cnvlutin: Ineffectualneuronfree deep neural network computing.
In International Symposium on Computer Architecture, pages 1–13, 2016.  [2] M. Alwani, H. Chen, M. Ferdman, and P. A. Milder. Fusedlayer cnn accelerators. In MICRO, 2016.
 [3] S. Anwar, K. Hwang, and W. Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
 [4] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. July 2017.
 [5] L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang. Acceleratorfriendly neuralnetwork training: Learning variations and defects in rram crossbar. In Design, Automation and Test in Europe Conference and Exhibition, pages 19–24, 2017.

[6]
Y. Chen, N. Sun, O. Temam, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, and
T. Chen.
Dadiannao: A machinelearning supercomputer.
In Ieee/acm International Symposium on Microarchitecture, pages 609–622, 2014.  [7] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 [8] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu. Quantized cnn: A unified approach to accelerate and compress convolutional networks. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), PP:1–14.
 [9] Z. Cheng, D. Soudry, Z. Mao, and Z. Lan. Training binary multilayer neural networks for image classification using expectation backpropagation. arXiv preprint arXiv:1503.03562, 2015.
 [10] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
 [11] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
 [12] T. Dettmers. 8bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561, 2015.
 [13] Edward. Lognet: Energyefficient neural networks using logarithmic computation. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5900–5904, 2017.
 [14] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–764, 2017.
 [15] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [16] D. Gudovskiy and L. Rigazio. ShiftCNN: Generalized lowprecision architecture for inference of convolutional neural networks. arXiv preprint arXiv:1706.02393, 2017.
 [17] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1737–1746, 2015.
 [19] D. Hammerstrom. A vlsi architecture for highperformance, lowcost, onchip learning. In IJCNN International Joint Conference on Neural Networks, pages 537–544 vol.2, 2012.
 [20] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, and Y. Wang. Ese: Efficient speech recognition engine with sparse lstm on fpga. 2017.
 [21] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: Efficient inference engine on compressed deep neural network. In ACM/IEEE International Symposium on Computer Architecture, pages 243–254, 2016.
 [22] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [23] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [24] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
 [25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [26] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. arXiv preprint arXiv:1707.06168, 2017.
 [27] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [28] J. L. Holi and J. N. Hwang. Finite precision error analysis of neural network hardware implementations. In Ijcnn91Seattle International Joint Conference on Neural Networks, pages 519–525 vol.1, 1993.
 [29] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In SolidState Circuits Conference Digest of Technical Papers, pages 10–14, 2014.
 [30] L. Hou, Q. Yao, and J. T. Kwok. Lossaware binarization of deep networks. arXiv preprint arXiv:1611.01600, 2016.
 [31] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [32] Q. Hu, P. Wang, and J. Cheng. From hashing to cnns: Training binary weight networks via hashing. In AAAI, February 2018.
 [33] K. Hwang and W. Sung. Fixedpoint feedforward deep neural network design using weights+ 1, 0, and 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6. IEEE, 2014.
 [34] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [35] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [36] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
 [37] N. P. Jouppi. Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, 2017.
 [38] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with highdensity 3d memory. In International Symposium on Computer Architecture, pages 380–392, 2016.
 [39] K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, and K. Choi. Dynamic energyaccuracy tradeoff using stochastic computing in deep neural networks. In Design Automation Conference, page 124, 2016.
 [40] M. Kim and P. Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016.
 [41] Y.D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 [42] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay. Design of an energyefficient accelerator for training of convolutional neural networks using frequencydomain computation. In Design Automation Conference, page 59, 2017.
 [43] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [44] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [45] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [46] V. Lebedev and V. Lempitsky. Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016.
 [47] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems, volume 89, 1989.
 [48] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 [49] G. Li, F. Li, T. Zhao, and J. Cheng. Block convolution: Towards memoryefficeint inference of largescale cnns on fpga. In Design Automation and Test in Europe, 2018.
 [50] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 [51] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 [52] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen. Cambricon: An instruction set architecture for neural networks. SIGARCH Comput. Archit. News, 44(3), June 2016.
 [53] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. arxiv preprint, 1708, 2017.
 [54] J.H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
 [55] Y. Ma, Y. Cao, S. Vrudhula, and J. S. Seo. An automatic rtl compiler for highthroughput fpga implementation of diverse deep convolutional neural networks. In International Conference on Field Programmable Logic and Applications, pages 1–8, 2017.
 [56] Y. Ma, Y. Cao, S. Vrudhula, and J.s. Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, 2017.
 [57] Y. Ma, M. Kim, Y. Cao, S. Vrudhula, J. S. Seo, Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J. S. Seo. Endtoend scalable fpga accelerator for deep residual networks. In IEEE International Symposium on Circuits and Systems, pages 1–4, 2017.
 [58] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
 [59] D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
 [60] D. Nguyen, D. Kim, and J. Lee. Double MAC: doubling the performance of convolutional neural networks on modern fpgas. In Design, Automation and Test in Europe Conference and Exhibition, DATE 2017, Lausanne, Switzerland, March 2731, 2017, pages 890–893, 2017.
 [61] Nurvitadhi. Can fpgas beat gpus in accelerating nextgeneration deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, 2017.
 [62] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. Scnn: An accelerator for compressedsparse convolutional neural networks. pages 27–40, 2017.
 [63] M. Price, J. Glass, and A. P. Chandrakasan. 14.4 a scalable speech recognizer with deepneuralnetwork acoustic models and voiceactivated power gating. In SolidState Circuits Conference, pages 244–245, 2017.
 [64] Qiu. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’16, 2016.
 [65] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV (4), volume 9908, pages 525–542. Springer, 2016.
 [66] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and B. Yuan. Scdcnn: Highlyscalable deep convolutional neural network using stochastic computing. Acm Sigops Operating Systems Review, 51(2):405–418, 2017.
 [67] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [68] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [69] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. From highlevel deep neural models to fpgas. In Ieee/acm International Symposium on Microarchitecture, pages 1–12, 2016.
 [70] Y. Shen, M. Ferdman, and P. Milder. Escher: A cnn accelerator with flexible buffering to minimize offchip transfer. In IEEE International Symposium on FieldProgrammable Custom Computing Machines, 2017.
 [71] H. Sim and J. Lee. A new stochastic computing multiplier with application to deep convolutional neural networks. In Design Automation Conference, page 29, 2017.
 [72] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [73] Suda. Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’16, 2016.
 [74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [75] W. Tang, G. Hua, and L. Wang. How to train a compact binary neural network with high accuracy? In AAAI, pages 2625–2631, 2017.
 [76] H. Tann, S. Hashemi, I. Bahar, and S. Reda. Hardwaresoftware codesign of accurate, multiplierfree deep neural networks. CoRR, abs/1705.04288, 2017.
 [77] Umuroglu. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, 2017.
 [78] S. I. Venieris and C. S. Bouganis. fpgaconvnet: A framework for mapping convolutional neural networks on fpgas. In IEEE International Symposium on FieldProgrammable Custom Computing Machines, pages 40–47, 2016.
 [79] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. SIGARCH Comput. Archit. News, 45(2):13–26, June 2017.
 [80] P. Wang and J. Cheng. Accelerating convolutional neural networks for mobile applications. In Proceedings of the 2016 ACM on Multimedia Conference, pages 541–545. ACM, 2016.
 [81] P. Wang and J. Cheng. Fixedpoint factorized networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [82] P. Wang, Q. Hu, Z. Fang, C. Zhao, and J. Cheng. Deepsearch: A fast image search framework for mobile devices. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14, 2018.
 [83] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. Deepburning: automatic generation of fpgabased learning accelerators for the neural network family. In Design Automation Conference, page 110, 2016.
 [84] Wei. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Proceedings of the 54th Annual Design Automation Conference 2017, DAC ’17, 2017.
 [85] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [86] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [87] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, and H. Yang. Switched by input: Power efficient structure for rrambased convolutional neural network. In Design Automation Conference, page 125, 2016.
 [88] Xiao. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on fpgas. In Proceedings of the 54th Annual Design Automation Conference 2017, DAC ’17, 2017.
 [89] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [90] H. Yang. Time: A traininginmemory architecture for memristorbased deep neural networks. In Design Automation Conference, page 26, 2017.
 [91] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
 [92] Zhang. Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’15, 2015.
 [93] C. Zhang, Z. Fang, P. Pan, P. Pan, and J. Cong. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In International Conference on ComputerAided Design, page 12, 2016.
 [94] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. Energyefficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ISLPED ’16, 2016.
 [95] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. Cambriconx: An accelerator for sparse neural networks. In Ieee/acm International Symposium on Microarchitecture, pages 1–12, 2016.
 [96] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
 [97] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
 [98] Zhao. Accelerating binarized convolutional neural networks with softwareprogrammable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, 2017.
 [99] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.
 [100] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [101] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
 [102] J. Zhu, Z. Qian, and C. Y. Tsui. Lradnn: Highthroughput and energyefficient deep neural network accelerator using low rank approximation. In Asia and South Pacific Design Automation Conference, pages 581–586, 2016.