Introduction
The recent deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have been able to deliver remarkable success in visual and recognition tasks (deng2009imagenet,taigman2014deepface) and realworld applications ([Huval et al.2015, Collobert and Weston2008, Burbidge et al.2001]), by leveraging largescale neural network sizes and learning from a huge volume of data. Despite the advantage of improved overall accuracy, the deep layered structure and large model sizes increase the computational complexity and memory requirements. It is projected that the majority of inference tasks will be performed on embedded, IoT and mobile systems which are with limited power and computational resources. In order to achieve higher scalability, performance, and energy efficiency, two orthogonal research and development trends have both attracted enormous interests.
The first is hardware accelerations of deep learning systems/applications, which have been extensively investigated in industry and academia ([Farabet et al.2009, Suda et al.2016, Qiu et al.2016, Zhang et al.2016a, Zhang et al.2016b, Han et al.2017, Zhao et al.2017, Zhang and Li2017, Umuroglu et al.2016, coma, comb, Chen et al.2017, Han et al.2016, Chen et al.2014]). As a representative technique, FPGAbased accelerators can offer the advantages of programmability, high degree of parallelism and short development cycle. Important progresses have been reported on FPGA accelerations of original DNNs ([Farabet et al.2009, Suda et al.2016, Zhang et al.2016a, Zhang et al.2016b]), binary neural networks ([Zhao et al.2017, Umuroglu et al.2016]
), and more recently, on DNNs and recurrent neural networks (RNNs) with model compression techniques (
[Qiu et al.2016, Han et al.2017]). These prior work mainly focus on the inference phase of DNNs, and suffer from frequent access to offchip memory systems because the limited onchip memory can hardly accommodate the large model sizes. Accessing offchip memory is highly energy inefficient. As pointed out in ([Han et al.2015, Han, Mao, and Dally2015]), the perbit access energy of offchip memory is 200X compared with onchip memory storage, and dominates the whole system power consumptions. Besides, it is also desirable to achieve algorithmiclevel accelerations to accommodate the further scaling of DNNs, instead of simply adding more and more hardware devices.The second important trend is the model size compression and algorithmiclevel acceleration of DNNs (with very minor accuracy loss), including weight quantization ([Lin, Talathi, and Annapureddy2016, Lin et al.2015]), sparsity regularization ([Feng and Darrell2015, Wen et al.2016, Li, Park, and Tang2017]), connection pruning ([Han et al.2015, Han, Mao, and Dally2015]), and low rank approximation ([Denil et al.2013, Denton et al.2014]). These approaches can offer a reasonable amount of parameter reduction (e.g., by to in ([Han et al.2015, Han, Mao, and Dally2015])) and/or a reasonable speedup (e.g., around 50% to 2 in ([Wen et al.2016])). However, they suffer from the following limitations: (i) the sparsity regularization and pruning methods will likely result in an irregular and sparse network structure, thereby undermining the compression ratio and increasing computation time (especially inefficient on GPUs and dedicated hardware which has high parallelism capability); (ii) the training complexity will be increased by incorporating additional pruning process ([Han et al.2015, Han, Mao, and Dally2015]), additional low rank approximation step ([Denil et al.2013, Denton et al.2014]), or extra tradeoff parameters ([Wen et al.2016]
); (iii) the compression or acceleration factors are heuristic numbers that cannot be precisely controlled, not to mention a mathematically rigorous proof of the effectiveness of these methods.
To combine these two directions, the aim of this paper is to address the limitations of existing model size compression and acceleration work and to achieve ultrahigh energy efficiency and performance for FPGAbased hardware implementations of DNNs, by (i) deriving a highly suitable algorithm for efficient computation and storage reduction without significant accuracy loss, and (ii) deriving the corresponding optimized hardware implementations. We develop an algorithmhardware cooptimization framework, which is applicable to different DNN types, sizes, and application scenarios. The proposed framework comprises algorithm and hardware parts. The algorithm part extends reference ([Cheng et al.2015]), which applies circulant matrices to the whole fullyconnected (FC) layer for model compression, to (i) the adoption of the general blockcirculant matrices to achieve finegrained tradeoff of accuracy and compression ratio, (ii) the generalization to the convolutional (CONV) layers for significant acceleration as CONV layers dominate the computation of DNNs ([Krizhevsky, Sutskever, and Hinton2012, He et al.2016]
), (iii) providing a mathematically rigorous proof that the proposed algorithm will asymptotically converge to the same “effectiveness” as DNNs without compression, and (iv) decoupling the fast Fourier transform (FFT) and inverse FFT computations in the framework for accelerating computation and facilitating hardware implementations. The proposed algorithm reduces computational complexity per layer from O(
) to O() and storage complexity from O() to O(), both for training and inference, with negligible degradation in DNN accuracy. The hardware part consists of highly efficient FPGAbased implementations using effective reconfiguration, batch processing, deep pipelining technique, effective resource reusing, and a hierarchical control framework. The proposed FPGAbased implementation can accommodate the whole DNN model using onchip block memory, thereby significantly improving the overall energy efficiency. Finally, a comprehensive algorithmhardware cooptimization is proposed which comprises (i) model selection and optimization, (ii) hardware optimization, and (iii) variational inferencebased Bayesian learning for enhancing accuracy and robustness. In summary, the major contributions of this work include both algorithm and hardware parts. The algorithm part adopts blockcirculant matrices for weight representation, which could achieve a significant model compression ratio with minor accuracy degradation. It applies to the whole network, both fullyconnected and convolutional layers. The hardware part consists of highly efficient FPGAbased implementations with multiple innovative parts of reconfiguration, batch processing, deep pipelining, resource reusing, etc.Please note that the proposed framework is distinct from the prior work ([Mathieu, Henaff, and LeCun2013]), which applies FFTs to accelerate the computations in the CONV layers. The prior work applies only to a single filter in the CONV layer and achieves no storage reduction (in fact it results in storage increase), whereas the proposed method applies both to CONV and FC layers and achieves simultaneous acceleration and storage reduction.
Because we focus on highly energyefficient FPGAbased implementations for lowpower embedded applications, we focus on the inference phase of small to mediumscale DNNs (e.g., for MNIST, SVHN, CIFAR datasets) on high energyefficiency FPGAs. Compared with the IBM TrueNorth neurosynapstic processor ([Merolla et al.2014]), our FPGAbased implementation achieves at least 152X speedup in throughput and 71X energy efficiency gain under the same test accuracy. Similarly, our actual FPGA implementations outperform (in performance) the stateoftheart analogbased and emerging devicebased implementations. Our framework achieves at least 31X gain in equivalent energy efficiency compared with the reference FPGAbased work that achieves the best efficiency.
Related Works
FPGA Accelerations of DNNs. FPGAbased accelerations of DNNs have been extensively investigated recently due to the advantages of programmability, high degree of parallelism and short development cycle. Based on the early work of direct acceleration of FPGAs ([Farabet et al.2009]), recently researchers have investigated energyefficient implementations using the batch processing technique ([Zhang et al.2016a, Zhang et al.2016b]
) or on compressed models using singular value decomposition (SVD) (
[Qiu et al.2016]). In this year the research on this topic has exploded, including accelerations of DNNs with weight pruning ([Han et al.2017]), binary neural networks ([Zhao et al.2017, Umuroglu et al.2016]), and highlevel synthesis for fast generation of FPGA implementations ([Zhao et al.2017, Zhang and Li2017]). These work typically suffer from frequent access to offchip memory systems because their model sizes cannot be effectively reduced for onchip memory storage, thereby resulting in high energy consumptions. The typical (equivalent) energy efficiency range is from 7 GOPS/W to less than 1 TOPS/W, depending on the testing FPGA platform, implementation details, and compression techniques.Connection Pruning and Weight Sparsifying. Han et al. ([Han et al.2015, Han, Mao, and Dally2015]) reduced the number of parameters by 9X  13X using connection pruning. Since most reduction is achieved on FC layers, no significant speedups of CONV layers can be observed ([Wen et al.2016]). As CONV layers have become the computational bottleneck, compression and acceleration on CONV layers become essential. Liu et al. achieved layerwise 4.59X speedup on the CONV layers of AlexNet with 2% accuracy loss. Recently, ([Wen et al.2016]) adopts a structured sparsity learning method and derives an effective tradeoff between acceleration on CPU/GPU and test accuracy for the CONV layers. More specifically, for ResNet20 on CIFAR10 and AlexNet
on ImageNet benchmarks, more than 50% acceleration can be achieved without any accuracy loss, while around 3X acceleration is achieved with an acceptable accuracy loss of 2%.
FFTs for CONV Layer Accelerations. LeCun et al. have proposed using FFTs to accelerate the computations in the CONV layers, which applies only to a single filter in the CONV layer ([Mathieu, Henaff, and LeCun2013]). It uses FFT to calculate the traditional inner products of filters and input feature maps, and can achieve speedup for large filter sizes. The underlying neural network structure remains unchanged. The speedup is due to filter reuse and it cannot achieve either asymptotic speedup in BigO notation or weight compression.
Structured Matrices in FC Layers for Model Compression. The most relevant work to this paper is ([Cheng et al.2015]), which directly applies circulant matrices to the FC layers for model compression. As an example, an FC layer of DNN can be represented as
, where vectors
andrepresent the outputs of all neurons in the previous layer and the current layer, respectively;
is an by weight matrix; andis the activation function. When
is a circulant matrix, the fast Fourier transform (FFT)based fast multiplication method can be utilized, and the computational complexity and weight storage complexity will be reduced from O() to O() and from O() to O(), respectively. Despite the significant reduction in computation and weight storage, this approach has the limitations of (i) resulting in a huge number of padding 0’s when the numbers of inputs and outputs are not equal, (ii) resulting in certain accuracy degradation for largescale FC layers because of the aggressive weight reduction, and (iii) only applicable to the FC layer, whereas the CONV layers are the most computationally intensive in DNNs.
Algorithm Development of BlockCirculant MatrixBased DNNs
In this section, we develop the algorithmic framework of blockcirculant matrixbased DNNs for simultaneous acceleration and model compression, for both inference and training phases. The proposed framework is able to accommodate arbitrary size and aspect ratio of weight matrices, and achieves a finegrained tradeoff between test accuracy and compression/acceleration ratio ([Ding et al.2017]). Unlike ([Cheng et al.2015]), we develop algorithms for both FC and CONV layers as shown in the following. We provide a mathematically rigorous proof of the proposed algorithm that it satisfies the universal approximation property as uncompressed DNNs. Finally, we develop decoupling technique for FFT/IFFT pairs for further acceleration and facilitating hardware (FPGA) implementations.
Inference and Training Algorithms for FC Layers
The key idea of blockcirculant matrixbased FC layers is to partition the original arbitrarysize unstructured weight matrix into 2D blocks of square submatrices. Such partitioning strategy has two advantages: 1) It is suitable for arbitrarysize weight matrices without any requirement on the aspect ratio of W; and 2) it is an adjustable approach that can conveniently control the compression ratio and potential accuracy loss by only changing the size of submatrices.
For formal discussions on the proposed inference and training procedures, let denote the block size (size of each submatrix) and there are blocks after partitioning , where and . Zero padding is required if does not directly divide or , but the amount of zero padding will be significantly reduced compared with ([Cheng et al.2015]). Then , , . Correspondingly, the input is also partitioned as . Then the forward propagation process in the inference phase is given by:
(1) 
where is a column vector. Assume each circulant matrix is defined by a vector , i.e., is the first row vector of . Then according to the circulant convolution theorem ([Pan2012, Bini, Pan, and Eberly1996]), the calculation of can be performed as , where denotes elementwise multiplications. The operation procedure is shown in Fig. 1. For the inference phase, the computational complexity of this FC layer will be , which is equivalent to for small , values. Similarly, the storage complexity will be because we only need to store or for each submatrix, which is equivalent to for small , values. Simultaneous acceleration and model compression compared with the original DNN can be achieved.
Now consider the backward propagation process in the training phase. Let be the th output element in
. Then by using the chain rule we can derive the backward propagation process as follows:
(2) 
(3) 
We have proved that and are blockcirculant matrices. Therefore, and can be calculated as the “FFTelementwise multiplicationIFFT” procedure and is equivalent to computational complexity per layer. Due to space limitation, the algorithmic descriptions of forward and backward propagations are omitted.
Please note that there is no special need to translate into or approximate each submatrix of . Instead, as shown in Eqns. (2) and (3), we directly learn the vector (the firstrow vector) of each submatrix of in the training process. The assumption is that the other rows of the submatrix follow the circulant formulation. In other words, when following the learning process Eqns. (2) and (3), the learnt weight matrices naturally follow the blockcirculant format. In fact, this is a key advantage of this proposed method in that there is no need for additional “translation” or “approximation” steps.
Inference and Training for CONV Layers
We generalize the inference and training algorithms to CONV layers, which have become the computation bottleneck of the whole DNN. The CONV layers are often associated with multiple input and multiple output feature maps:
(4) 
where , ,
represent the input, output, and weight tensors of the CONV layer, respectively. Here
and are the spatial dimensions of the input feature maps, is the number of input feature maps, is the size of the convolutional kernel, and is the number of output feature maps.Efficient software tools such as Caffe provide an efficient methodology of transforming tensorbased operations in the CONV layer to matrixbased operations (
[Jia et al.2014, Vedaldi and Lenc2015]), in order to enhance the implementation efficiency (GPUs are optimized for matrix operations.) Fig. 2 illustrates the application of the method to reformulate Eqn. (4) to the matrix multiplication , where , , and .We generalize the concept of “blockcirculant structure” to the rank4 tensor () in the CONV layer, i.e., all the slices of the form are blockcirculant matrices. Then we can prove that is actually a blockcirculant matrix. Hence the fast multiplication approach for blockcirculant matrices, as the “FFTcomponentwise multiplication IFFT” procedure, can now be applied to accelerate , thereby resulting in the acceleration of (4). The training phase can be derived similarly. The overall degrees of reduction in computational and storage complexities are similar to those in FC layers.
Theoretical Foundation and Software Results
With the substantial reduction of weight storage and computations, we also attempt to prove that the proposed blockcirculant matrixbased framework will consistently yield the similar overall accuracy compared with DNNs without compression. The theoretical proof will make the proposed method theoretically rigorous and distinct from prior work.
In the theory of neural networks, the universal approximation property
states that a neural network should be able to approximate any continuous or measurable function with arbitrary accuracy provided that an enough large number of parameters are available. This property provides the theoretical guarantee of using neural networks to solve machine learning problems, since machine learning tasks can be formulated as finding a proper approximation of an unknown, highdimensional function. We have proved the
universal approximation property of bl ock circulant matrixbased neural networks, and more generally, for arbitrary structured matrices satisfying the low displacement rank . As a result, we can guarantee the universal “effectiveness” of the proposed framework on different DNN types and sizes, application domains, and hardware/software platforms. Detailed proof procedure is provided in the supplementary file ([pro]).Fig. 3 shows the model compression results on MNIST, SHVN, CIFAR10, ImageNet, TIMIT (speech recognition) benchmarks, etc., using various DNN models. The accuracy degradations are constrained to be 1% to 2% between the original models and blockcirculant matrixbased models. The overall model compression is contributed by both weight parameter reduction and bit quantization. It can be observed that a significant model size compression, and therefore acceleration, can be achieved using the proposed framework.
Accelerating Computation and Facilitating Hardware Implementations
We propose the decoupling technique of FFTs and IFFTs, which applies to both inference and training phases. We take the inference phase of FC layer as an illustrative example. First, we make the observation that the FFT results of , i.e., , need to be utilized to calculate all vectors. Similar observation also holds for . Hence, we could perform precalculation of and and store them in memory for effective reuse. The values can even be precalculated and stored in memory before the inference phase because they are fixed after training. By performing such precalculation of , the total number of FFTs needed to calculate reduces from to (assuming ’s are calculated and stored in prior), achieving a significant reduction in total computations.
Similarly, each vector to be calculated in Eqn. (1) is given by , which requires IFFT calculations. Because FFTs and IFFTs are linear operations ([Oppenheim1999]), we can calculate IFFT in the last step, i.e., calculate as . In this way the total number of IFFT calculations can be reduced by times.
High Energy Efficiency and Performance Implementation in FPGAs
Based on the algorithmic framework, we describe the developed highefficiency FPGAbased implementation of DNNs. Since the target is lowpower embedded applications, we focus on the inference phase of small to mediumscale DNNs, e.g., for MNIST, SVHN, CIFAR datasets. We leave the largescale DNNs, e.g., for ImageNet dataset, for future investigation because they do not target at embedded applications. We first describe the proposed FPGA implementations using a set of reconfiguration and performance/efficiency enhancement techniques, then present the algorithmhardware cooptimization framework.
FPGA Implementations: Reconfigurability, InPlace Computation, Batch Processing, Deep Pipelining, and Resource ReUse
Reconfigurability, InPlace Computation, and Batch Processing. In order to accommodate different DNN models, sizes, and application scenarios, the proposed FPGA implementation possesses reconfigurability for different layer sizes and layer types (FC or CONV layers). The reconfigurability is achieved because (i) both FC and CONV layers are formulated as the “FFTcomponentwise multiplication IFFT” procedure; (ii) IFFT can be implemented using the FFT structure with simple preprocessing step ([Salehi, Amirfattahi, and Parhi2013]); and (iii) the FFT structure possesses inherent recursive property in that smallscale FFTs can be implemented in parallel in largerscale FFT structures ([Oppenheim1999]). More specifically, the first and second properties enable the implementation of a single FFT structure in a timemultiplexed manner for both FFTs and IFFTs and both FC and CONV layers. For instance, a 128input FFT structure can be implemented in FPGA if a block size of 128 is utilized. The third property enables that a single FFT structure can be utilized even if we use different block sizes for FC and CONV layers. Finally, inplace computation is utilized such that the same memory space can be utilized to store the outputs of every layer in the DNN, i.e., the outputs of each neuron layer will replace the inputs (outputs of layer ). In this way, the execution of an overall DNN will use the single FFT structure in a sequential, timemultiplexed manner without extra memory requirements.
The execution of the inference phase of the whole DNNs is shown in Fig. 4. The batch processing technique is utilized, in that a batch of input pictures are processed in an interleaved manner in the FPGA. As shown in Fig. 4, we first compute the first layer of all input pictures in this batch, then the second layer, and so on. Different layers of a neural network will be timemultiplexed on the basic block. The computations are all based on the implemented FFT structure discussed previously in a timemultiplexed manner. All operations will be pipelined on the basic computing block. The reason of batch processing is the deep pipelining (to be discussed later) utilized in the hardware implementation. Otherwise, pipeline bubbles have to be injected when computing all layers for one input picture consecutively, which results in timing overheads. A typical batch consists of around 50100 pictures, because (i) stateoftheart FPGAs have more than 2MB onchip memory storage (e.g., Intel (Altera) CyClone V 5CEA9, Xilinx Kintex7 XC7K325T) and (ii) the intermediate results of small to mediumscale DNNs (e.g., DNNs for CIFAR10) typically take several KBs per picture.
ThreePhase Operations, Deep Pipelining, and Resource ReUse. As described before, the calculation of consists of three phases: calculation of vectors for each , calculation of elementwise multiplications for each , (and corresponding additions), and IFFTs for each . For example, if is 1024by1024 and the block size is 128, a total of 8 FFTs, 8 IFFTs, and 64 groups of elementwise multiplications will be performed. As shown in Fig. 4, the threephase operations are integrated with batch processing. More specifically, an outer loop iterates on all layers of the DNN. Within the outer loop is the three calculation phases. Within each phase is the calculations for every , in each picture and for all pictures. In this way the timing overheads can be minimized to close to zero.
The deep pipelining technique is utilized for FFTs and IFFTs in order to improve throughput and energy efficiency, as illustrated in Fig. 4
. For example, if a 128point FFT is implemented as the basic computing block in FPGA, it needs 7 pipeline stages plus 4 additional stages corresponding to memory reading and writing. When IFFT is implemented on such basic computing block, 2 additional stages are needed corresponding to the preprocessing, and biasing and ReLU activation. The elementwise multiplications and additions in the second phase are also pipelined.
One clear advantage of the FPGAbased hardware implementation is the ability of resource reuse. Besides the effective time multiplexing of FFTs and IFFTs on the same hardware, the hardware multipliers utilized in the second phase can also reuse those in the FFT computing block. This effective resource reuse can be automatically determined in the FPGA synthesis process ([qua]), which could improve the area and energy efficiency of FPGA implementations.
AlgorithmHardware CoOptimizations
Finally, an algorithmhardware cooptimization framework is developed, which comprises (i) model selection and optimization, (ii) hardware optimization, and (iii) variational inferencebased Bayesian learning. The overall objective is to maximize the performance (throughput) and energy efficiency of FPGA hardware implementation subject to certain accuracy requirements. More specifically, the first aspect determines the proper block size and weight matrix size, in order to facilitate FPGAbased FFT implementations while satisfying the overall accuracy requirement. For stateoftheart FPGAs, a proper block size ranges from 64 to 256 (should better be a power of 2) for FC layers and may be smaller for CONV layers. The second aspect includes the exploitation of FFTs with realvalued inputs, i.e., the FFT results of a realvalued vector is symmetric except for the base (first) component ([Oppenheim1999]). Because both and are realvalued vectors, we only need to store the first half of vectors and , which significantly reduce the storage requirement and computations required in elementwise multiplications. The last aspect uses the variational inference process of Bayesian learning ([Blei, Jordan, and others2006]), which is compatible with the proposed framework and can result in accuracy and robustness enhancements. Bayesian training using variational inference ([Blei, Jordan, and others2006]
) is an effective training method to enhance accuracy and robustness of machine learning systems, including neural networks. During training phase, it assumes that each weight is a variable that satisfies certain prior distribution at the beginning. For each training sample, it generates a collection of random weights based on the distribution, and learns both the average and variance of each weight variable. The inference phase (implemented in hardware) will be the same, using the average estimate of each weight. Based on our results, Bayesian training is the most effective for small data training and smalltomedium neural networks. The algorithmhardware cooptimization framework is shown in Fig.
5. Overall, the proposed FPGAbased implementation can accommodate the whole DNN model using onchip block memory, thereby significantly improving the overall energy efficiency.Experimental Results
In this section, we provide the experimental results on FPGA implementations of the proposed framework on small to mediumscale DNNs, using MNIST, SVHN, and CIFAR10 benchmarks. Our FPGAs for implementation include the lowpower FPGA Intel (Altera) CyClone V 5CEA9, and the one with higher performance Xilinx Kintex7 XC7K325T. The former one is the default FPGA used in experiments. We compare the performance (throughput), energy efficiency, and accuracy with the best stateofthearts including IBM TrueNorth neurosynaptic processor, emerging device (e.g., memristor crossbar) based neuromorphic systems, analogbased neuromorphic systems, and reference FPGA implementations. IBM TrueNorth ([Esser et al.2015, Esser et al.2016]) is a neuromorphic CMOS chip fabricated in 28nm technology, with 4096 cores each simulating 256 programmable silicon neurons in a timemultiplexed manner. It implements the spiking neural network, which is a bioinspired type of neural networks and benefits from the ability of globally asynchronous implementations. It can accommodate MNIST, SVHN, and CIFAR10 benchmarks^{1}^{1}1Please note that ImageNet is not currently supported by IBM TrueNorth due to the highdegree neural connections. in the experiments.
First, we provide the comparison results on accuracy, performance (throughput, in kiloframes per second (kFPS)), and energy efficiency (in kFPS/W) on the three benchmarks, as shown in Table 1. The baselines include IBM TrueNorth processor and reference FPGA implementations of these benchmarks. We provide results of the proposed framework on three DNNs of MNIST data set with different target accuracies, one for SVHN, and two for CIFAR10 data set. The first two DNNs of the MNIST data set are multilayer perceptron (MLP) models that achieve 92.9% and 95.6% accuracies, respectively. Prior pooling is applied to reduce the input size to 256 and 128, respectively. The third DNN of the MNIST data set is a CNN similar to the LeNet5 structure (
[LeCun et al.1995]). The baseline IBM TrueNorth processor also has different implementations with different accuracy levels for the MNIST data set. For the CIFAR10 data set, the first DNN is a simple CNN structure, whereas the second is a wide ResNet model ([He et al.2016]) that can achieve 94.75% accuracy, only 0.75% lower than the best stateoftheart software implementation. We can observe that under the similar accuracy level, the speedup and energy efficiency gain compared with IBM TrueNorth are at least 152X and 71X, respectively. Under the similar accuracy level, the energy efficiency gain is at least 31X compared with the reference FPGAbased implementation that achieves the highest energy efficiency ([Umuroglu et al.2016]) (using binary neural networks). Besides the reduction in computational complexity, the high suitability of the proposed framework for hardware implementation, and the highly efficient deep pipelined hardware structure, the reasons for such significant gains also include the requirement of increasing neuron numbers for spiking or binary neural networks to achieve the same accuracy as MLP or CNN, and the inherent long latency in spiking neural networks.Name  Dataset  Platform  Precision  Accuracy  Performance  Energy eff. 
(kFPS)  (kFPS/W)  
Proposed MNIST 1  MNIST  CyClone V  12  92.9%  
Proposed MNIST 2  MNIST  CyClone V  12  95.6%  
Proposed MNIST 3  MNIST  CyClone V  12  99.0%  659.5  
Proposed SVHN  SVHN  CyClone V  12  96.2%  699.7  
Proposed CIFAR10 1  CIFAR10  CyClone V  12  80.3%  2514  
Proposed CIFAR10 2  CIFAR10  CyClone V  12  94.75%  25.4  
TrueNorth ([Esser et al.2015])  MNIST  TrueNorth  2  99%+  1.0  9.26 
TrueNorth ([Esser et al.2015])  MNIST  TrueNorth  2  95%  1.0  250 
TrueNorth ([Esser et al.2016])  SVHN  TrueNorth  2  96.7%  2.53  9.85 
TrueNorth ([Esser et al.2016])  CIFAR10  TrueNorth  2  83.4%  1.25  6.11 
Umuroglu et al. ([Umuroglu et al.2016])  MNIST  ZC706  1  95.8%  1693  
Umuroglu et al. ([Umuroglu et al.2016])  SVHN  ZC706  1  94.9%  21.9  6.08 
Umuroglu et al. ([Umuroglu et al.2016])  CIFAR10  ZC706  1  80.1%  21.9  6.08 
Alemdar et al. ([Alemdar et al.2016])  MNIST  Kintex7  2  98.3%  255.1  92.59 
Next, we provide sample comparison results with emerging device and analogbased implementations. Because the neural networks and applications may be different, we use the equivalent performance in gigaoperations per second (GOPS) and energy efficiency in GOPS/W for fair comparisons. The term “equivalent” is utilized because we normalize the number of (multiplication and addition) operations to the original matrixvector multiplication format. The proposed framework achieves around 5.14 Tera OPS/W (TOPS/W) energy efficiency, which outperforms representative latest results using analog computing and emerging devices. For example, ([Shafiee et al.2016, Song et al.2017, Lu et al.2015]) achieve 380.7 GOPS/W, 142.9 GOPS/W, and 1.04 TOPS/W, respectively. The reference work can be either manufactured or device modeling based. Performance wise, as analyzed in ([Bayat et al.2016, Liu et al.2016, Li et al.2016]), a matrixvector multiplication will take around 100ns and it takes around 1 to perform one inference sample on the MNIST data set (with 90%  94% accuracy). Our achieved highest performance (throughput) for the MNIST data set, i.e., 11.6ns per image recognition in CyClone V FPGA or around 4ns per image in Kintex7 FPGA, is difficult to achieve even using emerging devices and technology.
Finally, we provide the comparison results with other FPGA implementations in terms of the equivalent performance (in GOPS) and energy efficiency (in GOPS/W), as shown in Fig. 6. These metrics are relatively fair comparisons although the DNNs for implementations may be different. The baseline FPGA implementations include highlevel synthesisbased implementations, implementations of compressed models, etc. A minimum of more than 84X energy efficiency gain can be achieved compared with the reference FPGA implementations. Besides the reduced computational complexity and the highefficiency hardware implementation, another key reason for such significant energy efficiency gain is because the proposed FPGAbased implementation can accommodate the whole DNN model using onchip block memory, thereby significantly improving the overall energy efficiency.
Conclusion
This paper presents an algorithmhardware cooptimization framework to facilitate ultra highperformance and high energy efficiency hardware implementations of DNNs on FPGAs. The algorithm part adopts the general blockcirculant matrices to achieve a finegrained tradeoff of accuracy and compression ratio. It applies to both FC and CONV layers and contains a mathematically rigorous proof. The proposed algorithm reduces computational complexity per layer from O() to O() and storage complexity from O() to O(), both for training and inference phases. The hardware part consists of highly efficient FPGAbased implementations using effective reconfiguration, batch processing, deep pipelining, resource reusing, and a hierarchical control framework. Experimental results demonstrate that the proposed framework achieves at least 152X speedup in throughput and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGAbased work.
Acknowledgement
This work is funded by the National Science Foundation Awards CNS1650469, CCF1733701, CNS1704662, CCF1657333, CNS1739748, and CCF1733834.
References
 [Alemdar et al.2016] Alemdar, H.; Caldwell, N.; Leroy, V.; ProstBoucle, A.; and Pétrot, F. 2016. Ternary neural networks for resourceefficient ai applications. arXiv preprint arXiv:1609.00222.
 [Bayat et al.2016] Bayat, F. M.; Guo, X.; Klachko, M.; Prezioso, M.; Likharev, K.; and Strukov, D. 2016. Sub1us, sub20nj pattern classification in a mixedsignal circuit based on embedded 180nm floatinggate memory cell arrays. arXiv preprint arXiv:1610.02091.
 [Bini, Pan, and Eberly1996] Bini, D.; Pan, V.; and Eberly, W. 1996. Polynomial and matrix computations volume 1: Fundamental algorithms. SIAM Review.
 [Blei, Jordan, and others2006] Blei, D. M.; Jordan, M. I.; et al. 2006. Variational inference for dirichlet process mixtures. Bayesian analysis 1(1):121–143.

[Burbidge et al.2001]
Burbidge, R.; Trotter, M.; Buxton, B.; and Holden, S.
2001.
Drug design by machine learning: support vector machines for pharmaceutical data analysis.
Computers & chemistry 26(1):5–14.  [Chen et al.2014] Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; and Temam, O. 2014. Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning. In ACM Sigplan Notices.
 [Chen et al.2017] Chen, Y.H.; Krishna, T.; Emer, J. S.; and Sze, V. 2017. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits 52(1).
 [Cheng et al.2015] Cheng, Y.; Yu, F. X.; Feris, R. S.; Kumar, S.; Choudhary, A.; and Chang, S.F. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2857–2865.

[Collobert and
Weston2008]
Collobert, R., and Weston, J.
2008.
A unified architecture for natural language processing: Deep neural networks with multitask learning.
In ICML, 160–167. ACM.  [coma] http://www.techradar.com/news/computingcomponents/processors/googlestensorprocessingunitexplainedthisiswhatthefutureofcomputinglookslike1326915.
 [comb] https://www.sdxcentral.com/articles/news/intelsdeeplearningchipswillarrive2017/2016/11/.
 [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; and FeiFei, L. 2009. Imagenet: A largescale hierarchical image database. In CVPR, 248–255. IEEE.
 [Denil et al.2013] Denil, M.; Shakibi, B.; Dinh, L.; de Freitas, N.; et al. 2013. Predicting parameters in deep learning. In NIPS, 2148–2156.
 [Denton et al.2014] Denton, E. L.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 1269–1277.
 [Ding et al.2017] Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. 2017. C ir cnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 395–408. ACM.
 [Esser et al.2015] Esser, S. K.; Appuswamy, R.; Merolla, P.; Arthur, J. V.; and Modha, D. S. 2015. Backpropagation for energyefficient neuromorphic computing. In NIPS, 1117–1125.
 [Esser et al.2016] Esser, S. K.; Merolla, P. A.; Arthur, J. V.; Cassidy, A. S.; Appuswamy, R.; Andreopoulos, A.; Berg, D. J.; McKinstry, J. L.; Melano, T.; Barch, D. R.; et al. 2016. Convolutional networks for fast, energyefficient neuromorphic computing. NAS 201604850.
 [Farabet et al.2009] Farabet, C.; Poulet, C.; Han, J. Y.; and LeCun, Y. 2009. Cnp: An fpgabased processor for convolutional networks. In FPL, 32–37.
 [Feng and Darrell2015] Feng, J., and Darrell, T. 2015. Learning the structure of deep convolutional networks. In ICCV, 2749–2757.
 [Han et al.2015] Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both weights and connections for efficient neural network. In NIPS, 1135–1143.
 [Han et al.2016] Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M. A.; and Dally, W. J. 2016. Eie: efficient inference engine on compressed deep neural network. In ISCA, 243–254. IEEE Press.
 [Han et al.2017] Han, S.; Kang, J.; Mao, H.; Hu, Y.; Li, X.; Li, Y.; Xie, D.; Luo, H.; Yao, S.; Wang, Y.; et al. 2017. Ese: Efficient speech recognition engine with sparse lstm on fpga. In FPGA, 75–84. ACM.
 [Han, Mao, and Dally2015] Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
 [Huval et al.2015] Huval, B.; Wang, T.; Tandon, S.; Kiske, J.; Song, W.; Pazhayampallil, J.; Andriluka, M.; Rajpurkar, P.; Migimatsu, T.; ChengYue, R.; et al. 2015. An empirical evaluation of deep learning on highway driving. arXiv preprint arXiv:1504.01716.
 [Jia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. In MM, 675–678. ACM.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
 [LeCun et al.1995] LeCun, Y.; Jackel, L.; Bottou, L.; Brunot, A.; Cortes, C.; Denker, J.; Drucker, H.; Guyon, I.; Muller, U.; Sackinger, E.; et al. 1995. Comparison of learning algorithms for handwritten digit recognition. In ICANN, volume 60, 53–60. Perth, Australia.
 [Li et al.2016] Li, S.; Liu, X.; Mao, M.; Li, H. H.; Chen, Y.; Li, B.; and Wang, Y. 2016. Heterogeneous systems with reconfigurable neuromorphic computing accelerators. In ISCAS, 125–128. IEEE.
 [Li, Park, and Tang2017] Li, S.; Park, J.; and Tang, P. T. P. 2017. Enabling sparse winograd convolution by native pruning. arXiv preprint arXiv:1702.08597.
 [Lin et al.2015] Lin, Z.; Courbariaux, M.; Memisevic, R.; and Bengio, Y. 2015. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009.
 [Lin, Talathi, and Annapureddy2016] Lin, D.; Talathi, S.; and Annapureddy, S. 2016. Fixed point quantization of deep convolutional networks. In ICML, 2849–2858.
 [Liu et al.2016] Liu, X.; Mao, M.; Liu, B.; Li, B.; Wang, Y.; Jiang, H.; Barnell, M.; Wu, Q.; Yang, J.; Li, H.; et al. 2016. Harmonica: A framework of heterogeneous computing systems with memristorbased neuromorphic computing accelerators. IEEE Transactions on Circuits and Systems I: Regular Papers 63(5):617–628.
 [Lu et al.2015] Lu, J.; Young, S.; Arel, I.; and Holleman, J. 2015. A 1 tops/w analog deep machinelearning engine with floatinggate storage in 0.13 m cmos. IEEE Journal of SolidState Circuits 50(1).
 [Mathieu, Henaff, and LeCun2013] Mathieu, M.; Henaff, M.; and LeCun, Y. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851.
 [Merolla et al.2014] Merolla, P. A.; Arthur, J. V.; AlvarezIcaza, R.; Cassidy, A. S.; Sawada, J.; Akopyan, F.; Jackson, B. L.; Imam, N.; Guo, C.; Nakamura, Y.; et al. 2014. A million spikingneuron integrated circuit with a scalable communication network and interface. Science 345(6197):668–673.
 [Oppenheim1999] Oppenheim, A. V. 1999. Discretetime signal processing. Pearson Education India.
 [Pan2012] Pan, V. 2012. Structured matrices and polynomials: unified superfast algorithms. Springer Science & Business Media.
 [pro] https://drive.google.com/open?id=0B19Xkz1gXlwAYjVjWC1Kc2xSRm8.
 [Qiu et al.2016] Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In FPGA, 26–35.
 [qua] https://dl.altera.com.
 [Salehi, Amirfattahi, and Parhi2013] Salehi, S. A.; Amirfattahi, R.; and Parhi, K. K. 2013. Pipelined architectures for realvalued fft and hermitiansymmetric ifft with real datapaths. IEEE Transactions on Circuits and Systems II: Express Briefs 60(8):507–511.
 [Shafiee et al.2016] Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Strachan, J. P.; Hu, M.; Williams, R. S.; and Srikumar, V. 2016. Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars. In ISCA, 14–26. IEEE Press.
 [Song et al.2017] Song, L.; Qian, X.; Li, H.; and Chen, Y. 2017. Pipelayer: A pipelined rerambased accelerator for deep learning. In HPCA.
 [Suda et al.2016] Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.s.; and Cao, Y. 2016. Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks. In FPGA, 16–25. ACM.
 [Taigman et al.2014] Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, 1701–1708.
 [Umuroglu et al.2016] Umuroglu, Y.; Fraser, N. J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; and Vissers, K. 2016. Finn: A framework for fast, scalable binarized neural network inference. arXiv preprint arXiv:1612.07119.
 [Vedaldi and Lenc2015] Vedaldi, A., and Lenc, K. 2015. Matconvnet: Convolutional neural networks for matlab. In MM, 689–692. ACM.
 [Wen et al.2016] Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In NIPS, 2074–2082.
 [Zhang and Li2017] Zhang, J., and Li, J. 2017. Improving the performance of openclbased fpga accelerator for convolutional neural network. In FPGA.
 [Zhang et al.2016a] Zhang, C.; Fang, Z.; Zhou, P.; Pan, P.; and Cong, J. 2016a. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In ICCAD, 12. ACM.
 [Zhang et al.2016b] Zhang, C.; Wu, D.; Sun, J.; Sun, G.; Luo, G.; and Cong, J. 2016b. Energyefficient cnn implementation on a deeply pipelined fpga cluster. In ISLPED, 326–331. ACM.

[Zhao et al.2017]
Zhao, R.; Song, W.; Zhang, W.; Xing, T.; Lin, J.H.; Srivastava, M.; Gupta, R.;
and Zhang, Z.
2017.
Accelerating binarized convolutional neural networks with softwareprogrammable fpgas.
In FPGA, 15–24. ACM.