Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICs

03/28/2018 ∙ by Caiwen Ding, et al. ∙ CUNY Law School Syracuse University 0

Both industry and academia have extensively investigated hardware accelerations. In this work, to address the increasing demands in computational capability and memory requirement, we propose structured weight matrices (SWM)-based compression techniques for both field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) implementations. In algorithm part, SWM-based framework adopts block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. The SWM-based technique can reduce computational complexity from O(n^2) to O(n n) and storage complexity from O(n^2) to O(n) for each layer and both training and inference phases. For FPGA implementations on deep convolutional neural networks (DCNNs), we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using the SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on long short term memory (LSTM) networks, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with the baseline accelerator. For ASIC implementations, the SWM-based ASIC design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep learning has increasingly drawn attentions in many research fields, such as speech recognition (Hinton et al., 2012)

, computer vision 

(Krizhevsky et al., 2012; He et al., 2016), self-driving cars (Schmidhuber, 2015; Huval et al., 2015), and unmanned aircraft systems (Makantasis et al., 2015). Large-scale deep neural networks (DNNs) typically consist of multiple layers, and at least millions of weight parameters for the entire model (Krizhevsky et al., 2012). One major advantage of the larger-scale DNNs is that they extract more complex high-level features from the inputs (e.g., images/videos, speeches), and as a result, achieving a significant improvement in model accuracy (Schmidhuber, 2015).

On the other hand, as the size of DNNs grows continuously, there exist tremendous demands in increasing computational capability and memory requirement. Therefore, improving the performance and energy efficiency while maintaining the accuracy of DNNs becomes extremely critical. Two trends have characterized the research advance in order to achieve higher performance and energy efficiency. The first trend is hardware acceleration. FPGA-based accelerators have the advantage of friendly programmability and high-degree parallelism. Stochastic Computing (SC), in which all the inputs and weight values are represented as streams of random bits, has been investigated and successfully applied to hardware acceleration of DNNs (Li et al., 2016, 2017a; Ren et al., 2017; Ren et al., 2016; Yuan et al., 2017; Li et al., 2017c, d; Lin et al., 2017; Li et al., 2017b). Data-path optimization technique (Gokhale et al., 2014) have also been studied to map a limited number of Processing elements (PEs) on FPGA and reuse the mapped PEs by iterating data through them. On the other hand, ASIC-based implementations have been explored to further accelerate DNNs. A substantial number of high-tech companies have declared their ASIC chip designs in DNNs such as Google (Jouppi et al., 2017) and IBM TrueNorth (Esser et al., 2015). In the field of academia, Eyeriss (Chen et al., 2017), EIE (Han et al., 2016), and DaDianNao (Chen et al., 2014) mainly focus on the convolutional layers, the fully-connected layers, and the memory design/organization at the architectural level, respectively.

The second trend is model compression motivated by energy efficiency limitation of large DNN models. Weight pruning (Han et al., 2015) and lower rank approximation (Tai et al., 2015) have aimed to the reduce the number of operations involved in DNNs. They achieve a parameter reduction to some extent with inconsequential accuracy degradation. However, they have brought the new challenges into DNNs such as irregular network structure caused by sparsity regularization (Yu et al., 2017), and increased training complexity caused by the additional pruning process (Han et al., 2015) or low rank approximation step (Tai et al., 2015).

In this work, to address the limitations of existing works in model size compression and acceleration and to achieve ultra-high energy efficiency and performance for FPGA and ASIC-based hardware implementations, we propose the structured weight matrices (SWM)-based compression technique on both FPGA and ASIC implementations. The SWM-based framework adopts the general block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. For FPGA implementations on DCNNs, we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on LSTM networks, the proposed SWM-based method can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with ESE, respectively. For ASIC implementations, the proposed SWM-based design exhibits impressive advantages in terms of power, throughput, and energy efficiency. It indicates that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.

2. Background of DNNs

2.1. Deep Convolutional Neural Networks

DNN systems consist of many different architectures such as DCNNs, recurrent neural network (RNNs), and deep belief networks (DBNs). Although different network structures target at specific applications, they have the similarity in construction principle, i.e., multiple layers connected in series for feature extraction 

(Lee et al., 2009; Karpathy et al., 2014). DNNs are commonly made up of three-layer types: Fully-connected (FC) and convolutional layers (CONV), and pooling layers (POOL).

FC layer is the most storage-intensive layer in DNNs (Qiu et al., 2016; Han et al., 2016)

since its neurons are fully connected with neurons in previous layer. The computation of an FC layer consists of matrix-vector arithmetics followed by the activation function, described as:

, where

is the weight matrix of the synapses between this FC layer (with

neurons) and its previous layer (with neurons);

is the bias vector; and

is the activation function. The calculation of dominates computational complexity because the rest has lower complexity of O().

CONV layer performs a multi-dimensional convolution to extract features from its inputs that will be fed into subsequent layers for extracting higher-level features. A CONV layer is associated with a set of learnable filters (or kernels) (LeCun et al., 1998). A filter-sized moving window is applied to the input feature maps, calculating the convolution of the filter and input feature maps in the moving window. In practical DNN models, the CONV layers are often associated with multiple input and multiple output feature maps. As a result, the CONV layer can be expressed in tensor computations: where , ,

represent the input, output, and weight “tensors” of the CONV layer, respectively. Here,

and are the spatial dimensions of the input maps, is the number of input maps, is the size of the convolutional kernel, and is the number of output maps.

POOL layer

performs a subsampling operation on the extracted features to reduce the data dimensions and mitigate overfitting issues. Max pooling is the dominant type of pooling strategy in state-of-the-art DCNNs due to its higher overall accuracy and convergence speed

(Chen et al., 2014; Chen et al., 2017).

The majority of computations occur in CONV and FC layers, while the POOL layer has a lower computational complexity of O(). The storage requirement of DNNs is due to the weight matrices W’s in the FC layers and the convolutional kernels F’s in CONV layers. As a result, the FC and CONV layers become the major research focuses for energy-efficient implementation of DNNs.

2.2. Recurrent Neural Networks

Figure 1. An example of LSTM based RNN architecture.

RNNs have been investigated and have many applications in natural language processing, speech recognition, and machine translation 

(Sak et al., 2014). As one popular type of RNNs, long short term memory (LSTM) has been broadly studied as shown in Fig. 1 (Sak et al., 2014). An LSTM-based RNN accepts an input sequence (each of is a vector corresponding to time ) with the output sequence from last step (each of is a vector). It computes an output sequence by using the following equations iteratively from to :


where symbols , , , , , and represent the input gate, forget gate, output gate, cell state, cell output, and projected output, respectively. The operation represents element-wise multiplication, and the operation is matrix addition. The terms represent weight matrices (for instance, is the weight matrix from the input vector to the input gate), and the terms are the bias vectors. Additionally, weight matrices , , and are diagonal matrices for peephole connections, which can be considered as vectors during matrix-vector multiplication. Therefore, can be calculated using operation. is the logistic activation function and is a self-defined activation function. In this model we use hyperpolic tangent (tanh) activation function as .

3. Structured Weight Matrix

Figure 2. FFT-Based Calculation in SWM-based FC Layer.

This section discusses the inference and training algorithms of SWM-based DNNs (e.g., (Ding et al., 2017; Wang et al., 2018a)). The advantage is two-fold: 1) it is possible to derive a fine-grained tradeoff between accuracy and compression/acceleration by changing the block size; and 2) the method applies to both FC and CONV layers. The theoretical foundation is also derived from (Zhao et al., 2017), which shows that the “effectiveness” of SWM-based DNNs is the same compared with DNNs without compression. Experimental results in (Ding et al., 2017; Wang et al., 2018a) have demonstrated a good ratio of model compression (i.e., from 41 to 256) with small (less than 2%) overall accuracy degradation. In the following, we discuss the inference and training algorithms for FC layer, details of the CONV layer algorithms are provided in (Ding et al., 2017).

The key idea of SWM-based FC layers is to partition the original weight matrix into blocks of square sub-matrices, and each sub-matrix is a circulant matrix. The illustrations are shown in Fig. 2. Let denote the block size (size of each sub-matrix) and assume there are blocks after partitioning , where and . Then , , . The input is also partitioned as . Then, the forward propagation

of FC layer in the inference is given by (with bias and ReLU omitted for simplicity):


where is a column vector. Assume each circulant matrix is defined by a vector , i.e., is the first row vector of . According to the circulant convolution theorem (Pan, 2012), the calculation of can be performed as , where denotes element-wise multiplications. The operation procedure is shown on the right of Fig. 2. For the inference phase, the computational complexity of this FC layer is , which is equivalent to for small , values. Similarly, the storage complexity is because only or for each sub-matrix needs to be stored, which is equivalent to for small , values. Therefore, the simultaneous acceleration and model compression are achieved.


To reduce the computation complexity and storage complexity, many researchers have investigated to reduce the number of weight parameters or the number of bits for weight representation. However, the compression techniques will cause the model accuracy degradation. In this section, we will discuss the trade-off between model compression and model accuracy loss of the SWM-based technique.

4.1. Quantization and Weight Reduction

Data quantization on weights and neurons is a commonly used method for model compression. We attempt to use low-bit fixed-point data to represent the neurons and weights instead of using floating point data. We design a bit-wise simulator using C++ to verify the total number of bits for both integer and fractional part. Structure weight matrix, as a low-rank representation, uses one or several block circulant matrices to replace the original weight matrix as discussed in Section. 3. Shown in Fig. 2, by partitioning the original weight matrix into blocks of square sub-matrices, the total number of weights are reduced from to , where each block is a matrix. We further investigate the SWM-based DNN models including DCNNs and LSTMs regarding the compression ratio (block size) and model accuracy.

4.2. Accuracy Evaluation

4.2.1. Accuracy Evaluation on DCNNs

The weight storage (model size) reduction, and the test accuracy on various image recognition datasets and DCNN models: MNIST (LeNet-5), CIFAR-10, SVHN, STL-10, and ImageNet (using AlexNet structure)

(Krizhevsky et al., 2012; Netzer et al., 2011; Krizhevsky and Hinton, 2009; Coates et al., 2010; Deng, 2012)) are discussed in (Ding et al., 2017). 16-bit data quantization is adopted and the baselines are the original DCNN models with unstructured weight matrices and 32-bit floating point representations. The SWM-based compression technique enables 400-4000+ reduction in model size in the corresponding FC layers. On the other hand, the accuracy is close to original DCNN models and the accuracy degradation is negligible. Moreover, another advantage of the SWM-based technique is that the storage process of weight parameter after compression is regular, while reference works (Han et al., 2015) bring in irregularity in storing the weight parameter. The introduced irregularity requires extra index per weight parameter and therefore affects the available parallelism degree.

4.2.2. Accuracy Evaluation on LSTM

We evaluate the structure matrices based compression technique using TIMIT benchmark, the most commonly used dataset for automatic speech recognition (ASR) application. The LSTM network is built by stacking multiple LSTM layers. The Google LSTM model (Sak et al., 2014) with unstructured weight matrix is selected as the baseline model. We preprocess the TIMIT audio data using FFT-based filterbank as discussed in (Wang et al., 2018b; Li et al., 2018). The input speech data have the same number of features and same architecture as ESE  (Han et al., 2017). Phone Error Rate (PER) is adopted to evaluate the model prediction accuracy.

The block-circulant matrix based LSTM model enables a comprehensive tuning of model compression ratio by varying the block size . The PER is close to baseline LSTM when the block size is 2 using SWM-based compression technique. For the SWM-based LSTM models with a block size of 8 and 16, 7.6X and 14.6X model size reduction can be achieved compared with baseline LSTM, respectively. On the other hand, the computational complexity is reduced by 2.6X and 3.7X while the PERs are only and higher than the baseline.


5.1. Fpga

5.1.1. Overall Architecture

Figure 3. Overall system architecture of the proposed SWM-based FPGA compression framework.

The overall SWM-based architecture is shown in Fig. 3. The Host CPU is responsible for issuing workload or instructions to the FPGA logic block and monitoring the working stats. The FPGA logic part includes computing unit (containing the basic computing block and the peripheral computing block), the control subsystem, BRAM block, and the preprocess block for certain designs when the data loaded from external memory requires preprocess. The memory hierarchy of the architecture primarily consists of three blocks: Host MEM, FPGA DDR, and on-chip block memory (BRAM). The control subsystem coordinates the actual FFT/IFFT operations in the basic computing block and peripheral computing block. The control subsystem also determines the input size of FFT/IFFT operations. The twiddle factors in FFT/IFFT operations are stored in BRAM (i.e., the values including both real and imaginary parts); the weights, e.g., the FFT results are also stored in BRAM.

5.1.2. Computing Unit Designs

In the computing unit, the peripheral computing block mainly focuses on component-wise multiplication, activation (ReLU, Tanh, and Sigmoid), pooling etc., which need lower computational cost and hardware footprint. The basic computing unit consists of an FFT operation with a parallelization degree of and depth of . Fig. 4 shows an example of 8-point FFT operation in the basic computing block using butterfly units. The IFFT operation can also be implemented using the inputs basic computing unit in addition to a division operation (i.e., ) and two conjugations.

Figure 4. An example of 8-point basic computing block for FFT using butterfly units.

5.2. Asic

In order to apply DNNs onto mobile/IoT devices, the DNN applications should be implemented in ASICs, due to the benefit of small hardware volume. The great reduction in both parameter size and computational time complexity makes our SWM-based method suitable for ASIC implementations. Figure 5 shows the architecture of our end-to-end ASIC implementation of the SWM-based DNNs. The architecture consists of four main blocks: input/output interface, storage system, processing system, and global controller.

Figure 5. The architecture of the SWM-based chip.

The input/output interface is in charge of communicating with the external environment of the chip and the on-chip storage system. The input interface is composed of an input IO buffer and an input distributor. Similarly, the output interface is composed of an output IO buffer and an output distributor. In the view of data flow, the input IO buffer first receives and buffers data, including input images, weights, and biases from the external environment. For the reason that the number of IO pads are usually limited to a small number, whereas the bandwidth of the processing system is rather large for achieving high parallelism of computation. This mismatch in bandwidth requires an input distributor to temporally hold the external data until the size of the data reaches the bandwidth requirement of the storage system. Besides, there are three storage modules inside the storage system for respectively storing inputs/intermediate activations, weights, and biases, the global controller will decide where the buffered data should flow. With the similar idea, the output distributor will receive final activations from the storage system and be controlled to distribute a portion of activations into the output IO buffer, which will further send them back to the external system.

As depicted in Figure 6, the storage system composes three subsystems, including a memory bank for storing weights, a register file for storing biases, and a ping-pong buffer (i.e., two alternating register files) for storing image inputs and intermediate activations.

Figure 6. The architecture of the storage system.

The processing system achieves following equation for each layer: , where is the vector of weights at the ith row and jth column of the weight matrix, and are respectively the jth vector of inputs/activations and biases, and is an activation function. According to above equation, the processing system should contain the modules that are illustrated in Fig. 7. As the first step in the core computation, the image inputs are loaded from the storage system to the FFT module. Since the weights are repeatedly used without changes, what the weight memory bank stores are the weights in frequency domain. Thus the inputs of the multiply module are and . Next, the IFFT module performs the inverse FFT operation over the element-wise production vector, converting the vector from frequency domain to time domain. Then the summation is performed by the Accumulator module that generates the dot-product of inputs and weights. Finally, the Biase module adds up the biases to the dot-products, and the Activation module produces a vector of activations.

Figure 7. The architecture of the processing system.

Another crucial module in the architecture is the global controller, which takes the responsibility to generate control signals to guarantee the whole system to function correctly.

6. Evalutation

6.1. Fpga

DNN Name Dataset Platform Data Quantization Accuracy Performance Energy efficiency
(kFPS) (kFPS/W)
Proposed MNIST 1 MNIST CyClone V 12 bits 92.9%
Proposed MNIST 2 MNIST CyClone V 12 bits 95.6%
Proposed MNIST 3 MNIST CyClone V 12 bits 99.0% 659.5
Proposed SVHN SVHN CyClone V 12 bits 96.2% 699.7
Proposed CIFAR-10 1 CIFAR-10 CyClone V 12 bits 80.3% 2514
Proposed CIFAR-10 2 CIFAR-10 CyClone V 12 bits 94.75% 25.4
TrueNorth ((Esser et al., 2015)) MNIST TrueNorth 2 bits 99%+ 1.0 9.26
TrueNorth ((Esser et al., 2015)) MNIST TrueNorth 2 bits 95% 1.0 250
TrueNorth ((Esser et al., 2016)) SVHN TrueNorth 2 bits 96.7% 2.53 9.85
TrueNorth ((Esser et al., 2016)) CIFAR-10 TrueNorth 2 bits 83.4% 1.25 6.11
LSTM Name Dataset Platform Data Quantization PER Degradation Performance Energy efficiency
Proposed LSTM1 TIMIT ADM-7V3 16 bits 1.23% 330.275 14.359
Proposed LSTM1 TIMIT KU060 16 bits 1.23% 371.095 -
Proposed LSTM2 TIMIT ADM-7V3 16 bits 0.32% 179.687 8.168
Proposed LSTM2 TIMIT KU060 16 bits 0.32% 195.312 -
ESE (Han et al., 2017) TIMIT KU060 12 bits 0.30% 17.544 0.428
Table 1. Comparison results on accuracy, performance, and energy efficiency of the proposed SWM-based FPGA designs and baselines.

We implement the proposed framework on small to medium scale DNNs using the benchmarks of MNIST, SVHN, and CIFAR-10 on the low-power FPGA Intel (Altera) CyClone V 5CEA9. And we implement the proposed method LSTM on the platforms of Xilinx KU060 and Alpha Data’s ADM-7V3. The ADM-7V3 board contains a Xilinx Virtex-7 (690t) FPGA and a 16GB DDR3 memory and the Xilinx KU060 platform contains a Xilinx XCKU060 FPGA and two 4GB DDR3 memory. We connect the ADM-7V3 to the host through PCI-e 3.0 X8 interface and the host machine used in the experiment is a sever configured with an Intel Core i7-4790 CPU. The proposed FPGA implementations of LSTMs are operating at 200MHz on both platforms.

We compare the accuracy, performance (kFPS), and energy efficiency (kFPS/W) of the proposed SWM-based FPGA implementation with the state-of-the-art IBM TrueNorth neurosynaptic processor ((Esser et al., 2015)) for DCNNs, and the state-of-the-art ESE accelerator on the platform of Xilinx KU060 (Han et al., 2017)

for LSTMs. We first demonstrate the results of three MNIST datasets targeting at different accuracies, one SVHN dataset, and two CIFAR-10 datasets targeting at different accuracies. The first two DNNs of MNIST datasets are multi-layer perceptron (MLP) models which can achieve the accuracy of 92.9% and 95.6%, respectively. The third DNN of MNIST dataset has a CNN structure similar to LeNet-5 

(LeCun et al., 1995), which achieves 99.0% accuracy. The first DNN of CIFAR-10 has a simple structure while the second DNN of CIFAR-10 adopts a wide ResNet model (He et al., 2016) which can achieve 94.75% accuracy. The baseline system (IBM TrueNorth) has two different DNNs of MNIST datasets at two accuracy levels. Experimental results show that under the similar accuracy constraint, the gains of the SWM-based framework in performance and energy efficiency are at least 152X and 72X, respectively. For the LSTM implementation, we propose two structures: (i) the proposed LSTM1 adopts a block size of 16 (FFT16), which the relative PER degradation of the model is 1.23%; (ii) the proposed LSTM2 uses a block size of 8 (FFT8), which the relative PER degradation of the model is 0.32%. On the platform of KU060, we achieve 21X and 11X performance speedup for the proposed LSTM1 and LSTM2 based compression techniques compared with ESE. On the platform of AMD-7v3, compared with ESE, we achieve 18.8X and 10.2X and performance enhancement and 33.5X and 19.1X energy efficiency gains using the proposed LSTM1 and LSTM2, respectively. Since the power consumption of SWM-based LSTM is only half of the ESE, the energy efficiency gain is higher than performance. Please note that the manufacturing process of XCKU060 FPGA is 20nm while the process of Virtex-7 is 28nm, which means the actual energy efficiency gain should be more than the report here.

6.2. Asic

In this work, we implement an ASIC design of the SWM-based neural network for the image recognition task, and it is tested with the MNIST dataset. The implemented neural network has the original structure of , and this network is transferred into an SWM-based structure. The FFT module implemented in this work is a 64-point FFT, that is, it takes a vector of 64 real value numbers as inputs and generates their frequency domain representations. Consequently, the weight matrices has the structure of , where represents the weight matrix has rows and columns, and each element is a vector containing weights ( is 64 in this case). Our weight matrix transformation is not applied to the output layer, so the weights in this layer still keep the original structure of .

Our ASIC design is implemented with SMIC 40nm technology (including memories) and synthesized with Synopsys Design Compiler 2016. Table 2 shows the hardware performance of our design. It can be observed from the table, the SWM-based neural network exhibits impressive advantages in terms of power (), throughput (), and energy efficiency (), suggesting that this method is greatly suitable for applying DNNs onto mobile/IoT devices.

Metrics Performance
Clock Frequency (MHz)
Area ()
Power ()
Throughput ()
Energy Efficiency ()
Table 2. Hardware Performance of SWM-Based Neural Network Implemented in ASIC

7. Conclusion

In this work, we propose and evaluate the SWM-based compression technique on both FPGA and ASIC implementations. The SWM-based framework adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio and it works for both FC and CONV layers and contains a mathematically rigorous proof. For FPGA implementations, we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For the LSTM network, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with ESE, respectively. For ASIC implementations, the proposed SWM-based design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.

8. Acknowledgement

This work is funded by the National Science Foundation Awards CNS-1650469, CCF-1733701, CNS-1704662, CCF-1657333, CNS-1739748, and CCF-1733834.


  • (1)
  • Chen et al. (2014) Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014.

    Dadiannao: A machine-learning supercomputer. In

    Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.
  • Chen et al. (2017) Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.
  • Coates et al. (2010) Adam Coates, Honglak Lee, and Andrew Y Ng. 2010. An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001, 48109 (2010), 2.
  • Deng (2012) Li Deng. 2012.

    The MNIST database of handwritten digit images for machine learning research [best of the web].

    IEEE Signal Processing Magazine 29, 6 (2012), 141–142.
  • Ding et al. (2017) Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, et al. 2017. CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 395–408.
  • Esser et al. (2015) Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dharmendra S Modha. 2015. Backpropagation for energy-efficient neuromorphic computing. In Advances in Neural Information Processing Systems. 1117–1125.
  • Esser et al. (2016) Steven K Esser, Paul A Merolla, John V Arthur, Andrew S Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J Berg, Jeffrey L McKinstry, Timothy Melano, Davis R Barch, et al. 2016. Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences (PNAS) (2016), 201604850.
  • Gokhale et al. (2014) Vinayak Gokhale, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 g-ops/s mobile coprocessor for deep neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    . 682–687.
  • Han et al. (2017) Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA.. In FPGA. 75–84.
  • Han et al. (2016) Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243–254.
  • Han et al. (2015) Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82–97.
  • Huval et al. (2015) Brody Huval, Tao Wang, Sameep Tandon, Jeff Kiske, Will Song, Joel Pazhayampallil, Mykhaylo Andriluka, Pranav Rajpurkar, Toki Migimatsu, Royce Cheng-Yue, et al. 2015. An empirical evaluation of deep learning on highway driving. arXiv preprint arXiv:1504.01716 (2015).
  • Jouppi et al. (2017) Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. arXiv preprint arXiv:1704.04760 (published in ACM ISCA 2017) (2017).
  • Karpathy et al. (2014) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • LeCun et al. (1995) Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, JS Denker, Harris Drucker, I Guyon, UA Muller, Eduard Sackinger, et al. 1995. Comparison of learning algorithms for handwritten digit recognition. In ICANN, Vol. 60. Perth, Australia, 53–60.
  • Lee et al. (2009) Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. 2009.

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In

    Proceedings of the 26th annual international conference on machine learning. ACM, 609–616.
  • Li et al. (2017a) Ji Li, Ao Ren, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, and Yanzhi Wang. 2017a. Towards acceleration of deep convolutional neural networks using stochastic computing. In The 22nd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE.
  • Li et al. (2017c) Ji Li, Zihao Yuan, Zhe Li, Caiwen Ding, Ao Ren, Qinru Qiu, Jeffrey Draper, and Yanzhi Wang. 2017c. Hardware-driven nonlinear activation for stochastic computing based deep convolutional neural networks. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 1230–1236.
  • Li et al. (2017d) Ji Li, Zihao Yuan, Zhe Li, Ao Ren, Caiwen Ding, Jeffrey Draper, Shahin Nazarian, Qinru Qiu, Bo Yuan, and Yanzhi Wang. 2017d. Normalization and dropout for stochastic computing-based deep convolutional neural networks. Integration, the VLSI Journal (2017).
  • Li et al. (2016) Zhe Li, Ao Ren, Ji Li, Qinru Qiu, Yanzhi Wang, and Bo Yuan. 2016. Dscnn: Hardware-oriented optimization for stochastic computing based deep convolutional neural networks. In Computer Design (ICCD), 2016 IEEE 34th International Conference on. IEEE, 678–681.
  • Li et al. (2017b) Zhe Li, Ao Ren, Ji Li, Qinru Qiu, Bo Yuan, Jeffrey Draper, and Yanzhi Wang. 2017b. Structural design optimization for deep convolutional neural networks using stochastic computing. In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 250–253.
  • Li et al. (2018) Zhe Li, Shuo Wang, Caiwen Ding, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. Efficient Recurrent Neural Networks using Structured Matrices in FPGAs. arXiv preprint arXiv:1803.07661 (2018).
  • Lin et al. (2017) Sheng Lin, Ning Liu, Mahdi Nazemi, Hongjia Li, Caiwen Ding, Yanzhi Wang, and Massoud Pedram. 2017. FFT-Based Deep Learning Deployment in Embedded Systems. arXiv preprint arXiv:1712.04910 (2017).
  • Makantasis et al. (2015) Konstantinos Makantasis, Konstantinos Karantzalos, Anastasios Doulamis, and Nikolaos Doulamis. 2015.

    Deep supervised learning for hyperspectral data classification through convolutional neural networks. In

    Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International. IEEE, 4959–4962.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011. 5.
  • Pan (2012) Victor Pan. 2012. Structured matrices and polynomials: unified superfast algorithms. Springer Science & Business Media.
  • Qiu et al. (2016) Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26–35.
  • Ren et al. (2017) Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 405–418.
  • Ren et al. (2016) A. Ren, Z. Li, Y. Wang, Q. Qiu, and B. Yuan. 2016. Designing Reconfigurable Large-Scale Deep Learning Systems Using Stochastic Computing. Proc. of ICRC (2016).
  • Sak et al. (2014) Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
  • Schmidhuber (2015) Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks 61 (2015), 85–117.
  • Tai et al. (2015) Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. 2015. Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067 (2015).
  • Wang et al. (2018b) Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018b. C-LSTM: Enabling Efficient LSTM using Structured CompressionTechniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM.
  • Wang et al. (2018a) Yanzhi Wang, Caiwen Ding, Geng Yuan, Siyu Liao, Zhe Li, Xiaolong Ma, Bo Yuan, Xuehai Qian, Jian Tang, Qinru Qiu, and Xue Lin. 2018a. Towards ultra-high performance and energy efficiency of deep learning systems: an algorithm-hardware co-optimization framework. In

    AAAI Conference on Artificial Intelligence, (AAAI-18)

    . AAAI.
  • Yu et al. (2017) Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 548–560.
  • Yuan et al. (2017) Zihao Yuan, Ji Li, Zhe Li, Caiwen Ding, Ao Ren, Bo Yuan, Qinru Qiu, Jeffrey Draper, and Yanzhi Wang. 2017. Softmax Regression Design for Stochastic Computing Based Deep Convolutional Neural Networks. In Proceedings of the Great Lakes Symposium on VLSI. ACM, 467–470.
  • Zhao et al. (2017) Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. 2017. Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank. In International Conference on Machine Learning. 4082–4090.