Due to the growing need for DNN performance on different tasks, today’s DNN model has a relatively large model parameter size. For example, ResNet-18 for image classification has a model size of 45 MB, the YOLOv5l of the YOLOv5 family  for object detection has a model size as 245 MB, BERT  for NLP has 23M embedding weights and 317M neural network weights. For the shallow DNN model, all the weight and intermediate activations can be stored in the FPGA on-chip memory  (BRAM and URAM). However, large DNN models’ weight is hard to store in the FPGA on-chip memory, and external memory is used to store the weight and activations. . Therefore, the external memory usage for weight and activation will stall the system’s performance.
Using Xilinx Alveo U200 Accelerator Cards , as an illustration, we often need to access off-chip memory since the on-chip memory capacity is only 35 MB. However, the off-chip memory is inherently slower to access than on-chip memory in terms of 2 aspects: ① The off-chip memory (DDR4) total bandwidth is 77 GB/s, while the total bandwidth on-chip memory is 31 TB/s (400 higher than the off-chip memory); ② The off-chip memory (DDR4) has around 100 ns memory access latency , while the on-chip memory can be access at 500 MHz frequency  with only one cycle latency (50 lower latency than off-chip memory). Furthermore, The DRAM memory usually has more than 100 higher power consumption  than the FPGA on-chip SRAM. The whole system performance is constrained by accessing off-chip memory.
In order to fit the model into the FPGA platforms’ on-chip memory, model compression techniques can be used. The model compression techniques help reduce the model size and accelerate the inference of the DNN model with an acceptable accuracy loss. The model compression techniques can be divided into two types: weight pruning and weight quantization. We will leverage both of those two techniques in this paper to achieve a higher model compression rate.
In this paper, we focus on the acceleration of BCNN based NIN-Net and ResNet models. We further propose hardware design to achieve high parallelism and high throughput for the FPGA platform. We implement the proposed techniques on Alveo U280 hardware platforms for comparison of latency and throughput. Experimental results show that the FPGA hardware design enables 5882 frames/s and 6154 frames/s for BCNN based NIN-Net and ResNet models. Our contributions are summarized as follows:
Basic framework and building blocks for BCNN are given. For the pooling layer, the spectral pooling, average pooling, and max pooling are compared in terms of model performance.
The Surrogate Lagrangian Relaxation (SLR) weight pruning technique is adopted for BCNN weight pruning, straight-through estimator (STE) is adopted for BCNN weight quantization, the whole model compression framework achieves a high compression ratio with an acceptable accuracy loss.
The binarized complex convolution kernel design is proposed to enable a high level of hardware parallelism and low pipeline initiation interval.
The hardware resource scheduling for the BCNN model implementation on FPGA is discussed, and we achieve a high overall hardware throughput.
The organization of the work is as follows. Section II gives the basics of DNN model compression knowledge and the BCNN model. Section III discusses the model structure, SLR-based compression, and STE-based binary quantization. Section IV gives the hardware design for the BCNN based models. Section V gives the BCNN model’s training, the hardware implementation result. Section VI gives the overall conclusion for the hardware design and experiments.
Ii Related Work
In this section, we will briefly discuss the current works on DNN model compression techniques, BNNs, and complex neural networks.
Ii-a Model Compression
In order to reduce the DNN model size and inference latency, the model compression techniques can be adopted. The main challenge of the SOTA model compression techniques is maintaining the model’s accuracy on tasks while improving the model inference speed and throughput on hardware platforms. There are two types of model compression techniques: weight pruning  and weight quantization .
Two types of pruning methods are widely used for the weight pruning technique: the unstructured pruning method and the structured pruning method . The unstructured pruning technique [6, 18, 30] is easier to implement on software and has a high compression ratio with low accuracy loss. However, due to its irregular memory access pattern, the unstructured pruning can hardly be accelerated on most of the hardware platforms. The structured pruning technique [6, 18, 30] constrains the weight matrix to be pruned in a structured and hardware-friendly pattern. For example, block-circulant matrices [12, 24, 21] can be used for weight representation after pruning. The structured pruning-based hardware implementation achieves better performance due to the higher parallelism achievable by regular memory access patterns and reduced computation burden.
Another source of redundancy in the DNN model is the bit representation of the weight. The DNN baseline model usually adopts float32 bits representation for the weight value. In order to compress the bit representation of the data, various works [23, 29] have proposed different DNN model quantization techniques, including fixed bit-length, ternary, and binary weight representations. The truncated length bit representation reduces DNN model size, computation burden on the hardware platform, and memory bandwidth consumption. The fixed bit-length representation of the DNN model parameter can be further classified into equal-distance quantization and power-of-two quantization. The equal-distance quantization is similar to fixed-point number representation, and it reduces the hardware resource utilization while maintaining high accuracy. The power-of-two quantization further improves the hardware efficiency owning to the bit shift-based multiplication. However, the unequally distributed scale of power-of-two quantization leads to a non-negligible DNN model accuracy degradation. In order to improve the model accuracy and maintain hardware efficiency, mixed powers-of-two-based quantization is proposed . The mixed powers-of-two-based quantization features its’ combination of primary powers-of-two and a secondary powers-of-two part, and the multiplier requires a 2-bit shifter and one adder.
Ii-B Binary Complex Neural Networks
The concept of Binary Neural Network (BNN) originated from the binary weight neural network (BWNN) , and the BWNN only quantizes the bit representation of the weight value into the binary value. However, for the FPGA devices with small on-chip memory, the intermediate activations of the BWNN are still too large to be stored in the on-chip SRAM, and external memory is required. The later works [16, 9]
proposed BNN and researched the quantization of both activations and weights into a binary representation. Those works illustrated few key concepts to maintain the model accuracy for BNN: straight-through estimator (STE) for gradient descent, batch-normalization after binarized convolution, and keep full precision for both activations and weights at first and last layers. By quantizing both activation and weight, the multiplication is degraded into binary XOR operation and is highly hardware-friendly for FPGA and GPU platforms. A simply popcnt(xor()) function can be used for binarized convolution layers or fully connected layers.
In order to enhance the model information representation with the same parameter size or even less parameter size, deep, complex neural networks are proposed . The complex neural network has the dedicated complex version of the basic building block: convolution, batch normalization, weight initialization strategy, etc. The deep complex neural networks achieve comparable performance to ordinary DNNs with half model parameters.
The binary complex neural network (BCNN)  combines the benefit of both binary neural networks and complex neural networks. The activations and weights of deep complex neural networks are quantized to one bit except the first and last layer. In order to reduce the computation overhead and improve the model accuracy, Li et al..  also proposed few new concepts: quadrant binarization for forward and backward propagation, complex Gaussian batch normalization (CGBN), and binary complex weight initialization. Those concepts will be discussed in detail in Section III.
Ii-C Basic of Surrogate Lagrangian Relaxation (SLR)
The surrogate Lagrangian relaxation method (SLR)  is an optimization algorithm similar to the alternating direction method of multipliers (ADMM) , which breaks optimization problems into several smaller subproblems that can be solved iteratively. However, it also overcame major convergence difficulties of standard Lagrangian relaxation. As the solution of decomposed subproblems is coordinated by updating Lagrangian multipliers, convergence can be proved when the solutions satisfy the “surrogate” optimality condition with a novel step-sizing formula .
Gurevin et al.  was the first work implementing an SLR-based framework on DNN weight pruning. Comparing with the ADMM-based weight pruning method, when under classification task and the same compression rate, SLR can achieve almost point higher accuracy on VGG-16 on CIFAR-10 dataset, andhigher accuracy after hard-pruning than the ADMM method under all the YOLO framework. Also, when hardpruning accuracy is checked periodically during training steps, SLR can faster converge and reach the desired accuracy. Almost faster can be achieved during classification tasks for SLR on CIFAR-10 and
faster on ImageNet. During objection detection tasks,
faster can be achieved on COCO 2014 benchmark. Experiments also show that the SLR-based weight-pruning optimization approach achieves high accuracy even at the hard-pruning stage. This retrain-free propriety reduces the traditional three-stage pruning pipeline to two-stage and reduces the budget of retraining epochs.
Iii Training and Compression on BCNN
In this section, we will discuss the BCNN model in detail, which includes model structure, fundamental building blocks and operations for BCNN, weight pruning using SLR, and weight quantization based on quadrant binarization and STE.
Iii-a Structure of BCNN
The comparison between BCNN and the original convolution neural network (CNN) structure is demonstrated in Fig. 1. The mathematical equation for convolution and add bias operation can be found in Eq. 1. As shown in Fig. (a)a, the original CNN is composed of convolution layer, add bias, non-linear layer, and pooling layer. For the BCNN, the structure is different from the original CNN, and the structure is shown in Fig. (b)b. The pooling layer and batch normalization layer should come after the convolution layer, and the bias can be removed from the network to reduce the computation overhead without accuracy loss. For the BCNN, batch normalization is a mandatory operation for model convergence .
For the image with three channels (RGB) as input, the initial input only contains real part. In order the generate the complex input, a two-layers residual CNN is designed to learn the imaginary part. The network for generating the complex input is shown in Fig. 2.
Iii-B Building Blocks and Operations
The basic building blocks of BCNN are slightly different from that of ordinary DNN. The complex version of the convolution layer, pooling layer, batch normalization, and binarize function will be discussed in this section.
Iii-B1 Complex Convolution
The binary complex number can be defined as: , where the numbers and belong to . A single binary complex number is represented by 2 digits, and has twice memory occupations than that of binary number. Assuming the input activation is , weight is . The dot product between the input activation and weight can be donated as equation format Eq. 2 or matrix format Eq. 3. The mathematically expression of complex convolution operation can be deduced from Eq. 1 and Eq. 3. The convolution can be divided into bit-wise XOR operation, popcnt operation, and fixed-point number add/subtract for the binary complex convolution operation. Comparing to full precision convolution operation on the FPGA platform, only a few LUT resources will be used to calculate the binary complex convolution operation, and the memory bandwidth requirement is reduced by 32 .
Iii-B2 Pooling Layer
The pooling layer is optional in the convolutional layer. For a deep CNN, the pooling layer will be used at specific layers to downsample at a rate to reduce the activation sizes. There are two widely used pooling layers in the original CNN: max pooling and average pool. For BCNN, the activations are represented as complex numbers, enabling the possibility of another type of pooling method that has good preservation of information: spectral pooling .
The spectral pooling conducted a fast Fourier transform (FFT) over the 2D dimension of activations and truncated the high-frequency component to leave the center low frequency spectral. An inverse FFT (IFFT) is then conducted to transform the spectral information back to the spatial domain.
A comparison of different pooling methods is shown in Fig. 3. The pictures are obtained from the ImageNet dataset . For spectral pooling, the pixel value is scaled back to the range of to ensure visibility. As shown in the figure, the spectral pooling and average pooling preserves more spatial information than the max pooling. The max pooling is more like a 2D whitening function and performs poorly for images with higher brightness.
As for the computation complexity, the FFT operation of spectral pooling has a computation complexity; average pooling and max pooling both has a computation. In order to maintain a good trade-off between the model accuracy and computation complexity, the average pooling is chosen for the hardware design.
The complex version of batch normalization (CGBN) is an essential operation that helps the BCNN to converge. The original form of batch normalization can be found in Eq. 4, where donates the input activation, donates the mean of the activations in the mini-batch,
donates the variance for the activations in the mini-batch, anddonates a small value to be added to avoid dividing by zero. and are both scaling factors that can be learned during the training step. The activation output is represented as .
For the complex neural network, the original form of complex batch normalization can be found in Eq. 5 . is the covariance matrix, and is the mean of complex activations within the mini-batch. is a matrix and is a complex number. Both and
can be learned during backpropagation.
However, as discussed in Li et al.. , the original form of complex batch normalization requires too much calculation, and the CGBN concept is proposed. The mathematical equation for the CGBN can be found in Eq. 6. Both and are complex numbers scaling factors and can be learned. The CGBN for BCNN has higher accuracy and lower computation complexity so that CGBN will be used for both the software and hardware implementation.
There are two types of widely used binarization : deterministic binarization and stochastic binarization. The equation for deterministic binarization is given in Eq. 7, activation is binarized to +1 and -1 according to their sign. The equation for stochastic binarization is given in Eq. 8. In the equation, the is a hard clipping function and satisfies .
The stochastic binarization has a better model accuracy performance than the deterministic binarization. However, the implementation of stochastic binarization requires a stochastic number generator and is expensive for hardware design. So the deterministic binarization will be used for both software and hardware experiment.
For the complex number with real and imaginary parts, quadrant binarization is proposed in  to conduct the binarization and will be used in this paper. The concept of quadrant binarization is naive and straightforward: the real and imaginary part is binarized individually during the forward propagation.
Iii-C Channel-wise Weight Pruning Using SLR
Consider an -layer DNN indexed as . The collection of weights at each convolutional layer is denoted by and the collection of corresponding biases is denoted by
, and loss function is denoted by. The objective of channel-wise weight pruning can be done by minimizing the loss function and make it subject to constraints that the number of nonzero channels of the weight in each layer is less than or equal to a predefined number. This can be formulated as Eq. 9.
where the number of channels that being zero in is greater than , where is a set of predefined hyper-parameter. This can further be equivalently rewritten in an unconstrained form as Eq. 10.
The problem cannot be solved only analytically or only using stochastic gradient descent. In this case, duplicate variables are introduced as Eq.12.
To solve the problem, SLR leverages the augmented Lagrangian multipliers and penalizes their violations using quadratic penalties, which can be written as Eq. 13.
where is the Lagrangian multipliers (dual variables) corresponding to constraints with the same dimension as . The scalar is a positive number that represents the penalty coefficient, and denotes the Frobenius norm. This can be decomposed into two subproblems and being solved iteratively until convergence.
Step 1: Solve subproblem (loss function) for using Stochastic Gradient Decent. At iteration , the “loss function” subproblem is minimizing the Lagrangian function while keeping at previously obtained values for given values of multipliers : . This can be solved by using stochastic gradient descent (SGD) . “Surrogate” optimality condition (Eq. 14) needs to be satisfied to ensure multipliers update to right directions.
Otherwise, previous stepsizes and multipliers are kept.
Step 2: Solve subproblem (Channel-wise pruning) for through Pruning using Projections onto Discrete Subspace. The channel-wise pruning subproblem can be written as: . The global optimal of this subproblem can be solved analytically as
is the indicator function. For the weight tensor in each convolutional layer, Frobenius norm of each channel of the tensor is calculated and denoted as, where and is the number of channels in the tensor. Channels with “larger” are kept and with “smaller” are set to zeros following that the number of nonzero channels is less than or equal to . Second “surrogate” optimality condition (Eq. 16) needs to be satisfied to ensure multipliers update to right directions.
Same as the first step, previous stepsizes and multipliers are kept if the condition is not satisfied. In both steps, stepsize-setting parameters are formalized as Eq. 18, where and are predefined hyper-parameters.
Iii-D Weight Quantization
The pruned model from Section III-C will be used for further weight quantization. The binarization is conducted for both activations and weights in the binarized convolution layer, and deterministic binarization function is used. The binarization function (Eq. 7) is non-differentiable at 0, so the direct back-propagation is not feasible for the weight quantizaiton training. Straight-Through-Estimator (STE) is proposed in previous literatures [8, 16] for the back-propagation. The complex version of STE is proposed in , and the equation can be found in Eq. 19.
Iv FPGA Hardware Architecture
FPGA is the one of the most popular hardware platform for DNN applications. FPGA platform features it’s reconfigurable structure and high level of parallelism for hardware design. With the growing size of the DNN model, the weight matrices and the activations are too large to be stored in the FPGA on-chip memory. However, the aforementioned pruning and weight quantization technologies compressed both activations and weights representation, making it possible for FPGA platform to store all the intermediate results within on-chip memory. In this section, hardware design will be conducted based on Vivado HLS 2020.1, we will present our FPGA hardware structure for the BCNN model.
Iv-a Overall FPGA structure
Our overall FPGA structure for BCNN is shown in Fig. 4. We’ll have 2 BCNN model design for the this section: network in network (NIN)  based BCNN model and ResNet-18 based BCNN model. Both of those 2 models will be composed of 3 major layers: complex input generation layer (Fig. 2), full precision complex convolutional layer (Fig. (a)a), binarized complex convolutional layer (Fig. (b)b). A fully connected (FC) layer will be used at last to generate the prediction output.
For the ResNet-18 network, there are 2 types of residual blocks, and both of those residual blocks are binarized block for the BCNN model. The residual block 1 is shown in Fig. 5, the input is passed through 2 binarized complex convolutional layers and is added with origin input to get the final output. The residual block 2 is shown in Fig. 6, one of the path has 2 binarized complex convolutional layers and another path has only 1 binarized complex convolutional layers, then outputs of those 2 paths are added together to generate the final output.
Iv-B Hardware Design Details
There are several major building blocks for the hardware design: full precision complex convolutional layer, batch normalization, and pooling layer; binarized complex convolutional layer, batch normalization, and pooling layer. The activation functions used in the models are simple ReLU and Hardtanh functions which have very low computation costs.
The binarized complex convolution computation follows Eq. 1, Eq. 3, and Eq. 7. The weight matrix for real and imaginary part can be concatenated together to get the final weight matrix input. The concatenated input can directly serve as the general weight matrix input for the convolution. The example HLS code for the binarized convolution with is given in Fig. 7, and the hardware mapping is shown in Fig. 8. In the example design, the convolution kernel size is
, stride is 1, and input and output number of complex channel are both 128. For both the input and output channel, the first 128 channels are the real part and the last 128 channels are imaginary parts. The output channel and input channel are chosen for 2 levels of parallelism. The sparse channel pruning for the convolutional layer can be conducted during the binarization operation to avoid pipeline stall during the convolution operation.
V-a Training of BCNN Models
In this section, we will apply SLR pruning and STE-based quantization for both BCNN based NiN-Net and ResNet-18. For the CIFAR-10 dataset, we’ll present the training result for both BCNN based NiN-Net and ResNet-18. For ImageNet dataset, we’ll only present the result of BCNN based ResNet-18. We conduct our experiments on Ubuntu 18.04, Python 3.7 and PyTorch v1.6.0 software version. And we are using Nvidia Quadro RTX 6000 GPU with 24 GB GPU memory for the training.
Firstly, in order to finalize the pooling layer function for the final model, spectral pooling, average pruning, and max pooling are compared on the BCNN model. BCNN based NIN-Net is used for demonstration. Three types of pooling are compared in terms of their accuracy achievable, and the result is given in Table I. The average pooling has better performance than the other two types of pooling methods with an acceptable computation complexity. So the average pooling will be used for the BCNN model.
|Network||Pooling layer type||Accuracy|
|BCNN based NIN-Net||Average pruning||87.42%|
|Pruned & quantized||83.17%|
|Pruned & quantized||85.12%|
|Pruned & quantized||87.67%|
|Pruned & quantized||89.34%|
For the complex version of the model, the number of channel is reduced by half to ensure the same model size for the BCNN model. For the weight pruning, the pruning ratio is set as for most of the intermediate layers. Accuracy of four models: NIN-Net, complex NIN-Net, ResNet-18, and complex ResNet-18 on CIFAR-10 dataset on can be found in Table II. For the ImageNet dataset, the BCNN based ResNetE-18 model  will be used for the pruning and quantization, and the result is given in Table III. As shown in the table, the complex version of the network will have better performance than the ordinary counterpart. Thus, the complex binary network-based NIN-Net and ResNet-18 will be used for the hardware design evaluation.
|Network||Type||Top 5 accuracy|
|Pruned & Quantized||71.69%|
V-B Hardware Evaluation
The hardware evaluation is conducted on Xilinx SDSoC 2020.1 and Vivado HLS 2020.1.1. Alveo U290 board is used for the demonstration. The CIFAR-10 and dataset will be used as input, each image input has size of . The BCNN based NIN-Net model and the BCNN based ResNet-18 model will be design to fit the CIFAR-10 dataset image input.
The resource utilization for a single BCNN based NIN-Net inference kernel can be found in Table IV. For a single kernel, the execution latency is 1.53 ms. The maximum resource utilization is bounded by the LUT resources, and nine kernels can be used simultaneously to achieve a higher level of parallelism. The maximum achievable throughput will be 5882 frames/s.
For the BCNN based ResNet-18 model, the resource utilization for a single inference kernel can be found in Table V. The execution latency is 1.62 ms. The resource utilization is also bounded by the LUT resources. In this case, eight kernels can be used simultaneously during the inference step. For the BCNN based ResNet-18, the maximum throughput for the Alveo U280 platform is 4938 frames/s.
The cross-platform throughput comparison for BCNN model is conducted. The FPGA platform is Alveo U280, and the GPU platform is a single card Nvidia Quadro RTX 6000 GPU with 24 GB GPU memory. The throughput comparison can be found in Table VI. The proposed FPGA design achieved a 1.51 speed up on BCNN based NIN-Net model and 1.58 speed up on BCNN based ResNet-18 model.
|BCNN based NIN-Net||Alveo U280||5882|
|BCNN based ResNet-18||Alveo U280||4938|
We are the first work to evaluate the BCNN for the FPGA platform. BCNN reduces the memory storage as well as memory bandwidth requirement for the DNN model while maintaining good performance in model accuracy. Further, the resource utilization for BCNN model also reduces since most of the convolution computation will be degraded into XOR and pop_cnt operations. We utilize the HLS tool to design our BCNN models, and the proposed BCNN model hardware design achieves a 5882 frames/s and 6154 frames/s for BCNN based NIN-Net and BCNN based ResNet-18 on the Alveo U280 FPGA platform, which are 1.51 and 1.58 speed up than the GPU platform.
-  (2020-05-05) Alveo u200 and u250 data center accelerator cards data sheet. Xilinx. Note: v1.3.1 Cited by: §I.
-  (2017) The high-dimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199. Cited by: §III-A.
Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §III-C.
-  (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc. Cited by: §II-C.
-  (2015) Convergence of the surrogate lagrangian relaxation method. Journal of Optimization Theory and applications 164 (1), pp. 173–201. Cited by: §II-C.
-  (2020) YOLObile: real-time object detection on mobile devices via compression-compilation co-design. arXiv preprint arXiv:2009.05697. Cited by: §II-A.
-  (2021) Hbm connect: high-performance hls interconnect for fpga hbm. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 116–126. Cited by: §I.
-  (2015) Binaryconnect: training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363. Cited by: §II-B, §III-B4, §III-D.
-  (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §II-B.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §III-B2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
-  (2017) Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. Cited by: §II-A, §II-A.
-  (2019) REQ-yolo: a resource-aware, efficient quantization framework for object detection on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 33–42. Cited by: §II-A.
-  (2020) Enabling retrain-free deep neural network pruning using surrogate lagrangian relaxation. arXiv preprint arXiv:2012.10079. Cited by: §II-C.
-  (2014) Energy table for 45nm process. In Stanford VLSI wiki, Cited by: §I.
-  (2016) Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4114–4122. Cited by: §II-B, §III-D.
-  (2021) YOLOv5. GitHub. Note: https://github.com/ultralytics/yolov5 Cited by: §I.
Efficient transformer-based large scale language representations using hardware-friendly block structured pruning.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 3187–3199. Cited by: §II-A.
-  (2020) FTRANS: energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180. Cited by: §I, §II-A.
-  (2021) BCNN: binary complex neural network. arXiv preprint arXiv:2104.10044. Cited by: §II-B, §III-B3, §III-B4, §III-D, §V-A.
-  (2017) Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 458–465. Cited by: §II-A.
-  (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §IV-A.
-  (2015) Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §II-A.
-  (2017) Evaluating fast algorithms for convolutional neural networks on fpgas. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 101–108. Cited by: §II-A.
Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED), pp. 142–148. Cited by: §II-A.
-  (2015) Spectral representations for convolutional neural networks. arXiv preprint arXiv:1506.03767. Cited by: §III-B2.
-  (2020) FTDL: an fpga-tailored architecture for deep learning systems. In (FPGA)The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 320–320. Cited by: §I, §I.
-  Deep complex networks. arxiv 2018. arXiv preprint arXiv:1705.09792. Cited by: §II-B, §III-B3.
-  (2016) Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828. Cited by: §II-A.
-  (2019) An ultra-efficient memristor-based dnn framework with structured weight pruning and quantization using admm. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6. Cited by: §II-A.
-  (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §III-C.