I Introduction
Due to the growing need for DNN performance on different tasks, today’s DNN model has a relatively large model parameter size. For example, ResNet18 for image classification has a model size of 45 MB, the YOLOv5l of the YOLOv5 family [17] for object detection has a model size as 245 MB, BERT [11] for NLP has 23M embedding weights and 317M neural network weights. For the shallow DNN model, all the weight and intermediate activations can be stored in the FPGA onchip memory [27] (BRAM and URAM). However, large DNN models’ weight is hard to store in the FPGA onchip memory, and external memory is used to store the weight and activations. [19]. Therefore, the external memory usage for weight and activation will stall the system’s performance.
Using Xilinx Alveo U200 Accelerator Cards [1], as an illustration, we often need to access offchip memory since the onchip memory capacity is only 35 MB. However, the offchip memory is inherently slower to access than onchip memory in terms of 2 aspects: ① The offchip memory (DDR4) total bandwidth is 77 GB/s, while the total bandwidth onchip memory is 31 TB/s (400 higher than the offchip memory); ② The offchip memory (DDR4) has around 100 ns memory access latency [7], while the onchip memory can be access at 500 MHz frequency [27] with only one cycle latency (50 lower latency than offchip memory). Furthermore, The DRAM memory usually has more than 100 higher power consumption [15] than the FPGA onchip SRAM. The whole system performance is constrained by accessing offchip memory.
In order to fit the model into the FPGA platforms’ onchip memory, model compression techniques can be used. The model compression techniques help reduce the model size and accelerate the inference of the DNN model with an acceptable accuracy loss. The model compression techniques can be divided into two types: weight pruning and weight quantization. We will leverage both of those two techniques in this paper to achieve a higher model compression rate.
In this paper, we focus on the acceleration of BCNN based NINNet and ResNet models. We further propose hardware design to achieve high parallelism and high throughput for the FPGA platform. We implement the proposed techniques on Alveo U280 hardware platforms for comparison of latency and throughput. Experimental results show that the FPGA hardware design enables 5882 frames/s and 6154 frames/s for BCNN based NINNet and ResNet models. Our contributions are summarized as follows:

Basic framework and building blocks for BCNN are given. For the pooling layer, the spectral pooling, average pooling, and max pooling are compared in terms of model performance.

The Surrogate Lagrangian Relaxation (SLR) weight pruning technique is adopted for BCNN weight pruning, straightthrough estimator (STE) is adopted for BCNN weight quantization, the whole model compression framework achieves a high compression ratio with an acceptable accuracy loss.

The binarized complex convolution kernel design is proposed to enable a high level of hardware parallelism and low pipeline initiation interval.

The hardware resource scheduling for the BCNN model implementation on FPGA is discussed, and we achieve a high overall hardware throughput.
The organization of the work is as follows. Section II gives the basics of DNN model compression knowledge and the BCNN model. Section III discusses the model structure, SLRbased compression, and STEbased binary quantization. Section IV gives the hardware design for the BCNN based models. Section V gives the BCNN model’s training, the hardware implementation result. Section VI gives the overall conclusion for the hardware design and experiments.
Ii Related Work
In this section, we will briefly discuss the current works on DNN model compression techniques, BNNs, and complex neural networks.
Iia Model Compression
In order to reduce the DNN model size and inference latency, the model compression techniques can be adopted. The main challenge of the SOTA model compression techniques is maintaining the model’s accuracy on tasks while improving the model inference speed and throughput on hardware platforms. There are two types of model compression techniques: weight pruning [19] and weight quantization [12].
Two types of pruning methods are widely used for the weight pruning technique: the unstructured pruning method and the structured pruning method [25]. The unstructured pruning technique [6, 18, 30] is easier to implement on software and has a high compression ratio with low accuracy loss. However, due to its irregular memory access pattern, the unstructured pruning can hardly be accelerated on most of the hardware platforms. The structured pruning technique [6, 18, 30] constrains the weight matrix to be pruned in a structured and hardwarefriendly pattern. For example, blockcirculant matrices [12, 24, 21] can be used for weight representation after pruning. The structured pruningbased hardware implementation achieves better performance due to the higher parallelism achievable by regular memory access patterns and reduced computation burden.
Another source of redundancy in the DNN model is the bit representation of the weight. The DNN baseline model usually adopts float32 bits representation for the weight value. In order to compress the bit representation of the data, various works [23, 29] have proposed different DNN model quantization techniques, including fixed bitlength, ternary, and binary weight representations. The truncated length bit representation reduces DNN model size, computation burden on the hardware platform, and memory bandwidth consumption. The fixed bitlength representation of the DNN model parameter can be further classified into equaldistance quantization and poweroftwo quantization. The equaldistance quantization is similar to fixedpoint number representation, and it reduces the hardware resource utilization while maintaining high accuracy. The poweroftwo quantization further improves the hardware efficiency owning to the bit shiftbased multiplication. However, the unequally distributed scale of poweroftwo quantization leads to a nonnegligible DNN model accuracy degradation. In order to improve the model accuracy and maintain hardware efficiency, mixed powersoftwobased quantization is proposed [13]. The mixed powersoftwobased quantization features its’ combination of primary powersoftwo and a secondary powersoftwo part, and the multiplier requires a 2bit shifter and one adder.
IiB Binary Complex Neural Networks
The concept of Binary Neural Network (BNN) originated from the binary weight neural network (BWNN) [8], and the BWNN only quantizes the bit representation of the weight value into the binary value. However, for the FPGA devices with small onchip memory, the intermediate activations of the BWNN are still too large to be stored in the onchip SRAM, and external memory is required. The later works [16, 9]
proposed BNN and researched the quantization of both activations and weights into a binary representation. Those works illustrated few key concepts to maintain the model accuracy for BNN: straightthrough estimator (STE) for gradient descent, batchnormalization after binarized convolution, and keep full precision for both activations and weights at first and last layers. By quantizing both activation and weight, the multiplication is degraded into binary XOR operation and is highly hardwarefriendly for FPGA and GPU platforms. A simply popcnt(xor()) function can be used for binarized convolution layers or fully connected layers.
In order to enhance the model information representation with the same parameter size or even less parameter size, deep, complex neural networks are proposed [28]. The complex neural network has the dedicated complex version of the basic building block: convolution, batch normalization, weight initialization strategy, etc. The deep complex neural networks achieve comparable performance to ordinary DNNs with half model parameters.
The binary complex neural network (BCNN) [20] combines the benefit of both binary neural networks and complex neural networks. The activations and weights of deep complex neural networks are quantized to one bit except the first and last layer. In order to reduce the computation overhead and improve the model accuracy, Li et al.. [20] also proposed few new concepts: quadrant binarization for forward and backward propagation, complex Gaussian batch normalization (CGBN), and binary complex weight initialization. Those concepts will be discussed in detail in Section III.
IiC Basic of Surrogate Lagrangian Relaxation (SLR)
The surrogate Lagrangian relaxation method (SLR) [5] is an optimization algorithm similar to the alternating direction method of multipliers (ADMM) [4], which breaks optimization problems into several smaller subproblems that can be solved iteratively. However, it also overcame major convergence difficulties of standard Lagrangian relaxation. As the solution of decomposed subproblems is coordinated by updating Lagrangian multipliers, convergence can be proved when the solutions satisfy the “surrogate” optimality condition with a novel stepsizing formula [5].
Gurevin et al. [14] was the first work implementing an SLRbased framework on DNN weight pruning. Comparing with the ADMMbased weight pruning method, when under classification task and the same compression rate, SLR can achieve almost point higher accuracy on VGG16 on CIFAR10 dataset, and
point higher accuracy on ResNet18 on ImageNet dataset. Under object detection tasks on COCO 2014 benchmark, models pruned with the SLR method can at most have
higher accuracy after hardpruning than the ADMM method under all the YOLO framework. Also, when hardpruning accuracy is checked periodically during training steps, SLR can faster converge and reach the desired accuracy. Almost faster can be achieved during classification tasks for SLR on CIFAR10 andfaster on ImageNet. During objection detection tasks,
faster can be achieved on COCO 2014 benchmark. Experiments also show that the SLRbased weightpruning optimization approach achieves high accuracy even at the hardpruning stage. This retrainfree propriety reduces the traditional threestage pruning pipeline to twostage and reduces the budget of retraining epochs.
Iii Training and Compression on BCNN
In this section, we will discuss the BCNN model in detail, which includes model structure, fundamental building blocks and operations for BCNN, weight pruning using SLR, and weight quantization based on quadrant binarization and STE.
Iiia Structure of BCNN
The comparison between BCNN and the original convolution neural network (CNN) structure is demonstrated in Fig. 1. The mathematical equation for convolution and add bias operation can be found in Eq. 1. As shown in Fig. (a)a, the original CNN is composed of convolution layer, add bias, nonlinear layer, and pooling layer. For the BCNN, the structure is different from the original CNN, and the structure is shown in Fig. (b)b. The pooling layer and batch normalization layer should come after the convolution layer, and the bias can be removed from the network to reduce the computation overhead without accuracy loss. For the BCNN, batch normalization is a mandatory operation for model convergence [2].
(1) 
For the image with three channels (RGB) as input, the initial input only contains real part. In order the generate the complex input, a twolayers residual CNN is designed to learn the imaginary part. The network for generating the complex input is shown in Fig. 2.
IiiB Building Blocks and Operations
The basic building blocks of BCNN are slightly different from that of ordinary DNN. The complex version of the convolution layer, pooling layer, batch normalization, and binarize function will be discussed in this section.
IiiB1 Complex Convolution
The binary complex number can be defined as: , where the numbers and belong to . A single binary complex number is represented by 2 digits, and has twice memory occupations than that of binary number. Assuming the input activation is , weight is . The dot product between the input activation and weight can be donated as equation format Eq. 2 or matrix format Eq. 3. The mathematically expression of complex convolution operation can be deduced from Eq. 1 and Eq. 3. The convolution can be divided into bitwise XOR operation, popcnt operation, and fixedpoint number add/subtract for the binary complex convolution operation. Comparing to full precision convolution operation on the FPGA platform, only a few LUT resources will be used to calculate the binary complex convolution operation, and the memory bandwidth requirement is reduced by 32 .
(2) 
(3) 
IiiB2 Pooling Layer
The pooling layer is optional in the convolutional layer. For a deep CNN, the pooling layer will be used at specific layers to downsample at a rate to reduce the activation sizes. There are two widely used pooling layers in the original CNN: max pooling and average pool. For BCNN, the activations are represented as complex numbers, enabling the possibility of another type of pooling method that has good preservation of information: spectral pooling [26].
The spectral pooling conducted a fast Fourier transform (FFT) over the 2D dimension of activations and truncated the highfrequency component to leave the center low frequency spectral. An inverse FFT (IFFT) is then conducted to transform the spectral information back to the spatial domain.
A comparison of different pooling methods is shown in Fig. 3. The pictures are obtained from the ImageNet dataset [10]. For spectral pooling, the pixel value is scaled back to the range of to ensure visibility. As shown in the figure, the spectral pooling and average pooling preserves more spatial information than the max pooling. The max pooling is more like a 2D whitening function and performs poorly for images with higher brightness.
As for the computation complexity, the FFT operation of spectral pooling has a computation complexity; average pooling and max pooling both has a computation. In order to maintain a good tradeoff between the model accuracy and computation complexity, the average pooling is chosen for the hardware design.
IiiB3 Cgbn
The complex version of batch normalization (CGBN) is an essential operation that helps the BCNN to converge. The original form of batch normalization can be found in Eq. 4, where donates the input activation, donates the mean of the activations in the minibatch,
donates the variance for the activations in the minibatch, and
donates a small value to be added to avoid dividing by zero. and are both scaling factors that can be learned during the training step. The activation output is represented as .(4) 
For the complex neural network, the original form of complex batch normalization can be found in Eq. 5 [28]. is the covariance matrix, and is the mean of complex activations within the minibatch. is a matrix and is a complex number. Both and
can be learned during backpropagation.
(5) 
However, as discussed in Li et al.. [20], the original form of complex batch normalization requires too much calculation, and the CGBN concept is proposed. The mathematical equation for the CGBN can be found in Eq. 6. Both and are complex numbers scaling factors and can be learned. The CGBN for BCNN has higher accuracy and lower computation complexity so that CGBN will be used for both the software and hardware implementation.
(6) 
IiiB4 Binarization
There are two types of widely used binarization [8]: deterministic binarization and stochastic binarization. The equation for deterministic binarization is given in Eq. 7, activation is binarized to +1 and 1 according to their sign. The equation for stochastic binarization is given in Eq. 8. In the equation, the is a hard clipping function and satisfies .
(7) 
(8) 
The stochastic binarization has a better model accuracy performance than the deterministic binarization. However, the implementation of stochastic binarization requires a stochastic number generator and is expensive for hardware design. So the deterministic binarization will be used for both software and hardware experiment.
For the complex number with real and imaginary parts, quadrant binarization is proposed in [20] to conduct the binarization and will be used in this paper. The concept of quadrant binarization is naive and straightforward: the real and imaginary part is binarized individually during the forward propagation.
IiiC Channelwise Weight Pruning Using SLR
Consider an layer DNN indexed as . The collection of weights at each convolutional layer is denoted by and the collection of corresponding biases is denoted by
, and loss function is denoted by
. The objective of channelwise weight pruning can be done by minimizing the loss function and make it subject to constraints that the number of nonzero channels of the weight in each layer is less than or equal to a predefined number. This can be formulated as Eq. 9.(9) 
where the number of channels that being zero in is greater than , where is a set of predefined hyperparameter. This can further be equivalently rewritten in an unconstrained form as Eq. 10.
(10) 
The first term represents the nonlinear loss function, and the second represents the nondifferentiable penalty term [31] that is the indicator function of shown in Eq. 11.
(11) 
The problem cannot be solved only analytically or only using stochastic gradient descent. In this case, duplicate variables are introduced as Eq.
12.(12) 
To solve the problem, SLR leverages the augmented Lagrangian multipliers and penalizes their violations using quadratic penalties, which can be written as Eq. 13.
(13)  
where is the Lagrangian multipliers (dual variables) corresponding to constraints with the same dimension as . The scalar is a positive number that represents the penalty coefficient, and denotes the Frobenius norm. This can be decomposed into two subproblems and being solved iteratively until convergence.
Step 1: Solve subproblem (loss function) for using Stochastic Gradient Decent. At iteration , the “loss function” subproblem is minimizing the Lagrangian function while keeping at previously obtained values for given values of multipliers : . This can be solved by using stochastic gradient descent (SGD) [3]. “Surrogate” optimality condition (Eq. 14) needs to be satisfied to ensure multipliers update to right directions.
(14) 
When Eq. 14 is satisfied, stepsizes and multipliers are updated as Eq. 15.
(15)  
Otherwise, previous stepsizes and multipliers are kept.
Step 2: Solve subproblem (Channelwise pruning) for through Pruning using Projections onto Discrete Subspace. The channelwise pruning subproblem can be written as: . The global optimal of this subproblem can be solved analytically as
is the indicator function. For the weight tensor in each convolutional layer, Frobenius norm of each channel of the tensor is calculated and denoted as
, where and is the number of channels in the tensor. Channels with “larger” are kept and with “smaller” are set to zeros following that the number of nonzero channels is less than or equal to . Second “surrogate” optimality condition (Eq. 16) needs to be satisfied to ensure multipliers update to right directions.(16) 
When Eq. 16 is satisfied, stepsizes and multipliers are updated again as Eq. 17.
(17)  
Same as the first step, previous stepsizes and multipliers are kept if the condition is not satisfied. In both steps, stepsizesetting parameters are formalized as Eq. 18, where and are predefined hyperparameters.
(18) 
IiiD Weight Quantization
The pruned model from Section IIIC will be used for further weight quantization. The binarization is conducted for both activations and weights in the binarized convolution layer, and deterministic binarization function is used. The binarization function (Eq. 7) is nondifferentiable at 0, so the direct backpropagation is not feasible for the weight quantizaiton training. StraightThroughEstimator (STE) is proposed in previous literatures [8, 16] for the backpropagation. The complex version of STE is proposed in [20], and the equation can be found in Eq. 19.
(19) 
Iv FPGA Hardware Architecture
FPGA is the one of the most popular hardware platform for DNN applications. FPGA platform features it’s reconfigurable structure and high level of parallelism for hardware design. With the growing size of the DNN model, the weight matrices and the activations are too large to be stored in the FPGA onchip memory. However, the aforementioned pruning and weight quantization technologies compressed both activations and weights representation, making it possible for FPGA platform to store all the intermediate results within onchip memory. In this section, hardware design will be conducted based on Vivado HLS 2020.1, we will present our FPGA hardware structure for the BCNN model.
Iva Overall FPGA structure
Our overall FPGA structure for BCNN is shown in Fig. 4. We’ll have 2 BCNN model design for the this section: network in network (NIN) [22] based BCNN model and ResNet18 based BCNN model. Both of those 2 models will be composed of 3 major layers: complex input generation layer (Fig. 2), full precision complex convolutional layer (Fig. (a)a), binarized complex convolutional layer (Fig. (b)b). A fully connected (FC) layer will be used at last to generate the prediction output.
For the ResNet18 network, there are 2 types of residual blocks, and both of those residual blocks are binarized block for the BCNN model. The residual block 1 is shown in Fig. 5, the input is passed through 2 binarized complex convolutional layers and is added with origin input to get the final output. The residual block 2 is shown in Fig. 6, one of the path has 2 binarized complex convolutional layers and another path has only 1 binarized complex convolutional layers, then outputs of those 2 paths are added together to generate the final output.
IvB Hardware Design Details
There are several major building blocks for the hardware design: full precision complex convolutional layer, batch normalization, and pooling layer; binarized complex convolutional layer, batch normalization, and pooling layer. The activation functions used in the models are simple ReLU and Hardtanh functions which have very low computation costs.
The binarized complex convolution computation follows Eq. 1, Eq. 3, and Eq. 7. The weight matrix for real and imaginary part can be concatenated together to get the final weight matrix input. The concatenated input can directly serve as the general weight matrix input for the convolution. The example HLS code for the binarized convolution with is given in Fig. 7, and the hardware mapping is shown in Fig. 8. In the example design, the convolution kernel size is
, stride is 1, and input and output number of complex channel are both 128. For both the input and output channel, the first 128 channels are the real part and the last 128 channels are imaginary parts. The output channel and input channel are chosen for 2 levels of parallelism. The sparse channel pruning for the convolutional layer can be conducted during the binarization operation to avoid pipeline stall during the convolution operation.
V Experiment
Va Training of BCNN Models
In this section, we will apply SLR pruning and STEbased quantization for both BCNN based NiNNet and ResNet18. For the CIFAR10 dataset, we’ll present the training result for both BCNN based NiNNet and ResNet18. For ImageNet dataset, we’ll only present the result of BCNN based ResNet18. We conduct our experiments on Ubuntu 18.04, Python 3.7 and PyTorch v1.6.0 software version. And we are using Nvidia Quadro RTX 6000 GPU with 24 GB GPU memory for the training.
Firstly, in order to finalize the pooling layer function for the final model, spectral pooling, average pruning, and max pooling are compared on the BCNN model. BCNN based NINNet is used for demonstration. Three types of pooling are compared in terms of their accuracy achievable, and the result is given in Table I. The average pooling has better performance than the other two types of pooling methods with an acceptable computation complexity. So the average pooling will be used for the BCNN model.
Network  Pooling layer type  Accuracy 

Spectral pooling  87.09%  
BCNN based NINNet  Average pruning  87.42% 
Max pooling  86.88% 
Network  Type  Accuracy 

Original  89.13%  
NINNet  Pruned  84.92% 
Pruned & quantized  83.17%  
Original  89.31%  
Complex NINNet  Pruned  86.13% 
Pruned & quantized  85.12%  
Original  92.14%  
ResNet18  Pruned  88.51% 
Pruned & quantized  87.67%  
Original  92.51%  
Complex ResNet18  Pruned  90.23% 
Pruned & quantized  89.34% 
For the complex version of the model, the number of channel is reduced by half to ensure the same model size for the BCNN model. For the weight pruning, the pruning ratio is set as for most of the intermediate layers. Accuracy of four models: NINNet, complex NINNet, ResNet18, and complex ResNet18 on CIFAR10 dataset on can be found in Table II. For the ImageNet dataset, the BCNN based ResNetE18 model [20] will be used for the pruning and quantization, and the result is given in Table III. As shown in the table, the complex version of the network will have better performance than the ordinary counterpart. Thus, the complex binary networkbased NINNet and ResNet18 will be used for the hardware design evaluation.
Network  Type  Top 5 accuracy 

Complex  Original  83.46% 
ResNet18  Pruned  78.38% 
Pruned & Quantized  71.69% 
VB Hardware Evaluation
The hardware evaluation is conducted on Xilinx SDSoC 2020.1 and Vivado HLS 2020.1.1. Alveo U290 board is used for the demonstration. The CIFAR10 and dataset will be used as input, each image input has size of . The BCNN based NINNet model and the BCNN based ResNet18 model will be design to fit the CIFAR10 dataset image input.
The resource utilization for a single BCNN based NINNet inference kernel can be found in Table IV. For a single kernel, the execution latency is 1.53 ms. The maximum resource utilization is bounded by the LUT resources, and nine kernels can be used simultaneously to achieve a higher level of parallelism. The maximum achievable throughput will be 5882 frames/s.
Resource  Utilization  Total  percentage (%) 

DSP  575  9024  6.37 
FF  88845  2607360  3.41 
LUT  137387  1303680  10.54 
For the BCNN based ResNet18 model, the resource utilization for a single inference kernel can be found in Table V. The execution latency is 1.62 ms. The resource utilization is also bounded by the LUT resources. In this case, eight kernels can be used simultaneously during the inference step. For the BCNN based ResNet18, the maximum throughput for the Alveo U280 platform is 4938 frames/s.
Resource  Utilization  Total  percentage (%) 

DSP  465  9024  5.15 
FF  112347  2607360  4.31 
LUT  161306  1303680  12.37 
The crossplatform throughput comparison for BCNN model is conducted. The FPGA platform is Alveo U280, and the GPU platform is a single card Nvidia Quadro RTX 6000 GPU with 24 GB GPU memory. The throughput comparison can be found in Table VI. The proposed FPGA design achieved a 1.51 speed up on BCNN based NINNet model and 1.58 speed up on BCNN based ResNet18 model.
Model  Platform  Throughput (frames/s) 

BCNN based NINNet  Alveo U280  5882 
RTX 6000  3890  
BCNN based ResNet18  Alveo U280  4938 
RTX 6000  3123 
Vi Conclusion
We are the first work to evaluate the BCNN for the FPGA platform. BCNN reduces the memory storage as well as memory bandwidth requirement for the DNN model while maintaining good performance in model accuracy. Further, the resource utilization for BCNN model also reduces since most of the convolution computation will be degraded into XOR and pop_cnt operations. We utilize the HLS tool to design our BCNN models, and the proposed BCNN model hardware design achieves a 5882 frames/s and 6154 frames/s for BCNN based NINNet and BCNN based ResNet18 on the Alveo U280 FPGA platform, which are 1.51 and 1.58 speed up than the GPU platform.
References
 [1] (20200505) Alveo u200 and u250 data center accelerator cards data sheet. Xilinx. Note: v1.3.1 Cited by: §I.
 [2] (2017) The highdimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199. Cited by: §IIIA.

[3]
(2010)
Largescale machine learning with stochastic gradient descent
. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §IIIC.  [4] (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc. Cited by: §IIC.
 [5] (2015) Convergence of the surrogate lagrangian relaxation method. Journal of Optimization Theory and applications 164 (1), pp. 173–201. Cited by: §IIC.
 [6] (2020) YOLObile: realtime object detection on mobile devices via compressioncompilation codesign. arXiv preprint arXiv:2009.05697. Cited by: §IIA.
 [7] (2021) Hbm connect: highperformance hls interconnect for fpga hbm. In The 2021 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 116–126. Cited by: §I.
 [8] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363. Cited by: §IIB, §IIIB4, §IIID.
 [9] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §IIB.

[10]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §IIIB2.  [11] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
 [12] (2017) Circnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. Cited by: §IIA, §IIA.
 [13] (2019) REQyolo: a resourceaware, efficient quantization framework for object detection on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 33–42. Cited by: §IIA.
 [14] (2020) Enabling retrainfree deep neural network pruning using surrogate lagrangian relaxation. arXiv preprint arXiv:2012.10079. Cited by: §IIC.
 [15] (2014) Energy table for 45nm process. In Stanford VLSI wiki, Cited by: §I.
 [16] (2016) Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4114–4122. Cited by: §IIB, §IIID.
 [17] (2021) YOLOv5. GitHub. Note: https://github.com/ultralytics/yolov5 Cited by: §I.

[18]
(2020)
Efficient transformerbased large scale language representations using hardwarefriendly block structured pruning.
In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings
, pp. 3187–3199. Cited by: §IIA.  [19] (2020) FTRANS: energyefficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180. Cited by: §I, §IIA.
 [20] (2021) BCNN: binary complex neural network. arXiv preprint arXiv:2104.10044. Cited by: §IIB, §IIIB3, §IIIB4, §IIID, §VA.
 [21] (2017) Energyefficient, highperformance, highlycompressed deep neural network design using blockcirculant matrices. In 2017 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pp. 458–465. Cited by: §IIA.
 [22] (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §IVA.
 [23] (2015) Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §IIA.
 [24] (2017) Evaluating fast algorithms for convolutional neural networks on fpgas. In 2017 IEEE 25th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), pp. 101–108. Cited by: §IIA.

[25]
(2021)
Accelerating transformerbased deep learning models on fpgas using column balanced block pruning
. In 2021 22nd International Symposium on Quality Electronic Design (ISQED), pp. 142–148. Cited by: §IIA.  [26] (2015) Spectral representations for convolutional neural networks. arXiv preprint arXiv:1506.03767. Cited by: §IIIB2.
 [27] (2020) FTDL: an fpgatailored architecture for deep learning systems. In (FPGA)The 2020 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 320–320. Cited by: §I, §I.
 [28] Deep complex networks. arxiv 2018. arXiv preprint arXiv:1705.09792. Cited by: §IIB, §IIIB3.
 [29] (2016) Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828. Cited by: §IIA.
 [30] (2019) An ultraefficient memristorbased dnn framework with structured weight pruning and quantization using admm. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6. Cited by: §IIA.
 [31] (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §IIIC.