1. Introduction
Deep Convolutional Neural Networks (CNN) have demonstrated great success in various machine intelligence areas and enabled new advancement for applications such as video content understanding (donahue2015long, ), face recognition (schroff2015facenet, ), and crowd flow monitering (li2018csrnet, ). To overcome the overwhelming computing pressure of these deep models, researchers have developed custom hardware accelerators including ASICs (chen2014diannao, ) and FPGAs (qiu2016going, )
. FPGAs have already been proven to be an efficient device for implementing traditional computer vision algorithms
(liu2011real, ; rupnow2011high, ; he2009novel, ). More recently, they also gained popularity in deep neural network accelerations mainly because of their flexibility and high energyefficiency. And with the help of HighLevel Synthesis (HLS) tools, we are able to rapidly map and optimize the emerging deep neural network architectures on FPGAs.Prior FPGA works (zhang2015optimizing, ; ovtcharov2015accelerating, ) mostly focus on optimizing general convolution (CONV) layers. However, recent trend shows that stateoftheart neural network architectures (szegedy2015going, ; abdi2016multi, ) tend to contain the topology of parallel branches, which are then merged through filter concatenation or summation. It is noticed that jointly optimizing a single acceleration engine for all layers of the deep networks leads to dynamic underutilization of resources (shen2016overcoming, ). This effect is especially acute on these advanced CNN structures with branches, since each branch would most likely carry different sizes and dimensions of convolutions. Also, past approaches did not exploit the attribute that individual branches are independent, thus do not utilize the opportunity to concurrently execute the parallelizeable CONV branches. To overcome this problem, one solution is to implement multiple CONV engines that are specifically designed for each or subsets of layers. To achieve minimum overall latency, a resource partition solution has been proposed (zhang2017high, ). We extend such resource allocation strategy to exploit the intramodule parallelism of the multibranch topology to attain better latency.
Besides hardwarespecific tricks, researchers also look into algorithmic improvements to accelerate convolution computation. Researchers (vasilache2014fast, ) exploit the equivalency between convolutions in spatial domain and elementwise multiplications in frequency domain. This method allows us to mathematically reduce the computation complexity with FFTbased convolution. More recently, the Winograd minimalfilter based convolution algorithm has been introduced (lavin2016fast, )
, and it is suitable for small kernel sizes and strides. Although works have been done to adopt the FFTbased
(zhang2017frequency, ) and Winogradbased (aydonat2017opencl, )algorithms to accelerate convolution on FPGAs, to the best of our knowledge, there is no work on combining the two fast algorithms to adapt to different sizes of convolutions so as to obtain better performance. In our work, we analyze and explore the properties of both FFT and Winogradbased convolution algorithms, and propose a heuristic methodology to design a hybrid accelerator for different convolutions. To summarize, this work highlights the following contributions:

We analyze the FFT and Winograd convolution algorithms, and explore the design space to find the suitable sizes and depths for applying each algorithm respectively. The analysis is incorporated in a general methodology to design a hybrid convolution algorithm for FPGAs.

We propose a novel resource allocation scheme that considers the intramodule parallelism of the recently invented CNN topology with branches, and minimize the overall system latency. We implement an algorithm to quickly find optimal resource partition parameters on HLS tools.

We design a template based reconfigurable HLS IP specifically targeting the Inception module, which features parallel convolution branches (szegedy2015going, ). Using the IP, we implement a face recognition system that is built upon Inception V2. We achieve better performance and energyefficiency compared to GPUs and previous implementation of GoogLeNet on FPGAs (zhang2017frequency, ).
The rest of the work is organized as follows. In Section 2, we introduce the background of Inception architecture and two fast convolution algorithms. In Section 3, we perform design space exploration on the two algorithms to identify sweet spots to adopt each algorithm. Section 4 presents the resource allocation scheme along with an algorithm for fast optimization method. The overall implementation on the FPGA and additional techniques for optimization, along with result evaluation, are shown in Section 6.
2. Background
2.1. Inception Module
In recent years, we have seen the booming of highly effective CNN architectures. One trend is that many networks apply the idea of splitting convolution layers into several branches, which may contain different sizes and depths of convolution kernels. These branches are often merged through concatenation or summation. Such a topology enhances the model’s expressivity and enables the network to be several times deeper. The Inception module is the first architecture that employs such a forking mechanism. One inception module contains a composition of pooling, , , convolutions. At the top, the results of different convolution branches are concatenated together. The Inception module is illustrated in Figure 1
. The GoogLeNet (Inception V1) comprises nine Inception modules, making it the best performing CNN architecture in ImageNet competition 2014.
The FaceNet face recognition system (schroff2015facenet, )
is based on Inception V2, which is an improvement of V1 with batch normalization. The model is designed to output a 128d embedding vector, and a typical face recognition pipeline is shown in Figure
2. The model is trained using the triplet loss function, such that embeddings of two images of the same person have a small distance between them, while embeddings of different persons have a large distance.
2.2. Convolution in Frequency Domain
The well known convolution theorem states that spatial convolutions are equivalent to pairwise multiplications in the frequency domain. Assuming the convolution input is a feature map, there are kernels with size . The spatial convolution’s computation complexity is then . 2D realnumber FFT has logaritmic complexity. During inference, weights are usually only loaded once, therefore the FFT for weights (CONV kernels) can be done offline. Thus, the computation complexity for FFTbased convolution during inference consists of three parts: 2D FFT for feature map, pairwise complex number multiplication, and inverseFFT for the result. The overall computation complexity is given by:
(1) 
And the theoretical speed up is presented as:
(2) 
when number of feature maps (L) and number of kernels (K) are both large. The feature map size (M) is on log scale thus the impact is minimal. According to Equation 2, larger kernel size leads to more significant speed up.
2.3. Winograd Minimal Filtering Convolution
Another fast convolution is based on the Winograd minimal filtering algorithm(winograd1980arithmetic, ). The algorithm reduces the number of multiplications with the expense of additional addition and constant multiplication. Take as an example. The standard convolution consumes multiplications. The Winograd algorithm uses 4 multiplications. It uses 9 more constant multiplications, but they can be implemented as bitshifts and additions, thus are much cheaper. The 2D Winograd algorithm is implemented from nesting the minimal 1D algorithm. In general, a 2D Winograd algorithm can be represented by the following equations.
(3) 
where g, d refer to the original weight tile and feature map tile respectively, G, B, and A are transform matrices, generated by CookToom algorithm, and U, V are the transformed weight tile and feature map tile. For example, consumes 16 multiplications, at the expense of additional 84 operations, yielding 2.25x reduction compared to 36 multiplications with standard convolution.
3. Systemic Characterization for Fast Convolution Algorithms
The two fast algorithms deliver remarkable speed compared to conventional convolution, but theoretical analysis and previous experiments (vasilache2014fast, ; lavin2016fast, ) have shown that these two algorithms have different optimal design points. FFTbased method in theory provides greater speed up when kernel size is larger. On the other hand, study claims that the Winograd algorithm’s improvement on speed winds down quickly when kernel size becomes large because the transformation overhead increases quadratically, offsetting the savings in the multiplications.
Recent deep CNN structures contain multiple parallel branches with different kinds of convolutions. Therefore, a single efficient algorithm cannot provide the best optimization. Consequently, we come up with an innovative heuristic to design a hybrid accelerator that incorporates both fast algorithms to cover different workloads for better performance. In order to find a good strategy of using different algorithms, we systematically conduct studies for different implementations of FFT and Winograd CONV with different design parameters, and use latency cycle count as our performance metric. In our experiments we consider kernel sizes, feature map sizes, and input/output dimensions shown in Table 1. The empirical latency model obtained from the study can be used as a guidance to choose different algorithms for different network architectures.
Dimensions  Sizes evaluated 

kernel sizes  3, 5, 7 
feature map sizes  6, 12, 24 
input/output dimensions  16, 32, 64, 128 (combinations) 
For Winogradbased convolution with larger kernels, we evaluate , and
. For our implementation of FFTbased convolution, since the Radix2 FFT inputs must be of size of powers of 2, we choose the padding to make up size 8, 16, and 32, for input size 6, 12, and 24 respectively.
One observation is that the kernel size does not affect FFT’s absolute performance in general because kernel and input need to be zeropadded to be the same size. There is one exception when input size is and kernel size is and . For these particular parameter combinations, we pad it to instead of , to retain higher numerical precision as observed in our experiment. The padding overhead is significant, leading to similar performance as input, thus less speed up compared to other algorithms.
We use Vivado HLS to implement the algorithms on the Xilinx VU9P FPGA. The implementations of FFT and Winograd CONV engines are illustrated in Figure 3 and 4. For each engine, they have respective transformation matrices, followed by MAC arrays for computing pairwise multiplication. Our goal here is to evaluate the algorithmic impact, so we aim to eliminate the hardware resource usage difference as much as we can. We notice that both fast algorithms exhibit similar transformcomputetransform computation pattern, and each transformed feature map can be reused to do multiple pairwise multiplication with pretransformed kernels, to generate multiple partial results in parallel. Thus, one easier way to control resource usage is to designate the feature map reuse factor through HLS UNROLL. We try to keep the same reuse factor so that different algorithms uses similar amount of resource for fair comparison. However, as shown in Figure 5, which demonstrates the normalized resource usage, we observe that different algorithms have different preferences of resources. For example, for FFTbased algorithm, it uses 60% of the DSPs, but consumes as much as 2.2x of LUTs, compared to the baseline, which is implemented using a conventional loopoptimization method (zhang2017high, ). Also, BRAM usage is affected by the number of output channels, since it buffers the intermediate results to prevent unnecessary IFFTs. In general, the fast algorithms prefer using LUTs to implement transformation operations, and save DSP usage due to reduced number of multiplications.
The result is shown in Figure 6, 7, and 8. The Xaxis shows the input/output channel sizes, and 2D feature map sizes (6 denotes , etc.) of the convolution, and the Yaxis shows the normalized performance of each convolution method, measured in terms of simulation cycle count. Across the three figures, orange curves represent Winogradbased convolution’s speed up against the baseline, and grey curves represent FFTbased method’s speed up compared to the baseline. From the figure we learn that in small kernels, Winograd’s algorithm dominates the performance. For larger kernel sizes, FFTbased convolution starts to catch up in speed, because the expense of Winograd transformation starts to overwhelm. Starting from kernel size and feature map size , FFT starts to gain advantage over Winograd. When kernel size is and input/output depth is large, FFT method outperforms Winograd’s method by a maximum of 2x margin. Considering that FFT generally has higher resource usage overhead with same unroll factor, it would be wise to apply FFT when kernel size is at least and input size larger than 12. We summarize the empirical results at Table 2, which serves to be a decision table to select fast algorithms depending on different input configurations.
[width=2.2cm, height=2cm]kernel sizefeature map  

size  
Winograd  Winograd  Winograd  
Winograd  Winograd  FFT  
Winograd  FFT  FFT 
4. Resource Allocation for Minimal Latency Considering Intramodule Parallelism
With the empirical model, we further develop a judicious resource partition algorithm, which is critical to achieve minimal latency for mapping the entire network onto FPGAs. Zhang’s work (zhang2017high, ) uses Cauchy inequations to prove that in order to optimize the overall latency for an entire network, we should partition the resources according to the equation , where and represent the computation complexity of layer i and j, and and is the calculated ideal resource allocation for layer i and j. Applying such equation with the constraint , one is able to find the optimal resource partition between layers to achieve minimum latency.
However, such framework does not consider the properties of Inception like CNN structures with parallel branches, and does not provide us a solution on intramodule resource allocation. In such structure, latency is constrained by the longest branch. We appreciate the fact that each branch in such topology has no dependency and thus executes concurrently. In order to minimize the latency, we prorate the resource according to computation complexities of each branch: , where and indicate different branches.
To quickly find out the most appropriate parallel factors for branched structures, we put forward a resource allocation scheme explorer, as shown in Algorithm 1. We first calculate the ideal allocation solution for each module as by solving for each layer in the whole network model, then for each layer, we solve for each branch. In FPGA implementations, to utilize resource more efficiently, a common way is to specify parallel factors to launch multiple computing engines (Winograd or FFTconv engines in our case) concurrently. To avoid onchip memory port contention, arrays must be partitioned proportionally to the parallel factors. We designate our parallel factor to follow the power of two, such as 4, 8, and 16. This intends to boost the computing efficiency in hardware and avoid the misaligned parallelism between neighbouring layers (in HLS implementation, array partition factors should be consistent between adjacent layers). The proposed algorithm generates resource allocation scheme for different branch to approach the theoretical optimum. The algorithm is depicted in Algorithm 1. By analyzing the computation demands of the branches, we have the normalized computation complexity (line 1 to 3) and then generate the ideal resource allocation scheme for branch : . Since hardware implementation is more favorable to the power of two, we truncate the ideal scheme to use a more realistic option: . From line 7 to 17, we use a while loop to finetune the resource allocated to each branch. If the gap between ideal and realistic resource allocation for a branch is nonzero and more resource is still available, we double the current resource utilization (line 10, 11) for the branch to remove this gap starting with the largest gap first to fulfill the computation demand for this critical branch.
5. Inception Module IP
We use Vivado HLS to implement a C++ template based reconfigurable Inception module IP, which includes all the techniques of optimizations discussed in the above sections. Using the Inception engine, a face recognition network, based on Inception V2, is mapped onto a Xilinx VU9P FPGA development board. The Inception module IP is implemented as a template function, with the (1 1 CONV), , , and subfunctions that represent the parallel branches in the Inception module. Although there are no data read/write dependencies between each subfunction, by default, Vivado HLS won’t schedule subfunctions to execute concurrently if subfunctions read from or write to the same array (even when they are writing to different location of the array). To solve this problem, we have to explicitly implement and functions to copy the input feature maps to different buffers, and write to isolated buffers for different subfunctions, and then concatenate the result at the end. Algorithm 2 describes the implementation,
where CONFIG stands for an ensemble of multiple reconfigurable variables such as array sizes, flags for existence of submodules, and unroll factors etc. When Vivado HLS reads the code, it instantiates submodules according to the flags in the if statements. With the Inception module IP, we instantiate Inception modules that fit different input/output and CONV sizes by passing template parameters and construct the entire FaceNet system.
6. System Implementation and Evaluation
6.1. Data Quantization and Numerical accuracy
In recent years researchers have shown that neural networks are exceptionally robust to low precision computation (gysel2016hardware, ). In this work, we explore both 16bit fixed point and 8bit fixed point. We measure the numerical error of the network with regular convolutions, and fast algorithms, whose implementation configuration is listed on Table 4. We set the floatpoint embedding as the ground truth, and measure squared distance between the output with quantized data and floatpoint data, since this metric is used in the FaceNet system to measure if faces are from the same identity or not. Results are shown on Table 3. First we discover that 8bit fixedpoint weights are adequate to not lose too much accuracy compared with floatingpoint results. The network with fast algorithms and fixed 16bit and fixed 8bit values generate distance error at and magnitude compared to the floatingpoint version, which is tolerable in terms of face verification accuracy (threshold for same identity is set to 1). In our experiment of 100 face pairs at size
, 12 pairs are classified differently compared to the floatpoint result with 8bit fast algorithms, and only one mismatched pair with 16bit fast algorithm. We believe that if we were to run retraining, accuracy can be further restored, as shown in previous works
(han2015deep, ; zhang2017machine, ).FIX16 L2 error regular CONV  FIX8 L2 error regular CONV  FIX16 L2 error fast CONV  FIX8 L2 error fast CONV 
7.024e5  9.989e2  1.232e4  2.031e1 
6.2. Implementation
We implement our design targeting a Xilinx VU9P FPGA. The detailed implementation scheme of our Inception modules with hybrid algorithms is presented in Table 4, where ”” means the subbranch does not exist in the module.
Inception #  

Inception 2  Winograd   
Inception 3a  Winograd  conventional 
Inception 3b  Winograd  FFTbased 
Inception 3c  conventional  conventional 
Inception 4a  Winograd  FFTbased 
Inception 4e  conventional  conventional 
Inception 5a  Winograd   
Inception 5b  Winograd   
We start with the conventional 6loop unoptimized convolution as our baseline. In optimization, we use Algorithm 1 to iteratively optimize the subbranch and apply fast algorithms when possible. We implement Winograd in earlier modules and Winograd for later modules because although theoretically the former Winograd setup gives more performance gain, but in later modules, the input feature map sizes become so small that using results in lots of sampling of padded zeros, impairing both the accuracy and performance. We adopt FFTbased for inception 3b and inception 4a because the branch becomes critical path after branch is well optimized. CONVs in inception 3c and inception 4e have stride 2, which the fast methods don’t support, so we optimize them with conventional methods (zhang2017high, ).
Our implementation use 16bit fixed point for both weight and feature map, and the operating frequency is 200 MHz. The resource consumption and simulation performance results are shown in Table 5 and Table 6.
BRAM  DSP  FF  LUT  
Our work  3067  2041  539422  938159 
71%  32%  23%  79% 
We implement our design with Vivado HLS 2017.1 and find that our implementation tends to use more LUTs. This situation is due to the following reasons. First, both FFT and Winograd transformation comsume LUTs because multiplications are reduced to either additions or constant multiplications, which is implemented using LUTs. Second, the control logic in the Inception engine IP is more complicated compared to conventional convolution implementation, thus taking up more LUTs.
6.3. Evaluation
We first compare our implementation on FPGA with GPU result. We use a cuttingedge Pascalbased NVidia GTX 1080 GPU, which has a 8.9 TFLOPS peak performance. The GPU implmentation is on Torch, with CUDA 8.0. We also compare our work with the results reported in Zhang’s work (FPGA2017)
(zhang2017frequency, ), and DiCecco’s work (FPT2016) (dicecco2016caffeinated, ) which are works that evaluated GoogLeNet (Inception V1). These two works also implement fast convolution algorithms. FPGA2017 implements OverlapAdd FFT convolver on a CPU + FPGA system, and FPT2016 implements Winograd convolution on a Xilinx Virtex7 board. Our implementation is in fact Inception V2, which is the original Inception added with batch normalization layer after each CONV layer, thus has slightly more computations. The result is shown in Table 6. We use inference latency as the evaluation metric, since facial recognition/verification is a latency critical task. For works that don’t report latency, we calculate single image latency by dividing the entire network computation operations with the reported GOPS. Our result shows that, compared with GPU, we achieve 3.75x latency improvement. For FPGA works, we achieve superior results, with 3.53x and 8.11x latency speed up compared to the FPGA2017 and FPT2016, respectively. We also achieve 4.68x better energy efficiency compared to FPGA2017.
7. Conclusions
In this paper, we explore different fast convolution algorithms including Winograd’s minimum filter algorithm and FFTbased algorithm, and find the best strategy to apply them on different types of convolutions. We implement a configurable IPbased endtoend CNN accelerator targeting FaceNet (Inception V2) using Cbased HLS. Our solution surpasses both NVIDIA GTX 1080 GPU and previous FPGA results. We envision that such face recognition system can be paired with multiple lowpower video capture systems, with the FPGA deployed in a central server and close to database, for fast realtime multiface recognition and verification, to satisfy the need for security, border control, and other related applications.
Acknowledgements.
This work is supported by IBMIllinois Center for Cognitive Computing Systems Research (CSR), a research collaboration as part of the IBM AI Horizons Network. We also thank Kyle Rupnow of Inspirit IoT Inc. for helpful discussions.References
 [1] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
 [2] Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 [3] Yuhong Li, Xiaofan Zhang, and Deming Chen. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. CVPR, 2018.

[4]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and
Olivier Temam.
Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning.
In ACM Sigplan Notices. ACM, 2014.  [5] Jiantao Qiu, Jie Wang, et al. Going deeper with embedded FPGA platform for convolutional neural network. In FPGA, 2016.
 [6] Su Liu, Alexandros Papakonstantinou, Hongjun Wang, and Deming Chen. Realtime object tracking system on FPGAs. In SAAHPC 2011.
 [7] Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, and Deming Chen. High level synthesis of stereo matching: productivity, performance, and software constraints. In FPT 2011.
 [8] Chun He, Alexandros Papakonstantinou, and Deming Chen. A novel SoC architecture on FPGA for ultra fast face detection. In Computer Design, 2009. ICCD 2009. IEEE International Conference on.
 [9] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing FPGAbased accelerator design for deep convolutional neural networks. In FPGA, 2015.
 [10] Kalin Ovtcharov, Olatunji Ruwase, JooYoung Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, 2015.
 [11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [12] Masoud Abdi and Saeid Nahavandi. Multiresidual networks. arXiv preprint arXiv:1609.05672, 2016.
 [13] Yongming Shen, Michael Ferdman, and Peter Milder. Overcoming resource underutilization in spatial cnn accelerators. In FPGA, 2016.
 [14] Xiaofan Zhang, Xinheng Liu, Anand Ramachandran, Chuanhao Zhuge, Shibin Tang, Peng Ouyang, Zuofu Cheng, Kyle Rupnow, and Deming Chen. Highperformance video content recognition with longterm recurrent convolutional network for FPGA. In FPL, 2017.
 [15] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.
 [16] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In CVPR, 2016.
 [17] Chi Zhang and Viktor K Prasanna. Frequency domain acceleration of convolutional neural networks on CPUFPGA shared memory system. In FPGA, 2017.
 [18] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. An OpenCL (tm) deep learning accelerator on arria 10. arXiv preprint arXiv:1701.03534, 2017.
 [19] Shmuel Winograd. Arithmetic complexity of computations, cbmsnsf regional conference series in applied mathematics, vol. 33. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, Pa, 1980.
 [20] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. Hardwareoriented approximation of convolutional neural networks. arXiv:1604.03168, 2016.
 [21] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [22] Xiaofan Zhang, Anand Ramachandran, Chuanhao Zhuge, Di He, Wei Zuo, Zuofu Cheng, Kyle Rupnow, and Deming Chen. Machine learning on FPGAs to face the IoT revolution. In ICCAD, 2017. IEEE, 2017.
 [23] Roberto DiCecco, Griffin Lacey, Jasmina Vasiljevic, Paul Chow, Graham Taylor, and Shawki Areibi. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In FPT 2016. IEEE.
Comments
There are no comments yet.