From recognition to reasoning, convolution neural networks have attained impressive accuracies in a broad range of applications such as mobile robotics, natural language processing, information retrieval and speech recognition. . In 2014, VGG-Net,  a network which became very popular suggested some standards including uniform filters/kernels of size 3X3 across all layers as it could emulate the effect of larger receptive fields. This reinforced the notion that convolution neural networks have to be deep in order for the hierarchical representation of visual data to work.
General purpose processors are not able to fully exploit the inherent inter-output and intra-output parallelism of convnet networks, hence specialized hardware accelerators such as GPUs, FPGA and ASICs are gaining popularity. In fields like mobile robotics, which usually have stringent energy constraints, the reconfigurability and higher energy efficiency of FPGA based implementations has made them an attractive alternative.  
The major bottleneck while implementing huge networks on FPGA is meeting high memory throughput requirement of CNNs with limited on-chip memory. Traditional implementations of CNNs evaluate the network layer by layer and off-load data intermittently to a larger external memory which significantly decreases throughput because of limited data transfer bandwidth.
The computation pattern of CNNs is similar to iterative stencil loops (ISLs) , for which data dependencies span across multiple layers and iterations. Convnet layers are characterised by uniform spatial dependencies, domain narrowness and uniform inter-iteration dependencies. Works like  have adapted ISLs computation techniques  to pipeline the dataflow across different convnet layers. Since the spatial data flow across layers is dependent on very few data values, it is not required to wait for the entire intermediate output to be computed to start processing the next layer. This fact was exploited in Fused layer cnn  which restructured the computation to significantly reduce external memory access.
In our paper, we leverage upon the fact that the reverse is also true. That is, a particular input influences only a limited neighborhood of the intermediate output layers. So once these outputs are computed, that particular input can be discarded as shown in Fig(1). Using techniques like line buffer windowing and depth based concatenation, our 2.78X faster architecture improves upon . Specifically we make the following contributions:
We propose depth concatenation in both input data and filter weights, i.e. data values across depth are concatenated adjacent to each other so that they can be moved together across buffers. Since most of the computations along depth for each layer are independent and can occur concurrently, depth flattening minimises the lag due to serial data flow along depth.
We have modified the data flow pattern of CNNs for a constrained bandwidth setup by fusing across layers using the architectural pattern of line buffering. Line buffers help maximize data re-use by storing input serial data stream and intermediate computation results in small on-chip BRAM buffers. The effectively pipelined structure allows the computation of values of next layer as soon as its depending values have been computed and discards this input as soon as the corresponding outputs have been computed thus eliminating recomputation and optimizing memory resources.
These contributions have enabled the design of our elegantly pipelined high throughput DeCoILFNet accelerator which is very efficient in terms of utilization of FPGA resources. We have evaluated our accelerator on VGG-like networks, with VGG-16 as the representative. Compared to state-of-the-art CNN FPGA accelerator , our accelerator performs 2.6X faster on an average and reduces external memory access by 11.5X. Compared to Fused CNN, our accelerator performs 2.78X faster with a slight increase in off-chip memory access. We are 30X better in speed compared to CPU-caffe implementation and almost reach the speed of GPU-caffe implementations.
Ii Related work and Motivation
There are two major components of computation in Convolutional Neural Networks: forward pass and backward pass. While training iteratively, network performs repeated forward and backward passes to refine weights until the desired accuracy is achieved. Since for recognition only forward pass is required, many application designers train networks offline and use the trained weights to perform time-sensitive jobs on energy constrained devices 
. Recent developments in the deep learning community have shown that the fully connected layers can be removed with no degradation in performance. Under these circumstances, works like   which focus on accelerating convolution layers have gained prominence. However, as the networks are getting heavier, the on-chip memory of FPGAs is becoming insufficient to store the huge intermediate outputs. Conventional works  have focused on designing CNN accelerators which iteratively process the CNN layers and off-load the intermediate data to external memory. This involves extensive and unnecessary repetitious read and write accesses. Because of this reason, the limited amount of external bandwidth is a challenge for designing efficient accelerators. In CNNs, the input and output feature volume is larger for initial layers and it gradually reduces. In the later layers, the memory occupied by weights dominates as the depth increases. Thus redesigning data flow movement for initial layers significantly reduces the overall external memory accesses which decreases the overall computation latency and power. Inspired by the structure of image processing pipelines to minimize memory bandwidth using architectural pattern of line buffering , our DeCoILFNet uses small on-chip buffers to pipeline the computations within and across the layers increasing throughput and eliminating unnecessary communication with off-chip DDR. Our architecture has been optimized in a bandwidth constrained setup so efficiently that the restricted external memory access is no longer the bottleneck.
Iii DeCoILFNet Architecture
In the following sections we describe in detail the optimizations in different modules of DeCoILFNet accelerator.
Though our accelerator is generic, for ease of explanation, we have taken the following test example - input image as 5*5*3 (l*b*d), two convolution layers fused both with stride=1 (s), padding=1 (p), number of filters=3 (k) with kernel size 3X3 (wXw) followed by a pooling layer with window of 2X2 and stride =2.
input image as 5*5*3 (l*b*d), two convolution layers fused both with stride=1 (s), padding=1 (p), number of filters=3 (k) with kernel size 3X3 (wXw) followed by a pooling layer with window of 2X2 and stride =2.
Iii-a Line Buffer Windowing Module
The input to the accelerator comes in the form of a serial data stream. The first layer in CNNs is the convolution layer. For convolution operation, we need input windows similar to shown in the expected window in Fig. (2). As the input data comes serially, to get a valid complete window, we need 9 values of the sliding window which come sequentially. This cumulative delay of reading these values for getting a valid window each time adds huge unnecessary delay to overall computation. Therefore our line and window buffer module as shown in Fig. (2) is pipelined in such a way that we are able to get a new window at each clock cycle after a certain latency.
Usually before convolution, to maintain the spatial dimensions of the output, we pad the input layer with zeros. As shown in Fig. (3), when we reach towards the end of line buffer, we get some invalid windows. Using our line buffer module, we are able to smoothly incorporate the padding layer to get padded windows which are input for the next consecutive convolution.
Iii-B Depth concatenation module: Input data and filter data flattening
The above line buffer windowing module has been described for a 2-D window, whereas in our case, for volume convolution, we need a 3-D window. To get the 3-D window similarly in every cycle, our novel method for the same is to flatten along depth so that the data flow is same as before but instead of just one window of a particular depth, we get a window flattened along the third dimension. As shown in the Fig. (4), the input data after preprocessed depth-flattening, is sent to DeCoILFNet as a concatenated data stream. This concatenation increases the bandwidth as now instead of reading the 32 bits of D, D, and D in separate cycles, we read them together as 96 bits of DDD. This new concatenated window can be simply split into three independent windows which are parallely sent to the convolution block. Data of the convolving filter too is flattened similarly i.e. the values along the depth are concatenated. Before computation, this concatenated data f of filter1 is split into three 2-D filters and sent to convolution module. We have instantiated d*d = 9 filter BRAMs with multiple filters kept one after the other as shown in Fig (4). The multiple BRAMs allow us to read all 9 values of one 3-D filter parallely, thus making the filter ready for convolution in one cycle.
Iii-C 3-D Convolution pipelined Module
As shown in Fig. (4), the 3-D filter and input window split into d=3 filters and d=3 windows. We have used DSPs only for multipliers and LUTs for adders so that more computations can be performed in parallel. Both the multiplier and adder modules have an initial latency of 9 cycles after which because of its internal pipelining, the output of the next k=3 subsequent filters and input windows keeps coming in every cycle. Thus, the 2-D convolution module gets finely pipelined giving output in every cycle after a latency of (9*(1+ ceil(2(log2)w))) = 45 cycles because of the cumulative effect of multipliers and adders. The d values of 2D-convolution of each filter are added again to give the final single scalar value of 3-D convolution of the output volume. The entire 3-D convolution module is pipelined in such a way that after an initial latency of (9*(1+ ceil(2(log2)w)+ceil(log2(d)))) = 63 cycles we get the output of convolution of each filter with an image window in every clock cycle.
Usually in CNNs, the consecutive convolution layers are followed by pooling. In max pooling, a 2X2 window is slided across the input with a stride of 2. In our DeCoILFNet architecture, we use an intermediate pool line buffer for pipelining. As soon as we get the output of previous convolution, we redirect it to the pool buffer at the current output column address. We update the output column address at every even step, and at odd steps, we replace the current output with the max of old value and new computed output. These pooled outputs are read into the next input line buffer for further computation.
Iii-E Inter-layer Fusion Pipelining
Since CNNs follow the pattern of iterative stencil loops , i.e. each particular input influences only a limited neighborhood of the intermediate output layers as shown in Fig. (1). The main concept of using line buffer windowing module is based on this idea. So once these outputs are computed, that particular input can be replaced to get the next input either from external memory or computed output of the previous layer. Hence in our architecture, we start processing for the next layer as soon as we get the required valid inputs. As explained above in the 3-D convolution pipelined module, we get the convolution output of intermediate layer in every cycle for filters subsequently one after another. As shown in the pipeline Fig. (5), since in the first layer we have three filters which are computed one after another, though we get the output of each filter in every cycle, to stream the output data as serial input to the intermediate layer , we need to wait for the whole volume of output value to be computed. During this time when the volume is being computed, the input window is kept constant till all filters have been processed. This output volume is serially streamed to the intermediate line buffer. Here also, we need to wait for initial filing of intermediate line buffer before we get a valid convolvable window. This pipelining can be continued for further convolution layers. DeCoILFNet accelerator has been pipelined so efficiently that even if multiple convolutions are fused together, the only delay is because of the initial latencies after which we are still able to get one output element in every step. If we fuse the pooling layer in our architecture, as explained above, we need to wait for some more clock cycles before every new pooled row. Hence our architecture works best when we have multiple consecutive convolutions.
Iv Experimental Evaluation and Results
Iv-a Programming using hardware descriptive language: verilog
Most of the design optimization works    have been done using high level synthesis tools as it is easier to port the code from software to hardware implementation. The motivation behind using HLS is to avoid the need for RTL programming, nevertheless it is still necessary to verify the HLS generated RTL output , and in cases verification fails, it is difficult to determine the cause of problem. Hence to successfully explore and implement the deep pipelining and parallelism of our design and use resources in an efficient manner, our testing and validation has been done completely in verilog using Vivado tool.
Iv-B Experimental Setup
FPGA : Our design has been implemented on FPGA board Virtex-7 XC7V690T (on-chip BRAM of 6.46MB, 3600 DSP slices and 693120 logic cells) with a working frequency of 120MHz. This is the same board as used in  and , so that our comparisons in the next section are fair. We have used Xilinx Vivado 2017.1 tool for synthesis, placement and routing and the results are shown in Table (I).
Baselines: We compare our design with the following baselines:
CPU-caffe: We have obtained the baseline CPU-caffe timings with respect to a 3.5GHz hexa-core Intel Xeon E7 caffe-implementation .
GPU-caffe: We have obtained the baseline GPU-caffe timings with respect to GeForce GTX 1070 (1506 MHz graphics clock and 1683MHz processor clock) caffe-implementation .
Fused layer cnn and Optimized convolution accelerator: We have compared the resources and timing of first five layers of VGG-16 for DeCoILFNet accelerator against the Fused layer cnn accelerator  and Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks  by using data from Table of .
Functional verification: We performed layer by layer functional verification of our code by comparing it with our Matlab forward pass implementation using trained weights from caffe.
Iv-C Results and Comparison
In this section, we have analyzed the performance of our accelerators with caffe-cpu, caffe-gpu and state-of-the-art FPGA-accelerators for the initial layers of VGG-16. The motivation for us to choose VGG-16 was because modern state-of-the-art deep networks for various applications such as Fully Convolutional Network (FCN-32s) , Segnet (web demo model)  are variants of VGG-16. The common feature between them is that most of the convolution layers have kernel size=3X3, padding =1 and stride=1. Also these networks are characterized by multiple consecutive convolution layers.
We first evaluate our performance with respect to CPU-caffe and GPU-caffe implementations for the first seven layers of VGG-Net16 ( 5 Convolution Layers and 2 Pooling layers ). Table (II) shows the comparison of timing after every layer of VGG-Net16 of our accelerator with software implementations running over both CPU and GPU. As visible from the table, our DeCoILFNet’s performance at 120MHz is comparable to GPU and outperforms CPU with a speedup ranging from 4.28X to 39.08X.
The amount of speedup gained by DeCoILFNet as compared to the CPU keeps on increasing with the increase in the number of layers, this is because of the exploitation of the inter-layer fusion in case of hardware accelerator which allows it to start the next convolution without waiting for the whole output and as the number of layer increases this amount of fusion increases resulting in better performance as compared to the CPU.
Fusing of a pooling layer with convolution layer takes longer than fusing two convolution layers. Fig. (6) shows the difference in speedup obtained with and without the pooling layer. This is because for computing the pooled layer output, we need to fill up the entire line buffer initially. Thus the initial latency for pooling is higher.
Our design gives the best speedup performance when we have multiple consecutive convolutions. This is particularly helpful in networks like FCNs  and segnet which follow this pattern. In order to demonstrate the performance of our hardware accelerator, we have designed our own network consisting of four consecutive convolution layers each consisting of 64 filters of dimension 3*3 with stride 1, and run it over the CPU, GPU and DeCoILFNet comparing the result after each layer. This is a network pattern that is common in the initial layers of modern networks.As shown in the Table (III), when we fuse consecutive convolution layers, we are able to attain a speedup of 76.8X with respect to CPU and even slightly surpasses the GPU speed. In general FPGAs have a much higher per watt performance compared to GPUs. Modern GPUs use 10-100X more power than FPGAs. Thus using a resource constrained FPGA even reaching the GPU computation speed increases the per watt performance significantly.
In order to compare our architecture with the current state-of-the-art hardware accelerators we compared our architecture with the one proposed by  and  for the first seven layers of VGG-Net16. Table IV compares the resource utilization of DeCoILFNet with the baseline architectures. The resource utilization and timing for both implementations has been taken directly from ). Among the three, our architecture gives the best performance of speed(compared to ) along with a significant reduction in data volume transferred (compared to . We have been able to effectively utilize the DSPs by eliminating recomputation with the help of line-buffer pipelining. The goal of our architecture is to maximize the speedup in limited external memory accesses. Depth concatenation helped us pipeline dataflow and perform all independent computations for the first seven layers of VGG  in parallel. The pipelining is also very stringent, i.e. there is no stall after the initial latency and we keep getting a continuous stream of output. Keeping these in mind, the results shown in TableIV are the best possible we could achieve on Virtex-7. We are able to attain more than 2X speedup in terms of clock cycles compared to both accelerators, along with higher working frequency.
V Discussion and Trade-off
Fig. (7) shows the relation between off-chip memory accesses and the computation units due to grouped fusion of five convolutions and two pooling layers of VGG-16 in different groups. We have assumed that the depth based parallelism is constant for all the cases considered. The point A represents no fusion, i.e. when all intermediate outputs are stored in DDR. In this case as is visible from the diagram, since we write back to the DDR, the computation unit of single layer is reused for every layer, i.e. each layer is its own group. Hence the DSP utilization is minimum for this case at the cost of highest ( 23.54 MB ) dataflow. The point G in the diagram represents when all layers have been grouped and fused. Since we are computing all layers concurrently, the DSP utilization is maximum with minimum on-chip memory utilization.
Our high performance in Table IV compared to other accelerators is aided by our parallel computations across depth. Using depth concatenation allows us to perform several computations concurrently. Our depth-concatenation technique is also limited by the compute resources present on the FPGA board. As the concatenation depth increases, we need more resources to perform computations in parallel. We have used iterative decomposition to solve this problem. We divide the depth into multiple groups of parallel computation, and process these groups serially. The number of serial groups decides the factor by which our clock cycles increase, as we need to wait for the result of all groups to complete to get one output. This technique is particularly needed for the later layers of VGG-Net where we need to process inputs of depth 256 or 512.
In CNNs, the input and output feature volume is large for initial layers and gradually reduces. Keeping data-volume considerations aside, independence in computation-pattern for later-layers is same as initial-layers. Though we have demonstrated improvement results for initial layers, we believe our architecture can exploit the same data independence of later layers to give same better performance over baselines. For later-layers, weights dominate memory space and depth of convolving filters increases significantly. Since both parallelization due to depth concatenation and layer fusion require same compute resources, there is a trade-off between them. The number of layers fused should be maximum for the initial layers. This is because for the initial layers, the intermediate output data is huge and less layers fused would mean a huge data volume movement to and from external memory . Whereas for the later layers, the depth of input and convolving filters increases significantly and the intermediate. Also the subsampling layers reduce the intermediate data volume. Hence it makes more sense to allocate compute resources to parallel computations across depth for the later layers.
We presented a ‘Depth Concatenation and inter-layer fusion based convnet accelerator-DeCoILFNet’ which exploits the intra-layer parallelism of CNNs by flattening across the depth and combining it with the inter-layer fusion. Our accelerator maximises data re-use and completely eliminates recomputations while fusing multiple convnet layers. We explained in detail the different components of our architecture and evaluated our accelerator on VGG-like networks, with VGG-16 as the representative. We demonstrated that our 120 MHz accelerator is 30X faster compared the performance to a 3.5GHz hexa-core Intel Xeon E7 caffe-implementation.In addition, our design reduces external memory access by 42X along with a speedup of more than 2X in the number of clock cycles compared to state-of-the art FPGA accelerators.
-  Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, A dynamically configurable coprocessor for convolutional neural networks, in ISCA,2010.
-  Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong, Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,inFPGA, 2015.
-  Manoj Alwani, Han Chen, Michael Ferdman, Peter Milder, Fused Layer CNN accelerators, in MICRO, 2016.
-  James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, Pat Hanrahan, Darkroom: compiling high-level image processing code into hardware pipelines, in ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH ,2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, in CVPR,2015
-  M. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and D. DiSabello., Achieving high performance with FPGA-based computing, in IEEE Computer , 40(3):50–57, 2007.
-  J. Sanguinetti, Understanding high-level synthesis design’s advantages, in EE Times Asia, pages 1–4, 26 April 2010.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in Proceedings of the 22nd ACM international conference on Multimedia, 2014.
-  Vincenzo Rana , Ivan Beretta , Francesco Bruschi , Alessandro A. Nacci , David Atienza , Donatella Sciuto, Efficient Hardware Design of Iterative Stencil Loops, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016.
-  Evan Shelhamer , Jonathan Long , Trevor Darrell, Fully Convolutional Networks for Semantic Segmentation, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla : SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, DaDianNao: A machine-learning super- computer, inProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . IEEE Computer Society, 2014.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, cuDNN: Efficient primitives for deep learning, in CoRR , 2014
-  Karen Simonyan, Andrew Zisserman Very deep convolutional networks for large-scale image recognition