1 Introduction
The applications of deep learning, such as image recognition, autonomous driving, and smart robots, are significantly changing people’s lives. After CNNs (convolutional neural networks) are proposed, the accuracy of image recognition and the capability of video classification improve significantly. Meanwhile, the networks are getting bigger and deeper to extract more raw information from images and videos. From AlexNet in 2012 to ResNet in 2015, the model size has increased by 16 times [1].
To improve the efficiency of deep learning computations, methods like pruning, weight sharing, quantization and accelerators are proposed in recent years[1]. Pruning eliminates some connections of nodes in a network, while maintains the forwarding accuracy of the network at a similar level, reducing the computation overhead during training and inference. Weight sharing clusters the neighboring weights and generates a corresponding lookup table, reducing the storage overhead of network models. Quantization maps network weights of higher precision to lower precision and retrains the whole network, while the accuracy is not compromised. Accelerator computes convolutions or multichannel operations in parallel or in batch with parallel algorithms and specific hardware, reducing memory access times and elapse time in inference.
GPUs (Graphics Processing Unit) are one type of general deep learning accelerators, while deep learning accelerators discussed in this paper refer to those deployed on ASICs (Application Specific Integrated Circuit) or FPGAs (Field Programmable Gate Arrays).
2 Concurrent opensource projects
With regards to accelerator technology, there are two research concentrations: highperformance [2, 3, 4, 5, 6, 7] and reconfigurability [8, 9, 10, 11]. ASIC accelerators tend to make use of the performance advantage and higher data bandwidth, while FPGA ones tend to make use of the configurability to support more network types. Concurrent two large opensource FPGA accelerators are NVDLA and CHaiDNN.
2.1 NVDLA by NVIDIA
NVDLA [12] is an opensource deep learning accelerator released by NVIDIA. Its source code is based on HDL (hardware description language). It can be either deployed on FPGAs, SoCs (system on chip, e.g., Xilinx ZYNQ series) or ASICs. When it is deployed on FPGAs, the accelerator communicates with the host memory with highspeed buses like PCIe and CAPI, controlled by CPUs like X86 and POWER. When it is deployed on SoC, it can access memories with onchip AXI interfaces, controlled by CPUs like ARM. NVDLA’s main design feature is the modularized, reconfigurable and scalable architecture. It supports inference precisions of INT8, FP16 and FP32. The architecture is as Figure 1 [12]. The red part is the NVDLA hardware. CSB (Configuration Space Bus) gets commands from CPU. DBBIF (Data Backbone Interface) transfers the memory data shared with system DRAM. IRQ (Interrupt Interface) sends interrupt signals to CPU. In NVDLA large systems, SRAMs cache data to speed up data access.
Figure 2 [12] depicts the inner core of NVDLA. The computation unit consists of the following parts: Convolution buffer and convolution core, Activation engine (SDP, Surface Data Processor), Pooling engine (PDP, Planar Data Processor), LRN engine (CDP, Channel Data Processor), Memory Reshaping Unit (RUBIK) and bridge DMA (Direct Memory Access). The commands required by the core are transferred via the CSB interface. The data is transferred via the bridge DMA interface.
However, the opensource NVDLA project supports only pregenerated loadable files and restricted network types in the software section, while its parser, compiler and optimizer [12] are not opensource. Users can not put other networks or selftrained networks on it.
2.2 CHaiDNN by Xilinx
CHaiDNN [13] is an opensource deep learning inference accelerator released by Xilinx. The source code is written in C++, so it can only be deployed on Xilinx ZYNQ MPSoCs or UltraScale MPSoCs. The main feature of CHaiDNN is that it achieves a balance between computation efficiency and accuracy with 6bit or 8bit quantized data. On Xilinx ZU9/ZU7 platforms, a maximum of 1024/512 onchip DSPs can be utilized. Apart from NVDLA, the fully connected layer, which contains the most weights in CNNs, is realized on CPUs. CHaiDNN project is compiled with Xilinx SDSoC. The SDSoC toolkit analyzes and makes partitions of the C/C++ code. The parts that can be accelerated by the parallel resources on FPGAs will be generated as programmable logic, while the parts like memory management, interrupt control, sequential flow control will be generated for ARM CPUs in the SoC to maximize the efficiency. However, due to its platform exclusiveness, this project cannot be directly migrated onto ASICs.
3 Convolutional Neural Networks, Parallel Optimizations and Corresponding Hardware Relationships
3.1 Hardware Platform
Due to the demand of ASIC migration, the hardware of this paper is FPGA. The model is Opal Kelly XEM6310LX45 and the chip is 45nm Xilinx Spartan6 XC6SLX452, as Figure 5 [14]. The available hardware resources are as follows: In terms of computation resource, there are 58 Mult/DSPs and 6822 slices (each slice contains 4 6LUTs and 8 DFFs). In terms of storage, there are totally 2088Kbit BRAMs (Block Memory) and a 1Gbit DDR2 RAM (16 bits wide, 10Gb/s bandwidth)^{1}^{1}1Opal Kelly XEM6310 User Manual. Accessed March 6, 2019..
Moreover, another reason to develop on this FPGA is that it has high speed USB3.0 connection and corresponding USB3.0 chip onboard. It can achieve a maximum transfer speed of 340MB/s, which is fast to load block data. Meanwhile, Opal Kelly provides FrontPanel HDL APIs so that users can program with C/C++, C#, Ruby, Python and Java[14].
3.2 Characteristics of Convolutional Neural Network Computation
CNN inference normally consists of the following layers: convolution, pooling, activation and local response normalization. LRN layer is proposed in AlexNet, but it seldom appears in recent CNN researches, and networks without it can achieve a same accuracy of inference as those with it. Moreover, there are AlexNet and GoogLeNet without LRN layers proposed, so LRN layers are not realized in this work.
Figure 5 shows the convolution layer operation. The input matrix dimension is , the input data matrix and
weight matrix in the sliding window are dotmultiplied and summed. The result is a corresponding value in the output matrix. The stride of each sliding operation is s. To achieve an exact division, padding size of p is required for the input matrix surface. Thus, the final output surface side is
. To extract different types of information, normally there are multiple (n) weight cores to slide on the same input data matrix. Thus, the final output matrix dimension is . n increases as the network forwarding goes on, which means that the more information extracted, the deeper the channel dimension becomes. Different weight cubes have different bias values. In networks like AlexNet there are fully connected layers, which are essentially 1x1 convolutions, so fully connected layers are merged to convolutional layers. Fully connected layers require tremendous weights, which account for the main part of convolution operation. Obviously, convolution layers account for the main part of the network operation.The activation Layer maps each element of the input matrix to another value. It does not change the matrix dimension but serves to improve the network comprehension through nonlinear transformation. Figure
6shows common activation functions like sigmoid, tanh, ReLu. ReLu is used most commonly, as its computation overhead is the smallest and it can help the network achieve a fast convergence during training. In terms of hardware implementation, ReLu is also the easiest since it is only required to judge the sign bit of a floatingpoint number.
Sigmoid and tanh are more complicated nonlinear functions. Since these two functions are both singlevalued, they can be calculated by lookup table. Figure 7 and 8 [12]
show two step lookup tables with interpolation. LUT precision is determined by the total lookup points and the slope of the function. The more lookup points, the higher the lookup precision. The steeper the function, the harder the LUT implementation is. Since the hardware resource is restricted, the first step is a raw table which covers the entire domain of definition, while the second step is a dense table with a higher accuracy.
Figure 10
shows pooling operations. The input is normally the output of the previous convolution + activation. Pooling layers serve to preserve the main feature of the previous layer and reduce the data computation during forwarding to prevent overfitting. Common pooling operations are averagepooling and maxpooling. Averagepooling takes the mean value of the data in the sliding window, while maxpooling takes the max. The output surface of pooling is smaller than the input surface, while the channel dimension remains the same.
3.3 Optional Optimization Algorithms
Since LUTs, FFs, DSPs and BRAMs are restricted on FPGA and areas and power consumptions are also concerns on ASIC, it is required to achieve a hardware utilization as high as possible when we choose acceleration algorithms. To deal with memory access and computing in CNN inference, there are algorithms as follows:
3.3.1 Im2col + GEMM
Im2col + GEMM is a widely used convolution operation in Caffe. Figure 10 [15] shows the process. This method transforms the data matrix and weight matrix into onedimension and store them in memories, so that the core can finish the convolution operation by accessing the address onebyone and doing multiplyaccumulate. The matrix multiplication of Numpy library supports BLAS (Basic Linear Algebra Subprograms), which is optimized for linear algebra operations. In Level 3 BLAS, GEMM is supported. GEMM (General Matrix to Matrix Multiplication) is the operation of , which can be directly and efficiently calculated by standard computation libraries.
3.3.2 MEC (Memory Efficient Convolution)
MEC is an extension of im2col + GEMM, as Figure 11 [15]. Im2col cannot deal with the problem of repeated call of the same data by neighboring matrix multiplication, while MEC uses multiple computation units to solve the problem in pipeline and further decreases the memory access times. It is an essentially trading time for space. For the data section A in Figure 11, the input side is 7, the kernel is 3 and the stride is 1. Since the weight values keep unchanged while sliding on the data section, we can just sequentially reads out input_side * kernel data, and do pipeline convolution multiplyaccumulate operations in a ’STRIDE’ of stride * kernel. Such operation generates the first column of output matrix. MEC’s advantage is more obvious when there are redundant hardware resources. Its disadvantage is that the scale of pipeline increases with the kernel size, and the computation logic is related to stride. For example, if stride is 2 in Figure 11, matrix multiplication Q and S will be skipped.
3.3.3 Bitonic Sort
Bitonic sort is a sorting algorithm towards hardware, as Figure 12. Its main idea is recursively dividing a sequence into two monotonic ascending and descending halfsequences, and the final sequence is sorted. Take the detailed instance of Figure 12: (1) Relating to the second row, sort in ascending and descending order by every two elements. (2) Relating to the third and fourth row, sort in ascending and descending order by every four elements, then by every two elements. After (1) and (2), the first half of the sequence is ascending, and the second half is descending. (3) Relating to the fifth to the eighth row, sort in ascending by every eight elements, then by every four, finally by every two. Then the final sequence is an ascending sequence. Descending sort is the opposite.
Since in every iteration cycle, there are comparison of two numbers in a sequence and its recursive child sequence, the total number of elements must be an integer power of 2 (). The pros of the algorithm is that comparators can operate in parallel. Since the total steps are (In Figure 12 there are 3 steps), and each step takes maximum group comparison (there are 3 comparisons in the third step in Figure 12), the sequential time complexity of one comparator is . If parallel comparators are utilized, the time complexity will be . As in Figure 12, 4 comparators can process the 8number bitonic sort in 6 cycles.
3.3.4 Pipeline Accumulation
Pipeline accumulation is also an algorithm towards hardware. It is a simple algorithm that get the sum with several adders in groups, trading time for space, as in Figure 13. In Figure 13, 32 adders are calculating the sum of numbers. In the first cycle, the sum of beginning 64 numbers are calculated. In the second cycle the next 32 numbers are calculated, as well as the partial sum in the previous cycle (the identicallycolored part as the previous cycle), and so forth.
The disadvantage of the algorithm is that the most optimized computation unit number is determined by the length of the array. A short of adders in the hardware result in a long calculation time, while a plenty of adder causes some of the adder to be idle after some cycles. Moreover, theoretically we cannot put all adders to operate simultaneously in all cycles, which means there is always a moment that the computation utilization ratio is less or significantly less than 100%.
Another defect of the algorithm is that if all numbers pending accumulation is all readable, this means that these parallel data must be stored in caches. In the computation process these parallel data are serialized again. Some are accessed in the first cycle, while some are accessed in the following cycles. Such serdes process decreases the data flow efficiency. On the other hand, if some numbers pending accumulation is readable, the number in every cycle is different in each cycle. In Figure 13, the number of data read out in 10 cycles are 64, 32, 32, 32, 4, 2, 2, 0, 0, 1. Such irregularity would cause the accumulation control logic to be constrained by the convolution kernel size. If there are a variety of convolution kernel sizes in the network, the difficulty to maximize the parallelism would be significant.
In practice, since the width of the memory interface is limited, especially when data volume pending accumulation is large, it is impossible to load out all data in one cycle. As in SqueezeNet v1.1, the kernel size of averagepooling is [2] and each number takes 16 bits. If the core reads them all in one cycle, it would take an interface width of bits, which is unrealistic on FPGA. Since memory cannot achieve such bandwidth, the better practice is calculating along with memory readout.
3.4 Algorithm Tradeoff
3.4.1 Bitonic Sort & Pipeline Accumulation
Whether to use bitonic sort and pipeline accumulation is determined by the data access format in cache. If the dimension of cache is W or H first, then these two algorithms are practical. But if the cache is channel first, it would significantly increase the computation unit number if these two algorithms are utilized (In Figure 12, the computation unit number would be 4 times as comparing one by one). In the final product, such two algorithms are not used.
In this project, the stored data format (NHWC) is optimized for the parallelism of convolution operation, which means the input channel dimension is lowest, while the output channel dimension is the highest (NCHW in Caffe and NHWC in TensorFlow). Such stored data can be directly called as input of the next layer and we do not necessarily have to add extra logic to judge whether to start parallel units or not, as in Figure
22.3.4.2 Generic Accelerator vs. Stream Accelerator
After determining the format of cache, we need to determine whether the cache data is coming from DRAM outside the chip or from the shared DRAM on PC. Since the required onchip space is different, the corresponding accelerator architecture is different as well. Scalability is another thing to consider when choosing architecture, because the prototype FPGA resource is limited, the maximum parallelism supported is relatively low. If it is deployed on ASIC, there will be much more computation units. A good accelerator architecture decouples from the network, and do not have to change much after being scaled.
If the required data to calculate come from offchip DRAMs, the input image, network weight and parameters must be initially loaded to DRAM. After the calculation begins, DMAs moves data from DRAM to cache. Figure 14 shows this generic accelerator design. Take the example of the 6 available MCBs (memory controller block) on Spartan6, two ports can be configured as read/write to be responsible for block data access. The other four ports are readonly ports, which serves to read data and weight (twoway parallel). 6 DMAs serve to generate MCB access timing according to the request from computation unit, control signal block or the host and transfer data between DRAM and these three.
CSB (control signal block) activates the computation core according to the parameters read out from DRAM via DMA and the host commands. In the computation core there are two BRAMs, data cache and weight cache, to cache the block data and weight read out from DMA and provide random access to computation core (since DRAM latency is high). One MUX determines whether data cache and weight cache transfer data to the computation engine or not. P0 write port is shared by two blocks via a MUX. When the system is loading block data, the path from PipeIn FIFO to DMA P0 is selected. When the system is computing, the path from result to DMA P0 is selected.
We need to pay attention that although the block access of DRAM is fast, there is obvious latency. According to the Xilinx datasheet, typical MCB latency of the chip is 2232 cycles^{2}^{2}2Xilinx UG388 Spartan6 FPGA Memory Controller User Guide. Accessed March 6, 2019.. Since im2col + GEMM operation consists of small pieces of data, the latency due to repeated random DDR accessing would empty the pipeline and waste the parallel computing resource. This part is also discussed in 4.3.3.
Figure 15 shows the flow of generic accelerator. After resetting, the host enables the system. All block data (32 bits x 512, i.e., 512DWORDs) flows from the host to PIPEIN FIFO, then to DRAM. This process includes two clock domains, host clock domain and DRAM clock domain. The write side of PIPEIN FIFO is driven by host clock, while the read side is driven by DRAM clock.
After all data related to inference are loaded to DRAM, the host notifies the control signal block to start the network computation. The computation core extracts the parameter of a layer from DRAM, then reads out data and weight in order. After the computation is completed, the parallel results are serialized and sequentially written back to DRAM. After the writingback operation is finished, the control signal block sends out an interrupt. Such process loops until all layers are calculated.
The forwarding results are moved from DRAM to PIPEOUT FIFO via DMAs, then to the host via USB3.0. The write side of PIPEOUT FIFO is driven by the DRAM clock, while the read side is driven by the host clock.
To conclude, the entire computation process includes three clock domains, the host clock (100.8MHz), the DRAM clock (333.3MHz) and the computation core clock (100MHz). The entire design hardly meets the timing constraint, takes m logic resources and drains a lot of power after placing and routing.
If we use the generic accelerator, another two complicated problem is onchip padding and address managing. For the input layers that requires padding, there are two possible solutions. One is to pad before writing back in the previous layer, the other is to pad while reading the data. These two plans lead to different operations at corners.
Figure 16 shows an example of padding while reading. The parallelism in Figure 16 is 16. When this layer is going to write back the results on the left side and the next layer requires padding = p, the DMA needs to jump access DRAM to reserve the positions of 0 (in this case start writing back from 128). After finishing the writeback of a row, the DMA should jump 2p * BURST_LEN (channel parallelism). Then in the next layer this result can be accessed from address 0.
If padding before the current layer, the core needs to judge if the current convolution kernel is at the corners or sides, which leads to 9 situations in Figure 16. When the convolution kernel is at the top left corner, the core just reads address 4, 5, 7 and 8. When the convolution kernel is at the top side except corners, the core reads address 38. Such design strategy would bring about many judgment statements and dependencies, which increase LUT resource utilization after synthesis.
No matter which padding strategy is taken, in terms of address management, the result block written back to DRAM should have the same dimension order as the input block read out, so that coherency of the flow is preserved. Since the input data cube is stored in NHWC (input channel as the lowest dimension) in which the result channel of the convolution is at the highest dimension, memory reshaping is required to adjust the output matrix dimension. This part would be complicated. Meanwhile, when a convolution core is calling a block of data, suppose that the order is NHWC rather than NWHC, after kernel pieces of data are read out in the row direction, the DMA needs to jump to the next row. The jump length is BURST_LEN * (input_side  kernel). After calculating the first column of output, the DMA jumps back to the second input column. Such logic would introduce a lot of registers to calculate the shift value. Moreover, outputs of parallel convolution layers (e.g., expand1x1 and expand3x3 in SqueezeNet v1.1) are in NWHC, while the concatenation layer merges these two matrices in the lowest channel dimension. This requires some extra logic as well.
If the data are directly from the host, we just need to transfer every piece of cache data to FPGA via highspeed interface rather than storing it on extra offchip DDR. This is a stream architecture that shares memory with the host, like NVDLA. However, USB3.0 highspeed IO has latency as well. The total IO operation latency is USB latency + OS latency + storage latency.
The final design architecture is tha latter. The main reason is that if the data are initially moved to offchip DDR, the complex logic from DDR to cache must be redesigned as well. On the other hand, such operations are mature on X86/ARM CPUs with C++/Numpy libraries.
3.4.3 Im2Col vs. MEC  Channelfirst Parallelism vs. Surfacefirst Parallelism
For operations like convolution and pooling, there are two optional parallel solutions. Channelfirst parallelism means parallel in the channel dimension. Multiple multiplyaccumulators calculates the parallel channel data, followed by multiple channel accumulators. The pros are that in all cycles the parallelism is the highest while the cons are that one piece of data is called for multiple times in neighboring convolutions. If the access speed of cache is faster than calculation speed, such method would not increase halt time in the computation pipeline.
Surfacefirst parallelism means parallel in H or W dimension of the input matrix to address the shared neighboring problem, which is the idea of MEC. When the convolution starts, the parallel computation units are not all activated. The parallelism is highest when the convolution goes to the intermediate positions, as in Figure 19 and Figure 20. The advantage of this solution is that every single data is accessed from cache for once and shared by multiple weights, so that the data memory is accessed less.
The cons of this solution are that in the whole operation process, the computation parallelism varies, and the logic is pretty complicated. We need to adjust and prepare different parallel slots for different strides, since stride determines the max parallelism of a convolution. For every two neighbor convolution, there are kernel * (kernel  stride) numbers that are overlapped. Notice that the multiplied weights to these overlapped data in two convolutions are different. So we need multiple groups kernel  stride + 1 of parallel computation units to process that. When kernel is 3 and stride is 1, three slots will be occupied (In Figure 19, sum_enable = 111) and the data in the middle will be accessed for three times in three convolutions. When kernel is 3 and stride is 2, there is a slot that is always empty, and the data in the middle will be accessed for only two times.
Moreover, if kernel increases (e.g., in AlexNet there is kernel size of ), the slot number required increases proportionally. It is not a good practice since it makes the hardware size constrained by network size. Unless the slot size is very huge, networks with large convolution kernels are not supported. It is neither a runtime configurable design anyway.
This project uses channelfirst parallelism. Because the data are cached in BRAM that requires only one cycle for each readout, which is significantly faster than computation units. No empty pipelines or idle computation units would exist in the whole flow and it leads to a simpler design. If the data are stored in DRAM, we can refer to Xilinx MCB IP readout sequence in Figure 17. DMA takes at least 4 cycles to readout data from Xilinx MCB in Figure 18 ^{3}^{3}3Xilinx UG388 Spartan6 FPGA Memory Controller User Guide. Accessed March 6, 2019.. In the first cycle the DMA sends a read access command to MCB. In the second cycle the MCB read enable pX_rd_en is pulled high after judging the MCB IO type. In the third cycle data is read out (if there is only one). The fourth cycle is idle.
Another reason channelfirst parallelism is chosen is that channel dimensions of convolution networks are normally integer times of 4, 8 or 16 except the initial input image. Such feature makes computation units to scale more easily, we do not need to consider the problem of padding 0 in the input channel dimension except the initial layer whose channel is 3. Contrarily, the sizes of intermediate layer surfaces vary in different networks, it is difficult to schedule in the surfacefirst parallelism solution.
4 Hardware Architecture
FP16 formats are used in the storage and computation of the final design, because FP16 models do not have to be quantized and retrained from FP32 like INT8 and FP16 models saves 50% storage space and computation units compared to FP32 models. On the other hand, the activation layers and the softmax operation at the end make the forwarding process not sensitive to the deviation between FP16 and FP32, so FP16 do not lose much accuracy in inference. FP16 ranges from to and the precision is 0.05%, while FP32 ranges from to and the precision is 0.000006% [1]. Since the three channels of the input image are remapped to [0, 255] and followed by mean processing, the range of FP16 is enough for intermediate computations.
Figure 22 shows the stream accelerator architecture consisting of the following blocks: control signal block, engine, usb communication block (Host), several FIFOs and BRAMs. All FIFOs in the design are asynchronous FIFOs with handshake, supporting independent read/write clock domains, as in Figure 23 ^{4}^{4}4Xilinx PG057 FIFO Generator v13.1 LogiCORE IP Product Guide. Accessed March 6, 2019.. For command FIFO, the write clock domain is the USB clock while the read clock domain is the engine clock. For result FIFO, the write clock domain is the engine clock while the read clock domain is the USB clock.
The engine module consists of registers and computation units. Registers are data, weight, bias and layer registers. Computation units consist of several parallel floatingpoint computation modules and FIFOs. FIFOs serve to deal with the speed matching of different types of floatingpoint computations.
4.1 Control Signal Block and Parameters
Control signal block parses the required parameters for every layer of the network from the USB3.0 packets and store them in to layer registers.In the computation process these registers are accessed. Table 1 ^{5}^{5}5Wolfram Squeezenet v1.1 Trained on Imagenet Competition Data. Accessed March 6, 2019. shows the network tuple of SqueezeNet v1.1 and the detailed parameters are in Table 2. SqueezeNet v1.1 consists of four types of calculation, convolution , convolution , maxpooling and average pooling . ReLu operations can be directly realized after convolution. Concatenation layers can be realized by Numpy matrix operations. Since SqueezeNet v1.1 uses a lot of convolution kernels and the strategy of expanding after squeezing. The maximum convolution core is 3x3 and the network weights are reduced by 50 times compared to AlexNet while the inference accuracy is relatively the same[2].
Name  Type  Dimension 

input  Input  3tensor (size: 3x227x227) 
conv1  Convolution Layer  3tensor (size: 64x113x113) 
relu_conv1  Ramp  3tensor (size: 64x113x113) 
pool1  Pooling Layer  3tensor (size: 64x56x56) 
fire2  Net Graph (7 nodes)  3tensor (size: 128x56x56) 
fire3  Net Graph (7 nodes)  3tensor (size: 128x56x56) 
pool3_pad  Padding Layer  3tensor (size: 128x57x57) 
pool3  Pooling  3tensor (size: 128x28x28) 
fire4  Net Graph (7 nodes)  3tensor (size: 256x28x28) 
fire5  Net Graph (7 nodes)  3tensor (size: 256x28x28) 
pool5_pad  Padding Layer  3tensor (size: 256x29x29) 
pool5  Pooling Layer  3tensor (size: 256x14x14) 
fire6  Net Graph (7 nodes)  3tensor (size: 384x14x14) 
fire7  Net Graph (7 nodes)  3tensor (size: 384x14x14) 
fire8  Net Graph (7 nodes)  3tensor (size: 512x14x14) 
fire9  Net Graph (7 nodes)  3tensor (size: 512x14x14) 
drop9  Dropout Layer  3tensor (size: 512x14x14) 
conv10  Convolution Layer  3tensor (size: 1000x14x14) 
relu_conv10  Ramp  3tensor (size: 1000x14x14) 
pool10  Aggregation Layer  vector (size: 1000) 
flatten  Flatten Layer  vector (size: 1000) 
probabilities  Softmax Layer  vector (size: 1000) 
Output  class 
4.2 Engine
Engine computes according to the layer register. The operating flow is as Figure 35. There are three registers, data, weight and bias in the computation units (parallelism = 8) in Figure 22, whose widths are 128 bits, 128 bits and 16 bits independently. Layer registers store the parsed parameters of a single layer (12bytes, as in Figure 34). Computation units consist of three sections, convolution, maxpooling and averagepooling. Each section consists of several floatingpoint units and FIFOs. The parallelisms are all 8.
At 100MHz clock speed, FP16 multiplier latency is 6 cycles, FP16 adder latency is 2 cycles, FP16 comparator latency is 2 cycles and FP16 divider latency is 6 cycles. Since the adders in the computation units are used as accumulators, and the comparators are also continuously operating, new data should be fed after the accumulators or comparators are finished rather than in every cycle. Multipliers can be fed in every cycles as in MULT in Figure 25. Hence, the throughput of multipliers is larger than that of accumulators or comparators. FIFOs are required to match the input data update frequency of accumulators and comparators.
4.2.1 Convolution Units
Convolution units consist of 8 parallel floatingpoint multipliers, 8 parallel floatingpoint adders, 2 FIFOs and an independent floatingpoint adder. The calculation formula is as follows, in which stands for the elements of the input matrix (padded), stands for the weights in convolution kernels, stands for the elements in the output matrix, , , , stands for the output dimensions, and , , stands for the input dimensions, and similarly hereinafter.
(1) 
Figure 24 shows the macroscopic im2col convolution (input data access only). The parallelism is 16. The convolution process goes in five dimensions from the lowest to the highest. The operation in the lowest dimension is . It is the smallest operation in the while process, called as atom. The second dimension is the filter movement in H or W direction after calculating the partial sum of a filter kernel. The third dimension is the traversal of the entire input channel multiplyaccumulation of one column or row. The fourth dimension is the filter movement of a stride in W or H direction after channel operations, until the whole surface is finished. The final dimension is the traversal of all output channels, which means calling all filters to do convolution operation to the same input matrix.
Figure 25 shows the microscopic RTL convolution process. The parallelism is 8 and it is a threestage pipeline flow. After engine_valid is high, cmac_enable is pulled high and computation units are enabled. The computation units read out data, weights and biases from BRAM (the bias and internal transfer signals are not shown). Notice that when data_ready is low, the values fed to floatingpoint computation units will not be calculated, and so forth. When data_ready is high, data and biases are continuously fed to 8 parallel multipliers (Figure 25 shows only 1). After 6 cycles (In Figure 25, kernel size is ) continuous results are calculated. After calculation, ready is pulled high, and the results are stored into partial sum FIFO. Write latency is 6 cycles, then p_fifo_empty is pulled low.
After the multiply calculation, 8 parallel partial sum accumulators start. Each one reads out kernel pieces of data from the P_FIFO, and do accumulations. After PSUM accumulation, psum_ready is pulled high, and the results are stored in full sum FIFO. Write latency is 6 cycles, then f_fifo_empty is pulled low.
After PSUM calculation, the finalstage single full sum accumulator sums the 8 partial results in the previous stage. Notice that the initial value is the bias (0xac88 in Figure 25). After the calculation fsum_ready is pulled high and the data is stored in fsum. fsum
is 16 bits wide. Since the first feature map surface sizes of networks based on ImageNet after the first convolution are smaller than 128,
fsum depth is set at 128. This cache is inferred as a singleport RAM by ISE. The initial value in fsum cache in future steps are the previous accumulating results of FSUM process.Notice that there are two FIFOs in the threestage pipeline to match the different speed of three computation modules. They also help to adjust the computation speed of each floatingpoint IP during design. Because the lower the latency is, the more combinational logic resources are taken, the more difficult it is for the timing to converge, and the higher the fanout is. The FIFOs do not affect timing but help decouple the stages. In ASIC projects there are floatingpoint IPs of 1cycle latency to help achieve a better performance. Filled pipelines are not shown in Figure 25. If the accumulator can get the result in one cycle, then the speed of the three stages are the same and the pipeline is filled, which means the resource utilization is the best.
4.2.2 Maxpooling Units
Maxpooling consists of 8 parallel floatingpoint comparators and 1 FIFO. Its formula is as follows, in which stands for the elements of input matrix and stands for the elements of output matrix.
(2) 
Maxpooling does not change the input matrix channel. It only changes the surface size of input matrix, so the flow of computation is onedimension less than convolution. Only Kernel, W, H, C are involved. (W and H can be exchanged.)
Figure 26 shows the microscopical RTL comparing process. The parallelism is 8. It is a onestage flow. After engine_valid is high, maxpool_enable is pulled high and computation units are enabled. The computation units read out data from BRAM (not shown in Figure 26). After 6 cycles m_fifo_empty is pulled low. 8 parallel comparators (initial value 0x0000) compare between new data from M_FIFO and data in the previous cycle. If the result is high, which means new data in a_cmp is larger than data in b_cmp, value in a_cmp will be replaced by that in b_cmp. Otherwise b_cmp remains unchanged. So b_cmp stores the maximum value until the counter equals to kernel. The final result in b_cmp is stored in scmp_result, then ready signal is pulled high.
4.2.3 Averagepooling Units
Averagepooling consists of 8 parallel floatingpoint adders and 8 parallel floatingpoint dividers. Its computation formula is as follows, in which D stands for elements in the input matrix and A stands for elements in the output matrix.
(3) 
Averagepooling does not change the input matrix channel. It only changes the surface size of input matrix, so the flow of computation is onedimension less than convolution, which is the same as maxpooling.
Figure 27 shows the microscopical RTL averagepooling process. The parallelism is 8. It is a twostage flow. After engine_valid is high, average_enable is pulled high and computation units are enabled. The computation units read out data from BRAM (not shown in Figure 27). After 6 cycles s_fifo_empty is pulled low. 8 parallel accumulators read data from S_FIFO and do accumulation until the counter equals to kernel. Then div_data_ready will generate a onecyclewide pulse.
When div_data_ready is high, 8 parallel dividers will be triggered. a_div stores the accumulating results from the previous stage. b_div stores the intFP converter output, which is kernel in FP16 format (In Figure 27, it is 0x5948, i.e., ). After 6 cycles ready_buf is pulled high and computation finishes.
4.2.4 Test Files and Scripts
To generate the image data and weights required by CNN forwarding, there are some scripts to extract values in FP16 format from images and pretrained models. Preprocess.py moves image channels to the lowest dimension, swap channels from RGB to BGR(Caffe), subtracts the mean value of ILSVRC_2012 dataset in each channel from the image and rescales the difference from [0, 1] to [0, 255] (Figure 28). Extract.py extracts weights and biases from prototxt and caffemodel, converts them to FP16 format and pack them into npz format, as shown in Figure 29. The host script calls the npz file in execution.
Macros are used in computation testbenches to test if convolution, averagepooling, and maxpooling are correct. On the other hand, because the data volume is huge in the testbench, python scripts are used to generate testbench data from the preprocessed image and network weights, as shown in Figure 30. Such script generates the testbench codes directly from Numpy matrix and saves manual input.
Apart from these scripts, Caffe on CPU is also required to verify the inference, which is identical to the BVLC sample script ^{6}^{6}6BVLC Classification: Instant Recognition with Caffe. Accessed March 6, 2019..
4.3 usb3.0 Io
USB3.0 IO block loads input commands to command FIFO and stores input data, weight and bias to the corresponding cache. Meanwhile it transfers the result to result FIFO, and the parameters to host to calculate the cache positions.
Utilized USB3.0 FrontPanel APIs ^{7}^{7}7Opal Kelly FrontPanel SDK. Accessed March 6, 2019. are as follows. Wire In writes single 32bit value to FPGA registers. Wire Out reads single 32bit value from FPGA registers. BlockThrottled Pipe In is block write with handshake as timing Figure 31. BlockThrottled Pipe Out is block read with handshake as timing Figure 32. EP_READY is enable signal of USB communications from FPGA, which is related to the available space of CMDFIFO and RESFIFO. EP_WRITE/EP_READ is the write and read signal sent directly by PC host API. EP_DATAOUT and EP_DATAIN are 32 bits wide.
4.4 FIFOs and BRAMs
Figure 22 shows that in the toplevel hardware design, there are CMDFIFO on the input side and RESFIFO on the output side.
Command FIFO is 32 bits wide and 1024 deep. It stores parameters of each layer. Since each layer requires 12Bytes to characterize, as shown in Figure 34, theoretically 341 layers are supported. If the network is deeper, we can just increase command FIFO depth.
op_type is the computation format of this layer, idle, convolution + ReLU, maxpooling or averagepooling. stride ranges from 0 to 3. Large strides lose information of the image while small ones make the forwarding structure large, so 4 bits are enough. kernel is the side size of computation window. kernel_size is the square size of computation windows, i.e., kernel_size = kernel * kernel. This extra parameter saves integer multiplication onchip, while the resource and communication time is trivial. input_side_size is the side of input matrix (), for example the parameter of the first layer in SqueezeNet v1.1 is 227. output_side_size is the side of output surface(), for example the parameter of the first layer in SqueezeNet v1.1 is 113.
input_channel_size is the channel of input matrix. output_channel_size is the channel of output matrix. In convolution layer normally input_channel_size is smaller than output_channel_size. This is because with the convolution is a process that the network gets deeper, and surface becomes smaller. For pooling layer, input_channel_size equals to output_channel_size.
padding_size is normally 1, but 4 bits are still reserved for it. stride2 = stride * kernel. Such value will be called repeatedly in computation. As a network parameter it reduces logic utilization and helps timing convergence.
slot determines whether this layer belongs to any of parallel layers, like expand1x1 and expand3x3 in SqueezeNet v1.1. The input cube of these two layers comes from the output of same layer, and the outputs of the two layers are merged into one matrix as input to the next layer. slot[0] and slot[1] indicates the order of two parallel layers, and slot[3:2] shows the total number of parallel layers. slot is only transferred to PC host to help parse the input matrix and not called by computation units, because in practical computation these two parallel layers are in sequential orders.
Result FIFO is 32 bits wide, and 1024 deep. It directly stores the computation result and transfers it via USB3.0 upon host request. We can easily get that under channelfirst parallelism such FIFO can support result convolution layer of a maximum input_side_size of 1024 (since a single layer corresponds to output channel of 1) or pooling layer whose input_side_size is 128 (since output channel parallelism is 8). This is enough for concurrent networks. If the side or area of intermediate layers is larger, result FIFO depth should be enlarged.
In toplevel hardware design, as shown Figure 22, there are three BRAM as caches, data cache, weight cache and bias cache. These BRAM caches will be accessed once in every cycle to extract value to the corresponding registers of the same width.
Data cache is 128 bits wide, and 1024 deep. It serves to store the convolution, averagepooling and maxpooling data received via USB3.0. Since these three computation modules are operating only when cmac_enable, avepool_enable, and maxpool_enable are high, so it is not necessary to add MUX on the data path. Weight cache is 128 bits wide, and 8192 deep. It will be accessed by convolution only. Since the input channel parallelism is 8 in convolution, and the data format is FP16, so the width is 128 bits. If the max kernel size in convolution is , then the max supported input_channel_size * output_channel_size is . Since the output channel parallelism is also 8, the max input channel size is , which meets the requirement of SqueezeNet v1.1 (as in Table 2).
Bias cache is 128 bits wide, and 1024 deep. It will be accessed by convolution only. In FP16 format, only the lowest 16 bits are valid data in each bias cache, and the rest bits are all zeros. Since the output channel parallelism is 8 at most in convolution computation in this project, a bias cache depth of 8 should be enough. But to simplify the project files, same BRAM as data cache is utilized.
As for data cache and weight cache, since the USB3.0 input is 32 bits wide, and only the lower 16 bits are valid in FP16 format, so in every cycles a group of parallel data is cached by SERDES, as shown in Figure 34. When the count is smaller than BURST_LEN  1 = 7, 16bit data is shifted in until the 128bit cache is filled.
In Figure 35, after resetting, all parameters will be initially loaded to CMDFIFO, while bias, weight and data will be transferred to BRAM by layer and by piece. After receiving engine_valid high, the computation unit will read bias, data and weight from BRAM to calculate the result, and then write to RESFIFO. After this piece is finished, the host will get the result from RESFIFO via USB3.0 and the interrupt signal. Then goes the next piece, and next layer till the entire network forwarding is finished.
Compared to generic accelerator workflow as Figure 15, stream accelerator workflow is relatively simplified, the data stack is thinner and the data path is shorter, which helps improve the performance and scale the parallelism.
5 Software, System Architecture and Results
Figure 36 shows the PC host workflow. Detailed execution process in every stage is as follows. In Read Blob, the host will load the network parameters, biases and weights from the generated packed npz by extract.py, and load the preprocessed image data by preprocess.py. In Initialize Device the host connects and initializes FPGA and downloads the bitstream file. In Load Commands the host transfers all parameters of each layer to CMDFIFO on FPGA. In Load Layer these prestored parameters will be read out. These parameters will be called by computation units, as well as be used to slice the data blocks. In Process Weight Bias the network weights will be processed and slices. In load weight & bias the biases and weights will be transferred to bias cache and weight cache on FPGA.
In Process Gemm the host slices the padded data block. These sliced data will be transferred to FPGA in Load Gemm. Restart Engine resets the computation unit, clear the registers after the calculation of the previous layer, and starts the computation unit. After all results are calculated, FPGA will notify the host via interrupt. Then the host will fire a read request. In Read Output the data will be transferred from RESFIFO. In Concatenate Outputs, all pieces of results will be concatenated to be the input of the next layer. Finally Softmax & Argsort will normalize and sort the final result. Softmax function formula is as follows:
(4) 
It serves to normalize the output value to (0, 1), this value stands for the possibility the input image is the ith item inferred by the network.
Table 3 is the resource utilization of the accelerator after synthesize, placement and routing. We can see that there are still redundant registers, LUTs and DSPs. The reason why DSPs are not used much is that in Xilinx Floating Point 5.0 IP^{8}^{8}8Xilinx DS535 FloatingPoint Operator v5.0 LogiCORE Product Guide. Accessed March 6, 2019., only multipliers use DSPs and the rest use Flip Flops and LUT resources. Also, we can see that RAMB16BWERs are almost used up. This is because in the design many BRAMs and FIFOs are used. LUT usages are mostly made up of floatingpoint computation units. LUT utilization are over 70% when the parallelism is 16. A doubled parallelism means doubled width in BRAM and FIFO because of channelfirst parallelism. However, the present RAM16BWER and RAMB8BWER utilization exceeds 50%, so this chip is not capable of holding parallelism of 16.
Figure 37 is the forwarding result (left) of the first layer of SqueezeNet v1.1 compared to that (right) on Caffe CPU. Since the accelerator is in FP16 format, between every two results there is a padded 0. We can see that the results on these two platforms are basically the same, and deviations just start from the second or third decimal place.
Figure 38 and 39 shows that FPGA results are identical to Caffe CPU results. FP16 does not bring about any errors or obvious deviations because softmax amplifies the result of the finallayer convolution. Larger numbers on the exponential are more significant, but this does not compromise the correctness of the result much. Because the hardware resource restrictions (RAM16BWER utilization is already 88%), the parallelism is only 8. Hence, the computation elapse time is relatively long (computation time is 10.7s, and the whole process is 40.9s), which is significantly slower than Caffe. If USB3.0 can be replaced by PCIe buses, the latency will be improved. If there are more hardware resource to improve parallelism, the computation time will be proportionally reduced.
6 Conclusion and Future Development
6.1 Conclusion
This paper analyzes the optional solutions and architectures of CNN accelerators, as well as some tradeoffs between optional algorithms to preserve the scalability of the hardware project. A stream processing architecture with im2col + GEMM convolution solution is eventually designed. Channelfirst parallelism is used to simplify the computation control logic. Reconfigurable accelerator hardware and corresponding host software are generated.
In the verification stage, SqueezeNet v1.1 is used to test the inference sanity on FPGA, and the result is identical to that on Caffe CPU. The only deviation is introduced by the precision difference between FP16 and FP32. Since the scalability is considered in design, the selected algorithms maximize parallel computing resources, and the computation resource overhead is not influenced by network types or number of layers.
Restricted by FPGA resources, the speed of the accelerator is slower than CPU and other concurrent accelerators. If FPGAs with more logic resources are used, if parallelism are improved when this project is migrated to ASIC, or if higher clock speeds and lowerlatency highspeed buses are used, better accelerator performance can be achieved.
6.2 Configurable Parameters towards ASIC & Optimization
In this project, the computation precision and parallelism are two most important configurable parameters. These two parameters determine the numbers of computation units and the width of caches and FIFOs. In the current design the precision is FP16 and parallelism is 8. These parameters can be adjusted by macros, as shown in Figure 40. The reason why such scaling is practical is that channelfirst parallelism is used in design and no other logic must be changed while scaling. Apart from parameters above, MAX_KERNEL and MAX_O_SIDE are also configurable. These two determines the overhead of RAMs as buffer of results. CMD_BURST_LEN determined how many double words (4Bytes) are read out from CMSFIFO to CSB. In this case, the number is 3, i.e., 12Bytes.
If the project is going to be migrated to ASIC, we need to replace some FPGA IPs including floatingpoint IP, USB3.0 IP, BRAM IP and FIFO IP. There are practical ASIC solutions for these IPs, and thanks to handshake protocols, the core logic does not have to change much, just matching the ASIC IP.
Since the hardware in the project uses an engine to compute the CNN forwarding rather than storing weights directly on hardware, and the scale of computation units are not related to the intrinsic parameters of networks, other networks like AlexNet are also supported. Thus, this project is configurable in runtime. On larger FPGAs or on ASICs, more computation units (e.g., in NVDLA full configuration there are totally 2048 parallel convolution multiplyaccumulators) and SRAMs can be used to boost up the forwarding process. On the other hand, the host logic can also be migrated to CPUs like ARM of RISCV.
Moreover, the network parameters are manually extracted rather than by script. This is because the parameter requirements change during the design process. After the architecture is fixed, the commands can by extracted from prototxt by python script.
References
 [1] Song Han and B Dally. Efficient methods and hardware for deep learning. University Lecture, 2017.
 [2] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [3] Vinayak A Gokhale. Nnxa hardware accelerator for convolutional neural networks. 2014.

[4]
Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin,
Sicheng Zhao, and Kurt Keutzer.
Squeezenext: Hardwareaware neural network design.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pages 1638–1647, 2018.  [5] Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer. Squeezedet: Unified, small, low power fully convolutional neural networks for realtime object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 129–137, 2017.
 [6] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243–254. IEEE, 2016.
 [7] Lukas Cavigelli and Luca Benini. Origami: A 803gop/s/w convolutional network accelerator. IEEE Transactions on Circuits and Systems for Video Technology, 27(11):2461–2475, 2016.
 [8] Srinidhi Kestur, John D Davis, and Eric S Chung. Towards a universal fpga matrixvector multiplication architecture. In 2012 IEEE 20th International Symposium on FieldProgrammable Custom Computing Machines, pages 9–16. IEEE, 2012.
 [9] Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. Cnp: An fpgabased processor for convolutional networks. In 2009 International Conference on Field Programmable Logic and Applications, pages 32–37. IEEE, 2009.
 [10] Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshops, pages 109–116, 2011.
 [11] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. Hardwareoriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016.
 [12] NVIDIA Corporation. Nvdla primer, nvdla documentation. https://nvdla.org/primer.html. Accessed March 6, 2019.
 [13] Xilinx Inc. Chaidnn, hls based deep neural network accelerator library for xilinx ultrascale+ mpsocs. https://github.com/Xilinx/CHaiDNN. Accessed March 6, 2019.
 [14] Opal Kelly Incorporated. Xem6310. https://opalkelly.com/products/xem6310. Accessed March 6, 2019.

[15]
Minsik Cho and Daniel Brand.
Mec: memoryefficient convolution for deep neural network.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 815–824. JMLR. org, 2017.
Comments
There are no comments yet.