As mentioned in section Document, we aggressively simplified ShuffleNetV2’s operator set. Our modified network is mainly composed of the following operators:
shuffle and concatenation
Our accelerator is tailored to only support the operators above. This allows us to design more specialized compute units with simpler control, which enables us to further improve the hardware efficiency. The compute of the fully-connected layer can be mapped onto our convolution unit. Shuffle operation is not fully supported on FPGA. CPU-based memory copy is needed to maintain the memory layout. And the remaining average-pooling layer which is not supported on the FPGA is instead offloaded to the ARM processor on the SoC platform.
The accelerator architecture
Fig. Document shows the overall accelerator architecture design. Our accelerator, highlighted in light yellow, can be invoked by the CPU for computing one Conv-Pooling-Shift-Shuffle subgraph at a time. The CPU provides supplementary support to the accelerator. Both the FPGA and the CPU are used to run the network.
In our quantized ConvNet, weights are 1-bit, input and output activations are 4-bit, and the largest partial sum is 13-bit. The width of partial sum is determined by the input feature bit width and the largest channel size. Given that the largest channel size is 464, there are possible outcomes from the convolution, which requires 13 bits to represent.
Our hardware design is based on the dataflow architecture template [cheng2016high, vivado2018ug]. As illustrated in Fig. Document, we first extract a few process functions from the major operations including convolution, pooling, shift, shuffle and the memory load and store. We then chain them together using FIFOs with blocking read and non-blocking write. Note that the write is blocking once the FIFO is full. All the process functions are running concurrently and the execution of each function is triggered by the arrival of the data. Therefore, more task-level parallelism can be explicitly exposed to the HLS tool in addition to the instruction-level parallelism.
The notations used in this section are listed in Table Document. As shown in Fig. Document, given an input feature map of size and a weight kernel of size , the generated output feature map is of size in 11 convolution. The 11 convolution is essentially a matrix-matrix multiplication.
Although [kwon2018co] suggests a weight stationary dataflow for 1 1 convolution dominant ConvNets, we find it not applicable to our design as the bit width of weights is much smaller than the partial sums (1 bit vs 13 bits). Transferring the partial sums on and off-chip will incur more traffic on the memory bus. Therefore, we adopt the output stationary dataflow by retaining the partial sums in the local register file until an output feature is produced.
Fig. Document shows how we schedule the workload onto the accelerator. Note that the nested loops starting at line 17, 19 are automatically unrolled. Weights are prefetched onto on-chip BRAM . We first block our inputs so multiplications can be mapped onto the compute units at each iteration (Line 1321). In every iteration, input features are fetched from the DRAM. They are convolved with number of weights of size and produces partial sums. Each iteration of the loop nest along the input channel dimension at line 12 takes cycles to finish based on the Vivado HLS report. Equivalently, it takes to finish 1/4 bit multiplication. The partial sums are stored in the registers, which can be simultaneously accessed in every cycle. The parameter and were tuned for the area performance tradeoff. Increasing them increases overall resource utilization but helps to reduce the total number of execution cycles. Based on the roofline model[williams2009roofline], the attainable throughput is the compute-to-communication (CTC) ratio multiplied by the bandwidth when it is bandwidth bound. The CTC ratio of our compute unit for the input feature is (maximum number is 464 in DiracDeltaNet), which is a variable. The deeper output channel is the higher the CTC ratio. According to our measurement, the maximum bandwidth of the DDR channel is 6GB/s, which means Giga input features (1 Byte contains two 4-bit features) can be loaded. The theoretical memory bound throughput should be GMACs GOPs. For compute bound problems, the attainable throughput is dependent on the compute capability. In our case, it is =GOPs. Based on the analysis, the convolution unit will reach the bandwidth bound before it hits the computation roofline.
The high bitwidth to low bitwidth conversion is performed immediately after the kernel computation. It is a step function with 16 intervals that converts 13-bit partial sum to 4-bit activation. The threshold values are different for each layer. All the read-only threshold values are stored in on-chip BRAMs. An index number should be specified by the user function to select which set of threshold values to use for the compute of the current layer. In hardware, this unit is implemented by using 16 comparators. They are mapped onto a binary tree structure to reduce the circuit latency.
We adopt the line buffer design described in [zhao2017accelerating] to implement the max-pooling layer. For every iteration, of deep pixels are first fetched into the line buffers. Once the next pixel value is fetched, a large sliding window is formed. For every 2 cycles, we compare the values in the sliding window, output the largest one, and fetch the next 2 values. It takes iterations to finish the compute.
The line buffer design is also used for the shift operation. In the shift unit, the input images are first padded with 1 zero-value pixel at the width and height dimension.of pixels are then buffered and a sliding window is formed. The shift direction is different for different input channels. It is calculated based on the input channel index. After initialization, the unit is able to produce output pixel per cycle.
Shuffle is implemented by changing the address offset of output features during the writeback phase. Since the shuffle operation still requires us to concatenate the output from the previous DiracDeltaNet block to the current DiracDeltaNet block outputs, the CPU is used to copy the output from previous DiracDeltaNet unit to the shuffled address. The memory copy operation should be done concurrently with the computation of current DiracDeltaNet unit.
Fully Connected Unit
We don’t explicitly design a dedicated unit to compute FC layer. Instead, we map the compute of FC layer onto our existing hardware convolution unit. The feature map size is 1 for the FC layer. While the convolution unit only supports 1-bit weight, the FC layer’s computation is mapped in a bit serial like manner. The convolution unit processes each bit of the FC weight iteratively and bit shift is done by configuring the step function in the conversion unit.
We use the ARM processor to control the layer-based accelerator and to compute the last average-pooling layer that is not supported by the accelerator. The host application runs on a full Linux system on the ARM CPU, which controls the memory-mapped accelerator through the UIO driver interface. The Xilinx python-based PYNQ APIs [xilinx2018pynq] are used for fast deployment of the host software code on the Ultra 96 board.
We implement our deep learning accelerators on the Ultra96 development board with Xilinx Zynq UltraScale+ MPSoC targeted at embedded applications. TableDocument shows the overall resource utilization of our implementation. We are able to utilize 34% of the total LUTs on the FPGA, as the bit-level 1/4bit multiplications are mapped onto LUTs. BRAMs are mainly used for implementing the FIFO channels. DSPs are used for the address calculation for the AXI protocol. Our implementation runs at 250 MHz. Power measurements are obtained via a power monitor. We measured 5.3W with no workload running on the programming logic side and 5.5W max power on the Ultra96 power supply line when running our network.
|24130 (34.2%)||29867 (21.2%)||170 (78.7%)||37 (10.3%)|
|VGG-SVD[qiu2016going]||AlexNet[liang2018fp]||VGG16[suda2016throughput]||VGG16 [guo2017software]||DoReFa[jiao2017accelerating]||FINN-R [blott2018finnr]||Ours|
|Platform||Zynq XC7Z045||Stratix-V||Stratix-V||Zynq 7Z020||Zynq 7Z020||Zynq ZU3EG||Zynq ZU3EG|
|Frame Rate (fps)||4.5||864.7||3.8||5.7||106.0||200.0||96.5|
|Frame Rate (fps)||58.7||72.9||84.1||94.4||95.9||96.5|
We compare our accelerator against previous work in Table Document. As explained before, ConvNets for ImageNet classification are usually orders of magnitude more complex than CIFAR10 classification. Therefore, we only compare accelerators targeting ConvNets for ImageNet classification with a reasonable accuracy. Our work focuses on achieving competitive accuracy while improving the actual inference speed in terms of frames per second. Our experiments show that we successfully achieve those two goals. From the table, we can make the following observations: 1) Our accelerator achieves the highest top-1 and top-5 accuracy on ImageNet. The only previous work that comes close to our accuracy is [guo2017software], but its frame rate is 16.9 slower than ours. 2) Among the embedded accelerators whose top-1 accuracy is higher than 60%, which is a loose constraint, our model achieves the fastest inference speed. 3) Without the accuracy constraint, the speed of [liang2018fp, jiao2017accelerating, blott2018finnr] can go as fast as 864.7 frames per second. But their accuracy is rather low. 4) The peak attainable throughput of our accelerator is 418 GOPs, which is close to the theoretical compute roofline. Our average throughput (47.09 GOPs) is currently limited by the low hardware utilization. The inefficiency is mainly from the software shuffle operations and the first convolution layer whose input dimension is 3 which is much less than the hardware tiling factor . However, our accelerator still achieves competitive frame rate, demonstrating the efficacy of our co-design methodology. We see the opportunity of significant frame rate improvement through further algorithm/hardware co-design. The reported frame rate is achieved with batch size set to 16. There is a fixed software overhead for invoking the poll-based hardware accelerator. The computation latency of the DiracDelta Block1 in Table Document is 0.15ms when the batch size is equal to 1. The latency for a single read on the accelerator control register is 0.40ms, which is greater than the actual compute time. In order to minimize this software overhead, we increase the batch size to schedule more computation running on the accelerator per invocation. Furthermore, the weights stored in on-chip BRAM get reused more when batch size is increased. The frame rates of implementations with different batch sizes are summarized in Table Document. We break down the runtime of the whole heterogeneous system by bypassing one component of the system and measuring the runtime. The result is shown in Table Document. The whole system runs at 95.9 FPS on ImageNet classification at a batch size of 10, including both hardware PE execution and software execution of average pooling, and shuffle. We see from the table that the CPU-based memory copy for the shuffle operation significantly degrades the performance. All other non-conv components impact the overall performance slightly. To further understand the efficiency of various operators (11 conv, 22 max pooling, shift, and shuffle) implemented on FPGA and CPU, we measure the runtime of the DiracDeltaNet blocks with different configurations. The result is summarized in Table Document. We test 2 blocks with different input feature map and channel sizes. Note that the theoretical OPs of Block1 and Block2 is the same. As shown in the table, pooling and shift incur almost no performance drop. This is because the process functions for performing these operations do not impose new bottlenecks on the dataflow pipeline. Software memory copy latency of shuffle is more significant on Block1 than Block2. This is because memory copy overhead is proportional to . But total OPs remains the same, which means that smaller feature map needs less time for memory copy. The memory copy overhead can be possibly alleviated through running bare-metal C code on the CPU.
|Runtime (ms)||Frame Rate (fps)|
|w/o sw avg pool||100.3||99.7|
|w/o sw shuffle||70.4||142.1|
|feature map size||28||7|
Conclusion and Future Works
In this paper, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Based on ShuffleNetV2, we optimize the network’s operators by replacing all the 33 convolutions with shift operations and 11 convolutions. This allows us to build a compute unit exclusively customized for 11 convolutions for better efficiency. We quantize the network’s weights to binary and activations to 4-bit fixed-point numbers with less than 1% accuracy loss. These quantizations very well exploit the nature of FPGA hardware. As a result, DiracDeltaNet has a small parameter size, low computational OPs, hardware-friendly skip connections, ultra-low precision, and simplified operators. These features allow us to implement highly customized and efficient accelerators on FPGA. We implement the network on Ultra96 Soc systems. The implementation only took two people one month using HLS tools. Our accelerator achieves a top-5 accuracy of 88.2% on ImageNet, the highest among all the previously published embedded FPGA accelerators. It also reaches an inference speed of 96.5 FPS, surpassing prior works with similar accuracy by at least 16.9. While we see many more opportunities for further optimization, we believe this demonstrates the efficacy of our co-design methodology. For the future works, we will focus on further optimization. For example, we can add more layers in the dataflow architecture to improve the compute-to-communication ratio. Correspondingly, we will need to adjust the network such that the computation subgraphs are more symmetric. We would like to thank all of the people who helped us realize this project, especially Kostadin Ilov, Rock Qu, Alessandro Pappalardo, Amir Gholaminejad, Peter Jin, Ravi Krishna, and Alvin Wan. The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, and ADEPT Lab affiliates Google, Siemens, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.