I Introduction
In recent years, CNNs have been widely used in computer vision applications such as classification
[1], object detection [2], and semantic segmentation [3]. However, a CNN usually requires intensive computations, which limits its applicability on embedded devices. To address this issue, FPGA based accelerators [4] have been proposed so that CNNs can be important on realtime embedded systems. As a key operation in many neural networks, deconvolution has been widely used in the stateoftheart CNNs especially for semantic segmentation[3], image super resolution[5], and image denoising[6]. Through a learnable way, deconvolution extrapolates new information from input feature maps, which outperforms other interpolation algorithms such as nearest neighbor and bicubic interpolation. However, unlike hardware acceleration for convolution, much less attention has been paid on deconvolution. Due to the fact that deconvolution may become the bottleneck in speed if only convolution has been accelerated, there is a urgent need to optimize the deconvolution operation on FPGAs.
Ii Related Work
Lately tremendous research progress has been made on highperformance and lowpower CNN accelerators. In [7], the authors proposed a novel architecture for process element array, which dramatically reduced the external memory bandwidth requirements by intensive data reuse and outperforms the systoliclike structures[8]. A highthroughput CNN accelerator design was implemented in [9], where a comprehensive design space exploration on top of accurate models was deployed to determine the optimal design configuration.
Comparing to the accelerator design for convolution, that of deconvolution has not been thoroughly investigated. Liu et al. proposed an CNN architecture where convolution and deconvolution were accelerated on the system separately[10]. This architecture is not efficient enough because in most CNNs, convolution and deconvolution do not work in parallel. A high performance deconvolution module in [11]
used reverse looping and stride hole skipping techniques, but with the penalty of additional hardware resources and latency. An unified systolic accelerator was developed in
[12], which divided the deconvolution into two steps. Firstly, it multiplied one vector with the kernel and then stored the temporary matrices in onchip memory. Next, it added the overlaps of temporary matrices. This method increased the onchip BRAM access and introduced unnecessary data storage. Consequently both power consumption and computation latency grew.
To address the issues mentioned above, we analyze the deconvolution properties and fit it into our proposed process element array so that both convolution and deconvolution can be handled by sharing the same onchip resources. The contributions of our work are summarized as follows:

A novel process element structure is proposed so that both convolution and deconvolution are supported without extra circuit.
Iia Deconvolution
Deconvolution, also called transposed convolution, is a learnable method to perform upsampling. If a convolution unit is directly reused for deconvolution, it consists of the following two steps: 1) padding the input feature map and 2) applying convolution on the padded feature map, as indicated in Fig.
1. After padding, an input feature map with size is expanded into , and consequently the output feature map size becomes . In order to get exactly twice the size, extra padding for upper row and left column (highlighted in blue in Fig. 1) is needed.Iii Optimization
Because of the limited resources on an FPGA, a high performance CNN accelerator has to be deeply optimized on memory access and data transfer while maximizing the resource utilization.
Iiia Loop optimization
To efficiently map the convolution loops, three loop optimization techniques, loop unrolling, loop tiling and loop interchange, have been considered to customize the computation and communication patterns of the accelerator. Loop unrolling is the parallelism strategy for certain convolution loops, which demands more multipliers. Loop tiling determines the partition of feature maps, and consequently determines the required size of onchip memory. Loop interchange decides the computation order of the convolution loops [15].
After carefully judging all three optimization methods, our optimization strategy is to unroll loop1, and partially unroll loop2, loop4, and apply loop tiling on depth of input feature maps (Fig. 2). In order to jointly optimize with deconvolution, loop1 is fully unrolled (more details about this will be explained in the next section). In order to reduce the number of partial sums and data transfer, loop2 must be unrolled as much as possible. However, as the large amount of multipliers are required, loop2 is only partially unrolled. Further consideration has been taken for onchip memory access minimization. Therefore, loop4 is unrolled because of pixel reuse. As the partial sum in loop2 are stored in BRAM, no more overhead to the offchip memory is added.
IiiB Deconvolution
Fig. 1 presents the naive way to implement the deconvolution on hardware. As it can be seen, too many operations are wasted in multiplication by zeros.
The mathematical expression of the feature map deconvolution is given in Fig. 3. Based on equation (14) for the deconvolution, we conclude most of the redundant multiplications can be avoided. The procedure is summarized into three steps: 1) padding the input feature map if size doubling is expected; 2) scanning the padded input feature map by a sliding window; 3) applying deconvolution for each patch using kernel as in equation (14). Three examples are highlighted by colored squares in padded feature map and output feature map in Fig. 3.
(1) (2) (3) (4) According to TensorFlow, during deconvolution, the kernel should be rotated by 180°. To make the figure easier to understand, we assume that the kernel has been rotated already.
IiiC Quantization method
A finetuning with quantization constraint method [16] is employed in our design. It effectively diminishes the negative impact of bruteforce quantization while introducing more nonlinearity. Different from the ordinary quantization method [17], we quantize the weights and bias before storage. This quantization method does not require modification of the TensorFlow source code.
Iv Hardware Architecture
The overview of hardware architecture is shown in Fig. 4
. Line buffer converts the convolution into matrix multiplication by reorganizing the input image. The process element array multiplies input image by the weights. After batch normalization, activation and pooling, the output feature map is stored in Output Featuremap (OF) buffer.
Iva Line buffer
A line buffer is designed to build the expected sliding window and to perform zeropadding for convolution. In the proposed accelerator, line buffer bridges the AXI DMA and Input Featuremap (IF) buffer (Fig. 5). Data and valid signals from AXI Stream interface are inserted into cascaded FIFOs. Extra logic is added to generate the status signals (including empty and full signals) of each FIFO, so that the line buffer is able to fit feature maps with different sizes. Padding controller decides when to push the data into each FIFO and when to output zero padding according to the preloaded padding mode.
We choose not to buffer one entire feature map in onchip memory, because the buffer size would be restricted by the limited onchip memory and this could result inefficient buffer usage when feature map size drops. Therefore, only part of the input feature map is loaded into IF buffer and consequently this requires different zeropadding modes. Hence we provide different preloaded work modes for this line buffer.
IvB Process element array
The input feature map is stored in the IF buffers in form of vector, which reduces the BRAM consumption. To rebuild the / sliding window for convolution/deconvolution, a shift register is placed between IF buffer and process element arrays. Each process element array consists of multiple process elements as in Fig. 6, usually in a power of 2. This number is scalable, depending on the bandwidth of platform. Each process element comprises of a 9multiplier array and an adder tree to sum up the products.
IvB1 Convolution
During convolution, sliding windows and their corresponding weights are transmitted into process element array. In the process element, they are multiplied and summed up.
IvB2 Deconvolution
As discussed in Section IIIB, the deconvolution for each patch consumes multiplications and additions. This fits well to the proposed process element. During deconvolution, the process element structure is reused but with a different data routing mechanism.
Different from convolution who outputs one pixel after another, in the deconvolution mode, one process element generates pixels in parallel. Considering the bitwidth of OF buffer, these output pixels are fed into buffer in serial.
IvC Pooling
The pooling operation in CNNs is either max pooling or average pooling for
sliding windows. Another line buffer and shift register are utilized to generate sliding windows. Its work mode can be determined prior to compilation or configured onthefly.IvD Batch normalization and activation
During inference, the batch normalization is downgraded into a multiplication with an addition. We simply absorb it into the process element. Concerning to the activation function, our design supports both ReLU and LeakyReLU.
IvE System controller
The system controller determines: 1) the data flow such as when to trigger process element array, which data to be assigned into buffers, when to start data transfer to DDR, and 2) the setting for each component mentioned above, like the padding mode of line buffer module, work mode of process element arrays, and whether to bypass activation or pooling modules. It is implemented as a FSM. These settings are predefined and preloaded into register files so that FSM only has to read and execute sequentially.
V Implementation Considerations of SegNetBasic
The encoder part of SegNetBasic includes convolution layers and max pooling layers. The decoder part has convolution layers and deconvolution layers. The total parameter size for inference is about 42Mb, with 8bit quantization for feature maps and weights. The accelerator is designed using Simulink and the HDL Coder toolbox. Our target platform Xilinx ZC706 contains 900 DSP slices and 19.2Mb Block RAMs.
Vi Results and Discussion
The test setup of SegNetbasic hardware accelerator is demonstrated in Fig. 7. In the Zynq platform, hardware accelerator is loaded as a peripheral of ARM A9 processor. Two Direct Memory Access (DMAs) move the data between accelerator and DDR memory. The input images and parameters are preloaded into memory and transferred to PL by DMAs. This CNN accelerator clock frequency is 220MHz. Its total resource consumption is summarized in Tab. II.
LUTs Registers BRAMs DSPs 16579 (8%) 25390 (6%) 537 (99%) 576 (64%) TABLE II: Resource consumption Comparing to other implementations (in Tab. I), our design achieves better performance in case of deconvolution. Due to the sharing architecture, a better balance on both performance and resource efficiency for convolution and deconvolution is obtained. However, in order to support both operations, the architecture is not deeply optimized for convolution specifically. Therefore, the convolution performance is not as high as that from deeply optimized implementation in [19].
Via Scalability
Scalability is represented by the number of process element arrays in the accelerator. It is balance of bandwidth and computation capability. In SegNetBasic, the number of process element arrays is set to 1. This means the input and output data bitwidth is 64. If higher bandwidth is supported, higher performance is possible.
ViB Latency of operations
In order to compare the latency, we perform convolution and max pooling on a feature map (resulting a feature map) followed by deconvolution. We find the time for convolution and deconvolution are the same. The padding time difference is about due to different sizes of input feature maps. Considering pooling and ReLU, another is needed. Double buffering eliminates the data transfer time difference. Therefore, deconvolution saves about 3.2% processing time if comparing to convolution plus maxpooling and ReLU.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Iii Optimization
Because of the limited resources on an FPGA, a high performance CNN accelerator has to be deeply optimized on memory access and data transfer while maximizing the resource utilization.
Iiia Loop optimization
To efficiently map the convolution loops, three loop optimization techniques, loop unrolling, loop tiling and loop interchange, have been considered to customize the computation and communication patterns of the accelerator. Loop unrolling is the parallelism strategy for certain convolution loops, which demands more multipliers. Loop tiling determines the partition of feature maps, and consequently determines the required size of onchip memory. Loop interchange decides the computation order of the convolution loops [15].
After carefully judging all three optimization methods, our optimization strategy is to unroll loop1, and partially unroll loop2, loop4, and apply loop tiling on depth of input feature maps (Fig. 2). In order to jointly optimize with deconvolution, loop1 is fully unrolled (more details about this will be explained in the next section). In order to reduce the number of partial sums and data transfer, loop2 must be unrolled as much as possible. However, as the large amount of multipliers are required, loop2 is only partially unrolled. Further consideration has been taken for onchip memory access minimization. Therefore, loop4 is unrolled because of pixel reuse. As the partial sum in loop2 are stored in BRAM, no more overhead to the offchip memory is added.
IiiB Deconvolution
Fig. 1 presents the naive way to implement the deconvolution on hardware. As it can be seen, too many operations are wasted in multiplication by zeros.
The mathematical expression of the feature map deconvolution is given in Fig. 3. Based on equation (14) for the deconvolution, we conclude most of the redundant multiplications can be avoided. The procedure is summarized into three steps: 1) padding the input feature map if size doubling is expected; 2) scanning the padded input feature map by a sliding window; 3) applying deconvolution for each patch using kernel as in equation (14). Three examples are highlighted by colored squares in padded feature map and output feature map in Fig. 3.
(1)  
(2)  
(3)  
(4) 
According to TensorFlow, during deconvolution, the kernel should be rotated by 180°. To make the figure easier to understand, we assume that the kernel has been rotated already.
IiiC Quantization method
A finetuning with quantization constraint method [16] is employed in our design. It effectively diminishes the negative impact of bruteforce quantization while introducing more nonlinearity. Different from the ordinary quantization method [17], we quantize the weights and bias before storage. This quantization method does not require modification of the TensorFlow source code.
Iv Hardware Architecture
The overview of hardware architecture is shown in Fig. 4
. Line buffer converts the convolution into matrix multiplication by reorganizing the input image. The process element array multiplies input image by the weights. After batch normalization, activation and pooling, the output feature map is stored in Output Featuremap (OF) buffer.
Iva Line buffer
A line buffer is designed to build the expected sliding window and to perform zeropadding for convolution. In the proposed accelerator, line buffer bridges the AXI DMA and Input Featuremap (IF) buffer (Fig. 5). Data and valid signals from AXI Stream interface are inserted into cascaded FIFOs. Extra logic is added to generate the status signals (including empty and full signals) of each FIFO, so that the line buffer is able to fit feature maps with different sizes. Padding controller decides when to push the data into each FIFO and when to output zero padding according to the preloaded padding mode.
We choose not to buffer one entire feature map in onchip memory, because the buffer size would be restricted by the limited onchip memory and this could result inefficient buffer usage when feature map size drops. Therefore, only part of the input feature map is loaded into IF buffer and consequently this requires different zeropadding modes. Hence we provide different preloaded work modes for this line buffer.
IvB Process element array
The input feature map is stored in the IF buffers in form of vector, which reduces the BRAM consumption. To rebuild the / sliding window for convolution/deconvolution, a shift register is placed between IF buffer and process element arrays. Each process element array consists of multiple process elements as in Fig. 6, usually in a power of 2. This number is scalable, depending on the bandwidth of platform. Each process element comprises of a 9multiplier array and an adder tree to sum up the products.
IvB1 Convolution
During convolution, sliding windows and their corresponding weights are transmitted into process element array. In the process element, they are multiplied and summed up.
IvB2 Deconvolution
As discussed in Section IIIB, the deconvolution for each patch consumes multiplications and additions. This fits well to the proposed process element. During deconvolution, the process element structure is reused but with a different data routing mechanism.
Different from convolution who outputs one pixel after another, in the deconvolution mode, one process element generates pixels in parallel. Considering the bitwidth of OF buffer, these output pixels are fed into buffer in serial.
IvC Pooling
The pooling operation in CNNs is either max pooling or average pooling for
sliding windows. Another line buffer and shift register are utilized to generate sliding windows. Its work mode can be determined prior to compilation or configured onthefly.IvD Batch normalization and activation
During inference, the batch normalization is downgraded into a multiplication with an addition. We simply absorb it into the process element. Concerning to the activation function, our design supports both ReLU and LeakyReLU.
IvE System controller
The system controller determines: 1) the data flow such as when to trigger process element array, which data to be assigned into buffers, when to start data transfer to DDR, and 2) the setting for each component mentioned above, like the padding mode of line buffer module, work mode of process element arrays, and whether to bypass activation or pooling modules. It is implemented as a FSM. These settings are predefined and preloaded into register files so that FSM only has to read and execute sequentially.
V Implementation Considerations of SegNetBasic
The encoder part of SegNetBasic includes convolution layers and max pooling layers. The decoder part has convolution layers and deconvolution layers. The total parameter size for inference is about 42Mb, with 8bit quantization for feature maps and weights. The accelerator is designed using Simulink and the HDL Coder toolbox. Our target platform Xilinx ZC706 contains 900 DSP slices and 19.2Mb Block RAMs.
Vi Results and Discussion
The test setup of SegNetbasic hardware accelerator is demonstrated in Fig. 7. In the Zynq platform, hardware accelerator is loaded as a peripheral of ARM A9 processor. Two Direct Memory Access (DMAs) move the data between accelerator and DDR memory. The input images and parameters are preloaded into memory and transferred to PL by DMAs. This CNN accelerator clock frequency is 220MHz. Its total resource consumption is summarized in Tab. II.
LUTs  Registers  BRAMs  DSPs 
16579 (8%)  25390 (6%)  537 (99%)  576 (64%) 
Comparing to other implementations (in Tab. I), our design achieves better performance in case of deconvolution. Due to the sharing architecture, a better balance on both performance and resource efficiency for convolution and deconvolution is obtained. However, in order to support both operations, the architecture is not deeply optimized for convolution specifically. Therefore, the convolution performance is not as high as that from deeply optimized implementation in [19].
Via Scalability
Scalability is represented by the number of process element arrays in the accelerator. It is balance of bandwidth and computation capability. In SegNetBasic, the number of process element arrays is set to 1. This means the input and output data bitwidth is 64. If higher bandwidth is supported, higher performance is possible.
ViB Latency of operations
In order to compare the latency, we perform convolution and max pooling on a feature map (resulting a feature map) followed by deconvolution. We find the time for convolution and deconvolution are the same. The padding time difference is about due to different sizes of input feature maps. Considering pooling and ReLU, another is needed. Double buffering eliminates the data transfer time difference. Therefore, deconvolution saves about 3.2% processing time if comparing to convolution plus maxpooling and ReLU.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Iv Hardware Architecture
The overview of hardware architecture is shown in Fig. 4
. Line buffer converts the convolution into matrix multiplication by reorganizing the input image. The process element array multiplies input image by the weights. After batch normalization, activation and pooling, the output feature map is stored in Output Featuremap (OF) buffer.
Iva Line buffer
A line buffer is designed to build the expected sliding window and to perform zeropadding for convolution. In the proposed accelerator, line buffer bridges the AXI DMA and Input Featuremap (IF) buffer (Fig. 5). Data and valid signals from AXI Stream interface are inserted into cascaded FIFOs. Extra logic is added to generate the status signals (including empty and full signals) of each FIFO, so that the line buffer is able to fit feature maps with different sizes. Padding controller decides when to push the data into each FIFO and when to output zero padding according to the preloaded padding mode.
We choose not to buffer one entire feature map in onchip memory, because the buffer size would be restricted by the limited onchip memory and this could result inefficient buffer usage when feature map size drops. Therefore, only part of the input feature map is loaded into IF buffer and consequently this requires different zeropadding modes. Hence we provide different preloaded work modes for this line buffer.
IvB Process element array
The input feature map is stored in the IF buffers in form of vector, which reduces the BRAM consumption. To rebuild the / sliding window for convolution/deconvolution, a shift register is placed between IF buffer and process element arrays. Each process element array consists of multiple process elements as in Fig. 6, usually in a power of 2. This number is scalable, depending on the bandwidth of platform. Each process element comprises of a 9multiplier array and an adder tree to sum up the products.
IvB1 Convolution
During convolution, sliding windows and their corresponding weights are transmitted into process element array. In the process element, they are multiplied and summed up.
IvB2 Deconvolution
As discussed in Section IIIB, the deconvolution for each patch consumes multiplications and additions. This fits well to the proposed process element. During deconvolution, the process element structure is reused but with a different data routing mechanism.
Different from convolution who outputs one pixel after another, in the deconvolution mode, one process element generates pixels in parallel. Considering the bitwidth of OF buffer, these output pixels are fed into buffer in serial.
IvC Pooling
The pooling operation in CNNs is either max pooling or average pooling for
sliding windows. Another line buffer and shift register are utilized to generate sliding windows. Its work mode can be determined prior to compilation or configured onthefly.IvD Batch normalization and activation
During inference, the batch normalization is downgraded into a multiplication with an addition. We simply absorb it into the process element. Concerning to the activation function, our design supports both ReLU and LeakyReLU.
IvE System controller
The system controller determines: 1) the data flow such as when to trigger process element array, which data to be assigned into buffers, when to start data transfer to DDR, and 2) the setting for each component mentioned above, like the padding mode of line buffer module, work mode of process element arrays, and whether to bypass activation or pooling modules. It is implemented as a FSM. These settings are predefined and preloaded into register files so that FSM only has to read and execute sequentially.
V Implementation Considerations of SegNetBasic
The encoder part of SegNetBasic includes convolution layers and max pooling layers. The decoder part has convolution layers and deconvolution layers. The total parameter size for inference is about 42Mb, with 8bit quantization for feature maps and weights. The accelerator is designed using Simulink and the HDL Coder toolbox. Our target platform Xilinx ZC706 contains 900 DSP slices and 19.2Mb Block RAMs.
Vi Results and Discussion
The test setup of SegNetbasic hardware accelerator is demonstrated in Fig. 7. In the Zynq platform, hardware accelerator is loaded as a peripheral of ARM A9 processor. Two Direct Memory Access (DMAs) move the data between accelerator and DDR memory. The input images and parameters are preloaded into memory and transferred to PL by DMAs. This CNN accelerator clock frequency is 220MHz. Its total resource consumption is summarized in Tab. II.
LUTs  Registers  BRAMs  DSPs 
16579 (8%)  25390 (6%)  537 (99%)  576 (64%) 
Comparing to other implementations (in Tab. I), our design achieves better performance in case of deconvolution. Due to the sharing architecture, a better balance on both performance and resource efficiency for convolution and deconvolution is obtained. However, in order to support both operations, the architecture is not deeply optimized for convolution specifically. Therefore, the convolution performance is not as high as that from deeply optimized implementation in [19].
Via Scalability
Scalability is represented by the number of process element arrays in the accelerator. It is balance of bandwidth and computation capability. In SegNetBasic, the number of process element arrays is set to 1. This means the input and output data bitwidth is 64. If higher bandwidth is supported, higher performance is possible.
ViB Latency of operations
In order to compare the latency, we perform convolution and max pooling on a feature map (resulting a feature map) followed by deconvolution. We find the time for convolution and deconvolution are the same. The padding time difference is about due to different sizes of input feature maps. Considering pooling and ReLU, another is needed. Double buffering eliminates the data transfer time difference. Therefore, deconvolution saves about 3.2% processing time if comparing to convolution plus maxpooling and ReLU.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
V Implementation Considerations of SegNetBasic
The encoder part of SegNetBasic includes convolution layers and max pooling layers. The decoder part has convolution layers and deconvolution layers. The total parameter size for inference is about 42Mb, with 8bit quantization for feature maps and weights. The accelerator is designed using Simulink and the HDL Coder toolbox. Our target platform Xilinx ZC706 contains 900 DSP slices and 19.2Mb Block RAMs.
Vi Results and Discussion
The test setup of SegNetbasic hardware accelerator is demonstrated in Fig. 7. In the Zynq platform, hardware accelerator is loaded as a peripheral of ARM A9 processor. Two Direct Memory Access (DMAs) move the data between accelerator and DDR memory. The input images and parameters are preloaded into memory and transferred to PL by DMAs. This CNN accelerator clock frequency is 220MHz. Its total resource consumption is summarized in Tab. II.
LUTs  Registers  BRAMs  DSPs 
16579 (8%)  25390 (6%)  537 (99%)  576 (64%) 
Comparing to other implementations (in Tab. I), our design achieves better performance in case of deconvolution. Due to the sharing architecture, a better balance on both performance and resource efficiency for convolution and deconvolution is obtained. However, in order to support both operations, the architecture is not deeply optimized for convolution specifically. Therefore, the convolution performance is not as high as that from deeply optimized implementation in [19].
Via Scalability
Scalability is represented by the number of process element arrays in the accelerator. It is balance of bandwidth and computation capability. In SegNetBasic, the number of process element arrays is set to 1. This means the input and output data bitwidth is 64. If higher bandwidth is supported, higher performance is possible.
ViB Latency of operations
In order to compare the latency, we perform convolution and max pooling on a feature map (resulting a feature map) followed by deconvolution. We find the time for convolution and deconvolution are the same. The padding time difference is about due to different sizes of input feature maps. Considering pooling and ReLU, another is needed. Double buffering eliminates the data transfer time difference. Therefore, deconvolution saves about 3.2% processing time if comparing to convolution plus maxpooling and ReLU.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Vi Results and Discussion
The test setup of SegNetbasic hardware accelerator is demonstrated in Fig. 7. In the Zynq platform, hardware accelerator is loaded as a peripheral of ARM A9 processor. Two Direct Memory Access (DMAs) move the data between accelerator and DDR memory. The input images and parameters are preloaded into memory and transferred to PL by DMAs. This CNN accelerator clock frequency is 220MHz. Its total resource consumption is summarized in Tab. II.
LUTs  Registers  BRAMs  DSPs 
16579 (8%)  25390 (6%)  537 (99%)  576 (64%) 
Comparing to other implementations (in Tab. I), our design achieves better performance in case of deconvolution. Due to the sharing architecture, a better balance on both performance and resource efficiency for convolution and deconvolution is obtained. However, in order to support both operations, the architecture is not deeply optimized for convolution specifically. Therefore, the convolution performance is not as high as that from deeply optimized implementation in [19].
Via Scalability
Scalability is represented by the number of process element arrays in the accelerator. It is balance of bandwidth and computation capability. In SegNetBasic, the number of process element arrays is set to 1. This means the input and output data bitwidth is 64. If higher bandwidth is supported, higher performance is possible.
ViB Latency of operations
In order to compare the latency, we perform convolution and max pooling on a feature map (resulting a feature map) followed by deconvolution. We find the time for convolution and deconvolution are the same. The padding time difference is about due to different sizes of input feature maps. Considering pooling and ReLU, another is needed. Double buffering eliminates the data transfer time difference. Therefore, deconvolution saves about 3.2% processing time if comparing to convolution plus maxpooling and ReLU.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Vii Conclusions
In this paper, a scalable and configurable CNN accelerator architecture has been proposed by combining both convolution and deconvolution into single process element. The deconvolution operation is completed in one step and buffering of intermediate results is not needed. In addition, SegNetBasic has been successfully implemented on Xilinx Zynq ZC706 FPGA that achieves the performance of 151.5 GOPS for convolution and 94.3 GOPS for deconvolution, which outperforms stateoftheart segmentation CNN implementations.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Acknowledgment
This work was supported by the Mathworks Inc.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
References
 [1] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster RCNN: Towards realtime object detection with region proposal networks,” In Advances in neural information processing systems (NIPS), pp. 9199. 2015.

[3]
J. Long, E. Shelhamer, and T. Darrell,
”Fully convolutional networks for semantic segmentation,”
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 34313440. 2015.  [4] Y.H. Chen, J. Emer, and V. Sze, ”Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pp. 367379, 2016.
 [5] C. Ledig, et al, ”Photorealistic single image superresolution using a generative adversarial network,” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 46814690. 2017.
 [6] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ”Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.

[7]
U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling, and G.R. Chiu, ”An opencl™ deep learning accelerator on arria 10,”
In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 5564. 2017.  [8] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, ”Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” In Proceedings of the 35th International Conference on ComputerAided Design, p. 12, 2016.
 [9] X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, ”Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” In Proceedings of the 54th Annual Design Automation Conference, p. 29, 2017.
 [10] S. Liu, H. Fan, X. Niu, H. Ng, Y. Chu, and W. Luk, ”Optimizing CNNbased Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, 2018.
 [11] X. Zhang, S. Das, O. Neopane, K. KreutzDelgado, ”A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA,” arXiv preprint arXiv:1705.02583, 2017.
 [12] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, ”FCNelement: accelerating deconvolutional layers in classic CNN processors,” In Proceedings of the International Conference on ComputerAided Design, pp. 22, 2018.
 [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 24812495, 2017.
 [14] O. Ronneberger, P. Fischer, and T. Brox, ”Unet: Convolutional networks for biomedical image segmentation,” In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234241, 2015.
 [15] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ”Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks,” In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 4554, 2017.
 [16] Y. Lyu, L. Bai, and X. Huang., ”Chipnet: Realtime LiDAR processing for drivable region segmentation on an FPGA,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769  1779, 2019.
 [17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M.Tang, A. Howard, H. Adam, and D. Kalenichenko, ”Quantization and training of neural networks for efficient integerarithmeticonly inference,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27042713, 2018.
 [18] J. Qiu, J. Wang, S. Yao et al, ”Going deeper with embedded fpga platform for convolutional neural network,” In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays(FPGA), pp. 2635, 2016.
 [19] Q. Xiao, Y. Liang et al, ”Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16, 2017.
Comments
There are no comments yet.