Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA

05/19/2021
by   Gang Li, et al.
0

Deep convolutional neural networks have achieved remarkable progress in recent years. However, the large volume of intermediate results generated during inference poses a significant challenge to the accelerator design for resource-constraint FPGA. Due to the limited on-chip storage, partial results of intermediate layers are frequently transferred back and forth between on-chip memory and off-chip DRAM, leading to a non-negligible increase in latency and energy consumption. In this paper, we propose block convolution, a hardware-friendly, simple, yet efficient convolution operation that can completely avoid the off-chip transfer of intermediate feature maps at run-time. The fundamental idea of block convolution is to eliminate the dependency of feature map tiles in the spatial dimension when spatial tiling is used, which is realized by splitting a feature map into independent blocks so that convolution can be performed separately on individual blocks. We conduct extensive experiments to demonstrate the efficacy of the proposed block convolution on both the algorithm side and the hardware side. Specifically, we evaluate block convolution on 1) VGG-16, ResNet-18, ResNet-50, and MobileNet-V1 for ImageNet classification task; 2) SSD, FPN for COCO object detection task, and 3) VDSR for Set5 single image super-resolution task. Experimental results demonstrate that comparable or higher accuracy can be achieved with block convolution. We also showcase two CNN accelerators via algorithm/hardware co-design based on block convolution on memory-limited FPGAs, and evaluation shows that both accelerators substantially outperform the baseline without off-chip transfer of intermediate feature maps.

READ FULL TEXT

page 1

page 2

page 3

page 8

page 12

research
01/18/2018

On-Chip CNN Accelerator for Image Super-Resolution

To implement convolutional neural networks (CNN) in hardware, the state-...
research
06/15/2021

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Residual block is a very common component in recent state-of-the art CNN...
research
01/18/2018

An Energy-Efficient FPGA-based Deconvolutional Neural Networks Accelerator for Single Image Super-Resolution

Convolutional neural networks (CNNs) demonstrate excellent performance a...
research
06/29/2023

Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations

Today's performance analysis frameworks for deep learning accelerators s...
research
10/01/2018

Extended Bit-Plane Compression for Convolutional Neural Network Accelerators

After the tremendous success of convolutional neural networks in image c...
research
08/30/2023

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution

Deep learning-driven superresolution (SR) outperforms traditional techni...
research
07/19/2021

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

The configurable building blocks of current FPGAs – Logic blocks (LBs), ...

Please sign up or login with your details

Forgot password? Click here to reset