Im2win: An Efficient Convolution Paradigm on GPU

06/25/2023
by   Shuai Lu, et al.
0

Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix multiplication (GEMM)-based convolution and the direct convolution. GEMM-based convolution relies on the im2col algorithm, which results in a large memory footprint and reduced performance. Direct convolution does not have the large memory footprint problem, but the performance is not on par with GEMM-based approach because of the discontinuous memory access. This paper proposes a window-order-based convolution paradigm on GPU, called im2win, which not only reduces memory footprint but also offers continuous memory accesses, resulting in improved performance. Furthermore, we apply a range of optimization techniques on the convolution CUDA kernel, including shared memory, tiling, micro-kernel, double buffer, and prefetching. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution with cuBLAS and six cuDNN-based convolution implementations, with twelve state-of-the-art DNN benchmarks. The experimental results show that our implementation 1) uses less memory footprint by 23.1 TFLOPS compared with cuBLAS, 2) uses less memory footprint by 32.8 achieves up to 1.8× TFLOPS compared with the best performant convolutions in cuDNN, and 3) achieves up to 155× TFLOPS compared with the direct convolution. We further perform an ablation study on the applied optimization techniques and find that the micro-kernel has the greatest positive impact on performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2023

Im2win: Memory Efficient Convolution On SIMD Architectures

Convolution is the most expensive operation among neural network operati...
research
06/21/2017

MEC: Memory-efficient Convolution for Deep Neural Network

Convolution is a critical component in modern deep neural networks, thus...
research
04/13/2018

μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently...
research
05/13/2019

Winograd Convolution for DNNs: Beyond linear polinomials

We investigated a wider range of Winograd family convolution algorithms ...
research
07/03/2019

The Indirect Convolution Algorithm

Deep learning frameworks commonly implement convolution operators with G...
research
12/31/2020

I/O Lower Bounds for Auto-tuning of Convolutions in CNNs

Convolution is the most time-consuming part in the computation of convol...
research
03/08/2023

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Convolution is one of the most computationally intensive operations that...

Please sign up or login with your details

Forgot password? Click here to reset