TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

11/07/2022
by   Lizhi Xiang, et al.
0

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tuckercompressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 3.14X speedup over cuDNN, 1.45X speedup over TVM, and 4.57X over the original models using cuDNN with up to 0.05

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/14/2020

PENNI: Pruned Kernel Sharing for Efficient CNN Inference

Although state-of-the-art (SOTA) CNNs achieve outstanding performance on...
research
10/31/2022

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Generative Pre-trained Transformer (GPT) models set themselves apart thr...
research
08/24/2018

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

Conventional GPU implementations of Strassen's algorithm (Strassen) typi...
research
07/26/2021

Towards Efficient Tensor Decomposition-Based DNN Model Compression with Optimization Framework

Advanced tensor decomposition, such as Tensor train (TT) and Tensor ring...
research
10/31/2016

Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods

In this paper we focus on the integration of high-performance numerical ...
research
04/16/2021

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

Convolutional neural networks (CNNs) have found many applications in tas...
research
01/30/2019

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM

Many long short-term memory (LSTM) applications need fast yet compact mo...

Please sign up or login with your details

Forgot password? Click here to reset