Full Stack Optimization of Transformer Inference: a Survey

02/27/2023
by   Sehoon Kim, et al.
0

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.

READ FULL TEXT

page 8

page 14

page 15

page 20

page 26

research
08/26/2023

An Efficient FPGA-Based Accelerator for Swin Transformer

Since introduced, Swin Transformer has achieved remarkable results in th...
research
12/03/2021

NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference

Non-linear operations such as GELU, Layer normalization, and Softmax are...
research
03/27/2023

TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference

Automated co-design of machine learning models and evaluation hardware i...
research
09/14/2020

DANCE: Differentiable Accelerator/Network Co-Exploration

To cope with the ever-increasing computational demand of the DNN executi...
research
09/23/2020

Multi-Pass Transformer for Machine Translation

In contrast with previous approaches where information flows only toward...
research
03/24/2023

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

Automated design of efficient transformer models has recently attracted ...
research
11/29/2021

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

This work focuses on an efficient Agile design methodology for domain-sp...

Please sign up or login with your details

Forgot password? Click here to reset