EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

03/24/2023
by   Shikhar Tuli, et al.
0

Automated design of efficient transformer models has recently attracted significant attention from industry and academia. However, most works only focus on certain metrics while searching for the best-performing transformer architecture. Furthermore, running traditional, complex, and large transformer models on low-compute edge platforms is a challenging problem. In this work, we propose a framework, called ProTran, to profile the hardware performance measures for a design space of transformer architectures and a diverse set of edge devices. We use this profiler in conjunction with the proposed co-design technique to obtain the best-performing models that have high accuracy on the given task and minimize latency, energy consumption, and peak power draw to enable edge deployment. We refer to our framework for co-optimizing accuracy and hardware performance measures as EdgeTran. It searches for the best transformer model and edge device pair. Finally, we propose GPTran, a multi-stage block-level grow-and-prune post-processing step that further improves accuracy in a hardware-aware manner. The obtained transformer model is 2.8× smaller and has a 0.8 (BERT-Base). Inference with it on the selected edge device enables 15.0 latency, 10.0× lower energy, and 10.8× lower peak power draw compared to an off-the-shelf GPU.

READ FULL TEXT

page 1

page 6

page 13

page 18

research
03/27/2023

TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference

Automated co-design of machine learning models and evaluation hardware i...
research
05/28/2020

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, ...
research
09/10/2023

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Recent years have witnessed the great success of vision transformer (ViT...
research
02/28/2023

AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers

Self-attention-based transformer models have achieved tremendous success...
research
08/24/2022

Efficient Heterogeneous Video Segmentation at the Edge

We introduce an efficient video segmentation system for resource-limited...
research
11/28/2020

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Transformer-based language models such as BERT provide significant accur...
research
02/27/2023

Full Stack Optimization of Transformer Inference: a Survey

Recent advances in state-of-the-art DNN architecture design have been mo...

Please sign up or login with your details

Forgot password? Click here to reset