HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

05/28/2020
by   Hanrui Wang, et al.
24

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with arbitrary encoder-decoder attention and heterogeneous layers. Then we train a SuperTransformer that covers all candidates in the design space, and efficiently produces many SubTransformers with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized SubTransformer dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve 3× speedup, 3.7× smaller size over baseline Transformer; 2.7× speedup, 3.6× smaller size over Evolved Transformer with 12,041× less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git

READ FULL TEXT

page 1

page 13

research
10/14/2022

AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

Neural architecture search (NAS) has demonstrated promising results on i...
research
10/23/2020

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natu...
research
10/02/2022

Wide Attention Is The Way Forward For Transformers

The Transformer is an extremely powerful and prominent deep learning arc...
research
03/24/2023

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

Automated design of efficient transformer models has recently attracted ...
research
10/18/2021

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

In recent years, transformer models have revolutionized Natural Language...
research
07/17/2021

Dynamic Transformer for Efficient Machine Translation on Embedded Devices

The Transformer architecture is widely used for machine translation task...
research
09/23/2022

Faith: An Efficient Framework for Transformer Verification on GPUs

Transformer verification draws increasing attention in machine learning ...

Please sign up or login with your details

Forgot password? Click here to reset