A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

08/07/2022
by   Hongwu Peng, et al.
0

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 × and 2.6 × speedup compared to CPU and GPU implementation, and 4 × higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

READ FULL TEXT

page 3

page 6

research
07/16/2020

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

In natural language processing (NLP), the "Transformer" architecture was...
research
09/20/2022

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

Attention-based neural networks have become pervasive in many AI tasks. ...
research
10/09/2020

TurboTransformers: An Efficient GPU Serving System For Transformer Models

The transformer is the most critical algorithm innovation of the Nature ...
research
08/26/2023

An Efficient FPGA-Based Accelerator for Swin Transformer

Since introduced, Swin Transformer has achieved remarkable results in th...
research
10/18/2021

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

In recent years, transformer models have revolutionized Natural Language...
research
06/05/2018

Combining Multiple Optimised FPGA-based Pulsar Search Modules Using OpenCL

Field-Programmable Gate Arrays (FPGAs) are widely used in the central si...
research
09/18/2018

Shouji: Fast and Efficient Computation of Banded Sequence Alignment

Motivation: The ability to generate massive amounts of sequencing data c...

Please sign up or login with your details

Forgot password? Click here to reset