Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

09/20/2022
by   Hongxiang Fan, et al.
6

Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10 to 66 times and the number of parameters 2 to 22 times. By jointly optimizing the algorithm and hardware, our FPGA-based butterfly accelerator achieves 14.2 to 23.2 times speedup over state-of-the-art accelerators normalized to the same computational budget. Compared with optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is up to 273.8 and 15.1 times faster under the same power budget.

READ FULL TEXT

page 1

page 6

page 10

research
08/07/2022

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning mode...
research
02/22/2020

A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation

With the increasing computational demands of neural networks, many hardw...
research
09/18/2020

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

Designing hardware accelerators for deep neural networks (DNNs) has been...
research
06/29/2022

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

The attention mechanisms of transformers effectively extract pertinent i...
research
02/10/2023

Element-Wise Attention Layers: an option for optimization

The use of Attention Layers has become a trend since the popularization ...
research
10/13/2022

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

The attention mechanism requires huge computational efforts to process u...
research
02/28/2022

Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

Transformers are becoming the mainstream solutions for various tasks lik...

Please sign up or login with your details

Forgot password? Click here to reset