Log In Sign Up

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

by   Hongxiang Fan, et al.
University of Cambridge
cornell university
Imperial College London

Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10 to 66 times and the number of parameters 2 to 22 times. By jointly optimizing the algorithm and hardware, our FPGA-based butterfly accelerator achieves 14.2 to 23.2 times speedup over state-of-the-art accelerators normalized to the same computational budget. Compared with optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is up to 273.8 and 15.1 times faster under the same power budget.


page 1

page 6

page 10


A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation

With the increasing computational demands of neural networks, many hardw...

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

Designing hardware accelerators for deep neural networks (DNNs) has been...

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

The attention mechanisms of transformers effectively extract pertinent i...

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

The attention mechanism requires huge computational efforts to process u...

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Lang...

Accurate, Low-latency, Efficient SAR Automatic Target Recognition on FPGA

Synthetic aperture radar (SAR) automatic target recognition (ATR) is the...

I Introduction

Recent years have witnessed a great success of attention-based neural networks (NNs) on many AI tasks [brauwers2021general]. The attention mechanism [vaswani2017attention]

that captures long-range information from sequences of data has demonstrated its excellent algorithmic performance in various natural language processing 

[devlin2018bert, radford2019language]

and computer vision 

[dosovitskiy2020image] applications. However, the advances of attention-based NNs come at a cost: the use of attention and linear layers significantly increases the computational load, resulting in a large overhead on their speed and power consumption [wang2020spatten]. Figure 1 shows an operation breakdown on four mainstream attention-based models. For short input sequences, linear layers occupy over % of operation counts. As the input sequence increases, the computation is gradually dominated by the attention layer. Since both attention and linear layers are memory- and compute-intensive, it is challenging to achieve high hardware performance on attention-based NNs across input sequences of various lengths.

Fig. 1: FLOPs percentage of attention and linear layers with different input sequence length.

So far, various approaches and designs have been introduced to accelerate attention-based DNNs. On the algorithmic level, several efficient sparse variants have attempted to reduce the computational complexity [choromanski2020rethinking, wang2020linformer, kitaev2020reformer, tay2020sparse, beltagy2020longformer, zaheer2020big, child2019generating]. However, most of these approaches only focus on reducing the number of parameters and operations without considering the real hardware performance, such as end-to-end latency. Furthermore, the hardware efficiency of implementing these sparsity patterns on real hardware designs is often overlooked. On the hardware level, although various highly-optimized accelerators (Table I) have been proposed [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos, li2020ftrans, edgebert2021micro], several issues still remain unresolved:

  • [leftmargin=*]

  • Most of current accelerators only focus on optimizing either FFNs [li2020ftrans] or the attention mechanism [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos]. Without jointly optimizing both parts, these hardware designs lack scalability when accelerating the end-to-end attention-based NNs with different input lengths.

  • While optimizing the attention mechanism, most of the existing designs dynamically detect and prune the redundant computation at runtime to achieve high sparsity on specific datasets and networks. However, the generality of these dynamic approaches needs to be further tested as their performance gain may vary across different datasets and network architectures.

  • The sparsity patterns introduced by these dynamic approaches are often unstructured, requiring dynamic hardware controllers to exploit sparsity. Such complicated controllers often contain larger numbers of clocking elements, and their hardware overhead increases as the transistor size reduces [jang2021sparsity]. As such, the performance or energy gain of these dynamic methods may be diminished.

To address the aforementioned issues, this paper adopts butterfly sparsity to accelerate attention-based models with three novel aspects (Table I): i) fine-grained structured regularity, which possesses regular data accesses to optimize both memory and compute efficiency; ii) static sparsity pattern, which avoids the need of designing a dynamic controller in hardware; iii) sparsity exploitation on both attention and linear layers, which allows scalable end-to-end acceleration of attention-based NNs. We therefore propose FABNet, a hardware-friendly model for FFT, Attention and Butterfly-Net. To fully exploit the sparsity in hardware, we propose an adaptable butterfly accelerator that can be configured at runtime via dedicated hardware control to accelerate different layers using a single unified engine, significantly improving hardware efficiency. To push the performance limit, we jointly optimize the model and hardware via a co-design approach. Overall, this work makes the following contributions:

  • [leftmargin=*]

  • A hardware-friendly attention-based model, FABNet, that adopts the butterfly sparsity pattern on both attention and linear layers for end-to-end acceleration (Section III).

  • A novel adaptable butterfly accelerator configurable at runtime via dedicated hardware control to accelerate different layers using a single unified engine (Section IV).

  • Several hardware optimizations to improve the hardware efficiency and a co-design approach to jointly optimize both algorithmic and hardware parameters (Section V).

  • A comprehensive evaluation on different datasets that demonstrates the advantages of our approach over CPU, GPU and state-of-the-art accelerators (Section VI).

Accelerators Pattern Regularity Sparsity Pattern Sparsity Location
 [ham2020] unstructured dynamic attention
SpAtten [wang2020spatten] coarse-grained structured
Sanger [sanger201micro] load-balanced unstructured
Energon [zhou2021energon] unstructured
ELSA [elsa2021isca] unstructured
DOTA [dota2022asplos] unstructured
FTRANS [li2020ftrans] None static FFN
EdgeBERT [edgebert2021micro] None dynamic layer
Our work fine-grained structured static attention & FFN
TABLE I: Comparison of existing accelerators for attention-based NNs in terms of sparsity regularity, pattern and location.

Ii Background and Motivation

Ii-a Attention-Based Neural Networks

Based on their network structure, attention-based NNs can be classified into three categories:

i) encoder-decoder, ii) encoder-only, and iii) decoder-only networks. The encoder-decoder NNs are mainly designed for sequence-to-sequence tasks, such as machine translation [vaswani2017attention]. One of the most widely used encoder-decoder network is the Transformer, which is constructed by a stack of encoder and decoder blocks. Figure 2 illustrates the structure, where , and represent input length, hidden size and FFN expand ratio respectively. Each encoder starts with a multi-head attention module, followed by a feed-forward network (FFN) consisting of two linear (fully-connected) layers. Finally, residual addition [he2016deep] and layer normalization (LN) [ba2016layer] are used after FFN. Within each multi-head attention, the inputs are first mapped to query (), key () and value () matrices through three different linear layers. The query matrix is then multiplied with , followed by a softmax operation to get the score matrix (). The generated is multiplied with and the resultant matrix will flow into another linear layer, which generates the final output matrix of the multi-head attention. Similar to the encoder, the decoder employs two multi-head attention modules and one FFN, where the difference is that the inputs of the query and key matrices in the second attention module come from the last encoder.

Fig. 2: The structure of a Transformer. Shortcut addition and layer normalization are omitted for simplicity.

Based on the original encoder-decoder structure of Transformer, different variants have been proposed. The encoder-only networks, such as BERT [devlin2018bert] and XLM [lample2019cross]

, are autoencoding models that have been widely applied to NLP tasks, such as sequence classification 

[wang2018glue]. The Vision Transformer (ViT) [dosovitskiy2020image] also lies in this category. An extra linear projection layer is introduced at the beginning, while its encoder layers correspond to the encoder part of the original Transformer

. Finally, the decoder-only networks represent the autoregressive models designed for NLP tasks, such as language modeling 

[ma2019tensorized]. GPT [radford2019language] is a typical decoder-only model that corresponds to the decoder part of the original Transformer. Although we focus on encoder-only networks in this work, our hardware design is flexible and applicable to decoders too.

Ii-B Butterfly Matrices and FFT

Despite the impressive accuracy attained using attention-based NNs, these models are expensive and not scalable, e.g. the self-attention mechanism in the Transformer scales quadratically in terms of computation and memory as a function of the input sequence length. As a result, numerous works [choromanski2020rethinking, wang2020linformer, child2019generating, chen2021scatterbrain] adopt structured linear mappings, such as sparse and low-rank matrices, to approximate the attention matrices and/or the weight matrices in the feed-forward layers. Choosing an appropriate structure for each linear mapping, however, is application-dependent, often requiring domain expertise and entailing an arduous process of hand-picking solutions as different structures have different trade-offs in accuracy and speed.

To counteract this, recent work has utilized butterfly matrices [parker1995random, dao2019learning], which are universal representations of structured matrices that have a simple recursive structure. Specifically, each butterfly matrix of size encodes the recursive divide-and-conquer structure with butterfly patterns and, hence it can be expressed as the product of sparse butterfly factor matrices [de2018two] as follows:

where each , a butterfly factor, is a block matrix of diagonal matrices, with size , whose entries can be trained via gradient-based methods:

Due to their expressiveness in representing structured matrices and approximating unstructured data, butterfly matrices and their variants [chen2021pixelated, dao2020kaleidoscope] have found success in compressing attention and weight matrices, considerably improving the accuracy and efficiency of attention-based NNs. For instance, applying butterfly factorization to a linear layer with an weight matrix can reduce the computational and memory complexity from to .

Besides attention and weight matrices, some designs have explored replacing the entire attention mechanism with more efficient counterparts [tolstikhin2021mlp]. A prominent example is FNet [lee2021fnet]

, in which the self-attention modules are replaced by 2D Discrete Fourier Transform (DFT) operations. Specifically, for each input, 1D DFT is applied along the sequence and the hidden dimension independently, keeping only the real component of the resulting outputs. To reduce DFT computation time, the Cooley-Tukey Fast Fourier Transform (FFT) algorithm 

[cooley1965algorithm] is used. As the use of DFT facilitates information flow across all embeddings, it results in a similar performance compared to the use of vanilla self-attention layers, but at a significant reduction in latency and memory.

On the algorithmic front, our proposed FABNet utilizes a mixture of these techniques – FFT and butterfly matrices – to outperform relevant approximation approaches in terms of accuracy. Notably, since FFT matrices can be considered a special case of butterfly matrices with , being identity matrices and , acting as twiddle factors, both the FFT and butterfly matrices possess the recursive butterfly structure. Therefore, it is possible to use a unified computational and data access pattern and then devise a single hardware engine to accelerate both FFT and butterfly-based operations with high hardware efficiency.

Ii-C Latency Breakdown and Motivation

The operation counts in Figure 1 reveal that the computation of attention-based NNs is dominated by different components when the length of input sequences changes. To further investigate the real hardware performance of each subcomponent, we profile the execution time of the BERT-Large model on the Nvidia V100 GPU and Intel Xeon Gold 6154 CPU. The length of input sequences is set to , and on both devices, and the batch size for GPU and CPU is and , respectively. Figure 3 shows the latency breakdown. We split the latency consumption into three main subcomponents: attention layers, linear layers, and other operations, e.g.

 layer normalization, residual connections, matrix transformations and IO operations. Notably, on both CPU and GPU, linear layers take up a significant portion of execution time,

and respectively, when the input length is small. As the input length becomes larger, the execution time of attention layers increases gradually and becomes dominant. As such, the latency is dominated by different components depending on the length of the input sequence. According to Amdahl’s law [amdahl1967validity], to achieve high hardware performance across different input lengths, it is necessary to optimize both attention and linear layers.

The majority of previous accelerators for attention-based NNs focused on optimizing a single component of the entire model (either attention or FFN as shown in Table I), leading to suboptimal end-to-end performance gains. The execution time of these accelerators is heavily dependent on the input length which varies across different applications, reducing the scalability of these hardware designs and thus narrowing their deployability in real-world scenarios. Naively adopting a combination of previous works on optimizing the linear layers [li2020ftrans] and attention layers [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos], however, would result in low hardware efficiency as they adopt different sparsity patterns. As a result, designing an end-to-end accelerator for scalable attention-based NNs remains an open problem. In this work, we address this challenge by adopting an algorithm and hardware co-design approach. On the algorithmic level, a hardware-friendly model called FABNet is proposed, which adopts a unified butterfly sparsity pattern to compress both attention and linear layers. On the hardware level, we propose an adaptable butterfly design that can be configured at runtime to accelerate different layers in FABNet using one unified hardware engine.

Fig. 3: Execution time breakdown of Transformer with different input lengths on GPU and CPU.

Iii Algorithm Optimization

Iii-a Computational Analysis of Sparsity Patterns

Various pruning schemes have been proposed to reduce the computational complexity of attention-based NNs, leading to different efficient models [choromanski2020rethinking, wang2020linformer, kitaev2020reformer, tay2020sparse, beltagy2020longformer, zaheer2020big, child2019generating, chen2021pixelated, lee2021fnet, dao2020kaleidoscope, dao2022monarch]. By analysing the computational and data access patterns of these variants, we define five basic sparsity patterns shown in Figure 4: i) low rank, ii) sliding window, iii) butterfly, iv) random, and v) block-wise pattern. As low-rank approximation of an attention matrix requires both sequential row and column reads but the data are usually only stored in either a row-major or column-major, the hardware efficiency of low-rank sparsity is inherently diminished. Random sparsity also demonstrates low hardware efficiency due to its random read pattern. Furthermore, we observe that the sparsity in various sparse variants can be expressed as different combinations of the basic sparsity patterns, as summarized in Table II. As some basic sparsity patterns can only capture either long-range global or short-range local information (Figure 4), the rationale behind using multiple sparsity patterns within each variant is mainly to compensate for the underlying accuracy loss. For example, Pixelfly [chen2021pixelated] introduces an additional low-rank sparsity pattern to increase the expressiveness of their flat block-wise butterfly pattern and improve accuracy.

Fig. 4: Basic sparsity patterns in existing variants.
Model Sparsity pattern Att. FFN Unified Sparsity Co-Design
Performer [choromanski2020rethinking] Low-Rank
Linformer [wang2020linformer] (Extra kernels)
Reformer [kitaev2020reformer] Block-wise
(Extra kernels)
Sparse Sinkhorn [tay2020sparse] Block-wise + Random
Longformer [beltagy2020longformer] Sliding-Window + Low-Rank
BigBird [zaheer2020big] Random + Sliding-Window + Low-Rank
FNet [lee2021fnet] Butterfly
Kaleidoscope [dao2020kaleidoscope] Butterfly
Sparse Trans. [child2019generating] Low-Rank + Butterfly + Sliding-Window
Pixelfly [chen2021pixelated] Monarch [dao2022monarch] Butterfly + Block-Wise + Low-Rank
Our work Butterfly
TABLE II: Combination of sparsities in different variants.

Different sparsity patterns exhibit diverse data access patterns, which calls for custom hardware support. However, supporting multiple sparsity patterns may complicate the hardware design. For instance, in order to fully utilize the sparsity in the random pattern, complex dynamic controllers are required to achieve a load-balanced execution on different hardware engines [sanger201micro, geng2020awb]. The extra overhead of such controllers may counteract the improvement brought by skipping sparse operations [jang2021sparsity].

In this work, we aim to find a hardware-friendly sparsity pattern that: 1) has structured data access patterns to simplify the memory design, 2) captures both local and global range information with a single sparsity pattern, and 3) is applicable to both the attention mechanism and FFNs to sustain its performance improvement across both long and short input sequences. To meet these requirements, we adopt the butterfly sparsity as a basis for constructing our efficient algorithm.

Compared to other sparsity patterns, the butterfly sparsity provides a number of favorable properties. As shown in Figure 4, although random sparsity is able to capture both local and global information, it has two drawbacks compared to butterfly sparsity: 1) it requires complicated controllers with excessive hardware overhead [jang2021sparsity], and 2) its performance gain cannot be guaranteed as the sparsity may vary substantially among different datasets and tasks. Compared with random sparsity, the sliding-window pattern is more hardware-friendly. However, Table II shows that it often requires low-rank sparsity to compensate for the accuracy loss, as sliding-window sparsity only captures the local relationship within each window. Moreover, although some variants adopt a single low-rank or block-wise sparsity pattern with satisfactory algorithmic performance, they require extra algorithmic operations and dedicated computational kernels during inference (e.g. the locality-sensitive hashing (LSH) in Reformer [kitaev2020reformer]) during inference, resulting in large hardware overhead. In contrast, this paper treats the butterfly sparsity as a promising method due to its regular data access pattern and the ability of capturing both global and local information.

Iii-B Unified Butterfly Pattern for Attention and Linear Layers

Fig. 5: Network structure of FABNet. (SC: shortcut addition, LN: layer normalization)
Fig. 6: Hardware overview of the adaptable butterfly accelerator.

The butterfly pattern has demonstrated its effectiveness and generality in approximating linear transformations 

[dao2020kaleidoscope]. Furthermore, Lee-Thorp et al. [lee2021fnet] have shown the potential of simplifying the computation by replacing the entire attention layer with Fourier transform, which effectively mixes tokens without explicitly approximating the attention mechanism. To maximize the ability to reduce the computation with acceptable algorithmic performance, we start by proposing two basic building blocks for scalable inference: 1) the Attention Butterfly (ABfly), and 2) Fourier Butterfly (FBfly) blocks.

In the ABfly block, we retain the backbone of the attention module and compress all the linear layers using butterfly factorization. Specifically, the ABfly block starts with three butterfly linear layers to generate , and matrices. The results are fed into a vanilla multi-head attention layer and another butterfly linear layer to obtain the relationships among different tokens. A butterfly FFN that consists of two butterfly linear layers is placed at the end of the ABfly block for additional processing. To further reduce the amount of computation and number of parameters, we replace the attention module with a 2D Fourier transform layer, implemented using FFT, resulting in a more compute-efficient block called FBfly. The use of FFT effectively mixes different input tokens, which allows the following butterfly FFN to process a longer sequence. More importantly, all computation in the FBfly block, which use the FFT’s twiddle factors and the butterfly linear layers’ weights, is performed using a unified butterfly pattern, resulting in higher hardware efficiency over previous works.

Although FBfly is less compute- and memory-intensive than ABfly, the use of the Fourier transform layer may degrade accuracy [lee2021fnet]. To preserve high accuracy, we propose a novel butterfly-based network called FABNet that introduces a hybrid of the ABfly and FBfly blocks, as depicted in Figure 5. There are FBfly blocks at the beginning and ABfly blocks stacked on top. With this setup, we expose both and

as hyperparameters, enabling a trade-off between algorithmic and hardware performance. To optimize this trade-off, we develop a co-design method (Section 

V-C) that explores the design space of both neural architecture and hardware design.

Iv Hardware Accelerator

Iv-a Architecture Overview

Figure 6 shows the proposed hardware accelerator consisting of: a Butterfly Processor (BP), an Attention Processor (AP), a Post-processing Processor (PostP), the off-chip memory, and several on-chip buffers. BP consists of number of Butterfly Engines (BEs), which are used to accelerate the computations that involve butterfly patterns, including both FFT and butterfly linear transformations. AP contains number of Attention Engines (AEs), and each AE is composed of one QK unit and one SV unit. The QK unit is designed to implement the softmax and the matrix multiplication between queries and keys. The SV receives the outputs from the QK

unit and multiplies the results with value vectors to generate the final results of the attention layer. The

PostP module is responsible for executing the layer normalization and shortcut (SC) addition. To ease the on-chip memory consumption, the intermediate results between different FFT and butterfly operations are transferred back to the off-chip memory. Although doing so increases the bandwidth requirement, this ensures our accelerator is scalable on hardware platforms with limited on-chip memory. To improve the overall hardware performance, all the on-chip buffers utilize double-buffering to overlap the data transfer with the computation.

Fig. 7: Microarchitecture and dataflow of the adaptable butterfly unit.

Iv-B Adaptable Butterfly Engine

Figure 6b shows the hardware architecture of BE. Each BE is mainly composed of a butterfly memory system and number of adaptable Butterfly Units (BUs). To improve the hardware efficiency and enable the use of a single unified engine, the BE module is designed with a focus on adaptability. As such, it can be configured via programmable multiplexers and de-multiplexers at runtime to either execute an FFT or a butterfly linear transformation.

Iv-B1 Adaptable Butterfly Unit

Figure 7a depicts the architecture of the proposed adaptable BU. Each adaptable BU consists of four real-number multipliers and two real-number adders, followed by two complex-number adders. The inputs and twiddle factors of both FFT and butterfly linear transformation are connected to the multipliers, with eight multiplexers used to select the correct inputs for each operation. Two de-multiplexers are placed after the real-number adders to control the output flow.

When performing the butterfly linear transformation (Figure 7b), the twiddle factors are non-symmetric real numbers. Hence, the output of each twiddle multiply can computed as:

where and represent the inputs and twiddle factors, respectively. To perform the butterfly linear transformation, four multipliers in each BE are configured to execute the four real-number multiplications in the equation above. The values and are selected via multiplexers as the operands of the multipliers. At the same time, the results generated from the real-number adders/subtractors are outputted directly from the de-multiplexers.

For FFT (Figure 7c), since the twiddle factors of FFT are complex and symmetric, it only requires one complex-number multiplication per twiddle multiplication. Thus, by selecting the complex inputs and twiddle factor , we reuse the four real-number multipliers in each BE to perform the required complex-number multiplication. The de-multiplexers are then configured to output the results to the complex-number adders/subtractors to get the final results . The control signals for the multiplexers and de-multiplexers are set before running each layer. As such, the proposed adaptable BE can be used to accelerate both FFTs and butterfly linear transformations by reusing the multipliers, adders and subtractors.

Iv-B2 Butterfly Memory System

Our butterfly memory system comprises an input manager, a serial-to-parallel (S2P) module, a parallel-to-serial (P2S) module and butterfly buffers. As shown in Figure 8a, the butterfly pattern requires different data access at different stages. The conventional column-major or row-major order will cause bank conflicts while reading the data. For instance, accessing index pair and of the first stage causes a read conflict in the column-major order as shown in Figure 8b, in which each row represents a memory bank. The row-major order also suffers from the same issue while reading and in the third stage.

Fig. 8: Bank conflicts in column and row-major orders.
Fig. 9: Data layout and hardware design of S2P.

To avoid such bank conflict, we introduce a custom data layout strategy and implement it using the S2P module shown in Figure 9. We permute each column using a starting position which indicates how many rows the first element in the current column should be shifted down. We define the starting position using the following formula:

For each columns, the starting positions is obtained by shifting one position down, as shown in Figure 9a. The starting positions are generated using a counter, and a bit-count and addition operations (Figure 9b). After packing the serial data together, S2P permutes them based on the starting positions.

Fig. 10: An example of 16-input butterfly.

Figure 10 presents an example with 16 inputs, where the data required by the first and second stage of the butterfly pattern are read from the buffers without bank conflicts. However, as the butterfly units receive data in pairs, an extra pairing is required after the S2P module. An example is the second output column of the first stage in Figure 10b. To pair indices, we design an index coalescing module before the butterfly units (Figure 11). Based on the index of each input, a bit-count and addition operation is used to calculate the corresponding shift position. Then, a crossbar coalesces the index pairs based on the indices and shift positions. To ensure the outputs generated from the butterfly units preserve the original order, a recover module is used before the data is written back.

Fig. 11: Hardware design of Index Coalescing module.

V Optimizations and Co-Design

V-a Memory Sharing in Butterfly Buffers

We employ butterfly buffers to allow the overlap between data transfer and computation. To reduce the memory consumption and improve the hardware efficiency, the butterfly buffers are shared between both FFT and butterfly linear transformation. Nonetheless, as the data width of FFT is twice that of the butterfly linear transformation, different address mapping and overlapping strategies are required.

Fig. 12: Different address mapping strategies.

Figure 12 shows the proposed address mapping strategies for butterfly linear transformation and FFT. Assuming the bitwidth of real numbers is 16 bits, each input buffer is 16-bit wide. While processing butterfly linear transformations, input buffers A and B are used as two independent ping-pong banks with separate read and write ports (top right in Figure 12). In this manner, when input buffer A is used for computation, buffer B can start the input data transfer for the next batch, leading to the overlapping strategy shown in Figure 13a. While processing FFT, since the data include both real and imaginary parts which require 32-bit read and write ports, we concatenate the lower parts of input buffer A and B as the first ping-pong bank for the storage of complex numbers. To improve the hardware efficiency, we further reuse the higher parts of both buffers as the second ping-pong bank. As the computation requires both read and write accesses, we adopt a different overlapping strategy that pipelines the output data transfer only with the input data load of the next batch (Figure 13b). By employing different address mapping and overlapping strategies for FFT and butterfly linear transformation, we maximise the hardware efficiency and performance.

Fig. 13: Different overlapping strategies.

V-B Fine-Grained Pipelining between BP and AP

While executing the ABfly block, BP and AP are in use, performing butterfly linear transformation and attention matrix multiplication, respectively. To further improve performance when executing the ABfly block, we employ fine-grained pipelining between BP and AP.

Fig. 14: Fine-grained pipelining between BP and AP.

Figure 14 illustrates the dataflow of BP and AP. In the naive implementation, the key (), value () and query () matrices are generated sequentially from BP. After , and are computed, AP starts the computation of and . To optimize this process, we reorder the execution sequence of linear layers such that BP computes and at the beginning (Figure 14b). As can be decomposed into multiple vector matrix multiplications that multiply different rows of with the entire matrix , we can actually start the computation of once the first few rows of become available. As such, the in AP can be pipelined with the computation of in BP. At the same time, since is generated from the QK unit in a row-by-row fashion, we can further pipeline the with , as the computation of can start once the first few rows of are generated from the QK unit. Assuming there are and rows in and matrices, it takes and to compute one row in the SV and QK units, respectively. As such, the total latency reduction achieved is compared to the unoptimized non-pipelined implementation.

Fig. 15: Flow of the algorithm-hardware co-design process.

V-C Algorithm and Hardware Co-Design

The overall design space of our end-to-end system is formed by FABNet’s hyperparameters and the butterfly accelerator’s hardware parameters. Specifically, the joint design space consists of: 1) the algorithm parameters, i.e. the hidden size (), the expand ratio of FFN (), the total number of blocks () and the number of ABfly blocks () in FABNet, and 2) the hardware parameters, i.e. the parallelism of BU () and BE () in BP, and the parallelism of the QK () and SV () units in AP.

To assess the trade-off provided by each design point, we need to evaluate its algorithmic performance (e.g. an accuracy metric), its latency and its resource consumption. During search, the algorithmic performance is obtained by training and evaluating FABNet

, while the latency is estimated by utilizing a custom simulator built for our butterfly accelerator. To verify whether the design can be accommodated by the target FPGA device, we developed an analytical model to estimate the consumption of DSP blocks and on-chip memory (BRAMs). As DSPs are mainly consumed by the multipliers in

AP and BP, we formulate its resource usage as:

where the value of reflects the number of multipliers in each BU. The consumption of BRAM is mainly occupied by the shortcut buffer, query buffer, key buffer and different buffers in BU including butterfly buffer and weight buffers, which can be formulated as:

The proposed analytical resource model is only used during the design space exploration stage. At the end of the co-design process, the final performance is obtained by running synthesis and place-&-route on our design with the optimized configurations.

Figure 15 illustrates the proposed co-design approach. Given a target dataset, FPGA device and both algorithmic and hardware performance constraints, we employ exhaustive grid search to traverse the joint design space and find the Pareto-optimal set of algorithmic and hardware parameters. Each individual design point corresponds to a different compression ratio of FABNet and level of parallelism of the butterfly accelerator, and provides different accuracy, latency and resource consumption. The final output is the Pareto front of parameters for both FABNet and our butterfly accelerator that satisfies a given set of constraints.

Vi Evaluation

Vi-a Experimental Setup

Benchmarks. To evaluate the algorithmic and hardware performance of our approach on workloads with long sequences, we choose five tasks from Long-Range-Arena [tay2020long], including hierarchical data classification (ListOPs), byte-level text classification (Text), byte-level document retrieval (Retrieval), image classification for sequences of pixels (Image), classification of long-range spatial dependency (Pathfinder). The input sequences of these datasets range from to .

Software Implementation. We implement the vanilla Transformer [devlin2018bert], FNet [lee2021fnet] and our FABNet

models using PyTorch (v

[pytorch]. The pretrained models are obtained from Huggingface  [wolf2019huggingface]. The batch size is for both Image and Pathfinder tasks, and for the rest of datasets during training. The learning rate is set to , except for the Image and Pathfinder tasks where we use and respectively. Multiple Nvidia A100 and V100 GPUs are used for training. To use FFT cores on Nvidia GPUs, the PyTorch API “rfft2” is used to implement the FFT operation required in both FNet and FABNet. The high-performance CUDA implementation [dao2020kaleidoscope] of butterfly linear transformation is adopted to accelerate both GPU training and inference. We define two models with different default settings: FABNet-Base (, , , ) and FABNet-Large (, , , ).

Hardware Implementation. We implement our hardware accelerators using Verilog. To evaluate performance in different scenarios, two Xilinx FPGA boards are used in our experiments: VCU128 for cloud/server scenarios and Zynq 7045 for edge/mobile settings. Xilinx Vivado 2019.1 is used for synthesis and implementation. While the maximum clock frequencies of our designs depend on the particular FPGA board and resource consumption, all the FPGA designs are clocked at 200 MHz which is below the maximum. We obtain power consumption values using the Xilinx Power Estimator (XPE) tool and develop a cycle-accurate performance model to evaluate the speed performance, which is cross-validated***We cross-validate the functionality and correctness of our RTL design with the ground-truth results generated from PyTorch. Please refer to Appendix A-C for details. with our RTL simulation results generated by Vivado. The memory accesses to external memory are also considered. We use 16-bit half-precision floating-point in our hardware designs. We deploy four multipliers in each BU. As the hidden dimension is usually at most , we set the depth of butterfly, query and key buffers as . Finally, the size of shortcut buffers is the same as butterfly buffers.

Vi-B Algorithmic Performance

The FBfly introduced in Section III-B is an efficient alternative to the vanilla attention block. To evaluate its algorithmic impact on end-to-end models, we take a six-layer Transformer as an exampleOther models, such as GPT and BERT, actually follow the same network architecture of Transformer with the encoder or decoder kept. To eliminate the effect of different training strategies and evaluate the quality of the architecture, we choose the vanilla Transformer for demonstration. and compress it with different numbers of FBfly blocks, starting from the last block to the first block. Figure 16 shows the accuracy results on LRA-Text and LRA-Image. Although the accuracy fluctuates with different numbers of compressed layers, FBfly shows higher accuracy than the non-compressed Transformer with and compressed layers on LRA-Text and LRA-Image, respectively, demonstrating the improved algorithmic performance of our approach on end-to-end models.

Fig. 16: Accuracy with different number of compressed layers.

To obtain the best possible algorithmic performance of each model, we use the optimized configuration specified in [xiong2021nystromformer] for both vanilla Transformer and FNetAs the vanilla FNet on Retrieval task suffers significant accuracy loss, we increase its hidden size to .. We perform a simple grid search to optimize the hyperparameters of our FABNet. Table III presents the optimized accuracy of different models. FABNet achieves higher accuracy than both Transformer and FNet on three out of five tasks, including ListOPs, Retrieval and Image. On average, FABNet achieves the same accuracy as Transformer. To investigate the efficiency of FABNet, Figure 17 shows the compression rate of our optimized FABNet over the vanilla Transformer and FNet in terms of floating-point operations (FLOPs) and model size (number of parameters). Compared with the vanilla Transformer, FABNet achieves around reduction in FLOPs and reduction in model size, depending on the target task. Furthermore, compared with FNet, FABNet reduces FLOPs by and model size by .

Fig. 17: Reduction in FLOPs and model sizes.
Fig. 18: Co-design on LRA-Text dataset.
Fig. 19: Speedup breakdown of algorithm and hardware optimizations.
Fig. 20: Performance comparison against (a) high-end GPUs, and (b) edge GPU and CPU.

Vi-C Effectiveness of Co-design

We evaluate the effectiveness of our co-design approach in finding the optimal algorithm and hardware designs. For demonstration, we use LRA-Text as the target dataset and VCU128 FPGA as the target device. We select , , and from {, , , , }, {, , }, {, } and {, } respectively. Parameters for hardware parallelism (, , and ) are chosen from {, , , , , , }. Figure 18 shows the points in the accuracy-latency design space. The orange line represents the accuracy loss, which is constrained to be less than % compared with the vanilla Transformer. The Pareto front is indicated by the brown line and the other blue points represent designs with less optimized software-related hyperparameters (Figure 16) or hardware design parameters. Among the design points that satisfy the accuracy constraint, we choose the point with the lowest latency in the Pareto front as our point of comparison. Within our design space, the selected point is up to % more accurate than the points in the same latency range and up to faster than points in the same accuracy range, underlining the advantages of our co-design approach. The runtime of the co-design process is around hours on our GPU server. To get the configurations for the rest of the datasets in LRA, we constrain the overall accuracy loss to be less than % compared to the vanilla Transformer. The final models and designs are chosen as the configurations with the highest hardware performance without violating the accuracy constraints. Unless mentioned otherwise, the remaining the sections report the algorithmic and hardware performance using these optimized configurations.

ListOps Text Retrieval Image Pathfinder Avg.
Vanilla Transformer 0.373 0.637 0.783 0.379 0.709 0.576
Vanilla FNet 0.365 0.630 0.779 0.288 0.66 0.544
FABNet 0.374 0.626 0.801 0.398 0.679 0.576
TABLE III: Accuracy of different models on LRA.

Vi-D Comparison with Baseline Design

To evaluate the speedup brought by our algorithm (FABNet) and hardware (butterfly accelerator), we use a baseline design for comparison [devlin2018bert]. The baseline hardware is designed with multiple multiply-accumulate (MAC) units to accelerate the linear transform and the matrix multiplications between query, key and value vectors. Each MAC is composed of a multiplier array followed by an adder tree. The fine-grained intra- and inter-layer pipeline techniques [song2019hypar, alwani2016fused] are used to optimize the hardware performance. We allocate the parallelism of each MAC unit according to its workload in order to achieve load-balanced execution between different pipeline stages. For a fair comparison, we implement both baseline and butterfly accelerators on a VCU128 FPGA using multipliers. The high bandwidth memory (HBM) is used as the external memory. Both designs are clocked at MHz. We evaluate both base ( layers) and large ( layers) versions of each model using four different input sequences (, , and ).

A speedup breakdown is shown in Figure 19. To demonstrate the improvement brought by our algorithm, we first evaluate both BERT-Base and FABNet on the baseline design. As the FFT is not supported in the baseline design, we implement the Fourier layers as linear layers by multiplying the input sequences with DFT matrices. Since the operation reduction brought by the algorithm is not fully utilized by the baseline design, FABNet results in a speedup compared to BERT. To further evaluate the improvement brought by hardware optimizations, we evaluate FABNet on our butterfly accelerator, showing speedup when compared to the baseline design. By combining both algorithm and hardware optimizations, the overall speedup of our approach is over the baseline design.

Vi-E Comparison with GPU and CPU

We compare our butterfly accelerator against GPU and CPU in both edge and server scenarios. In the edge scenario, our butterfly accelerator is implemented on a Xilinx Zynq 7045 FPGA. DDR4 is used as external memory and multipliers are used for computation. Nvidia Jetson Nano GPU and Raspberry Pi4 are used as the GPU and CPU platforms, respectively. In the server scenario, the butterfly accelerator is implemented on a Xilinx VCU128 FPGA. HBM is used as external memory and the design consumes multipliers. We use Nvidia V100 and TITAN Xp GPUs for comparison, with highly-optimized CUDA implementations [dao2020kaleidoscope]. FPGA designs are clocked at MHz.

We evaluate both FABNet-Base and FABNet-Large using , , and input sequences. Figure 20 shows the results in term of speedup and energy efficiency. We represent energy efficiency using Giga operations per second per Watt (GOPS/Watt). In the edge scenario, our design on Zynq 7045 FPGA achieves speedup over Jetson Nano GPU and speedup over Raspberry Pi4§§§On FABNet-Large with long input sequences greater than , Raspberry Pi 4 suffers from out-of-memory (OOM) issues.. At the same time, our design yields and higher energy efficiency than Jetson Nano and Raspberry Pi4, respectively. In the server scenario, our design on VCU128 is up to and faster and up to and more energy-efficient than the V100 and TITAN Xp GPU, respectively. In summary, the end-to-end speedup and energy efficiency gains on both edge and server scenarios under different input sequences highlight the scalability of our butterfly accelerator.

Platform # cores Compiler Frequency Technology
CPU Raspberry Pi 4 4
Nvidia V100 5,120 PyTorch
GPU Nvidia TITAN Xp 3,840 1.10.2
Nvidia Jetson Nano 128
FPGA Xilinx VCU128 - Vivado
Xlinx Zynq 7045 - 2019.2
TABLE IV: Hardware specification of CPU, GPU and FPGA.

Vi-F Comparison with SOTA Accelerators

Accelerators  [ham2020] SpAtten [wang2020spatten] Sanger [sanger201micro] Energon [zhou2021energon] ELSA [elsa2021isca] DOTA [dota2022asplos] FTRANS [li2020ftrans] Our work
(HPCA’20) (HPCA’21) (MICRO’21) (TCAD’21) (ISCA’21) (ASPLOS’22) (ISLPED’20)
Technology ASIC (40nm) ASIC (40nm) ASIC (55nm) ASIC (45nm) ASIC (40nm) ASIC (22nm) FPGA (16nm) FPGA (16nm)
Frequency 1 GHz 170 MHz 200 MHz
# of Multipliers 128 6531 640
Latency (ms) 56.0 48.8 45.2 44.2 34.7 34.1 61.6 2.4
Throughput (Pred./s) 17.86 20.49 22.12 22.62 28.82 29.32 16.23 416.66
Power (W) 1.217 1.060 0.801 2.633 0.976 0.858 25.130 11.355
Energy Eff. (Pred./J) 14.67 19.33 27.62 8.59 29.52 34.18 0.65 36.69
TABLE V: Comparison with existing Transformer accelerators in terms of latency, power and energy efficiency.
Fig. 21: Latency for different input sequence lengths (a-c) when varying the available off-chip memory bandwidth.

Table V compares our butterfly accelerator with existing state-of-the-art (SOTA) accelerators in terms of speed and power consumption. Instead of comparing the effective throughput [wang2020spatten, sanger201micro], we use the end-to-end latency to represent the actual execution speed of the hardware. The energy efficiency is represented by the number of predictions per Joule (Pred./J). Following the experimental setting of [dota2022asplos], we compare all other SOTA accelerators on LRA-Image dataset with one-layer vanilla Transformer. Among these accelerators, only SpAtten [wang2020spatten] and DOTA [dota2022asplos] report the end-to-end performance. For the rest of the accelerators that only support attention, we estimate their performance by reusing their available multipliers to accelerate FFN. Furthermore, in both [wang2020spatten] and [sanger201micro], the authors compare different ASIC and FPGA designs based on the assumption that all the ASIC designs are clocked at GHz with multipliers. For a fair comparison, we follow the same assumption in our experiments. For designs with more than multipliers, we follow the scaling approach of [wang2020spatten, sanger201micro] to linearly scale down its throughput to get their end-to-end performance. For instance, DOTA [dota2022asplos]We assume their design is compute-bound. achieves speedup over Nvidia V100 using multipliers with TOPS throughput. We scale down its throughput by , which leads to speedup over V100. To obtain the power consumption, we use the same linear scaling approach. For instance, Sanger [sanger201micro] reports the power consumption of a design with multipliers. We divide the power consumption of their systolic array ( mW) by , which leads to mW. Together with the power of other modules such as pre-processing and memory, their total power consumption is W. To match the computational capacity of ASIC designs, we use DSPs in the VCU128 FPGA. As our FPGA-based design is clocked at MHz, this ensures that we have the same M 28 GOPS theoretical peak performance as ASIC designs (G 28 GOPS). While this is a simple approximation, it allows us to compare different hardware architectures regardless of their underlying target platforms.

As shown in Table V, our butterfly accelerator achieves speedup over the FPGA-based FTRANS [li2020ftrans] while using nearly fewer DSPs. At the same time, we achieve higher energy efficiency than FTRANS. Compared with ASIC designs, our accelerator achieves speedup under the same computational capacity. Although our FPGA-based butterfly design consumes more power than ASIC designs, it yields higher energy efficiency than the other SOTA ASIC accelerators. We expect further speedup and energy efficiency improvements when our design is implemented as an ASIC.

We attribute the performance gain of our approach over ASIC designs to two main factors: 1) the use of FFT and butterfly factorization which significantly reduces the computational complexity at the algorithmic level; 2) the adaptable butterfly design that adopts a single unified hardware engine to accelerate both FFT and butterfly linear transformation, which significantly improves the hardware efficiency; and 3) the co-design process which jointly optimizes both algorithm and hardware parameters.

Vi-G Off-Chip Memory Bandwidth Analysis

In order to investigate the sensitivity of our design to off-chip memory bandwidth, we vary the bandwidth from , , , , and GB/s, and evaluate its latency based on our performance model. For these experiments, we use five different designs with , , and BEs executing FABNet-Large with layers. To understand the bandwidth requirements under both short and long input lengths, we evaluate each design using three input sequences (, and ). The results are shown in Figure 21. For a small-scale design of BEs, a bandwidth of GB/s is enough for the design to reach its peak performance under different input sequences. For the largest design of BEs, the achieved performance saturates once the bandwidth reaches GB/s.

Vi-H Power and Resource Analysis

Table VI shows the power consumption breakdownPower of I/O is not included as it occupies less than % of the total power. based on the report generated from the Vivado XPE tool. We implement two designs with BEs (BE-120) and BEs (BE-40) on a VCU128 FPGA, which have been used in Section VI-E and Section VI-F, respectively. In both designs, the dynamic power accounts for more than % of the total power consumption. The memory resources, including both BRAM and HBM, consume more than % of the dynamic power. Furthermore, when the number of BEs scales from to , the power of clocking, logic & signal and DSPs increases from  W,  W and  W to  W,  W and  W, respectively.

Table VII presents the resource consumption of both BE-40 and BE-120 designs on the same VCU120 FPGA. Due to the use of FFT and butterfly matrices, our FABNet becomes less memory-intensive than the vanilla attention-based NNs. Since the theoretical memory bandwidth of a single HBM ( GB/s) can already satisfy the requirement of our accelerator (Section VI-G), we use one HBM in both designs to reduce the resource and power consumption. When the number of BEs decreases from to , the BRAM usage is reduced from to . This reduction can also be observed on the LUT and register resources.

Dynamic (W) Static
Design Clocking Logic& DSP Memory
Signal (BRAM & HBM) (W)
BE-40 used 2.668 2.381 0.338 5.325 3.368
pct. 18.8% 16.7% 2.3% 37.5% 23.7%
BE-120 used 6.882 7.732 1.437 6.142 3.665
pct. 26.4% 29.7% 5.5% 23.6% 14.1%
TABLE VI: Power breakdown of our designs on VCU128.
LUTs Registers DSP48s BRAMs HBMs
Available 1,303,680 2,607,360 9,024 2,016 2
BE-40 used 358,609 536,810 640 338 1
pct. 27.5% 20.6% 7.1% 16.8% 50.0%
BE-120 used 1,034,610 1,648,695 2,880 978 1
pct. 79.3% 63.2% 31.9% 48.5% 50.0%
TABLE VII: Resource usage of our designs on VCU128.

Vii Related Work

Efficient Approaches for Attention. As the algorithmic complexity of the self-attention mechanism scales quadratically with respect to the input sequence length, many sparse variants have been introduced to approximate the attention-based NNs [tay2020efficient]. The sparsity patterns in these approaches are determined either dynamically [wang2020linformer, zaheer2020big, choromanski2020rethinking, tay2020sparse] or statically [beltagy2020longformer, lee2021fnet, chen2021pixelated]. Although these methods achieve high compression rate on the number of operations and parameters, the hardware cost and efficiency of their mappings on real hardware designs are not considered in these works.

Domain-Specific Accelerators for Attention-based NNs. To better utilize existing efficient attention-based algorithmic approaches on hardware, various domain-specific hardware designs have been introduced. Ham et al. [ham2020] propose a hardware architecture called , which dynamically prunes entries based on their softmax importance. By leveraging the sparsity in both head and token levels, Wang et al. [wang2020spatten] propose SpAtten that dynamically prunes entire rows and columns from the attention matrix. EdgeBERT [edgebert2021micro] explores the layer sparsity via an entropy-based early-exit approach, which significantly reduces the computation and memory footprint. To detect the weak relationship in attention, DOTA [dota2022asplos] uses low-rank linear transformations to detect and omit the weak connections. ELSA [elsa2021isca] approximates the attention mechanism using a sign random projection approach. To further exploit the sparsity in attention-based NNs, Energon [zhou2021energon] adopts a low-precision NN to predict the sparsity in the attention matrix. However, the generated sparsity patterns from these approaches are always unstructured, which may lead to hardware inefficiency. Sanger [sanger201micro] propose pack-and-split modules to distribute the non-zero computation to each computation engine, achieving a load-balanced execution. Although these accelerators achieve notable speedup over GPUs and CPUs when executing the attention mechanism, their end-to-end hardware performance is limited as the approximation and acceleration of the FFN part are not considered in their design.

Comparison to Previous Work.

As spatial-domain convolution corresponds to frequency-domain multiplication, various FFT-based hardware accelerators have been introduced to accelerate CNNs 

[zhang2017frequency, abtahi2018fft], LSTMs [wang2018c, li2019rnn] and attention-based NNs [li2020ftrans]. However, these designs only use FFT as a domain transfer approach, while the main computation is still performed in another processing engine. In contrast, this paper adopts the butterfly sparsity, a generalized pattern of FFT, for the main computation of the network. Based on the proposed method, all the computations are performed in a single unified butterfly accelerator, resulting in a higher hardware efficiency over previous designs.

Although butterfly sparsity has been explored in recent literature, most approaches only focus on algorithm-level optimizations. Dao et al. [dao2020kaleidoscope] demonstrate the potential of the butterfly matrix in approximating linear transformations, but its efficiency on attention matrices is not explored in their work. On the other hand, Pixelated Butterfly [chen2021pixelated] and Sparse Transformer [child2019generating] focus on adopting the butterfly pattern for attention matrices, neglecting the linear layers. Moreover, their designs require the use of multiple sparsity patterns to compensate for the accuracy loss, which significantly complicates the hardware design. Lee-Thorp et al. [lee2021fnet] show the effectiveness of Fourier transforms in accelerating attention layers, but did not consider the linear layers, leading to scalability issues (Section II-C). Different from all these efforts, this paper exploits the use of butterfly sparsity for both attention and linear layers via an algorithm-hardware co-design approach. A novel butterfly-based algorithm and an adaptable hardware accelerator are jointly designed to overcome existing limitations to push the performance limit.

Viii Conclusion

This paper proposes the end-to-end acceleration of attention-based NNs via algorithm and hardware co-design. On the algorithmic level, we propose FABNet, a hardware-friendly attention-based NN. Both attention and linear layers are compressed using a unified butterfly sparsity pattern allowing for scalable end-to-end acceleration. On the hardware level, an adaptable butterfly accelerator is proposed that can be configured at runtime to accelerate different layers based on a unified hardware engine to achieve high hardware efficiency. Both algorithm and hardware design parameters are jointly optimized to push the performance limit. Our experiments demonstrate that our co-design approach yields up to speedup over state-of-the-art accelerators. Furthermore, our design achieves up to higher energy efficiency compared to optimized GPU implementations.


The support of the UK EPSRC grants (UK EPSRC grant number EP/V028251/1, EP/L016796/1, EP/S030069/1 and EP/N031768/1), AMD and Intel is gratefully acknowledged. We also thank Alexander Mathiasen for insightful discussions on model compression.

Appendix A Artifact Appendix

A-a Abstract

This Appendix summarizes the necessary information and instructions to evaluate our artifacts. The functionality of our hardware accelerator can be evaluated by running Verilog HDL designs and System Verilog testbenches on Vivado design suite. The accuracy results can be obtained by running our PyTorch programs and the associated Bash scripts. The power and resource utilization can be obtained by running Synthesis and Implementation using our RTL code and constraint files. The latency can be obtained by running our custom Python-based performance model. We also provide all our training log files and Vivado design reports in the link:

A-B Artifact check-list (meta-information)

  • Algorithm: FABNet, an efficient model that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.

  • Program: Python, PyTorch, Verilog HDL

  • Model: We evaluate three models for comparison, including Transformer, FNet and FABNet.

  • Data set: Long-Range-Arena (LRA) dataset, which is a well-known long sequence natural language processing (NLP) dataset. The zip file can be downloaded from the link: The required disk space of the unzip file is around 33 GB.

  • Run-time environment: Ubuntu 20.04, CUDA SDK 11.3 or higher.

  • Hardware: Nvidia V100 GPU, Nvidia TITAN Xp GPU, Nvidia Jetson Nano GPU, Intel Xeon Gold 6154 CPU, Raspberry Pi 4.

  • Metrics: Accuracy, simulated latency, resource and power consumption.

  • Experiments: Bash scripts and detailed instructions are provided to run experiments.

  • How much disk space required (approximately)?

  • How much time is needed to prepare workflow (approximately)? hours.

  • How much time is needed to complete experiments (approximately)? Accuracy results: hundreds of GPU hours to obtain. Power and resource consumption: around 70 hours. Functionality of Verilog design: around 5 hours.

  • Publicly available? Yes.

  • Code licenses (if publicly available)? Yes.

  • Archived (provide DOI)? We will update this later.

A-C Description

A-C1 How to access

A-C2 Hardware dependencies

A GPU server is required to run the training of our models. A CPU server is needed to run simulation, synthesis and place&route. Different GPUs and CPUs, such as Nvidia Jetson Nano GPU and Raspberry Pi 4, are also required to evaluate the hardware performance of different models.

A-C3 Software dependencies

Vivado Design Suite , PyTorch , CUDA SDK or higher, Python or higher. Other dependencies are listed in requirements.txt.

A-C4 Data sets

Five tasks in the LRA dataset including ListOPs for hierarchical data classification, Text for byte-level text classification, Retrieval for byte-level document retrieval, Image for image classification for sequences of pixels and Pathfinder for classification of long-range spatial dependency.

A-D Installation

We provide a detailed installation guide in the of the root directory.

A-E Experiment workflow

To evaluate the functionality of our hardware, perform the following steps:

  • Follow the instruction of experimental setup in the root directory to install software dependencies. Install Vivado 2019.2.

  • Generate the test data using our Python programs.

  • Create a Vivado project for our hardware design.

  • Import all the Verilog source code and System Verilog testbenches.

  • Include all the necessary IPs from Vivado IP library.

We provide Vivado Tcl scripts and step-by-step instructions in ./hardware/npu_design/verilog/ to automate the whole process.

To reproduce the algorithmic and hardware performance, we provide all the scripts under the directory ./script_figs to generate figures and tables. The detailed instructions are provided in ./script_figs/

A-F Evaluation and expected results

We provide scripts under ./script_figs to generate all the figures and tables related to accuracy performance and hardware performance include power consumption, resource utilization and simulated latency performance. As running all the experiments requires a few hundred GPU/CPU hours, to facilitate the artifact evaluation, we refer to the following key results that can be obtained within a reasonable time:

  • Vivado simulation to run different layers, such as fast Fourier transform, butterfly matrix multiplication and layer normalization, on our Verilog hardware design. We provide System Verilog testbenches under ./hardware/npu_design/verilog/functionality/testbench/, and a detailed workflow in the first paragraph of Section A-E.

  • Power breakdown in Table VI and resource utilization in Table VII. We provide detailed instructions and Vivado Tcl scripts under ./hardware/npu_design/verilog/ to run synthesis and place&route on both VCU128 and Zynq 7045 FPGAs.

Although it takes longer to run other experiments, all the results are reproducible using our provided scripts. We provide all the GPU training log files and Vivado design reports in the link:

A-G Methodology

Submission, reviewing and badging methodology: