I Introduction
Recent years have witnessed a great success of attentionbased neural networks (NNs) on many AI tasks [brauwers2021general]. The attention mechanism [vaswani2017attention]
that captures longrange information from sequences of data has demonstrated its excellent algorithmic performance in various natural language processing
[devlin2018bert, radford2019language]and computer vision
[dosovitskiy2020image] applications. However, the advances of attentionbased NNs come at a cost: the use of attention and linear layers significantly increases the computational load, resulting in a large overhead on their speed and power consumption [wang2020spatten]. Figure 1 shows an operation breakdown on four mainstream attentionbased models. For short input sequences, linear layers occupy over % of operation counts. As the input sequence increases, the computation is gradually dominated by the attention layer. Since both attention and linear layers are memory and computeintensive, it is challenging to achieve high hardware performance on attentionbased NNs across input sequences of various lengths.So far, various approaches and designs have been introduced to accelerate attentionbased DNNs. On the algorithmic level, several efficient sparse variants have attempted to reduce the computational complexity [choromanski2020rethinking, wang2020linformer, kitaev2020reformer, tay2020sparse, beltagy2020longformer, zaheer2020big, child2019generating]. However, most of these approaches only focus on reducing the number of parameters and operations without considering the real hardware performance, such as endtoend latency. Furthermore, the hardware efficiency of implementing these sparsity patterns on real hardware designs is often overlooked. On the hardware level, although various highlyoptimized accelerators (Table I) have been proposed [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos, li2020ftrans, edgebert2021micro], several issues still remain unresolved:

[leftmargin=*]

Most of current accelerators only focus on optimizing either FFNs [li2020ftrans] or the attention mechanism [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos]. Without jointly optimizing both parts, these hardware designs lack scalability when accelerating the endtoend attentionbased NNs with different input lengths.

While optimizing the attention mechanism, most of the existing designs dynamically detect and prune the redundant computation at runtime to achieve high sparsity on specific datasets and networks. However, the generality of these dynamic approaches needs to be further tested as their performance gain may vary across different datasets and network architectures.

The sparsity patterns introduced by these dynamic approaches are often unstructured, requiring dynamic hardware controllers to exploit sparsity. Such complicated controllers often contain larger numbers of clocking elements, and their hardware overhead increases as the transistor size reduces [jang2021sparsity]. As such, the performance or energy gain of these dynamic methods may be diminished.
To address the aforementioned issues, this paper adopts butterfly sparsity to accelerate attentionbased models with three novel aspects (Table I): i) finegrained structured regularity, which possesses regular data accesses to optimize both memory and compute efficiency; ii) static sparsity pattern, which avoids the need of designing a dynamic controller in hardware; iii) sparsity exploitation on both attention and linear layers, which allows scalable endtoend acceleration of attentionbased NNs. We therefore propose FABNet, a hardwarefriendly model for FFT, Attention and ButterflyNet. To fully exploit the sparsity in hardware, we propose an adaptable butterfly accelerator that can be configured at runtime via dedicated hardware control to accelerate different layers using a single unified engine, significantly improving hardware efficiency. To push the performance limit, we jointly optimize the model and hardware via a codesign approach. Overall, this work makes the following contributions:

[leftmargin=*]

A hardwarefriendly attentionbased model, FABNet, that adopts the butterfly sparsity pattern on both attention and linear layers for endtoend acceleration (Section III).

A novel adaptable butterfly accelerator configurable at runtime via dedicated hardware control to accelerate different layers using a single unified engine (Section IV).

Several hardware optimizations to improve the hardware efficiency and a codesign approach to jointly optimize both algorithmic and hardware parameters (Section V).

A comprehensive evaluation on different datasets that demonstrates the advantages of our approach over CPU, GPU and stateoftheart accelerators (Section VI).
Accelerators  Pattern Regularity  Sparsity Pattern  Sparsity Location 

[ham2020]  unstructured  dynamic  attention 
SpAtten [wang2020spatten]  coarsegrained structured  
Sanger [sanger201micro]  loadbalanced unstructured  
Energon [zhou2021energon]  unstructured  
ELSA [elsa2021isca]  unstructured  
DOTA [dota2022asplos]  unstructured  
FTRANS [li2020ftrans]  None  static  FFN 
EdgeBERT [edgebert2021micro]  None  dynamic  layer 
Our work  finegrained structured  static  attention & FFN 
Ii Background and Motivation
Iia AttentionBased Neural Networks
Based on their network structure, attentionbased NNs can be classified into three categories:
i) encoderdecoder, ii) encoderonly, and iii) decoderonly networks. The encoderdecoder NNs are mainly designed for sequencetosequence tasks, such as machine translation [vaswani2017attention]. One of the most widely used encoderdecoder network is the Transformer, which is constructed by a stack of encoder and decoder blocks. Figure 2 illustrates the structure, where , and represent input length, hidden size and FFN expand ratio respectively. Each encoder starts with a multihead attention module, followed by a feedforward network (FFN) consisting of two linear (fullyconnected) layers. Finally, residual addition [he2016deep] and layer normalization (LN) [ba2016layer] are used after FFN. Within each multihead attention, the inputs are first mapped to query (), key () and value () matrices through three different linear layers. The query matrix is then multiplied with , followed by a softmax operation to get the score matrix (). The generated is multiplied with and the resultant matrix will flow into another linear layer, which generates the final output matrix of the multihead attention. Similar to the encoder, the decoder employs two multihead attention modules and one FFN, where the difference is that the inputs of the query and key matrices in the second attention module come from the last encoder.Based on the original encoderdecoder structure of Transformer, different variants have been proposed. The encoderonly networks, such as BERT [devlin2018bert] and XLM [lample2019cross]
, are autoencoding models that have been widely applied to NLP tasks, such as sequence classification
[wang2018glue]. The Vision Transformer (ViT) [dosovitskiy2020image] also lies in this category. An extra linear projection layer is introduced at the beginning, while its encoder layers correspond to the encoder part of the original Transformer. Finally, the decoderonly networks represent the autoregressive models designed for NLP tasks, such as language modeling
[ma2019tensorized]. GPT [radford2019language] is a typical decoderonly model that corresponds to the decoder part of the original Transformer. Although we focus on encoderonly networks in this work, our hardware design is flexible and applicable to decoders too.IiB Butterfly Matrices and FFT
Despite the impressive accuracy attained using attentionbased NNs, these models are expensive and not scalable, e.g. the selfattention mechanism in the Transformer scales quadratically in terms of computation and memory as a function of the input sequence length. As a result, numerous works [choromanski2020rethinking, wang2020linformer, child2019generating, chen2021scatterbrain] adopt structured linear mappings, such as sparse and lowrank matrices, to approximate the attention matrices and/or the weight matrices in the feedforward layers. Choosing an appropriate structure for each linear mapping, however, is applicationdependent, often requiring domain expertise and entailing an arduous process of handpicking solutions as different structures have different tradeoffs in accuracy and speed.
To counteract this, recent work has utilized butterfly matrices [parker1995random, dao2019learning], which are universal representations of structured matrices that have a simple recursive structure. Specifically, each butterfly matrix of size encodes the recursive divideandconquer structure with butterfly patterns and, hence it can be expressed as the product of sparse butterfly factor matrices [de2018two] as follows:
where each , a butterfly factor, is a block matrix of diagonal matrices, with size , whose entries can be trained via gradientbased methods:
Due to their expressiveness in representing structured matrices and approximating unstructured data, butterfly matrices and their variants [chen2021pixelated, dao2020kaleidoscope] have found success in compressing attention and weight matrices, considerably improving the accuracy and efficiency of attentionbased NNs. For instance, applying butterfly factorization to a linear layer with an weight matrix can reduce the computational and memory complexity from to .
Besides attention and weight matrices, some designs have explored replacing the entire attention mechanism with more efficient counterparts [tolstikhin2021mlp]. A prominent example is FNet [lee2021fnet]
, in which the selfattention modules are replaced by 2D Discrete Fourier Transform (DFT) operations. Specifically, for each input, 1D DFT is applied along the sequence and the hidden dimension independently, keeping only the real component of the resulting outputs. To reduce DFT computation time, the CooleyTukey Fast Fourier Transform (FFT) algorithm
[cooley1965algorithm] is used. As the use of DFT facilitates information flow across all embeddings, it results in a similar performance compared to the use of vanilla selfattention layers, but at a significant reduction in latency and memory.On the algorithmic front, our proposed FABNet utilizes a mixture of these techniques – FFT and butterfly matrices – to outperform relevant approximation approaches in terms of accuracy. Notably, since FFT matrices can be considered a special case of butterfly matrices with , being identity matrices and , acting as twiddle factors, both the FFT and butterfly matrices possess the recursive butterfly structure. Therefore, it is possible to use a unified computational and data access pattern and then devise a single hardware engine to accelerate both FFT and butterflybased operations with high hardware efficiency.
IiC Latency Breakdown and Motivation
The operation counts in Figure 1 reveal that the computation of attentionbased NNs is dominated by different components when the length of input sequences changes. To further investigate the real hardware performance of each subcomponent, we profile the execution time of the BERTLarge model on the Nvidia V100 GPU and Intel Xeon Gold 6154 CPU. The length of input sequences is set to , and on both devices, and the batch size for GPU and CPU is and , respectively. Figure 3 shows the latency breakdown. We split the latency consumption into three main subcomponents: attention layers, linear layers, and other operations, e.g.
layer normalization, residual connections, matrix transformations and IO operations. Notably, on both CPU and GPU, linear layers take up a significant portion of execution time,
and respectively, when the input length is small. As the input length becomes larger, the execution time of attention layers increases gradually and becomes dominant. As such, the latency is dominated by different components depending on the length of the input sequence. According to Amdahl’s law [amdahl1967validity], to achieve high hardware performance across different input lengths, it is necessary to optimize both attention and linear layers.The majority of previous accelerators for attentionbased NNs focused on optimizing a single component of the entire model (either attention or FFN as shown in Table I), leading to suboptimal endtoend performance gains. The execution time of these accelerators is heavily dependent on the input length which varies across different applications, reducing the scalability of these hardware designs and thus narrowing their deployability in realworld scenarios. Naively adopting a combination of previous works on optimizing the linear layers [li2020ftrans] and attention layers [ham2020, wang2020spatten, sanger201micro, zhou2021energon, elsa2021isca, dota2022asplos], however, would result in low hardware efficiency as they adopt different sparsity patterns. As a result, designing an endtoend accelerator for scalable attentionbased NNs remains an open problem. In this work, we address this challenge by adopting an algorithm and hardware codesign approach. On the algorithmic level, a hardwarefriendly model called FABNet is proposed, which adopts a unified butterfly sparsity pattern to compress both attention and linear layers. On the hardware level, we propose an adaptable butterfly design that can be configured at runtime to accelerate different layers in FABNet using one unified hardware engine.
Iii Algorithm Optimization
Iiia Computational Analysis of Sparsity Patterns
Various pruning schemes have been proposed to reduce the computational complexity of attentionbased NNs, leading to different efficient models [choromanski2020rethinking, wang2020linformer, kitaev2020reformer, tay2020sparse, beltagy2020longformer, zaheer2020big, child2019generating, chen2021pixelated, lee2021fnet, dao2020kaleidoscope, dao2022monarch]. By analysing the computational and data access patterns of these variants, we define five basic sparsity patterns shown in Figure 4: i) low rank, ii) sliding window, iii) butterfly, iv) random, and v) blockwise pattern. As lowrank approximation of an attention matrix requires both sequential row and column reads but the data are usually only stored in either a rowmajor or columnmajor, the hardware efficiency of lowrank sparsity is inherently diminished. Random sparsity also demonstrates low hardware efficiency due to its random read pattern. Furthermore, we observe that the sparsity in various sparse variants can be expressed as different combinations of the basic sparsity patterns, as summarized in Table II. As some basic sparsity patterns can only capture either longrange global or shortrange local information (Figure 4), the rationale behind using multiple sparsity patterns within each variant is mainly to compensate for the underlying accuracy loss. For example, Pixelfly [chen2021pixelated] introduces an additional lowrank sparsity pattern to increase the expressiveness of their flat blockwise butterfly pattern and improve accuracy.
Model  Sparsity pattern  Att.  FFN  Unified Sparsity  CoDesign 
Performer [choromanski2020rethinking]  LowRank  ✔  ✗  ✗  ✗ 
Linformer [wang2020linformer]  (Extra kernels)  
Reformer [kitaev2020reformer]  Blockwise  ✔  ✗  ✗  ✗ 
(Extra kernels)  
Sparse Sinkhorn [tay2020sparse]  Blockwise + Random  ✔  ✗  ✗  ✗ 
Longformer [beltagy2020longformer]  SlidingWindow + LowRank  ✔  ✗  ✗  ✗ 
BigBird [zaheer2020big]  Random + SlidingWindow + LowRank  ✔  ✗  ✗  ✗ 
FNet [lee2021fnet]  Butterfly  ✔  ✗  ✗  ✗ 
Kaleidoscope [dao2020kaleidoscope]  Butterfly  ✗  ✔  ✔  ✗ 
Sparse Trans. [child2019generating]  LowRank + Butterfly + SlidingWindow  ✔  ✗  ✗  ✗ 
Pixelfly [chen2021pixelated] Monarch [dao2022monarch]  Butterfly + BlockWise + LowRank  ✔  ✔  ✗  ✗ 
Our work  Butterfly  ✔  ✔  ✔  ✔ 
Different sparsity patterns exhibit diverse data access patterns, which calls for custom hardware support. However, supporting multiple sparsity patterns may complicate the hardware design. For instance, in order to fully utilize the sparsity in the random pattern, complex dynamic controllers are required to achieve a loadbalanced execution on different hardware engines [sanger201micro, geng2020awb]. The extra overhead of such controllers may counteract the improvement brought by skipping sparse operations [jang2021sparsity].
In this work, we aim to find a hardwarefriendly sparsity pattern that: 1) has structured data access patterns to simplify the memory design, 2) captures both local and global range information with a single sparsity pattern, and 3) is applicable to both the attention mechanism and FFNs to sustain its performance improvement across both long and short input sequences. To meet these requirements, we adopt the butterfly sparsity as a basis for constructing our efficient algorithm.
Compared to other sparsity patterns, the butterfly sparsity provides a number of favorable properties. As shown in Figure 4, although random sparsity is able to capture both local and global information, it has two drawbacks compared to butterfly sparsity: 1) it requires complicated controllers with excessive hardware overhead [jang2021sparsity], and 2) its performance gain cannot be guaranteed as the sparsity may vary substantially among different datasets and tasks. Compared with random sparsity, the slidingwindow pattern is more hardwarefriendly. However, Table II shows that it often requires lowrank sparsity to compensate for the accuracy loss, as slidingwindow sparsity only captures the local relationship within each window. Moreover, although some variants adopt a single lowrank or blockwise sparsity pattern with satisfactory algorithmic performance, they require extra algorithmic operations and dedicated computational kernels during inference (e.g. the localitysensitive hashing (LSH) in Reformer [kitaev2020reformer]) during inference, resulting in large hardware overhead. In contrast, this paper treats the butterfly sparsity as a promising method due to its regular data access pattern and the ability of capturing both global and local information.
IiiB Unified Butterfly Pattern for Attention and Linear Layers
The butterfly pattern has demonstrated its effectiveness and generality in approximating linear transformations
[dao2020kaleidoscope]. Furthermore, LeeThorp et al. [lee2021fnet] have shown the potential of simplifying the computation by replacing the entire attention layer with Fourier transform, which effectively mixes tokens without explicitly approximating the attention mechanism. To maximize the ability to reduce the computation with acceptable algorithmic performance, we start by proposing two basic building blocks for scalable inference: 1) the Attention Butterfly (ABfly), and 2) Fourier Butterfly (FBfly) blocks.In the ABfly block, we retain the backbone of the attention module and compress all the linear layers using butterfly factorization. Specifically, the ABfly block starts with three butterfly linear layers to generate , and matrices. The results are fed into a vanilla multihead attention layer and another butterfly linear layer to obtain the relationships among different tokens. A butterfly FFN that consists of two butterfly linear layers is placed at the end of the ABfly block for additional processing. To further reduce the amount of computation and number of parameters, we replace the attention module with a 2D Fourier transform layer, implemented using FFT, resulting in a more computeefficient block called FBfly. The use of FFT effectively mixes different input tokens, which allows the following butterfly FFN to process a longer sequence. More importantly, all computation in the FBfly block, which use the FFT’s twiddle factors and the butterfly linear layers’ weights, is performed using a unified butterfly pattern, resulting in higher hardware efficiency over previous works.
Although FBfly is less compute and memoryintensive than ABfly, the use of the Fourier transform layer may degrade accuracy [lee2021fnet]. To preserve high accuracy, we propose a novel butterflybased network called FABNet that introduces a hybrid of the ABfly and FBfly blocks, as depicted in Figure 5. There are FBfly blocks at the beginning and ABfly blocks stacked on top. With this setup, we expose both and
as hyperparameters, enabling a tradeoff between algorithmic and hardware performance. To optimize this tradeoff, we develop a codesign method (Section
VC) that explores the design space of both neural architecture and hardware design.Iv Hardware Accelerator
Iva Architecture Overview
Figure 6 shows the proposed hardware accelerator consisting of: a Butterfly Processor (BP), an Attention Processor (AP), a Postprocessing Processor (PostP), the offchip memory, and several onchip buffers. BP consists of number of Butterfly Engines (BEs), which are used to accelerate the computations that involve butterfly patterns, including both FFT and butterfly linear transformations. AP contains number of Attention Engines (AEs), and each AE is composed of one QK unit and one SV unit. The QK unit is designed to implement the softmax and the matrix multiplication between queries and keys. The SV receives the outputs from the QK
unit and multiplies the results with value vectors to generate the final results of the attention layer. The
PostP module is responsible for executing the layer normalization and shortcut (SC) addition. To ease the onchip memory consumption, the intermediate results between different FFT and butterfly operations are transferred back to the offchip memory. Although doing so increases the bandwidth requirement, this ensures our accelerator is scalable on hardware platforms with limited onchip memory. To improve the overall hardware performance, all the onchip buffers utilize doublebuffering to overlap the data transfer with the computation.IvB Adaptable Butterfly Engine
Figure 6b shows the hardware architecture of BE. Each BE is mainly composed of a butterfly memory system and number of adaptable Butterfly Units (BUs). To improve the hardware efficiency and enable the use of a single unified engine, the BE module is designed with a focus on adaptability. As such, it can be configured via programmable multiplexers and demultiplexers at runtime to either execute an FFT or a butterfly linear transformation.
IvB1 Adaptable Butterfly Unit
Figure 7a depicts the architecture of the proposed adaptable BU. Each adaptable BU consists of four realnumber multipliers and two realnumber adders, followed by two complexnumber adders. The inputs and twiddle factors of both FFT and butterfly linear transformation are connected to the multipliers, with eight multiplexers used to select the correct inputs for each operation. Two demultiplexers are placed after the realnumber adders to control the output flow.
When performing the butterfly linear transformation (Figure 7b), the twiddle factors are nonsymmetric real numbers. Hence, the output of each twiddle multiply can computed as:
where and represent the inputs and twiddle factors, respectively. To perform the butterfly linear transformation, four multipliers in each BE are configured to execute the four realnumber multiplications in the equation above. The values and are selected via multiplexers as the operands of the multipliers. At the same time, the results generated from the realnumber adders/subtractors are outputted directly from the demultiplexers.
For FFT (Figure 7c), since the twiddle factors of FFT are complex and symmetric, it only requires one complexnumber multiplication per twiddle multiplication. Thus, by selecting the complex inputs and twiddle factor , we reuse the four realnumber multipliers in each BE to perform the required complexnumber multiplication. The demultiplexers are then configured to output the results to the complexnumber adders/subtractors to get the final results . The control signals for the multiplexers and demultiplexers are set before running each layer. As such, the proposed adaptable BE can be used to accelerate both FFTs and butterfly linear transformations by reusing the multipliers, adders and subtractors.
IvB2 Butterfly Memory System
Our butterfly memory system comprises an input manager, a serialtoparallel (S2P) module, a paralleltoserial (P2S) module and butterfly buffers. As shown in Figure 8a, the butterfly pattern requires different data access at different stages. The conventional columnmajor or rowmajor order will cause bank conflicts while reading the data. For instance, accessing index pair and of the first stage causes a read conflict in the columnmajor order as shown in Figure 8b, in which each row represents a memory bank. The rowmajor order also suffers from the same issue while reading and in the third stage.
To avoid such bank conflict, we introduce a custom data layout strategy and implement it using the S2P module shown in Figure 9. We permute each column using a starting position which indicates how many rows the first element in the current column should be shifted down. We define the starting position using the following formula:
For each columns, the starting positions is obtained by shifting one position down, as shown in Figure 9a. The starting positions are generated using a counter, and a bitcount and addition operations (Figure 9b). After packing the serial data together, S2P permutes them based on the starting positions.
Figure 10 presents an example with 16 inputs, where the data required by the first and second stage of the butterfly pattern are read from the buffers without bank conflicts. However, as the butterfly units receive data in pairs, an extra pairing is required after the S2P module. An example is the second output column of the first stage in Figure 10b. To pair indices, we design an index coalescing module before the butterfly units (Figure 11). Based on the index of each input, a bitcount and addition operation is used to calculate the corresponding shift position. Then, a crossbar coalesces the index pairs based on the indices and shift positions. To ensure the outputs generated from the butterfly units preserve the original order, a recover module is used before the data is written back.
V Optimizations and CoDesign
Va Memory Sharing in Butterfly Buffers
We employ butterfly buffers to allow the overlap between data transfer and computation. To reduce the memory consumption and improve the hardware efficiency, the butterfly buffers are shared between both FFT and butterfly linear transformation. Nonetheless, as the data width of FFT is twice that of the butterfly linear transformation, different address mapping and overlapping strategies are required.
Figure 12 shows the proposed address mapping strategies for butterfly linear transformation and FFT. Assuming the bitwidth of real numbers is 16 bits, each input buffer is 16bit wide. While processing butterfly linear transformations, input buffers A and B are used as two independent pingpong banks with separate read and write ports (top right in Figure 12). In this manner, when input buffer A is used for computation, buffer B can start the input data transfer for the next batch, leading to the overlapping strategy shown in Figure 13a. While processing FFT, since the data include both real and imaginary parts which require 32bit read and write ports, we concatenate the lower parts of input buffer A and B as the first pingpong bank for the storage of complex numbers. To improve the hardware efficiency, we further reuse the higher parts of both buffers as the second pingpong bank. As the computation requires both read and write accesses, we adopt a different overlapping strategy that pipelines the output data transfer only with the input data load of the next batch (Figure 13b). By employing different address mapping and overlapping strategies for FFT and butterfly linear transformation, we maximise the hardware efficiency and performance.
VB FineGrained Pipelining between BP and AP
While executing the ABfly block, BP and AP are in use, performing butterfly linear transformation and attention matrix multiplication, respectively. To further improve performance when executing the ABfly block, we employ finegrained pipelining between BP and AP.
Figure 14 illustrates the dataflow of BP and AP. In the naive implementation, the key (), value () and query () matrices are generated sequentially from BP. After , and are computed, AP starts the computation of and . To optimize this process, we reorder the execution sequence of linear layers such that BP computes and at the beginning (Figure 14b). As can be decomposed into multiple vector matrix multiplications that multiply different rows of with the entire matrix , we can actually start the computation of once the first few rows of become available. As such, the in AP can be pipelined with the computation of in BP. At the same time, since is generated from the QK unit in a rowbyrow fashion, we can further pipeline the with , as the computation of can start once the first few rows of are generated from the QK unit. Assuming there are and rows in and matrices, it takes and to compute one row in the SV and QK units, respectively. As such, the total latency reduction achieved is compared to the unoptimized nonpipelined implementation.
VC Algorithm and Hardware CoDesign
The overall design space of our endtoend system is formed by FABNet’s hyperparameters and the butterfly accelerator’s hardware parameters. Specifically, the joint design space consists of: 1) the algorithm parameters, i.e. the hidden size (), the expand ratio of FFN (), the total number of blocks () and the number of ABfly blocks () in FABNet, and 2) the hardware parameters, i.e. the parallelism of BU () and BE () in BP, and the parallelism of the QK () and SV () units in AP.
To assess the tradeoff provided by each design point, we need to evaluate its algorithmic performance (e.g. an accuracy metric), its latency and its resource consumption. During search, the algorithmic performance is obtained by training and evaluating FABNet
, while the latency is estimated by utilizing a custom simulator built for our butterfly accelerator. To verify whether the design can be accommodated by the target FPGA device, we developed an analytical model to estimate the consumption of DSP blocks and onchip memory (BRAMs). As DSPs are mainly consumed by the multipliers in
AP and BP, we formulate its resource usage as:where the value of reflects the number of multipliers in each BU. The consumption of BRAM is mainly occupied by the shortcut buffer, query buffer, key buffer and different buffers in BU including butterfly buffer and weight buffers, which can be formulated as:
The proposed analytical resource model is only used during the design space exploration stage. At the end of the codesign process, the final performance is obtained by running synthesis and place&route on our design with the optimized configurations.
Figure 15 illustrates the proposed codesign approach. Given a target dataset, FPGA device and both algorithmic and hardware performance constraints, we employ exhaustive grid search to traverse the joint design space and find the Paretooptimal set of algorithmic and hardware parameters. Each individual design point corresponds to a different compression ratio of FABNet and level of parallelism of the butterfly accelerator, and provides different accuracy, latency and resource consumption. The final output is the Pareto front of parameters for both FABNet and our butterfly accelerator that satisfies a given set of constraints.
Vi Evaluation
Via Experimental Setup
Benchmarks. To evaluate the algorithmic and hardware performance of our approach on workloads with long sequences, we choose five tasks from LongRangeArena [tay2020long], including hierarchical data classification (ListOPs), bytelevel text classification (Text), bytelevel document retrieval (Retrieval), image classification for sequences of pixels (Image), classification of longrange spatial dependency (Pathfinder). The input sequences of these datasets range from to .
Software Implementation. We implement the vanilla Transformer [devlin2018bert], FNet [lee2021fnet] and our FABNet
models using PyTorch (v
) [pytorch]. The pretrained models are obtained from Huggingface [wolf2019huggingface]. The batch size is for both Image and Pathfinder tasks, and for the rest of datasets during training. The learning rate is set to , except for the Image and Pathfinder tasks where we use and respectively. Multiple Nvidia A100 and V100 GPUs are used for training. To use FFT cores on Nvidia GPUs, the PyTorch API “rfft2” is used to implement the FFT operation required in both FNet and FABNet. The highperformance CUDA implementation [dao2020kaleidoscope] of butterfly linear transformation is adopted to accelerate both GPU training and inference. We define two models with different default settings: FABNetBase (, , , ) and FABNetLarge (, , , ).Hardware Implementation. We implement our hardware accelerators using Verilog. To evaluate performance in different scenarios, two Xilinx FPGA boards are used in our experiments: VCU128 for cloud/server scenarios and Zynq 7045 for edge/mobile settings. Xilinx Vivado 2019.1 is used for synthesis and implementation. While the maximum clock frequencies of our designs depend on the particular FPGA board and resource consumption, all the FPGA designs are clocked at 200 MHz which is below the maximum. We obtain power consumption values using the Xilinx Power Estimator (XPE) tool and develop a cycleaccurate performance model to evaluate the speed performance, which is crossvalidated^{*}^{*}*We crossvalidate the functionality and correctness of our RTL design with the groundtruth results generated from PyTorch. Please refer to Appendix AC for details. with our RTL simulation results generated by Vivado. The memory accesses to external memory are also considered. We use 16bit halfprecision floatingpoint in our hardware designs. We deploy four multipliers in each BU. As the hidden dimension is usually at most , we set the depth of butterfly, query and key buffers as . Finally, the size of shortcut buffers is the same as butterfly buffers.
ViB Algorithmic Performance
The FBfly introduced in Section IIIB is an efficient alternative to the vanilla attention block. To evaluate its algorithmic impact on endtoend models, we take a sixlayer Transformer as an example^{†}^{†}†Other models, such as GPT and BERT, actually follow the same network architecture of Transformer with the encoder or decoder kept. To eliminate the effect of different training strategies and evaluate the quality of the architecture, we choose the vanilla Transformer for demonstration. and compress it with different numbers of FBfly blocks, starting from the last block to the first block. Figure 16 shows the accuracy results on LRAText and LRAImage. Although the accuracy fluctuates with different numbers of compressed layers, FBfly shows higher accuracy than the noncompressed Transformer with and compressed layers on LRAText and LRAImage, respectively, demonstrating the improved algorithmic performance of our approach on endtoend models.
To obtain the best possible algorithmic performance of each model, we use the optimized configuration specified in [xiong2021nystromformer] for both vanilla Transformer and FNet^{‡}^{‡}‡As the vanilla FNet on Retrieval task suffers significant accuracy loss, we increase its hidden size to .. We perform a simple grid search to optimize the hyperparameters of our FABNet. Table III presents the optimized accuracy of different models. FABNet achieves higher accuracy than both Transformer and FNet on three out of five tasks, including ListOPs, Retrieval and Image. On average, FABNet achieves the same accuracy as Transformer. To investigate the efficiency of FABNet, Figure 17 shows the compression rate of our optimized FABNet over the vanilla Transformer and FNet in terms of floatingpoint operations (FLOPs) and model size (number of parameters). Compared with the vanilla Transformer, FABNet achieves around reduction in FLOPs and reduction in model size, depending on the target task. Furthermore, compared with FNet, FABNet reduces FLOPs by and model size by .
ViC Effectiveness of Codesign
We evaluate the effectiveness of our codesign approach in finding the optimal algorithm and hardware designs. For demonstration, we use LRAText as the target dataset and VCU128 FPGA as the target device. We select , , and from {, , , , }, {, , }, {, } and {, } respectively. Parameters for hardware parallelism (, , and ) are chosen from {, , , , , , }. Figure 18 shows the points in the accuracylatency design space. The orange line represents the accuracy loss, which is constrained to be less than % compared with the vanilla Transformer. The Pareto front is indicated by the brown line and the other blue points represent designs with less optimized softwarerelated hyperparameters (Figure 16) or hardware design parameters. Among the design points that satisfy the accuracy constraint, we choose the point with the lowest latency in the Pareto front as our point of comparison. Within our design space, the selected point is up to % more accurate than the points in the same latency range and up to faster than points in the same accuracy range, underlining the advantages of our codesign approach. The runtime of the codesign process is around hours on our GPU server. To get the configurations for the rest of the datasets in LRA, we constrain the overall accuracy loss to be less than % compared to the vanilla Transformer. The final models and designs are chosen as the configurations with the highest hardware performance without violating the accuracy constraints. Unless mentioned otherwise, the remaining the sections report the algorithmic and hardware performance using these optimized configurations.
ListOps  Text  Retrieval  Image  Pathfinder  Avg.  

Vanilla Transformer  0.373  0.637  0.783  0.379  0.709  0.576 
Vanilla FNet  0.365  0.630  0.779  0.288  0.66  0.544 
FABNet  0.374  0.626  0.801  0.398  0.679  0.576 
ViD Comparison with Baseline Design
To evaluate the speedup brought by our algorithm (FABNet) and hardware (butterfly accelerator), we use a baseline design for comparison [devlin2018bert]. The baseline hardware is designed with multiple multiplyaccumulate (MAC) units to accelerate the linear transform and the matrix multiplications between query, key and value vectors. Each MAC is composed of a multiplier array followed by an adder tree. The finegrained intra and interlayer pipeline techniques [song2019hypar, alwani2016fused] are used to optimize the hardware performance. We allocate the parallelism of each MAC unit according to its workload in order to achieve loadbalanced execution between different pipeline stages. For a fair comparison, we implement both baseline and butterfly accelerators on a VCU128 FPGA using multipliers. The high bandwidth memory (HBM) is used as the external memory. Both designs are clocked at MHz. We evaluate both base ( layers) and large ( layers) versions of each model using four different input sequences (, , and ).
A speedup breakdown is shown in Figure 19. To demonstrate the improvement brought by our algorithm, we first evaluate both BERTBase and FABNet on the baseline design. As the FFT is not supported in the baseline design, we implement the Fourier layers as linear layers by multiplying the input sequences with DFT matrices. Since the operation reduction brought by the algorithm is not fully utilized by the baseline design, FABNet results in a speedup compared to BERT. To further evaluate the improvement brought by hardware optimizations, we evaluate FABNet on our butterfly accelerator, showing speedup when compared to the baseline design. By combining both algorithm and hardware optimizations, the overall speedup of our approach is over the baseline design.
ViE Comparison with GPU and CPU
We compare our butterfly accelerator against GPU and CPU in both edge and server scenarios. In the edge scenario, our butterfly accelerator is implemented on a Xilinx Zynq 7045 FPGA. DDR4 is used as external memory and multipliers are used for computation. Nvidia Jetson Nano GPU and Raspberry Pi4 are used as the GPU and CPU platforms, respectively. In the server scenario, the butterfly accelerator is implemented on a Xilinx VCU128 FPGA. HBM is used as external memory and the design consumes multipliers. We use Nvidia V100 and TITAN Xp GPUs for comparison, with highlyoptimized CUDA implementations [dao2020kaleidoscope]. FPGA designs are clocked at MHz.
We evaluate both FABNetBase and FABNetLarge using , , and input sequences. Figure 20 shows the results in term of speedup and energy efficiency. We represent energy efficiency using Giga operations per second per Watt (GOPS/Watt). In the edge scenario, our design on Zynq 7045 FPGA achieves speedup over Jetson Nano GPU and speedup over Raspberry Pi4^{§}^{§}§On FABNetLarge with long input sequences greater than , Raspberry Pi 4 suffers from outofmemory (OOM) issues.. At the same time, our design yields and higher energy efficiency than Jetson Nano and Raspberry Pi4, respectively. In the server scenario, our design on VCU128 is up to and faster and up to and more energyefficient than the V100 and TITAN Xp GPU, respectively. In summary, the endtoend speedup and energy efficiency gains on both edge and server scenarios under different input sequences highlight the scalability of our butterfly accelerator.
Platform  # cores  Compiler  Frequency  Technology  

CPU  Raspberry Pi 4  4  
Nvidia V100  5,120  PyTorch  
GPU  Nvidia TITAN Xp  3,840  1.10.2  
Nvidia Jetson Nano  128  
FPGA  Xilinx VCU128    Vivado  
Xlinx Zynq 7045    2019.2 
ViF Comparison with SOTA Accelerators
Accelerators  [ham2020]  SpAtten [wang2020spatten]  Sanger [sanger201micro]  Energon [zhou2021energon]  ELSA [elsa2021isca]  DOTA [dota2022asplos]  FTRANS [li2020ftrans]  Our work 
(HPCA’20)  (HPCA’21)  (MICRO’21)  (TCAD’21)  (ISCA’21)  (ASPLOS’22)  (ISLPED’20)  
Technology  ASIC (40nm)  ASIC (40nm)  ASIC (55nm)  ASIC (45nm)  ASIC (40nm)  ASIC (22nm)  FPGA (16nm)  FPGA (16nm) 
Frequency  1 GHz  170 MHz  200 MHz  
# of Multipliers  128  6531  640  
Latency (ms)  56.0  48.8  45.2  44.2  34.7  34.1  61.6  2.4 
Throughput (Pred./s)  17.86  20.49  22.12  22.62  28.82  29.32  16.23  416.66 
Power (W)  1.217  1.060  0.801  2.633  0.976  0.858  25.130  11.355 
Energy Eff. (Pred./J)  14.67  19.33  27.62  8.59  29.52  34.18  0.65  36.69 
Table V compares our butterfly accelerator with existing stateoftheart (SOTA) accelerators in terms of speed and power consumption. Instead of comparing the effective throughput [wang2020spatten, sanger201micro], we use the endtoend latency to represent the actual execution speed of the hardware. The energy efficiency is represented by the number of predictions per Joule (Pred./J). Following the experimental setting of [dota2022asplos], we compare all other SOTA accelerators on LRAImage dataset with onelayer vanilla Transformer. Among these accelerators, only SpAtten [wang2020spatten] and DOTA [dota2022asplos] report the endtoend performance. For the rest of the accelerators that only support attention, we estimate their performance by reusing their available multipliers to accelerate FFN. Furthermore, in both [wang2020spatten] and [sanger201micro], the authors compare different ASIC and FPGA designs based on the assumption that all the ASIC designs are clocked at GHz with multipliers. For a fair comparison, we follow the same assumption in our experiments. For designs with more than multipliers, we follow the scaling approach of [wang2020spatten, sanger201micro] to linearly scale down its throughput to get their endtoend performance. For instance, DOTA [dota2022asplos]^{¶}^{¶}¶We assume their design is computebound. achieves speedup over Nvidia V100 using multipliers with TOPS throughput. We scale down its throughput by , which leads to speedup over V100. To obtain the power consumption, we use the same linear scaling approach. For instance, Sanger [sanger201micro] reports the power consumption of a design with multipliers. We divide the power consumption of their systolic array ( mW) by , which leads to mW. Together with the power of other modules such as preprocessing and memory, their total power consumption is W. To match the computational capacity of ASIC designs, we use DSPs in the VCU128 FPGA. As our FPGAbased design is clocked at MHz, this ensures that we have the same M 28 GOPS theoretical peak performance as ASIC designs (G 28 GOPS). While this is a simple approximation, it allows us to compare different hardware architectures regardless of their underlying target platforms.
As shown in Table V, our butterfly accelerator achieves speedup over the FPGAbased FTRANS [li2020ftrans] while using nearly fewer DSPs. At the same time, we achieve higher energy efficiency than FTRANS. Compared with ASIC designs, our accelerator achieves speedup under the same computational capacity. Although our FPGAbased butterfly design consumes more power than ASIC designs, it yields higher energy efficiency than the other SOTA ASIC accelerators. We expect further speedup and energy efficiency improvements when our design is implemented as an ASIC.
We attribute the performance gain of our approach over ASIC designs to two main factors: 1) the use of FFT and butterfly factorization which significantly reduces the computational complexity at the algorithmic level; 2) the adaptable butterfly design that adopts a single unified hardware engine to accelerate both FFT and butterfly linear transformation, which significantly improves the hardware efficiency; and 3) the codesign process which jointly optimizes both algorithm and hardware parameters.
ViG OffChip Memory Bandwidth Analysis
In order to investigate the sensitivity of our design to offchip memory bandwidth, we vary the bandwidth from , , , , and GB/s, and evaluate its latency based on our performance model. For these experiments, we use five different designs with , , and BEs executing FABNetLarge with layers. To understand the bandwidth requirements under both short and long input lengths, we evaluate each design using three input sequences (, and ). The results are shown in Figure 21. For a smallscale design of BEs, a bandwidth of GB/s is enough for the design to reach its peak performance under different input sequences. For the largest design of BEs, the achieved performance saturates once the bandwidth reaches GB/s.
ViH Power and Resource Analysis
Table VI shows the power consumption breakdown^{∥}^{∥}∥Power of I/O is not included as it occupies less than % of the total power. based on the report generated from the Vivado XPE tool. We implement two designs with BEs (BE120) and BEs (BE40) on a VCU128 FPGA, which have been used in Section VIE and Section VIF, respectively. In both designs, the dynamic power accounts for more than % of the total power consumption. The memory resources, including both BRAM and HBM, consume more than % of the dynamic power. Furthermore, when the number of BEs scales from to , the power of clocking, logic & signal and DSPs increases from W, W and W to W, W and W, respectively.
Table VII presents the resource consumption of both BE40 and BE120 designs on the same VCU120 FPGA. Due to the use of FFT and butterfly matrices, our FABNet becomes less memoryintensive than the vanilla attentionbased NNs. Since the theoretical memory bandwidth of a single HBM ( GB/s) can already satisfy the requirement of our accelerator (Section VIG), we use one HBM in both designs to reduce the resource and power consumption. When the number of BEs decreases from to , the BRAM usage is reduced from to . This reduction can also be observed on the LUT and register resources.
Dynamic (W)  Static  

Design  Clocking  Logic&  DSP  Memory  
Signal  (BRAM & HBM)  (W)  
BE40  used  2.668  2.381  0.338  5.325  3.368 
pct.  18.8%  16.7%  2.3%  37.5%  23.7%  
BE120  used  6.882  7.732  1.437  6.142  3.665 
pct.  26.4%  29.7%  5.5%  23.6%  14.1% 
LUTs  Registers  DSP48s  BRAMs  HBMs  
Available  1,303,680  2,607,360  9,024  2,016  2  
BE40  used  358,609  536,810  640  338  1 
pct.  27.5%  20.6%  7.1%  16.8%  50.0%  
BE120  used  1,034,610  1,648,695  2,880  978  1 
pct.  79.3%  63.2%  31.9%  48.5%  50.0% 
Vii Related Work
Efficient Approaches for Attention. As the algorithmic complexity of the selfattention mechanism scales quadratically with respect to the input sequence length, many sparse variants have been introduced to approximate the attentionbased NNs [tay2020efficient]. The sparsity patterns in these approaches are determined either dynamically [wang2020linformer, zaheer2020big, choromanski2020rethinking, tay2020sparse] or statically [beltagy2020longformer, lee2021fnet, chen2021pixelated]. Although these methods achieve high compression rate on the number of operations and parameters, the hardware cost and efficiency of their mappings on real hardware designs are not considered in these works.
DomainSpecific Accelerators for Attentionbased NNs. To better utilize existing efficient attentionbased algorithmic approaches on hardware, various domainspecific hardware designs have been introduced. Ham et al. [ham2020] propose a hardware architecture called , which dynamically prunes entries based on their softmax importance. By leveraging the sparsity in both head and token levels, Wang et al. [wang2020spatten] propose SpAtten that dynamically prunes entire rows and columns from the attention matrix. EdgeBERT [edgebert2021micro] explores the layer sparsity via an entropybased earlyexit approach, which significantly reduces the computation and memory footprint. To detect the weak relationship in attention, DOTA [dota2022asplos] uses lowrank linear transformations to detect and omit the weak connections. ELSA [elsa2021isca] approximates the attention mechanism using a sign random projection approach. To further exploit the sparsity in attentionbased NNs, Energon [zhou2021energon] adopts a lowprecision NN to predict the sparsity in the attention matrix. However, the generated sparsity patterns from these approaches are always unstructured, which may lead to hardware inefficiency. Sanger [sanger201micro] propose packandsplit modules to distribute the nonzero computation to each computation engine, achieving a loadbalanced execution. Although these accelerators achieve notable speedup over GPUs and CPUs when executing the attention mechanism, their endtoend hardware performance is limited as the approximation and acceleration of the FFN part are not considered in their design.
Comparison to Previous Work.
As spatialdomain convolution corresponds to frequencydomain multiplication, various FFTbased hardware accelerators have been introduced to accelerate CNNs
[zhang2017frequency, abtahi2018fft], LSTMs [wang2018c, li2019rnn] and attentionbased NNs [li2020ftrans]. However, these designs only use FFT as a domain transfer approach, while the main computation is still performed in another processing engine. In contrast, this paper adopts the butterfly sparsity, a generalized pattern of FFT, for the main computation of the network. Based on the proposed method, all the computations are performed in a single unified butterfly accelerator, resulting in a higher hardware efficiency over previous designs.Although butterfly sparsity has been explored in recent literature, most approaches only focus on algorithmlevel optimizations. Dao et al. [dao2020kaleidoscope] demonstrate the potential of the butterfly matrix in approximating linear transformations, but its efficiency on attention matrices is not explored in their work. On the other hand, Pixelated Butterfly [chen2021pixelated] and Sparse Transformer [child2019generating] focus on adopting the butterfly pattern for attention matrices, neglecting the linear layers. Moreover, their designs require the use of multiple sparsity patterns to compensate for the accuracy loss, which significantly complicates the hardware design. LeeThorp et al. [lee2021fnet] show the effectiveness of Fourier transforms in accelerating attention layers, but did not consider the linear layers, leading to scalability issues (Section IIC). Different from all these efforts, this paper exploits the use of butterfly sparsity for both attention and linear layers via an algorithmhardware codesign approach. A novel butterflybased algorithm and an adaptable hardware accelerator are jointly designed to overcome existing limitations to push the performance limit.
Viii Conclusion
This paper proposes the endtoend acceleration of attentionbased NNs via algorithm and hardware codesign. On the algorithmic level, we propose FABNet, a hardwarefriendly attentionbased NN. Both attention and linear layers are compressed using a unified butterfly sparsity pattern allowing for scalable endtoend acceleration. On the hardware level, an adaptable butterfly accelerator is proposed that can be configured at runtime to accelerate different layers based on a unified hardware engine to achieve high hardware efficiency. Both algorithm and hardware design parameters are jointly optimized to push the performance limit. Our experiments demonstrate that our codesign approach yields up to speedup over stateoftheart accelerators. Furthermore, our design achieves up to higher energy efficiency compared to optimized GPU implementations.
Acknowledgement
The support of the UK EPSRC grants (UK EPSRC grant number EP/V028251/1, EP/L016796/1, EP/S030069/1 and EP/N031768/1), AMD and Intel is gratefully acknowledged. We also thank Alexander Mathiasen for insightful discussions on model compression.
Appendix A Artifact Appendix
Aa Abstract
This Appendix summarizes the necessary information and instructions to evaluate our artifacts. The functionality of our hardware accelerator can be evaluated by running Verilog HDL designs and System Verilog testbenches on Vivado design suite. The accuracy results can be obtained by running our PyTorch programs and the associated Bash scripts. The power and resource utilization can be obtained by running Synthesis and Implementation using our RTL code and constraint files. The latency can be obtained by running our custom Pythonbased performance model. We also provide all our training log files and Vivado design reports in the link: https://drive.google.com/drive/folders/1jaR8gDXzO1Hu83xFg_IJOwRgoBMPnjY?usp=sharing.
AB Artifact checklist (metainformation)

Algorithm: FABNet, an efficient model that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.

Program: Python, PyTorch, Verilog HDL

Model: We evaluate three models for comparison, including Transformer, FNet and FABNet.

Data set: LongRangeArena (LRA) dataset, which is a wellknown long sequence natural language processing (NLP) dataset. The zip file can be downloaded from the link: https://storage.googleapis.com/longrangearena/lra_release.gz. The required disk space of the unzip file is around 33 GB.

Runtime environment: Ubuntu 20.04, CUDA SDK 11.3 or higher.

Hardware: Nvidia V100 GPU, Nvidia TITAN Xp GPU, Nvidia Jetson Nano GPU, Intel Xeon Gold 6154 CPU, Raspberry Pi 4.

Metrics: Accuracy, simulated latency, resource and power consumption.

Experiments: Bash scripts and detailed instructions are provided to run experiments.

How much disk space required (approximately)?

How much time is needed to prepare workflow (approximately)? hours.

How much time is needed to complete experiments (approximately)? Accuracy results: hundreds of GPU hours to obtain. Power and resource consumption: around 70 hours. Functionality of Verilog design: around 5 hours.

Publicly available? Yes.

Code licenses (if publicly available)? Yes.

Archived (provide DOI)? We will update this later.
AC Description
AC1 How to access
You can access our codebase from the link: https://zenodo.org/record/7010800#.YwQKCOzMJhF or https://github.com/oshxfan/Butterfly_Acc.git.
AC2 Hardware dependencies
A GPU server is required to run the training of our models. A CPU server is needed to run simulation, synthesis and place&route. Different GPUs and CPUs, such as Nvidia Jetson Nano GPU and Raspberry Pi 4, are also required to evaluate the hardware performance of different models.
AC3 Software dependencies
Vivado Design Suite , PyTorch , CUDA SDK or higher, Python or higher. Other dependencies are listed in requirements.txt.
AC4 Data sets
Five tasks in the LRA dataset including ListOPs for hierarchical data classification, Text for bytelevel text classification, Retrieval for bytelevel document retrieval, Image for image classification for sequences of pixels and Pathfinder for classification of longrange spatial dependency.
AD Installation
We provide a detailed installation guide in the README.md of the root directory.
AE Experiment workflow
To evaluate the functionality of our hardware, perform the following steps:

Follow the instruction of experimental setup in the root directory to install software dependencies. Install Vivado 2019.2.

Generate the test data using our Python programs.

Create a Vivado project for our hardware design.

Import all the Verilog source code and System Verilog testbenches.

Include all the necessary IPs from Vivado IP library.
We provide Vivado Tcl scripts and stepbystep instructions in ./hardware/npu_design/verilog/README.md to automate the whole process.
To reproduce the algorithmic and hardware performance, we provide all the scripts under the directory ./script_figs to generate figures and tables. The detailed instructions are provided in ./script_figs/README.md.
AF Evaluation and expected results
We provide scripts under ./script_figs to generate all the figures and tables related to accuracy performance and hardware performance include power consumption, resource utilization and simulated latency performance. As running all the experiments requires a few hundred GPU/CPU hours, to facilitate the artifact evaluation, we refer to the following key results that can be obtained within a reasonable time:

Vivado simulation to run different layers, such as fast Fourier transform, butterfly matrix multiplication and layer normalization, on our Verilog hardware design. We provide System Verilog testbenches under ./hardware/npu_design/verilog/functionality/testbench/, and a detailed workflow in the first paragraph of Section AE.

Power breakdown in Table VI and resource utilization in Table VII. We provide detailed instructions and Vivado Tcl scripts under ./hardware/npu_design/verilog/ to run synthesis and place&route on both VCU128 and Zynq 7045 FPGAs.
Although it takes longer to run other experiments, all the results are reproducible using our provided scripts. We provide all the GPU training log files and Vivado design reports in the link: https://drive.google.com/drive/folders/1jaR8gDXzO1Hu83xFg_IJOwRgoBMPnjY?usp=sharing.
AG Methodology
Submission, reviewing and badging methodology: