1. Introduction
One of the most common Natural Language Processing (NLP) tasks is sequence transduction, translating an input sequence to an output sequence. Traditionally, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used for this task
(graves2012sequence; gehring2017convolutional). The Transformer architecture (vaswani2017attention) removes all recurrent and convolutional components and instead relies on selfattention mechanisms, typically showing better computation time and performance. A transformer is made up of two parts, encoders and decoders. The Bidirectional Encoder Representations from Transformers (BERT) model (devlin2018bert) incorporates the encoders from transformers to generate a stateoftheart language representation model. When it was first introduced, BERT broke records on eleven different NLP tasks. Since then, variations of BERT such as RoBERTa (liu2019roberta) and DistilBERT (sanh2019distilbert) have shown even better performance and accuracy. Implementing efficient accelerators for inference of these transformer models has proven a challenging task.FPGAbased overlay processors have provided effective solutions for CNN and other network inference on the edge, allowing for flexibility across networks without reconfiguring the FPGA (yu2019opu; yu2020light; yu2020uni). CNNs consist mostly of linear matrix operations like convolution and pooling, and have consistently shown resilience to extreme quantization. However, BERT and its successors (liu2019roberta; sanh2019distilbert; jiao2019tinybert; lan2019albert) cannot be efficiently accelerated using existing CNN FPGA accelerators. Although they compute many matrix multiplications, transformer models also introduce complex nonlinear operations that require higher precision, and are called with higher frequency. For instance, softmax and layer normalization (ba2016layer) are performed several times in each encoder for BERT, and block subsequent computation until they are finished processing. As such, it is essential to calculate them efficiently while maintaining high throughput. These nonlinearities must be computed ondevice for performancesensitive applications, as sending to a CPU causes significant latency overhead and is not practical on the edge.
Most existing accelerators (mlsys2020_143; ham20203) include specialized units for computing each type of nonlinearity. For instance, FTRANS (li2020ftrans), the only previously published FPGA accelerator for transformers, includes separate softmax and layer normalization modules. Since NLP is a constantlyevolving field that may introduce different types of nonlinearities, this specialized approach means that an FPGA design may need reconfiguration for additional NLP networks. It also leads to unnecessary area overhead and underutilized resources across nonlinear operations.
In this paper we propose NPE, an FPGAbased overlay processor for NLP model inference at the edge. As shown in Figure 1, unlike most other accelerators, NPE employs a common method for approximating different nonlinear functions efficiently and accurately without added overhead. The main contributions of our work are as follows:

We design a softwareprogrammable domainspecific overlay processor with a matrix multiply unit and a multiprecision vector unit for NLP processing.

We employ a unified piecewise polynomial approach for nonlinear function approximation to allow extensibility to future nonlinear functions that may be required.

We demonstrate that our proposed accelerator can meet the realtime latency constraints for conversational AI while maintaining and lower power than GPUs and CPUs respectively. Our design utilizes fewer FPGA resources relative to a comparable networkspecific transformer FPGA accelerator.
2. Related Work
While BERT has been highly optimized at the software level for CPU and GPU, little work has been published related to custom hardware acceleration of any transformerbased networks, particularly on FPGAs. Two ASICs have been recently proposed, OPTIMUS (mlsys2020_143) and (ham20203), that each accelerate different parts of transformer inference. OPTIMUS optimizes matrix multiplication for transformers by exploiting sparsity. It has dedicated exponential, divider, and square root components for nonlinearities, leading to wasted area since each is only used a small fraction of the time. accelerates attention mechanisms using various approximation methods. It has a very deep pipeline that is specialized for attention, which is inefficient from an overall design point of view because the multipliers and function approximation units cannot be reused even for other matrix multiplications. In particular, would need to be paired with a conventional matrix multiply accelerator to implement BERT. At any one time, either the unit or the main accelerator would be computing, leading to many idle resources and wasted performance potential.
To the best of our knowledge, FTRANS (li2020ftrans)
is the only currently published FPGA accelerator for BERT and related transformer networks. FTRANS takes a very specialized approach for implementing transformers, in which it has dedicated encoder and decoder modules. Each of these modules has many specialized components. For example, the attention head module contains components for softmax and layer normalization, as well as five unique PE banks to perform each matrix multiply subroutine required (see Table
1 for the computation in an attention layer). While it is sufficient to implement transformers, FTRANS is not flexible enough to handle any nontransformer network. As the NLP stateoftheart evolves and new network variants emerge, the FTRANS architecture may have to be extensively redesigned to adapt.3. Background
3.1. Conversational AI
Low power and low latency NLP is a prerequisite for conversational AI at the network edge. Conversational AI allows people to interact naturally with machines in a dialogue. For instance, a user may ask a question to a smart speaker and expect a humanlike response almost immediately. For this response to seem natural, it must be returned within 300 milliseconds. In these 300 ms, the device must perform several complex steps including speechtotext, user authentication, and natural language processing. As a result, any single model’s inference should be complete within 1015 milliseconds. Recently, many efforts have been made by the GPU community to optimize GPU implementations of BERT to reach this threshold.
3.2. The BERT Model
BERT adopts the structure of the encoders from transformers. While there are many BERT variants, the particular structure can be described by three parameters: number of encoders , number of attention heads , and hidden layer size . We focus on , which is composed of 12 encoders, each with 12 attention heads and a hidden size of 768 ( = 12, = 12, = 768). Figure 1 shows the general structure for BERT.
The model starts with an embedding layer that converts each input language sequence into features. For instance, an input sequence with 512 tokens in would be converted to a matrix, where each token is replaced by a 768length feature vector. Here, a token refers to a few adjacent characters, where a word is made up of one or more tokens. The embedding step has negligible computation but requires lots of memory. Therefore, we assume this initial step is performed offchip and focus on accelerating the computationallyintensive encoders.
Embedding is followed by
= 12 encoders. Each encoder performs four backtoback operations: multiheaded selfattention, layer normalization, feedforward layers, then layer normalization again. The encoder calculation can be decomposed into matrixmatrix multiplies followed by one of three primitives: softmax, layer normalization, and the GELU activation
(hendrycks2016gaussian). Table 1 describes the computation in further detail.Embedding 
= Embedding(input_sequence) 
MultiHeaded SelfAttention 
for (i = 0 to A  1) 
= ; = ; = 
= softmax() where is a known constant 
= [, …, ] 
Layer Normalization A 
= LayerNorm( + ) 
FeedForward 
= GELU( + ) 
= + 
Layer Normalization B 
= LayerNorm( + ) 
The BERT model is typically trained to handle sequences of up to 512 tokens. For any particular task we pick a sequence length
, padding any input sequences less than
and truncating any sequences larger than . This parameter determines the complexity of the operations performed and the inference speed. Sequence lengths less than 32 are usually too small for practical applications. Typical benchmarks use sequence lengths of 64 or 128.3.3. BERT Nonlinear Operations
3.3.1. GELU Activation
The GELU activation is defined by the following equation:
(1) 
It is commonly approximated using the function as in Equation 2 and can also be approximated directly using a lookup table.
(2) 
3.3.2. Layer Normalization
Layer normalization first requires computing the mean and variance of a matrix across rows. Given a matrix
of dimension , we compute the mean and variance for row .(3) 
Then, the mean and variance are applied to each element using Equation 4 to get the normalized output . Finally, each is scaled and shifted by trained vectors and to get the layer normalization output , as shown in Equation 5.
(4) 
(5) 
3.3.3. Softmax
The definition of softmax is shown in Equation 6. Softmax can be difficult to implement in hardware because of the exponential and division operations. It can be directly realized using dedicated exponential and division units, at the cost of underutilized resources and extra area overhead.
(6) 
3.4. Throughput Requirements of Nonlinear Operations
Since matrix multiply operations depend on the results from preceding nonlinear operations, nonlinear processing needs to have high enough throughput to not add significant latency overhead. The throughput requirement for nonlinear operations can be determined by the number of elements we need to process and the cycles available for processing. We define the cycle budget for a nonlinear operation as the number of cycles that the preceding matrix multiply takes to process, given the matrix multiply dimensions and the number of multiplies per cycle. For , given 2048 multiplies per cycle and a sequence length of 512, we show the throughput requirements for each nonlinearity in Table 2. The layer normalization after attention and the one after GELU are shown separately, since they have different throughput requirements.
Nonlinearity  N  M  Cycle Budget  Throughput  % of Overall Cycles 

Softmax  512  512  8,192  32  5 
Layer Norm A  512  768  147,456  2.7  7.5 
GELU  512  3072  589,824  2.7  30 
Layer Norm B  512  768  589,824  0.7  30 
The final column of Table 2 indicates percentage of overall cycles that depend on each nonlinear computation. Specifically, this value shows the percentage of overall matrix multiply cycles that are followed by the nonlinearity. For instance, we see that 30% of the overall cycle time is spent computing the matrix multiply inputs to GELU operations.
From Table 2, we see that Layer Normalization and GELU both require a throughput average of less than three elements per cycle, while softmax requires 32 elements per cycle to throughputmatch the matrix multiply operations. To put this in perspective, this means that without additional optimizations we would need to perform softmax on a vector of 512 elements in just 16 cycles.
4. Nonlinearity Processing
4.1. Nonlinear Function Approximation
There are dozens of ways to approximate nonlinear functions. We first discuss several specialized approaches that are commonly used for different nonlinear functions. We then discuss our efficient uniform approach to approximate several types of nonlinearities.
4.1.1. Specialized Approaches
Approximating complex nonlinear functions has been a field of interest for many decades. Common functions like softmax and square roots have been implemented in many different ways over the years using mathematical models with varying computational complexities and accuracies. The square root function, for example, can be approximated using a Taylor Series Expansion (kwon2009floating), the Chebyshev approximation (sawada2002mechanical), the NewtonRaphson algorithm (cornea1999correctness), and the CORDIC algorithm (volder1959cordic; andraka1998survey). It can also be approximated directly using a lookup table (LUT). Softmax is also often implemented using one or more LUTs for exponential calculation (yuan2016efficient; du2019efficient).
For reasonable accuracy with the lookupbased approaches, the LUTs typically are large and require a lot of memory. In contrast, piecewise linear approaches have shown good accuracy while maintaining very small lookup tables (frenzen2010number).
4.1.2. Unified Nonlinearity Processing
As previously shown, each type of nonlinearity has several ways it can be approximated. However, taking a separate approach for each function can lead to many times more area and much more underutilized resources. For instance, most transformer accelerators have dedicated exponential units and dividers for softmax and dedicated square root units for layer normalization.
We take a more unified approach that uses only piecewise polynomial approximation along with simple vector operations to process various nonlinearities. Some functions, like GELU, can directly be approximated using piecewise approximation. Others, like softmax, may use piecewise approximation for intermediate functions like exponentials or square roots and then use adders, multipliers, etc. for the remainder of the computation. To implement this unified approach, our hardware is optimized for piecewise function approximation as well as other operations like multiplication, addition/subtraction, and vector reduction (e.g., max, sum).
4.1.3. Multiprecision Computation
Unlike matrix multiplies and other linear operations, nonlinear computation cannot be fully quantized to lowprecision data types (such as 8bit fixed point numbers) while maintaining network accuracy. Instead, different steps of the nonlinear computation require varying levels of precision. For instance, layer normalization may take in a 16bit fixed point number as an input but will need to process the variance calculations using 32 or even 64bit fixed point. As such, all supported operations (arithmetic, reduction, etc.) need to be flexible enough to operate on several different data types.
4.2. Piecewise Function Approximation
4.2.1. General Piecewise Approximations
Piecewise function approximation involves breaking up a nonlinear function of interest into smaller regions along a chosen interval, each approximated by a simpler function such as a line or polynomial. Continuous Piecewise Linear (CPWL) approximations, in particular, use lines for approximation where the end point of each region is the same as the start point of the next region. Figure 2 shows an approximation of with just three segments. The starting and values for each segment are called knot samples and nodal values respectively.
Typically, approximation regions are chosen to be either uniform width or nonuniform width. Uniform width segments tend to be easier to evaluate but require significantly more segments. Nonuniform width segments can better approximate functions with fewer segments but require more complex evaluation. The advantage of nonuniform segments in terms of total number of segments required becomes even more significant when dealing with functions that have large mostlylinear regions. In particular, consider functions like GELU(x) and which are nearly linear except for a very small nonlinear region near zero. Uniform segmentation would require orders of magnitude more segments than nonuniform to approximate the linear regions, leading to significantly more memory usage for the same maximum approximation error.
We can also use a piecewise polynomial approach to approximate some functions, which takes more cycles to compute but gives higher accuracy. For most functions, including those needed for BERT, piecewise linear is enough. We discuss our nonuniform continuous piecewise linear segmentation in the following section.
4.2.2. Piecewise Linear Approach
The main considerations for piecewise linear approximations are the number of segments, segment widths, and the best approximation for each segment. Frenzen et al. (frenzen2010number) explored the number of segments needed for piecewise linear approximation of various common functions using different segmentation techniques. Based on their results, we see that it is possible to maintain high accuracy with very few segments (even less than 10, depending on accuracy constraints). We use a segmentation approach based on (berjon2015optimal), which describes one method for finding an optimal partition with nonuniform segmentation. We find that even suboptimal segmentation can result in no accuracy loss for BERT inference on the test set. Algorithm 1 gives the general computation for evaluating a continuous piecewise linear function.
Finding the subinterval from Step 1 of Algorithm 1 adds additional complexity. If using uniform segmentation, the segment number can be found simply by using the upper bits of the input. Since we use nonuniform segmentation, we implement more complex segment address calculation like that used in (lee2003non). In software, this segment address calculation could be performed using Algorithm 2. On a CPU or GPU, these calculations would take tens of instructions. Meanwhile, with specialized hardware, this task can easily be done on an FPGA or ASIC within a single clock cycle, such as with a priority encoder.
By itself, piecewise linear approximation is not always accurate enough to be used without a large number of segments. However, with normalization and range limiting of the fixed point input and subsequent denormalization of the output, this approximation can maintain high accuracy with only a few segments.
5. Accelerator Architecture
In this section, we present the overall architecture of NPE, our FPGA overlay processor for NLP. NPE adopts several components from OPU (yu2019opu) including the matrix multiply unit (MMU). Figure 3 shows the architecture of the accelerator.
5.1. Accelerator Modules
The NPE architecture consists of the following: instruction control unit (ICU), memory read unit (MRU), memory write unit (MWU), matrix multiply unit (MMU), and the nonlinear vector unit (NVU). The ICU sends instructions to each of the modules. Data is exchanged between the functional units through a series of memory buffers.
5.2. Data Flow
The memory read unit (MRU) reads data from external memory and writes it to the MMU’s input buffer (MIB). The MMU reads from the MIB and uses its scratchpad memory (MMEM) both for intermediate computations and for its final output. The NVU uses its scratchpad memory (NMEM) for its intermediate operations. The NVU accesses MMEM for its input data and deposits the final results either to the MIB (MMU input buffer) or to the NMEM (NVU scratchpad) for subsequent retrieval by the MWU, which writes these results to the external memory. The operations of the MRU, MWU, MMU and the NVU are pipelined to hide offchip memory latency and to allow for maximum concurrency between these units.
5.3. Matrix Multiply Unit (MMU)
The MMU computation consists of five stages: data selection, inner product, adder tree reduction, accumulation, and quantization. Data selection loads the necessary matrix operands from the input buffers and rearranges them as needed. The matrix multiplication is implemented using an array of PEs, where each PE performs an inner product using an array of multipliers followed by an adder tree. Then, the PE outputs can be further summed up by another adder tree, with the number of final outputs dependent on the matrix multiplication dimensions. Finally, the inner product outputs are accumulated and then quantized.
The MMU implementation has 128 PEs with 16 multiply accumulate units each (for a total of 2048 multipliers). These multiply accumulate units map to the FPGA’s DSP slices in our implementation. There are two versions of the NPE design, one supporting 8bit matrix multiplies and the other supporting 16bit matrix multiplies. The 16bit version uses each DSP slice for a single element multiply, for a total throughput of 2048 multiplies per cycle. The 8bit version decomposes each DSP slice into two 8bit multipliers with one input in common (due to DSP slice resource constraints). On the same board, we can get a throughput of 4096 8bit multiplies per cycle with 2048 DSP slices.
5.4. Matrix Multiply Quantization
The matrix multiply for transformer networks can be quantized to 16bit fixed point with no perceptible loss in accuracy. Several works (bhandare2019efficient; zafrir2019q8bert) have also shown the feasibility of 8bit quantization of transformers. As shown in (zafrir2019q8bert), BERT can be implemented with 8bit matrix multiplies with minimal accuracy loss. For this reason, while we support both 16 and 8bit matrix multiplies, we plan to use 8bit matrix multiply for our implementation. For both the 16bit and the 8bit matrix multiplies, the output of the MMU is written out to the MMEM (MMU scratchpad memory) as 16bit fixed point values. Consequently, the NVU always consumes 16bit fixed point values and generates either 8bit or 16bit results for the subsequent matrix multiplies.
5.5. Software Simulation
We simulate the architecture in software to validate the accuracy constraints of endtoend BERT model inference. In order to model the overall accuracy loss, our simulations take into account the NPE modules used as well as the data quantization at each intermediate step. In particular, we model the fixedpoint quantization effects of matrix multiplication and the approximation errors from our unified nonlinear processing approach, including piecewise linear approximations for various nonlinear operations.
6. Nonlinear Vector Unit (NVU) Architecture
The key novel component of NPE’s design is the nonlinear vector unit (NVU) that handles highthroughput nonlinear function computation with minimal resource overhead. The NVU is a dataparallel vector load/store architecture that performs arithmetic and logical operations on multiple elements in each clock cycle. The arithmetic performance of the NVU is defined by the width of the vector registers and the degree of parallelism in the arithmetic datapath.
The NVU, shown in Figure 4, comprises a vector load/store unit (LSU), a vector register file (VRF), a vector compute unit (VCU), a scalar register file (SRF), a scalar compute unit (SCU), and a microprogram controller (MPC). The VRF, the VCU, and the LSU all operate on fixedlength vectors. The LSU accesses vectors resident in either the NVU memory (NMEM), the MMU memory (MMEM), or the MMU input buffer (MIB). The vector compute unit (VCU) and scalar compute unit (SCU) are designed to handle 16, 32 and 64bit operations to allow for higher bit precision in intermediate calculations when required.
6.1. Microprogram Controller (MPC)
The instruction control unit (ICU) of NPE sends instructions to all NPE functional units, including the MMU and the NVU. These instructions are designed to operate over many clock cycles. The NVU interprets these highlevel instructions to implement nonlinear functions like softmax, layer normalization, GELU, etc.
The microprogram controller (MPC) of the NVU is responsible for breaking down ICU instructions into a sequence of VLIW instruction bundles, which are used to control the LSU, the VCU, and the SCU. Since the VCU can execute up to three operations concurrently, as many as five instructions can be dispatched by the MPC in each microinstruction cycle. The microprogram corresponding to each instruction is stored in the microprogram memory.
6.2. Vector Register File (VRF)
The vector register file (VRF) provides a highthroughput scratchpad for the vector compute unit (VCU). It allows the VCU to exploit locality of data across operations. The NVU requires a register file with 8 logical read and write ports in order to maintain concurrency of various VCU functional units. However, we implement the VRF using dualport (1R1W) BRAMS with a combination of time sharing and data duplication. The resulting register file operates at the desired frequency of 200 MHz while requiring less than 5% of overall BRAM resources.
6.3. NVU Memory (NMEM)
The NVU memory subsystem (NMEM) serves primarily as scratchpad for the NVU. The memory write unit (MWU) can also read results from the NMEM through a dedicated read port. The NMEM uses multiple banks of singleport memories, which are designed to support loads and stores of entire vectors in a single clock cycle. The NMEM can be implemented efficiently using singleport BRAMs on the FPGA. The NMEM also has arbitration logic to allow access from both the NVU and the MWU. The NMEM also contains logic for data permutation, which is required for strided and indexed accesses.
6.4. Vector Load/Store Unit (LSU)
The vector load/store unit performs fixedlength vector transfers between the vector register file and either the NMEM, the MMEM, or the MMU input buffer (MIB). The LSU supports nonstrided, strided, and indexed loads and stores from the NMEM.
6.5. Vector Compute Unit (VCU)
The vector compute unit (VCU) supports a full complement of vector and intravector arithmetic (add, subtract, multiply, shift, sum, dot product, etc.), logical, compare (, , , etc.), min/max and permute operations. In addition, specialized capabilities for piecewise polynomial approximation have been added, allowing the NVU to approximate nonlinear functions more than faster than traditional SIMD processors. The VCU reads vectors from the VRF or the scalar register file (SRF) and, depending on the operation, writes results back to the VRF or SRF. Vector reduction results are written to the SRF, and vectorscalar operations fetch scalar operands from the SRF.
The VCU implements multiprecision arithmetic while sharing resources between 8, 16, 32, and 64bit fixed point data types. While the MMU only needs to handle a single data type (either 8 or 16bit fixed point), NVU operations typically involve mixed precision including 32 and 64bit representations for intermediate calculations.
The VCU can execute up to three operations concurrently.
6.6. Scalar Compute Unit (SCU)
The NVU also includes a scalar compute unit (SCU), which operates out of a scalar register file (SRF). Like the VCU, the SCU can handle 8, 16, 32, and 64bit operations. The concurrent operation of vector and scalar functional units allow for the computation of nonlinear functions of vector reduce operations. For example, the NVU is capable of performing an inner product followed by the operation for layer normalization variance calculations while maintaining full throughput.
6.7. Scalable Architecture of the NVU
All operations of the NVU are performed on fixed bitwidth vector registers which make up the VRF. There are 32 vector registers in the VRF. All NVU microinstructions reference source and destination vector registers. The overall performance of the NVU can be described by a single parameter, i.e. the vector register width (). The number of elements processed per microinstruction depends on the element size (8, 16, 32, or 64 bits). For example, a of 256 can hold 32 elements of 8 bits, 16 elements of 16 bits, etc. The NVU’s area and performance depend on the number of elements that can be processed per microinstruction.
7. Throughput Analysis
For initial analysis of NPE’s architecture, we examine the throughput requirements for BERT on our architecture. While the MMU has both 8bit and 16bit variations, we focus on NPE with 16bit MMU and pair it with NVUs of different . For the remainder of this work, we describe the NVU variants based on the vector register width . NVU refers to the NVU with bit vector registers. For instance, NVU256 means that a vector register is 256 bits. We focus on four NVU sizes (NVU256, NVU512, NVU1024, and NVU2048), comparing the throughput of each NVU variant to the required BERT throughputs.
7.1. NVU Throughput
Table 3 shows the individual NVU performance results on each nonlinear function required for BERT. To normalize the results across NVU vector register widths, we give the number of cycles needed to process a 512 length array of 16bit elements and the corresponding throughput in elements per cycle.
NVU Width  Softmax  Layer Norm  GELU 

NVU256  1.64 (312)  0.64 (804)  4 (128) 
NVU512  3.05 (168)  1.29 (396)  8 (64) 
NVU1024  4.74 (108)  2.42 (212)  16 (32) 
NVU2048  6.40 (80)  4.13 (124)  32 (16) 
7.2. BERT Throughput Requirements
We analyze the effective throughput requirement for each nonlinearity in BERT. This builds on the analysis in Table 2, where we established that the worstcase throughput requirement for softmax is 32 elements per cycle to keep up with the MMU. However, here we demonstrate that we can relax the worst case requirement by taking into account the optimization of overlapping independent computations. Then, we show the final requirements after optimization.
7.2.1. Overlapping Computation
In most cases, each stage of the transformer network computation is dependent on the results of the preceding stage. For instance, the feedforward layer has a matrix multiply with GELU followed by a matrix multiply with Layer Normalization. The GELU computation must be finished before the next matrix multiply is started. This holds true for all Layer Normalization and GELU operations in BERT. This means that GELU and Layer Normalization must be rate matched with the MMU in order to avoid stalling the MMU.
Fortunately, this is not the case with softmax and parts of the attention mechanism. We can reduce the throughput requirements of softmax by overlapping it with independent matrix computations in multiheaded selfattention. For example, the computation in Table 1 of softmax() can be overlapped with the matrix multiplication = . Since computation for each attention head is independent, we can also overlap softmax for head with some part of the computation for the next head . In this way,the throughput requirements of the softmax computation can be relaxed by more than .
7.2.2. Optimized BERT Throughput Requirements
Taking overlapping computations into account, we see the throughput requirements shown in Table 4. In general, the throughput requirements of matrix multiplies in BERT do not depend on BERT network sequence length. However, for some of the attention computation, there is a dependence on sequence length. This only affects nonlinearity throughput when we overlap independent matrix multiplies with softmax, which is why we see a throughput dependence for softmax on sequence length in Table 4.
Although softmax tends to have very high throughput requirements for higher sequence lengths, it only accounts for a small percentage of overall computation (see Table 2). Layer Normalization and GELU are needed for approximately two thirds of the total computation time but have relatively lower throughput requirements. If the NVU’s softmax computation cannot match MMU throughput, we may still only get a small inference time overhead. Meanwhile, if layer normalization or GELU cannot be throughputmatched, we would expect a more noticeable inference time overhead.
Sequence Length  Softmax  Layer Norm A  Layer Norm B  GELU 

64  0.92  2.6  0.6  2.6 
128  1.79  2.6  0.6  2.6 
256  3.39  2.6  0.6  2.6 
512  6.29  2.6  0.6  2.6 
By comparing results from Tables 3 and 4, we see that NVU2048 is more than capable of keeping up with the 16bit MMU. In fact, NVU1024 can approach or exceed most of the requirements except softmax with sequence length 512. Given that softmax only takes up a few percentage of computation, and a sequence length of 512 is not needed for most applications, it is evident that there would only be marginal benefit of using NVU2048 over NVU1024. For this reason, we only analyze NVU2048 for its inference time, as a comparison metric indicating ideal NVU performance (where MMU never stalls). Similarly, with the 8bit MMU, the NVU2048 also nearly matches the matrix multiplier throughput and can be used as a reference point.
8. Evaluation
We implement NPE at 200 MHz on the Xilinx Zynq Z7100 FPGA, which has 2,020 DSP slices, 26.5 Mb RAM, and 277k LUTs. We look at several NVU variants (NVU256, NVU512, and NVU1024), each of which can be paired with either the 8bit or 16bit MMU. We examine FPGA utilization for each NVU variant separately, then show overall FPGA utilization of each of the six resulting NPE configurations. We calculate softwaresimulated inference times for BERT for these six configurations and compare them to the corresponding NVU2048 reference inference time. Finally, we evaluate NPE’s performance on BERT inference relative to other implementations’.
8.1. FPGA Utilization
Module  LUT  FF  DSP Slices  BRAM  F7 Mux  F8 Mux  
NMEM  NVU256  776 (0.28%)  1234 (0.22%)  0  4 (0.53%)  0  0 
VRF  NVU256  156 (0.06%)  513 (0.09%)  0  4 (0.53%)  0  0 
VCU+SCU  NVU256  10328 (3.72%)  1753 (0.32%)  8 (0.4%)  0  3 (¡0.01%)  0 
Total  NVU256  11260 (4.06%)  3500 (0.63%)  8 (0.4%)  8 (1.06%)  3 (¡0.01%)  0 
NMEM  NVU512  1330 (0.48%)  2268 (0.41%)  0  8 (1.06%)  0  0 
VRF  NVU512  306 (0.11%)  1025 (0.18%)  0  8 (1.06%)  0  0 
VCU+SCU  NVU512  19549 (7.05%)  3441 (0.62%)  16 (0.79%)  0  12 (¡0.01%)  5 (¡0.01%) 
Total  NVU512  21185 (7.64%)  6734 (1.21%)  16 (0.79%)  16 (2.1%)  12 (¡0.01%)  5 (¡0.01%) 
NMEM  NVU1024  2902 (1.05%)  4377 (0.79%)  0  16 (2.1%)  350 (0.25%)  0 
VRF  NVU1024  607 (0.22%)  2049 (0.37%)  0  16 (2.1%)  0  0 
VCU/SCU  NVU1024  34423 (12.41%)  6984 (1.26%)  32 (1.58%)  0  37 (0.03%)  5 (¡0.01%) 
Total  NVU1024  37932 (13.67%)  13410 (2.42%)  32 (1.58%)  32 (4.2%)  387 (0.28%)  5 (¡0.01%) 
In Table 5, we individually show the FPGA utilization results for several components of the NVU: the NVU memory (NMEM), the vector register file (VRF), and the compute units (VCU and SCU). Then, in Table 6, we give the cumulative FPGA resource utilization for NPE using each NVU variant, both for 8bit and 16bit NPE.
MMU  LUT  FF  DSP Slices  BRAM  

8bit  NVU256  165776 (59.76%)  341151 (61.49%)  1994 (98.71%)  345 (45.70%) 
8bit  NVU512  175701 (63.33%)  344385 (62.07%)  2002 (99.10%)  353 (46.75%) 
8bit  NVU1024  192448 (69.37%)  351061 (63.28%)  2018 (99.90%)  369 (48.87%) 
16bit  NVU256  129231 (46.59%)  250738 (45.19%)  1995 (98.76%)  502.5 (66.56%) 
16bit  NVU512  139156 (50.16%)  253972 (45.78%)  2003 (99.16%)  510.5 (67.61%) 
16bit  NVU1024  155903 (56.20%)  260648 (46.98%)  2019 (99.95%)  526.5 (69.73%) 
From these results, we see that all the NVU variants are small relative to the overall NPE design. Even NVU1024 uses less than three percent of overall flipflop, DSP slice, and BRAM resources each. The larger NVUs do use 715% of the overall LUT resources, much of which is due to the muxes required for shifting. Despite this, the overall design still has many LUTs left over.
8.2. Inference Time
The system simulation gives a cycle count estimate for a single inference of
, which can be used to determine inference time given the operating clock speed. The relative inference times of NPE with 16bit MMU and NVU256, NVU512, and NVU1024 are compared to inference time with NVU2048. For NPE with 16bit MMU, NVU2048 gives the ideal inference time because it always exceeds the MMU throughput.Figure 5 shows the percent inference time overhead of NVUs of different for NPE with 16bit MMU. We see that in all cases GELU does not add latency overhead for any sequence length. Overall, NVU1024 has very little overhead compared to the baseline case. The small difference is because layer normalization throughput is slightly lower than that which is needed to match the MMU. For smaller sequence lengths, NVU1024 adds less than 1% latency overhead, NVU512 adds around 10%, and NVU256 adds about 30%. Depending on the use case, these overheads may be acceptable given the reduced area costs. For higher sequence lengths, NVU256 begins to show very large overheads of 53% and 97%.
Note that inference time overhead alone is not the only criteria that should be used to evaluate these options. Even larger overheads may be acceptable, as long as the overall inference time including overhead is within the target for conversational AI. For this reason, the actual inference time is compared below.
The BERT inference time for NPE with 16bit and 8bit MMUs with each is shown in Figure 6. We see that NPE with 8bit MMU can achieve sub10 ms inference time with sequence length of 64 even with NVU512, but that the inference time increases proportionally as sequence length increases. For typical applications, sequence length of 64 is sufficient. For conversational AI, we require within 1015 ms inference time, which we can clearly surpass with NVU512 and NVU1024 for both 8 and 16bit.
8.3. Comparison with CPU, GPU, and FPGA
The authors of the FTRANS transformer FPGA accelerator (li2020ftrans) provide inference benchmarks by running RoBERTa, an optimized version of BERT with the same model architecture but trained more thoroughly. Since BERT and RoBERTa have the same architecture, we can compare our BERT accelerator’s inference times with their RoBERTa benchmarks. We compare with our NPE with 16bit and 8bit MMUs with NVU1024 on the Zynq Z7100. The devices used in the benchmark are an i78700k CPU, an RTX 5000 GPU, and an Ultrascale+ VCU118 FPGA (for FTRANS). The RTX 5000 has more compute units than our Zynq FPGA and runs at higher clock frequency. The VCU118 has 6,840 DSP slices and 2,586k logic cells ( the DSP slices and the logic cells on our board). The inference times and relative latencies are shown in Table 7. We also give the approximate power consumption of each device.
i78700k  RTX 5000  FTRANS  NPE (16bit)  NPE (8bit)  
Throughput  3.76  57.46  101.79  73.69  135.14 
Relative Speedup  (baseline)  0.72  1.33  
DSP Slices Utilized      6,840  2,020  2,020 
Throughput per DSP      0.0148 ()  0.0365 ()  0.0669 () 
Approximate Power (W)  80  120  25  20  20 
From the results, we see that the CPU is far too slow for conversational AI. While the RTX 5000 GPU gets close, it does not meet the conversational AI latency targets. However, with a larger or more optimized GPU it could meet these requirements, albeit with much higher power consumption. Both FTRANS and NPE implementations stay within the range needed for conversational AI.
8.4. Benchmarks Discussion
The biggest benefit of an FPGA implementation of BERT over CPU and GPU is with power consumption. From Table 7, we see about a power benefit over CPU and over GPU. This difference in power consumption is especially important for NLP processing on edge devices. While FTRANS and NPE both have comparable performance and power, FTRANS uses over more resources than NPE since it uses a much larger FPGA. We attribute some of the difference in resource consumption to the fact that FTRANS uses specialized modules for each transformer and each nonlinearity, which leads to additional area and underutilized components.
9. Conclusion
In this paper we propose NPE, an FPGAbased overlay processor that is domainspecialized for Natural Language Processing. NPE offers softwarelike programmability and provides a unified framework to process arbitrarily complex nonlinear functions. If a new stateoftheart NLP model were to surpass transformers in the coming years, NPE is most likely flexible enough to adapt to it without requiring reconfiguring the FPGA accelerator or adding specialized processing modules. NPE can also meet the inference latency requirements for conversational AI for the BERT language model. Relative to CPU and GPU, NPE has and lower power consumption respectively. Our accelerator shows comparable performance to a transformer model specialized FPGA accelerator, but NPE uses lower FPGA resources. Overall, we find that NPE is a promising solution for lowcost and lowpower NLP network inference at the edge.