1 Introduction
Recurrent Neural Networks (RNNs) are a class of sequence models that plays a key role in lowlatency, AIpowered services in datacenters Fowers et al. (2018); Jouppi et al. (2017)
. In these services, the platforms assume that user requests come in individual samples and need to be served with very stringent latency window for realtime human computer interaction. An example of such workload is Google Translate, where inference happens concurrently when a user types. Despite its popularity, RNN model serving is hard to accelerate efficiently. Modern software and hardware platforms support optimized BLAS routines. To serve RNNs on these platforms, a compiler tends to stitch multiple optimized BLAS kernels into a single computation graph. While a hardware accelerator might execute each individual kernel efficiently, it misses the opportunity of global crosskernel optimization that can dramatically improves performance and energyefficiency. This approach leads to two issues. First, communication between BLAS kernels creates large intermediate results, which can lead to poor memory performance when the blocking size is not properly tuned for the target system. Missing the opportunity of crosskernel fusion can lead to huge performance loss due to different access latency at each level of memory hierarchy in a processorbased architecture. On a spatial architecture, while the first two levels of memory hierarchies, i.e. registers and onchip scratchpads, tend to have single cycle access latency, the energy required to access these two types of memory would be widely different. Therefore, lack of crosskernel fusion can lead to inefficient allocation of scratchpad resource and low energyefficiency. Second, hardware accelerators tend to use large vectorization in compute and memory access to boost compute density when accelerating BLAS kernels. However, hardware accelerators tend to suffer from resource underutilization when the workload size is not multiples of the vector size. The utilization is worse with RNN applications that are composed of sequences of small matrix multiplications due to small hidden unit sizes and many time steps. Moreover, many accelerator platforms are optimized for BLAS level3 (matrixmatrix) operations, e.g. NVBLAS Library for GPU
9, TPU Jouppi et al. (2017), EIE Han et al. (2016a), EyeRiss Chen et al. (2017), and DaDianNao Chen et al. (2014). These platforms suffer from low resource utilization when serving singlebatch, realtime RNN applications that contain a lot of matrixvector multiplication (MVM) executions.To address these issues, we propose the following strategies. First, we fuse all the gate functions with the elementwise, nonlinear functions in the same time step. This way, all of our intermediate results are buffered in the registers as opposed to the scratchpads. Second, we spatially parallelize and pipeline the computation graph. We vectorize the innerloop of the tiled dot product to explore SIMD parallelism and finegrain pipelining. We also explore tiled parallelism and coarsegrain pipelining by unrolling the outer loop nests based on the amount of available compute resources. These strategies exploit the gatelevel parallelism in RNN cells, balance the pipelines of MVM and elementwise nonlinear functions, and maximize the resource utilization when serving RNN models on different problem sizes. In addition, the entire pipeline is dataflow driven with no dynamic scheduling overhead.
We evaluate the proposed strategies by serving RNN tasks in DeepBench Narang and Diamos (2017) on the target spatial architecture. We implement the designs in Spatial Koeplinger et al. (2018), a DomainSpecificLanguage (DSL) that describes applications with nested loops and explicit hardware memory hierarchy. We choose Plasticine Prabhakar et al. (2017), a coarsegrained reconfigurable architecture (CGRA), as the target spatial architecture. Furthermore, we propose augmentations to the Plasticine microarchitecture in order to support the mixprecision operations, which is critical for serving RNNs in realtime.
Finally, we compare the results to those obtained by serving DeepBench tasks on the stateoftheart RNN serving platforms. We show that our implementation delivers consistently high FLOPS utilization across tasks of various sizes. We also demonstrate energyefficiency advantage of spatial architectures compared to processorbased architectures.
The key contributions of this paper are:

We analyze the computation and memory layout of RNN cell implementations on commercially available platforms. We find that BLAS abstraction leads to expensive interkernel data movement and resource underutilization.

We address these issues by describing RNN applications using abstractions with more general loop constructs that enable crosskernel optimization, spatial parallelization, and pipelining of arbitrary loop nesting. To achieve lowlatency inference for RNN applications, we propose microarchitectural codesign to a spatial architecture in order to enable lowprecision operations.

Finally, we thoroughly evaluate CPU, general purpose graphics processing unit (GPGPU), fieldprogrammable gate array (FPGA), and a previouslyproposed CGRA, as serving platforms for RNN applications.
The rest of the paper is organized as follows. Section 2 provides backgrounds on the RNN algorithms, the DSL and hardware platform used in this paper. Section 3 discusses the available RNN implementations on commercially available platforms. We then discuss the optimization strategies implemented in this work that address the inefficiency in these implementations. Section 4 discusses the architectural changes for supporting efficient RNN inference on the target spatial architecture. Section 5 details our evaluation methodology and experimental results. Section 6 discusses related works on available software and hardware optimization strategies for serving RNN applications. Section 7 offers concluding remarks.
2 Background
RNNs are widely used to model arbitrary sequential tasks. An RNN contains a cell unit to iteratively consume a Tstep input sequence in order to generate an output sequence
. Long ShortTerm Memory (LSTM)
Hochreiter and Schmidhuber (1997)and Gated Recurrent Unit (GRU)
Chung et al. (2014) are popular RNN cell units. In this paper, we use LSTM as an example. Nevertheless, our optimization techniques can be generalized to any other types of RNN cells. In Section 5, we also provide evaluations of GRU implemented using our techniques.2.1 LSTM Cell
At step , an LSTM generates an output and the next memory cell states and as follows:
(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
are dimensions of hidden states and input features, respectively. is the sum of hidden state and input feature dimensions. is the Hadamard product. Table 1 shows the specifications for each matrix and vector in an LSTM cell.
Name  Shape  Specification 

LSTM cell’s input vector  
Forget gate’s activation vector  
Input gate’s activation vector  
Output gate’s activation vector  
Candidate of memory gate’s activation vector  
Memory gate’s vector  
Hidden state’s weight matrices at gate  
Input vector’s weight matrices at gate  
Bias vector at gate ,,, 
2.2 Spatial Reconfigurable Architectures
Spatial reconfigurable architectures, such as FPGAs and CGRAs, are gaining traction as data center accelerators for their energy efficiency Amazon (2017); Putnam et al. (2014); Ouyang et al. (2014)
. Compared to processorbased architectures, spatial architectures can reach high resource utilization by reconfiguring memory and compute based on the applications and computation requirements. In addition to exploiting parallelism, pipelining of dataflow graph in a spatial architecture provides high compute throughput. Nonetheless, the traditional lowlevel programming interface and long synthesis time of FPGA is the major obstacle for it to become a mainstream accelerator. As opposed to bitlevel flatinterconnection in FPGAs, CGRAs are usually configured at higher level of granularity and contain a hierarchical interconnection network. In exchange, the reduction in flexibility in hardware translates to lowered routing overhead and higher clock frequency. The reduced routing overhead provides higher compute density and memory capacity, which makes CGRA an attractive platform to accelerate deep learning workloads. Due to the flexibility in mapping applications, spatial architectures often require design space exploration (DSE) in order to achieve good resource utilization and performance
Koeplinger et al. (2016); Liu and Schafer (2016).2.3 Spatial
Spatial is a hardwarecentric DSL that targets FPGAs and a previously proposed CGRA, Plasticine. A user describes applications in unparallelized patternbased loops with explicit memory hierarchies. Spatial automatically schedules, parallelizes, and pipelines arbitrary loop nests. To scale the memory bandwidth with parallelism, Spatial banks the scratchpad memories. To sustain the throughput of pipelining, Spatial also buffers the intermediate memories. Spatial exposes important design parameters such as blocking size and unrolling factor. Using the exposed parameters, users can easily tune their design either manually or with an external DSE engine to balance the pipeline stages and saturate resource for different tasks on different hardware targets.
2.4 Plasticine
Plasticine is a CGRA that accelerates general nested loops in Spatial. It consists of primarily two types of units: a pattern compute unit (PCU) containing a single instruction multiple data (SIMD) pipeline optimized for accelerating vectorized map and reduction loops, and a pattern memory unit (PMU) containing configurable memory that to support banking and buffering schemes for various access patterns. Plasticine supports parallelizing and pipelining arbitrarily nested loops from Spatial. More architectural details will be explained in Section 4.
3 RNN Computation Analysis
In this section, we first discuss the limitation of BLASbased LSTM on processor and spatial architectures. Next, we discuss our implementation of loopbased LSTM on spatial architectures. Table 2 contains specifications for symbols and parameters used in this section.
Symbol  Processor  Reconfigurable Hardware 
Kernel  Inner Loop  
Memory Hiearchy  Onchip Scratchpad  
Register File  Register  
Unrolling factor using multiple hardware compute blocks  
Elementwise Operation  
Outer Loop  
Vectorization parameter for AVX or SIMD instructions  
Parameter  Specification  
Vectorization parameter on H  
Unrolling factor on H  
Vectorization parameter on R  
Unrolling factor on R  
Number of gates in an RNN. For LSTM, G=4 
3.1 BLASbased LSTM on Processor Architecture
Modern Machine Learning frameworks, e.g. TensorFlow
Abadi et al. (2016), divide the computation graph of an LSTM cell into BLAS kernels. Then, the BLAS kernel is accelerated by calling lowlevel optimized BLAS subroutines such as Intel BLAS Library on CPU and NVBLAS Library on GPU. Figure 1 (a) shows the computation graph of a BasicLSTM cell in TensorFlow. This implementation can lead to large memory footprint since all the intermediate results are materialized in memory. A common strategy to tackle the issue is through fusing blocked kernels. With TensorFlow’s abstraction, this can only be achieved by expressing the entire RNN cell as an optimized kernel. For example, TensorFlow provides LSTMBlockFusedCell and GRUBlockCell modules, which are the fastest TensorFlow implementations of RNN cells for CPU. In practice, such implementation can provide significant performance improvement over the BasicLSTM implementation. However, it is still very hard to saturate CPU compute capacity, potentially due to the high synchronization overhead across threads. Figure 1 (b) shows the computation layout of TensorFlow with cuDNN library Chetlur et al. (2014) on GPU. cuDNN is an NVIDIA GPU library for accelerating deep neural networks. To minimize the data movement, cuDNN fuses all the vectorvector (VV) operations after MVM. Specifically, the bias add in Equation 1, 2, 3, 4, and all the operations in Equation 5, 6, are fused into one kernel. Nevertheless, there are still intermediate buffers of size between the MVM kernel and the elementwise operations.Compared to the BasicLSTM implementation, CudnnLSTM eliminates most of large intermediate memories. However, the MVMs of Equation 1, 2, 3, 4 are all accelerated with BLAS3 kernels, which performs only matrixmatrix level operations. This turns MVM and VV bias add into Matrix Matrix Multiplication (MMM) and Matrix Matrix Addition (MMA), which leads to serious underutilization of GPU.
Moreover, a processorbased architecture introduces large energy overhead of instruction decoding and scheduling. GPU especially suffers from its powerhungry, highthroughput memory hierarchy. For these reasons, both the CPU and GPU architectures are not suitable for energyefficient, lowlatency RNNs serving platforms.
3.2 BLASbased LSTM on Spatial Architecture
Previous work has studied the capability of using an FPGA as a lowlatency serving platform. An FPGA has the flexibility of resizing MVM and VV units based on the application size. In addition, MVM and VV units can be implemented with hardware pipelines, which removes the instruction scheduling and control overhead on a processorbased architecture. The latest version of Intel Stratix 10 FPGA further boosts the compute power of FPGA with increasing number of hardened digital signal processing (DSP) blocks and onchip memory capacity. Microsoft Brainwave (BW) Fowers et al. (2018) is a stateoftheart FPGAbased deep learning framework.
Figure 2 shows BW’s compute and memory layout. In contrast to the CPU and GPU implementations, BW blocks the MVM along both row and column dimensions. It then fuses the inner tiled MVM with elementwise nonlinear functions. Specifically for a matrix of size and a vector of size , BW parallelizes the compute of multiple column tiles (, # MV Tiles in the original paper) of size with multiple tiled engines, as shown in Figure 4 (a). Each tile engine contains (native dimension) number of dot product engines vectorized by (lanes) and achieves one tile per cycle throughput. Parallel tiles along the row dimension are then fed into a pipelined reduction and accumulation unit. Immediately after the accumulation, the multifunction units (MFUs) execute the elementwise operations on the vector chunk produced by the accumulator. Although BW’s implementation still keeps the vectorized intermediate results, the size is much smaller than in BasicLSTM cell. Nonetheless, with parallelization in , BW allocates lots of vectorized intermediate buffers that can still lead to energy inefficiency. BW performs one MVM operation in iterations.
The MVM operations are executed on each gate of the LSTM sequentially. Similarly, elementwise operations using for the nonlinear operators are also scheduled to execute on the vectorized multifunction units with size of , as shown with the arrow in time in Figure 2. To avoid DRAM communication overhead and improve compute density, Brainwave embeds MVM in a blocked floatingpoint format, where the vector of values share a single 5bit exponent and have distinct signs and 25 bit mantissa for each value. As a result, they can achieve very dense lowprecision compute and storage, with one adder per values and multipliers for a vector of . The remaining operations are performed in 16bit precision.
When matrix dimensions cannot be divided by and , Brainwave suffers from underutilization of the compute FLOPS, as shown in Figure 4 (a). The underutilization is worse with small problem sizes. In addition, BW computes and separately rather than computing them with concatenated larger matrices, which can further aggravate the problem. This might be because BW’s abstraction does not allow partial updates of an vector but only is updated at the end of the step.
3.3 Loopbased LSTM
We have made the following observations from analyzing BLASbased LSTM implementations:

Constructing an LSTM cell’s computation graph using BLAS subroutines introduces large intermediate buffers even when the kernels themselves are blocked. Each element on RNN cells’ nonreduction dimension of the MVM () can be computed completely independently within one time step. This exposes the opportunity of finegrain loop tiling and fusion across the entire LSTM kernel.

MVM is the computation bottleneck in serving RNN cells. Spatial architecture allows us to distribute most of the compute resource to MVM by parallelizing and pipelining MVM with elementwise operations.

Using lowprecision operations can boost compute density and keep RNN weights onchip to avoid highlatency DRAM communication. We need to introduce efficient lowprecision support in the target spatial architecture without introduce too much overhead.
To address the issue of large intermediate buffers, we finegrain tile and fuse MVM with nonlinear functions. We refer to the computation for generating every single element in and as LSTM1 operation, which can be computed independently in a single step. LSTM1 is composed of four independent dot products of the row vectors of the weight matrices with the input vector immediately followed by the elementwise operations on output of the dot product. The resulting and vectors are produced by computing LSTM1 operations for iterations.
As shown in Figure 3, each MVM unit is replaced by a MapReduce unit to compute the tiled dot product. Each MapReduce is vectorized by with pipelined map function followed by a pipelined reduction tree. is the number of parallel MapReduce units. Results of MapReduce blocks are reduced and accumulated with another reduction tree (not shown in Figure). Next, the dot product result is passed through a chain of function units for executing bias add and nonlinear functions. Dot products, bias adds, and nonlinear functions of the four gates can also be parallelized. Finally, the results of the four gates are pipelined through a set of function units for elementwise operation in LSTM cell. At the outer loop, LSTM1 runs for iterations, where is the number of parallel LSTM1 implementations.
In the loopbased design, all intermediate buffers are scalars as opposed to vectors. Regarding utilization, the loopbased LSTM design suffers from less underutilization due to unaligned problem size compared to the tiled MVM approach in BW. Figure 4 shows sources of such underutilizations. An MVM approach design would suffer from 2D fragmentation on both the and dimensions (Figure 4 (a)), whereas the loopbased design only suffers from 1D fragmentation on the dimension (Figure 4 (b)).
Figure 5 shows a loopbased LSTM design implemented in Spatial. Foreach is a loop construct with a lambda that takes loop iterator as input. Reduce is a construct that executes MapReduce by taking a map function followed by a reduction function. User declare explicit onchip scratchpads and registers with SRAM and Reg. To enable finetuning an RNN application, we exposes loop vectorization factor and unrolling factors .
4 Plasticine Specialization for RNN Serving
To show efficient execution of the loop and parallel pattern constructs, we map our implementation onto a spatial architecture, Plasticine. Foreach at Line and Reduce at Line are mapped to PCUs on Plasticine. When the application size is small, these constructs are executed using pipelined SIMD lanes within a single PCU. When the application size is large, multiple PCUs can be used to parallelize and pipeline the dot product across PCUs. Elementwise operations can be executed in a deep pipeline formed by chaining multiple PCUs.
To fit an RNN’s weights onchip, we execute our application with lowprecision arithmetics. In this section, we propose the necessary microarchitectural changes to support lowprecision arithmetics on Plasticine. We also discuss architectural parameter selection for Plasticine to serve RNN applications efficiently.
4.1 MixedPrecision Support
Previous works Fowers et al. (2018); Jouppi et al. (2017) have shown that lowprecision inference can deliver promising performance improvements without sacrificing accuracy. In the context of reconfigurable architectures such as FPGAs, lowprecision inference not only increases compute density, but also reduces required onchip capacity for storing weights and intermediate data.
To support lowprecision arithmetics without sacrificing coarsegrained reconfigurability, we introduce two lowprecision struct types in Spatial: a tuple of 4 8bit and 2 16bit floatingpoint numbers, 4float8 and 2float16 respectively. Both types packs multiple lowprecision values into a single precision storage. We support only 8 and 16bit precisions, which are commonly seen in deep learning inference hardwares. Users can only access values that are 32bit aligned. This constraint guarantees that the microarchitectual change is only local to the PCU. Banking and DRAM access granularity remains intact from the original design.
Figure 6 (a) shows the original SIMD pipeline in a Plasticine PCU. Each FU supports both floatingpoint and fixpoint operations. When mapping applications on Plasticine, the inner most loop body is vectorized across the lanes of the SIMD pipeline, and different operations of the loop body are mapped to different stages. Each pipeline stage contains a few pipeline registers (PRs) that allow propagation of live variables across stages. Special crosslane connections as shown in red in Figure 6 enable reduction operations. To support 8bit elementwise multiplication and 16bit reduction, we add 4 opcodes to the FU, shown in Figure 6 (b). The and stages are elementwise, lowprecision operations that multiply and add 4 8bit and 2 16bit values, respectively. The and
stages rearrange lowprecision values into two registers, and then pad them to higher precisions. The
stage reduces the two 32bit value to a single 32bit value using the existing add operation. From here, we can use the original reduction network shown in Figure 6 (a) to complete the remaining reduction and accumulates in 32bit connection.With 4 lanes and 5 stages, a PCU first reads 16 8bit values, performs 8bit multiplication followed by rearrangement and padding, and then produce 16 16bit values after the second stage. The intermediate values are stored in 2 PRs per lane. Next, 16 16bit values are reduced to 8 16bit values and then rearranged to 8 32bit value in 2 PRs per lane. Then, the elementwise addition in 32bit value reduces the two registers in each line into 4 32bit values. These values are fed through the reduction network that completes the remaining reduction and accumulation in two plus one stages.
In a more aggressive specialization, we can fuse the multiply and rearange into the same stage. We also fuse the first lowprecision reduction with the next rearange as shown in Figure 6 (d). In this way, we can perform the entire lowprecision mapreduce in 2 stages in addition to the original full precision reduction. In order to maximize hardware reuse, we assume that it is possible to construct a full precision FU using lowprecision FUs. In addition, we observe that the original reduction network in the SIMD lanes could lead to low FU utilization. To improve FU utilization, we fold the entire tree structure in a single stage. Figure 6 (c) shows the folded reduction accumulation structure. Specifically, latter reductions in the tree are mapped to earlier stages in the pipeline. In this setup, the entire reduction plus accumulation is still fully pipelined in cycles with no structural hazard. With fused reducedprecision multiplication and reduction, and folded reduction tree, a PCU is able to perform all mapreduce that accumulates 8bit values using 4 stages. All the operations are completed in cycles.
4.2 Sizing Plasticine for Serving RNN
Evaluating an RNN cell containing hidden units and input features requires computations and memory reads. With large , the compute to memory ratio is 2:1. The original Plasticine architecture uses a checkerboard layout with 1 to 1 ratio between PCU and PMU. A PCU has 6 stages and 16 lanes, and a PMU has 16 banks. This provides a 6:1 ratio between compute resource and onchip memory read bandwidth. As a result of this layout, onchip memory read bandwidth becomes the bottleneck for accelerating RNN serving applications. Given that RNNs cover a wide range of important applications, we select a Plasticine configuration tailored for RNN serving. Specifically, we choose a 2 to 1 PMUPCU ratio with 4 stages in each PCU. Figure 7 shows the layout of this Plasticine variant.
5 Evaluation
In this section, we evaluate the realtime RNN serving tasks on various platforms. We start with the methodology of our experiments, followed by a discussion of performance and power comparisons across these platforms.
5.1 Methodology
To evaluate RNN serving, we use the LSTM and GRU tasks from Baidu DeepBench as our benchmarks. We evaluate the benchmarks across processorbased architectures including CPU and GPU, and spatial architectures including FPGA and CGRA. Table 4 shows the detailed specifications of the targeting hardware, which includes stateoftheart high performance platforms in each of the commercialized categories. Table 5 summarizes application configurations of each platform.
Cpu
We implement the applications in TensorFlow 1.10, and evaluate our implementations on Intel Xeon Scalable Processor (Skylake) CPU. We use the LSTMBlockFusedCell and GRUBlockCell kernels in TensorFlow. We further enable AVX2 vector instructions for CPU evaluation. Due to lack of lowprecision support in both tool chain and platform, we use singleprecision for our implementation.
Gpu
We use TensorFlow with cuDNN Library to target NVIDIA Tesla V100 GPU from Google Cloud. cuDNN is a GPUaccelerator Library from NVIDIA that is specialized for deep learning. We use 16bit precision for our implementation on GPU. On both CPU and GPU platforms, we run TensorFlow profilers and collect the time spent only on evaluating the RNN cells.
Plasticine
We implement the applications in Spatial, which targets Plasticine. Although Spatial has FPGA backend support, Stratix 10 is not commercially available at the time of the submission of this work. The current FPGA targets that Spatial support are not comparable to Stratix 10 both in terms of memory and compute capacity. Therefore, we only use Spatial to target Plasticine for this evaluation. However, our approach should generally benefit an implementation on a high performance FPGA like Stratix 10. We choose Plasticine configuration that matches the peak 8bit FLOPS and onchip scratchpad capacity of a Stratix 10 FPGA. The exact configuration of Plasticine is shown in Table 3. In order to minimize the overhead of lowprecision support, Plasticine only supports 8bit, 16bit, and 32bit elementwise operations, and mixed precision reduction operation. For our evaluation, the elementwise operations are performed in 8bit precision, the first stage of the reduction is performed in 16bit, while the remaining of the reduction and accumulation are performed in 32 bit operations.
To measure the performance, we use a cycle accurate simulator for Plasticine. We modified the simulator to model the proposed microarchitectural changes to support lowprecision operations. We use the area and power of individual CUs and network switches from the original Plasticine paper, and compute total area of configuration shown in Table 3. As discussed in Section 4
, we reduce the number of stages in PCU from 6 stages to 4 stages with fused lowprecision operations and folded reduction tree. Low preicision function units can be used to compose full precision units. We conservatively estimate the area and power of PCU stays the same with our proposed change and reduced two stages. We also increase the PMU to PCU ratio to better match the compute to memory ratio for RNN inference applications. To match the memory capacity of Stratix 10, we shrink the scratchpad capacity of each PMU from 256kB to 84kB. For power calculations, we generate activity tracing of the CUs from simulation, and then integrate with characterized power of individual PCU to compute the total power. The power and area characterizations are based off synthesis at 28nm technology at 1GHz clock frequency.
Brainwave
Finally, we also compared our results to Microsoft Brainwave framework. For this evaluation, we compare to Brainwave implemented on top of Intel Stratix 10 FPGA. Brainwave is synthesized at 250MHz and all operations are performed in blocked lowprecision floatingpoint format described in section 3.3.
# Row  24  # Column  24 
# PCU  192  # PMU  384 
# Lanes in PCU  16  # Stages in PCU  4 
Scrachpad capacity per PMU  84kB 
Specification  Intel Xeon Skylake (Dual core)  Tesla V100 SXM2  Stratix 10 280 FPGA  Plasticine 

Max Clock Rate (GHz)  2.0/2.8*  1.38/1.53*  1  1 
Onchip memory** (MB)  55  20  30.5  31.5 
Peak 32bit TFLOPS  –  15.7  10  12.5 
Peak 8bit TFLOPS  –  –  48  49 
Technology ()  14  12  14  28 
Die Area ()  64.4  815  1200  494.37 
TDP (W)  15  300  148  160 
* Base/Boosted Frequency ** Capacity of L3 cache for CPU, register file for GPU, and onchip scratchpad for reconfigurable architectures.
Platform  Intel Xeon Skylake  Tesla V100 SXM2  Stratix 10 280 FPGA  Plasticine 

Software Framework  TF+AVX2  TF+cuDNN  Brainwave  Spatial 
Achieved Clock Frequency (GHz)  2  1.38  0.25  1 
Precision  f32  f16  blocked precision  mix f8+16+32 
Benchmarks  Latency (ms)  Effective TFLOPS  Plasticine Speedup (x)  Power (W)  
H  T  Xeon Skylake  Tesla V100  BW  Plasticine  Xeon Skylake  Tesla V100  BW  Plasticine  Xeon Skylake  Tesla V100  BW  Plasticine  
LSTM  256  150  15.75  1.69  0.425  0.0419  0.010  0.09  0.37  3.8  376.3  40.4  10.2  28.5 
512  25  11.50  0.60  0.077  0.0139  0.009  0.18  1.37  7.6  830.3  43.2  5.6  53.7  
1024  25  107.65  0.71  0.074  0.0292  0.004  0.59  5.68  14.4  3,686.6  24.3  2.5  97.2  
1536  50  411.00  4.38  0.145  0.1224  0.005  0.43  13.01  15.4  3,357.8  35.8  1.2  102.7  
2048  25  429.36  1.55  0.074  0.1060  0.004  1.08  22.62  15.8  4,050.6  14.6  0.7  104.5  
GRU  512  1  0.91  0.39  0.013  0.0004  0.003  0.01  0.25  7.6  2,182.3  942.4  31.2  61.9 
1024  1500  3,810.00  33.77  3.792  1.4430  0.005  0.56  4.98  13.1  2,640.3  23.4  2.6  109.1  
1536  375  2,730.00  13.12  0.951  0.7463  0.004  0.81  11.17  14.2  3,658.3  17.6  1.3  114.6  
2048  375  5,040.00  17.70  0.954  1.2833  0.004  1.07  19.79  14.7  3,927.5  13.8  0.7  101.2  
2560  375  7,590.00  23.57  0.993  1.9733  0.004  1.25  29.69  15.0  3,846.4  11.9  0.5  117.2  
Geometric Mean  2,529.3  29.8  2.0 
Benchmarks  Stratix 9 BW  Plasticine  
H  T  
LSTM  256  150  6  400  40  6  1  4  64 
512  25  4  8  
1024  25  
1536  50  
2048  25  
GRU  512  1  2  
1024  1500  
1536  375  
2048  375  
2560  375  
2816  750 
5.2 RNN Performance Analysis
Table 6 shows the performance comparison of LSTM and GRU with various numbers of hidden units (H) and step sizes (T) over the four platforms. Overall, both CPU and GPU significantly underutilize the available compute FLOPS. In addition, they cannot meet the latency requirement for realtime serving for all problem sizes. Both BW and Plasticine deliver promising latencies within 5ms for all problem sizes. When serving very large RNNs, BW provides better performance with up to 2x better than Plasticine on the largest GRU (H=2816). When serving small and medium size RNNs, Plasticine performs better than BW with up to 30x better performance on small GRU (H=512). We also observe that Plasticine delivers consistent FLOPS when serving all the problem sizes.
ProcessorBased Architectures
For CPU experiments, the RNN kernels from TensorFlow itself is not multithreaded. Since we focus on realtime serving of RNN applications, we use batch size of 1 for all of our benchmarks, which expose no parallelism outside the kernel level. Consequently, the machine is still very underutilized even with AVX2 instruction. Although one could implement RNN directly in c++, the MVM sizes in RNNs are too small to benefit from multithreading due to the synchronization overhead. V100 with cuDNN library provides significant acceleration compared to CPU. Nevertheless, the latency is still high. This is because GPUs are designed for throughput oriented rather than latency sensitive workloads. Provided that the library is based on BLAS3 routines, which are matrixmatrix operation, MVMs in RNN serving suffer from significant resource underutilization. In Table 6, V100 shows very poor performance on GRU (H=512). This is likely due to the initialization overhead which should not be timed. From our evaluation, neither processorbased architectures are suitable for providing lowlatency serving on RNN applications.
Spatial Architectures
Table 7 shows the selected design parameters for each problem size for BW and Plasticine. On Stratix 10, BW uses 6 tile engines () with native dimension of 400 () and 40 lanes (). Large and improve the datatocontrol ratio by amortizing the scheduling overhead over a large vectorized instruction. However, this design choice aggravates the underutilization for small RNN feature sizes at 256 and 512. Our implementation effectively uses of size 1 by performing dot product instead of MVM, which prevents fragmentation in the dimension. With , all the intermediate buffers are stored in registers. In contrast, BW uses register files of size . In addition, our proposed implementation captures additional gatelevel, X, and H parallelism as well as pipelining at elementwise functions. In contrast, BW schedules these operations in time and dispatches corresponding instructions to drive the compute units.
A CGRA is less flexible than an FPGA when performing arbitrary lowprecision operations. In this example, we increase memory density of Plasticine by supporting quantile precisions as described in Section
4.1. All weights are stored in 8 bit format, so as the multiplication operations of MVM. The reduction and accumulation operations are implemented in mix of 16 and 32 bit precisions. Hence, the peak FLOPS when performing mixed precision mapreduce is much less than the peak FLOPS for blocked lowprecision format in BW. As a result, Plasticine performs worse than BW on the large RNNs.In addition, Plasticine delivers very consistent FLOPS for different problem sizes. For small problem size, the dot product can be fully unrolled with . Therefore, we can increase to explore additional parallelism across the hidden units. For large problem size, dot product becomes the bottleneck of the pipeline. Hence, we reduce and increase to balance the throughput between dot product and elementwise operations. In this example, BW uses a single set of parameters for all problem sizes. Although one can potentially tune parameters for different problem sizes, doing so will incur resynthesis and placeandroute on an FPGA, which is an order of magnitude longer than the compilation time needed for a CGRA design. In addition, to exhaust hardware resources with a smaller , one would have to increase the number of matrix vector tile engines in BW. As a result, decoders and schedulers associated with these units will drive up the controltodata overhead and deliver less FLOPS for larger problem sizes.
5.3 Area and Power Analysis
Table 4 shows the die area comparison of different platforms. While the GPU has a publiclyreported die area measurement Markidis et al. (2018), Xeon Skylake and Stratix 10 only have estimated die areas based on their estimated transistor counts Cutress (2017). With the rough area estimates, we can see that while CPU has the smallest area in this case, the performance gap is too large even after we scale up to a 28core server. The GPU also delivers bad performance per area mostly due to the low utilization of compute FLOPS. Stratix 10 delivers the best performance for the large RNNs, but with the largest die area estimates of 30 billion transistors Gazettabyte (2015). Plasticine’s die area is based on the synthesis results at 28nm, which is one generation older than all the other platforms. With technology scaling, Plasticine should possess double the amount of compute and memory resources at 14nm for the same die area, which will roughly match Stratix 10’s performance on all the RNN problem sizes. At the same time, Plasticine is more than 2x smaller than Stratix 10, which could also contribute at least 2x  60x performance per area improvement for all problem sizes. Table 4 shows the thermal design power (TDP) of the four platforms, which is the peak power achievable for any workloads Intel (Technical report, 2018); Durant et al. (2017). BW also reports a measured peak power for the given set of benchmarks of 125W. Table 6 shows the simulated power for Plasticine for each benchmark. Overall, the peak power among benchmarks for Plasticine is 118W, which is slightly less than the peak power compared to BW.
6 Related Work
Previously proposed serving platforms focus on exploiting data locality by mapping RNN cells onto spatial architectures. For example, Chang et al presented an FPGAbased implementation of an LSTM network Chang et al. (2015). This approach works well for supporting small RNNs. However, for a large RNN, the weights would be too large to fit onchip. As a result, the serving latency would be dominated by DRAM data loading. To address the issue of fitting RNN weights onchip, several previous works Han et al. (2016b); Wang et al. (2018); See et al. (2016); Narang et al. (2017) have studied the approaches for compressing RNN weights. For example, Han et al presented a compression scheme called DSD Han et al. (2016b). It iteratively removes parameters in the weight matrices and retrains the sparse model to minimize the accuracy loss introduced by sparsity Han et al. (2016b)
. With this compression scheme, Han et al were able to deploy an LSTM network containing 3.2 million parameters onto a modern FPGA without sacrificing accuracy. Compared to serving on CPU and GPU platforms, serving a sparse LSTM network on FPGA provides much lower latency and higher energy efficiency. However, we find that it could be hard to generalize this compression scheme for all the RNN tasks. RNNs are very flexible in terms of their model structures. Applying a DSDlike compression scheme to all the RNN models requires handtuning the compression heuristics for every model. To avoid handtuning, He et al proposed an approach that uses reinforcement learning techniques for automatic compression tuning
He et al. (2018). However, their approach focuses on compressing CNN tasks on edge devices, which may not be transferrable to the case of serving RNN tasks in datacenter. Observing that the sparsitybased compression schemes are still under active development, we choose to support compression schemes that focus on representing RNN weights using lowprecision data format. Commercially available platforms such as Google TPU Jouppi et al. (2017) and Microsoft BrainWave Fowers et al. (2018) support these schemes.7 Conclusion
In this paper, we describe a set of techniques for performing crosskernel optimization within RNN cells. We identify that by moving away from BLAS abstraction and focus on optimizing looplevel construct, we are able to achieve consistent hardware utilization when serving RNN cells of different sizes. We show that we are able to achieve 1020x performance improvement at a less advanced technology compared to the stateoftheart GPU platform, and a geometric speedup of 2x compared to the stateoftheart FPGAbased platform.
8 Acknowledgement
We appreciate the anonymous reviewers for their feedback. We thank Matthew Feldman for compiler support and his constructive suggestions on the manuscript of this paper, and Raghu Prabhakar for providing insights and feedback on the architecture section of this paper. We also thank Google for the cloud credits. This material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA86501827865. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U.S. Government. This research is also supported in part by affiliate members and other supporters of the Stanford DAWN project  Ant Financial, Facebook, Google, Infosys, Intel, Microsoft, NEC, Teradata, SAP and VMware.
References
 Tensorflow: a system for largescale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §3.1.
 EC2 f1 instances with fpgas now generally available. Technical report Amazon. Note: https://aws.amazon.com/blogs/aws/ec2f1instanceswithfpgasnowgenerallyavailable/ Cited by: §2.2.
 Recurrent neural networks hardware implementation on fpga. arXiv preprint arXiv:1511.05552. Cited by: §6.

Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks
. IEEE Journal of SolidState Circuits 52 (1), pp. 127–138. Cited by: §1.  Dadiannao: a machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. Cited by: §1.
 Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §3.1.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.
 The intel skylakex review: core i9 7900x, i7 7820x and i7 7800x tested. Technical report AnandTech. Note: https://www.anandtech.com/show/11550/theintelskylakexreviewcorei97900xi77820xandi77800xtested/ Cited by: §5.3.
 [9] Dense linear algebra on gpus. Note: https://developer.nvidia.com/cublas Cited by: §1.
 Inside volta: the world’s most advanced data center gpu. Technical report NVIDIA. Note: https://devblogs.nvidia.com/insidevolta/ Cited by: §5.3.
 A configurable cloudscale dnn processor for realtime ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 1–14. Cited by: §1, §3.2, §4.1, §6.
 Altera’s 30 billion transistor fpga. Note: http://www.gazettabyte.com/home/2015/6/28/alteras30billiontransistorfpga.html Cited by: §5.3.
 EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254. Cited by: §1.
 Dsd: densesparsedense training for deep neural networks. arXiv preprint arXiv:1607.04381. Cited by: §6.

AMC: automl for model compression and acceleration on mobile devices.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 784–800. Cited by: §6.  Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
 AN 787: intel stratix 10 thermal modeling and management. Technical report Intel. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an787.pdf Cited by: §5.3.
 Product specification. Technical report Intel. Note: https://ark.intel.com/products/codename/37572/Skylake Cited by: §5.3.

Indatacenter performance analysis of a tensor processing unit
. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. Cited by: §1, §4.1, §6.  Automatic generation of efficient accelerators for reconfigurable hardware. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 115–127. External Links: Document, ISSN 10636897 Cited by: §2.2.
 Spatial: a language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, New York, NY, USA, pp. 296–311. External Links: ISBN 9781450356985, Link, Document Cited by: §1.
 Efficient and reliable highlevel synthesis design space explorer for fpgas. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 1–8. External Links: Document, ISSN 19461488 Cited by: §2.2.
 NVIDIA tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014. Cited by: §5.3.
 Baidu deepbench. GitHub Repository. Note: https://github.com/baiduresearch/DeepBench Cited by: §1.
 Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119. Cited by: §6.
 SDA: softwaredefined accelerator for largescale dnn systems. Hot Chips 26. Cited by: §2.2.
 Plasticine: a reconfigurable architecture for parallel paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 389–402. Cited by: §1.
 A reconfigurable fabric for accelerating largescale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, Piscataway, NJ, USA, pp. 13–24. External Links: ISBN 9781479943944, Link Cited by: §2.2.

Compression of neural machine translation models via pruning
. arXiv preprint arXiv:1606.09274. Cited by: §6.  Clstm: enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 11–20. Cited by: §6.
Comments
There are no comments yet.