. In these services, the platforms assume that user requests come in individual samples and need to be served with very stringent latency window for real-time human computer interaction. An example of such workload is Google Translate, where inference happens concurrently when a user types. Despite its popularity, RNN model serving is hard to accelerate efficiently. Modern software and hardware platforms support optimized BLAS routines. To serve RNNs on these platforms, a compiler tends to stitch multiple optimized BLAS kernels into a single computation graph. While a hardware accelerator might execute each individual kernel efficiently, it misses the opportunity of global cross-kernel optimization that can dramatically improves performance and energy-efficiency. This approach leads to two issues. First, communication between BLAS kernels creates large intermediate results, which can lead to poor memory performance when the blocking size is not properly tuned for the target system. Missing the opportunity of cross-kernel fusion can lead to huge performance loss due to different access latency at each level of memory hierarchy in a processor-based architecture. On a spatial architecture, while the first two levels of memory hierarchies, i.e. registers and on-chip scratchpads, tend to have single cycle access latency, the energy required to access these two types of memory would be widely different. Therefore, lack of cross-kernel fusion can lead to inefficient allocation of scratchpad resource and low energy-efficiency. Second, hardware accelerators tend to use large vectorization in compute and memory access to boost compute density when accelerating BLAS kernels. However, hardware accelerators tend to suffer from resource underutilization when the workload size is not multiples of the vector size. The utilization is worse with RNN applications that are composed of sequences of small matrix multiplications due to small hidden unit sizes and many time steps. Moreover, many accelerator platforms are optimized for BLAS level-3 (matrix-matrix) operations, e.g. NVBLAS Library for GPU9, TPU Jouppi et al. (2017), EIE Han et al. (2016a), EyeRiss Chen et al. (2017), and DaDianNao Chen et al. (2014). These platforms suffer from low resource utilization when serving single-batch, real-time RNN applications that contain a lot of matrix-vector multiplication (MVM) executions.
To address these issues, we propose the following strategies. First, we fuse all the gate functions with the element-wise, non-linear functions in the same time step. This way, all of our intermediate results are buffered in the registers as opposed to the scratchpads. Second, we spatially parallelize and pipeline the computation graph. We vectorize the inner-loop of the tiled dot product to explore SIMD parallelism and fine-grain pipelining. We also explore tiled parallelism and coarse-grain pipelining by unrolling the outer loop nests based on the amount of available compute resources. These strategies exploit the gate-level parallelism in RNN cells, balance the pipelines of MVM and element-wise non-linear functions, and maximize the resource utilization when serving RNN models on different problem sizes. In addition, the entire pipeline is data-flow driven with no dynamic scheduling overhead.
We evaluate the proposed strategies by serving RNN tasks in DeepBench Narang and Diamos (2017) on the target spatial architecture. We implement the designs in Spatial Koeplinger et al. (2018), a Domain-Specific-Language (DSL) that describes applications with nested loops and explicit hardware memory hierarchy. We choose Plasticine Prabhakar et al. (2017), a coarse-grained reconfigurable architecture (CGRA), as the target spatial architecture. Furthermore, we propose augmentations to the Plasticine microarchitecture in order to support the mix-precision operations, which is critical for serving RNNs in real-time.
Finally, we compare the results to those obtained by serving DeepBench tasks on the state-of-the-art RNN serving platforms. We show that our implementation delivers consistently high FLOPS utilization across tasks of various sizes. We also demonstrate energy-efficiency advantage of spatial architectures compared to processor-based architectures.
The key contributions of this paper are:
We analyze the computation and memory layout of RNN cell implementations on commercially available platforms. We find that BLAS abstraction leads to expensive inter-kernel data movement and resource underutilization.
We address these issues by describing RNN applications using abstractions with more general loop constructs that enable cross-kernel optimization, spatial parallelization, and pipelining of arbitrary loop nesting. To achieve low-latency inference for RNN applications, we propose micro-architectural co-design to a spatial architecture in order to enable low-precision operations.
Finally, we thoroughly evaluate CPU, general purpose graphics processing unit (GPGPU), field-programmable gate array (FPGA), and a previously-proposed CGRA, as serving platforms for RNN applications.
The rest of the paper is organized as follows. Section 2 provides backgrounds on the RNN algorithms, the DSL and hardware platform used in this paper. Section 3 discusses the available RNN implementations on commercially available platforms. We then discuss the optimization strategies implemented in this work that address the inefficiency in these implementations. Section 4 discusses the architectural changes for supporting efficient RNN inference on the target spatial architecture. Section 5 details our evaluation methodology and experimental results. Section 6 discusses related works on available software and hardware optimization strategies for serving RNN applications. Section 7 offers concluding remarks.
RNNs are widely used to model arbitrary sequential tasks. An RNN contains a cell unit to iteratively consume a T-step input sequence in order to generate an output sequence
. Long Short-Term Memory (LSTM)Hochreiter and Schmidhuber (1997)
and Gated Recurrent Unit (GRU)Chung et al. (2014) are popular RNN cell units. In this paper, we use LSTM as an example. Nevertheless, our optimization techniques can be generalized to any other types of RNN cells. In Section 5, we also provide evaluations of GRU implemented using our techniques.
2.1 LSTM Cell
At step , an LSTM generates an output and the next memory cell states and as follows:
are dimensions of hidden states and input features, respectively. is the sum of hidden state and input feature dimensions. is the Hadamard product. Table 1 shows the specifications for each matrix and vector in an LSTM cell.
|LSTM cell’s input vector|
|Forget gate’s activation vector|
|Input gate’s activation vector|
|Output gate’s activation vector|
|Candidate of memory gate’s activation vector|
|Memory gate’s vector|
|Hidden state’s weight matrices at gate|
|Input vector’s weight matrices at gate|
|Bias vector at gate ,,,|
2.2 Spatial Reconfigurable Architectures
. Compared to processor-based architectures, spatial architectures can reach high resource utilization by reconfiguring memory and compute based on the applications and computation requirements. In addition to exploiting parallelism, pipelining of data-flow graph in a spatial architecture provides high compute throughput. Nonetheless, the traditional low-level programming interface and long synthesis time of FPGA is the major obstacle for it to become a mainstream accelerator. As opposed to bit-level flat-interconnection in FPGAs, CGRAs are usually configured at higher level of granularity and contain a hierarchical interconnection network. In exchange, the reduction in flexibility in hardware translates to lowered routing overhead and higher clock frequency. The reduced routing overhead provides higher compute density and memory capacity, which makes CGRA an attractive platform to accelerate deep learning workloads. Due to the flexibility in mapping applications, spatial architectures often require design space exploration (DSE) in order to achieve good resource utilization and performanceKoeplinger et al. (2016); Liu and Schafer (2016).
Spatial is a hardware-centric DSL that targets FPGAs and a previously proposed CGRA, Plasticine. A user describes applications in un-parallelized pattern-based loops with explicit memory hierarchies. Spatial automatically schedules, parallelizes, and pipelines arbitrary loop nests. To scale the memory bandwidth with parallelism, Spatial banks the scratchpad memories. To sustain the throughput of pipelining, Spatial also buffers the intermediate memories. Spatial exposes important design parameters such as blocking size and unrolling factor. Using the exposed parameters, users can easily tune their design either manually or with an external DSE engine to balance the pipeline stages and saturate resource for different tasks on different hardware targets.
Plasticine is a CGRA that accelerates general nested loops in Spatial. It consists of primarily two types of units: a pattern compute unit (PCU) containing a single instruction multiple data (SIMD) pipeline optimized for accelerating vectorized map and reduction loops, and a pattern memory unit (PMU) containing configurable memory that to support banking and buffering schemes for various access patterns. Plasticine supports parallelizing and pipelining arbitrarily nested loops from Spatial. More architectural details will be explained in Section 4.
3 RNN Computation Analysis
In this section, we first discuss the limitation of BLAS-based LSTM on processor and spatial architectures. Next, we discuss our implementation of loop-based LSTM on spatial architectures. Table 2 contains specifications for symbols and parameters used in this section.
|Memory Hiearchy||On-chip Scratchpad|
|Unrolling factor using multiple hardware compute blocks|
|Vectorization parameter for AVX or SIMD instructions|
|Vectorization parameter on H|
|Unrolling factor on H|
|Vectorization parameter on R|
|Unrolling factor on R|
|Number of gates in an RNN. For LSTM, G=4|
3.1 BLAS-based LSTM on Processor Architecture
Modern Machine Learning frameworks, e.g. TensorFlowAbadi et al. (2016), divide the computation graph of an LSTM cell into BLAS kernels. Then, the BLAS kernel is accelerated by calling low-level optimized BLAS subroutines such as Intel BLAS Library on CPU and NVBLAS Library on GPU. Figure 1 (a) shows the computation graph of a BasicLSTM cell in TensorFlow. This implementation can lead to large memory footprint since all the intermediate results are materialized in memory. A common strategy to tackle the issue is through fusing blocked kernels. With TensorFlow’s abstraction, this can only be achieved by expressing the entire RNN cell as an optimized kernel. For example, TensorFlow provides LSTMBlockFusedCell and GRUBlockCell modules, which are the fastest TensorFlow implementations of RNN cells for CPU. In practice, such implementation can provide significant performance improvement over the BasicLSTM implementation. However, it is still very hard to saturate CPU compute capacity, potentially due to the high synchronization overhead across threads. Figure 1 (b) shows the computation layout of TensorFlow with cuDNN library Chetlur et al. (2014) on GPU. cuDNN is an NVIDIA GPU library for accelerating deep neural networks. To minimize the data movement, cuDNN fuses all the vector-vector (VV) operations after MVM. Specifically, the bias add in Equation 1, 2, 3, 4, and all the operations in Equation 5, 6, are fused into one kernel. Nevertheless, there are still intermediate buffers of size between the MVM kernel and the element-wise operations.
Compared to the BasicLSTM implementation, CudnnLSTM eliminates most of large intermediate memories. However, the MVMs of Equation 1, 2, 3, 4 are all accelerated with BLAS3 kernels, which performs only matrix-matrix level operations. This turns MVM and VV bias add into Matrix Matrix Multiplication (MMM) and Matrix Matrix Addition (MMA), which leads to serious underutilization of GPU.
Moreover, a processor-based architecture introduces large energy overhead of instruction decoding and scheduling. GPU especially suffers from its power-hungry, high-throughput memory hierarchy. For these reasons, both the CPU and GPU architectures are not suitable for energy-efficient, low-latency RNNs serving platforms.
3.2 BLAS-based LSTM on Spatial Architecture
Previous work has studied the capability of using an FPGA as a low-latency serving platform. An FPGA has the flexibility of resizing MVM and VV units based on the application size. In addition, MVM and VV units can be implemented with hardware pipelines, which removes the instruction scheduling and control overhead on a processor-based architecture. The latest version of Intel Stratix 10 FPGA further boosts the compute power of FPGA with increasing number of hardened digital signal processing (DSP) blocks and on-chip memory capacity. Microsoft Brainwave (BW) Fowers et al. (2018) is a state-of-the-art FPGA-based deep learning framework.
Figure 2 shows BW’s compute and memory layout. In contrast to the CPU and GPU implementations, BW blocks the MVM along both row and column dimensions. It then fuses the inner tiled MVM with element-wise non-linear functions. Specifically for a matrix of size and a vector of size , BW parallelizes the compute of multiple column tiles (, # MV Tiles in the original paper) of size with multiple tiled engines, as shown in Figure 4 (a). Each tile engine contains (native dimension) number of dot product engines vectorized by (lanes) and achieves one tile per cycle throughput. Parallel tiles along the row dimension are then fed into a pipelined reduction and accumulation unit. Immediately after the accumulation, the multi-function units (MFUs) execute the element-wise operations on the vector chunk produced by the accumulator. Although BW’s implementation still keeps the vectorized intermediate results, the size is much smaller than in BasicLSTM cell. Nonetheless, with parallelization in , BW allocates lots of vectorized intermediate buffers that can still lead to energy inefficiency. BW performs one MVM operation in iterations.
The MVM operations are executed on each gate of the LSTM sequentially. Similarly, element-wise operations using for the non-linear operators are also scheduled to execute on the vectorized multi-function units with size of , as shown with the arrow in time in Figure 2. To avoid DRAM communication overhead and improve compute density, Brainwave embeds MVM in a blocked floating-point format, where the vector of values share a single 5-bit exponent and have distinct signs and 2-5 bit mantissa for each value. As a result, they can achieve very dense low-precision compute and storage, with one adder per values and multipliers for a vector of . The remaining operations are performed in 16-bit precision.
When matrix dimensions cannot be divided by and , Brainwave suffers from underutilization of the compute FLOPS, as shown in Figure 4 (a). The underutilization is worse with small problem sizes. In addition, BW computes and separately rather than computing them with concatenated larger matrices, which can further aggravate the problem. This might be because BW’s abstraction does not allow partial updates of an vector but only is updated at the end of the step.
3.3 Loop-based LSTM
We have made the following observations from analyzing BLAS-based LSTM implementations:
Constructing an LSTM cell’s computation graph using BLAS subroutines introduces large intermediate buffers even when the kernels themselves are blocked. Each element on RNN cells’ non-reduction dimension of the MVM () can be computed completely independently within one time step. This exposes the opportunity of fine-grain loop tiling and fusion across the entire LSTM kernel.
MVM is the computation bottleneck in serving RNN cells. Spatial architecture allows us to distribute most of the compute resource to MVM by parallelizing and pipelining MVM with element-wise operations.
Using low-precision operations can boost compute density and keep RNN weights on-chip to avoid high-latency DRAM communication. We need to introduce efficient low-precision support in the target spatial architecture without introduce too much overhead.
To address the issue of large intermediate buffers, we fine-grain tile and fuse MVM with non-linear functions. We refer to the computation for generating every single element in and as LSTM-1 operation, which can be computed independently in a single step. LSTM-1 is composed of four independent dot products of the row vectors of the weight matrices with the input vector immediately followed by the element-wise operations on output of the dot product. The resulting and vectors are produced by computing LSTM-1 operations for iterations.
As shown in Figure 3, each MVM unit is replaced by a MapReduce unit to compute the tiled dot product. Each MapReduce is vectorized by with pipelined map function followed by a pipelined reduction tree. is the number of parallel MapReduce units. Results of MapReduce blocks are reduced and accumulated with another reduction tree (not shown in Figure). Next, the dot product result is passed through a chain of function units for executing bias add and non-linear functions. Dot products, bias adds, and non-linear functions of the four gates can also be parallelized. Finally, the results of the four gates are pipelined through a set of function units for element-wise operation in LSTM cell. At the outer loop, LSTM-1 runs for iterations, where is the number of parallel LSTM-1 implementations.
In the loop-based design, all intermediate buffers are scalars as opposed to vectors. Regarding utilization, the loop-based LSTM design suffers from less underutilization due to unaligned problem size compared to the tiled MVM approach in BW. Figure 4 shows sources of such underutilizations. An MVM approach design would suffer from 2-D fragmentation on both the and dimensions (Figure 4 (a)), whereas the loop-based design only suffers from 1-D fragmentation on the dimension (Figure 4 (b)).
Figure 5 shows a loop-based LSTM design implemented in Spatial. Foreach is a loop construct with a lambda that takes loop iterator as input. Reduce is a construct that executes MapReduce by taking a map function followed by a reduction function. User declare explicit on-chip scratchpads and registers with SRAM and Reg. To enable fine-tuning an RNN application, we exposes loop vectorization factor and unrolling factors .
4 Plasticine Specialization for RNN Serving
To show efficient execution of the loop and parallel pattern constructs, we map our implementation onto a spatial architecture, Plasticine. Foreach at Line and Reduce at Line are mapped to PCUs on Plasticine. When the application size is small, these constructs are executed using pipelined SIMD lanes within a single PCU. When the application size is large, multiple PCUs can be used to parallelize and pipeline the dot product across PCUs. Element-wise operations can be executed in a deep pipeline formed by chaining multiple PCUs.
To fit an RNN’s weights on-chip, we execute our application with low-precision arithmetics. In this section, we propose the necessary micro-architectural changes to support low-precision arithmetics on Plasticine. We also discuss architectural parameter selection for Plasticine to serve RNN applications efficiently.
4.1 Mixed-Precision Support
Previous works Fowers et al. (2018); Jouppi et al. (2017) have shown that low-precision inference can deliver promising performance improvements without sacrificing accuracy. In the context of reconfigurable architectures such as FPGAs, low-precision inference not only increases compute density, but also reduces required on-chip capacity for storing weights and intermediate data.
To support low-precision arithmetics without sacrificing coarse-grained reconfigurability, we introduce two low-precision struct types in Spatial: a tuple of 4 8-bit and 2 16-bit floating-point numbers, 4-float8 and 2-float16 respectively. Both types packs multiple low-precision values into a single precision storage. We support only 8 and 16-bit precisions, which are commonly seen in deep learning inference hardwares. Users can only access values that are 32-bit aligned. This constraint guarantees that the microarchitectual change is only local to the PCU. Banking and DRAM access granularity remains intact from the original design.
Figure 6 (a) shows the original SIMD pipeline in a Plasticine PCU. Each FU supports both floating-point and fix-point operations. When mapping applications on Plasticine, the inner most loop body is vectorized across the lanes of the SIMD pipeline, and different operations of the loop body are mapped to different stages. Each pipeline stage contains a few pipeline registers (PRs) that allow propagation of live variables across stages. Special cross-lane connections as shown in red in Figure 6 enable reduction operations. To support 8-bit element-wise multiplication and 16-bit reduction, we add 4 opcodes to the FU, shown in Figure 6 (b). The and stages are element-wise, low-precision operations that multiply and add 4 8-bit and 2 16-bit values, respectively. The and
stages rearrange low-precision values into two registers, and then pad them to higher precisions. Thestage reduces the two 32-bit value to a single 32-bit value using the existing add operation. From here, we can use the original reduction network shown in Figure 6 (a) to complete the remaining reduction and accumulates in 32-bit connection.
With 4 lanes and 5 stages, a PCU first reads 16 8-bit values, performs 8-bit multiplication followed by rearrangement and padding, and then produce 16 16-bit values after the second stage. The intermediate values are stored in 2 PRs per lane. Next, 16 16-bit values are reduced to 8 16-bit values and then rearranged to 8 32-bit value in 2 PRs per lane. Then, the element-wise addition in 32-bit value reduces the two registers in each line into 4 32-bit values. These values are fed through the reduction network that completes the remaining reduction and accumulation in two plus one stages.
In a more aggressive specialization, we can fuse the multiply and rearange into the same stage. We also fuse the first low-precision reduction with the next rearange as shown in Figure 6 (d). In this way, we can perform the entire low-precision map-reduce in 2 stages in addition to the original full precision reduction. In order to maximize hardware reuse, we assume that it is possible to construct a full precision FU using low-precision FUs. In addition, we observe that the original reduction network in the SIMD lanes could lead to low FU utilization. To improve FU utilization, we fold the entire tree structure in a single stage. Figure 6 (c) shows the folded reduction accumulation structure. Specifically, latter reductions in the tree are mapped to earlier stages in the pipeline. In this setup, the entire reduction plus accumulation is still fully pipelined in cycles with no structural hazard. With fused reduced-precision multiplication and reduction, and folded reduction tree, a PCU is able to perform all map-reduce that accumulates 8-bit values using 4 stages. All the operations are completed in cycles.
4.2 Sizing Plasticine for Serving RNN
Evaluating an RNN cell containing hidden units and input features requires computations and memory reads. With large , the compute to memory ratio is 2:1. The original Plasticine architecture uses a checkerboard layout with 1 to 1 ratio between PCU and PMU. A PCU has 6 stages and 16 lanes, and a PMU has 16 banks. This provides a 6:1 ratio between compute resource and on-chip memory read bandwidth. As a result of this layout, on-chip memory read bandwidth becomes the bottleneck for accelerating RNN serving applications. Given that RNNs cover a wide range of important applications, we select a Plasticine configuration tailored for RNN serving. Specifically, we choose a 2 to 1 PMU-PCU ratio with 4 stages in each PCU. Figure 7 shows the layout of this Plasticine variant.
In this section, we evaluate the real-time RNN serving tasks on various platforms. We start with the methodology of our experiments, followed by a discussion of performance and power comparisons across these platforms.
To evaluate RNN serving, we use the LSTM and GRU tasks from Baidu DeepBench as our benchmarks. We evaluate the benchmarks across processor-based architectures including CPU and GPU, and spatial architectures including FPGA and CGRA. Table 4 shows the detailed specifications of the targeting hardware, which includes state-of-the-art high performance platforms in each of the commercialized categories. Table 5 summarizes application configurations of each platform.
We implement the applications in TensorFlow 1.10, and evaluate our implementations on Intel Xeon Scalable Processor (Skylake) CPU. We use the LSTMBlockFusedCell and GRUBlockCell kernels in TensorFlow. We further enable AVX2 vector instructions for CPU evaluation. Due to lack of low-precision support in both tool chain and platform, we use single-precision for our implementation.
We use TensorFlow with cuDNN Library to target NVIDIA Tesla V100 GPU from Google Cloud. cuDNN is a GPU-accelerator Library from NVIDIA that is specialized for deep learning. We use 16-bit precision for our implementation on GPU. On both CPU and GPU platforms, we run TensorFlow profilers and collect the time spent only on evaluating the RNN cells.
We implement the applications in Spatial, which targets Plasticine. Although Spatial has FPGA back-end support, Stratix 10 is not commercially available at the time of the submission of this work. The current FPGA targets that Spatial support are not comparable to Stratix 10 both in terms of memory and compute capacity. Therefore, we only use Spatial to target Plasticine for this evaluation. However, our approach should generally benefit an implementation on a high performance FPGA like Stratix 10. We choose Plasticine configuration that matches the peak 8-bit FLOPS and on-chip scratchpad capacity of a Stratix 10 FPGA. The exact configuration of Plasticine is shown in Table 3. In order to minimize the overhead of low-precision support, Plasticine only supports 8-bit, 16-bit, and 32-bit element-wise operations, and mixed precision reduction operation. For our evaluation, the element-wise operations are performed in 8-bit precision, the first stage of the reduction is performed in 16-bit, while the remaining of the reduction and accumulation are performed in 32 bit operations.
To measure the performance, we use a cycle accurate simulator for Plasticine. We modified the simulator to model the proposed micro-architectural changes to support low-precision operations. We use the area and power of individual CUs and network switches from the original Plasticine paper, and compute total area of configuration shown in Table 3. As discussed in Section 4
, we reduce the number of stages in PCU from 6 stages to 4 stages with fused low-precision operations and folded reduction tree. Low preicision function units can be used to compose full precision units. We conservatively estimate the area and power of PCU stays the same with our proposed change and reduced two stages. We also increase the PMU to PCU ratio to better match the compute to memory ratio for RNN inference applications. To match the memory capacity of Stratix 10, we shrink the scratchpad capacity of each PMU from 256kB to 84kB. For power calculations, we generate activity tracing of the CUs from simulation, and then integrate with characterized power of individual PCU to compute the total power. The power and area characterizations are based off synthesis at 28nm technology at 1GHz clock frequency.
Finally, we also compared our results to Microsoft Brainwave framework. For this evaluation, we compare to Brainwave implemented on top of Intel Stratix 10 FPGA. Brainwave is synthesized at 250MHz and all operations are performed in blocked low-precision floating-point format described in section 3.3.
|# Row||24||# Column||24|
|# PCU||192||# PMU||384|
|# Lanes in PCU||16||# Stages in PCU||4|
|Scrachpad capacity per PMU||84kB|
|Specification||Intel Xeon Skylake (Dual core)||Tesla V100 SXM2||Stratix 10 280 FPGA||Plasticine|
|Max Clock Rate (GHz)||2.0/2.8*||1.38/1.53*||1||1|
|On-chip memory** (MB)||55||20||30.5||31.5|
|Peak 32-bit TFLOPS||–||15.7||10||12.5|
|Peak 8-bit TFLOPS||–||–||48||49|
|Die Area ()||64.4||815||1200||494.37|
* Base/Boosted Frequency ** Capacity of L3 cache for CPU, register file for GPU, and on-chip scratchpad for reconfigurable architectures.
|Platform||Intel Xeon Skylake||Tesla V100 SXM2||Stratix 10 280 FPGA||Plasticine|
|Achieved Clock Frequency (GHz)||2||1.38||0.25||1|
|Precision||f32||f16||blocked precision||mix f8+16+32|
|Benchmarks||Latency (ms)||Effective TFLOPS||Plasticine Speedup (x)||Power (W)|
|H||T||Xeon Skylake||Tesla V100||BW||Plasticine||Xeon Skylake||Tesla V100||BW||Plasticine||Xeon Skylake||Tesla V100||BW||Plasticine|
|Benchmarks||Stratix 9 BW||Plasticine|
5.2 RNN Performance Analysis
Table 6 shows the performance comparison of LSTM and GRU with various numbers of hidden units (H) and step sizes (T) over the four platforms. Overall, both CPU and GPU significantly underutilize the available compute FLOPS. In addition, they cannot meet the latency requirement for real-time serving for all problem sizes. Both BW and Plasticine deliver promising latencies within 5ms for all problem sizes. When serving very large RNNs, BW provides better performance with up to 2x better than Plasticine on the largest GRU (H=2816). When serving small and medium size RNNs, Plasticine performs better than BW with up to 30x better performance on small GRU (H=512). We also observe that Plasticine delivers consistent FLOPS when serving all the problem sizes.
For CPU experiments, the RNN kernels from TensorFlow itself is not multi-threaded. Since we focus on real-time serving of RNN applications, we use batch size of 1 for all of our benchmarks, which expose no parallelism outside the kernel level. Consequently, the machine is still very underutilized even with AVX2 instruction. Although one could implement RNN directly in c++, the MVM sizes in RNNs are too small to benefit from multi-threading due to the synchronization overhead. V100 with cuDNN library provides significant acceleration compared to CPU. Nevertheless, the latency is still high. This is because GPUs are designed for throughput oriented rather than latency sensitive workloads. Provided that the library is based on BLAS3 routines, which are matrix-matrix operation, MVMs in RNN serving suffer from significant resource underutilization. In Table 6, V100 shows very poor performance on GRU (H=512). This is likely due to the initialization overhead which should not be timed. From our evaluation, neither processor-based architectures are suitable for providing low-latency serving on RNN applications.
Table 7 shows the selected design parameters for each problem size for BW and Plasticine. On Stratix 10, BW uses 6 tile engines () with native dimension of 400 () and 40 lanes (). Large and improve the data-to-control ratio by amortizing the scheduling overhead over a large vectorized instruction. However, this design choice aggravates the underutilization for small RNN feature sizes at 256 and 512. Our implementation effectively uses of size 1 by performing dot product instead of MVM, which prevents fragmentation in the dimension. With , all the intermediate buffers are stored in registers. In contrast, BW uses register files of size . In addition, our proposed implementation captures additional gate-level, X, and H parallelism as well as pipelining at element-wise functions. In contrast, BW schedules these operations in time and dispatches corresponding instructions to drive the compute units.
A CGRA is less flexible than an FPGA when performing arbitrary low-precision operations. In this example, we increase memory density of Plasticine by supporting quantile precisions as described in Section4.1. All weights are stored in 8 bit format, so as the multiplication operations of MVM. The reduction and accumulation operations are implemented in mix of 16 and 32 bit precisions. Hence, the peak FLOPS when performing mixed precision map-reduce is much less than the peak FLOPS for blocked low-precision format in BW. As a result, Plasticine performs worse than BW on the large RNNs.
In addition, Plasticine delivers very consistent FLOPS for different problem sizes. For small problem size, the dot product can be fully unrolled with . Therefore, we can increase to explore additional parallelism across the hidden units. For large problem size, dot product becomes the bottleneck of the pipeline. Hence, we reduce and increase to balance the throughput between dot product and element-wise operations. In this example, BW uses a single set of parameters for all problem sizes. Although one can potentially tune parameters for different problem sizes, doing so will incur re-synthesis and place-and-route on an FPGA, which is an order of magnitude longer than the compilation time needed for a CGRA design. In addition, to exhaust hardware resources with a smaller , one would have to increase the number of matrix vector tile engines in BW. As a result, decoders and schedulers associated with these units will drive up the control-to-data overhead and deliver less FLOPS for larger problem sizes.
5.3 Area and Power Analysis
Table 4 shows the die area comparison of different platforms. While the GPU has a publicly-reported die area measurement Markidis et al. (2018), Xeon Skylake and Stratix 10 only have estimated die areas based on their estimated transistor counts Cutress (2017). With the rough area estimates, we can see that while CPU has the smallest area in this case, the performance gap is too large even after we scale up to a 28-core server. The GPU also delivers bad performance per area mostly due to the low utilization of compute FLOPS. Stratix 10 delivers the best performance for the large RNNs, but with the largest die area estimates of 30 billion transistors Gazettabyte (2015). Plasticine’s die area is based on the synthesis results at 28nm, which is one generation older than all the other platforms. With technology scaling, Plasticine should possess double the amount of compute and memory resources at 14nm for the same die area, which will roughly match Stratix 10’s performance on all the RNN problem sizes. At the same time, Plasticine is more than 2x smaller than Stratix 10, which could also contribute at least 2x - 60x performance per area improvement for all problem sizes. Table 4 shows the thermal design power (TDP) of the four platforms, which is the peak power achievable for any workloads Intel (Technical report, 2018); Durant et al. (2017). BW also reports a measured peak power for the given set of benchmarks of 125W. Table 6 shows the simulated power for Plasticine for each benchmark. Overall, the peak power among benchmarks for Plasticine is 118W, which is slightly less than the peak power compared to BW.
6 Related Work
Previously proposed serving platforms focus on exploiting data locality by mapping RNN cells onto spatial architectures. For example, Chang et al presented an FPGA-based implementation of an LSTM network Chang et al. (2015). This approach works well for supporting small RNNs. However, for a large RNN, the weights would be too large to fit on-chip. As a result, the serving latency would be dominated by DRAM data loading. To address the issue of fitting RNN weights on-chip, several previous works Han et al. (2016b); Wang et al. (2018); See et al. (2016); Narang et al. (2017) have studied the approaches for compressing RNN weights. For example, Han et al presented a compression scheme called DSD Han et al. (2016b). It iteratively removes parameters in the weight matrices and retrains the sparse model to minimize the accuracy loss introduced by sparsity Han et al. (2016b)
. With this compression scheme, Han et al were able to deploy an LSTM network containing 3.2 million parameters onto a modern FPGA without sacrificing accuracy. Compared to serving on CPU and GPU platforms, serving a sparse LSTM network on FPGA provides much lower latency and higher energy efficiency. However, we find that it could be hard to generalize this compression scheme for all the RNN tasks. RNNs are very flexible in terms of their model structures. Applying a DSD-like compression scheme to all the RNN models requires hand-tuning the compression heuristics for every model. To avoid hand-tuning, He et al proposed an approach that uses reinforcement learning techniques for automatic compression tuningHe et al. (2018). However, their approach focuses on compressing CNN tasks on edge devices, which may not be transferrable to the case of serving RNN tasks in datacenter. Observing that the sparsity-based compression schemes are still under active development, we choose to support compression schemes that focus on representing RNN weights using low-precision data format. Commercially available platforms such as Google TPU Jouppi et al. (2017) and Microsoft BrainWave Fowers et al. (2018) support these schemes.
In this paper, we describe a set of techniques for performing cross-kernel optimization within RNN cells. We identify that by moving away from BLAS abstraction and focus on optimizing loop-level construct, we are able to achieve consistent hardware utilization when serving RNN cells of different sizes. We show that we are able to achieve 10-20x performance improvement at a less advanced technology compared to the state-of-the-art GPU platform, and a geometric speedup of 2x compared to the state-of-the-art FPGA-based platform.
We appreciate the anonymous reviewers for their feedback. We thank Matthew Feldman for compiler support and his constructive suggestions on the manuscript of this paper, and Raghu Prabhakar for providing insights and feedback on the architecture section of this paper. We also thank Google for the cloud credits. This material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7865. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U.S. Government. This research is also supported in part by affiliate members and other supporters of the Stanford DAWN project - Ant Financial, Facebook, Google, Infosys, Intel, Microsoft, NEC, Teradata, SAP and VMware.
- Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §3.1.
- EC2 f1 instances with fpgas now generally available. Technical report Amazon. Note: https://aws.amazon.com/blogs/aws/ec2-f1-instances-with-fpgas-now-generally-available/ Cited by: §2.2.
- Recurrent neural networks hardware implementation on fpga. arXiv preprint arXiv:1511.05552. Cited by: §6.
Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52 (1), pp. 127–138. Cited by: §1.
- Dadiannao: a machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. Cited by: §1.
- Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §3.1.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.
- The intel skylake-x review: core i9 7900x, i7 7820x and i7 7800x tested. Technical report AnandTech. Note: https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/ Cited by: §5.3.
-  Dense linear algebra on gpus. Note: https://developer.nvidia.com/cublas Cited by: §1.
- Inside volta: the world’s most advanced data center gpu. Technical report NVIDIA. Note: https://devblogs.nvidia.com/inside-volta/ Cited by: §5.3.
- A configurable cloud-scale dnn processor for real-time ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 1–14. Cited by: §1, §3.2, §4.1, §6.
- Altera’s 30 billion transistor fpga. Note: http://www.gazettabyte.com/home/2015/6/28/alteras-30-billion-transistor-fpga.html Cited by: §5.3.
- EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254. Cited by: §1.
- Dsd: dense-sparse-dense training for deep neural networks. arXiv preprint arXiv:1607.04381. Cited by: §6.
AMC: automl for model compression and acceleration on mobile devices.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §6.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
- AN 787: intel stratix 10 thermal modeling and management. Technical report Intel. Note: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an787.pdf Cited by: §5.3.
- Product specification. Technical report Intel. Note: https://ark.intel.com/products/codename/37572/Skylake Cited by: §5.3.
In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. Cited by: §1, §4.1, §6.
- Automatic generation of efficient accelerators for reconfigurable hardware. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 115–127. External Links: Cited by: §2.2.
- Spatial: a language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, New York, NY, USA, pp. 296–311. External Links: Cited by: §1.
- Efficient and reliable high-level synthesis design space explorer for fpgas. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 1–8. External Links: Cited by: §2.2.
- NVIDIA tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014. Cited by: §5.3.
- Baidu deepbench. GitHub Repository. Note: https://github.com/baidu-research/DeepBench Cited by: §1.
- Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119. Cited by: §6.
- SDA: software-defined accelerator for largescale dnn systems. Hot Chips 26. Cited by: §2.2.
- Plasticine: a reconfigurable architecture for parallel paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 389–402. Cited by: §1.
- A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, Piscataway, NJ, USA, pp. 13–24. External Links: Cited by: §2.2.
Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §6.
- C-lstm: enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 11–20. Cited by: §6.