1 Introduction
Recurrent Neural Network (RNN) are a stateoftheart machine learning approach that has achieved a tremendous success for a wide variety of sequencetosequence application domains
[1, 2, 3, 4, 5]. Unlike a feedforward Deep Neural Network (DNN), an RNN remembers information from previous inputs to improve accuracy. Long Short Term Memory (LSTM) [6] networks represent the preferred RNN implementation nowadays. LSTM cells can remember useful information over a long period of time, whereas it vanishes over time in other RNN approaches. LSTM networks are currently used for many sequence processing problems such as speech recognition [5], machine translation [1] or language modeling [7].This type of applications is of especial interest for mobile devices such as tablets, smartphones or smartwatches. For example, voicebased interfaces represent a more natural humancomputer interface than touchscreens and keyboards. Unfortunately, there are several challenges that hinder the deployment of LSTM networks in mobile devices. First, accurate LSTM networks are typically quite large and, therefore, they require substantial memory storage and computational resources. Realtime LSTM evaluation comes at a high energy cost that may not be acceptable for many lowpower devices. Second, due to its recurrent nature, an LSTM network inference exhibits a significant amount of sequential processing and limited parallelism and thus it cannot be efficiently executed on multicore CPUs or GPUs. Not surprisingly, our measurements on a recent Tegra X1 mobile SoC show that the CPU and the GPU do not achieve realtime performance for EESEN [5] and RLDRADSPR [8], two stateoftheart LSTM networks for speech recognition.
A few FPGAbased LSTM network accelerators targeted to the mobile segment have been presented in recent years [9, 10]. In these designs, high energyefficiency is achieved by storing all the synaptic weights in local memory since accesses to external DRAM memory consume more than two orders of magnitude energy than accessing a small onchip buffer [11]. Due to the requirement of storing the entire LSTM network onchip, the aforementioned accelerators are restricted to small LSTM models. Supporting larger LSTM networks, that provide stateoftheart accuracy, would require a significant increase in local storage and main memory bandwidth usage, which would incur in a high energy overhead.
In this paper, we present EPUR, a processing unit for recurrent neural networks that supports large LSTM models and provides realtime performance with an energy consumption amenable for mobile devices. EPUR efficiently implements in hardware the different components of an LSTM cell, providing enough flexibility to support LSTM networks for different applications. A main challenge for EPUR is fetching the weights from memory in an energyefficient manner. Storing them in local memory is not feasible due to the large size of modern LSTM models, which is in the order of tens or even hundreds of Mbytes, but accessing offchip memory is extremely expensive from an energy point of view [11]. EPUR makes a tradeoff between local memory capacity and external memory bandwidth to achieve low power, providing local storage for just one LSTM layer. Figure 1 shows an LSTM network consisting of multiple LSTM cells, arranged in several layers, which are recurrently executed for processing the different elements in the input sequence.
In EPUR, weights for one LSTM layer are fetched from external DRAM and stored in onchip local memory. Next, each cell on the layer is evaluated for the entire input sequence, reusing the weights stored in local memory for every element in the input sequence. The cost of accessing main memory is amortized due to the large size of typical input sequences, which is in the order of thousands of elements (e.g. audio frames). Due to the current trend towards deeper neural networks, EPUR offers good scalability as the size of the onchip local memory is independent of the number of layers.
To further improve the energyefficiency of weight fetching, we observe that an LSTM cell has two types of connections: selfrecurrent, a.k.a. recurrent, and forward connections from the previous layer (see Figure 1). Data dependencies impose strict sequential order for processing recurrent connections. However, forward connections can be processed in any order, since the results from the previous layer are available when the current layer starts execution. In this paper, we introduce Maximizing Weight Locality (MWL), a technique that modifies the order in which forward connections are processed to maximize temporal locality. When leveraging MWL, EPUR requires modest local storage capacity and memory bandwidth, even for large LSTM networks. For example, for EESEN [5], a speech recognition LSTM network that has a size of 42 Mbytes, EPUR only requires 1.5 Mbytes of local storage. Furthermore, main memory bandwidth usage for realtime performance is as small as 4.2 Mbytes/s, only 0.02% of the available memory bandwidth of conventional low power systems such as Tegra X1.
To summarize, this paper focuses on implementing energyefficient, realtime LSTM networks. Its main contributions are the following:

We propose EPUR, a processing unit for recurrent neural networks that improves energy efficiency with respect to CPU and GPU by orders of magnitude.

We introduce Maximizing Weight Locality (MWL), a technique that dramatically improves temporal locality of weight fetching, providing huge energy savings.

We evaluate EPUR for large, representative LSTM networks from different application domains, including speech recognition, machine translation and video classification.

EPUR achieves realtime performance while reducing energy by 92x on average when compared with a contemporary low power mobile SoC. Its peak power is 975 mW and its area is 46.3 , which is reasonable for most mobile devices.
The rest of the paper is organized as follows. Section 2 provides some background on LSTM networks. Section 3 presents EPUR, our processing unit for recurrent neural networks. Section 4 describes our evaluation methodology and Section 5 details the experimental results. Section 6 reviews some related work and, finally, Section 7 sums up the main conclusions.
2 Recurrent Neural Networks
Feedforward Deep Neural Networks (DNNs), such as Convolutional Neural Networks (CNNs), have been shown to be very successful for classification problems. However, they fail to provide an effective framework for sequencetosequence machine learning applications (e.g. machine translation) for several reasons. First, the input/output dimensionality of a feedforward DNN is fixed, whereas sequence processing problems require variable length input/output. Second, DNNs use a fairly constrained amount of context information to make a prediction, typically a few frames of audio/video or a few words, but some problems require taking into account distant past or future information to be accurate. Not surprisingly, sequence processing tasks such as machine translation or audio/video description cannot be accurately accomplished with the sole use of a feedforward DNN
[12]. Note that a DNN can be used for a specific subtask of a sequence processing problem, like acoustic scoring in speech recognition, but a very expensive post processing stage is still required to generate the output sequence [13, 14].In order to overcome the aforementioned limitations of feedforward DNNs, Recurrent Neural Networks (RNNs) [15] have been proposed. RNNs include loops or recurrent connections, allowing information to persist from one step, i.e. execution, of the network to the next. Therefore, RNNs can potentially employ an unbounded amount of context information to make predictions. In addition, RNNs are recurrently executed for every element in the input sequence and, hence, they can handle variable length input/output, which is a requirement for sequence processing problems.
Simple RNN architectures can capture and exploit short term dependencies. However, exploiting long term dependencies is challenging and, typically, useful information is diluted over time in many RNN approaches. To overcome this issue, Long Short Term Memory (LSTM) networks were proposed [6], which represent the most successful and widely used RNN implementation, with applications in speech recognition [5], machine translation [1] and language modeling [7]. In this section, we explain in detail the structure and behavior of LSTM networks.
2.1 Lstm Rnn
An LSTM RNN consists of multiple layers that are stacked together to form a deep RNN, including an input layer and multiple hidden layers formed by LSTM cells. These layers can be unidirectional or bidirectional. Unidirectional layers only use past information to perform inference for the current execution step, whereas bidirectional layers exploit both past and future context information and, typically, they provide higher accuracy. Therefore, Deep Bidirectional LSTM (BiLSTM) networks deliver stateoftheart accuracy for multiple sequence processing problems [16, 15, 17].
Figure 2 shows an unrolled BiLSTM network with 1 hidden layer. The bidirectional layer consists of two LSTM cells, the first one processes the information in the forward direction, i.e. () to (), while the second one processes the input sequence in the backward direction, i.e. () to () . Figure 2 shows multiple instances of these two cells for each layer, which corresponds to multiple recurrent uses of the same two cells, one for each element in the input sequence. In this logical view of the network, a.k.a. unrolled, recurrent connections are shown as horizontal connections, either lefttoright or vice versa, and they correspond in fact to connections from the output of one cell to the input of the same cell. In a given layer, the outputs of the LSTM cells in both forward and backward directions are concatenated, forming the input () for the next layer. Finally, a BiLSTM network includes a feedforward (nonrecurrent) softmax output layer, that produces the final output of the network. For example, for speech or text applications, the outputs represent the likelihoods of the different characters, phonemes or words at each step.
2.2 LSTM Cell
Figure 3 shows the basic structure of an LSTM cell. A key component is the cell state (), which represents the memory storage of the cell. On each cell, the state is updated by four components, commonly named as gates, which also perform the computation of the cell output (). Each of these gates consists of two fullyconnected networks: one taking as input the output of the previous LSTM layer () and one taking as input the output of the LSTM cell in the previous time step (). The former is the one using forward connections, whereas the latter includes the recurrent or feedback connections.
Figure 4 shows the computations performed within an LSTM cell. For each new element () of the input sequence, the following actions are taken: First, the cell updater gate () modulates the amount of input information that is considered a candidate to update the cell state. Then, the input gate () decides how much of the candidate information will be stored into the cell state. On the other hand, the forget gate () determines how much information will be removed from the current cell state (), i.e. which information is no longer useful for future predictions. Finally, the output gate () decides the amount of information that will be emitted from the cell.
(1) 
(2) 
(3) 
(4) 
(5) 
(6) 
denote elementwise multiplication, hyperbolic tangent and sigmoid function respectively.
In other words, information that is no longer useful is removed from the cell state using the mask generated by the forget gate. New information is added to the cell state applying the mask generated in the input gate to the candidate information produced by the cell updater gate. Then, to compute the cell output, a hyperbolic tangent is applied to the current cell state and the resulting value is multiplied by the mask generated in the output gate. Therefore, the cell output () is a filtered version of the cell state.
The mathematical computations performed in the four gates are very similar, as can be seen in equations 1, 2, 3, and 5 in Figure 4
. Note that conceptually each of the four gates is composed of multiple neurons and, as shown in Figure
4, each of them consist of two independent feedforward fully connected networks, which are implemented as two matrixvector multiplications. Therefore, for each neuron in the four gates and in all cells, two dotproduct operations are performed: one for forward connections and one for recurrent connections. Then, the outputs of these connections are added to a bias () and to a peephole connection. Note that peephole connections are a masked version of the cell state and they are used to link the cell state to the gates. Therefore, they allow the cell state to have control over which information is added, removed or outputted, improving prediction accuracy for machine learning applications that require precise timing [18]. These connections are shown as the dotted lines in Figure 3. Finally, an activation function is applied to the result to obtain the output value of each neuron.
3 EPUR Processing Unit
In this section, we present EPUR, an energyefficient processing unit for large LSTM networks. First, we describe the main drawbacks of stateoftheart solutions for LSTM inference. Next, we present the architecture of EPUR, which is an energyefficient hardware implementation of an LSTM cell. We detail the main parameters and tradeoffs made during the design of EPUR. Finally, we present Maximizing Weight Locality (MWL), a technique that largely improves the temporal locality of the memory accesses for fetching the synaptic weights.
3.1 Motivation
Stateoftheart hardware implementations [19, 20] for LSTM networks rely on storing all synaptic weights onchip in order to avoid expensive offchip memory accesses. As we can see in Figure 5, this approach is unfeasible for many LSTM applications, due to their large memory requirements to achieve high accuracy. For example, the GMAT [21] LSTM network for machine translation requires more than 250 Mbytes of memory.
Based on the recurrent nature of LSTM networks, we propose a costeffective tradeoff between main memory accesses and onchip memory storage. It is based on the observation that the input sequences of LSTM networks tend to contain a large number of elements and for evaluating a single pass (backward or forward) of a given layer, only the weights for that particular layer are used to evaluate the whole input sequence. We exploit this characteristic of RNNs to design the memory system of EPUR, providing onchip memory capacity to store only the weights of a single LSTM layer. Note that, as seen in Figure 5, the storage requirements are reduced by 7x on average, although this comes at the expense of higher offchip memory traffic, nonetheless this tradeoff is necessary in order to support larger and deeper models since keeping them onchip is unfeasible due to their large memory footprint.
3.2 Overview
Figure 6 shows the main components of EPUR processing unit. EPUR is composed of four computation units (CUs), which have several communications links among them. Each of these four hardware units is tailored to the computation of one of the four LSTM gates (i.e., forget gate, input gate, cell updater gate and output gate). The reason for this onetoone gatetoCU mapping is that exchanging information between LSTM gates is not needed for most of the cell state computation.
The computation on a gate is mainly dominated by the calculation of the matrixvector multiplications detailed in section 2.2. Note that each gate performs exactly two matrixvector multiplications (i.e. two dot products for each neuron) per element of the input sequence and, therefore, the total computation is well balanced among the four gates. However, a minimal amount of information is shared among CUs at the end of the cell state calculation, in order to gather the necessary data for its update. As shown in Figure 6, both input and forget gates send their result to the cell updater gate, whereas the result produced in the cell updater gate is consumed by the output gate. Moreover, after the cell state is updated by the cell updater gate, it is sent to the input and forget gates.
On the other hand, because of multiple data dependencies, the intermediate results produced by one layer for an entire input sequence must be saved in memory. There are two main alternatives to store this information: a dedicated onchip memory (OM) or main memory. In Figure 7, we show the normalized energy consumption and the reduction in accesses to main memory for some of the most common LSTM applications using both approaches. As we can observe, using a dedicated onchip memory consumes on average 2.4x less energy than storing/loading continuously the intermediate results to/from main memory since, on average, 77% of the accesses to main memory are avoided. Therefore, this is the adopted solution in EPUR. This dedicated onchip memory is divided in two parts of equal size. One part is used to store the output results produced in the current layer and the other one is used to read the results produced in the previous layer. The reason for this double buffering is that any result from the previous layer cannot be overwritten until the complete input sequence has been evaluated.
3.3 Computation Unit
The Computation Unit is the hardware structure that implements the formal model of an LSTM cell, described in Figure 4. It is composed of two main components: the Dot Product Unit (DPU) and the Multifunctional Unit (MU). The DPU, shown at the top of Figure 8, performs the necessary dot product operations in a gate, which is the most timeconsuming part. Note that our design employs dot products over matrixmatrix multiplications to simplify the hardware. The MU, shown at the bottom of Figure 8, performs the rest of operations, such as activation functions or peephole calculations. In addition to these components, two memory buffers are used to store the input sequence and the synaptic weights for each gate in the LSTM cell. Note that the same weights are reused for each recurrent execution of an LSTM cell.
3.3.1 The Dot Product Unit
The DPU performs a floating point (FP) dot product between two vectors of length M by splitting them into K subvectors of size N. On each cycle, this unit executes the following steps. First, two size N subvectors are loaded from two different onchip scratchpad memories: the Weight Buffer and the Input Buffer. The former keeps all the synaptic weights of a given layer. The latter stores either the input vector or the previous output vector of the layer being evaluated. Next, the Nelement FP Multiplier performs an elementwise multiplication of the two subvectors. Then, the resulting vector is sent to the Nelement FP Reduction Adder, in order to sum together all its elements, which takes cycles. Finally, the resulting value is added to the value stored in a register called Accumulator, which accumulates the partial dot product until the results of all K subvectors are added together.
As shown in Figure 4, to evaluate a neuron in a given gate, two dot product operations are required; one takes as input vector and the other one takes . The resulting output values of these two operations are added. In the Computation Unit, these two dot product operations are computed sequentially for each neuron, so that the latter is automatically added to the result of the former in the Accumulator register. Then, the resulting value is sent to the Multifunctional Unit (MU), which performs the remaining operations depending on the gate. Note that when a value is sent to the MU, the DPU does not wait until the MU finishes. Instead, it proceeds with the evaluation of the remaining neurons since they do not depend on the previous ones.
3.3.2 The Multifunctional Unit
The Multifunctional Unit (MU) is a configurable hardware component whose activity depends on the Computation Unit (i.e. input gate) where it is located, and the configuration provided by the user. One input to the MU is the DPU output value, which corresponds to neuron’s evaluation for forward and recurrent connections. On the other hand, some of the operations performed in a particular MU may require values produced in other Computation Units, as explained in section 3.2.
As shown in Figure 8, an MU is composed of a register file, an interconnection network and several floating point units that implement basic operations: multiplication, addition, division, comparison and exponential. Also, each MU receives the required synaptic information, weights for peephole connections and biases, through the Weight Buffer. Moreover, the previous cell state (i.e. for the previous element in the input sequence) comes through the Input Buffer.
In Table 1, we detail the basic steps performed by the four MUs once the output data from the DPUs is available. For the sake of simplicity, we assume a single cycle per operation and data transfer in Table 1. Note that for the evaluation we use Synopsys Design Compiler to set realistic latencies for the different operations and data transfers, as reported in Table 4. MUs are not in the critical path, since the DPU operations are more time consuming and, thus, there is slack to accommodate multicycle latencies for MU operations.
The MUs for the input and forget gates perform very similar operations: they perform the multiplications for peephole connections and add the bias. Next, they apply the sigmoid function to the result. After this, the resulting value is sent to the MU of the cell updater gate, which uses this information to proceed with the computation of the cell state, i.e. , and, then, it applies the hyperbolic tangent function to this value. Once this information is computed, it is sent to the MU of the output gate, which computes the element of the output vector, i.e. , corresponding to the current element of the input sequence (i.e. ). Finally, this value is sent to the Input Buffer of all the Computation Units. In addition, it is sent to the dedicated onchip memory where it is stored to be consumed by the next layer, as described in Section 3.2. Communication between MUs is performed by dedicated links, as shown in Figure 6.
3.4 MWL: Maximizing Weight Locality
As shown in Figure 9, onchip memory requirements to store the synaptic weights are still quite significant for some applications (i.e. GMAT), despite the optimizations proposed in Section 3.1. In order to further improve energy consumption and reduce onchip memory requirements, we propose a technique that maximizes temporal locality of the accesses to the weights, which are performed for each layer. We call this technique Maximizing Weight Locality (MWL). The key observation is that forward connections (i.e. their inputs come from the previous layer) can be processed in any order since all the output results from the previous layer are available. Therefore, EPUR processes forward connections in an order that improves temporal locality. The idea is that in a given gate, instead of completely evaluating all the neurons for a single element () of the input sequence, the evaluation for all the neurons is split in two steps. In the fist step, all the neurons are evaluated using as input the forward connections for the whole input sequence (i.e, , .., ) and the intermediate results are saved. For the second step, MWL proceeds with the computation of all neurons for the recurrent connections (i.e, , .., ). Note that in this case, the evaluation must be done in sequence since data dependencies in the recurrent connections impose strict sequential order.
With this approach, EPUR reuses a small subset of the weights, those corresponding to a particular neuron, at extremely short distances. Note that for a given neuron, once it is partially computed for all elements of the input sequence, its corresponding weights will no longer be required and, thus, they can be evicted from onchip memory. Therefore, while processing forward connections, EPUR only requires onchip storage for the forward weights of a single neuron at a time, significantly reducing onchip storage requirements and energy consumption. As shown in Figure 9, the storage requirements for the weights are reduced by approximately 50% on average. Note that recurrent connections are evaluated as usual and, hence, all the associated weights for a given layer must be stored onchip to avoid excessive accesses to offchip memory.
The drawback of MWL is that requires additional memory to store the partial evaluations of all neurons on a given layer. In the design of EPUR, presented in Section 3.3, neurons in a cell are completely evaluated for an element in the input sequence before proceeding to the next input element. Therefore, only the final output vector of a cell, , has to be stored in a memory buffer. On the other hand, with MWL, the neurons are first partially evaluated for all the elements in the input sequence, by operating exclusively on the forward connections. In this case, the partial evaluations for the neurons in each of the four gates must be stored, since later they have to be merged with the result of evaluating the recurrent connections, in order to produce the final output. This requires an increase in onchip storage requirements for intermediate results, but this overhead is minimized applying linear quantization to the partial output results. Next subsections provide further details on the implementation and tradeoffs of MWL.
3.4.1 Prioritize Forward Connections
The conventional way to evaluate the input sequence in a layer is by performing all the necessary computations of the current element in the input sequence before starting with the next one. It implies the evaluation of both forward and recurrent connections in each layer. However, by following this order, the temporal locality to access the weights from each gate is suboptimal. As we can see in the left part of Figure 10, the reuse distance of a weight access is equal to adding the size of the two weight matrices, i.e. and . This has a direct impact on storage requirements, since a longer reuse distance requires a larger onchip memory to hold the weights in order to avoid expensive offchip memory accesses.
MWL improves temporal locality in the weight accesses by changing the evaluation order of the two feedforward networks across the entire input sequence in a given layer. It is based on the observation that all feedforward networks that take as input vector, i.e. those that contain forward connections, do not depend on the previous output of the layer, as we can see in Figure 4. Therefore, we improve temporal locality by partially evaluating all the neurons in a layer for the entire input sequence and then proceeding with the recurrent connections (), instead of sequentially evaluating the neurons in the layer for and and then proceeding with and . This reduces the storage requirements to the size of a single feedforward network, as seen in Figure 9.
Note that for a given neuron in a cell, its computations use the same subset of weights (i.e, a single row from the weight matrix of the feedforward network), therefore the reuse distance is reduced to a single row of the feedforward matrix, as we can see in the middle part of Figure 10. Henceforth, we store them in a small buffer (i.e. 4KB), thus, avoiding to access the weight buffer for the forward connections. As a result, as shown in Figure 9, the accesses to the weight buffer are reduced by 50% on average.
Finally, after the partial evaluation of the forward connections for all the neurons in a layer, the evaluation for recurrent connections is performed as explained in Section 3.2, i.e. the next input is not evaluated until the results of the current input are computed, to respect data dependencies (right part of Figure 10).
3.4.2 Storage of the Intermediate Results
The dedicated onchip memory for intermediate results (see Section 3.2) is dimensioned to hold the final outputs (i.e. ) for a given layer, which are produced by the output gates in each cell. When using MWL, the temporal values produced by each gate while evaluating forward connections must be saved for the entire input sequence since the MUs will need these values to compute the final outputs, as explained above. Therefore, the main drawback of this technique is the extra storage requirements for these intermediate values, which is equal to four times the memory needed to store the outputs, because intermediate values are produced in the four gates. In order to deal with this issue, EPUR applies a wellknown technique, linear quantization, which reduces the number of bits needed to represent these values, at the expense of potentially some loss in accuracy. More specifically, we apply linear quantization using 8 bits per element introducing negligible accuracy loss in our set of neural networks. Empirically we found that for the networks EESEN and RLDRASPR the WER decreases by less than 1%. For the other three networks (BYSDNE, LDLRNN, GMAT), we observed an accuracy loss of less than 0.5%. Note that previous work reported similar results [21, 22].
When using linear quantization, for a given neuron with partial output (i.e. ) produced in MWL, its quantized value (i.e. ) is computed using the following equations:
(7) 
(8) 
where is the number of bits of the quantized value (represented as an integer), i.e. 8 bits, and is the maximum value of . Theoretically, the value of is unbounded; however, we empirically found that its absolute value is normally less than 20 for recurrent neural networks. Note that the constant is computed offline.
In order to compute the previous equation, we extended the MU with functional units to support AND, OR and SHIFT operations. We implemented the rounding operation by adding one to the product followed by a sequence of AND, OR, additions and multiplications. These operations are performed in parallel with the computation of done by the DPU. Once the casting is completed, the value is stored in the onchip memory for intermediate results.
After all the partial outputs () for all the neurons are computed, recurrent connections are evaluated as explained in section 3.4.1. However, before computing the final output for a given gate in a cell, the previous quantized values must be converted back to floating point numbers and added to the result of evaluating the recurrent connections. We implemented this value conversion through a look up table that maps the integer quantized value to its floating point representation. Note that the size of this table is small since is small (i.e. 8 bits in our experiments) and it is computed offline.
4 Evaluation Methodology
As our set of benchmarks, we use five recent LSTM networks which are described in Table 2. Our selection includes RNNs for popular applications such as speech recognition, machine translation or video classification. Each of these networks has a different number of internal layers and outputs, i.e. number of cells. Moreover, there are some networks that only perform a single pass for inference computation, i.e. they are unidirectional, whereas two of them, EESEN and GMAT, are bidirectional. On the other hand, we include networks with and without peephole connections. Therefore, our selection covers a wide range of LSTM designs with different sizes, from small RNNs of one Mbyte to large RNNs or hundreds of Mbytes. For each network we used the accuracy metric listed in Table 2 and the test set provided in each work.
Network  App Domain  Layers  Neurons  Passes  Peephole  Size (MB)  Accuracy 

BYSDNE [23]  Video Classification  5  512  1  Yes  40  88.6% 
RLDRADSPR [8]  Speech Recognition  10  1024  1  Yes  118  39.3 WER 
EESEN [5]  Speech Recognition  5  320  2  Yes  42  23.8 WER 
LDLRNN [24]  Time Series  2  128  1  No  1  85% 
GMAT [21]  Machine Translation  17  1024  1  No  272  24.1 Bleu 
As our baseline platform, we use an NVIDA Tegra X1 SoC [25] whose parameters are shown in Table 3. Its energy consumption has been measured by reading the registers of the Texas Instruments INA3221 power monitor included in the Jetson TX1 development board [25]
. Regarding the software implementation of the networks, we implemented them using Keras
[26], a highlevel neural networks API. We use the Theano
[27] backend to run the LSTM networks. Theano relies on cuBLAS, a highperformance CUDA library, to perform matrix operations. Finally, we also implemented MWL in software for the Tegra X1 (Tegra X1+MWL) to analyze the benefits of a softwareonly implementation. We used CUDA to implement this version and employed kernel fusion [28] to merge the processing of different gates in one kernel, avoiding excessive number of API calls, which represent a significant overhead in this platform.Parameter  Value 
CPU  4core ARM A57 
GPU  256core Maxwell GPU 
Streaming Multiprocessors  2 (2048 threads/proc) 
Technology  20 nm 
Frequency  1.0 GHz 
CPU L2 Cache  2 MB 
GPU L2 Cache  256 KB 
To evaluate our accelerator, we have developed a cycleaccurate simulator of EPUR. This simulator estimates the total energy consumption (static and dynamic) and execution time of LSTM networks running on top of EPUR. We used Verilog to implement the different pipeline components of EPUR, and we synthesized them using the Synopsys Design Compiler to obtain their delay and energy consumption. We use a typical process corner with a voltage of 0.78V and average switching activity is used to estimate dynamic power. We used CACTI
[29] to estimate the delay and energy (static and dynamic) of onchip memories. Finally, to estimate timing and energy consumption of main memory we used MICRON models [30]. We model 4 GB LPDDR4 DRAM.Regarding the clock frequency, we used the delays reported by Synopsys Design Compiler and CACTI to set the frequency such that most hardware structures operate at one clock cycle. In addition, we evaluated alternative frequency values in order to minimize energy consumption. Note that many hardware components, such as floating point multipliers, are pipelined and have latency larger than one clock cycle, as shown in Table 4.
The remaining configuration parameters of EPUR used for our experiments are shown in Table 4. We select an energyefficient configuration that achieves realtime performance for all the neural networks in Table 2. Note that EPUR is designed to accommodate large LSTM networks and, thus, its onchip storage might be oversized for the small models used in some applications. In this case, unused memory banks to store weights and intermediate results are power gated to reduce static power.
Parameter  EPUR  EPUR+MWL 
Technology  28 nm  28 nm 
Frequency  500 MHz  500 MHz 
Intermediate Memory  6 MB  6 MB 
Weights Memory  4 MB per CU  2 MB per CU 
Inputs Memory  8 KB per CU  4 KB per CU 
DPU Width  16 operations  16 operations 
MU Operations  cycles: 2 (ADD), 4 (MUL), 5 (EXP)  
MU Communication  2 cycles  2 cycles 
Peak Bandwidth  30 GB/s  30 GB/s 
5 Experimental Results
In this section, we present the evaluation of EPUR, our processing unit for RNNs. The baseline configuration used for comparison purposes is a Theano implementation running on a mobile NVIDIA Tegra X1 platform. The configuration labeled as EPUR throughout this section consists of our first design presented in Section 3.2, whereas the configuration EPUR+MWL includes our technique for improving the temporal locality of the weights described in Section 3.4. First, we present the energy reduction achieved by these two configurations with respect to the Tegra X1. Second, the performance improvement over the baseline is analyzed. Third, the power consumption for each of these configurations is shown. Fourth, we present the total area required by EPUR. Finally, we analyze the performance of a softwareonly implementation of MWL.
Figure 11 shows the energy reduction. On average, EPUR and EPUR+MWL achieve 61x and 92x energy reduction respectively. All the LSTM networks show large improvements of at least 28x reduction in energy consumption. A remarkable case is LDLRNN, for which EPUR reduces the total energy by 352.4x and 496.1x, respectively. The reason for this large energy reduction is that LDLRNN has fewer outputs per layer, i.e. smaller number of neurons, which means that the matrixvector multiplications require less number of operations and, also, less memory accesses are done to fetch the weights or intermediate results. This penalizes Tegra X1 platform because the ratio between computations in the GPU and other related tasks (e.g., GPU synchronization, CPU work, etc.) is smaller. Note that for EPUR most of the energy savings come from avoiding accesses to main memory to load/store intermediate results and weights. In the case of EPUR+MWL, energy savings come from avoiding accesses to the onchip memory for weights by 50% on average.
Figure 12 shows the energy breakdown for the two configurations of EPUR. The different components of EPUR are grouped into “scratchpad memories”, which includes all the onchip memories, and “operations”, which includes the pipeline components, such as the functional units. Since onchip memory requirements and number of memory accesses are significant, the overall energy consumption is dominated by the dynamic accesses to onchip memories, which consume around 80%. Because MWL reduces the dynamic accesses for the weight buffer by 50% on average, the dynamic energy due to onchip memories is reduced in 31% on average for EPUR+MWL. Note that the energy consumption due to scratchpad memories is not reduced by 50% since there is an increase in memory accesses to the onchip memory for intermediate results. In the case of the leakage due to onchip memories, after applying MWL, it is reduced by more than 50% on average. This saving comes from the reduction in storage requirements to store the weights for the forward connections. Henceforth, the savings in leakage and dynamic energy result in 35% reduction of the total energy consumption. Regarding the energy consumption due to the operations, it ranges between 10% and 20% of the total energy for both configurations.
Figure 13 shows the speedups for different LSTM networks. On average, the speedup achieved by EPUR over Tegra X1 is 18.7x. EPUR performance improvements come from hiding memory latency (i.e, loading/storing is overlapped with computations), reducing offchip memory accesses, and featuring a custom pipeline tailored to LSTM computation. Note that, for EPUR, once the weights and input frames are loaded from the main system, there is not extra overhead from the main application. However, since the Tegra X1 is tailored to a broader range of applications, its performance is impacted by the overhead due to related tasks (e.g., GPU synchronization, CPU work, etc.). Regarding EPUR+MWL, there is not performance improvement against the baseline since the order in which MWL evaluates the neurons does not change the final execution time. Note that in MWL, the number of operations to evaluate a given neuron is equal to the number of operations for the conventional order. However, because the evaluation of the recurrent connections for a given neuron is postponed until all forward connections are evaluated, the latency to evaluate a single neuron increases but the latency to produce the final output sequence does not change. Finally, for speech recognition applications, EPUR achieves realtime performance by a large margin, running 30x and 5x faster than realtime for EESEN and RLDRADSPR respectively.
On the other hand, power dissipation is shown in Figure 14, which includes the total power for Tegra X1 and the two configurations of EPUR. As it can be seen, EPUR+MWL dissipates 5x lower power than Tegra X1 on average.
Regarding area, EPUR requires a total area of 64.6 , whereas the total area of EPUR+MWL is 46.3 . As depicted in Figure 15, the component with larger contribution to the total area is the onchip memory for the synaptic weights, which is reduced by 50% when MWL is applied.
Finally, Figure 16 shows the speedup and energy reduction of the Tegra X1+MWL, i.e. MWL implemented in software, with respect to the baseline. On average, it provides a 2x energy reduction and a 2.3x speedup. EESEN and LDLRNN exhibit large improvements in performance and energy. These RNNs have smaller number of neurons than the others (see Table 2) and, hence, the synaptic weights can be stored in the onchip storage of the mobile GPU and reused for the entire layer evaluation, i.e. for the whole input sequence. On the other hand, the benefits are significantly smaller for BYSDNE, RLDRADSPR and GMAT. These networks feature larger number of neurons and, hence, the synaptic weights of one LSTM cell cannot be stored onchip in Tegra X1, increasing offchip memory traffic by a large extent. Note that the onchip memories of Tegra X1 are fairly smaller than the ones included in EPUR as illustrated in Table 3 and Table 4. This lack of onchip storage constrains the effectiveness of Tegra X1+MWL for RNNs with large cell dimensionality.
6 Related Work
Improving the energyefficiency of LSTM networks has attracted the attention of the architectural community in the last few years. Proposals for LSTM networks acceleration have been presented in [20, 19, 31]. Although these accelerators achieve higher performance per watt than CPUs and GPUs, they are not designed for lowpower mobile devices since their power dissipation ranges from 19 W to 41 W. However, EPUR dissipates a peak power of 970 mW, which is amenable for lowpower mobile devices.
Chang et al. [9] present a lowpower accelerator targeting the mobile segment. It implements a small LSTM network (2 layers, 128 neurons) and dissipates 1.9 W. In this work arithmetic operations are done using fixedpoint Q8.8 data format, thus an accuracy loss of 7.1% is aggregated. On the contrary, EPUR uses floating point operations (either FP16 or FP32) and supports larger network models for a wide variety of application domains. Note that scaling up the aforementioned accelerator presented in [9] to support larger LSTM networks would require a significant increase in local storage capacity or in main memory traffic, and both alternatives would come at a high overhead in energy consumption.
Another lowpower LSTM accelerator is presented in [10], this system consumes 9 W and supports larger models by using aggressive weight quantization. External DRAM traffic is completely avoided by storing the quantized weights in a local onchip memory of 2 Mbytes. However, this quantization comes at the expense of nonnegligible accuracy loss. For speech recognition, Word Error Rate increases from 13.5%, using 32bit floating point, to 15.1% and 20.2% when using 6bit and 4bit quantization respectively. Furthermore, larger and more accurate models cannot be stored in its local memory even with the 4bit quantization. For example, EESEN requires more than 5 Mbytes when using 4 bits per weight. Our work is different since EPUR+MWL uses 8bit quantization to reduce the size of intermediate results with a negligible impact on accuracy.
The LSTM accelerator ESE [20] achieves high performance and energyefficiency by exploiting linear quantization and aggressive pruning. The main application for this work is speech recognition and the main target are highend systems. On the contrary, EPUR targets mobile devices and achieves high energyefficiency by improving the temporal locality of the memory accesses that fetch synaptic weight. Moreover, EPUR supports a large variety of applications. We leave the use of pruned models in EPUR as future work.
Regarding the work in [32], EPUR without MWL is similar to a weight stationary architecture applied to LSTMs since it loads all weights for given layer in onchip memory, holding them until all associated computations are performed. However, MWL is different since it aims at further reducing the reuse distances. Unlike traditional weight stationary architectures, MWL splits synaptic weights in two types: forward and recurrent. Based on the observation that forward connections can be processed in any order, whereas recurrent connections impose sequential processing due to data dependencies. Therefore, MWL evaluates forward connections in the order that maximizes temporal locality, requiring extra small onchip storage for this stage, whereas it processes all recurrent connections on a second stage as shown in Figure 10. MWL greatly reduces the energy consumption of the baseline accelerator.
Finally, cuDNN [33] has been recently extended to efficiently support RNN training. EPUR design is significantly different in multiple ways. First, cuDNN focuses on RNN training with large batch sizes, whereas EPUR focuses on RNN inference with batch size of one, i.e. one input sequence at a time. We measured cuDNN performance for RNN inference with batch size of one and found it is 1.5x faster than cuBLAS, whereas EPUR achieves 18.7x speedup. cuDNN effectiveness is reduced due to the small batch size commonly used for RNN inference. Furthermore, cuDNN’s optimizations to execute multiple layers in parallel cannot be applied to bidirectional LSTMs due to data dependencies.
7 Conclusions
In this paper, we present EPUR, a processing unit for RNNs that supports large LSTM networks while dissipating lowpower, motivated by the increasingly important role of LSTM networks in applications such as speech recognition, machine translation and video classification. Unlike previous proposals that attempt to accommodate the entire RNN onchip, EPUR only provides storage for one LSTM layer, whose weights are fetched once from main memory and reused for multiple recurrent executions. To further improve the memory efficiency of EPUR, we introduce Maximizing Weight Locality (MWL), a novel technique that improves the temporal locality of the synaptic weights. The proposed design supports large LSTM networks of hundreds of Megabytes, while using small onchip storage and low memory bandwidth. Our results show that EPUR reduces energy consumption by 92x on average with respect to a modern mobile GPU, while providing 18.7x speedup.
References
 [1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.

[2]
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for
visual recognition and description,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2625–2634, 2015.  [3] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.
 [4] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequencevideo to text,” in Proceedings of the IEEE international conference on computer vision, pp. 4534–4542, 2015.
 [5] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: Endtoend speech recognition using deep rnn models and wfstbased decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pp. 167–174, IEEE, 2015.
 [6] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [7] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
 [8] J. Kim, M. ElKhamy, and J. Lee, “Residual lstm: Design of a deep recurrent architecture for distant speech recognition,” arXiv preprint arXiv:1701.03360, 2017.
 [9] A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neural networks hardware implementation on fpga,” arXiv preprint arXiv:1511.05552, 2015.
 [10] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “Fpgabased lowpower speech recognition with recurrent neural networks,” in Signal Processing Systems (SiPS), 2016 IEEE International Workshop on, pp. 230–235, IEEE, 2016.
 [11] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
 [12] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learning systems, 2016.
 [13] R. Yazdani, A. Segura, J.M. Arnau, and A. Gonzalez, “An ultra lowpower hardware accelerator for automatic speech recognition,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1–12, IEEE, 2016.
 [14] H. Tabani, J.M. Arnau, J. Tubella, and A. Gonzalez, “An ultra lowpower hardware accelerator for acoustic scoring in speech recognition,” in Parallel Architecture and Compilation Techniques (PACT), 26th International Conference on, IEEE/ACM, 2017.

[15]
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.  [16] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and G. Soda, “Bidirectional dynamics for protein secondary structure prediction,” in Sequence Learning, pp. 80–104, Springer, 2001.
 [17] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, vol. 4, pp. 2047–2052, IEEE, 2005.
 [18] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEEINNSENNS International Joint Conference on, vol. 3, pp. 189–194, IEEE, 2000.
 [19] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpgabased accelerator for long shortterm memory recurrent neural networks,” in Design Automation Conference (ASPDAC), 2017 22nd Asia and South Pacific, pp. 629–634, IEEE, 2017.
 [20] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga.,” in FPGA, pp. 75–84, 2017.
 [21] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
 [22] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed point quantization of deep convolutional networks,” CoRR, vol. abs/1511.06393, 2015.
 [23] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702, 2015.
 [24] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzell, “Learning to diagnose with lstm recurrent neural networks,” arXiv preprint arXiv:1511.03677, 2015.
 [25] NVIDIA, “NVIDIA TEGRA X1 new mobile superchip.” http://international.download.nvidia.com/pdf/tegra/TegraX1whitepaperv1.0.pdf.
 [26] F. Chollet and Others, “Keras.”
 [27] R. AlRfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. BoulangerLewandowski, X. Bouthillier, A. de Brébisson, O. Breuleux, P.L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.A. Côté, M. Côté, A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.A. Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Merriënboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. WardeFarley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv eprints, vol. abs/1605.02688, May 2016.
 [28] G. Wang, Y. Lin, and W. Yi, “Kernel fusion: An effective method for better power efficiency on multithreaded gpu,” in Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing, pp. 344–350, IEEE Computer Society, 2010.
 [29] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP Laboratories, pp. 22–31, 2009.
 [30] Micron Inc., “TN5301: LPDDR4 System Power Calculator.” https://www.micron.com/support/toolsandutilities/powercalc.
 [31] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “Fpga acceleration of recurrent neural network based language model,” in FieldProgrammable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, pp. 111–118, IEEE, 2015.
 [32] Y.H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 367–379, IEEE Press, 2016.
 [33] J. Appleyard, T. Kociský, and P. Blunsom, “Optimizing performance of recurrent neural networks on gpus,” CoRR, vol. abs/1604.01946, 2016.
Comments
There are no comments yet.