Recurrent Neural Network (RNN) are a state-of-the-art machine learning approach that has achieved a tremendous success for a wide variety of sequence-to-sequence application domains[1, 2, 3, 4, 5]. Unlike a feed-forward Deep Neural Network (DNN), an RNN remembers information from previous inputs to improve accuracy. Long Short Term Memory (LSTM)  networks represent the preferred RNN implementation nowadays. LSTM cells can remember useful information over a long period of time, whereas it vanishes over time in other RNN approaches. LSTM networks are currently used for many sequence processing problems such as speech recognition , machine translation  or language modeling .
This type of applications is of especial interest for mobile devices such as tablets, smartphones or smartwatches. For example, voice-based interfaces represent a more natural human-computer interface than touchscreens and keyboards. Unfortunately, there are several challenges that hinder the deployment of LSTM networks in mobile devices. First, accurate LSTM networks are typically quite large and, therefore, they require substantial memory storage and computational resources. Real-time LSTM evaluation comes at a high energy cost that may not be acceptable for many low-power devices. Second, due to its recurrent nature, an LSTM network inference exhibits a significant amount of sequential processing and limited parallelism and thus it cannot be efficiently executed on multicore CPUs or GPUs. Not surprisingly, our measurements on a recent Tegra X1 mobile SoC show that the CPU and the GPU do not achieve real-time performance for EESEN  and RLDRADSPR , two state-of-the-art LSTM networks for speech recognition.
A few FPGA-based LSTM network accelerators targeted to the mobile segment have been presented in recent years [9, 10]. In these designs, high energy-efficiency is achieved by storing all the synaptic weights in local memory since accesses to external DRAM memory consume more than two orders of magnitude energy than accessing a small on-chip buffer . Due to the requirement of storing the entire LSTM network on-chip, the aforementioned accelerators are restricted to small LSTM models. Supporting larger LSTM networks, that provide state-of-the-art accuracy, would require a significant increase in local storage and main memory bandwidth usage, which would incur in a high energy overhead.
In this paper, we present E-PUR, a processing unit for recurrent neural networks that supports large LSTM models and provides real-time performance with an energy consumption amenable for mobile devices. E-PUR efficiently implements in hardware the different components of an LSTM cell, providing enough flexibility to support LSTM networks for different applications. A main challenge for E-PUR is fetching the weights from memory in an energy-efficient manner. Storing them in local memory is not feasible due to the large size of modern LSTM models, which is in the order of tens or even hundreds of Mbytes, but accessing off-chip memory is extremely expensive from an energy point of view . E-PUR makes a trade-off between local memory capacity and external memory bandwidth to achieve low power, providing local storage for just one LSTM layer. Figure 1 shows an LSTM network consisting of multiple LSTM cells, arranged in several layers, which are recurrently executed for processing the different elements in the input sequence.
In E-PUR, weights for one LSTM layer are fetched from external DRAM and stored in on-chip local memory. Next, each cell on the layer is evaluated for the entire input sequence, reusing the weights stored in local memory for every element in the input sequence. The cost of accessing main memory is amortized due to the large size of typical input sequences, which is in the order of thousands of elements (e.g. audio frames). Due to the current trend towards deeper neural networks, E-PUR offers good scalability as the size of the on-chip local memory is independent of the number of layers.
To further improve the energy-efficiency of weight fetching, we observe that an LSTM cell has two types of connections: self-recurrent, a.k.a. recurrent, and forward connections from the previous layer (see Figure 1). Data dependencies impose strict sequential order for processing recurrent connections. However, forward connections can be processed in any order, since the results from the previous layer are available when the current layer starts execution. In this paper, we introduce Maximizing Weight Locality (MWL), a technique that modifies the order in which forward connections are processed to maximize temporal locality. When leveraging MWL, E-PUR requires modest local storage capacity and memory bandwidth, even for large LSTM networks. For example, for EESEN , a speech recognition LSTM network that has a size of 42 Mbytes, E-PUR only requires 1.5 Mbytes of local storage. Furthermore, main memory bandwidth usage for real-time performance is as small as 4.2 Mbytes/s, only 0.02% of the available memory bandwidth of conventional low power systems such as Tegra X1.
To summarize, this paper focuses on implementing energy-efficient, real-time LSTM networks. Its main contributions are the following:
We propose E-PUR, a processing unit for recurrent neural networks that improves energy efficiency with respect to CPU and GPU by orders of magnitude.
We introduce Maximizing Weight Locality (MWL), a technique that dramatically improves temporal locality of weight fetching, providing huge energy savings.
We evaluate E-PUR for large, representative LSTM networks from different application domains, including speech recognition, machine translation and video classification.
E-PUR achieves real-time performance while reducing energy by 92x on average when compared with a contemporary low power mobile SoC. Its peak power is 975 mW and its area is 46.3 , which is reasonable for most mobile devices.
The rest of the paper is organized as follows. Section 2 provides some background on LSTM networks. Section 3 presents E-PUR, our processing unit for recurrent neural networks. Section 4 describes our evaluation methodology and Section 5 details the experimental results. Section 6 reviews some related work and, finally, Section 7 sums up the main conclusions.
2 Recurrent Neural Networks
Feed-forward Deep Neural Networks (DNNs), such as Convolutional Neural Networks (CNNs), have been shown to be very successful for classification problems. However, they fail to provide an effective framework for sequence-to-sequence machine learning applications (e.g. machine translation) for several reasons. First, the input/output dimensionality of a feed-forward DNN is fixed, whereas sequence processing problems require variable length input/output. Second, DNNs use a fairly constrained amount of context information to make a prediction, typically a few frames of audio/video or a few words, but some problems require taking into account distant past or future information to be accurate. Not surprisingly, sequence processing tasks such as machine translation or audio/video description cannot be accurately accomplished with the sole use of a feed-forward DNN. Note that a DNN can be used for a specific subtask of a sequence processing problem, like acoustic scoring in speech recognition, but a very expensive post processing stage is still required to generate the output sequence [13, 14].
In order to overcome the aforementioned limitations of feed-forward DNNs, Recurrent Neural Networks (RNNs)  have been proposed. RNNs include loops or recurrent connections, allowing information to persist from one step, i.e. execution, of the network to the next. Therefore, RNNs can potentially employ an unbounded amount of context information to make predictions. In addition, RNNs are recurrently executed for every element in the input sequence and, hence, they can handle variable length input/output, which is a requirement for sequence processing problems.
Simple RNN architectures can capture and exploit short term dependencies. However, exploiting long term dependencies is challenging and, typically, useful information is diluted over time in many RNN approaches. To overcome this issue, Long Short Term Memory (LSTM) networks were proposed , which represent the most successful and widely used RNN implementation, with applications in speech recognition , machine translation  and language modeling . In this section, we explain in detail the structure and behavior of LSTM networks.
2.1 Lstm Rnn
An LSTM RNN consists of multiple layers that are stacked together to form a deep RNN, including an input layer and multiple hidden layers formed by LSTM cells. These layers can be unidirectional or bidirectional. Unidirectional layers only use past information to perform inference for the current execution step, whereas bidirectional layers exploit both past and future context information and, typically, they provide higher accuracy. Therefore, Deep Bidirectional LSTM (BiLSTM) networks deliver state-of-the-art accuracy for multiple sequence processing problems [16, 15, 17].
Figure 2 shows an unrolled BiLSTM network with 1 hidden layer. The bidirectional layer consists of two LSTM cells, the first one processes the information in the forward direction, i.e. () to (), while the second one processes the input sequence in the backward direction, i.e. () to () . Figure 2 shows multiple instances of these two cells for each layer, which corresponds to multiple recurrent uses of the same two cells, one for each element in the input sequence. In this logical view of the network, a.k.a. unrolled, recurrent connections are shown as horizontal connections, either left-to-right or vice versa, and they correspond in fact to connections from the output of one cell to the input of the same cell. In a given layer, the outputs of the LSTM cells in both forward and backward directions are concatenated, forming the input () for the next layer. Finally, a BiLSTM network includes a feed-forward (non-recurrent) softmax output layer, that produces the final output of the network. For example, for speech or text applications, the outputs represent the likelihoods of the different characters, phonemes or words at each step.
2.2 LSTM Cell
Figure 3 shows the basic structure of an LSTM cell. A key component is the cell state (), which represents the memory storage of the cell. On each cell, the state is updated by four components, commonly named as gates, which also perform the computation of the cell output (). Each of these gates consists of two fully-connected networks: one taking as input the output of the previous LSTM layer () and one taking as input the output of the LSTM cell in the previous time step (). The former is the one using forward connections, whereas the latter includes the recurrent or feedback connections.
Figure 4 shows the computations performed within an LSTM cell. For each new element () of the input sequence, the following actions are taken: First, the cell updater gate () modulates the amount of input information that is considered a candidate to update the cell state. Then, the input gate () decides how much of the candidate information will be stored into the cell state. On the other hand, the forget gate () determines how much information will be removed from the current cell state (), i.e. which information is no longer useful for future predictions. Finally, the output gate () decides the amount of information that will be emitted from the cell.
denote element-wise multiplication, hyperbolic tangent and sigmoid function respectively.
In other words, information that is no longer useful is removed from the cell state using the mask generated by the forget gate. New information is added to the cell state applying the mask generated in the input gate to the candidate information produced by the cell updater gate. Then, to compute the cell output, a hyperbolic tangent is applied to the current cell state and the resulting value is multiplied by the mask generated in the output gate. Therefore, the cell output () is a filtered version of the cell state.
. Note that conceptually each of the four gates is composed of multiple neurons and, as shown in Figure4, each of them consist of two independent feed-forward fully connected networks, which are implemented as two matrix-vector multiplications. Therefore, for each neuron in the four gates and in all cells, two dot-product operations are performed: one for forward connections and one for recurrent connections. Then, the outputs of these connections are added to a bias () and to a peephole connection. Note that peephole connections are a masked version of the cell state and they are used to link the cell state to the gates. Therefore, they allow the cell state to have control over which information is added, removed or outputted, improving prediction accuracy for machine learning applications that require precise timing . These connections are shown as the dotted lines in Figure 3
. Finally, an activation function is applied to the result to obtain the output value of each neuron.
3 E-PUR Processing Unit
In this section, we present E-PUR, an energy-efficient processing unit for large LSTM networks. First, we describe the main drawbacks of state-of-the-art solutions for LSTM inference. Next, we present the architecture of E-PUR, which is an energy-efficient hardware implementation of an LSTM cell. We detail the main parameters and trade-offs made during the design of E-PUR. Finally, we present Maximizing Weight Locality (MWL), a technique that largely improves the temporal locality of the memory accesses for fetching the synaptic weights.
State-of-the-art hardware implementations [19, 20] for LSTM networks rely on storing all synaptic weights on-chip in order to avoid expensive off-chip memory accesses. As we can see in Figure 5, this approach is unfeasible for many LSTM applications, due to their large memory requirements to achieve high accuracy. For example, the GMAT  LSTM network for machine translation requires more than 250 Mbytes of memory.
Based on the recurrent nature of LSTM networks, we propose a cost-effective tradeoff between main memory accesses and on-chip memory storage. It is based on the observation that the input sequences of LSTM networks tend to contain a large number of elements and for evaluating a single pass (backward or forward) of a given layer, only the weights for that particular layer are used to evaluate the whole input sequence. We exploit this characteristic of RNNs to design the memory system of E-PUR, providing on-chip memory capacity to store only the weights of a single LSTM layer. Note that, as seen in Figure 5, the storage requirements are reduced by 7x on average, although this comes at the expense of higher off-chip memory traffic, nonetheless this trade-off is necessary in order to support larger and deeper models since keeping them on-chip is unfeasible due to their large memory footprint.
Figure 6 shows the main components of E-PUR processing unit. E-PUR is composed of four computation units (CUs), which have several communications links among them. Each of these four hardware units is tailored to the computation of one of the four LSTM gates (i.e., forget gate, input gate, cell updater gate and output gate). The reason for this one-to-one gate-to-CU mapping is that exchanging information between LSTM gates is not needed for most of the cell state computation.
The computation on a gate is mainly dominated by the calculation of the matrix-vector multiplications detailed in section 2.2. Note that each gate performs exactly two matrix-vector multiplications (i.e. two dot products for each neuron) per element of the input sequence and, therefore, the total computation is well balanced among the four gates. However, a minimal amount of information is shared among CUs at the end of the cell state calculation, in order to gather the necessary data for its update. As shown in Figure 6, both input and forget gates send their result to the cell updater gate, whereas the result produced in the cell updater gate is consumed by the output gate. Moreover, after the cell state is updated by the cell updater gate, it is sent to the input and forget gates.
On the other hand, because of multiple data dependencies, the intermediate results produced by one layer for an entire input sequence must be saved in memory. There are two main alternatives to store this information: a dedicated on-chip memory (OM) or main memory. In Figure 7, we show the normalized energy consumption and the reduction in accesses to main memory for some of the most common LSTM applications using both approaches. As we can observe, using a dedicated on-chip memory consumes on average 2.4x less energy than storing/loading continuously the intermediate results to/from main memory since, on average, 77% of the accesses to main memory are avoided. Therefore, this is the adopted solution in E-PUR. This dedicated on-chip memory is divided in two parts of equal size. One part is used to store the output results produced in the current layer and the other one is used to read the results produced in the previous layer. The reason for this double buffering is that any result from the previous layer cannot be overwritten until the complete input sequence has been evaluated.
3.3 Computation Unit
The Computation Unit is the hardware structure that implements the formal model of an LSTM cell, described in Figure 4. It is composed of two main components: the Dot Product Unit (DPU) and the Multifunctional Unit (MU). The DPU, shown at the top of Figure 8, performs the necessary dot product operations in a gate, which is the most time-consuming part. Note that our design employs dot products over matrix-matrix multiplications to simplify the hardware. The MU, shown at the bottom of Figure 8, performs the rest of operations, such as activation functions or peephole calculations. In addition to these components, two memory buffers are used to store the input sequence and the synaptic weights for each gate in the LSTM cell. Note that the same weights are reused for each recurrent execution of an LSTM cell.
3.3.1 The Dot Product Unit
The DPU performs a floating point (FP) dot product between two vectors of length M by splitting them into K sub-vectors of size N. On each cycle, this unit executes the following steps. First, two size N sub-vectors are loaded from two different on-chip scratchpad memories: the Weight Buffer and the Input Buffer. The former keeps all the synaptic weights of a given layer. The latter stores either the input vector or the previous output vector of the layer being evaluated. Next, the N-element FP Multiplier performs an element-wise multiplication of the two sub-vectors. Then, the resulting vector is sent to the N-element FP Reduction Adder, in order to sum together all its elements, which takes cycles. Finally, the resulting value is added to the value stored in a register called Accumulator, which accumulates the partial dot product until the results of all K sub-vectors are added together.
As shown in Figure 4, to evaluate a neuron in a given gate, two dot product operations are required; one takes as input vector and the other one takes . The resulting output values of these two operations are added. In the Computation Unit, these two dot product operations are computed sequentially for each neuron, so that the latter is automatically added to the result of the former in the Accumulator register. Then, the resulting value is sent to the Multifunctional Unit (MU), which performs the remaining operations depending on the gate. Note that when a value is sent to the MU, the DPU does not wait until the MU finishes. Instead, it proceeds with the evaluation of the remaining neurons since they do not depend on the previous ones.
3.3.2 The Multifunctional Unit
The Multifunctional Unit (MU) is a configurable hardware component whose activity depends on the Computation Unit (i.e. input gate) where it is located, and the configuration provided by the user. One input to the MU is the DPU output value, which corresponds to neuron’s evaluation for forward and recurrent connections. On the other hand, some of the operations performed in a particular MU may require values produced in other Computation Units, as explained in section 3.2.
As shown in Figure 8, an MU is composed of a register file, an interconnection network and several floating point units that implement basic operations: multiplication, addition, division, comparison and exponential. Also, each MU receives the required synaptic information, weights for peephole connections and biases, through the Weight Buffer. Moreover, the previous cell state (i.e. for the previous element in the input sequence) comes through the Input Buffer.
In Table 1, we detail the basic steps performed by the four MUs once the output data from the DPUs is available. For the sake of simplicity, we assume a single cycle per operation and data transfer in Table 1. Note that for the evaluation we use Synopsys Design Compiler to set realistic latencies for the different operations and data transfers, as reported in Table 4. MUs are not in the critical path, since the DPU operations are more time consuming and, thus, there is slack to accommodate multi-cycle latencies for MU operations.
The MUs for the input and forget gates perform very similar operations: they perform the multiplications for peephole connections and add the bias. Next, they apply the sigmoid function to the result. After this, the resulting value is sent to the MU of the cell updater gate, which uses this information to proceed with the computation of the cell state, i.e. , and, then, it applies the hyperbolic tangent function to this value. Once this information is computed, it is sent to the MU of the output gate, which computes the element of the output vector, i.e. , corresponding to the current element of the input sequence (i.e. ). Finally, this value is sent to the Input Buffer of all the Computation Units. In addition, it is sent to the dedicated on-chip memory where it is stored to be consumed by the next layer, as described in Section 3.2. Communication between MUs is performed by dedicated links, as shown in Figure 6.
3.4 MWL: Maximizing Weight Locality
As shown in Figure 9, on-chip memory requirements to store the synaptic weights are still quite significant for some applications (i.e. GMAT), despite the optimizations proposed in Section 3.1. In order to further improve energy consumption and reduce on-chip memory requirements, we propose a technique that maximizes temporal locality of the accesses to the weights, which are performed for each layer. We call this technique Maximizing Weight Locality (MWL). The key observation is that forward connections (i.e. their inputs come from the previous layer) can be processed in any order since all the output results from the previous layer are available. Therefore, E-PUR processes forward connections in an order that improves temporal locality. The idea is that in a given gate, instead of completely evaluating all the neurons for a single element () of the input sequence, the evaluation for all the neurons is split in two steps. In the fist step, all the neurons are evaluated using as input the forward connections for the whole input sequence (i.e, , .., ) and the intermediate results are saved. For the second step, MWL proceeds with the computation of all neurons for the recurrent connections (i.e, , .., ). Note that in this case, the evaluation must be done in sequence since data dependencies in the recurrent connections impose strict sequential order.
With this approach, E-PUR reuses a small subset of the weights, those corresponding to a particular neuron, at extremely short distances. Note that for a given neuron, once it is partially computed for all elements of the input sequence, its corresponding weights will no longer be required and, thus, they can be evicted from on-chip memory. Therefore, while processing forward connections, E-PUR only requires on-chip storage for the forward weights of a single neuron at a time, significantly reducing on-chip storage requirements and energy consumption. As shown in Figure 9, the storage requirements for the weights are reduced by approximately 50% on average. Note that recurrent connections are evaluated as usual and, hence, all the associated weights for a given layer must be stored on-chip to avoid excessive accesses to off-chip memory.
The drawback of MWL is that requires additional memory to store the partial evaluations of all neurons on a given layer. In the design of E-PUR, presented in Section 3.3, neurons in a cell are completely evaluated for an element in the input sequence before proceeding to the next input element. Therefore, only the final output vector of a cell, , has to be stored in a memory buffer. On the other hand, with MWL, the neurons are first partially evaluated for all the elements in the input sequence, by operating exclusively on the forward connections. In this case, the partial evaluations for the neurons in each of the four gates must be stored, since later they have to be merged with the result of evaluating the recurrent connections, in order to produce the final output. This requires an increase in on-chip storage requirements for intermediate results, but this overhead is minimized applying linear quantization to the partial output results. Next subsections provide further details on the implementation and trade-offs of MWL.
3.4.1 Prioritize Forward Connections
The conventional way to evaluate the input sequence in a layer is by performing all the necessary computations of the current element in the input sequence before starting with the next one. It implies the evaluation of both forward and recurrent connections in each layer. However, by following this order, the temporal locality to access the weights from each gate is suboptimal. As we can see in the left part of Figure 10, the reuse distance of a weight access is equal to adding the size of the two weight matrices, i.e. and . This has a direct impact on storage requirements, since a longer reuse distance requires a larger on-chip memory to hold the weights in order to avoid expensive off-chip memory accesses.
MWL improves temporal locality in the weight accesses by changing the evaluation order of the two feed-forward networks across the entire input sequence in a given layer. It is based on the observation that all feed-forward networks that take as input vector, i.e. those that contain forward connections, do not depend on the previous output of the layer, as we can see in Figure 4. Therefore, we improve temporal locality by partially evaluating all the neurons in a layer for the entire input sequence and then proceeding with the recurrent connections (), instead of sequentially evaluating the neurons in the layer for and and then proceeding with and . This reduces the storage requirements to the size of a single feed-forward network, as seen in Figure 9.
Note that for a given neuron in a cell, its computations use the same subset of weights (i.e, a single row from the weight matrix of the feed-forward network), therefore the reuse distance is reduced to a single row of the feed-forward matrix, as we can see in the middle part of Figure 10. Henceforth, we store them in a small buffer (i.e. 4KB), thus, avoiding to access the weight buffer for the forward connections. As a result, as shown in Figure 9, the accesses to the weight buffer are reduced by 50% on average.
Finally, after the partial evaluation of the forward connections for all the neurons in a layer, the evaluation for recurrent connections is performed as explained in Section 3.2, i.e. the next input is not evaluated until the results of the current input are computed, to respect data dependencies (right part of Figure 10).
3.4.2 Storage of the Intermediate Results
The dedicated on-chip memory for intermediate results (see Section 3.2) is dimensioned to hold the final outputs (i.e. ) for a given layer, which are produced by the output gates in each cell. When using MWL, the temporal values produced by each gate while evaluating forward connections must be saved for the entire input sequence since the MUs will need these values to compute the final outputs, as explained above. Therefore, the main drawback of this technique is the extra storage requirements for these intermediate values, which is equal to four times the memory needed to store the outputs, because intermediate values are produced in the four gates. In order to deal with this issue, E-PUR applies a well-known technique, linear quantization, which reduces the number of bits needed to represent these values, at the expense of potentially some loss in accuracy. More specifically, we apply linear quantization using 8 bits per element introducing negligible accuracy loss in our set of neural networks. Empirically we found that for the networks EESEN and RLDRASPR the WER decreases by less than 1%. For the other three networks (BYSDNE, LDLRNN, GMAT), we observed an accuracy loss of less than 0.5%. Note that previous work reported similar results [21, 22].
When using linear quantization, for a given neuron with partial output (i.e. ) produced in MWL, its quantized value (i.e. ) is computed using the following equations:
where is the number of bits of the quantized value (represented as an integer), i.e. 8 bits, and is the maximum value of . Theoretically, the value of is unbounded; however, we empirically found that its absolute value is normally less than 20 for recurrent neural networks. Note that the constant is computed offline.
In order to compute the previous equation, we extended the MU with functional units to support AND, OR and SHIFT operations. We implemented the rounding operation by adding one to the product followed by a sequence of AND, OR, additions and multiplications. These operations are performed in parallel with the computation of done by the DPU. Once the casting is completed, the value is stored in the on-chip memory for intermediate results.
After all the partial outputs () for all the neurons are computed, recurrent connections are evaluated as explained in section 3.4.1. However, before computing the final output for a given gate in a cell, the previous quantized values must be converted back to floating point numbers and added to the result of evaluating the recurrent connections. We implemented this value conversion through a look up table that maps the integer quantized value to its floating point representation. Note that the size of this table is small since is small (i.e. 8 bits in our experiments) and it is computed offline.
4 Evaluation Methodology
As our set of benchmarks, we use five recent LSTM networks which are described in Table 2. Our selection includes RNNs for popular applications such as speech recognition, machine translation or video classification. Each of these networks has a different number of internal layers and outputs, i.e. number of cells. Moreover, there are some networks that only perform a single pass for inference computation, i.e. they are unidirectional, whereas two of them, EESEN and GMAT, are bidirectional. On the other hand, we include networks with and without peephole connections. Therefore, our selection covers a wide range of LSTM designs with different sizes, from small RNNs of one Mbyte to large RNNs or hundreds of Mbytes. For each network we used the accuracy metric listed in Table 2 and the test set provided in each work.
|Network||App Domain||Layers||Neurons||Passes||Peephole||Size (MB)||Accuracy|
|BYSDNE ||Video Classification||5||512||1||Yes||40||88.6%|
|RLDRADSPR ||Speech Recognition||10||1024||1||Yes||118||39.3 WER|
|EESEN ||Speech Recognition||5||320||2||Yes||42||23.8 WER|
|LDLRNN ||Time Series||2||128||1||No||1||85%|
|GMAT ||Machine Translation||17||1024||1||No||272||24.1 Bleu|
As our baseline platform, we use an NVIDA Tegra X1 SoC  whose parameters are shown in Table 3. Its energy consumption has been measured by reading the registers of the Texas Instruments INA3221 power monitor included in the Jetson TX1 development board 
. Regarding the software implementation of the networks, we implemented them using Keras
, a high-level neural networks API. We use the Theano backend to run the LSTM networks. Theano relies on cuBLAS, a high-performance CUDA library, to perform matrix operations. Finally, we also implemented MWL in software for the Tegra X1 (Tegra X1+MWL) to analyze the benefits of a software-only implementation. We used CUDA to implement this version and employed kernel fusion  to merge the processing of different gates in one kernel, avoiding excessive number of API calls, which represent a significant overhead in this platform.
|CPU||4-core ARM A-57|
|GPU||256-core Maxwell GPU|
|Streaming Multiprocessors||2 (2048 threads/proc)|
|CPU L2 Cache||2 MB|
|GPU L2 Cache||256 KB|
To evaluate our accelerator, we have developed a cycle-accurate simulator of E-PUR. This simulator estimates the total energy consumption (static and dynamic) and execution time of LSTM networks running on top of E-PUR. We used Verilog to implement the different pipeline components of E-PUR, and we synthesized them using the Synopsys Design Compiler to obtain their delay and energy consumption. We use a typical process corner with a voltage of 0.78V and average switching activity is used to estimate dynamic power. We used CACTI to estimate the delay and energy (static and dynamic) of on-chip memories. Finally, to estimate timing and energy consumption of main memory we used MICRON models . We model 4 GB LPDDR4 DRAM.
Regarding the clock frequency, we used the delays reported by Synopsys Design Compiler and CACTI to set the frequency such that most hardware structures operate at one clock cycle. In addition, we evaluated alternative frequency values in order to minimize energy consumption. Note that many hardware components, such as floating point multipliers, are pipelined and have latency larger than one clock cycle, as shown in Table 4.
The remaining configuration parameters of E-PUR used for our experiments are shown in Table 4. We select an energy-efficient configuration that achieves real-time performance for all the neural networks in Table 2. Note that E-PUR is designed to accommodate large LSTM networks and, thus, its on-chip storage might be over-sized for the small models used in some applications. In this case, unused memory banks to store weights and intermediate results are power gated to reduce static power.
|Technology||28 nm||28 nm|
|Frequency||500 MHz||500 MHz|
|Intermediate Memory||6 MB||6 MB|
|Weights Memory||4 MB per CU||2 MB per CU|
|Inputs Memory||8 KB per CU||4 KB per CU|
|DPU Width||16 operations||16 operations|
|MU Operations||cycles: 2 (ADD), 4 (MUL), 5 (EXP)|
|MU Communication||2 cycles||2 cycles|
|Peak Bandwidth||30 GB/s||30 GB/s|
5 Experimental Results
In this section, we present the evaluation of E-PUR, our processing unit for RNNs. The baseline configuration used for comparison purposes is a Theano implementation running on a mobile NVIDIA Tegra X1 platform. The configuration labeled as E-PUR throughout this section consists of our first design presented in Section 3.2, whereas the configuration E-PUR+MWL includes our technique for improving the temporal locality of the weights described in Section 3.4. First, we present the energy reduction achieved by these two configurations with respect to the Tegra X1. Second, the performance improvement over the baseline is analyzed. Third, the power consumption for each of these configurations is shown. Fourth, we present the total area required by E-PUR. Finally, we analyze the performance of a software-only implementation of MWL.
Figure 11 shows the energy reduction. On average, E-PUR and E-PUR+MWL achieve 61x and 92x energy reduction respectively. All the LSTM networks show large improvements of at least 28x reduction in energy consumption. A remarkable case is LDLRNN, for which E-PUR reduces the total energy by 352.4x and 496.1x, respectively. The reason for this large energy reduction is that LDLRNN has fewer outputs per layer, i.e. smaller number of neurons, which means that the matrix-vector multiplications require less number of operations and, also, less memory accesses are done to fetch the weights or intermediate results. This penalizes Tegra X1 platform because the ratio between computations in the GPU and other related tasks (e.g., GPU synchronization, CPU work, etc.) is smaller. Note that for E-PUR most of the energy savings come from avoiding accesses to main memory to load/store intermediate results and weights. In the case of E-PUR+MWL, energy savings come from avoiding accesses to the on-chip memory for weights by 50% on average.
Figure 12 shows the energy breakdown for the two configurations of E-PUR. The different components of E-PUR are grouped into “scratchpad memories”, which includes all the on-chip memories, and “operations”, which includes the pipeline components, such as the functional units. Since on-chip memory requirements and number of memory accesses are significant, the overall energy consumption is dominated by the dynamic accesses to on-chip memories, which consume around 80%. Because MWL reduces the dynamic accesses for the weight buffer by 50% on average, the dynamic energy due to on-chip memories is reduced in 31% on average for E-PUR+MWL. Note that the energy consumption due to scratchpad memories is not reduced by 50% since there is an increase in memory accesses to the on-chip memory for intermediate results. In the case of the leakage due to on-chip memories, after applying MWL, it is reduced by more than 50% on average. This saving comes from the reduction in storage requirements to store the weights for the forward connections. Henceforth, the savings in leakage and dynamic energy result in 35% reduction of the total energy consumption. Regarding the energy consumption due to the operations, it ranges between 10% and 20% of the total energy for both configurations.
Figure 13 shows the speedups for different LSTM networks. On average, the speedup achieved by E-PUR over Tegra X1 is 18.7x. E-PUR performance improvements come from hiding memory latency (i.e, loading/storing is overlapped with computations), reducing off-chip memory accesses, and featuring a custom pipeline tailored to LSTM computation. Note that, for E-PUR, once the weights and input frames are loaded from the main system, there is not extra overhead from the main application. However, since the Tegra X1 is tailored to a broader range of applications, its performance is impacted by the overhead due to related tasks (e.g., GPU synchronization, CPU work, etc.). Regarding E-PUR+MWL, there is not performance improvement against the baseline since the order in which MWL evaluates the neurons does not change the final execution time. Note that in MWL, the number of operations to evaluate a given neuron is equal to the number of operations for the conventional order. However, because the evaluation of the recurrent connections for a given neuron is postponed until all forward connections are evaluated, the latency to evaluate a single neuron increases but the latency to produce the final output sequence does not change. Finally, for speech recognition applications, E-PUR achieves real-time performance by a large margin, running 30x and 5x faster than real-time for EESEN and RLDRADSPR respectively.
On the other hand, power dissipation is shown in Figure 14, which includes the total power for Tegra X1 and the two configurations of E-PUR. As it can be seen, E-PUR+MWL dissipates 5x lower power than Tegra X1 on average.
Regarding area, E-PUR requires a total area of 64.6 , whereas the total area of E-PUR+MWL is 46.3 . As depicted in Figure 15, the component with larger contribution to the total area is the on-chip memory for the synaptic weights, which is reduced by 50% when MWL is applied.
Finally, Figure 16 shows the speedup and energy reduction of the Tegra X1+MWL, i.e. MWL implemented in software, with respect to the baseline. On average, it provides a 2x energy reduction and a 2.3x speedup. EESEN and LDLRNN exhibit large improvements in performance and energy. These RNNs have smaller number of neurons than the others (see Table 2) and, hence, the synaptic weights can be stored in the on-chip storage of the mobile GPU and reused for the entire layer evaluation, i.e. for the whole input sequence. On the other hand, the benefits are significantly smaller for BYSDNE, RLDRADSPR and GMAT. These networks feature larger number of neurons and, hence, the synaptic weights of one LSTM cell cannot be stored on-chip in Tegra X1, increasing off-chip memory traffic by a large extent. Note that the on-chip memories of Tegra X1 are fairly smaller than the ones included in E-PUR as illustrated in Table 3 and Table 4. This lack of on-chip storage constrains the effectiveness of Tegra X1+MWL for RNNs with large cell dimensionality.
6 Related Work
Improving the energy-efficiency of LSTM networks has attracted the attention of the architectural community in the last few years. Proposals for LSTM networks acceleration have been presented in [20, 19, 31]. Although these accelerators achieve higher performance per watt than CPUs and GPUs, they are not designed for low-power mobile devices since their power dissipation ranges from 19 W to 41 W. However, E-PUR dissipates a peak power of 970 mW, which is amenable for low-power mobile devices.
Chang et al.  present a low-power accelerator targeting the mobile segment. It implements a small LSTM network (2 layers, 128 neurons) and dissipates 1.9 W. In this work arithmetic operations are done using fixed-point Q8.8 data format, thus an accuracy loss of 7.1% is aggregated. On the contrary, E-PUR uses floating point operations (either FP16 or FP32) and supports larger network models for a wide variety of application domains. Note that scaling up the aforementioned accelerator presented in  to support larger LSTM networks would require a significant increase in local storage capacity or in main memory traffic, and both alternatives would come at a high overhead in energy consumption.
Another low-power LSTM accelerator is presented in , this system consumes 9 W and supports larger models by using aggressive weight quantization. External DRAM traffic is completely avoided by storing the quantized weights in a local on-chip memory of 2 Mbytes. However, this quantization comes at the expense of non-negligible accuracy loss. For speech recognition, Word Error Rate increases from 13.5%, using 32-bit floating point, to 15.1% and 20.2% when using 6-bit and 4-bit quantization respectively. Furthermore, larger and more accurate models cannot be stored in its local memory even with the 4-bit quantization. For example, EESEN requires more than 5 Mbytes when using 4 bits per weight. Our work is different since EPUR+MWL uses 8-bit quantization to reduce the size of intermediate results with a negligible impact on accuracy.
The LSTM accelerator ESE  achieves high performance and energy-efficiency by exploiting linear quantization and aggressive pruning. The main application for this work is speech recognition and the main target are high-end systems. On the contrary, E-PUR targets mobile devices and achieves high energy-efficiency by improving the temporal locality of the memory accesses that fetch synaptic weight. Moreover, E-PUR supports a large variety of applications. We leave the use of pruned models in E-PUR as future work.
Regarding the work in , E-PUR without MWL is similar to a weight stationary architecture applied to LSTMs since it loads all weights for given layer in on-chip memory, holding them until all associated computations are performed. However, MWL is different since it aims at further reducing the reuse distances. Unlike traditional weight stationary architectures, MWL splits synaptic weights in two types: forward and recurrent. Based on the observation that forward connections can be processed in any order, whereas recurrent connections impose sequential processing due to data dependencies. Therefore, MWL evaluates forward connections in the order that maximizes temporal locality, requiring extra small on-chip storage for this stage, whereas it processes all recurrent connections on a second stage as shown in Figure 10. MWL greatly reduces the energy consumption of the baseline accelerator.
Finally, cuDNN  has been recently extended to efficiently support RNN training. E-PUR design is significantly different in multiple ways. First, cuDNN focuses on RNN training with large batch sizes, whereas E-PUR focuses on RNN inference with batch size of one, i.e. one input sequence at a time. We measured cuDNN performance for RNN inference with batch size of one and found it is 1.5x faster than cuBLAS, whereas E-PUR achieves 18.7x speedup. cuDNN effectiveness is reduced due to the small batch size commonly used for RNN inference. Furthermore, cuDNN’s optimizations to execute multiple layers in parallel cannot be applied to bidirectional LSTMs due to data dependencies.
In this paper, we present E-PUR, a processing unit for RNNs that supports large LSTM networks while dissipating low-power, motivated by the increasingly important role of LSTM networks in applications such as speech recognition, machine translation and video classification. Unlike previous proposals that attempt to accommodate the entire RNN on-chip, E-PUR only provides storage for one LSTM layer, whose weights are fetched once from main memory and reused for multiple recurrent executions. To further improve the memory efficiency of E-PUR, we introduce Maximizing Weight Locality (MWL), a novel technique that improves the temporal locality of the synaptic weights. The proposed design supports large LSTM networks of hundreds of Megabytes, while using small on-chip storage and low memory bandwidth. Our results show that E-PUR reduces energy consumption by 92x on average with respect to a modern mobile GPU, while providing 18.7x speedup.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in , pp. 2625–2634, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in Proceedings of the IEEE international conference on computer vision, pp. 4534–4542, 2015.
-  Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pp. 167–174, IEEE, 2015.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language modeling,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
-  J. Kim, M. El-Khamy, and J. Lee, “Residual lstm: Design of a deep recurrent architecture for distant speech recognition,” arXiv preprint arXiv:1701.03360, 2017.
-  A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neural networks hardware implementation on fpga,” arXiv preprint arXiv:1511.05552, 2015.
-  M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “Fpga-based low-power speech recognition with recurrent neural networks,” in Signal Processing Systems (SiPS), 2016 IEEE International Workshop on, pp. 230–235, IEEE, 2016.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learning systems, 2016.
-  R. Yazdani, A. Segura, J.-M. Arnau, and A. Gonzalez, “An ultra low-power hardware accelerator for automatic speech recognition,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1–12, IEEE, 2016.
-  H. Tabani, J.-M. Arnau, J. Tubella, and A. Gonzalez, “An ultra low-power hardware accelerator for acoustic scoring in speech recognition,” in Parallel Architecture and Compilation Techniques (PACT), 26th International Conference on, IEEE/ACM, 2017.
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and G. Soda, “Bidirectional dynamics for protein secondary structure prediction,” in Sequence Learning, pp. 80–104, Springer, 2001.
-  A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, vol. 4, pp. 2047–2052, IEEE, 2005.
-  F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, pp. 189–194, IEEE, 2000.
-  Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long short-term memory recurrent neural networks,” in Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific, pp. 629–634, IEEE, 2017.
-  S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga.,” in FPGA, pp. 75–84, 2017.
-  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
-  D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed point quantization of deep convolutional networks,” CoRR, vol. abs/1511.06393, 2015.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702, 2015.
-  Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzell, “Learning to diagnose with lstm recurrent neural networks,” arXiv preprint arXiv:1511.03677, 2015.
-  NVIDIA, “NVIDIA TEGRA X1 new mobile superchip.” http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf.
-  F. Chollet and Others, “Keras.”
-  R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Brébisson, O. Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Côté, M. Côté, A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Merriënboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016.
-  G. Wang, Y. Lin, and W. Yi, “Kernel fusion: An effective method for better power efficiency on multithreaded gpu,” in Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing, pp. 344–350, IEEE Computer Society, 2010.
-  N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP Laboratories, pp. 22–31, 2009.
-  Micron Inc., “TN-53-01: LPDDR4 System Power Calculator.” https://www.micron.com/support/tools-and-utilities/power-calc.
-  S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “Fpga acceleration of recurrent neural network based language model,” in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, pp. 111–118, IEEE, 2015.
-  Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 367–379, IEEE Press, 2016.
-  J. Appleyard, T. Kociský, and P. Blunsom, “Optimizing performance of recurrent neural networks on gpus,” CoRR, vol. abs/1604.01946, 2016.