As humanity progresses into the digital era, more and more data is produced and distributed across the world. Deep Neural Networks (DNN) provides a method for computers to learn from this mass of data. This unlocks a new set of possibilities in computer vision, speech recognition, natural language processing and more. However, DNNs are computationally expensive, such that general processors consume large amounts of power to deliver desired performance. This limits the application of DNNs in the embedded world. Thus, a custom architecture optimized for DNNs provides superior performance per power and brings us a step closer to self-learning mobile devices.
Recurrent Neural Networks (RNNs) are becoming an increasingly popular way to learn sequences of data [sutskever2014sequence, cho2014learning, zaremba2014recurrent, graves2013speech], and it has been shown to be successful in various applications, such as speech recognition [graves2013speech], machine translation [sutskever2014sequence] and scene analysis [byeon2015scene]
. A combination of a Convolutional Neural Network (CNN) with a RNN can lead to fascinating results such as image caption generation[vinyals2014show, mao2014explain, fang2014captions].
Due to the recurrent nature of RNNs, it is sometimes hard to parallelize all its computations on conventional hardware. General purposes CPUs do not currently offer large parallelism, while small RNN models do not get full benefit from GPUs. Thus, an optimized hardware architecture is necessary for executing RNNs models on embedded systems.
Long Short Term Memory, or LSTM [hochreiter1997long], is a specific RNN architecture that implements a learned memory controller for avoiding vanishing or exploding gradients [bengio1994learning]. The purpose of this paper is to present a LSTM hardware module implemented on the Zynq 7020 FPGA from Xilinx [xilinx:zynq7000]. Figure 1 shows an overview of the system. As proof of concept, the hardware was tested with a character level language model made with LSTM layers and hidden units. The next following sections present the background for LSTM, related work, implementation details of the hardware and driver software, the experimental setup and the obtained results.
Ii LSTM Background
One main feature of RNNs are that they can learn from previous information. But the question is how far should a model remember, and what to remember. Standard RNN can retain and use recent past information [schmidhuber2015deep]. But it fails to learn long-term dependencies. Vanilla RNNs are hard to train for long sequences due to vanishing or exploding gradients [bengio1994learning]. This is where LSTM comes into play. LSTM is an RNN architecture that explicitly adds memory controllers to decide when to remember, forget and output. This makes the training procedure much more stable and allows the model to learn long-term dependencies [hochreiter1997long].
There are some variations on the LSTM architecture. One variant is the LSTM with peephole introduced by [gers2000recurrent]. In this variation, the cell memory influences the input, forget and output gates. Conceptually, the model peeps into the memory cell before deciding whether to memorize or forget. In [cho2014learning], input and forget gate are merged together into one gate. There are many other variations such as the ones presented in [sak2014long] and [otte2014dynamic]. All those variations have similar performance as shown in [greff2015lstm].
The LSTM hardware module that was implemented focuses on the LSTM version that does not have peepholes, which is shown in figure 2. This is the vanilla LSTM [graves2005framewise], which is characterized by the following equations:
is the logistic sigmoid function,is element wise multiplication, is the input vector of the layer, is the model parameters, is memory cell activation, is the candidate memory cell gate, is the layer output vector. The subscript means results from the previous time step. The , and are respectively input, forget and output gate. Conceptually, these gates decide when to remember or forget an input sequence, and when to respond with an output. The combination of two matrix-vector multiplications and a non-linear function, , extracts information from the input and previous output vectors. This operation is referred as gate.
One needs to train the model to get the parameters that will give the desired output. In simple terms, training is an iterating process in which training data is fed in and the output is compared with a target. Then the model needs to backpropagate the error derivatives to update new parameters that minimize the error. This cycle repeats until the error is small enough[bishop2006pattern]. Models can become fairly complex as more layers and more different functions are added. For the LSTM case, each module has four gates and some element-wise operations. A deep LSTM network would have multiple LSTM modules cascaded in a way that the output of one layer is the input of the following layer.
Iii Related Work
Co-processors for accelerating computer vision algorithms and CNNs have been implemented on FPGAs. A system that can perform recognition on mega-pixel images in real-time is presented in [farabet2010hardware]. A similar architecture for general purpose vision algorithms called neuFlow is described in [farabet2011neuflow]. neuFlow is a scalable architecture composed by a grid of operation modules connected with an optimized data streaming network. This system can achieve speedups up to in end-to-end applications.
An accelerator called nn-X for deep neural networks is described in [dundar2013accelerating, jin2014efficient, dundar2014memory, gokhale2014240]. nn-X is a high performance co-processor implemented on FPGA. The design is based on computational elements called collections that are capable of performing convolution, non-linear functions and pooling. The accelerator efficiently pipelines the collections achieving up to G-op/s.
RNNs are different from CNNs in the context that they require a different arrangement of computation modules. This allows different hardware optimization strategies that should be exploited. A LSTM learning algorithm using Simultaneous Perturbation Stochastic Approximation (SPSA) for hardware friendly implementation was described in [tavcar2013transforming]. The paper focuses on transformation of the learning phase of LSTM for FPGA.
Another FPGA implementation that focus on standard RNN is described by [lifpga]. Their approach was to unfold the RNN model into a fixed number of timesteps and compute them in parallel. The hardware architecture is composed of a hidden layer module and duplicated output layer modules. First, the hidden layer serially processes the input for timesteps. Then, with the results of the hidden layer, the duplicated logic computes output for timesteps in parallel.
This work presents a different approach of implementing RNN in FPGA, focusing the LSTM architecture. It is different than [lifpga], in the sense that it uses a single module that consumes input and previous output simultaneously.
The main operations to be implemented in hardware are matrix-vector multiplications and non-linear functions (hyperbolic tangent and logistic sigmoid). Both are modifications of the modules presented in [gokhale2014240]. For this design, the number format of choice is Q8.8 fixed point. The matrix-vector multiplication is computed by a Multiply ACcumulate (MAC) unit, which takes two streams: vector stream and weight matrix row stream. The same vector stream is multiplied and accumulated with each weight matrix row to produce an output vector with same size of the weight’s height. The MAC is reset after computing each output element to avoid accumulating previous matrix rows computations. The bias
can be added in the multiply accumulate by adding the bias vector to the last column of the weight matrix and adding an extra vector element set to unity. This way there is no need to add extra input ports for the bias nor add extra pre-configuration step to the MAC unit. The results from the MAC units are added together. The adder’s output goes to an element wise non-linear function, which is implemented with linear mapping.
The non-linear function is segmented into lines , with limited to a particular range. The values of , and range are stored in configuration registers during the configuration stage. Each line segment is implemented with a MAC unit and a comparator. The MAC multiplies and and accumulates with . The comparison between the input value with the line range decides whether to process the input or pass it to the next line segment module. The non-linear functions were segmented into 13 lines, thus the non-linear module contains 13 pipelined line segment modules. The main building block of the implemented design is the gate module as shown in figure 3.
The implemented module uses Direct Memory Access (DMA) ports to stream data in and out. The DMA ports use valid and ready handshake. Because the DMA ports are independent, the input streams are not synchronized even when the module activates the ports at same the time. Therefore, a stream synchronizing module is needed. The sync block is a buffer that caches some streaming data until all ports are streaming. When the last port starts streaming, the sync block starts to output synchronized streams. This ensures that vector and matrix row elements that goes to MAC units are aligned.
The gate module in figure 3 also contains a rescale block that converts bit values to bit values. The MAC units perform bit multiplication that results into bit values. The addition is performed using bit values to preserve accuracy.
The internal blocks are controlled by a state machine to perform a sequence of operations. The implemented design uses four bit DMA ports. Since the operations are done in bit, each DMA port can transmit two bit streams. The weights and are concatenated in the main memory to exploit this feature. The streams are then routed to different modules depending on the operation to be performed. With this setup, the LSTM computation was separated into three sequential stages:
Compute and .
Compute and .
Compute and .
In the first and second stage, two gate modules ( MAC units) are running in parallel to generate two internal vectors (, , and ), which are stored into a First In First Out (FIFO) for the next stages. The ewise module consumes the FIFO vectors to output the and back to main memory. After that, the module waits for new weights and new vectors, which can be for the next layer or next time step. The hardware also implements an extra matrix-vector multiplication to generate the final output. This is only used when the last LSTM layer has finished its computation.
This architecture was implemented on the Zedboard [Avnet:zedboard], which contains the Zynq-7000 SOC XC7Z020. The chip contains Dual ARM Cortex-A9 MPCore, which is used for running the LSTM driver C code and timing comparisons. The hardware utilization is shown in table I. The module runs at MHz and the total on-chip power is W.
|Components||Utilization [ / ]||Utilization [%]|
Iv-B Driving Software
The control and testing software was implemented with C code. The software populates the main memory with weight values and input vectors, and it controls the hardware module with a set of configuration registers.
The weight matrix have an extra element containing the bias value in the end of each row. The input vector contains an extra unity value so that the matrix-vector multiplication will only add the last element of the matrix row (bias addition). Usually the input vector size can be different from the output vector
size. Zero padding was used to match both the matrix row size and vector size, which makes stream synchronization easier.
Due to the recurrent nature of LSTM, and becomes the and for the next time step. Therefore, the input memory location for and is the same for the output and . Each time step and are overwritten. This is done to minimize the number of memory copies done by the CPU. To implement a multi-layer LSTM, the output of the previous layer was copied to the location of the next layer, so that is preserved in between layers for error measurements. This feature was removed for profiling time. The control software also needs to change the weights for different layers by setting different memory locations in the control registers.
The training script by Andrej Karpathy of the character level language model was written in Torch7. The code can be downloaded from Github111https://github.com/karpathy/char-rnn. Additional functions were written to transfer the trained parameters from the Torch7 code to the control software.
The Torch7 code implements a character level language model, which predicts the next character given a previous character. Character by character, the model generates a text that looks like the training data set, which can be a book or a large internet corpora with more than MB of words. For this experiment, the model was trained on a subset of Shakespeare’s work. The batch size was , the training sequence was and learning rate was . The model is expected to output Shakespeare look like text.
The Torch7 code implements a layer LSTM with hidden layer size (weight matrix height). The character input and output is a
sized vector one-hot encoded. The character that the vector represents is the index of the only unity element. The predicted character from last layer is fed back to inputof first layer for following time step.
For profiling time, the Torch7 code was ran on other embedded platforms to compare the execution time between them. One platform is the Tegra K1 development board, which contains quad-core ARM Cortex-A15 CPU and Kepler GPU 192 Cores. The Tegra’s CPU was clocked at maximum frequency of MHz. The GPU was clocked at maximum of MHz. The GPU memory was running at MHz.
Another platform used is the Odroid XU4, which has the Exynos5422 with four high performance Cortex-A15 cores and four low power Cortex-A7 cores (ARM big.LITTLE technology). The low power Cortex-A7 cores was clocked at MHz and the high performance Cortex-A15 cores was running at MHz.
The C code LSTM implementation was ran on Zedboard’s dual ARM Cortex-A9 processor clocked at MHz. Finally, the hardware was ran on Zedboard’s FPGA clocked at MHz.
The number of weights of some models can be very large. Even our small model used almost KB of weights. Thus it makes sense to compress those weights into different number formats for a throughput versus accuracy trade off. The use of fixed point Q8.8 data format certainly introduces rounding errors. Then one may raise the question of how much these errors can propagate to the final output. Comparing the results from the Torch7 code with the LSTM module’s output for same sequence, the average percentage error for the was and for was . Those values are average error of all time steps. The best was and the worse was . The recurrent nature of LSTM did not accumulate the errors and on average it stabilized at a low percentage.
The text generated by sampling characters (timestep to ) is shown in figure 6. On the left is text output from FPGA and on the right is text from the CPU implementation. The result shows that the LSTM model was able to generate personage dialog, just like in one of Shakespeare’s book. Both implementations displayed different texts, but same behavior.
Vi-B Memory Bandwidth
The Zedboard Zynq ZC7020 platform has Advanced eXtensible Interface (AXI) DMA ports available. Each is ran at MHz and send packages of bits. This allows aggregate bandwidth up to GB/s full-duplex transfer between FPGA and external DDR3.
At MHz, one LSTM module is capable of computing M-ops/s and uses simultaneously AXI DMA ports for streaming weight and vector values. During the peak memory usage, the module requires GB/s of memory bandwidth. The high memory bandwidth requirements poses a limit to the number of LSTM modules that can be ran on parallel. To replicated LSTM module, it is required higher memory bandwidth or to introduce internal memory to lower requirements of external DDR3 memory usage.
Figure 7 shows the timing results. One can observe that the implemented hardware LSTM was significantly faster than other platforms, even running at lower clock frequency of MHz (Zynq ZC7020 CPU uses MHz). Scaling the implemented design by replicating the number of LSTM modules running in parallel will provide faster speed up. Using 2 LSTM cells in parallel can be faster than Exynos5422 on quad-core ARM Cortex-A7.
In one LSTM layer of size , there are K-ops. Multiplying this by number of samples and number of layers for the experimental application () gives the total number of operations M-ops. The execution time divided by number of operations gives performance (ops/s). The power consumption of each platform was measured. Figure 8 shows the performance per unit power.
The GPU performance was slower because of the following reasons. The model is too small for getting benefit from GPU, since the software needs to do memory copies. This is confirmed by running the same Torch7 code on a MacBook PRO 2016. The CPU of the MacBook PRO 2016 executed the Torch7 code for character level language model in s, whereas the MacBook PRO 2016’s GPU executed the same test in s.
Recurrent Neural Networks have recently gained popularity due to the success from the use of Long Short Term Memory architecture in many applications, such as speech recognition, machine translation, scene analysis and image caption generation.
This work presented a hardware implementation of LSTM module. The hardware successfully produced Shakespeare-like text using a character level model. Furthermore, the implemented hardware showed to be significantly faster than other mobile platforms. This work can potentially evolve to a RNN co-processor for future devices, although further work needs to be done. The main future work is to optimize the design to allow parallel computation of the gates. This involves designing a parallel MAC unit configuration to perform the matrix-vector multiplication.
This work is supported by Office of Naval Research (ONR) grants 14PR02106-01 P00004 and MURI N000141010278 and National Council for the Improvement of Higher Education (CAPES) through Brazil scientific Mobility Program (BSMP). We would like to thank Vinayak Gokhale for the discussion on implementation and hardware architecture and also thank Alfredo Canziani, Aysegul Dundar and Jonghoon Jin for the support. We gratefully appreciate the support of NVIDIA Corporation with the donation of GPUs used for this research.