I Introduction
Deep neural networks (DNNs) have been widely applied to solve various practical problems with stateoftheart performance. Recurrent neural networks (RNNs) which are a subset architecture of DNNs, are particularly useful in applications involving time series inputs, such as speech recognition [1, 2] and dynamical system control[3, 4]
. In contrast to Convolutional Neural Networks (
CNNs) which use filter kernels, RNNs are fullyconnected networks: They take a 1D vector as input and produce a vector of output. The feature vectors generated by CNNs can be fed into an RNN for further processing. In this way, RNNs can connect the high dimensional input features over time, which is useful for complex sequential classification or regression tasks. Gated RNNs modify a “vanilla” RNN to add nonlinear operations to the units that allow them to memorize and gate their output. Long ShortTerm Memory units (
LSTM) [5] and GatedRecurrent Units (GRU) [6]are used to overcome the vanishing gradient problem frequently encountered during vanilla RNN training with backpropagation through time (
BPTT), where the sequential operations of the RNN are unrolled to compute the weight updates based on output error. By using BPTT with labeled training data, GRU and LSTM RNNs can be trained to high accuracy for tasks involving time series such as continuous speech recognition.Edge computing devices that embed some intelligence implemented through trained DNNs are gaining interest in recent years. An advantage of edge computing is that computations are done locally on enduser devices to reduce latency and protect privacy [8]. Most literature reports on the use of CNNs for edge devices. There is less reported on devices that use RNNs particularly on embedded lowlatency, high energyefficient platforms that use FPGAs. RNNs also have larger memory footprints and memory access of the fullyconnected weight matrices dominates power consumption. RNNs are usually computed on the cloud which introduces large and variable latency, thereby making it hard to guarantee realtime performance for edge applications such as humancomputer interaction devices and mobile, robotic applications.
Optimization methods have been applied to RNNs for embedded hardware implementations (e.g. weight pruning in ESE [9], and structure pruning in BBS [10]). We also previously reported the DeltaRNN accelerator [11] that uses the delta network algorithm [12]. Our first DeltaRNN implementation [11] stored the large weight matrices in FPGA block RAM and thus needed expensive FPGA boards with greater than 15 W power consumption. However, this work only focused on pushing the limit of high batch1 throughput without considering the memory and power constraints of extreme edge devices. Typical chips for edge applications, such as microcontrollers and smallfootprint FPGA, only have a few hundreds of kilobytes (kB) of onchip SRAM, but large RNNs usually have megabytes (MB) of parameters, making them difficult to be buffered onchip even after compression. In this case, storing RNN parameters in offchip memory such as flash or DDR memory is an inevitable choice for edge devices. Therefore, reported hardware RNN implementations [9, 10] cannot be easily scaled down for edge platforms.
This work is on EdgeDRNN, a hardware accelerator for RNN computation on the edge [13]. Because of our interest in realtime edge applications, our focus is on supporting lowlatency batch1 inference of large RNNs for realtime performance on lowcost and low power but very constrained edge platforms. We show that our implementation can run largescale multilayer RNNs using a small number of processing elements with the large weight matrices stored on offchip DRAM. Besides accelerating RNN inference, it leaves most cycles of the CPU in the systemonchip (SoC
) available for other tasks, such as feature extraction and I/O management. EdgeDRNN can be implemented on a smallfootprint FPGA with 19X less logic cells and 15X less onchip memory compared to the one used in DeltaRNN
[11]. Thus, EdgeDRNN is suitable for embedded system applications such as robotics (Fig. 1B).Moreover, in previous work using the current delta network algorithm, a global threshold is applied on both the inputs and hidden unit activations of every layer of the network in sparsifying the activation vector. In this work, we looked at how different threshold values on the inputs and activations of hidden units affect the tradeoff between the accuracy of the network on a regression task and the sparsity levels of the change in the activation vectors. A higher sparsity level implies reduced weight memory access and reduced computes.
This paper makes the following contributions:

We describe a flexible, lowcost, high throughput edge FPGA RNN accelerator that uses a spiking neural networkinspired Delta RNN principle to provide stateofart latency and power efficiency for a wide range of gated RNN network sizes with batch size of 1.

We report the first study of a delta network that uses different delta thresholds for the input and activations of the hidden units. On our tested regression task, this modification increases temporal sparsity of hidden delta states by 16% compared to using a global threshold.

We compare the usability and throughput performance of two different EdgeDRNN implementation on the SoC FPGA: Baremetal and embedded Linux. The latter enables faster development and we report the correct FPGA memory bus port configuration that minimizes the performance loss due to CPU contention for the memory controller.

We report benchmark latency and throughput numbers of RNN inference on stateoftheart commercial chips for edge applications. To our best knowledge, these numbers have never been reported before.
The rest of this paper is organized as follows. Section II describes the background of gated recurrent unit (GRU) based RNN and the algorithm of delta network based GRURNN, which is called DeltaGRU. Section III describes the architectural design of the accelerator and its implementation on MiniZed. Section IV discusses experimental results including the experiments using different delta thresholds for the network. Section V compares the proposed accelerator with prior work and commercial edge devices. Section VI concludes the paper.
Ii Background
Iia DNN hardware optimization methods:
Various methods have been proposed to reduce the dominant RNN matrixvector (MxV) operations. Static approximation methods (i.e. constructed during training) include quantization, arithmetic, and weight pruning.
Quantization: Quantizing floatingpoint weights or activations to fixedpoint numbers with shorter bit width reduces memory footprint of networks and make it possible to use fixedpoint MAC units instead of expensive floatingpoint MAC units [14, 15, 16, 9]. Chip area can be further reduced by replacing conventional fixedpoint multipliers by lookup table based [17] or multiplexer [18] based multipliers on low bit precision networks with 24 bit weights. By including quantization during training (e.g. by using an approach like dualcopy rounding [19]) it is possible to reduce weight precision to 8 bits without accuracy loss.
Weight pruning:
Pruning removes unimportant neuron connections that results in sparse weight matrices
[20]. Sparse matrices can be encoded into a sparse matrix format such as the Compressed Sparse Column (CSC) and Compressed Sparse Row (CSR). With an accelerator that can decode the sparse matrix format onchip, the sparse matrixvector (SpMV) multiplication can be accelerated by executing multiplyandaccumulate (MAC) operations only on nonzero weights. This approach was adopted by the Efficient Speech Recognition Engine (ESE) [9]. Because unstructured pruning results in computation that is hard to balance across processing elements, structured pruning methods have also been proposed to improve load balancing during the SpMV computation [10, 21]. This approach was used by the LSTM FPGA accelerator using Bank Balanced Sparsity (BBS) [10]. It is also used by the custom digital IC of [21] where it is called Hierarchical Coarse Grain Sparsity (HCGS). Structured pruning is a popular approach for improving RNN hardware performance; both BBS and HCGS use it to increase effective MAC efficiency, but large increases in efficiency result in significantly worse inference accuracy [21]. For example, a 16X compression increases the error rate by a factor of about 1.2X. It allows static (compiletime) optimization, but training is fairly complicated since exploration of the additional structure hyperparameter values is needed to find optimum values that are matched to the particular hardware.
Arithmetic: Bitserial NN accelerators such as [22, 23] utilize a flexible bitserial MAC to support various precision of network parameters and are smaller in area compared to conventional fixedpoint MAC units. However, since a bitserial MAC units requires more cycles to finish a multiplication between high bit precision operands, more bitserial MAC units are required to achieve higher throughput than using conventional MAC units and larger adder trees are need for accumulating partial sums. Thus, the average speedup using this method is only around 2X but it comes with extra overhead area. The CLSTM accelerator used Toeplitzlike weight matrices in the form of blocked circulant matrices to reduce RNN memory requirements [24]
since multiple rows in each circulant matrix block can be generated from a single vector. The method also enables the use of Fast Fourier Transform (
FFT) to reduce the MxV cost from to [25]. However, forcing weight matrices to be blocked circulant is coarsegrained and leads to higher accuracy degradation compared to weight pruning [9, 10]. Moreover, the method leads to hardware overhead of computing the FFTs of activations and weights.Temporal sparsity: The delta network algorithm [12] capitalizes on the temporal sparsity of activation state vectors in a network. Setting a finite threshold that is greater than zero has the effect of zeroingout belowthreshold elements of the activation vector, which results in sparse delta vectors. Since zero activations have no downstream influence, these MACs can be skipped. It means that entire columns of the weight matrix can be skipped. Thus delta networks marry the temporal sparsity of spiking networks with the synchronous update and analog state transmission of conventional deep networks. Combining these principles provides the benefits of sparse computing and efficient communication of precise analog information with reduced and predictable memory access of inexpensive DRAM which is crucial for storing the weights.
A set of studies [12, 13, 7] showed in a variety of networks that by applying the delta principle during training, the accuracy loss is minimal even with a 510X improvement of RNN throughput and latency. For example, [12] used a 4layer 320 units per layer GRU RNN for continuous speech recognition on the Wall Street Journal dataset. The Word Error Rate (WER) increased by only a factor of 1.08X but with a reduced memory access of 6.2X.
IiB Gated Recurrent Unit
The update equations for a GRU layer of neurons and dimensional input, which are as follows:
(1)  
where are the reset gate, the update gate and the cell state respectively. , are weight matrices and
are bias vectors. The
variable denotes the logistic sigmoid function. Each GRU input is the vector
and its output is the vector .Fig. 2A illustrates the update of the normal GRU reset gate as a flow diagram.
IiC DeltaGRU
The delta network method is applied to the GRURNN architecture; we call this DeltaGRU. Assume an input vector sequence with sequence length of , we first declare the following variables:
(2)  
where is the th element of input state memory vectors in timestep . is the th element of hidden state memory vectors in timestep . is the th element of delta input state vectors . is the th element of delta hidden state vectors . and are respectively the delta thresholds of inputs and hidden state for each layer. In the initial timestep (), , , are all initialized to zeros.
The update equations for the DeltaGRU are now:
(3)  
where , , , are delta memory vectors and . Variables and indicate the sigmoid function and elementwise multiplication of vectors respectively.
Fig. 2B illustrates these operations for the DeltaGRU reset gate. The input vector and the hidden state vector are respectively replaced by the delta input state vector and the delta hidden state vector . Values of the previous state memory vectors , are updated using Eq. 2 to generate new state memory vectors , . The previous delta memory vector holds the previous step’s partial sumproduct and the resulting new delta memory vector is stored. Otherwise the operations are the same as for the original GRU reset gate, as shown in Fig. 2A. The other gates have similar flow diagrams. The state and delta memories are 1D vectors and can be easily fit into onchip SRAM buffers.
IiD Temporal Sparsity
The temporal sparsity of a DeltaGRU network of layers with an input sequence length of is defined as the fraction of zeros in the and vectors, signified by and respectively. The effective temporal sparsity is the weighted average of and according to the number of network parameters they correspond to. The definition of temporal sparsity is given by Eq. 4:
(4)  
where and are the number of zero elements respectively in the delta vectors and in layer at timestep . Because operations on biases are negligible, they are ignored in Eq. 4.
By skipping zero elements in delta vectors, whole columns of matrixvector MAC operations can be skipped. If the delta network is properly trained (by including the delta operation), [12, 7, 26] showed that the number of operations can be reduced by 5X to 100X with negligible loss of accuracy, depending on the temporal evolution of the states of the input and hidden units.
IiE Datasets
Two datasets are used in this paper: the TIDIGITS [27] dataset for the classification task and the SensorsGas dataset for the regression task. The TIDIGITS speech dataset has more than 25k digit sequences spoken by over 300 men, women, and children. The entire training and test sets are used in our experiments. The SensorsGas dataset consists of recordings of metaloxide sensors in response to various concentrations of carbon monoxide gas over 14 days [28, 29]. This dataset was used in [30] to evaluate the network performance of a gated RNN in predicting the concentration of carbon monoxide. The dataset used here comes from the 70/30 Split variant, that is, 70% of the sequences are randomly selected to form the training set, while the remaining sequences form the test set.
Iii EdgeDRNN Accelerator
Iiia Overview
Due to limited weight reuse, it is difficult to compute RNNs efficiently for realtime applications that usually work best with a batch size of 1. Therefore, a big challenge of RNN inference on the edge is the scarce offchip memory bandwidth available on portable platforms, and the limited amount of onchip block RAM on small FPGAs. EdgeDRNN uses cheap offchip DRAM for weight storage and reduces memory bandwidth by exploiting temporal sparsity in RNN updates.
Fig. 3 shows the architecture of the EdgeDRNN accelerator. The main modules consist of the Delta Unit for encoding delta vectors and generating weight column pointers (pcol); the Processing Element (PE) Array for matrixsparse vector multiplications; the (CTRL) control module which contains finite state machines (FSMs) and encodes instructions to control the AXI Datamover. Other modules include the configuration module (CFG) composed of configuration registers; the output buffer (OBUF) for buffering and redirecting outputs back to the Delta Unit and the WFIFO for buffering weights.
IiiB Delta Unit & CTRL
The Delta Unit stores state memory for delta state vector encoding in a block random access memory (BRAM^{1}^{1}1BRAM is the standard SRAM memory block on FPGAs; on Xilinx Zynq FPGAs a single BRAM block has 18bit words and a capacity of 18kb). The FSM addresses the BRAM according to the valid signal of input state vectors and , one of which, is selected to be processed as at a time depending on the FSM state. The Delta Unit encodes one element of a delta state vector in each clock cycle after the valid signal asserted until the whole vector is processed.
The vector sizes are provided by the config signal from the CFG module. Delta state vector elements that are greater than or equal to threshold or ; and their corresponding physical weight column address pointer (pcol) are respectively dispatched to the DFIFO and CTRL. The corresponding state element is written into the BRAM to update the state memory. Otherwise, elements are discarded without being written into the DFIFO. By using only Delta unit, the latency in clock cycles for the Delta Unit to process a vector is exactly the length of that vector. It is possible to reduce the latency by searching for nonzero elements in subsections of a vector simultaneously. It can be realized by using
Delta Unit blocks in parallel to fill at most one nonzero value into the DFIFO on every clock cycle. Assuming that nonzero elements are uniformly distributed in a delta state vector and using
Delta Unit blocks running in parallel, the latency in clock cycles to process a whole vector is(5) 
where is the length of the vector, is the length of the subsection of the vector or the lookahead window side of the Delta Unit; and is the temporal sparsity defined in Eq. 4.
Although can be hidden under , the latency of computing MxV, becomes a bottleneck of total latency when , which could happen when an accelerator uses a large number of MAC units to compute small networks. However, in this work, we aim to run large network inference with a small number of MAC units for edge applications, making ; thus, is used in EdgeDRNN. The MAC utilization results shown in Section IV.D prove that this choice did not lead to latency bottleneck.
The CTRL module contains FSMs that control the PE array. This module generates 80bit instructions for controlling the Xilinx AXI Datamover IP [31] to fetch RNN parameters. The instruction contains pcol and the burst length calculated from the dimensions of the network stored in configuration registers.
IiiC Processing Element Array
Twodimensional arithmetic unit arrays such as systolic arrays are difficult to be fully utilized in portable edge devices due to scarce onchip memory resources, the low external memory bandwidth of the system and the limited weight reuse nature of RNNs. In order to fully utilize every PE, a vector PE array is used in EdgeDRNN. Fig. 5 show the internal structure of a PE.
The PE has a 16bit multiplier MUL and two adders, 32bit ADD0 and 16bit ADD1. Multiplexers are placed before operands of MUL so that the PE can be reused for both MxV and vector dot products. The multiplexer below ADD0 selects between ’0’ and the BRAM data. ’0’ is chosen when an initialization of BRAM is needed as shown in Fig. 5. ADD1 is responsible for elementwise vector additions. All units are parameterized in the System Verilog RTL and configurable at compiletime to support any fixedpoint precision within their designed bit width. The PE supports tanh and sigmoid functions by using lookup tables (LUTs). The input bit width of LUTs is fixed to 16 bits while the output bit width can be set anywhere between 5 (Q1.4) to 9 (Q1.8) bits.
Fig. 6 shows the sparse MxV flow. The weight matrices of the GRURNN are concatenated following the arrangement shown on the right half of the figure. Biases are appended to the concatenated weight matrix as the first column and an element is appended to each input state vector as the first element. The PE array multiplies only nonzero delta state elements with corresponding valid columns. Products are accumulated in the Accumulation Memory (ACC Mem) to compute delta memory vectors , , , . Products involving , , are accumulated to ; , , to ; , to ; to . According to the delta update scheme defined by Eq. 2, the appended in the delta state vector becomes after the initial timestep, which means that biases , , are only accumulated to the ACC Mem by once and will be skipped by the Delta Unit after the initial timestep.
The calculation of activation after the MxV is also done by the PE array and stages of this process are shown in Fig. 7. The PE array fetches the delta memory vectors from the ACC Mem to calculate in 8 pipeline stages. Paths without any operator in any stage are buffered for 1 clock cycle using flipflops. During execution of the activation generation, stages S0S2 are executed simultaneously with S5S7 to reuse the arithmetic units using timedivision multiplexing.
Finally, assuming that the DRAM interface can deliver bits per RNN clock cycle for weight fetch, the optimum number of PEs in the array is determined by the weight precision bit width . The definition of and corresponding theoretical peak throughput, , is defined below:
(6)  
where is the clock frequency of the programmable logic. For example, the FPGA used in this paper has a 64bit DRAM interface, so, with 16bit weights, is optimal.
IiiD Implementation on MiniZed
Fig. 8 shows the latest implementation of EdgeDRNN on the $89 MiniZed development board [32] which has a Zynq7007S SoC. EdgeDRNN is implemented in the programmable logic (PL). The SoC also has a programmable CPU which is in a module called the Processing System (PS). Data is transferred between PS and PL through high performance (HP) slave ports while control signals generated by the PS is transferred through general purpose (GP) master ports. The hard IP block, AXI Datamover, is controlled by the PS to fetch weights to the PL from DDR3L memory. Another hard IP block, AXI DMA is controlled by the PS to transfer inputs and outputs of the accelerator. Compared to our previous work [13], we reduced resource utilization by replacing the AXI SmartConnect IP with the AXI Interconnect IP while preserving the same throughput and latency. To further reduce onchip power, we used the power optimization strategy during implementation in Xilinx Vivado and lower the ARM CPU clock frequency from 667 MHz to 400 MHz for the baremetal version.
The peak DRAM read memory bandwidth is 1 GB/s at the 125 MHz clock frequency (64bits125 MHz/8bits/byte). EdgeDRNN can be configured to support 1, 2, 4, 8, 16bit fixedpoint weights and 16bit fixedpoint activations. In this paper, EdgeDRNN is configured to support 16bit activation and 8bit weights. To fully exploit this HP port bandwidth, we implement PEs following Eq. 6. Adding more PEs would only leave them idle since weight fetches are limited by the DRAM bandwidth.
The AXILite General Purpose (GP) master port is used for the singlecore ARM CortexA9 CPU to control the AXIDMA and to write the configuration to the accelerator. Configurations include physical start address of the concatenated weights, delta thresholds, and network dimensions.
LUT  LUTRAM  FF  BRAM  DSP  

Available  14400  6000  28800  50  66 
EdgeDRNN  30.8%  0.4%  9.3%  32%  13.6% 
Total  65.2%  4.4%  34.1%  66%  13.6% 
The PL is driven by single clock domain of 125 MHz generated by the PS. Table I shows the resource utilization solely for EdgeDRNN (with 5bit (Q1.4) LUTs) and for the whole PL after synthesis and implementation. BRAMs are used to store previous state memory in the Delta Units and the accumulation memory in PEs and FIFOs. 8 DSPs are used for the MAC units in the 8 PEs while the remaining DSP in CTRL produces weight column addresses. The most consumed resources are LUTs (72%). This entrylevel XC7Z007S FPGA has only 14.4k LUTs. By comparison, the top level XC7Z100 has 19X more LUTs and 11X more BRAM.
IiiE Petalinux OS Integration
Xilinx’s Zynq chips are hosted on heterogeneous embedded platforms with a variety of peripherals and communication interfaces. To work with this type of system there are two workflows, baremetal and embedded OS.
The baremetal workflow is similar to the workflow of conventional microcontrollers. Baremetal has a set of libraries that establish a very thin software layer over all the hardware resources available in the system and that helps a little during the elaboration of the software that will be deployed in the system; however, detailed knowledge of the hardware is still necessary to ensure correct functionality. The resulting software runs on the PS processor making use of all its computing power since it is the only software running on the core. Baremetal allows a more dedicated use of the system resources to achieve high performance execution but it offers little flexibility and versatility.
The second option is to use an embedded Linux OS provided by Xilinx called PetaLinux. This OS establishes several software layers over the system hardware that simplifies its use and the development of applications that make use of the system’s peripherals like USB, Bluetooth, and WiFi. The Linux system is a preemptive multitasking operating system that can make application development much faster. Since running Linux slightly slows down inference (Sec. IV), users can decide to pay the throughput price of using Linux for faster development time and easier maintenance. For EdgeDRNN, we implemented both systems to meet our various application requirements.
Iv Experimental Results
We previously developed two EdgeDRNN systemlevel demonstrations: continuous spoken digit recognition [26] and realtime control of a powered leg prosthetic [7]
. Here we report the results of new experiments to measure accuracy, throughput, and power efficiency on the spoken digit task and a new regression task on gas concentration estimation. We also report measurements of embedded Linux implementation of EdgeDRNN.
Iva Experimental Setup: Training
We evaluate the accuracy of DeltaGRU and the hardware performance of EdgeDRNN using this DeltaGRU network on both a classification task using the TIDIGITS [27] dataset and on a regression task using the SensorsGas [30] dataset.
IvA1 Classification
For the classification task, we trained 6 different DeltaGRU network sizes and compared their WER on the TIDIGITS
audio digit dataset, evaluated using the greedy decoder. Inputs to the networks consist of 40dimensional log filter bank features extracted from audio sampled at 20 kHz using a frame size of 25 ms and frame stride of 10 ms. We use the Connectionist Temporal Classification (
CTC) loss [33]to handle variable input sequence lengths. The DeltaGRU networks were trained for 50 epochs using a learning rate of 3e4 and batch size of 32. Following a similar procedure in
[26], a continuous spoken digit recognition demo is built using EdgeDRNN to prove the system functionality^{2}^{2}2https://www.youtube.com/watch?v=XyNjh5yiMI.IvA2 Regression
For the SensorsGas regression task, the input dimension of the network is 14 corresponding to data from the 14 sensors. We adopt a 2step pretrain and retrain scheme we developed for [7]: 1) We pretrain a cuDNN GRU model on the training set for 100 epochs. The learning rate is 5e4 and the batch size of 64. 2) We load these parameters into a DeltaGRU network with same size as the cuDNN GRU and retrain for another 10 epochs with learning rate of 3e3 and batch size of 256. In this step we optimize the deltas for the visible and hidden units. Because the cuDNN GRU model is highly optimized for NVIDIA GPUs, the pretrain step helps to train the network to achieve high accuracy with 5X less time.
All networks are trained using the Adam optimizer and quantizationaware training using quantization scheme similar to [19]
. To improve accuracy, we use nonlinear functions with the same input and output bit precision as the LUT in the forward phase of the training. In the backward phase, the gradient of the nonlinear function is calculated using the original nonlinear functions in FP32 precision. Training was done with PyTorch 1.2.0 on NVIDIA GPUs running CUDA 10 and cuDNN 7.6.
IvB Experimental Setup: Network Implementation
After the quantized DeltaGRU is trained for a particular task, a Python script converts the PyTorch network modules into C/C++ hardware network header files. These files contain the network parameters and configuration register values for EdgeDRNN. By including the header files, baremetal or PetaLinux applications are compiled using the standard cross compiler. The resulting system image is transferred to the QSPI flash (baremetal) or eMMC storage (PetaLinux) on the MiniZed. In each timestep of the RNN, a feature vector is transferred from the PS to the accelerator using the AXI DMA. For measuring the performance of the accelerator, features are calculated offline on a computer and stored in a header file. For using the accelerator in realworld applications, features such as log filter bank and spike count features for audio, are calculated online by the ARM core in the PS. A flag connected to a PS hardware register is raised at the end of each timestep.
IvC Accuracy & Throughput
IvC1 Classification
Fig. 9 shows the EdgeDRNN throughput and WER on the TIDIGITS test set versus the used in training and testing of a 2L768HDeltaGRU network. is the same for both and . With PEs and PL frequency MHz, EdgeDRNN has a theoretical peak throughput of GOp/s. At , there is still a speedup of about 2X from natural sparsity of the delta vectors. Higher leads to better effective throughput, but with gradually increasing WER. The optimal point is at (0.25), just before a dramatic increase of WER, where EdgeDRNN achieves an effective throughput around 20.2 GOp/s with 1.3% WER. WER and throughput of smaller tested networks are shown in Table II. The 5bit (Q1.4) LUT was used for this task and did not lead to accuracy loss compared to the network running on CPU with FP32 nonlinear functions.
Network Sizes 

WER 

Latency (s)  Eff. Throughput (GOp/s) 


Min  Max  Mean  Est.  Mean  Est.  
1L256H  0.5 M  1.83%  +1.36%  16.7  142.2  46.2  43.3  9.9  10.5  495%  25.6%  90.0%  
2L256H  1.2 M  1.13%  +0.69%  29.1  258.9  90.7  91.6  13.7  13.6  685%  78.9%  89.1%  
1L512H  1.7 M  1.04%  +0.44%  40.6  331.2  130.6  129.8  13.0  13.1  650%  25.6%  89.5%  
2L512H  4.9 M  0.89%  +0.75%  57.3  656.8  252.6  262.9  19.2  18.4  960%  85.5%  91.2%  
1L768H  3.7 M  1.27%  +0.11%  64.1  616.7  224.3  224.8  16.6  16.6  830%  25.6%  91.3%  
2L768H  10.8 M  0.77%  +0.53%  96.5  1344.5  535.6  541.6  20.2  19.9  1010%  87.0%  91.6% 
Network Sizes 

Latency (s)  Eff. Throughput (GOp/s) 


Min  Max  Mean  Mean  
1L256H  0.5 M  17.0  311.0  48.2  9.5  475%  
2L256H  1.2 M  30.0  461.0  93.1  13.4  670%  
1L512H  1.7 M  42.0  603.0  133.6  12.7  635%  
2L512H  3.7 M  59.0  923.0  257.5  18.8  940%  
1L768H  4.8 M  66.0  627.0  228.5  16.3  815%  
2L768H  10.8 M  99.0  1366.0  544.9  19.8  990% 
IvC2 Regression
In this regression task, we evaluate the impact of using different delta thresholds for and on the accuracy results of a 2L256HDeltaGRU model evaluated on the SensorsGas testset. Fig. 10 and Fig. 11 show respectively the regression accuracy and temporal sparsity versus and . The pretrained 2L256HGRU network without using a delta threshold, achieves a rootmeansquare error (RMSE) of 0.995 and coefficient of determination () of 0.976. This accuracy was achieved using 5bit (Q1.4) LUTs, which gave the lowest RMSE out of all other LUT bit precision values.
Similar to the results for the classification task, Fig. 10 shows that the accuracy degrades when larger delta thresholds are used. Fig. 11 shows that the sparsity levels of and are heavily influenced by their corresponding delta thresholds. The accuracy degrades faster with increasing for a fixed than with increasing for a fixed . has a minor impact on and vice versa. The results from this regression task indicate that propagating changes more often in input states is more important than propagating changes in hidden states. By exploiting this phenomenon, we get the optimal point , where the RMSE and are 1.078 and 0.972 respectively. With and , the latency of the optimal model is 206 s. In comparison, Jetson TX2 runs a 4.8X smaller 1L200HGRU network in 271 s [30].
IvD Theoretical & Measured Performance
Eq. 7 gives the estimated mean effective throughput of EdgeDRNN running a DeltaGRU layer:
(7)  
where is the number of operations in a DeltaGRU layer per timestep, the latency of MxV, the latency of remaining operations to produce the activation, and the other variables are defined as in Eqs. 4 and 6.
Table II compares the Eq. 7 predictions with benchmark results of different DeltaGRU network sizes running on EdgeDRNN. Estimated results calculated from Eq. 7 are close to measured results and the maximum relative error between them is smaller than 7.1%. Thus Eq. 7 can be used to estimate EdgeDRNN performance for a particular RNN network size. On average, EdgeDRNN can run all tested networks under 0.54 ms latency corresponding to 20.2 GOp/s effective throughput for the 2L768HDeltaGRU.
IvE Performance in PetaLinux
For performance measurements on the PetaLinuxbased system, we implemented an application that performs the same operations as the software implemented for baremetal but use the AXI DMA driver included in the OS.
Table III shows the latency and performance results for the 6 networks used in this work. The minimum and mean latency numbers in the PetaLinux version are respectively up to 3.4% and 11.3% higher than the numbers obtained for the baremetal version. Because the minimum PetaLinux latency is nearly the same as the baremetal latency, the big difference in maximum latency numbers between the PetaLinux and the baremetal version is due to CPU contention for other tasks running on the PS that lock the single PS DDR controller. EdgeDRNN fetches weights from HP ports (Fig. 8) that are routed through the PS DDR controller. (The FPGA’s ACP interface should not be used to access DRAM memory under PetaLinux because it is connected directly to the L2 cache on the ARM core where the OS runs. This configuration creates conflicts and the performance of the system is seriously compromised.) Under PetaLinux, the HP interface should be used to connect any module placed on the PL that requires direct access to the DRAM memory.
To understand the impact of CPU load and CPU DRAM access on the RNN inference time, we wrote a small program that loops over a memory array and is designed to trigger L2 cache misses. We used two different memory array sizes to study the effect of cache misses since the large memory array causes more L2 cache misses. Table IV shows that the impact on RNN latency is minor: a small network takes about 50% longer to run with either memory array size, and a large RNN is only slowed down by 10%.
Mean Latency (us)  
1L256HTH64 (little network with few parameters)  
Only EdgeDRNN  48 
EdgeDRNN + Small memory workload  72 
EdgeDRNN + Large memory workload  82 
2L768HTH64 (big network with many parameters)  
Only EdgeDRNN  545 
EdgeDRNN + Small memory workload  613 
EdgeDRNN + Large memory workload  595 
The RNN inference time varies between 50 us to 0.5 ms across the different network sizes. During this inference time, the PS is free for other tasks (e.g. computing features) and only needs to check if the RNN update is finished when these tasks are completed.
IvF Power Measurement
Part  Wall Power (mW)  Percentage  

PS  DDR3L  534  23.3% 
PLLs  388  16.9%  
ARM CortexA9  166  7.2%  
Peripherals  21  0.9%  
Regulator/etc.  942  41.1%  
Static  119  5.2%  
EdgeDRNN  66  2.9%  
DMA/Interconnect  54  2.4%  
Total  2290 
Table V shows the power breakdown of the MiniZed system. The total power is measured by a USB power meter; the PS, PL and static power is estimated by the Xilinx Power Analyzer. The whole system burns at most 2.3 W but the EdgeDRNN only consumes 66 mW. It is interesting to note that the DRAM power is about 8X more than the RNN logic. This result clearly shows that the RNN computation is memory dominated.
V Comparison
Va Comparison with FPGA RNN Accelerators
Platform  This Work  BBS [10]  DeltaRNN [11]  ESE [9]  DeepRnn [14]  
FPGA  XC7Z007S  Arria 10 GX1150  XC7Z100  XCKU060  XC7Z045  
Dev. Kit Cost  $89  $4,495  $2,295  $3,295  $2,495  
Weight Storage  Offchip  Onchip & offchip  Onchip  Offchip  Onchip & offchip  

INT 16/8/0  INT 16/16/4  INT 16/16/0  INT 16/12/4  INT 16/16/0  
Sparsity Type  Temporal  Weight  Temporal  Weight    
^{1} Effective Sparsity  90.0%  87.5%  88.2%  88.7%    
Frequency (MHz)  125  200  125  200  142  

64      512    
Number of MACs (Batch1)  8  4096  768  32  256  
^{2} Peak Throughput (GOp/s)  2  1638.4  192  12.8  4.5  

20.2  2432.8  1198.3  78.8  0.7  
^{3} MAC Utilization  1010%  150%  630%  620%  15%  

2.0  1.3  2.0  1.3  2.0  

20.2  10.7  17.0  11.5  2  
Wall Plug Power (W)  2.3  19.1  7.3  41.0+PC    

8.8  127.4  164.2  1.9    

8.8  4.7  7.4  5.0   

The effective sparsity of EdgeDRNN & DeltaRNN is calculated by Eq. 4.

Peak throughput is calculated by Eq. 6.

MAC utilization is the ratio between batch1 throughput and the peak throughput of the accelerator.

Memorybounded peak throughput is calculated by Eq. 8.

Normalized to the same frequency, DRAM interface bit width for weight fetch, number of MACs and activation & weight bit precision with EdgeDRNN. We assume the normalized numbers are obtained on the same MiniZed board and assume they have the same wall plug power consumption with EdgeDRNN. Detailed discussion is in Section V.A.
Table VI compares EdgeDRNN with other stateoftheart FPGA RNN accelerators. Both BBS [10] and DeltaRNN were optimized for batch1 inference by using all MACs for a single input sample. BBS can use DRAM to support large networks and has the highest batch1 throughput among all accelerators; however the reported throughput number was obtained by buffering the whole network by using expensive onchip memory. After compression, the network has around 0.8 MB parameters, which can be buffered on large FPGAs like the GX1150 used by BBS, but it is still too expensive for edge hardware platforms (e.g. MiniZed has only 0.2 MB onchip memory). ESE [9] reuses weights fetched from offchip memory to feed 1024 MACs for batch inference and achieved 2520 GOp/s total throughput; however only 32 out of 1024 MACs were used for each input sample limiting its batch1 throughput. Except for EdgeDRNN and DeepRnn [14], other platforms are not designed for edge applications. BBS, DeltaRNN and ESE provide much higher throughput but their power consumption is around 3X18X larger than EdgeDRNN and they require expensive FPGA development systems that are not very portable. By contrast, the small number of processing elements in EdgeDRNN is intentionally chosen to match the available memory bandwidth of the DRAM interface, since there is no point in having idle PEs.
To fairly compare architectures without the influence of different specifications of FPGA platforms, it makes sense to normalize the batch1 throughput and other corresponding numbers of accelerators to the same number of PEs ()^{3}^{3}3Each PE has a single MAC unit., clock frequency ( MHz), DRAM interface bit width for weight fetch (64bit) and bit precision of weights (INT8) & activations (INT16) as used by EdgeDRNN. We also assume that the normalized platforms are implemented on MiniZed having the same power consumption of EdgeDRNN. The normalized batch1 throughput is defined below:
(8)  
where is the memorybounded peak throughput and is the bit width of the nonzero element index. To exploit weight sparsity by skipping zero elements in the weights, indices of nonzero weight elements have to be used and introduces offchip memory overhead. Both BBS and ESE use for their tested networks. EdgeDRNN and DeltaRNN only need indices of valid columns corresponding to nonzero delta state vector elements, and they are calculated onchip without introducing offchip memory overhead; thus, for EdgeDRNN and DeltaRNN. In this normalization process, we assume the ideal case, in which normalized platforms reach the memorybounded peak throughput and can fully utilize sparsity. Thus, Eq. 8 gives the upper bound throughput value of the normalized platform.
Table VI shows that EdgeDRNN achieves the highest normalized throughput, and an even higher normalized throughput than our previous BRAMbased DeltaRNN because of the improved pipeline and higher sparsity achieved. Compared with BBS, EdgeDRNN achieves only a small fraction of the total batch1 throughput, but the normalization makes it clear that BBS achieves its high throughput by using onchip BRAM, a huge number of MACs, and a higher clock frequency. Among all the accelerators, EdgeDRNN also shows the highest effective MAC utilization and the lowest wall plug power. Finally, the EdgeDRNN FPGA development kit is a factor of at least 25X cheaper than other FPGA RNNs, and the cost is comparable to the cheapest edge AI accelerators.
VB Architectural Comparison
Fig. 12 compares the architecture and MxV flow of EdgeDRNN with BBS and ESE. We compare EdgeDRNN with ESE and BBS because they are also FPGA RNN accelerators using DRAM with high reported throughput. Both ESE and BBS exploit weight sparsity with load balancing techniques.
VB1 Ese
In ESE, interleaving rows of matrix elements are assigned to MAC units in the PE array and the MxV is computed column by column. To balance the workload and exploit weight sparsity better, the network weight matrix is pruned so that the number of nonzero elements assigned to each MAC unit is the same for the whole matrix. activation buffers (BUF) are required for MAC units, which immediately execute operations when nonzero activations and weights are available.
VB2 Bbs
BBS balances the workload using structured pruning. Rows of a weight matrix are split into banks of equal length. Their pruning method forces the numbers of nonzero values to be the same across all banks. By assigning the same number of row banks to each MAC unit in the PE array, the workload is balanced. As shown on the right side of Fig. 12, each row of the matrix is equally divided into two banks respectively for MAC 0 and MAC 1 and the computation is done row by row. In this case, each MAC receives different activation elements and BUFs are required for MAC units. BBS also supports the buffering of a partial weight matrix onchip to enhance throughput, which is useful for large FPGA platforms. The reported batch1 throughput of BBS in Table VI is obtained with all network parameters onchip, which is not practical on a small FPGA platform like MiniZed that has only 0.2 MB onchip memory.
VB3 EdgeDRNN
Platform  This Work  NCS2 [34]  Jetson Nano [35]  Jetson TX2 [36]  GTX 1080 [37]  
Chip  XC7Z007S  Myriad X  Tegra X1  Tegra X2  GP104  
Dev. Kit Cost  $89  $69  $99  $411  $500+PC  
DRAM Type (Bus Width)  DDR3L (16bit)    LPDDR4 (64bit)  LPDDR4 (128bit)  GDDR5X (256bit)  
DRAM Bandwidth (GB/s)  1.0    25.6  59.7  320  
Test Network  2L768HDeltaGRU  2L664HLSTM  2L768HGRU  
#Parameters  5.4 M  5.4 M  5.4 M  
Bit Precision (A/W)  INT16/8  FP16  FP32  FP16  FP32  FP16  FP32  FP16  
WER on TIDIGITS  1.1%  0.8%  
0.7%  0.8%  1.3%  
Latency (s)  2633  1673  536  3,588  5,757  4,356  3,124  2,693  527  484  

4.1  6.5  20.2  3.0  1.9  2.5  3.5  4.0  20.5  22.3  
^{1} Wall Plug Power (W)  2.3  1.7  7.2  7.1  8.2  8.1  96.6+PC  82.2+PC  

1.8  2.8  8.8  1.8  0.3  0.4  0.4  0.5  0.2  0.3 

EdgeDRNN power was measured by a USB power meter. Power numbers of Jetson Nano and Jetson TX2 boards are measured by a Voltcraft 4500ADVANCED Energy Monitor. Power of GTX 1080 was measured by the nvidiasmi utility.
Unlike ESE and BBS, EdgeDRNN includes an extra unit to compute delta state vectors. Similar to ESE, EdgeDRNN also assigns interleaving rows to MAC units and computes MxV column by column; however, all MAC units share the same delta state vector elements; thus, only 1 BUF (DFIFO) is required.
Both ESE and BBS require indices of nonzero weight elements to realize zeroskipping. The indices cause overhead on memory access, reducing effective memory bandwidth. EdgeDRNN skips whole columns of computation and indices of valid columns are calculated onthefly to avoid memory overhead.
Moreover, ESE and BBS require extra Elementwise (Ewise) multiplication units for the RNN activation generation after MxV. EdgeDRNN reuses multipliers in the PE array by timedivision multiplexing to save DSP and LUT resources. Elementwise addition is done by reusing adders in the PE array and also using a single 16bit adder per PE, as shown in Fig. 5.
Our previous work, DeltaRNN [11], achieved high batch1 throughput and MAC utilization with temporal sparsity, but it stored all network parameters on chip, making it unscalable. Meanwhile, EdgeDRNN is designed to match the external memory bandwidth available on any FPGA platform with external DRAM. The small number of MAC units along tall concatenated weight matrix columns, as shown in Fig. 6, makes the burst length long enough to maintain high DRAM controller efficiency for large networks.
VC Comparison with an SNN Processor
We compare the performance metrics of TrueNorth [38], an applicationspecific integrated circuit (ASIC) SNN processor, on the TIDIGITS dataset. The system is reported to dissipate 38.6 mW power by using a feature extraction method that can be implemented directly on TrueNorth [39]. We cannot easily compare the power numbers of this ASIC processor with the power dissipated by an FPGA which is a more generalpurpose platform. To run TrueNorth, an interfacing FPGA that burns several Watts is needed so the system power would much higher. The reported accuracy from their work is only 95% which is lower than the 99% accuracy achieved by the quantized delta network reported in our previous work [12].
VD Comparison with Commercial Platforms
Table VII compares EdgeDRNN with popular commercial platforms, including their cost and memory system bandwidth. All platforms are benchmarked on the same spoken digit recognition task (first 10,000 timesteps of the TIDIGITS
test set) using networks of the same size, except that the Intel Compute Stick 2 (NCS2) does not support GRU and was benchmarked with an LSTM network with a similar parameter count and trained with the same hyperparameters. The latency requirement of the recognition task is 10 ms which is determined by the frame stride. To meet this requirement, frames cannot be concatenated into a single tensor. The computation of the RNN is executed when there is a new frame.
For benchmark of GPUs, we used GRUs because we found that latency numbers of both FP32 and FP16 cuDNN GRU implementations are 3X lower than that of running the DeltaGRU algorithm using the NVIDIA cuSPARSE library. In addition, we removed peripheral devices from the Jetson board with the exclusion of the needed Ethernet cable to the PC. Because GPUs also need time to boost their clock frequency and to allocate memory, the first 50 timesteps of the test sequence are excluded. The power efficiency results show that EdgeDRNN still achieves over 5X higher system power efficiency compared to commercial ASIC and GPU products.
GPUs are throughputoriented architectures suitable for neural network training with large batch sizes; however, it is not optimal for edge inference where batch1 throughput is critical for achieving low latency. The claimed peak FP32 throughput of Jetson Nano [35], Jetson TX2 [36] and GTX 1080 [37] are respectively 0.5 TOp/s, 0.8 TOp/s and 9 TOp/s while the measured batch1 throughput are only 1.9 GOp/s, 3.5 GOp/s and 20.5 GOp/s. The low batch1 throughput of GPUs is because weights fetched from offchip DRAM cannot be reused to fully utilize GPU cores. Fig. 13A shows the throughput of GTX 1080 approaches the claimed peak throughput with large batch sizes due to more weight data reuse; however, increasing batch size also causes worse latency numbers as shown in Fig. 13B. FP16 outperforms FP32 because of the smaller memory bottleneck.
Fig. 14 compares latency per frame on a test set sample. EdgeDRNN latency is lower during the silent or quieter periods (e.g. between 60 s and 80 s) when the input is changing slowly. EdgeDRNN is as quick as the desktop 1080 GPU and 5X quicker than the other platforms, despite having a DRAM bandwidth that is orders of magnitude slower.
In [7], we reported that EdgeDRNN ran the RNN for robotic control about 51X faster than the embedded BeagleBone Black platform with a ARM CortexA8 CPU, while burning about the same total power of 2W. Moreover, to compare the performance of EdgeDRNN and the ARM CortexA9 CPU on the PS side of MiniZed, we took the same 2L128HDeltaGRU network used in our previous realtime control demonstration [7] and measured the latency per frame on 1 minute test data (sample rate = 200 Hz), which are 1000 frames of motor encoder readings. Fig. 15 shows the latency of the ARM CPU and EdgeDRNN. The mean latency of the ARM CPU is 1281 s without sparsity and 428 s with sparsity. The mean latency of EdgeDRNN with sparsity is 16 s, therefore EdgeDRNN is 27X faster than the ARM CPU which exploits temporal sparsity in the same network. In the case of the robotic task, EdgeDRNN runs the network 300X faster than the required maximum latency of 5ms.
Vi Conclusion
The 2 W EdgeDRNN runs batch1 RNNs as fast as a 200 W GPU+PC, and its power efficiency is at least a factor of 4X higher than any of the commercial edge AI platforms in the benchmark. We found that the batch1 RNN throughput numbers of commercial GPUs are a factor of over 100X less than their claimed peak throughput. Using the delta network to exploit temporal sparsity allows a modest number of 8 PEs to achieve an effective 162 Op per clock cycle, equivalent to an MAC utilization efficiency of over 1000%. EdgeDRNN uses a standard AXI4 interface for weight fetches; thus it can be scaled up to larger FPGA platforms by simply increasing the number of PEs to match the available memory bandwidth provided by onchip BRAM or offchip DRAM. Thus using temporal sparsity in delta activation vectors allows the arithmetic units on this task to effectively compute 10X more operations with the same amount of memory access.
The delta threshold allows instantaneous tradeoff of accuracy versus latency. Future work could exploit a dynamic tradeoff of accuracy versus latency to quickly converge onto optimal values in a guided search.
References
 [1] A. Graves, “Sequence transduction with recurrent neural networks,” ICML Representation Learning Workshop, 2012.
 [2] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
 [3] K.i. Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks,” Neural Networks, vol. 6, no. 6, pp. 801 – 806, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S089360800580125X
 [4] T. Chow and Y. Fang, “A recurrent neuralnetworkbased realtime learning control strategy applying to nonlinear systems with unknown dynamics,” IEEE Transactions on Industrial Electronics, vol. 45, no. 1, pp. 151–161, Feb. 1998, conference Name: IEEE Transactions on Industrial Electronics.
 [5] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735

[6]
K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations
using RNN encoder–decoder for statistical machine translation,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Oct. 2014, pp. 1724–1734. [Online]. Available: http://www.aclweb.org/anthology/D141179  [7] C. Gao, R. Gehlhar, A. D. Ames, S. C. Liu, and T. Delbruck, “Recurrent neural network control of a hybrid dynamical transfemoral prosthesis with edgedrnn accelerator,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 5460–5466.
 [8] J. Chen and X. Ran, “Deep Learning With Edge Computing: A Review,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019.
 [9] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang et al., “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. ACM, 2017, pp. 75–84.
 [10] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang, “Efficient and effective sparse lstm on fpga with bankbalanced sparsity,” in Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, ser. FPGA ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 63–72. [Online]. Available: https://doi.org/10.1145/3289602.3293898
 [11] C. Gao, D. Neil, E. Ceolini, S.C. Liu, and T. Delbruck, “DeltaRNN: A powerefficient recurrent neural network accelerator,” in Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, ser. FPGA ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 21–30. [Online]. Available: https://doi.org/10.1145/3174243.3174261

[12]
D. Neil, J. Lee, T. Delbrück, and S.C. Liu, “Delta networks for
optimized recurrent network computation,” in
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017
, 2017, pp. 2584–2593. [Online]. Available: http://proceedings.mlr.press/v70/neil17a.html 
[13]
C. Gao, A. RiosNavarro, X. Chen, T. Delbruck, and S.C. Liu,
“EdgeDRNN: Enabling lowlatency recurrent neural network edge inference,”
in
2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)
, 2020, pp. 41–45.  [14] A. X. M. Chang and E. Culurciello, “Hardware accelerators for recurrent neural networks on FPGA,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
 [15] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “Fpgabased lowpower speech recognition with recurrent neural networks,” in 2016 IEEE International Workshop on Signal Processing Systems (SiPS), 2016, pp. 230–235.
 [16] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong, “FPDNN: An automated framework for mapping deep neural networks onto FPGAs with RTLHLS hybrid templates,” in 2017 IEEE 25th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), April 2017, pp. 152–159.
 [17] D. Shin, J. Lee, J. Lee, and H. Yoo, “14.2 DNPU: An 8.1TOPS/W reconfigurable CNNRNN processor for generalpurpose deep neural networks,” in 2017 IEEE International SolidState Circuits Conference (ISSCC), Feb 2017, pp. 240–241.
 [18] M. Yang, C. Yeh, Y. Zhou, J. P. Cerqueira, A. A. Lazar, and M. Seok, “A 1W voice activity detector using analog feature extraction and digital deep neural network,” in 2018 IEEE International Solid  State Circuits Conference  (ISSCC), Feb 2018, pp. 346–348.

[19]
E. Stromatias, D. Neil, M. Pfeiffer, F. Galluppi, S. B. Furber, and S.C. Liu, “Robustness of spiking deep belief networks to noise and reduced bit precision of neuroinspired hardware platforms,”
Frontiers in Neuroscience, vol. 9, p. 222, 2015.  [20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http://arxiv.org/abs/1510.00149
 [21] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J. Seo, “An 8.93 TOPS/W LSTM recurrent neural network accelerator featuring hierarchical coarsegrain sparsity for ondevice speech recognition,” IEEE Journal of SolidState Circuits, pp. 1–1, 2020.
 [22] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bitserial deep neural network computing,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
 [23] O. Bilaniuk, S. Wagner, Y. Savaria, and J. David, “Bitslicing FPGA accelerator for quantized neural networks,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019, pp. 1–5.
 [24] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks: A memoryefficient approach,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2763–2775, Oct 2017.
 [25] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, “CLSTM: Enabling efficient LSTM using structured compression techniques on FPGAs,” in Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, ser. FPGA ’18. New York, NY, USA: ACM, 2018, pp. 11–20. [Online]. Available: http://doi.acm.org/10.1145/3174243.3174253
 [26] C. Gao, S. Braun, I. Kiselev, J. Anumula, T. Delbruck, and S.C. Liu, “Realtime speech recognition for IoT purpose using a delta recurrent neural network accelerator,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019, pp. 1–5.
 [27] R. G. Leonard and G. Doddington, “TIDIGITS speech corpus,” Texas Instruments, Inc, 1993.
 [28] J. Burgués, J. M. JiménezSoto, and S. Marco, “Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models,” Analytica Chimica Acta, vol. 1013, pp. 13 – 25, 2 2018.
 [29] J. Burgués and S. Marco, “Multivariate estimation of the limit of detection by orthogonal partial least squares in temperaturemodulated MOX sensors,” Analytica Chimica Acta, vol. 1019, pp. 49 – 64, 2018.
 [30] S. Wang, Y. Hu, J. Burgués, S. Marco, and S.C. Liu, “Prediction of gas concentration using gated recurrent neural networks,” in 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2020, pp. 178–182.
 [31] Xilinx, “AXI datamover.” [Online]. Available: https://www.xilinx.com/products/intellectualproperty/axi_datamover.html
 [32] AVNET, “MiniZed.” [Online]. Available: http://zedboard.org/product/minized
 [33] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 369–376. [Online]. Available: http://doi.acm.org/10.1145/1143844.1143891
 [34] “Intel® neural compute stick 2 product specifications.” [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/140109/intelneuralcomputestick2.html
 [35] “Jetson Nano Developer Kit,” Mar. 2019. [Online]. Available: https://developer.nvidia.com/embedded/jetsonnanodeveloperkit
 [36] “Harness AI at the Edge with the Jetson TX2 Developer Kit,” Aug 2019. [Online]. Available: https://developer.nvidia.com/embedded/jetsontx2developerkit
 [37] “GeForce GTX 1080 Graphics Cards from NVIDIA GeForce.” [Online]. Available: https://www.nvidia.com/enus/geforce/products/10series/geforcegtx1080/
 [38] F. Akopyan, J. Sawada, A. Cassidy, R. AlvarezIcaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha, “TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
 [39] W. Tsai, D. R. Barch, A. S. Cassidy, M. V. DeBole, A. Andreopoulos, B. L. Jackson, M. D. Flickner, J. V. Arthur, D. S. Modha, J. Sampson, and V. Narayanan, “Alwayson speech recognition using TrueNorth, a reconfigurable, neurosynaptic processor,” IEEE Transactions on Computers, vol. 66, no. 6, pp. 996–1007, 2017.
Comments
There are no comments yet.