1. Introduction
The development and successful training of deep neural networks (DNNs) has resulted in breakthrough results in different application areas such as computer vision and machine learning
(LeCun et al., 2015; Krizhevsky et al., 2012; Zhang et al., 2015). Although neural networks are inspired by neurons in the nervous system, it is known that learning and computation in nervous system is mainly based on eventbased spiking computational units (Deco et al., 2008). Accordingly, spiking neural networks (SNNs) have been proposed to better mimic the capabilities of biological neural networks. Although SNNs can represent the underlying spatiotemporal behavior of biological neural networks, they received much less attention, due to difficulties in training since spikes in general are not differentiable and gradientbased methods cannot be used directly for training.SNNs, similar to DNNs are formed of multiple layers and several neurons per layer. They differ in functionality, however, with SNNs sharing spikes rather than floating point values.
In general, DNNs and SNNs can be reduced to optimized ASICs and/or parallelized using GPUs. Due to temporal sparsity, ASIC implementations of SNNs are found to be far more energy and resource efficient, with neuromorphic chips emerging that possess high energy efficiency, including Loihi (Davies et al., 2018), SpiNNaker (Furber et al., 2012) and others (Schemmel et al., 2010; Qiao et al., 2015). This energy efficiency, along with their relative simplicity in inference make SNNs attractive, so long as they can be trained efficiently, and perform in a manner similar to DNNs.
Through this paper, we focus on recurrent SNNs. Similar to recurrent DNNs, recurrent SNNs are a special class of SNNs that are equipped with an internal memory which is managed by the network itself. This additional storage gives them the power to process sequential dataset. Hence, they are popular for different tasks including speech recognition and language modeling.
Despite the substantial literature on training SNNs, the domain, especially recurrent SNNs, is still in its infancy when compared to our understanding of training mechanisms for DNNs. A significant portion of SNNtraining literature has focused on training feedforward SNNS with one layer networks (Gütig and Sompolinsky, 2006; Memmesheimer and others, 2014). Recently, some developments enabled training multi layer SNNs (Shrestha and Orchard, 2018), nonetheless, training recurrent SNNs is still in an incipient stage.
Recently, (Shrestha and Orchard, 2018) utilized spike responses based on kernel functions for every neuron to capture the temporal dependencies of spike trains. Although this method successfully captures the temporal dependency between spikes, kernelbased computations are costly. Moreover, the need for convolution operation over time makes them inefficient to be applied to recurrent SNNs.
Our contributions. We present a new framework for designing and training recurrent SNNs based on long shortterm memory (LSTM) units. Each LSTM unit includes three different gates: forget gate that helps to dismiss useless information, input gate monitors the information entering the unit, and output gate that forms the outcome of the unit. Indeed, LSTM (Hochreiter and Schmidhuber, 1997) and its variants (Greff and others, 2017)
are special cases of recurrent neural networks (RNNs) that, in part, help address the vanishing gradient problem. LSTMs are considered particularly wellsuited for time series and sequential datasets. In this paper, we leverage this capability within SNNs to propose LSTMbased SNNs that are capable of sequential learning. We propose a novel backpropagation mechanism and architecture in this paper which make it possible to achieve better performance than existing recurrent SNNs that is comparable with conventional LSTMs. In addition, our approach does not require a convolutional mechanism over time, resulting in a lowercomplexity training mechanism for recurrent SNNs compared to the feedforward neural network kernelbased approaches.
We study the performance and dynamics of our proposed architecture through empirical evaluations on various datasets. First, we start with a toy datasets, and then follow by benchmark language modeling and speech recognition datasets which provide more structured temporal dependencies. Additionally, our approach achieves better test accuracy compared to the existing literature using a simple model and network. Further, we also show that such an LSTM SNN performs well on the larger and more complex sequential EMNIST dataset (Cohen et al., 2017). Finally, we evaluate the capability of the proposed recurrent SNNs in naturallanguage generation which reveals one of many interesting applications of SNNs.
2. Related Work
In general, existing approaches for training SNNs can be subdivided into indirect training and direct training categories. Indirect training of SNNs refers to those approaches that train a conventional DNN using existing approaches and then associate/map the trained output to the desired SNN. Such a mechanism can be fairly general and powerful, but it can be limiting as the SNN obtained depends heavily on the associated DNN. In particular, (Esser and others, 2015)
presents a framework where they optimize the probability of spiking on a DNN, and then transfer the optimized parameters into the SNN. Further literature has been developed on this framework by adding noise to the associated activation function
(Liu et al., 2017), constraining the synoptics’ strengths (the network’s weights and biases) (Diehl et al., 2015), and utilizing alternate transfer functions (O’Connor et al., 2013).To enable direct training of SNNs, SpikeProp (Bohte et al., 2002) presents a pioneering supervised temporal learning algorithm. Here, the authors simulate the dynamics of neurons by leveraging an associated spike response model (SRM) (Gerstner and Kistler, 2002). In particular, SpikeProp and its associated extensions (Booij and tat Nguyen, 2005; Schrauwen and Campenhout, 2004) update the weights in accordance with the actual and target spiking times using gradient descent. However, the approach is challenging to be applied to benchmark tasks. To partially address this, improvements on SpikeProp have been developed, including MuSpiNN (GhoshDastidar and Adeli, 2009), and Resilient propagation (McKennoch et al., 2006). More recently, (Jin et al., 2018) presents a twolevel backpropagation algorithm for training SNNs, and (Shrestha and Orchard, 2018) presents a framework for training SNNs where both weights and delays are optimized simultaneously. Additionally, these frameworks apply a kernel function for every neuron, which might be a memoryintensive and timeconsuming operation, especially for recurrent SNNs.
Perhaps the most related to our work is the recent work in (Bellec and others, 2018). Similarly, the authors propose using LSTM units and in relation with the algorithm in (Hubara et al., 2016) to assure that the neurons in LSTM units output either or . For training, they approximate the gradient of the spike activation with the piecewise linear function , where
is the output of the neuron before the activation (socalled neurons’ membrane potential). In this paper, however, we relaxed the gradient of the spike activation with a probability distribution. This relaxation provides more precise updates for the network at each iteration. Also authors in
(Costa et al., 2017)have studied to remodel the architecture of LSTM to be admissible to cortical circuits which are similar to the circuits have been found in nervous system. Indeed, they leverage the sigmoid function for all activations in LSTM. Further,
(Shrestha et al., 2017) is an indirect training approach where they first run a conventional LSTM and then map it into spiking version.There are bioinspired approaches for training SNNs, including methods such as spiketime dependent plasticity (STDP) (Song et al., 2000) for direct training
. STDP is an unsupervised learning mechanism which mimics the human visual cortex. Although such biologicallyinspired training mechanisms are of interest, they are also challenging to benchmark, and therefore, we focus on alternative
direct training approaches in this paper.3. Our Methodology
3.1. LSTM Spiking Neural Networks
LSTM and its variants, a special class of RNNs, are popular due to their remarkable results in different sequential processing tasks, including longrange structures, i.e., natural language modeling and speech recognition. Indeed, LSTMs and in general RNNs are capable of capturing the temporal dependence of their input, while also addressing the vanishing gradient issue faced by other architectures.
Therefore, LSTM networks constitute a natural candidate to capture the temporal dependence a SNN models. The output value of a neuron before applying the activation is called its membrane potential, denoted as for neuron at time , see Figure (a)a.
We outline LSTM spiking unit’s main elements in Figure 1. An LSTM spiking unit has three interacting gates and associated “spike" functions. Generally, spike activations and are applied to each of their associated neurons individually. These functions take neurons’ membrane potential and outputs either a spike or null at each time step.
Like conventional LSTMs, the core idea behind such an LSTM spiking unit is the unit state, , which is a pipeline and manager of information flow between units. Indeed, this is done through collaborations of different gates and layers. Forget gate, denoted by , decides what information should be dismissed. The input gate , controls the information entering the unit, and another assisting layer on input, , which is modulated by another spike activation . Eventually, the output of the unit is formed based on the output gate , and the unit state. More specifically, given a set of spiking inputs , the gates and states are characterized as follows:
(1) 
where represents the Hadamard product, and are spike activations that map the membrane potential of a neuron, , to a spike if it exceeds the threshold value and , respectively. Throughout this paper, we assume two expressions: wake mode: which refers to the case that the neuron generates a spike and means that the neuron’s value is ; sleep mode: if the neuron’s value is . Also, and denote associated weights and biases for the network, respectively. Notice that can take the values , , or . Since the gradients around are not as informative, we threshold this output to output when it is or . We approximate the gradients of this step function with that take two values or . Note that we can employ a Gaussian approximation at this step similar to our approach in the next section, and we observe that this relaxation does not affect the performance in practice, which is what we employed in the experiments.
3.2. Enabling Backpropagation in LSTM SNNs
Backpropagation is a major, if not the only, problem in SNNs. In this section, we proceed with an example. Regardless of the activations ( or ), assume that we perturb the membrane potential of a neuron, , with an arbitrary random value . Given , the neuron can be either in the wake mode or sleep mode. Based on the activation’s threshold (see Figure (b)b), this perturbation could switch the neuron’s mode. For instance, in the wake mode if and also ( is the threshold that can be either or based on the activation), the neuron will be forced to the sleep mode. With this, we can say that the change in neuron’s mode is a function of the membrane potential and the threshold given by . Therefore, if the mode switches the derivative of output w.r.t. is proportional to , otherwise, . Nevertheless, There is still a problem with small values of that the mode switches (which equivalently means that is close to the threshold). Indeed, this gradient will blow up the backpropagation of error.
To tackle with this issue, we suggest an alternative approximation. Consider the probability density function (pdf)
which corresponds to the pdf of changing mode withas the random variable. Given a small random perturbation
, the probability of switching mode is and the probability of staying at the same mode is . As such, we can capture the expected value of as follows:(2) 
It can be seen the activation’s derivative could be relaxed with an appropriate symmetric (about the threshold ) distribution, whose random variable is proportional to the difference neuron’s membrane potential and the threshold, .
We empirically observed that a good candidate for this distribution is the Gaussian distribution with suitable variance (see Figure
(c)c). Moreover, the smoothness of Gaussian distribution makes it a better candidate against other wellknown symmetric distributions, i.e., Laplace distribution. Interestingly, another attribute that makes it unique is its curve which has, in spirit, analogous impact on backpropagation as the activations in traditional LSTM. In other words, Gaussian distribution has the same shape as the derivatives of the sigmoid and tanh activations. In addition, we can easily tune the variances corresponding to and to have the same shape as their counterpart activations in traditional LSTM (see Figure (d)d).3.3. Loss Function Derivative and Associated Parameter Updates
Next, we develop the update expressions for the parameters of LSTM spiking units. In order to do so, consider that the output layer is softmax,
, and the loss function defined to be cross entropy loss. Therefore, the derivative of the loss function w.r.t.
output of LSTM SNNs at can be characterized as follows:(3) 
where is the true signal or label. Identically, networks with linear output layers and least square loss functions we have the same gradient. Given this and expressions in (1), the derivatives of the loss function w.r.t. outputs of each gate and layer can be derived as follows: All other derivatives with details are provided in Appendix A.
(4) 
4. Experiments
4.1. Settings and Datasets
We test our proposed method for different datasets. For all experiments, we initialize all weights based on standard normal distribution, and all biases are initialized to be zero at the beginning. Additionally, the networks are trained using Adam optimizer
(Kingma and Ba, 2014), with the learning rates of , and as the original paper. The thresholds for the spike activations have been set on , which is optimized empirically. and are set to be and , respectively. More details about this selection is provided in Appendix B.4.2. Toy Dataset
We first illustrate the perfomance of the proposed method on a periodic sinusoidal signal. Our objective is to show that the proposed architecture can learn the temporal dependence using spikes as the input. Hence, we set our original input and target output to be
. In this case, the task is generating a prediction from a sequence of input spikes. To obtain this input spike train, after sampling the signal, we convert samples into ON and OFFevent based values using Poisson process, where the value of each input shows the probability that it emits a spike as shown in Figure 3.Next, we used the proposed deep LSTM spiking unit composed of one hidden layer of spiking neurons and input size of . The output is a passed through a linear layer of size one. The loss function is , where and denote the actual and predicted outputs, respectively, and . Accordingly, we backpropagate the error using the proposed method. Also, we empirically optimize and and set them to and , respectively (more insight about the impacts of these parameters over the convergence rate and accuracy is provided for sequential MNIST dataset). The generated sequences and their convergence into true signal for different number of iterations are represented in Figure 4. As it shows, the network has learned the dependencies of samples in few iterations.
4.3. LSTM Spiking Network for Classification
Sequential MNIST (Lamb et al., 2016) is a standard and popular dataset among machine learning researchers. The dataset consists of handwritten digits corresponding to k images for training and k test images. Each image is a grayscale pixels coming from different classes. The main difference of sequential MNIST is that the network cannot get the whole image at once (see Appendix Figure 5). To convert each image to ON and OFFevent based training samples we again use Poisson sampling, where the density of each pixel shows the probability that it emits a spike.
To make MNIST as a sequential dataset, we train the proposed LSTM spiking network over time steps and input size of for each time step (see Appendix Figure 5), and execute the optimization and let it run for epochs. Test accuracy and associated error bars are presented in Table 1. In addition, we listed the results of other stateoftheart recurrent SNNs approaches for sequential MNIST, and feedforward SNNs for MNIST in the same table.
Method  Architecture  Accuracy  Best 

Converted FF SNN (Diehl et al., 2015)  
FF SNN(Jin et al., 2018)  
LSTM SNN(Costa et al., 2017)  ( LSTM units)  
LSTM SNN(Bellec and others, 2018)  ( LSTM units)  
this work  ( LSTM units) 
(A) refers to indirect feedforward SNNs training, MNIST dataset,
(B) MNIST dataset,
As it can be seen in Table 1, we achieve test accuracy for sequential MNIST which is better than other LSTMbased SNNs and also this result is comparable to what was obtained by the feedforwad SNN proposed in (Jin et al., 2018). It should be noted that in (Jin et al., 2018) neurons are followed by timebased kernels and the network gets the whole image at once. Hereupon, We first note that kernelbased SNNs are not instantaneous. Usually, these networks are modeled continuously over time , and then are sampled with a proper sampling time . For every time instance, each neuron goes through a convolution operation and finally the outputs are transferred to the next layer via matrix multiplication. This procedure is repeated for every time instance . Even though our proposed algorithm operates in discretetime steps, one should note that the number of time steps in our model is much less compared to the kernelbased methods. Indeed, for kernelbased approaches one should prefer small sampling time to guarantee appropriate sampling, which, on the other hand, increases the number of time steps and consequently incurs more computation cost. For MNIST dataset, for example, the number of time steps required by our algorithm is 28 (see in Table 1), while the kernelbased method in (Shrestha and Orchard, 2018) requires 350. Furthermore, in powerlimited regimes computational complexity of kernelbased approaches make them less favorable candidates. However, in our proposed method, we eliminate the need for these kernels by drawing connections between LSTM and SNNs in order to model the dynamics of neurons. More information about the selection of and are provided in Appendix B.
Sequential EMNIST is another standard and relatively new benchmark for classification algorithms, which is an extended version of MNIST, but more challenging in the sense that it includes both letters and digits. It has almost K training, and about K test samples from distinct classes. Using the same framework as sequential MNIST section, we convert the images into ON and OFF eventbased sequential array for each image. Similarly, we train the network for iterations. The resulting test accuracy and the associated error bars are presented in Table. 2. The results of some other methods are also listed in the same table. Although this dataset has not been tested by other recurrent SNN approaches, we get comparable results with feedforward SNNs.
We believe there are several reasons for why FF SNN performs better in image classification tasks. Among them are getting the image at once, equipping each neuron with a timebased kernel and sampling input multiple times (see (Jin et al., 2018) and (Shrestha and Orchard, 2018)). However, RNNs in general and LSTM in particular have shown tremendous success in sequential learning tasks, which can be attributed to them equipping each neuron with an internal memory to manage the information flow from the sequential inputs. This feature leads RNN and its derivatives to be the preferred method in many sequential modeling tasks, especially in language modeling. FF networks, however, are not designed to learn the dependencies of a sequential input. While the proposed work in (Jin et al., 2018) performs better in image classification, it is not obvious how we can modify its architecture for sequential learning tasks, see the following experiments.
Method  Architecture  Accuracy  Best 

Converted FF SNN (Neftci et al., 2017)  
FF SNN (Neftci et al., 2017)  
FF SNN (Jin et al., 2018)  
this work (Sequential EMNIST)  ( units) 
Dataset  Characters  LSTM SNN  LSTM  Words  LSTM SNN  LSTM 

Alice’s Adventure  K  
Wikitext2  K 
Generated Text (characterlevel)  Close Text 

she is such a cring  she is such a nice 
andone had no very clear notion all over with william  Alice had no very clear notion how long ago anything had happened 
she began again: ’of hatting ’  she began again: ’Ou est ma chatte?’ 
she was very like as thump!  she was not quite sure 
Generated Text (wordlevel) 
Close Text 
alice began to get rather sleepy and went on  Alice began to feel very uneasy 
the rabbit was no longer to be lost  there was not a moment to be lost 
however on the second time round she could if i only knew how to begin  however on the second time round she came upon a low curtain 
4.4. Language Modeling
The goal of this section is to demonstrate that the proposed LSTM SNN is also capable of learning highquality language modeling tasks. By showing this, we can testify the network’s capability to learn longterm dependencies. In particular, we first train our network for prediction and then extend it to be a generative language model, for both character and wordlevel, using the proposed LSTM SNN. Indeed, the proposed recurrent SNN will learn the dependencies in the strings of inputs and conditional probabilities of each character (word) given a sequence of characters (words). For both models, we use LSTM spiking unit with one hidden layer of size . Also, the same initialization and parameters as mentioned before.
Characterlevel  each dataset that we used for this part is a string of characters, including alphabets, digits, and punctuations. The network is a series of LSTM SNN units, and the input of each,
, is a character which is onehot encoded version of it, represented by the vector
, where denote the total number of characters. Therefore, the input vector for each unit is a onehot vector, which is also in favor of spikebased representation. Giving the training sequence , the network utilizes it to return the predictive sequence, denoted by , where . It should be noted that the last layer of each spiking LSTM module is a softmax.The datasets that we employ are Alice’s Adventures in Wonderland and Wikitext2. We first shrink these datasets and also clear them from capital letters by replacing small one. After this preprocessing, Alice’s Adventures in Wonderland and Wikitext2 include and distinct characters, respectively. And they both have total number of characters. Test dataset for this dataset is a different with the same distinct characters but total size of . We used an LSTM with input size of characters, one hidden layer of size and output size of characters. To evaluate the model, the averaged perplexity () after iterations is reported in Table 3. Also, we reported the results of conventional LSTM for similar datasets. As it can be seen the proposed LSTM spiking unit can achieve comparable results, however, its privilege is to be far more energy and resource efficient. After learning longterm dependencies successfully, the trained model also can be employed to generate text as well. Hence, to have a better vision of the quality and richness of generated sequences, some samples are presented in Table 4.
Wordlevel  similar to characterlevel, we start by cleaning capital letters and then follow by extracting distinct words. However, compared to characterlevel, onehot encoding for each word would be exhaustive. To tackle with this problem, we start by encoding each word to a representative vector. Based on word to vector, we use a window size of ( words behind and word ahead), and train a feedforward neural network of one hidden layer with
units followed by a softmax layer. Hence, each word is represented by a vector of size
, where different words with similar context are close to each other. In this representation, each vector carries critical information from words, and we expect significant loss of information when we convert vectors into spikebased representations. Therefore, we have input vectors in their main formats without any conversion to ONand OFFevent based. Similar datasets to the previous task have been used. However, here we have an LSTM spiking unit with input size of , one hidden layer with neurons, and output size of . Similar to the previous part, the results are provided in Table 3 and Table 4. Hence, for wordlevel language modeling task the results are also comparable with conventional LSTM.4.5. Speech Classification
The goal of using speech recognition task is to evaluate the ability of our architecture to learn speech sequences for the classification task. To do so, we leverage a speech dataset recorded at kHz, FSDD, consisting of recordings of digits spoken from four different speakers, total size of ( of each per speaker). To effectively represent each sample for training, first we transform samples using 1D wavelet scattering transform. After applying this preprocessing, each sample becomes a 1D vector size of coming from different classes. The proposed network for this task is a series of LSTM spiking units, input size of for each and the output is taken from the last unit where it is followed by a softmax layer. To evaluate the model, the dataset is divided into training and test samples. Based on this methodology we achieved accuracy for training set, and for the test set. Employing the same architecture, training and test accuracy for conventional LSTM are and , respectively. It can be inferred that LSTM SNNs can get comparable results to convetional LSTM, but in a more efficient energy and resource manner.
5. Conclusion
In this work, we introduce a framework for directtraining of recurrent SNNs. In particular, we developed a class of LSTMbased SNNs that leverage the inherent LSTM capability of learning temporal dependencies. Based on this network, we develop a backpropagation framework for such networks based. We evaluate the performance of such LSTM SNNs over toy examples and then for the classification task. The results show that the proposed network achieve better performance compared to the existing recurrent SNNs. The results are also comparable with feedforward SNNs, while the proposed model is computationally less intensive. Finally, we test our method with a language modeling task to evaluate the performance of our network to learn longterm dependencies.
Acknowledgements.
This work was supported by ONR under grant N000141912590, and by ONR under grants 1731754 and 1564167.References
 Long shortterm memory and learningtolearn in networks of spiking neurons. In Advances in Neural Information Processing Systems 31, pp. 787–797. External Links: Link Cited by: §2, Table 1.
 Errorbackpropagation in temporally encoded networks of spiking neurons. Neurocomputing 48 (14), pp. 17–37. Cited by: §2.
 A gradient descent rule for spiking neurons emitting multiple spikes. Information Processing Letters 95 (6), pp. 552–558. Cited by: §2.
 EMNIST: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373. Cited by: §1.
 Cortical microcircuits as gatedrecurrent neural networks. In Advances in Neural Information Processing Systems 30, pp. 272–283. Cited by: Figure 7, §2, Table 1.
 Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
 The dynamic brain: from spiking neurons to neural masses and cortical fields. PLoS computational biology 4 (8), pp. e1000092. Cited by: §1.

Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing
. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2, Table 1.  Backpropagation for energyefficient neuromorphic computing. In Advances in Neural Information Processing Systems 28, pp. 1117–1125. Cited by: §2.
 Overview of the spinnaker system architecture. IEEE Transactions on Computers 62 (12), pp. 2454–2467. Cited by: §1.
 Spiking neuron models: single neurons, populations, plasticity. Cambridge university press. Cited by: §2.

A new supervised learning algorithm for multiple spiking neural networks with application in epilepsy and seizure detection
. Neural networks 22 (10), pp. 1419–1431. Cited by: §2.  LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28 (10), pp. 2222–2232. Cited by: §1.
 The tempotron: a neuron that learns spike timing–based decisions. Nature neuroscience 9 (3), pp. 420. Cited by: §1.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 Binarized neural networks. In Advances in Neural Information Processing Systems 29, pp. 4107–4115. External Links: Link Cited by: §2.
 Hybrid macro/micro level backpropagation for training deep spiking neural networks. In Advances in Neural Information Processing Systems 31, pp. 7005–7015. Cited by: §2, §4.3, §4.3, Table 1, Table 2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 Professor forcing: a new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems 29, pp. 4601–4609. Cited by: §4.3.
 Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
 Noisy softplus: an activation function that enables snns to be trained as anns. arXiv preprint arXiv:1706.03609. Cited by: §2.
 Fast modifications of the spikeprop algorithm.. In IJCNN, Vol. 6, pp. 3970–3977. Cited by: §2.
 Learning precisely timed spikes. Neuron 82 (4), pp. 925–938. Cited by: §1.
 Eventdriven random backpropagation: enabling neuromorphic deep learning machines. Frontiers in neuroscience 11, pp. 324. Cited by: Table 2.

Realtime classification and sensor fusion with a spiking deep belief network
. Frontiers in neuroscience 7, pp. 178. Cited by: §2. 
A reconfigurable online learning spiking neuromorphic processor comprising 256 neurons and 128k synapses
. Frontiers in neuroscience 9, pp. 141. Cited by: §1.  A waferscale neuromorphic hardware system for largescale neural modeling. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pp. 1947–1950. Cited by: §1.
 Extending spikeprop. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Vol. 1, pp. 471–475. External Links: Document, ISSN 10987576 Cited by: §2.
 A spikebased long shortterm memory on a neurosynaptic processor. In 2017 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pp. 631–637. Cited by: §2.
 SLAYER: spike layer error reassignment in time. In Advances in Neural Information Processing Systems 31, pp. 1419–1428. Cited by: §1, §1, §2, §4.3, §4.3.
 Competitive hebbian learning through spiketimingdependent synaptic plasticity. Nature neuroscience 3 (9), pp. 919. Cited by: §2.
 Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §1.
Appendix A Backpropagation
We develop the update expressions for the parameters of LSTM spiking units. In order to do so, consider that the output layer is softmax, , and the loss function defined to be cross entropy loss. Therefore, the derivative of the loss function w.r.t. output of LSTM SNNs at can be characterized as follows:
(5) 
Identically, networks with linear output layers and least square loss functions we have the same gradient. Given this and also the derivatives of the loss function w.r.t. outputs of each gates in (4), we can now update the weights based on the derivative of the loss function for each of them:
where and , and is one or a positive number less than it (based on the value of , explained in 3.1 of the paper). Taking into account these partial derivatives at each time step , we can now update the weights and biases based on the partial derivatives of the loss function with respect to them. And with same approach we can express the derivatives of the loss function for the biases.
Appendix B & impacts
Figure 6 is depicted to reveal the serious effects of and on tuning the gradients. Indeed, these two parameters control the flow of error during the backpropagation for different parts of the LSTM spiking unit. An interesting point is that with and , the LSTM SNNs becomes similar to conventional LSTM during the backpropagation. We have done these experiments on MNIST dataset. We observe the same outcomes for the other datasets as well.