I Introduction
The research interest in neural networks has considerably grown over the recent years owing to their remarkable success in many applications. Record accuracy was obtained in deep networks applied for image classification krizhevsky_imagenet_2012 ; szegedy_going_2015 ; he_deep_2016 , and new architectures enabled solving cognitivelychallenging tasks, such as multiple object detection trained endtoend redmon_you_2016 , pixellevel segmentation of images he_mask_20171 , or even the playing of computer games based on raw screen pixels mnih_humanlevel_2015 . Moreover, sequencetosequence models enabled language translation sutskever_sequence_2014 and the combination of convolutional layers with recurrent layers has led to languageindependent endtoend speech recognition amodei_deep_2016 . These architectures surpassed the performance of traditional domainspecific models, and established neural networks as the standard approach in the industry.
Although the term neural networks elicits associations to the sophisticated functioning of the brain, the advances in the field were obtained by extending the original simple ANN paradigm of the 50’s to complex deep neural networks trained with backpropagation. The ANNs take only highlevel inspiration from the structure of the brain comprising neurons interconnected with synapses, which results in humanlike performance, albeit at a much higher power budget than the
20 W required by the human brain. At the same time, the neuroscientific community – whose focus is understanding the brain dynamics – has been exploring architectures with more biologicallyrealistic dynamics, such as SNNs that in their simplest form consist of Leaky IntegrateandFire (LIF) neurons dayan_theoretical_2005 ; eliasmith_how_2013 ; gerstner_neuronal_2014 . The SNN paradigm encompasses rich temporal dynamics, due to the braininspired LIF neurons, and promises low power consumption, due to the use of sparse asynchronous voltage pulses, called spikes, to compute and propagate information. Thus, SNNs are considered to be the next generation of neural networks beyond ANNs maass_networks_1997 , with advantages stemming from their efficient implementation, rich dynamics and novel learning capabilities.From an implementation perspective, the inherent characteristics of SNNs have led to highlyefficient computing architectures with collocation of memory and processing units. For example, incorporation of the design principles of the SNN paradigm in the field of neuromorphic computingmead_neuromorphic_1990 has led to the development of nonvon Neumann systems with significantly increased parallelism and reduced energy consumption, demonstrated in chips such as FACETS/BrainScales meier_mixedsignal_2015 , Neurogrid benjamin_neurogrid:_2014 , IBM’s TrueNorthcassidy_realtime_2014 and Intel’s Loihidavies_loihi:_2018 . Moreover, recent breakthroughs in the area of memristive nanoscale devices have enabled further improvements in area and energy efficiency of mixed digitalanalog implementations of synapses and spiking neurons kuzum_synaptic_2013 ; tuma_stochastic_2016 ; wozniak_learning_2016 ; pantazi_allmemristive_2016 ; tuma_detecting_2016 .
From a neural and synaptic dynamics perspective, largescale simulations following the neuroscientific insights were performed to explore the activity patterns of SNNs markram_blue_2006 ; izhikevich_largescale_2008 ; ananthanarayanan_cat_2009 ; markram_human_2012 , or to address concrete interesting cognitive problems eliasmith_largescale_2012 ; rasmussen_spiking_2014 . However, the most appealing models eliasmith_largescale_2012 involved complex taskspecific architectures and the execution of their dynamics could take up to 2.5 hours and 24GB RAM to calculate one second response. Owing to lack of holistic understanding of largescale activity patterns and complex design of cognitive simulations, further bottom up research of simpler architectures is needed. Specifically, approaches such as that in which the competitive dynamics of inhibitory circuits were abstracted to WinnerTakeAll (WTA) architectures maass_computational_2000 , or that in which the rich recurrent dynamics were exploited in Liquid State Machines maass_realtime_2002 , provide direct examples of cases where SNN dynamics can enhance the computational capabilities.
Finally, from a learning perspective, biologicallyinspired unsupervised Hebbian learning rules, such as SpikeTimingDependent Plasticity (STDP)markram_regulation_1997 ; song_competitive_2000 and its extension to Fatiguing STDP (FSTDP) moraitis_fatiguing_2017 , were applied for correlation detection song_competitive_2000 ; gutig_learning_2003 ; tuma_stochastic_2016 ; wozniak_learning_2016 ; pantazi_allmemristive_2016 ; tuma_detecting_2016 ; moraitis_fatiguing_2017 or high frequency signals sampling tuma_stochastic_2016 . STDP in WTA networks was applied for handwritten digit recognition querlioz_simulation_2011 ; diehl_unsupervised_2015 ; sidler_unsupervised_2017 , but yielded limited accuracy. The reason is an insufficient generalization of the internal representation of knowledge in WTA architectures, effectively implementing a NN algorithm wozniak_THESIS_2017 . Further improvements in the internal representation were obtained through feature learning bichler_extraction_2012 ; burbank_mirrored_2015 ; wozniak_IJCNN_2017 .
However, despite the great promise of SNNs in terms of efficient implementation, rich dynamics and learning capabilities, it has been unclear how to effectively train large generic networks of spiking neurons to reach the accuracy of ANNs for common machine learning tasks. The main reason was attributed to lack of a scalable supervised SNN learning algorithm, such as the backpropagation (BP) in ANNs. It was shown that porting the weights from ANNs trained with BP to SNNs enhances their performance oconnor_realtime_2013 ; diehl_conversion_2016 ; rueckauer_conversion_2018
. Moreover, there have been multiple attempts to develop supervised learning approaches inspired by the idea of BP
bohte_errorbackpropagation_2002 ; anwani_normad_2015 , and also to explore how STDP could perform BP bengio_stdpcompatible_2017 ; tavanaei_bpstdp:_2017 , or how to directly implement BP. Examples include: using a differentiable approximation of the LIF response function hunsberger_spiking_2015 , applying BP instead of STDP when output spikes are emitted esser_convolutional_2016 , deriving BP formulas on lowpassfiltered neuronal activities lee_training_2016 , or using the concept of backpropagation through time with differentiable approximations of temporal network dynamics huh_gradient_2017 ; bellec_long_2018 . Despite many performance improvements, these stateoftheart architectures usually do not surpass the accuracy of ANNs on common datasets and involve sophisticated implementation.In this paper, we take a different perspective of the relationship between the ANNs and SNNs. We reflect on the nature of the spiking neural dynamics and postulate to unify the SNN model with its ANN counterpart. In the first part of the paper, we focus on the LIF dynamics and provide a constructive proof that a spiking neuron can be transformed into a simple novel recurrent ANN unit called Spiking Neural Unit (SNU) that shares many similarities with those of the Long ShortTerm Memory (LSTM) and Gated Recurrent Unit (GRU). Moreover, we generalize the LIF dynamics to a nonspiking case by introducing a socalled soft variant of SNU (sSNU). In the second part, we address the learning challenges. We show that for the proposed units the existing ANN frameworks can be naturally reused for training with backpropagation through time, enabling a very easy implementation and successful learning of any SNN architecture. Finally, we demonstrate the efficacy of our approach by training deep SNNs with up to seven layers and analyzing the performance of two tasks: handwritten digit recognition and polyphonic music prediction. The results obtained from the SNU familybased networks show competitive performance compared to RNNs, LSTM and GRUbased networks – the stateoftheart models commonly used in tasks involving temporal data.
Ii Neuron models and temporal data
Conventional ANNs are feedforward architectures implementing layers of neurons described by the equation
(1) 
where and
are the input and output vectors,
is the weights matrix, is a vector of biases andis the activation function, such as a sigmoid
or the rectified linear function, used in Rectified Linear Units (ReLUs)
nair_rectified_2010 .Such ANNs are stateless and have no inherent notion of time, yet through an appropriate transformation of the last temporal inputs into a large spatial input vector waibel_modular_1989 , it is possible to utilize these neuronal models with temporal data. However, this solution is computationally inefficient, because a larger model with times more inputs needs to be evaluated during each time step, and the system needs to implement shift buffering or delay lines to provide the required past inputs.
A Recurrent Neural Network (RNN) is an extension of an ANN that is capable of operating directly on temporal data. This is achieved through the introduction of recurrent connections between the neurons within each layer
(2) 
where indicates the discrete time and denotes the matrix of recurrent weights. As a result, streams of temporal data may be directly fed into such networks that maintain the temporal context in the transient activation values circulating through the recurrent connections.
However, the rapid transient dynamics of RNNs posed significant challenges for training these models on long sequences. For each time step, the temporal context of the entire RNN layer is combined with the new inputs and transformed by the nonlinear activation function. This leads to a rapid saturation of the neuronal activations that negatively impacts the learning – known as the vanishing gradients problem. The solution was to provide a longterm temporal context that was unaffected by the nonlinear interactions outside the neuronal cells. This was achieved by introducing into the neurons an internal state variable called carry
. Its dynamics is controlled by surrounding trainable gates that form together a stateful Long ShortTerm Memory (LSTM) unit gers_learning_1999(3)  
where multiple RNN units are combined and indexed with , , to denote the input, the output, and the forget gate, respectively. Moreover, and are input and output activation functions, and denotes the inner product. This approach became the stateoftheart in recurrent networks.
Recently, the Gated Recurrent Units (GRUs) cho_properties_2014 became a popular alternative to the LSTM units that achieve similar performance with fewer gates chung_empirical_2014 . They are formulated as
(4)  
where is called an update gate, is called a reset gate and is the activation function.
Meanwhile, SNNs have been developing almost independently from ANNs. The common basic Leaky IntegrateandFire (LIF) spiking neuron model is inherently temporal, comprising a state variable , called the membrane potential, with the dynamics described by the differential equation gerstner_neuronal_2014
(5) 
where is the time constant of the neuron, and represent the resistance and the capacitance of the neuronal cell soma, and is the incoming current from the synapses. The synapses of a neuron receive spikes and modulate them by the synaptic weights to provide the input current to the neuronal cell soma. The input current is integrated into the membrane potential . When crosses a firing threshold at time , an output spike is emitted: , and the membrane potential is reset to the resting state , often defined to be equal to .
It is common to describe the membrane potential dynamics using a discretetime approximation that is obtained from Eq. 5 assuming a discretization step
(6) 
Assuming that we do not consider the temporal dynamics of biologicallyrealistic models of synapses and dendritesgerstner_neuronal_2014 , the input current may be defined as . This formulation provides a simple framework for the analysis of the LIF dynamics, commonly explored in SNN research. However, understanding how to take advantage of these temporal dynamics to build largescale generic deep spiking networks that would achieve high accuracy on common machine learning tasks has remained an open question.
Iii LIF dynamics in an ANN framework
Here, we introduce a novel way of looking at SNNs that makes their temporal dynamics easier to understand and enables them to be incorporated in deep learning architectures. In particular, we propose a succinct highlevel model of a spiking LIF neuron, which we call a Spiking Neural Unit (SNU). The SNU comprises two ANN neurons as subunits:
, which models the membrane potential accumulation dynamics, and , which implements the spike emission, as illustrated in Fig. 1. The integration dynamics of the membranepotential state variable is realized through a single selflooping connection to in the accumulation stage. The spike emission is realized through a neuron with step activation function. Simultaneously, an activation of controls the resetting of the state variable by gating the selflooping connection at . Thus, SNU – a discretetime abstraction of a LIF neuron – represents a construct that is directly implementable as a neural unit in ANN frameworks.Following the standard ANN convention, the formulas that govern the computation occurring in a layer of SNUs are as follows
(7) 
where is the vector of internal state variables calculated by the subunits, is the output vector calculated by the subunits, is the accumulation stage activation function, and is the output activation function.
is a ReLU, i.e.,
is the rectified linear activation function, based on the assumption that the membrane potential value is bounded by the resting state . The inputs are weighted by the synaptic weights in matrix and there is no bias term. The selflooping weight applied to the previous state value performs a discrete time approximation of the membrane potential decay that occurred in the time period . The last term relies on the binary output values of the spiking output to either retain the state, or reset it after spike emission. is a thresholding neuron, i.e. it has a step activation function , which returns that corresponds to an output spike if , or otherwise. There is no weight on the connection from , but it is biased with to implement the spiking threshold.The parameters of the SNUs are , and . If is fixed to 1, the state does not decay and the SNU corresponds to the IntegrateandFire (IF) neuron without the leak term. Otherwise, these parameters correspond to the parameters of the LIF neuron introduced in Eq. 6, i.e.,
(8) 
Thus, the same set of parameter values can be used in an SNUbased network, implemented by utilizing standard ANN frameworks, as well in a native LIFbased implementation, utilizing standard SNN frameworks, or even in neuromphorphic hardware. To demonstrate this, we have used TensorFlow
^{1}^{1}1 http://www.tensorflow.org to produce sample plots of the spiking dynamics for a single SNU in Fig. 2. As can be seen, the state variable of the SNU increases each time an input spike arrives at the neuron, and decreases following the exponential decay dynamics. When the spiking threshold is reached, an output spike is emitted and the membrane potential is reset. These dynamics are aligned with the reference LIF dynamics, which we obtained for the corresponding parameters by running a simulation in a wellknown Brian2 ^{2}^{2}2 http://brian2.readthedocs.io SNN framework.iii.1 Relaxing the SNN constraints
In SNNs, information is transmitted throughout the network with allornone spikes, typically modeled as binary values. As a result, the input data is binarized, and the step function is used to determine the binary neuronal outputs. However, the proposed SNU implementation, within the ANN framework, allows the allornone constraint to be relaxed, thus allowing the benefits of a variant of the SNU, called soft SNU (sSNU) to be explored. The sSNU is a member of the family of SNUs characterized by the dynamics in Eq.
7. It generalizes this dynamics to nonspiking ANNs, in which the input data does not have to be binarized and the activation functionis set to a sigmoid function. This formulation has the additional interesting property of an analog proportional reset, i.e., the magnitude of the output determines what fraction of the membranepotential state variable is retained. Exploiting the intermediate values at all stages of processing, viz., input, reset and output, facilitates onpar performance comparison of LIFlike dynamics with other ANN models, eliminating any potential performance loss stemming from the limited value resolution of the standard SNU.
The sSNU concept has another interesting intuitive interpretation. In a sense, the 0 or 1 binarized output of a neuron represents its confidence in a certain hypothesis concerning the information presented at its inputs. In handling static data, all relevant information is presented simultaneously at the inputs and, as a consequence, an artificial neuron would directly output its confidence regarding this input information. However, in cases with temporal data, the information is spread over time and the LIF neurons collect it in the membrane potential wozniak_THESIS_2017 . When enough information aligned with the hypothesis has been collected, an output spike transmits this fact to the downstream neurons and restarts the process through the state reset. However, with sSNU, a floating point output is always transmitted. To avoid repeated retransmission of the same information to the downstream neurons, the value of the membrane potential has to be reduced. On the other hand, in order to retain certain memory, the neuron should not be fully reset at each time step. Thus, the solution provided by sSNU is to attenuate the state variable proportionally to the output value transmitted to the downstream neurons.
iii.2 Structural comparison
The temporal context of the units from the SNU family is captured through the internal state corresponding to the membrane potential of the LIF neuron. In this sense, the structure of the SNUs is similar to LSTM or GRU in that it also relies on their internal state as a means of storing temporal context. This structural similarity is visible in Figs. 3
ac, in which all the aforementioned units have an internal state that is maintained through a recurrent loop within the units’ boundaries, drawn in gray. Besides similarities in the structure, the SNUs possess unique features not present in the other models, viz., a nonlinear transformation
within the internal state loop, a parametrized state loop connection, a bias of the state output connection to the output activation function , indicated by bold arrows in Fig. 3c, and a direct reset gate controlled by the output .Network structure  

Feedforward  Recurrent  
Stateless units  ANN  RNN 
Stateful units  SNU  LSTM/GRU 
The SNUs can be optionally interconnected through recurrent connections, as is indicated in Fig. 3c. Thus, similar to the Liquid State Machine SNN modelmaass_realtime_2002 and the learningtolearn architecturebellec_long_2018 , it might be beneficial for certain tasks to extend the SNUsbased networks to include the recurrent connections matrix . However, the most typical architecture with SNUs is feedforward, which creates a novel category of ANN architectures for temporal processing, as summarized in Tab. 1. Note that processing temporal data without the use of a recurrent neural network structure but rather using only the internal state has long been the standard approach in the SNN community wozniak_learning_2016 ; pantazi_allmemristive_2016 ; maass_computational_2000 ; song_competitive_2000 ; moraitis_fatiguing_2017 ; gutig_learning_2003 ; querlioz_simulation_2011 ; diehl_unsupervised_2015 ; sidler_unsupervised_2017 ; bichler_extraction_2012 ; wozniak_IJCNN_2017 . Thus, in the rest of the paper we will focus on the classic feedforward SNN network architectures.
The use of feedforward stateful architectures for temporal problems has a series of profound advantages. From an implementation perspective, alltoall connectivity between the neuronal outputs and the neuronal inputs within the same layer is not required. This may lead to highlyparallel software implementations or neuromorphic hardware designs. From a theoretical standpoint, owing to inherent temporal neural dynamics, a feedforward network of SNUs is the simplest temporal neural network architecture with a lower number of parameters than that in RNNs, LSTM or GRUbased networks, which may result in faster training and reduced overfitting.
In addition to the synaptic weights, the trainable parameters in SNUs may include the membrane time constant and the neuronal threshold. For a layer of neurons with inputs, the number of parameters for an SNU with trainable synaptic weights and firing thresholds, but constant , is , which is equivalent to the number of parameters in the simplest feedforward ANN. Moreover, as summarized in Tab. 2, even if a trainable time constant is considered, the number of parameters in an SNU architecture is smaller than that of an RNN, an LSTM with four fullyparametrized gates or a GRU with three fully parametrized gates.
Network model  # of parameters 

ANN  
RNN  
LSTM  
GRU  
SNU  
SNU with trainable  
recurrent SNU 
Iv Training
SNUs provide a mapping of the spiking neural dynamics into the ANN frameworks, which naturally enables to reuse the existing backpropagation training procedures. However, backpropagation requires that all parts of the network are differentiable, which is the case for the sSNU variant, but not for the step function in the standard SNU. Nevertheless, in particular cases it is possible to train nondifferentiable neural networks by providing a pseudoderivative for the nondifferentiable functions bengio_estimating_2013 . In the remaining part of the paper, we follow this approach and use the derivative of as the pseudoderivative of the step function.
Even though an SNUbased network is a feedforward architecture, the state within the units is implemented using selflooping recurrent connections. Therefore, to train such deep networks, we follow the idea of using backpropagation through time (BPTT) werbos_generalization_1988 in SNNs huh_gradient_2017 ; bellec_long_2018 . This implies that the SNU structure is unfolded over time, i.e., the computational graph and its parameters are replicated for each time step, as illustrated in Fig. 4, and then the standard backpropagation algorithm is applied. The unfolding involves only the local state of the neuron, which is different from the common RNN architectures that require unfolding of the activations of all units in a layer through recurrent connection matrices . In practice, these details do not matter for the ANN frameworks that generate a computational graph and use automatic differentiation for the training, so that the entire training code is created dynamically.
In the case standard SNUs are used for the output layer of a network, we propose to adapt the learning loss to reflect the differences of how SNNs and ANNs are assessed. For RNNs, LSTMs and GRUs it is quite often that the last output after presentation of the entire sequence, for instance in Fig. 4, is considered. In the case of SNNs, it is common to assess the output spiking rate of the neurons over a range of time, such as the entire output sequence in Fig. 4. To reflect this, we define the SNU spiking rate loss as the mean squared error (MSE) between the rate of the mean spiking output , calculated over an assessment period , and the target firing rate
(9) 
In deep learning, normalized target values in the range between 0 and 1 are used by convention. If we normalize the mean spiking rate by the maximum spiking rate of , we obtain the normalized spiking rate loss, or simply the mean output loss for normalized targets
(10) 
that does not depend on the sampling time from the SNN discretization, and is also suitable for use with any ANN model.
iv.1 Handwritten digit recognition
We evaluated the performance of deep SNUbased networks in comparison to other popular temporal ANN models. Here, we proposed a temporal variation of the MNIST handwritten digit classification task, in which we assumed that the inputs are spikes from an asynchronous camera. We assumed that for each input image pixel belonging to the digit, defined as a pixel having a positive intensity value, the camera sends a spike at a random time instance. Thus, as illustrated in Fig. 5
, the generated spikes convey jittered information about the digits. For repeatability of the results and to limit the regularization effects, the transformation from standard MNIST to the jittered MNIST was performed upfront for the entire dataset using five time steps per digit and a random seed equal to 0, so that the same inputs are presented at each epoch to all models.
We aimed to develop a training setting similar to the SNN convention querlioz_simulation_2011 ; sidler_unsupervised_2017 , in which the patterns from the dataset form a continuous stream, i.e., the spikes representing the current training digit come directly after the spikes of the preceding digit. Thus, the network operates with a nonzero initial state and has to identify consecutive digits without receiving explicit information when the digit at the input has changed. However, a direct implementation of this approach would require to apply BPTT to a continuous stream of 60000 training digits forming a single training input stream per entire dataset. Instead, to apply BPTT to the individual training digits and benefit from parallel training with batching, we trained the networks on sequences formed by feeding a dummy random digit to initialize the network state first and then consecutively presenting the training digit. The neuronal outputs of all the models were then evaluated during the training digit presentation period using the mean output loss defined in Eq. 10.
We trained 3, 4 and 7layer network architectures with 78425010, 78425625610 and 78425625625625625610 neurons, respectively. The networks were homogeneous – all the neurons were of the same kind: SNUs, or sSNUs, with trainable , and fixed
; or RNNs with sigmoidal neurons; or LSTMs with sigmoidal activation functions; or GRUs with sigmoidal activation functions. For each epoch, the training set of 60000 MNIST images was presented with a batch size of 15. The learning was performed using Stochastic Gradient Descent (SGD) with learning rate
. To assess the consistency of the results, the networks were executed for 10 different random weight initializations and the mean test accuracy was calculated.Firstly, we analyzed the impact of the network depth on the performance of an SNN implemented with SNUs. The evolution of the test accuracy is plotted in Fig. 6. A 3layer SNUbased network achieved mean test accuracy of 96.8%, whereas with a 4layer architecture the accuracy increased to 97.41%. A further slight improvement was achieved with a 7layer architecture that increased the test accuracy to 97.46%. These results indicate that deep networks incorporating spiking neural dynamics have high potential in addressing efficiently, and with high accuracy, machine learning tasks that involve temporal data.
Secondly, we compared the performance of the units from the SNU family with the stateoftheart temporal ANNs. The learning curves of the best performing RNNs, GRU and LSTMbased networks are depicted in Fig. 7. The best test accuracy was obtained with 3 or 4layer networks and was similar for all of them. The exact accuracy values and numbers of parameters of the networks are reported in Tab. 3. As can been seen, the 4 or 7layer SNUbased network architectures have surpassed in test accuracy these stateoftheart temporal ANN models. Moreover, a soft variant of the SNU in a 4layer configuration has achieved the highest result of accuracy.
Network  Total # of parameters  Mean accuracy  Maximum accuracy 

GRU 3layer  0.9694  0.9708  
LSTM 4layer  0.9699  0.9719  
RNN 4layer  0.9708  0.9718  
SNU 4layer  0.9741  0.9754  
sSNU 4layer  0.9796  0.9802 
iv.2 Polyphonic music prediction
To further validate the performance of SNUs, we considered the task of polyphonic music prediction. We used the dataset of Johann Sebastian Bach’s (JSB) chorales that comprises over seven hours of music in 382 pieces, in the form provided by BoluangerLewandowski et al. boulangerlewandowski_modeling_2012
. The piano notes ranging from A0 to C8 were coded as 88dimensional binary vectors, in which ones correspond to a note being played. They were sequentially fed into a network, which had to predict at each time step the set of the notes that were to be played in the consecutive time step. The goal was to minimize the negative logprobability of notes’ predictions.
We compared SNUbased networks against the stateoftheart ANN resultschung_empirical_2014 ; greff_lstm:_2017 obtained with RNNs, GRUs and LSTMs, including a No Input Activation Function (NIAF) variant of an LSTM with , which nevertheless do not surpass the performance of the best taskspecific model boulangerlewandowski_modeling_2012 obtaining the loss of 5.56. The standard ANN assessment architecture comprises: an input layer receiving a set of input notes; followed by a hidden layer of analyzed units; followed by a softmax output layer predicting the next set of notes.
Network  # of hidden units  # of hidden layer param.  Total # of parameters 

RNN tanhchung_empirical_2014  
GRUchung_empirical_2014  
LSTMchung_empirical_2014  
LSTMgreff_lstm:_2017  
NIAFLSTMgreff_lstm:_2017  
SNU  
sSNU 
We trained a feedforward network with 150 SNUs, or sSNUs, with trainable , and . The various network architectures and details are summarized in Tab. 4. Learning was performed using SGD for 2000, or 500 epochs, with the learning rate , or , respectively. The parameter adjustments were applied after presentation of each music piece, following the stateoftheart convention. We executed the learning for 10 different random initializations and reported the mean of the lowest test negative logprobabilities.
An architecture with a single feedforward layer of spiking SNUs performed better than a standard recurrent layer of units, as illustrated in Fig. 8
. However, it performed worse than more sophisticated neural units for temporal data processing. The results improved when we used soft SNUs that enable to transmit intermediate values and make full use of the output softmax layer. The average negative logprobability of sSNU was lower than for GRUs
chung_empirical_2014 with similar number of parameters, or the average for the best 10% of 200 trials executed for LSTMs and NIAF versions of LSTMsgreff_lstm:_2017 requiring significantly more parameters. However, the best single NIAFLSTM trial was able to achieve 8.38greff_lstm:_2017 . In our case, the sSNUs performed consistently with a minimum of 8.47 close to the mean of 8.49.V Conclusion
For a long time, SNN and ANN research and applications have been developing separately. There has been significant effort to understand the SNN dynamics and to take advantage of their unique capabilities, albeit with limited success compared to the spectacular progress witnessed with ANNs. In this paper, we have tried to unify these neural network architectures by proposing the SNU that incorporates the spiking neural dynamics in a common ANN framework.
The transformation of the spiking model to an SNU allows SNNs to benefit from the advances in the ANN frameworks and also enables direct comparison of the dynamics of the spiking neural units with the stateoftheart recurrent units. Moreover, with the sSNU variant, we have generalized the neural dynamics to the nonspiking case. Using this methodology, deep networks consisting of SNUs or sSNUs can be efficiently trained with BPTT. The benchmark results demonstrated that a feedforward sSNUbased network outperforms conventional RNNs, LSTM or GRUbased networks in temporal tasks. Therefore, the sSNU offers an alternative stateful model with the lowest number of parameters among the existing models for temporal data processing. Even though the results using the SNUbased network demonstrated slightly inferior performance compared to the sSNUbased networks, the binary inputoutput characteristics of the spiking communication provides an interesting implementation alternative for low power AI applications.
The proposed SNU family opens many new avenues for future work. It enables us to explore the capabilities of biologicallyinspired neural models and benefit from their low computational power as well as their simplicity. It also provides an easy approach to training spiking networks that could increase their adoption for practical applications and would enable powerefficient neuromorphic hardware implementations. Finally, the compatibility of the SNU family with the ANN frameworks and models enables the use of existing or forthcoming ANN accelerators for SNN implementation and deployment.
References
References
 (1) Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105 (2012).

(2)
Szegedy, C. et al.
Going deeper with convolutions.
In
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 1–9 (2015).  (3) He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
 (4) Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, realtime object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
 (5) He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask RCNN. In 2017 IEEE International Conference on Computer Vision (ICCV), 2980–2988 (2017).

(6)
Mnih, V. et al.
Humanlevel control through deep reinforcement learning.
Nature 518, 529–533 (2015).  (7) Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112 (2014).
 (8) Amodei, D. et al. Deep Speech 2: Endtoend speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, 173–182 (JMLR.org, New York, NY, USA, 2016).
 (9) Dayan, P. & Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (The MIT Press, 2005).
 (10) Eliasmith, C. How to Build a Brain: A Neural Architecture for Biological Cognition (Oxford University Press, 2013).
 (11) Gerstner, W., Kistler, W. M., Naud, R. & Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition (Cambridge University Press, 2014).
 (12) Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Networks 10, 1659 – 1671 (1997).
 (13) Mead, C. Neuromorphic electronic systems. Proceedings of the IEEE 78, 1629–1636 (1990).
 (14) Meier, K. A mixedsignal universal neuromorphic computing system. In 2015 IEEE International Electron Devices Meeting (IEDM), 4.6.1–4.6.4 (IEEE, Washington, DC, 2015).
 (15) Benjamin, B. V. et al. Neurogrid: A mixedanalogdigital multichip system for largescale neural simulations. Proceedings of the IEEE 102, 699–716 (2014).
 (16) Cassidy, A. S. et al. Realtime scalable cortical computing at 46 gigasynaptic OPS/Watt with ~100x speedup in timetosolution and ~100,000x reduction in energytosolution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 27–38 (IEEE, Piscataway, NJ, USA, 2014).
 (17) Davies, M. et al. Loihi: A neuromorphic manycore processor with onchip learning. IEEE Micro 38, 82–99 (2018).
 (18) Kuzum, D., Yu, S. & Philip Wong, H.S. Synaptic electronics: Materials, devices and applications. Nanotechnology 24, 382001 (2013).
 (19) Tuma, T., Pantazi, A., Le Gallo, M., Sebastian, A. & Eleftheriou, E. Stochastic phasechange neurons. Nature Nanotechnology 11, 693–699 (2016).
 (20) Woźniak, S., Tuma, T., Pantazi, A. & Eleftheriou, E. Learning spatiotemporal patterns in the presence of input noise using phasechange memristors. In 2016 IEEE International Symposium on Circuits and Systems (ISCAS), 365–368 (IEEE, 2016).
 (21) Pantazi, A., Woźniak, S., Tuma, T. & Eleftheriou, E. Allmemristive neuromorphic computing with leveltuned neurons. Nanotechnology 27, 355205 (2016).
 (22) Tuma, T., Le Gallo, M., Sebastian, A. & Eleftheriou, E. Detecting correlations using phasechange neurons and synapses. IEEE Electron Device Letters 37, 1238–1241 (2016).
 (23) Markram, H. The Blue Brain Project. Nature Reviews Neuroscience 7, 153 (2006).
 (24) Izhikevich, E. M. & Edelman, G. M. Largescale model of mammalian thalamocortical systems. Proceedings of the National Academy of Sciences 105, 3593–3598 (2008).
 (25) Ananthanarayanan, R., Esser, S. K., Simon, H. D. & Modha, D. S. The cat is out of the bag. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 1–12 (IEEE, 2009).
 (26) Markram, H. The Human Brain Project. Scientific American 306, 50–55 (2012).
 (27) Eliasmith, C. et al. A largescale model of the functioning brain. Science 338, 1202–1205 (2012).
 (28) Rasmussen, D. & Eliasmith, C. A spiking neural model applied to the study of human performance and cognitive decline on Raven’s Advanced Progressive Matrices. Intelligence 42, 53–82 (2014).
 (29) Maass, W. On the computational power of WinnerTakeAll. Neural Computation 12, 2519–2535 (2000).
 (30) Maass, W., Natschläger, T. & Markram, H. Realtime computing without stable states: A new framework for neural computation based on perturbations. Neural computation 14, 2531–2560 (2002).
 (31) Markram, H., Lübke, J., Frotscher, M. & Sakmann, B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275, 213–215 (1997).
 (32) Song, S., Miller, K. D. & Abbott, L. F. Competitive Hebbian learning through spiketimingdependent synaptic plasticity. Nature Neuroscience 3, 919–926 (2000).
 (33) Moraitis, T. et al. Fatiguing STDP: Learning from spiketiming codes in the presence of rate codes. In 2017 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2017).
 (34) Gütig, R., Aharonov, R., Rotter, S. & Sompolinsky, H. Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. The Journal of Neuroscience 23, 3697–3714 (2003).
 (35) Querlioz, D., Bichler, O. & Gamrat, C. Simulation of a memristorbased spiking neural network immune to device variations. In 2011 International Joint Conference on Neural Networks (IJCNN), 1775–1781 (IEEE, 2011).
 (36) Diehl, P. U. & Cook, M. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in Computational Neuroscience 9 (2015).
 (37) Sidler, S., Pantazi, A., Woźniak, S., Leblebici, Y. & Eleftheriou, E. Unsupervised learning using phasechange synapses and complementary patterns. In 2017 ENNS International Conference on Artificial Neural Networks (ICANN) (2017).
 (38) Woźniak, S. Unsupervised Learning of PhaseChangeBased Neuromorphic Systems. Doctoral dissertation, EPFL (2017).
 (39) Bichler, O., Querlioz, D., Thorpe, S. J., Bourgoin, J.P. & Gamrat, C. Extraction of temporally correlated features from dynamic vision sensors with spiketimingdependent plasticity. Neural Networks 32, 339–348 (2012).

(40)
Burbank, K. S.
Mirrored STDP implements autoencoder learning in a network of spiking neurons.
PLOS Computational Biology 11, 1–25 (2015). 
(41)
Woźniak, S., Pantazi, A.,
Leblebici, Y. & Eleftheriou, E.
Neuromorphic system with phasechange synapses for pattern learning and feature extraction.
In 2017 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2017). 
(42)
O’Connor, P., Neil, D.,
Liu, S.C., Delbruck, T. &
Pfeiffer, M.
Realtime classification and sensor fusion with a spiking deep belief network.
Frontiers in Neuroscience 7 (2013).  (43) Diehl, P. U., Zarrella, G., Cassidy, A., Pedroni, B. U. & Neftci, E. Conversion of artificial recurrent neural networks to spiking neural networks for lowpower neuromorphic hardware. In Rebooting Computing (ICRC), IEEE International Conference on, 1–8 (IEEE, 2016).
 (44) Rueckauer, B. & Liu, S.C. Conversion of analog to spiking neural networks using sparse temporal coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 1–5 (IEEE, Florence, Italy, 2018).
 (45) Bohte, S. M., Kok, J. N. & La Poutré, H. Errorbackpropagation in temporally encoded networks of spiking neurons. Neurocomputing 48, 17–37 (2002).
 (46) Anwani, N. & Rajendran, B. NormAD  normalized approximate descent based supervised learning rule for spiking neurons. In 2015 International Joint Conference on Neural Networks (IJCNN), 1–8 (IEEE, 2015).

(47)
Bengio, Y., Mesnard, T.,
Fischer, A., Zhang, S. &
Wu, Y.
STDPcompatible approximation of backpropagation in an energybased model.
Neural Computation 29, 555–577 (2017).  (48) Tavanaei, A. & Maida, A. S. BPSTDP: Approximating backpropagation using spike timing dependent plasticity. arXiv:1711.04214 [cs] (2017). ArXiv: 1711.04214.
 (49) Hunsberger, E. & Eliasmith, C. Spiking deep networks with LIF neurons. arXiv:1510.08829 [cs] (2015). ArXiv: 1510.08829.
 (50) Esser, S. K. et al. Convolutional networks for fast, energyefficient neuromorphic computing. Proceedings of the National Academy of Sciences 201604850 (2016).
 (51) Lee, J. H., Delbruck, T. & Pfeiffer, M. Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience 10 (2016).
 (52) Huh, D. & Sejnowski, T. J. Gradient descent for spiking neural networks. arXiv preprint arXiv:1706.04698 (2017).
 (53) Bellec, G., Salaj, D., Subramoney, A., Legenstein, R. & Maass, W. Long shortterm memory and learningtolearn in networks of spiking neurons. arXiv:1803.09574 [cs, qbio] (2018). ArXiv: 1803.09574.

(54)
Nair, V. & Hinton, G. E.
Rectified linear units improve restricted Boltzmann machines.
In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 807–814 (Omnipress, USA, 2010).  (55) Waibel, A. Modular construction of TimeDelay Neural Networks for speech recognition. Neural Computation 1, 39–46 (1989).
 (56) Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Computation 12, 2451–2471 (1999).

(57)
Cho, K., van Merrienboer, B.,
Bahdanau, D. & Bengio, Y.
On the properties of neural machine translation: Encoderdecoder approaches.
In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST8), 2014 (2014).  (58) Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014 (2014).
 (59) http://www.tensorflow.org.
 (60) http://brian2.readthedocs.io.
 (61) Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 2222–2232 (2017).
 (62) Bengio, Y., Léonard, N. & Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432 [cs] (2013). ArXiv: 1308.3432.
 (63) Werbos, P. J. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks 1, 339 – 356 (1988).
 (64) BoulangerLewandowski, N., Bengio, Y. & Vincent, P. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 1881–1888 (Omnipress, USA, 2012).