I Introduction
Biological snn are evolution’s highly efficient solution to the problem of sensory signal processing. Therefore, taking inspiration from the brain is a natural approach to engineering more efficient computing architectures. In the area of machine learning, rnn, a class of stateful neural networks whose internal states evolve with time
(Box. II), have proven highly effective at solving realtime pattern recognition and noisy time series prediction problems
[1]. rnn and biological neural networks share several properties, such as a similar general architecture, temporal dynamics and learning through weight adjustments. Based on these similarities, a growing body of work is now establishing formal equivalences between rnn and networks of lif neurons which are widely used in computational neuroscience
[2, 3, 4, 5].rnn are typically trained using an optimization procedure in which the parameters or weights are adjusted to minimize a given objective function. Efficiently training largescale rnn is challenging due to a variety of extrinsic factors, such as noise and nonstationary of the data, but also due to the inherent difficulties of optimizing functions with longrange temporal and spatial dependencies. In snn and binary rnn, these difficulties are compounded by the nondifferentiable dynamics implied by the binary nature of their outputs. While a considerable body of work has successfully demonstrated training of twolayer snn [6, 7, 8], i.e. networks without hidden units, the ability to train deeper snn with hidden layers has remained a major obstacle. Because hidden units and depth are crucial to efficiently solve many realworld problems, overcoming this obstacle is vital.
As network models grow larger and make their way into embedded and automotive applications, their power efficiency becomes increasingly important. Simplified neural network architectures that can run natively and efficiently on dedicated hardware are now being devised. This includes, for instance, networks of binary neurons which can dispense with energetically costly floatingpoint multiplications, or neuromorphic hardware that emulate the dynamics of snn [9].
These new hardware developments have created an imminent need for tools and strategies enabling efficient inference and learning in snn and binary rnn. In this article we discuss and address the inherent difficulties in training snn with hidden layers, and introduce various strategies and approximations used to successfully implement them.
Ii Understanding snn as rnn
We start with a brief walkthrough on how to formally map a snn to a rnn. Formulating snn as rnn will allow us to directly transfer and apply existing methods to train rnn and will serve as the conceptual framework for the rest of this article. To this end, we will first introduce the lif neuron model with currentbased synapses which has wide use in computational neuroscience
[10]. Next, we will reformulate this model in discrete time and show its formal equivalence to a rnn with binary activation functions. Readers familiar with the lif neuron model can skip the following steps up to Equation eq:mem_discrete_time.
[tbhp] [backgroundcolor=black!10]
rnn are networks of interconnected units, or neurons, in which
the network state at any point in time, is a function of
both external input and the network’s state at the previous
time point .
One popular rnn structure arranges neurons in multiple layers where every layer is recurrently connected and also receives input from the previous layer. More precisely, the dynamics of a network with layers is given by:
y^l[n] = & σ(a^l[n]) for
a^l [n] = & V^l y^l [n1] + W^l y^l1 [n1] for
y^0[n] ≡& x[n]
where
is the state vector of the neurons at layer
, is an activation function, and and are the recurrent and feedforward weight matrices of layer , respectively. External inputs , typically arrive at the first layer.A lif neuron with index can formally be described in differential form as
(1) 
where is the membrane potential, is the resting potential, is the membrane time constant, is the input resistance, and is the input current [10]. Equation eq:lif_basic shows that acts as a leaky integrator of the input current . Neurons emit spikes to communicate their output to other neurons when their membrane voltage reaches the firing threshold . After each spike, the membrane voltage is reset to the resting potential (Fig. 1). Due to this reset, Equation eq:lif_basic only describes the subthreshold dynamics of a lif neuron, i.e. the dynamics in absence of spiking output of the neuron.
In snn the input current is typically generated by synaptic currents triggered by the arrival of presynaptic spikes . When working with differential equations, it is convenient to denote a spike train as a sum of Dirac delta functions with the firing times of neuron .
Synaptic currents follow specific temporal dynamics themselves. A common firstorder approximation is to model their time course as an exponential current following each presynaptic spike. Moreover, we assume that synaptic currents sum linearly. The dynamics of these operations are given by
(2) 
where the sum runs over all presynaptic neurons and are the corresponding afferent weights. Neuron has a self connection if the index appears in this sum. Because of this property we can simulate a single lif neuron with two linear differential equations whose initial conditions change instantaneously whenever a spike occurs.
We can incorporate the reset term in Equation eq:lif_basic through an extra term that instantaneously pulls the membrane potential down by an amount whenever the neuron emits a spike:
(3) 
It is customary to solve Equations eq:cuba_syn and eq:lif numerically in discrete time which allows us to interpret the spike train of neuron as a nonlinear function of the membrane voltage where denotes the Heaviside step function. Without loss of generality, we set , , and . When using a small simulation time step , Equation eq:cuba_syn is well approximated by
(4) 
with the decay strength . Note that for finite and positive . Moreover, . We use to denote the time step to emphasize the discrete dynamics. We can now express Equation eq:lif as
(5) 
with . The output spikes of neuron are simply given by .
Equations eq:syn_discrete_time and eq:mem_discrete_time characterize the dynamics of a rnn. Specifically, the state of neuron is given by the instantaneous synaptic currents and the membrane voltage (Box. II). The computations necessary to update the cell state can be unrolled in time as illustrated in the computational graph (Figure 2).
We have now seen that snn constitute a special case of rnn. However, so far we have not explained how their parameters are set to implement a specific computational function. This is the focus of the rest of this tutorial, in which we present a variety of learning algorithms that systematically change the parameters towards implementing a specific function.
Iii Methods for training rnn
Powerful machine learning methods are able to train rnn for a variety of tasks ranging from time series prediction, to language translation, to automatic speech recognition. In the following, we discuss the most common methods before analyzing their applicability to snn.
There are several stereotypical ingredients that define the training process. The first ingredient is a cost or loss function which is minimized when the network’s response corresponds to the desired behavior. In time series prediction, for example, this loss could be the squared difference between the predicted and the true value. The second ingredient is a mechanism that updates the network’s weights to minimize the loss. One of the simplest and most powerful mechanisms to achieve this is to perform gradient descent on the loss function. In network architectures with
hidden units (i.e. units whose activity affect the loss indirectly through other units) the parameter updates contain terms relating to the activity and weights of the units they project to (downstream units). Gradientdescent learning can solve this credit assignment problemby providing explicit expressions for these updates through the chain rule of derivatives. As we will now see, the learning of hidden unit parameters depends on an efficient method to compute these gradients. When discussing these methods, we distinguish between solving the spatial credit assignment problem which affects mlp and rnn in the same way and the temporal credit assignment problem which only occurs in rnn.
Iiia Spatial credit assignment
In mlp, gradient descent can be implemented efficiently using the backpropagation of error algorithm (Box. IIIA). In its simplest form, this algorithm propagates errors “backwards” from the output of the network to earlier (upstream) neurons.
Using backpropagation to adjust hidden layer weights ensures that the weight update will reduce the cost function for the current training example, provided the learning rate is small enough. While this theoretical guarantee is desirable, it comes at the cost of certain communication requirements — namely that gradients have to be communicated back through the network — and increased memory requirements as the neuron states need to be kept in memory until the errors become available.
[tbhp] [backgroundcolor=black!10]
The task of learning is to minimize a cost function over the entire dataset. In a neural network, this can be achieved by gradient descent, which modifies the network parameters in the direction opposite to the gradient:
(6) 
with the total input to the neuron and a small learning rate. The first term is the error of neuron and the second term reflects the sensitivity of the neuron output to changes in the parameter. In multilayer networks, gradient descent is expressed as the backpropagation of the errors starting from the prediction (output) layer to the inputs. Using superscripts to denote the layer ( is input, is output):
(7) 
where the is the error of output neuron and and indicates the transpose.
This update rule is ubiquitous in deep learning and known as the gradient backpropagation algorithm
[1]. Learning is typically carried out in forward passes (evaluation of the neural network activities) and backward passes (evaluation of s).The same rule can be applied to rnn. In this case the recurrence is “unrolled” meaning that an auxiliary network is created by making copies of the network for each time step.
The unrolled network is simply a deep network with shared feedforward weights and recurrent weights , on which the standard bp applies:
(8) 
Applying backpropagation to an unrolled network is referred to as bptt.
IiiB Temporal credit assignment
When training rnn, we also have to consider temporal dependencies which requires solving a temporal credit assignment problem (Fig. 2). In the following we review two common methods to achieve this:

The “backward” method: This method applies the same strategies as with spatial credit assignment by “unrolling” the network in time (Box. IIIA). bptt solves the temporal credit assignment problem by back propagating errors through the unrolled network. This method works backward through time after completing a forward update. The use of standard backpropagation on the unrolled network directly enables the use of autodifferentiation tools offered in modern machine learning toolkits [3, 11].

The forward method: In some situations it is beneficial to propagate all necessary information for gradient computation forward in time [12]. This formulation is achieved by computing the gradient of a cost function and maintaining the recursive structure of the rnn. For example, the “forward gradient” of the feedforward weight becomes:
(9) Gradients with respect to recurrent weights can be computed in a similar fashion [12].
The backward optimization method is generally more efficient in terms of computation, but requires maintaining all the inputs and activations for each time step. Thus, its space complexity for each layer is , where is the number of neurons per layer and is the number of time steps. On the other hand, the forward method requires maintaining variables , resulting in a space complexity per layer. This method is more appealing from a biological point of view, since the learning rule can be made consistent with synaptic plasticity in the brain and “threefactor” rules, as discussed in section VB. In summary, efficient algorithms to train rnn exist. However, to apply them to snn we first need to overcome several key challenges which we will discuss next.
Iv Credit assignment with spiking neurons: Challenges and solutions
Before the solutions introduced above can be applied to snn, two key challenges need to be overcome. The first challenge is the nondifferentiability of the spiking nonlinearity. Equations eq:bptt and eq:forward_mode_differentiation reveal that the expressions for both the forward and the backward learning methods contain the derivative of the neural activation function as a multiplicative factor. For a spiking neuron, however, we have , whose derivative is zero everywhere except at , where it is ill defined (Fig. 3). This allornothing behavior of the binary spiking nonlinearity stops gradients from “flowing” and makes lif neurons unsuitable for gradient based optimization. The same issue occurs in binary neurons and some of the solutions proposed here are inspired by the methods first developed in binary networks [13, 14].
The second challenge concerns the implementation of the optimization algorithm itself. Standard backpropagation can be expensive in terms of computation, memory and communication, and may be poorly suited to the constraints dictated by the hardware that implements it (e.g. a computer, a brain, or a neuromorphic device). As memory is ample in modern computers, the backward method is often preferred. On the other hand, processing in dedicated neuromorphic hardware and, more generally, nonvon Neumann computers, may have specific locality requirements (Box. VA) that can complicate matters. On such hardware, the forward approach may therefore be preferable. In practice, however, the scaling of both methods has proven unsuitable for many snn models and additional simplifying approximations are common. In the following sections, we describe solutions to these challenges and approximations that make learning in snn more tractable.
To overcome the first challenge in training snn, which is concerned with the “hard” spiking nonlinearity, several approaches have been devised with varying degrees of success. The most common approaches can be coarsely classified into the following categories: i) resorting to entirely biologically inspired local learning rules for the hidden units, ii) translating conventionally trained “ratebased” neural networks to snn, iii) smoothing the network model to be continuously differentiable, or iv) defining a surrogate gradient as a continuous relaxation of the real gradients. Approaches pertaining biologically motivated local learning rules and network translation have been reviewed extensively in
Tavanaei et al. [5]. In this tutorial we therefore focus on the latter two supervised approaches which we will refer to as the “smoothed” and the sg approach. First, we review existing literature on common “smoothing” approaches before turning to an indepth discussion of how to build functional snn using sg methods.Iva Smoothed spiking neural networks
The defining characteristic of smoothed snn is that their formulation ensures wellbehaved gradients which are directly suitable for optimization. Smooth models can be further categorized into (i) soft nonlinearity models, (ii) probabilistic models, for which gradients are only well defined in expectation, or models which either rely entirely on (iii) rate or (iv) singlespike temporal codes.
IvA1 Gradients in soft nonlinearity models
This approach can in principle be applied directly to all spiking neuron models which explicitly include a smooth spike generating process. This includes, for instance, the HodgkinHuxley, MorrisLecar, and FitzHughNagumo models [10]. In practice this approach has only been applied successfully by Huh and Sejnowski [17] using an augmented integrateandfire model in which the binary spiking nonlinearity was replaced by a continuousvalued gating function. The resulting network constitutes a rnn which can be optimized using standard methods of bptt or rtrl. Importantly, the soft threshold models compromise on one of the key features of snn, namely the binary activation function.
IvA2 Gradients in probabilistic models
Another example for smooth models are binary probabilistic models. In simple terms, stochasticity effectively smooths out the hard binary nonlinearity which makes it possible to define a gradient on expectation values. Binary probabilistic models have been objects of extensive study in the machine learning literature mainly in the context of (restricted) Boltzmann machines
[18]. Similarly, the propagation of gradients has been studied for binary stochastic models [14].Probabilistic models are practically useful because the loglikelihood of a spike train is a smooth quantity which can be optimized using gradient descent [19]. Although this insight was first discovered in networks without hidden units, the same ideas were later extended to multilayer networks [20]. Similarly, Guerguiev et al. [21] used probabilistic neurons to study biologically plausible ways of propagating error or target signals using segregated dendrites (see Section VA). In a similar vein, variational learning approaches were shown to be capable of learning useful hidden layer representations in snn [22, 23, 24]. However, the injected noise necessary to smooth out the effect of binary nonlinearities often poses a challenge for optimization [23].
IvA3 Gradients in ratecoding networks
Another common approach to obtain gradients in snn is to assume a ratebased coding scheme. The main idea is that spike rate is the underlying informationcarrying quantity. For many plausible neuron models, the suprathreshold firing rate depends smoothly on the neuron input. This inputoutput dependence is captured by the socalled fI curve of a neuron. In such cases, the derivative of the fI curves is suitable for gradientbased optimization.
There are several examples of this approach. For instance Hunsberger and Eliasmith [25], Neftci et al. [26] used an effectively ratecoded input scheme to demonstrate competitive performance on standard machine learning benchmarks such as CIFAR10 and MNIST. Similarly Lee et al. [27] demonstrated deep learning in snn by defining partial derivatives on lowpass filtered spike trains.
Ratebased approaches can offer good performance, but they may be inefficient. On the one hand, precise estimation of firing rates requires averaging over a number of spikes. Such averaging requires either relatively high firing rates or long averaging times because several repeats are needed to average out discretization noise. This problem can be partially addressed by spatially averaging over large populations of spiking neurons, though this requires the use of more neurons.
IvA4 Gradients in singlespiketimingcoding networks
In an effort to optimize snn without potentially harmful noise injection and without reverting to a ratebased coding scheme, several studies have formulated snn as a set of firing times. In such a temporal coding setting, individual spikes could carry significantly more information than ratebased schemes.
The idea behind training temporal coding networks was pioneered in SpikeProp [28]. In this work the analytic expressions of firing times for hidden units were linearized, allowing to analytically compute approximate hidden layer gradients. More recently, a similar approach without the need of linearization was used in [29] where the author computed the spike timing gradients explicitly for nonleaky integrateandfire neurons. Intriguingly, the work showed competitive performance on conventional networks and benchmarks.
Although the spike timing formulation does in some cases yield welldefined gradients, it may suffer from certain limitations. For instance, the formulation of SpikeProp [28] required each hidden unit to emit exactly one spike per trial. Moreover, it is difficult to define firing time for quiescent units, although population sparseness is presumably crucial for power efficiency.
IvB Surrogate gradients
sg methods provide an alternative approach for overcoming the difficulties associated with the “hard” nonlinearity when training snn. Their defining characteristic is that instead of changing the model definition as in the smoothed approaches, a sg is introduced as a continuous relaxation of the nonsmooth spiking nonlinearity for purposes of numerical optimization (Fig. 4). Importantly, this approximation is only used in the parameter updates and does not affect the forward pass of the network directly. Moreover, the use of sg allows to efficiently train snn endtoend without the need to specify which coding scheme is to be used in the hidden layers.
Like standard gradientdescent, sg learning can deal with the spatial and temporal credit assignment problem by either bptt or by forward methods, e.g. through the use of eligibility traces (see Section IIIB for details). Alternatively, additional approximations can be introduced which may offer advantages specifically for hardware implementations. In the following, we briefly review existing work relying on sg methods before turning to a more indepth treatment of the underlying principles and capabilities.
IvB1 Surrogate derivatives for spiking nonlinearity
A set of works have used sg to specifically overcome the challenge of the “hard” nonlinearity. In these works typically a standard algorithm such as bptt is used with one minor modification: within the algorithm each occurrence of the derivative of the spiking nonlinearity is replaced by the derivative of a smooth function. Implementing these approaches is possible in most autodifferentiationenabled machine learning toolkits.
One of the first uses of such a sg is described in Bohte [16]
where the derivative of a spiking neuron nonlinearity was approximated by the derivative of a truncated quadratic function, thus resulting in a relu as surrogate derivative
(Fig. 3). This is similar in flavor to the solution proposed to optimize binary neural networks [13]. The same idea underlies the training of largescale convolutional networks with binary activations on classification problems using neuromorphic hardware [15]. Zenke and Ganguli [2] proposed a three factor online learning rule using the negative part of a fast sigmoid to construct a sg. Recently, Bellec et al. [3] successfully trained networks of neurons with slow temporal dynamics using the same surrogate derivative. Encouragingly, the authors found that such networks can perform on par with conventional lstm networks. Finally, Shrestha and Orchard [11] used an exponential function and reported competitive performance on a range of neuromorphic benchmark problems.In summary, a plethora of studies have constructed sg using different nonlinearities and trained a diversity of snn architectures. These nonlinearties, however, have a common underlying theme. All functions are nonlinear and monotonically increasing towards the firing threshold (Fig. 3). While a more systematic comparison of different surrogate nonlinearities is still pending, overall the diversity found in the present literature suggests that the success of the method is not crucially dependent on the details of the surrogate used to approximate the derivative.
IvB2 Surrogate gradients affecting locality of the update rules
The majority of studies discussed in the previous section introduced a surrogate nonlinearity to prevent gradients from vanishing (or exploding), but by relying on methods such as bptt, they did not explicitly affect the structural properties of the learning rules. There are, however, training approaches for snn which introduce more farreaching modifications which may completely alter the way error signals or target signals are propagated (or generated) within the network. Such approaches are typically used in conjunction with the aforementioned surrogate derivatives. There are two main motivations for such modifications which are typically linked to physical constraints that make it impossible to implement the exact “correct” gradient descent algorithm. For instance, in neurobiology biophysical constraints make it impossible to implement bptt without further approximations. Studies interested in how the brain could solve the credit assignment problem focus on how simplified “local” algorithms could achieve similar performance while adhering to the biophysical constraints (Box. VA). Similarly, neuromorphic hardware may pose certain constraints with regard to memory or communications which impede the use of bptt and call for simpler and often more local methods for training on such devices. In the following Applications Section, we will review several promising sg approaches which introduce far larger deviations from the “true gradients” and often allow learning at a greatly reduced complexity and computational cost.
V Applications
In this section, we present a number of illustrative applications of smooth or surrogate gradients to snn which exploit a number of the unique features of snn, namely, their internal continuoustime dynamics that allow them to use time as a coding dimension; and their inputdriven nature where the network stays quiescent until incoming inputs/spikes trigger activity in the network.
Va Feedback alignment and random error bp
One family of algorithms that relaxes some of the requirements of bp are feedback alignment or, more generally, random bp algorithms [30, 32, 31]. These are approximations to the gradient bp rule that sidestep the nonlocality problem by replacing weights in the bp rule with random ones (Fig. 5b): where
is a fixed, random matrix with the same dimensions as
. The replacement of with a random matrix breaks the dependency of the backward phase on , enabling the rule to be more local. One common variation is to replace the entire backward propagation by a random propagation of the errors to each layer (Fig. 5c) [31]: where is a fixed, random matrix with appropriate dimensions.Random bp approaches lead to remarkably little loss in classification performance on some benchmark tasks. Although a general theoretical understanding of random bp is still a subject of intense research, extended simulations of linear networks show that, during learning, the network adjusts its feedforward weights such that they partially align with the (random) feedback weights, thus permitting them to convey useful error information [30]. Building on these findings, an asynchronous spikedriven adaptation of random bp using local synaptic plasticity rules with the dynamics of spiking neurons was demonstrated in [26]. To obtain the surrogate gradients, the authors approximated the derivative of the neural activation function using a boxcar function. Networks using this learning rule performed remarkably well, and were shown to operate continuously and asynchronously without the alternation between forward and backward passes that is necessary in bp.
[tbhp] [backgroundcolor=black!10]
Locality of computations is characterized by the set variables available to the physical processing elements, and depends on the computational substrate. To illustrate the concept of locality, we assume two neurons, and , and would like Neuron to implement a function on domain defined as:
Here, refers to the output of neuron seconds ago, , are the respective membrane potentials, and is the synaptic weight from to . Variables under are directly available to Neuron A and are thus local to it.
On the other hand, variable is temporally nonlocal and is spatially nonlocal to neuron . Nonlocal information can be transmitted through special structures, for example dedicated encoders and decoders for and a form of working memory (WM) for . Although locality in a model of computation can make its use challenging, it enables massively parallel computations with dynamical interprocess communications.
VB Supervised learning with local three factor learning rules
SuperSpike is a biologically plausible three factor learning rule which makes use of several of the approximations introduced above [2]. Although the underlying motivation of the study is geared toward a deeper understanding of learning in biological neural networks, the learning rule may prove interesting for hardware implementations because it does not rely on bptt. Specifically, the rule uses synaptic eligibility traces to solve the temporal credit assignment problem.
We now provide a short account on why SuperSpike can be seen as one of the forwardintime optimization procedures. SuperSpike was derived for temporal supervised learning tasks in which a given output neuron learns to spike at predefined times. To that end, SuperSpike minimizes the van Rossum distance with kernel
between a set of output spike train and their corresponding target spike trains(10) 
where the last approximation corresponds to transitioning to discrete time. To perform online gradient descent, we need to compute the gradients of . Here we first encounter the derivative . Because the (discrete) convolution is a linear operator, this expression simplifies to . In SuperSpike is implemented as a dynamical system (see [2] for details). To compute derivatives of the neuron’s output spiketrain of the form we differentiate the network dynamics (Equations eq:syn_discrete_time and eq:mem_discrete_time) and obtain
(11)  
(12)  
(13) 
The above equations define a dynamical system, which, given the starting conditions , can be simulated online and forward in time to produce all relevant derivatives. Crucially, to arrive at useful surrogate gradients, SuperSpike makes two approximations. First, is replaced by a smooth surrogate derivative (cf. Fig. 3). Second, the reset term with the negative sign in Equation eq:deriv_mem_update is dropped, which empirically leads to better results. With these definitions in hand, the final weight updates are given by
(14) 
where . These weight updates depend only on local quantities (Box. VA).
Above, we have considered a simple two layer network (cf. Fig. 2). If we were to apply the same strategy to compute updates in a network with a hidden layer or with recurrent connections, the equations would become more complicated and nonlocal. To avoid introducing nonlocality, SuperSpike propagates error signals from the output layer directly to the hidden units like in random bp or direct feedback alignment (cf. Section VA; Fig. 5c; [30, 32, 31]). With these approximations SuperSpike achieves temporal credit assignment by propagating all relevant quantities forward in time, while it relies on random bp to perform spatial credit assignment.
VC Encoding information in spike times
There are various ways of encoding information in spike streams. One particularly powerful encoding uses spike times as the information carrying quantities. This capitalizes on the continuoustime nature of SNNs and results in highly sparse network activity as the emission time of even a single spike can encode significant information. For this example, we use nonleaky integrate and fire neurons described by: dUidt = I_i with I_i = ∑_j W_ij∑_r Θ(t  t_i^r)exp((tt_i^r)) where is the time of the spike from neuron , and is the Heaviside step function.
Consider the simple exclusive or (XOR) problem in the temporal domain: A network receives two spikes, one from each of two different sources. Each spike can either be “early” or “late”. The network has to learn to distinguish between the case in which the spikes are either both early or both late, and the case where one spike is early and the other is late (Fig. (a)a). When designing a SNN, there is significant freedom in how the network input and output are encoded. In this case, we use a firsttospike code in which we have two output neurons and the binary classification result is represented by the output neuron that spikes first. Figure (b)b shows the network’s response after training (see [29] for details on the training process). For the first input class (early/late or late/early), one output neuron spikes first and for the other class (early/early or late/late), the other output neuron spikes first.
VD Learning using local errors
Multilayer neural networks are hierarchical feature extractors. Through successive linear projections and pointwise nonlinearities, neurons become tuned (respond most strongly) to particular spatiotemporal features in the input. While the best features are those that take into account the subsequent processing stages and which are learned to minimize the final error (as the features learned using backpropagation do), highquality features can also be obtained by more local methods. The nonlocal component of the weight update equation (Eq. eq:bp_deep) is the error term . Instead of obtaining this error term through backpropagation, we require that it be generated using information local to the layer. One way of achieving this is to define a layerwise loss and use this local loss to obtain the errors. In such a local learning setting, the local errors becomes: δ_i^l [n] = σ’(a_i^l[n] ) ddyil[n]L^l( y^l[n]) where L^l( y^l[n]) ≡L( W^l_r y^l[n],^y^l[n]) with a pseudotarget for layer , and a fixed random matrix that projects the activity vector at layer to a vector having the same dimension as the pseudotarget. In essence, this formulation assumes that an auxiliary random layer is attached to layer and the goal is to modify so as to minimize the discrepancy between the auxiliary random layer’s output and the pseudotarget. The simplest choice for the pseudotarget is to use the toplayer target. This forces each layer to learn a set of features that are able to match the toplayer target after undergoing a fixed random linear projection. Each layer builds on the features learned by the layer below it, and we empirically observe that higher layers are able to learn higherquality features that allow their random and fixed auxiliary layers to better match the target [33].
This approach was recently used in snn in combination with the SuperSpike (cf. Section VB) forward method to overcome the temporal credit assignment problem [4]. As in SuperSpike, the snn model is simplified by using a feedforward structure, and omitting the refractory dynamics in the optimization. However, the cost function was defined to operate locally on the instantaneous rates of each layer. This simplification results in a forward method whose space complexity scales as (instead of for the forward method or for the backward method), while still making use of spiking neural dynamics. Thus the method constitutes a highly efficient synaptic plasticity rule for multilayer snn. Furthermore, the simplifications enable the use of existing autodifferentation methods in machine learning frameworks to systematically derive synaptic plasticity rules from taskrelevant cost functions and neural dynamics. This approach was benchmarked on the DVS Gestures dataset (Fig. 7), and performs on par with standard bp or bptt rules.
Vi Discussion
We have outlined how snn can be studied within the framework of rnn and discussed successful approaches for training them. We have specifically focused on sg approaches for two reasons: sg approaches are able to train snn to unprecedented performance levels on a range of realworld problems. This transition marks the beginning of an exciting time in which snn will become increasingly interesting for applications which were previously dominated by rnn; sg provide a framework that ties together ideas from machine learning, computational neurosciences, and neuromorphic computing. From the viewpoint of computational neuroscience, the approaches presented in this paper are appealing because several of them are related to “threefactor” plasticity rules which are an important class of rules believed to underlie synaptic plasticity in the brain. Finally, for the neuromorphic community, sg methods provide a way to learn under various constraints on communication and storage which makes sg methods highly relevant for learning on custom lowpower neuromorphic devices.
The spectacular successes of modern ann were enabled by algorithmic and hardware advances that made it possible to efficiently train large ann on vast amounts of data. With temporal coding, snn are universal function approximators that are potentially far more powerful than ann with sigmoidal nonlinearities . Unlike largescale ann, which had to wait for several decades until the necessary computational resources were available for training them, we currently have the necessary resources, whether in the form of mainstream compute devices such as CPUs or GPUs, or custom neuromorphic devices, to train and deploy large snn. The fact that snn are less widely used than ann is thus primarily due to the algorithmic issue of trainability. In this tutorial, we provided an overview of various exciting developments that are gradually addressing the issues encountered when training snn. Fully addressing these issues would have immediate and wideranging implications, both technologically, and in relation to learning in biological brains.
Acknowledgments
This work was supported by the Intel Corporation (EN); the National Science Foundation under grant 1640081 (EN); the swiss national science foundation early postdoc mobility grant P2ZHP2_164960 (HM) ; the Wellcome Trust [110124/Z/15/Z] (FZ).
References
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
 Zenke and Ganguli [2018] F. Zenke and S. Ganguli, “SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks,” Neural Computation, vol. 30, no. 6, pp. 1514–1541, Apr. 2018.

Bellec et al. [2018]
G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass, “Long shortterm memory and Learningtolearn in networks of spiking neurons,” in
Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 795–805.  Kaiser et al. [2018] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic plasticity for deep continuous local learning,” arXiv preprint arXiv:1812.10766, 2018.
 Tavanaei et al. [2018] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, “Deep learning in spiking neural networks,” Neural Networks, Dec. 2018.
 Gütig [2014] R. Gütig, “To spike, or when to spike?” Current Opinion in Neurobiology, vol. 25, pp. 134–139, Apr. 2014.
 Memmesheimer et al. [2014] R.M. Memmesheimer, R. Rubin, B. Ölveczky, and H. Sompolinsky, “Learning Precisely Timed Spikes,” Neuron, vol. 82, no. 4, pp. 925–938, May 2014.
 Anwani and Rajendran [2015] N. Anwani and B. Rajendran, “NormADnormalized approximate descent based supervised learning rule for spiking neurons,” in Neural Networks (IJCNN), 2015 International Joint Conference on. IEEE, 2015, pp. 1–8.
 Boahen [2017] K. Boahen, “A neuromorph’s prospectus,” Computing in Science Engineering, vol. 19, no. 2, pp. 14–28, Mar. 2017.
 Gerstner et al. [2014] W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski, Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014.
 Shrestha and Orchard [2018] S. B. Shrestha and G. Orchard, “SLAYER: Spike Layer Error Reassignment in Time,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 1419–1428.
 Williams and Zipser [1989] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
 Courbariaux et al. [2016] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1,” arXiv:1602.02830 [cs], Feb. 2016, arXiv: 1602.02830.
 Bengio et al. [2013] Y. Bengio, N. Léonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” arXiv:1308.3432 [cs], Aug. 2013, arXiv: 1308.3432.
 Esser et al. [2016] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast, energyefficient neuromorphic computing,” Proc Natl Acad Sci U S A, vol. 113, no. 41, pp. 11 441–11 446, Oct. 2016.
 Bohte [2011] S. M. Bohte, “ErrorBackpropagation in Networks of Fractionally Predictive Spiking Neurons,” in Artificial Neural Networks and Machine Learning – ICANN 2011, ser. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Jun. 2011, pp. 60–68.
 Huh and Sejnowski [2018] D. Huh and T. J. Sejnowski, “Gradient Descent for Spiking Neural Networks,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 1440–1450.
 Ackley et al. [1985] D. Ackley, G. Hinton, and T. Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive Science: A Multidisciplinary Journal, vol. 9, no. 1, pp. 147–169, 1985.
 Pfister et al. [2006] J.P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner, “Optimal SpikeTimingDependent Plasticity for Precise Action Potential Firing in Supervised Learning,” Neural Computation, vol. 18, no. 6, pp. 1318–1348, Apr. 2006.
 Gardner et al. [2015] B. Gardner, I. Sporea, and A. Grüning, “Learning Spatiotemporally Encoded Pattern Transformations in Structured Spiking Neural Networks,” Neural Comput, vol. 27, no. 12, pp. 2548–2586, Oct. 2015.
 Guerguiev et al. [2017] J. Guerguiev, T. P. Lillicrap, and B. A. Richards, “Towards deep learning with segregated dendrites,” eLife Sciences, vol. 6, p. e22901, Dec. 2017.
 Brea et al. [2013] J. Brea, W. Senn, and J.P. Pfister, “Matching Recall and Storage in Sequence Learning with Spiking Neural Networks,” J. Neurosci., vol. 33, no. 23, pp. 9565–9575, Jun. 2013.
 Rezende and Gerstner [2014] D. J. Rezende and W. Gerstner, “Stochastic variational learning in recurrent spiking networks,” Front. Comput. Neurosci, vol. 8, p. 38, 2014.
 Mostafa and Cauwenberghs [2018] H. Mostafa and G. Cauwenberghs, “A learning framework for winnertakeall networks with stochastic synapses,” Neural computation, vol. 30, no. 6, pp. 1542–1572, 2018.
 Hunsberger and Eliasmith [2015] E. Hunsberger and C. Eliasmith, “Spiking deep networks with lif neurons,” arXiv preprint arXiv:1510.08829, 2015.
 Neftci et al. [2017] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis, “Eventdriven random backpropagation: Enabling neuromorphic deep learning machines,” Frontiers in Neuroscience, vol. 11, p. 324, 2017.
 Lee et al. [2016] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backpropagation,” Frontiers in Neuroscience, vol. 10, 2016.
 Bohte et al. [2002] S. M. Bohte, J. N. Kok, and H. La Poutre, “Errorbackpropagation in temporally encoded networks of spiking neurons,” Neurocomputing, vol. 48, no. 1, pp. 17–37, 2002.
 Mostafa [2018] H. Mostafa, “Supervised learning based on temporal coding in spiking neural networks,” IEEE transactions on neural networks and learning systems, vol. 29, no. 7, pp. 3227–3235, 2018.
 Lillicrap et al. [2016] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random synaptic feedback weights support error backpropagation for deep learning,” Nature Communications, vol. 7, 2016.
 Nøkland [2016] A. Nøkland, “Direct feedback alignment provides learning in deep neural networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 1037–1045.
 Baldi and Sadowski [2016] P. Baldi and P. Sadowski, “A theory of local learning, the learning channel, and the optimality of backpropagation,” Neural Networks, vol. 83, pp. 51–74, 2016.
 Mostafa et al. [2018] H. Mostafa, V. Ramesh, and G. Cauwenberghs, “Deep supervised learning using local errors,” Frontiers in neuroscience, vol. 12, p. 608, 2018.