1 Introduction
Understanding how the plasticity dynamics in multilayer biological neural networks is organized for efficient datadriven learning is a long standing question in computational neurosciences (Zenke and Ganguli, 2017; Sussillo and Abbott, 2009; Clopath et al., 2010). The generally unmatched success of deep learning in a wide variety of datadriven tasks prompts the question whether the ingredients of their success are compatible with biological neural networks, namely spiking neural networks. The response to this question is largely positive (Neftci, 2018). However, biological neural networks distinguish themselves from the assumptions made in artificial neural networks by their continuoustime dynamics, the locality of their operations (Baldi et al., 2017)
, and their spikebased communication. Taking these properties into account in a neural network is challenging. The spiking nature of the neurons’ nonlinearity makes it nondifferentiable. The continuoustime dynamics involve temporal dependencies that create a challenging credit assignment problem. The assumption of local computations at the neuron disqualifies the use of backpropagation through time. The failure to take all these properties into account causes learning in spiking neural networks to require either very large neural networks, a lot of time, and most often both compared to artificial neural networks. Improving the learning performance of spiking neural networks is not only one step in the quest to understand the adaptive capabilities the brain, but also a critical endeavor to build braininspired, neuromorphic computing technologies that emulate the dynamics of neural circuits
(Neftci, 2018).In this article, we describe Deep Continuous Local Learning (DCLL), a spiking neural network model with plasticity dynamics that is compatible with the properties of biological neural networks mentioned above, and learns at proficencies comparable to that of small deep neural networks (Fig. 1). DCLL
builds on recent work in training spiking neural networks using the strategies to train deep neural networks. Using layerwise local classifiers
(Mostafa et al., 2017), the gradients are computed locally using pseudotargets (usually the labels themselves). To take the temporal dynamics of the neurons into account, we use a Spike Response Model (SRM) model and a soft threshold function for computing a surrogate gradient, similarly to SuperSpike (Zenke and Ganguli, 2017). The information needed to compute the gradient forward (as opposed to storing them in backpropagation through time, for example), making the plasticity rule temporally local. While SuperSpike scales at least quadratically with the number of neurons, our model scales linearly. To achieve this, we use a local ratebased cost function reminiscent of readout neurons in liquid state machines (Maass et al., 2002), but where the readout is performed over a fixed random combination of the neuron outputs. The rate based readout does not have a temporal convolution term in the cost function, the absence of which enables linear scaling. Furthermore, the ratebased readout does not prevent learning precise temporal spike trains.The local classifier in DCLL acts like an encoderdecoder layer reminiscent of the learning mechanism in reservoir type networks, such as the neural engineering framework (Eliasmith and Anderson, 2004), liquid state machines (Maass et al., 2002) and FORCE learning (Sussillo and Abbott, 2009). In reservoir networks, the encoder is typically random and fixed and the decoder is trained. Just like in DCLL, they use a ratebased cost function over a linear combination of spikedriven basis functions. The key difference with DCLL is that the encoder weights are trained, whereas the decoder (readout) weights are random and fixed. The training of the encoder weights allows the network to learn representations that are amenable as inputs for subsequent layers.
. Errors in the local classifiers are propagated through the random connections to train weights coming in to the spiking layer, but no further (curvy, dashed line). To simplify the learning rule and enable linear scaling of the computations, the cost function is formulated using a rate code. The state of the spiking neurons (membrane potential, synaptic states, refractory state) are carried forward in time. Consequently, even in the absence of recurrent connections, the neurons are stateful in the sense of recurrent neural networks. (Right) Snapshot of the neural states illustrating the DCLL learning rule in the top layer. In this example, the network is trained to produce three timevarying pseudotargets
, and .Our approach can be viewed as a type of synthetic gradient. Synthetic gradients were initially proposed to decouple one or more layers from the rest of the network as to prevent layerwise locking, similar to DCLL. While synthetic gradients usually involve an outer loop that is equivalent to a full backpropagation through the network which cannot be done locally in spiking neural networks. Instead DCLL relies on an initialization of the local random classifier weights and forgoes the outer loop.
One appeal of our model is the scalability of the learning rule: its formulation allows for the convenient use of the autodifferentation mechanisms in existing machine learning frameworks, in this case PyTorch. Its linear scalability enables the training of hundreds of thousands of neurons on a single GPU, and the learning on extremely fine time scales even on long sequences. In our case, we trained with
ms sequences on a ms time step. Due to the memory requirements, BackPropagationThroughTime (BPTT) is typically truncated to about 10 steps.We demonstrate our approach on the classification of gestures, IBM DVS Gestures dataset (Amir et al., 2017), recorded using an eventbased neuromorphic sensor and report comparable performance to deep neural networks and even networks trained with BPTT. We also perform an “ablation” study on the neuron model which reveals that the spiking nature of the neuron plays no significant role in the final accuracies. These results are consistent with the idea that it is the continuoustime dynamics and the locality of computations that are the most important distinguishing properties of spiking neural networks.
2 Related Work in Multilayer SpikeBased Learning
In (Neftci et al., 2017), the authors demonstrated EventDriven Random BackPropagation (eRBP) which is a form of approximate gradient backpropagation in spiking neural networks that translates into a three factor rule reminiscent of an errormodulated Hebb rule. The error was mediated by a random topdown (spikebased) feedback and accumulated at a second compartment in each neuron. Upon every presynaptic (input) spike, the weights are updated in the direction opposite to the value stored in the second compartment. At classical MNIST digit recognition tasks, eRBP performed nearly as well as an equivalent deep neural network. However, a shortcoming of eRBP is that the model does not take continuous dynamics into account. This is problematic because the “loop duration” i.e. “the duration necessary from the input onset to a stable response in the error neurons” scales with the number of layers. In deep networks, the errors can be strongly delayed when the time constants are long, or when the inputs have fast components. This reduces the qualities of the computed gradients.
SuperSpike employs a surrogate gradient descent to train networks of Linear Integrate & Fire (LI&F) neurons on a spike train distance measure. Because the LI&F
neuron output is nondifferentiable, SuperSpike uses a surrogate network with differentiable activation functions to compute an approximate gradient. The authors show that this learning rule is equivalent to a forwardpropagation of errors using eligibility traces, and is capable of efficient learning in hidden layers of feedforward multilayer networks. Unfortunately, the approximations in SuperSpike prevent efficient learning in deep layers, and the algorithm scales as
, where is the number of neurons. While quadratic scaling is biologically plausible, it prevents an efficient implementation in digital hardware. Like SuperSpike, DCLL uses surrogate gradients to perform weight updates, but the cost function is ratebased, such that the algorithm scales as . Ratebase costs function are used in similar scenarios in liquid state machines, and FORCE learning, and do not prevent the learning of fine, temporal dynamics as in SuperSpike.Spiking neural networks can be viewed as a subclass of binary, recurrent artificial neural networks. Spiking neurons are recurrent in the Artificial Neural Network (ANN) sense even if all the connections are feedforward, because the neurons have a state that is propagated forward at every time step. Binary neural networks, where both activations and weights are binary were studied in deep learning as a way to decrease model complexity during inference (Courbariaux et al., 2016; Rastegari et al., 2016). We are not aware of work on recurrent variations of binary nets, however.
Surrogategradient descent and forward propagation of the eligibility functions is the flipside of backpropagationthroughtime, where gradients are computed using past activities. The BPTTlike approach for spiking neural networks was investigated in (Bohte et al., 2000; Lee et al., 2016; Shrestha and Orchard, 2018)
. While these approaches provide unbiased estimation of the gradients, we show that
DCLL can perform equally or better than these techniques using lower computational resources. This is because the computational and memory demands are higher for BPTT, which requires trucation and limits the size of the networks that can be simulated. Furthermore, forwardpropagation techniques such as DCLL can be formulated as local synaptic plasticity rules, and are thus amenable to implementation in dedicated, eventbased (neuromorphic) hardware (Neftci, 2018).Hierarchy of Time Surfaces (HOTS) is a model for eventbased pattern recognition using time surfaces
(Lagorce et al., 2015). Time surfaces describe the recent history of events in the spatial neighborhood of an event. Synaptic dynamics in the spiking neuron models exponentially filter their input events play the role of time surfaces. In the case of convolutional neural networks (as used in this work) and continuoustime operation, the
DCLLforward dynamics are identical to that of HOTS. While deep weights there are trained using unsupervised learning, all the layers in
DCLLare trained using gradientbased supervised learning updates making it more efficient when targets or pseudo targets are available.
Decoupled Neural Interfaces (DNI) were proposed to mitigate layerwise locking in training deep neural networks (Jaderberg et al., 2016). Layerwise locking occurs when the computations in one layer are locked until the error necessary for the weight update become available. In DNI, this decoupling is achieved using a synthetic gradient, a neural network that estimates the gradients for a portion of the network. In an inner loop, the network parameters are trained using the synthetic gradients, and in an outer loop the synthetic gradient network parameters are trained using a full BP step. The gradient computed using local errors in DCLL described below can be viewed as a type of synthetic gradients, which ignores the outer loop to avoid a full BP step (Mostafa et al., 2017). Although we ignore the outer loop limits DCLL’s crosslayer feature adaptation, we find that the network performs strikingly well.
This work builds on a combination of SuperSpike, local errors and random backpropagation, with the realization that a ratebased cost function combined with a differentiable spiketorate decoder can still exploit temporal dynamics of the spiking neurons.
3 Methods
3.1 Neuron and Synapse Model
The neuron model used for DCLL can be compactly described as follows:
where is the unit step function, and are kernels that reflect neural and synaptic dynamics, e.g. refractoriness, reset and postsynaptic potentials, and denotes a (temporal) convolution. Consistently with currentbased Integrate & Fire (I&F) model, the second and first order kernels for and are, respectively:
(1) 
This model is consistent with a deterministic SRM (Gerstner and Kistler, 2002).
3.2 Surrogate Gradients of the Neuron and Synapse
Generally, the SRM
output is stochastic, such that the conditional probability of an output spike (
) given the input spike vector
is:where is interpreted as a stochastic intensity. The use of the stochastic intensity provides a good description for biological neurons, and provides the means to compute the gradient with respect to the neuron parameters (Williams, 1992)
. However, the training of stochastic neurons is notoriously slow due to the high variance of the gradient estimator. Recent work showed that a deterministic approach using a hard threshold during inference combined with a differentiable activation function during training yields very good results
(Zenke and Ganguli, 2017; Neftci et al., 2017). This approach is called surrogate gradientbased learning, since parameters updates are based off a differentiable but different version of the taskperforming network. For clarity, the variables are used in the forward (nonlearning) computations and the variables are used in the backward (learning) computations, whereis a sigmoidal function. In practice, we choose a symmetric
centered on , i.e. such that . We emphasize here again that the neuron model remains deterministic and is used as a differentiable approximation of the step function .The surrogate network can be differentiated with respect to the neuron parameters. This enables a gradientbased optimization of a target loss as a function of :
(2) 
The gradient of the neuron with respect to the parameter is
(3) 
Due to the dependence of the refractory term on its own history, the gradient cannot be computed in closed form. Dropping the remaining derivative in the equation above is not a valid option in the general case, as there is no guarantee that the term is small. One possibility is to use a weak refractory kernel or none (i.e. = 0). However, in the absence of refractory mechanism or firing rate regularization, the neurons fire at very high rates. The solution we use is to enforce a low firing rate through an activitydependent regularizer strong enough that the contribution of the refractory term is negligible. A similar approach was used in SuperSpike (Zenke and Ganguli, 2017).
Our simulations here being based on dense operations (GPU based for the most part), the high firing rate does not have any impact on performance. Note that in the case of eventbased neuromorphic accelerators, the performance of a network scales directly in the number of synaptic events (Merolla et al., 2014). When strong refractory kernels are used, we use a regularizer to prevent sustained firing (i.e. keeping below the firing threshold, and activity regularizer to maintain a minimum firing rate in each layer.
) suggest a synaptic plasticity rule that is tailored to the neuron and synapse model through its dependence on
and . Furthermore, we now see that Eq. (2) consists of three types of factors, one modulatory (), one postsynaptic () and presynaptic (). These types of rules are often termed three factor rules, which have been shown to be consistent with biology (Pfister et al., 2006)and compatible with a wide number of unsupervised, supervised and reinforcement learning paradigms
(Urbanczik and Senn, 2014).3.3 Local Synaptic Plasticity Rules and Auxiliary Cost Functions
In Eq. (3), both presynaptic and postsynaptic terms are local, meaning that all the variables to compute them are available and the neuron and synapse. The third factor plays the role of the backpropagated errors in the gradient backpropagation rule, and generally involves nonlocal terms, including the activity of other neurons and the targets, and their history. While an increasing body of work is showing that approximations to the backpropagated errors are possible, for example in feedback alignment (Lillicrap et al., 2014; Nø kland, 2016; Neftci et al., 2017), how to maintain their history efficiently remains a challenging problem. SuperSpike (Zenke and Ganguli, 2017) deals with it by explicitly computing this history at the synapse. In the exact form, this results in nested convolutions for a network of layers, which is computationally inefficient. To approximate this, (Zenke and Ganguli, 2017) uses a straightthrough estimator, whereby the activation function derivatives of the other layers are ignored (they are all equal to 1). In this case, the nested convolutions can be combined as a single one, such that only convolutions remain necessary. This approach however has limited power in cases where two or more layers are used, and the 2 nested convolutions involves a quadratic scaling of the number of state variables.
One other approach is to enforce locality by using local gradients, or equivalently, local classifiers. One difficulty in defining a local error signal at a neuron in a deep layer is that the cost function is almost always defined using the network output at the top layer. Thus, using local information only, a neuron in a deep layer can not infer how a change in its activity will affect the toplayer cost. To address this conundrum, ref. (Mostafa et al., 2017) attaches random local classifiers to deep layers and defines auxiliary cost functions using their output. These auxiliary cost functions provide a taskrelevant source of error for neurons in deep layers. Surprisingly, training deep layers using auxiliary local errors that minimize the cost at the local classifiers still allows the network as a whole to reach a small toplayer cost. That is because minimizing the local classifiers’ cost puts pressure on deep layers to learn useful taskrelevant features that will allow the random local classifiers to solve the task. Moreover, each layer builds on the features of the previous layer to learn even better features for its local random classifier. Thus, even though no error information propagates downwards through the layer stack, the layers indirectly learn useful hierarchical features that end up minimizing the cost at the top layer.
3.4 The Deep Continuous Local Learning rule
The DCLL rule combines SuperSpike with deep local learning described above to solve the temporal and spatial credit assignment problem in continuous (spiking) neural networks. Hence our approach is called Deep Continuous Local Learning (DCLL). To achieve this, we organize layers of such neurons, and train each layer to predict a pseudotarget using a random local classifier , where indexes the layer, and are a fixed, random matrices (one for each layer
). The loss function is the sum of the layerwise loss functions,
i.e. , where is the pseudotarget for layer . In the special case of an MSE loss, the layerwise loss is:where is the pseudotarget for layer . The gradient of the loss becomes:
(4) 
where for MSE loss . This update rule prescribes that the weight updates should be executed after the presentation of the sequence of duration . In a discretetime simulation, empirical results show that learning works well even if the updates are made at every time step of the simulation (Neftci, 2018). Replacing Eq. (3) in Eq. (4):
(5) 
where is a learning rate. We note that the gradient of the loss at the top layer, , is only used to update the weights in layer
and does not backpropagate further through the network. In all our experiments, updates are made for each time step of the simulation. Furthermore, the neual and synaptic time constants were drawn randomly from a uniform distribution.
Implementation using Automatic Differentiation:
Temporal convolutions can make implementation in machine learning frameworks difficult. With DCLL, however, the function to be differentiated is linear in the parameters (note that it is not the case in the Van Rossum Distance). In the case of no refractory kernel (), the loss function involves derivatives of . This equation involves no temporal convolutions in the trainable parameters . The linear property enables the cost function to be differentiated using automatic differentiation tools provided by machine learning frameworks outofthebox. The surrogate gradient approach guarantees that the computed gradients of are the same as the DCLL rule. This enables the integration of DCLL with machine learning frameworks, in our case PyTorch without backpropagating through time since the only timedependent term, is propagated forward. The integration is significant in that one can build large convolutional neural networks, as well as leverage any type of layer, operation, optimizer and cost function provided by the software. We leverage this integration in all our experiments under the Results section.
Local synaptic plasticity rule:
Eq. (5) requires some information that is nonlocal to the error neurons and the hidden neuron. This includes (not local to the error neurons), the targets and the weights (not local to the hidden neurons). We assume that a dedicated channel communicates these targets to each unit . Furthermore, because the are fixed, we can duplicate them on both ends of the connection. The term can be approximated by as in the Van Rossum Distance but without taking it into account in the gradient so as to avoid the nested convolution, or simply . In both cases can be efficiently delivered to the error neuron through a conventional connection.
Finally, computed at the error neuron must be communicated to the hidden neuron. To transmit this, the error neuron output can be separated in two, one positive and and one negative , each of them spiking when the error increases by a fixed positive threshold and negative threshold, respectively. This approach is particularly interesting, as it prescribes an errortriggered learning rule (Neftci, 2018),
(6) 
which can improve learning efficiency when implemented on an eventbased processor, as Eq. (6) is triggered only when the local error exceeds some threshold. Given that our simulations are currently GPU based, we use Eq. (5) for updates. In future work, we will evaluate the effectiveness of Eq. (6) on largescale neural networks (but see (Neftci, 2018) for a simple example).
Relation to Van Rossum distance and SuperSpike and Linear versus Quadratic Scaling
The SuperSpike learning rule is a surrogate gradient descent on spike distance, i.e. Van Rossum distance:
where is an exponential filter similar to above. When an MSE loss is used with DCLL, it can be viewed as a simplification of the SuperSpike rule where spike distance is replaced by instantaneous spike count distance. In fact, the target in the Van Rossum Distance, , can be interpreted as and the prediction fulfills a role similar to the local classifier activation function. These dynamics and loss function avoid nested convolutions, enable the scaling of DCLL, where is the number of neurons.
4 Experiments
4.1 Regression with Poisson Spike Trains
To illustrate the inner workings of DCLL, we demonstrate DCLL in a regression task. A three layer fully connected network is stimulated with a frozen ms Poisson spike train. The pseudotargets are a ramp function, a cosine function and a sine function for each layer, respectively. (Fig. 1) illustrates the states of the neuron. For each local classifier, dropout was used to prevent over reliance on one particular neuron. For illustration purposes, the recording of the neural states were made in the absence of parameter updates (i.e. the learning rate is 0). We use on the Adamax optimizer (Kingma et al., 2014) and a smooth L1 loss. In this experiment, the refractory period was nonzero, as observed by the resetting effect in the membrane potential (). As discussed in the methods we use regularization to keep the neurons from sustaining high firing rates and an activity regularizer. Updates to the weight can occur each time step, when the derivative of the activation function , the input state are nonzero. The magnitude and direction of the update is determined by the error. Note that, in effect, the error is randomized as a consequence of the random local classifier. Because the input neurons has a fixed mean firing rate, the network learned to use the input spike times to reliably produce the targets.
4.2 Poisson MNIST
We first show the results of our method on the MNIST dataset compared to a conventional convolutional neural network. For DCLL, Each digit is converted into a 500ms Poisson spiketrain, where the mean firing rates vary from 0 to Hz depending on the pixel intensity. Gradient updates are performed at every simulation step after a 50ms burnin (450 gradient steps per minibatch). Each minibatch contains 64 samples. The test set consists of 1024 unseen samples converted into spiketrains and presented for 1s to the network. We rely on a simple network architecture consisting of three convolutional layers of 16, 24 and 32 channels respectively with
kernels interleaved with max pooling layers. In total, the network has 3528 tunable weights and 72 biases, and was trained using Adamax and on a smooth L1 loss.
The results are compared with a reference network of the same architecture and optimizer, trained in the same regime by performing 500 backpropagation steps for each batch. The pseudotargets used for the local classifiers are class labels. In order to match the number of trainable parameters with the conventional convnet architecture, we used one additional fully connected layer without spiking neuron dynamics.
The results of the method on the MNIST dataset are shown in Figure 2
. The final accuracy after 50 000 training samples is 98.73% for the third layer of spiking DCLL version against 98% for the analog backpropagation network. Note that the accuracy was computed sample by sample. When few samples are present, the spiking DCLL version seems to learn a better generalization, as the performance are below 90% after the first 2000 samples. This could be due to the random noise added to the samples when converting them to spiketrains with the Poisson distribution, which helps learning a generalization.
4.3 DVS Gestures
We test DCLL at the more challenging task of learning gestures recorded using a Dynamical Vision Sensor (DVS) (Lichtsteiner et al., 2008). Amir et al. recorded DvsGesture dataset using a DVS, comprising 1342 instances of a set of 11 hand and arm gestures, collected from 29 subjects under 3 different lighting conditions (Amir et al., 2017). Unlike standard imagers, the DVS records streams of events that signal the temporal intensity changes at each of its pixels. The unique features of each gesture is embedded in the stream of events. The event streams were downsized to and binned in frames of , the effective time step of the GPUbased simulation ((Fig. 3)). During training, random ms long sequences were selected in batches of 72. Testing sequences were ms long, and where selected starting from the beginning of each recording was used (288 testing sequences). Note that the shortest recording in the test set is ms, and this duration selected to simplify and speed up the classification evaluation. The classification is obtained by counting spikes at the output starting from a “burnin period” of ms and selecting as output class the neuron that spiked the most. Contrary to (Amir et al., 2017), we did not use stochastic decay and the neural network structure is an all convolutional neural network, loosely adapted from (Springenberg et al., 2014). We find that kernels provide much better results that or
kernels commonly used in image recognition. Furthermore, we did not observe significant improvement by adding more than 3 convolutional layers. The optimal hyperparameters were found by a combination of manual and grid search. We used a weak refractory kernel
( and ). These settings provided the best accuracy, at the cost of allowing shorts bursts of neural activity (Fig. 3).Overall, our performance are equal or better than other published spiking neural network implementations that use backpropagation for training ((Tab. 1), (Fig. 5)). DCLL reached the reported accuracies after a much smaller number of iterations compared to the IBM EEDN case (Amir et al., 2017). Furthermore, our network achieved these results using a much smaller network compared to the other reported results.
The spiking activation function performed better than the sigmoid and ReLU cases (not shown). However, we note that the parameter settings were tuned for the spiking neuron and transferred without further tuning to the networks with ReLU and logistic activation functions.
Model  Error  Training Samples 
IBM EEDN  5.51%  Offline samples 
Slayer  6.36%  Offline # samples not reported 
DCLL (This Work)  5.819%  Online samples 
Layer Type  #  Dimensions 

Input (ON, OFF)  2  
Conv  64  
MaxPool  64  
Dropout(p=.5)  
Dense  11  
Conv  128  
Dropout(p=.5)  
Dense  11  
Conv  128  
MaxPool  128  
Dropout(p=.5)  
Dense  11 
5 Discussion
Understanding and deriving neural and synaptic plasticity rules that can enable hidden weights to learn is an ongoing quest in neuroscience and neuromorphic engineering. From a machine learning perspective, locality and differentiability are key issues of the spiking neuron model operations. While the latter problem is now being tackled with surrogate gradient approaches, how to achieve this in deep networks in a scalable and local fashion is still an open question.
We presented a novel synaptic plasticity rule, DCLL, derived from a surrogate gradient approach with linear computational scalability in the number of neurons. The rule draws on recent work in surrogate gradient descent in spiking neurons and local learning with layerwise classifiers. The linear scalability is obtained through a ratebased cost function on the local classifier. To motivate this from a biological point of view, and a neuromorphic implementation point of view, we discussed a fully eventdriven variant of the rule using errortriggered updates. The simplicity of the DCLL rule equation make it amenable for a direct exploitation of existing machine learning software libraries. Thanks to the surrogate gradient approach, the updates computed through automatic differentiation are equal to the DCLL update.
The DCLL rule can exploit the temporal dynamics of the spiking neuron to learn classification regression on spike trains, similarly to simple recurrent units in artificial neural networks. This is because the state of the neuron being maintained is operationally equivalent to the recurrent neuron. The neuron can learn sequences that have temporal dependencies on the scale of the neural and synaptic time constants, i.e. the kernel. The near stateoftheart classification accuracy on the DVS gestures task demonstrates the scalability of the approach. The surrogate gradient approach also enables DCLL to seamlessly exploit random time constants (e.g.
see DVS gestures experiments). Random time constants are interesting given that multiplicity of time constants in a recurrent neural networks can confer it with a long term memory that rivalizes those of longshort term memory units
(Koutnik et al., 2014).Researchers using Spiking Neural Networks in datadriven machine learning tasks often emphasize the spiking nature of its units a key difference compared to artificial neural networks. Our experience is different however. We find that their temporal dynamics, and local computations when implemented on dedicated hardware are the key differences compared to artificial neural networks. Thanks to our machine learning framework driven experimentations, we could easily replace the threshold neuron with a ReLU function, or a sigmoid. In general, our initial results are often better with the ReLU. However, after a close inspection of the dynamics, we observe that this is caused by activations reaching their saturation values. Through proper initialization and refractory periods, we could close the gap between ReLU and spiking activation functions.
One limitation of the DCLL rule currently presented is the lack of connections within a single layer. Up to now the learning of recurrent connections has been studied in the rate domain (Sussillo and Abbott, 2009), or using backpropagation through time (Bellec et al., 2018). The key challenge using a forward surrogate gradient approach is the dependency of the synaptic update on the neuron’s own history (e.g. see the discussion on the refractory period in the methods). We speculate that an approach that bootstraps the gradient computation in a fashion similar to synthetic gradients is a promising track to follow.
A direct consequence of the local classifiers is the lack crosslayer adaptation of the layers. To tackle this problem, one could use metalearning to adapt the random matrix in the classifier. In effect, the metalearning loop would act as the outer loop in the synthetic gradients approach
(Jaderberg et al., 2016). From a neuroscience perspective, the notion that a “layer” of neurons specialized to solving certain problems and sensory modalities is natural.Funding
EN was supported by the Intel Corporation, the National Science Foundation under grant 1640081, and by the Korean Institute of Science and Technology. JK was supported by a fellowship within the FITweltweit programme of the German Academic Exchange Service (DAAD). HM was supported by the Swiss National Fund. JK, HM, EN wrote the paper, conceived the experiments. JK and EN ran the experiments and analyzed the data.
References

Amir et al. (2017)
Amir Arnon, Taba Brian, Berg David, Melano Timothy, McKinstry Jeffrey, Di Nolfo
Carmelo, Nayak Tapan, Andreopoulos Alexander, Garreau Guillaume, Mendoza
Marcela, and others .
A low power, fully eventbased gesture recognition system.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7243–7252, 2017.  Baldi et al. (2017) Baldi Pierre, Sadowski Peter, and Lu Zhiqin. Learning in the machine: The symmetries of the deep learning channel. Neural Networks, 95:110–133, 2017.
 Bellec et al. (2018) Bellec Guillaume, Salaj Darjan, Subramoney Anand, Legenstein Robert, and Maass Wolfgang. Long shortterm memory and learningtolearn in networks of spiking neurons. arXiv preprint arXiv:1803.09574, 2018.
 Bohte et al. (2000) Bohte Sander M, Kok Joost N, and La Poutré Johannes A. Spikeprop: backpropagation for networks of spiking neurons. In ESANN, pages 419–424, 2000.
 Clopath et al. (2010) Clopath C., Büsing L., Vasilaki E., and Gerstner W. Connectivity reflects coding: a model of voltagebased stdp with homeostasis. Nature Neuroscience, 13(3):344–352, 2010.
 Courbariaux et al. (2016) Courbariaux Matthieu, Hubara Itay, Soudry Daniel, ElYaniv Ran, and Bengio Yoshua. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 Eliasmith and Anderson (2004) Eliasmith C. and Anderson C.H. Neural engineering: Computation, representation, and dynamics in neurobiological systems. MIT Press, 2004.
 Gerstner and Kistler (2002) Gerstner W. and Kistler W. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University Press, 2002.
 Jaderberg et al. (2016) Jaderberg Max, Czarnecki Wojciech Marian, Osindero Simon, Vinyals Oriol, Graves Alex, and Kavukcuoglu Koray. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
 Kingma et al. (2014) Kingma Diederik P, Mohamed Shakir, Rezende Danilo Jimenez, and Welling Max. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 Koutnik et al. (2014) Koutnik Jan, Greff Klaus, Gomez Faustino, and Schmidhuber Juergen. A clockwork RNN. arXiv preprint arXiv:1402.3511, 2014.
 Lagorce et al. (2015) Lagorce Xavier, Ieng Sio Hoi, Clady Xavier, Pfeiffer Michael, and Benosman Ryad Benjamin. Spatiotemporal features for asynchronous eventbased data. Frontiers in Neuroscience, 9(46), 2015. ISSN 1662453X. doi: 10.3389/fnins.2015.00046.
 Lee et al. (2016) Lee Jun Haeng, Delbruck Tobi, and Pfeiffer Michael. Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience, 10, 2016.
 Lichtsteiner et al. (2008) Lichtsteiner P., Posch C., and Delbruck T. An 128x128 120dB 15slatency temporal contrast vision sensor. IEEE J. Solid State Circuits, 43(2):566–576, 2008.
 Lillicrap et al. (2014) Lillicrap Timothy P, Cownden Daniel, Tweed Douglas B, and Akerman Colin J. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247, 2014.
 Maass et al. (2002) Maass W., Natschläger T., and Markram H. Realtime computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002.
 Merolla et al. (2014) Merolla Paul A, Arthur John V, AlvarezIcaza Rodrigo, Cassidy Andrew S, Sawada Jun, Akopyan Filipp, Jackson Bryan L, Imam Nabil, Guo Chen, Nakamura Yutaka, and others . A million spikingneuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
 Mostafa et al. (2017) Mostafa Hesham, Ramesh Vishwajith, and Cauwenberghs Gert. Deep supervised learning using local errors. arXiv preprint arXiv:1711.06756, 2017.
 Neftci (2018) Neftci Emre O. Data and power efficient intelligence with neuromorphic learning machines. iScience, 5:52–68, 2018. ISSN 25890042. doi: https://doi.org/10.1016/j.isci.2018.06.010. URL http://www.sciencedirect.com/science/article/pii/S2589004218300865.
 Neftci et al. (2017) Neftci Emre O., Augustine Charles, Paul Somnath, and Detorakis Georgios. Eventdriven random backpropagation: Enabling neuromorphic deep learning machines. Frontiers in Neuroscience, 11:324, 2017. ISSN 1662453X. doi: 10.3389/fnins.2017.00324.
 Nø kland (2016) Nø kland Arild. Direct feedback alignment provides learning in deep neural networks. In Lee D. D., Sugiyama M., Luxburg U. V., Guyon I., and Garnett R., editors, Advances in Neural Information Processing Systems 29, pages 1037–1045. Curran Associates, Inc., 2016.
 Pfister et al. (2006) Pfister JeanPascal, Toyoizumi Taro, Barber David, and Gerstner Wulfram. Optimal spiketimingdependent plasticity for precise action potential firing in supervised learning. Neural computation, 18(6):1318–1348, 2006.
 Rastegari et al. (2016) Rastegari Mohammad, Ordonez Vicente, Redmon Joseph, and Farhadi Ali. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Shrestha and Orchard (2018) Shrestha Sumit Bam and Orchard Garrick. Slayer: Spike layer error reassignment in time. arXiv preprint arXiv:1810.08646, 2018.
 Springenberg et al. (2014) Springenberg Jost Tobias, Dosovitskiy Alexey, Brox Thomas, and Riedmiller Martin. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 Sussillo and Abbott (2009) Sussillo David and Abbott Larry F. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009.
 Urbanczik and Senn (2014) Urbanczik Robert and Senn Walter. Learning by the dendritic prediction of somatic spiking. Neuron, 81(3):521–528, 2014.
 Williams (1992) Williams Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Zenke and Ganguli (2017) Zenke Friedemann and Ganguli Surya. Superspike: Supervised learning in multilayer spiking neural networks. arXiv preprint arXiv:1705.11146, 2017.