For a biological agent operating under environmental pressure, energy consumption and reaction times are of critical importance. Similarly, engineered systems also strive for short time-to-solution and low energy-to-solution characteristics. At the level of neuronal implementation, this implies achieving the desired results with as few and as early spikes as possible. In the time-to-first-spike coding framework, both of these goals are inherently emerging features of learning. Here, we describe a rigorous derivation of error-backpropagation-based learning for hierarchical networks of leaky integrate-and-fire neurons. We explicitly address two issues that are relevant for both biological plausibility and applicability to neuromorphic substrates by incorporating dynamics with finite time constants and by optimizing the backward pass with respect to substrate variability. This narrows the gap between previous models of first-spike-time learning and biological neuronal dynamics, thereby also enabling fast and energy-efficient inference on analog neuromorphic devices that inherit these dynamics from their biological archetypes, which we demonstrate on two generations of the BrainScaleS analog neuromorphic architecture.READ FULL TEXT VIEW PDF
In recent years, the machine learning landscape has been dominated by deep learning methods. Among the benchmark problems they managed to crack, some were thought to still remain elusive for a long time lecun2015deep, krizhevsky2012imagenet, goodfellow2014generative, silver2017mastering, vaswani2017attention. It is thus not exaggerated to say that deep learning has reformed our understanding and the future role of “artificial intelligence” brooks2012brain, ng2016artificial, hassabis2017neuroscience, sejnowski2018deep, richards2019deep.
However, compared to abstract neural networks used in deep learning, their more biological archetypes – spiking neural networks – still lag behind in performance and scalability pfeiffer2018deep. Reasons for this difference in success are numerous; for instance, unlike abstract neurons, even an individual biological neuron represents a complex system, with finite response times, membrane dynamics and spike-based communication gerstner2001different, izhikevich2004model, making it more challenging to find reliable coding and computation paradigms gerstner1998spiking, maass2016searching, davies2019benchmarks. Furthermore, one of the major driving forces behind the success of deep learning, the backpropagation of errors algorithm Rumelhart1986, remained incompatible with spiking neural networks for a long time esser2015backpropagation, schmitt2017neuromorphic, tavanaei2018deep, neftci2019surrogate.
Despite these challenges, spiking neural networks promise to hold some intriguing advantages. The asynchronous nature of spike-based communication allows a coding scheme that utilizes both spatial and temporal dimensions gutig2006tempotron, unlike rate-based or spike-count-based approaches cao2015spiking, diehl2016conversion, schmitt2017neuromorphic, wu2019spikecount, where the information of spike times is lost due to temporal or population averaging. Due to the inherent parallelism of all biological, as well as many biologically-inspired, neuromorphic systems, this promises fast, sparse and energy-efficient information processing, which might hold the key to novel computing architectures that could one day rival the efficiency of the brain itself mead1990neuromorphic, indiveri2011neuromorphic, roy2019towards. This makes spiking neural networks potentially more powerful than the ”conventional”, simple models currently used in machine learning maass1997networks, even though this potential still remains mostly unexploited pfeiffer2018deep.
Many attempts have been made to reconcile spiking neural networks with their abstract counterparts in terms of functionality, e.g., featuring spike-based stochastic inference models petrovici2013stochastic, neftci2014event, petrovici2016stochastic, neftci2016stochastic, leng2018spiking, kungl2019accelerated, dold2019stochasticity, jordan2019deterministic and deep models trained on target spike times by shallow learning rules kheradpisheh2018stdp, illing2019biologically or using spike-compatible versions of the error backpropagation algorithm bohte2000spikeprop, lee2016training, o2016deep, zenke2018superspike, huh2018gradient, jin2018hybrid, tavanaei2018deep, kulkarni2018spiking, wu2018spatio, wu2019spikecount, bellec2019eprop. A particularly elegant way of utilizing the temporal aspect of exact spike times is the ttfs coding scheme thorpe2001spike. Here, a neuron encodes a continuous variable as the time elapsed before its first spike. Such single-spike coding enables fast information processing by inherently encouraging as few and as early spikes as possible, which meets physiological constraints and reaction times observed in humans and animals thorpe1996speed,decharms1996primary,wehr1996odour,johansson2004first,gollisch2008rapid,saal2016importance,portelli2016rank. Apart from biological plausibility, such a coding scheme is a natural fit for neuromorphic systems that offer energy-efficient and fast emulation of spiking neural networks schemmel2010wafer,akopyan2015truenorth,friedmann2016hybridlearning,davies2018loihi,mayr2019spinnaker,pei2019towards.
For hierarchical ttfs networks, a gradient-descent-based learning rule was proposed in mostafa2017supervised,kheradpisheh2019s4nn, using error backpropagation on a continuous function of output spike times. However, this approach is limited to a neuron model without leak, which is neither biologically plausible, nor compatible with most analog vlsi neuron dynamics thakur2018large. We extend the aforementioned approach to the lif model with cuba synapses, which represents an analytically treatable dynamical model of spiking neurons with realistic integration behavior rauch2003neocortical,gerstner2009good,teeter2018generalized, i.e., with finite membrane () and synaptic () time constants. For three special cases (, and
), both the times-to-first-spike as well as the gradients of the loss function are analytically calculable.
The closed and exact analytical form of the proposed model, especially for gradients used in weight updates, enables a robust implementation on neuromorphic physical-model systems. We demonstrate such an implementation on the BrainScaleS-2 friedmann2016hybridlearning,billaudelle2019versatile and BrainScaleS-1 schemmel2010wafer,schmitt2017neuromorphic,kungl2019accelerated accelerated, analog spiking neuromorphic systems. By incorporating information generated on the hardware for updates during training, the algorithm can adapt to the imperfections of the analog circuits. This allows a natural transfer from theory and simulation to a neuromorphic physical model system, demonstrating that the proposed model deals well with various drawbacks of physical-model systems such as fixed-pattern noise and limited parameter precision and control. Such a robustness of coding and learning under imperfections of the underlying neuronal substrate represents a quintessentially desirable property for every model claiming biological plausibility and for every application geared towards physical computing systems prodromakis2010review,esser2015backpropagation,van2018organic,wunderlich2019demonstrating,kungl2019accelerated,dold2019stochasticity,feldmann2019all.
In the following, we first introduce the cuba lif model and the ttfs coding scheme (Section 2.1), before we demonstrate how both inference and training via error backpropagation can be performed analytically with such dynamics (Section 2.2). Finally, the presented model is evaluated both in software simulations (Section 3.1) and neuromorphic emulations (Section 3.2).
The dynamics of an lif neuron with cuba synapses are given by
with membrane capacitance , leak conductance , presynaptic weights and spike times , synaptic time constant and the Heaviside step function. The first sum runs over all presynaptic neurons while the second sum runs over all spikes for each presynaptic neuron. The neuron elicits a spike at time when the presynaptic input pushes the membrane potential above a threshold . After spiking, a neuron becomes refractory for a time period , which is modeled by clamping its membrane potential to a reset value : for . For convenience and without loss of generality, we set the leak potential . Eqn. 1 can be solved analytically, which yields the subthreshold dynamics
with membrane time constant and the psp kernel given by a difference of exponentials. Here we already assumed the ttfs use case with only one relevant spike for each neuron, for which the second sum in Eqn. 1 reduces to a single term. The choice of ultimately influences the shape of a psp, starting from a simple exponential (), to a difference of exponentials (with an alpha function for the special case of ) to a graded step function () (Fig. 1C).
The first two cases are markedly different from the last one, which is also known as either the nlif or simply integrate-and-fire (IF) model and was used in previous work mostafa2017supervised. In the nlif model, input to the membrane is never forgotten, as opposed to the lif model, where the psp reaches a peak after finite time and subsequently decays back to its baseline. In other words, presynaptic spikes in the lif model have a purely local effect in time, unlike in the nlif model, where only the onset of a psp is localized in time, but the postsynaptic effect remains forever, or until the postsynaptic neuron spikes. A finite thus assigns much more importance to the time differences between input spikes and introduces discontinuities in the neuronal output that make an analytical treatment more difficult (Fig. 1D).
Our spike-based neural code follows an idea first proposed in mostafa2017supervised. Unlike coding in ann and different from rate-based codes in snn, this scheme explicitly uses the timing of individual spikes for encoding information. In ttfs coding, the presence of a feature is reflected by the timing of a neuron’s first spike, with earlier spikes representing a more strongly manifested feature. This has the effect that important information inherently propagates faster through the network, with potentially only few spikes needed for the network to process an input. Consequently, this scheme enables a more efficient processing of inputs, both in terms of time-to-solution and energy-to-solution (assuming the latter depends on the total number of spikes and the time required for the network to solve, e.g., an input classification problem).
In order to formulate the optimization of first-spike times as a gradient-descent problem, we need to derive closed-form expressions for these . This is equivalent to finding the time of the first threshold crossing by solving for . Even though a general closed-form solution does not exist, we show analytical solutions for three specific cases: (i) , (ii) and (iii) :
where is the Lambert W function and using
as shorthand for sums over the set of causal presynaptic spikes . All three equations are differentiable with respect to synaptic weights and presynaptic spike times. For a detailed derivation of these results, we refer to Appendix A.
The implicit assumption of having only the first spike emitted by every neuron be relevant for downstream processing can effectively be ensured by using a long enough refractory period. Since the only information-carrying signal that is not reset upon firing is the synaptic memory, which is forgotten on the time scale of , we found that, in practice, setting leads to most neurons eliciting only one spike before the classification of a given input pattern.
The case has already been discussed in mostafa2017supervised and was reproduced here for completeness and comparison. Due to the symmetry in and of the psp (Eqn. A7), the case describes the case as well. Using Eqns. 6, 5 and 4, we can treat the ttfs network much like an ann, where instead of rates, spike times are propagated. In a layered feed-forward network, Eqns. 6, 5 and 4 can be used iteratively, i.e., one can calculate the spike times of the first layer, use these to calculate the spike times of the second layer, etc., until the label neurons are reached.
While we found both rules for finite to work well in practice, we focus on in the following, as and can be treated analogously. Equations for all cases are derived in Appendix A.
As depicted in Fig. 1A, we consider feed-forward networks of cuba lif neurons. The input uses the same coding scheme as all other neurons, with input neurons spiking earlier for darker pixels. In particular, no external time reference is required: the network effectively processes contrast information and is essentially agnostic with respect to specific absolute input spike times. The output of the network is defined by the identity of the label neuron that spikes first (Fig. 1B).
We denote by the output spike time of the th neuron in the th layer, e.g., for a network with layers, is the spike time of the th neuron in the label layer. The weight projecting to the th neuron of layer from the th neuron of layer is denoted by .
To apply a variant of the error backpropagation algorithm Rumelhart1986, we choose a loss function that is differentiable with respect to synaptic weights and spike times. During learning, the objective is to maximize the temporal difference between the correct and all other label spikes while minimizing the time-to-correct-solution. The following loss function fulfills the above requirements:
denotes the vector of label spike times, the index of the correct label and , and
represent scaling parameters. Because the softmax-scaled spike times can be viewed as assigning a probability to the different labels, the first term represents a cross-entropy of this distribution relative to the true label distribution (which is 1 for the correct label and 0 otherwise). Reducing this term therefore increases the temporal difference between the output spike of the correct label neuron and all other label neurons. Notably, it only depends on the spike time difference and is invariant under absolute time shifts, making it independent of artificial outside clocking. The second term is a regularizer that favors solutions where the correct label neuron spikes as early as possible.
Weights are updated such that they minimize the loss . For weights projecting into the label layer, updates are calculated via
Using this additional information optimizes learning in scenarios where the inferred output spike and the true output spike differ (Section 3.1).
The weight updates of deeper layers can be calculated iteratively by application of the chain rule:
where the second term is a propagated error that can be calculated recursively with a sum over the neurons in layer ):
The latter derivative amounts to, once the output spike time is reinserted as above,
The learning rule can be rewritten in layer-wise form to resemble the standard error backpropagation algorithm for abstract neurons (seeAppendix B for the standard backpropagation equation):
where is the element-wise product, the -superscript denotes the transpose of a matrix and is a vector containing the backpropagated errors of layer
. The individual elements of the tensors above are given by
In this form, it becomes apparent that for training, only the label layer error and the neuron spike times are required, which can either be calculated using Eqn. 5 or by simulating (or emulating) the lif dynamics (Eqn. 1).
As mentioned above, the treatment of the other two special cases is analogous to the above. Thus, for cuba lif neurons with finite time constants , and , both the forward pathway (spike times) and backward pathway (backpropagation) can be calculated analytically using a loss that is differentiable with respect to both synaptic weights and neuronal spike times.
We demonstrate the above framework in a pattern classification task (Fig. 2A), with the spiking network (Fig. 1A) simulated in NEST Gewaltig:NEST. To assist learning, mini batch training was used and the weight updates calculated as described in Section 2.2 were L1-normalized layer-wise to be smaller than 10. Furthermore, for layers with more than 35% of silent neurons averaged over the minibatches, all afferent weights were increased by a fixed amount in order to have sufficient activity. In case of multiple layers where this applied, only the first layer with insufficient spikes was boosted.
Fig. 2B-K shows results from training a network with 49 visible, 80 hidden and 4 label neurons on this data set. While not used during training, the temporal evolution of the membrane potentials helps illustrate the learning process. Fig. 2B-D shows voltages in the label layer for one class (orange) before, during and after training, illustrating how the trained weights make the correct neuron spike earliest by a large margin (see also Fig. 2E).
The spike times including the input spikes (vertical lines) and the ones in the hidden layer are shown in a raster plot in Fig. 2E. Due to the finite membrane and synaptic time constants, output spikes can only happen within a finite time window after their inputs. In the particular training scenario described in Fig. 2, this renders output spikes happening immediately before the late input spikes extremely unlikely. This effective gap can also be seen in the evolution of the spike times during training, where a small change in synaptic weights can bring a spike from the early into the late group and vice-versa, leading to sudden jumps in both spike timing and loss (Fig. 2F-I,K; see also Fig. 1D).
The evolution of the label layer spike times for all four classes is shown in Fig. 2F-I, including the steps at which the voltage plots were recorded. The spike times for the different classes together decide both the accuracy (the proportion of correct classifications) and the loss (Eqn. 9). The evolution of both during training is shown in Fig. 2J,K.
It is important to note that we have chosen a simple dataset in order to make it amenable to emulation on our neuromorphic systems (Section 3.2). In particular, it is linearly separable and would thus not require backpropagation for perfect classification. Therefore, to demonstrate that error backpropagation is working as intended, we performed an additional simulation with frozen weights between the hidden and label layer, training only the ones between input and hidden layer. As expected and shown in Fig. 2
L, training was successful in this setup, but took considerably longer for the same initialization and hyperparameters as used inFig. 2B-K.
As mentioned above, our loss function consists of two parts (Eqn. 9), the first relating to ttfs coding and the second representing a regularizer that pushes correct neurons to spike early, and stabilises the training. Figure 3 shows the effect of regularization: it shifts correct label spikes to earlier times, which in turn causes the afferent active hidden neurons to spike earlier as well.
As noted in the introduction, ttfs coding is a natural fit for neuromorphic hardware due to its emphasis on speed (early spikes), especially for accelerated devices which can profit additionally from intrinsic speed-up (Section 3.2). However, this speed comes at the price of reduced control over certain neuron and synapse parameters. This implies in particular that the ratio of the membrane and synaptic time constants is different from the ideal values of or used in derivations. It is therefore crucial that the learning rule still works under such parameter variability in order for it to be applicable to such neuromorphic substrates. The question at hand touches upon whether our learning rules also work for other parameter values than the specific ones for which they were derived, and if so, how well. We study the robustness of learning for different time constant ratios by sweeping a range of membrane time constants using the NEST Gewaltig:NEST simulator for the forward pass and the different learning rules for the backward pass.
One detail is important in this context. Eqns. 6, 5 and 4 for the spike times depend only on neuron parameters, presynaptic spike times and weights, thus the derivatives needed for the weight update initially depend on those ’natural’ variables as well (Eqns. A15 and A22). With some manipulations, the equation for the actual output spike time can be inserted (Eqns. A17 and A24), producing a version of the learning rule that profits from more information of the forward pass and is thus significantly more stable. The two versions are identical only in case the forward and backward pass have exactly the same ratios. The effect of this disparity is shown in Fig. 4. For both update rules, including information about the true spiking activity significantly improves learning over a wide range of ratios.
In this framework, classification speed is a function of the network depth and the time constants and
. Assuming typical biological timescales, most input patterns in the above scenario are classified within several. By leveraging the speedup of neuromorphic systems such as BrainScaleS schmitt2017neuromorphic,billaudelle2019versatile, with intrinsic acceleration factors of -, the same computation can be achieved within .
However, the speed advantages of such analog systems compared to software simulations come at the cost of reduced control, and training needs to cope with phenomena such as spike time jitter and neuron parameter variability. In particular, this implies
, so the derived learning rule is only an approximation of true gradient descent in these systems, as discussed above.
We ported the network architecture and training scheme to the BrainScaleS-2 system, a mixed-signal accelerated neuromorphic platform. The asic is built around an analog neuromorphic core which emulates the dynamics of neurons and synapses. All state variables, such as membrane potentials and synaptic currents, are physically represented in their respective circuits and evolve continuously in time. Considering the natural time constants of such integrated analog circuits, this emulation takes place at 1000-fold accelerated time scales compared to the biological nervous system. One BrainScaleS-2 core features 512 AdEx neurons, which can be freely configured; these circuits can be restricted to LIF dynamics as required by our training framework aamir2018lifarray,aamir2018adex. Both the membrane and synaptic time constants were calibrated to .
Each neuron circuit integrates stimuli from a column of 256 current-based synapses friedmann2016hybridlearning. Each synapse holds a weight value; its sign is shared with all other synapses located on the same row in the synapse matrix. The presented training scheme, however, allows weights to continuously transition between excitation and inhibition. We therefore allocated pairs of synapse rows to convey the activity of single presynaptic partners, one configured for excitation, the other one for inhibition.
Synapses receive their inputs from an event routing module allowing to connect neurons within a chip as well as to inject stimuli from external sources. Events emitted by the neuron circuits are annotated with a time stamp and then sent off-chip. The neuromorphic asic is accompanied by a fpga to handle the communication with the host computer. It also provides mechanisms for low-latency experiment control including the timed release of spike trains into the neuromorphic core. The fpga is furthermore used to record events and digitized membrane traces originating from the asic.
We used an in-the-loop-training approach, where emulation runs on the neuromorphic substrate were interleaved with host-based weight update calculations schmitt2017neuromorphic. For the emulation of the forward pass, the data set was broken down into mini-batches, converted into input spike trains and then injected into the neuromorphic system via the fpga. The latter was also used to record the spikes emitted by the hidden and label layers. Weight updates were – based on these output spike trains – calculated on the host computer and then written back to the synapse memory. This backward pass shared its implementation with the previously described simulation framework.
We were able to successfully and reliably train the network emulated on BrainScaleS-2 on the discussed data set (Fig. 5). The system quickly learned to fully discriminate between the presented patterns, with clear separation between label spike times. Learning performance in terms of convergence speed is difficult to compare because the hyperparameters are not easily transferable, but appears similar to the numerical simulations of the same network. After training, due to the interplay of the system’s intrinsic acceleration and the nature of the learning algorithm itself, each pattern was classified in less than .
To demonstrate the amenability of our approach to different neuromorphic substrates, we also tested it on the BrainScaleS-1 system schemmel2010wafer. This version of BrainScaleS has a very similar architecture to BrainScaleS-2 , but its component chips are interconnected through postprocessing on their common wafer (wafer-scale integration). More importantly for our coding scheme and learning rules, its circuits emulate coba instead of cuba neurons. Furthermore, due to the different fabrication technology and design choices [in particular, the floating-gate parameter memory, see]srowig2007analog, schemmel2010wafer, Koke2017, the parameter variability and spike time jitter are significantly higher than on BrainScaleS-2 schmitt2017neuromorphic.
The training procedure was analogous to the one used on BrainScaleS-2 . To accommodate the coba synapse dynamics, we introduced global weight scale factors that modeled the distance between reversal and leak potentials and the total conductance, which were multiplied to the synaptic weights to achieve a cuba approximation for which our learning rules apply. Despite this approximation and the considerable substrate variability (compare, e.g., Fig. 6C-F with Fig. 5C-F), our framework was able to compensate well, almost matching the performance achieved on BrainScaleS-2 (Fig. 6).
In this manuscript, we proposed a model of deep time-to-first-spike learning that builds on a principled view of neuro-synaptic dynamics with finite time constants and comes with exact learning rules for optimizing first-spike times; an early version of this work was presented in goeltz2019mastersthesis. In this quintessentially spike-based learning framework, only single spike times are required for calculating the weight updates, thus reducing the memory (bandwidth and capacity) requirements of synaptic updates in comparison to, e.g., rate coding approaches [see, e.g., ][for an example of deep but rate-based learning that was also applied to a BrainScaleS system]schmitt2017neuromorphic.
Our work builds on earlier results by mostafa2017supervised, which we extended to accommodate leaky integrate-and-fire neurons, thereby including more biologically plausible and neuromorphic-hardware-compatible neuro-synaptic dynamics. Additionally, we introduced a regularizing loss term that favors early classification, thereby significantly improving the time-to-solution of the network. To account for substrate variability, we further incorporated output spike times directly into the backward pass, which extends the applicability of our derived learning rules to a wide range of parameters, thus allowing us to demonstrate the framework on two different neuromorphic platforms (two generations of the BrainScaleS architecture) that exhibit varying degrees of parameter noise in their analog components. Unlike other approaches mostafa2017supervised,comsa2019temporal,kheradpisheh2019s4nn we do not use any kind of clocking or bias spikes, thereby being independent of any absolute time reference or global clock signal.
The complexity of the learned dataset was mostly limited by the size of the used substrate, but we expect the framework to scale to significantly more challenging problems, as suggested by the FPGA-based experiments in mostafa2017fpga. After learning, the network needed less than one spike per neuron to produce a correct classification on all used substrates. With these few spikes, we achieved a time-to-solution of less than one synaptic time constant. Since the dynamical timescales directly affect the duration of the network emulation between synaptic updates, this inherently leads to a significant reduction of the total training time. Taking into consideration relaxation times between patterns, our setup was able to handle a concatenated pattern throughput of at least , independently of emulated network size. These results promise an efficient exploitation of such accelerated neuromorphic substrates for high-throughput inference on spiking input data.
The authors wish to thank Sebastian Schmitt for BrainScaleS-1 support as well as Jakob Jordan and Nico Gürtler for valuable discussions.
The authors gratefully acknowledge funding from the European Union under grant agreements 604102, 720270, 785907 (HBP) and the Manfred Stärk Foundation.
Simulations were performed on the bwForCluster NEMO, supported by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 39/963-1 FUGG.
For each, a solution for the spike time , defined by
given lif dynamics
has to be found. For convenience, we use the following definitions
with summation over the set of causal presynaptic spikes .
With this choice of , the first term in Eqn. A2 becomes and we recover the nlif case discussed in mostafa2017supervised. Given the existence of an output spike, in Eqn. A1 the spike time appears only in one place and simple reordering yields
where we used Eqn. A4 for and , the latter being the sum over the weights.
According to l’Hôpital’s rule, in the limit Eqn. A2 becomes a sum over -functions of the form
The variable is introduced to bring the equation into the form
which can be solved with the differentiable Lambert W function . The goal is now to bring Eqn. A8 into this form, this is achieved by reformulation in terms of
With the definition of the Lambert W function the spike time can be written as
Given that a spike happens, there will be two threshold crossings: One from below at the actual spike time, and one from above when the voltage decays back to the leak potential (Fig. A1A,B). Correspondingly, the Lambert W function (Fig. A1C,D) has two real branches (in addition to infinite imaginary ones), and we need to choose the branch that returns the earlier solution. In case the voltage is only tangent to the threshold at its maximum, the Lambert W function only has one solution.
For choosing the branch in the other cases we need to look at from the definition, i.e.
In a setting with only one strong enough input spike, the summations in and reduce to yield . Because the maximum of the psp for occurs at , we know that the spike must occur at and therefore
This corresponds to the branch cut of the Lambert W function meaning we must choose the branch with . For a general setting, if we know a spike exists, we expect and to be positive. In order to get the earlier threshold crossing, we need the branch that returns the larger (Fig. A1D), that is where .
The derivatives for in the causal set come down to
A crucial step is to reinsert the definition of the spike time where it is possible (cf. Section 3.1). For this we need the derivative of the Lambert W function that follows from differentiating its definition Eqn. A9 with w.r.t. . With this derivative one can calculate the derivative of Eqn. A12 with respect to incoming weights and times as functions of presynaptic weights, input spike times and output spike time:
The quadratic equation has two solutions that correspond to the voltage crossing at spike time and relaxation towards the leak later; again, we want the earlier of the two solutions. It follows from the monotonicity of the logarithm that the earlier time is the one with the larger denominator. Due to an output spike requiring an excess of recent positively weighted input spikes, are positive, and the solution is the correct one.
Using the definition for brevity, the derivatives of Eqn. A21 are
Again, inserting the output spike time yields
The standard error backpropagation formula for artificial (rate-based) neural networks Rumelhart1986 with rates is given by
Traditionally, in artificial neural networks, the last layer is a linear classifier, but here, to highlight the resemblance to rate-based neurons, we define the loss function on the activation of the neurons in the last layer , where is the target label in one-hot coding.