Fast and deep neuromorphic learning with time-to-first-spike coding

12/24/2019 ∙ by Julian Göltz, et al. ∙ 13

For a biological agent operating under environmental pressure, energy consumption and reaction times are of critical importance. Similarly, engineered systems also strive for short time-to-solution and low energy-to-solution characteristics. At the level of neuronal implementation, this implies achieving the desired results with as few and as early spikes as possible. In the time-to-first-spike coding framework, both of these goals are inherently emerging features of learning. Here, we describe a rigorous derivation of error-backpropagation-based learning for hierarchical networks of leaky integrate-and-fire neurons. We explicitly address two issues that are relevant for both biological plausibility and applicability to neuromorphic substrates by incorporating dynamics with finite time constants and by optimizing the backward pass with respect to substrate variability. This narrows the gap between previous models of first-spike-time learning and biological neuronal dynamics, thereby also enabling fast and energy-efficient inference on analog neuromorphic devices that inherit these dynamics from their biological archetypes, which we demonstrate on two generations of the BrainScaleS analog neuromorphic architecture.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the machine learning landscape has been dominated by deep learning methods. Among the benchmark problems they managed to crack, some were thought to still remain elusive for a long time lecun2015deep, krizhevsky2012imagenet, goodfellow2014generative, silver2017mastering, vaswani2017attention. It is thus not exaggerated to say that deep learning has reformed our understanding and the future role of “artificial intelligence” brooks2012brain, ng2016artificial, hassabis2017neuroscience, sejnowski2018deep, richards2019deep.

However, compared to abstract neural networks used in deep learning, their more biological archetypes – spiking neural networks – still lag behind in performance and scalability pfeiffer2018deep. Reasons for this difference in success are numerous; for instance, unlike abstract neurons, even an individual biological neuron represents a complex system, with finite response times, membrane dynamics and spike-based communication gerstner2001different, izhikevich2004model, making it more challenging to find reliable coding and computation paradigms gerstner1998spiking, maass2016searching, davies2019benchmarks. Furthermore, one of the major driving forces behind the success of deep learning, the backpropagation of errors algorithm Rumelhart1986, remained incompatible with spiking neural networks for a long time esser2015backpropagation, schmitt2017neuromorphic, tavanaei2018deep, neftci2019surrogate.

Despite these challenges, spiking neural networks promise to hold some intriguing advantages. The asynchronous nature of spike-based communication allows a coding scheme that utilizes both spatial and temporal dimensions gutig2006tempotron, unlike rate-based or spike-count-based approaches cao2015spiking, diehl2016conversion, schmitt2017neuromorphic, wu2019spikecount, where the information of spike times is lost due to temporal or population averaging. Due to the inherent parallelism of all biological, as well as many biologically-inspired, neuromorphic systems, this promises fast, sparse and energy-efficient information processing, which might hold the key to novel computing architectures that could one day rival the efficiency of the brain itself mead1990neuromorphic, indiveri2011neuromorphic, roy2019towards. This makes spiking neural networks potentially more powerful than the ”conventional”, simple models currently used in machine learning maass1997networks, even though this potential still remains mostly unexploited pfeiffer2018deep.

Many attempts have been made to reconcile spiking neural networks with their abstract counterparts in terms of functionality, e.g., featuring spike-based stochastic inference models petrovici2013stochastic, neftci2014event, petrovici2016stochastic, neftci2016stochastic, leng2018spiking, kungl2019accelerated, dold2019stochasticity, jordan2019deterministic and deep models trained on target spike times by shallow learning rules kheradpisheh2018stdp, illing2019biologically or using spike-compatible versions of the error backpropagation algorithm bohte2000spikeprop, lee2016training, o2016deep, zenke2018superspike, huh2018gradient, jin2018hybrid, tavanaei2018deep, kulkarni2018spiking, wu2018spatio, wu2019spikecount, bellec2019eprop. A particularly elegant way of utilizing the temporal aspect of exact spike times is the ttfs coding scheme thorpe2001spike. Here, a neuron encodes a continuous variable as the time elapsed before its first spike. Such single-spike coding enables fast information processing by inherently encouraging as few and as early spikes as possible, which meets physiological constraints and reaction times observed in humans and animals thorpe1996speed,decharms1996primary,wehr1996odour,johansson2004first,gollisch2008rapid,saal2016importance,portelli2016rank. Apart from biological plausibility, such a coding scheme is a natural fit for neuromorphic systems that offer energy-efficient and fast emulation of spiking neural networks schemmel2010wafer,akopyan2015truenorth,friedmann2016hybridlearning,davies2018loihi,mayr2019spinnaker,pei2019towards.

For hierarchical ttfs networks, a gradient-descent-based learning rule was proposed in mostafa2017supervised,kheradpisheh2019s4nn, using error backpropagation on a continuous function of output spike times. However, this approach is limited to a neuron model without leak, which is neither biologically plausible, nor compatible with most analog vlsi neuron dynamics thakur2018large. We extend the aforementioned approach to the lif model with cuba synapses, which represents an analytically treatable dynamical model of spiking neurons with realistic integration behavior rauch2003neocortical,gerstner2009good,teeter2018generalized, i.e., with finite membrane (

) and synaptic () time constants. For three special cases (, and

), both the times-to-first-spike as well as the gradients of the loss function are analytically calculable.

The closed and exact analytical form of the proposed model, especially for gradients used in weight updates, enables a robust implementation on neuromorphic physical-model systems. We demonstrate such an implementation on the BrainScaleS-2 friedmann2016hybridlearning,billaudelle2019versatile and BrainScaleS-1 schemmel2010wafer,schmitt2017neuromorphic,kungl2019accelerated accelerated, analog spiking neuromorphic systems. By incorporating information generated on the hardware for updates during training, the algorithm can adapt to the imperfections of the analog circuits. This allows a natural transfer from theory and simulation to a neuromorphic physical model system, demonstrating that the proposed model deals well with various drawbacks of physical-model systems such as fixed-pattern noise and limited parameter precision and control. Such a robustness of coding and learning under imperfections of the underlying neuronal substrate represents a quintessentially desirable property for every model claiming biological plausibility and for every application geared towards physical computing systems prodromakis2010review,esser2015backpropagation,van2018organic,wunderlich2019demonstrating,kungl2019accelerated,dold2019stochasticity,feldmann2019all.

In the following, we first introduce the cuba lif model and the ttfs coding scheme (Section 2.1), before we demonstrate how both inference and training via error backpropagation can be performed analytically with such dynamics (Section 2.2). Finally, the presented model is evaluated both in software simulations (Section 3.1) and neuromorphic emulations (Section 3.2).

2 Mathematical results

Figure 1: Pattern recognition with time-to-first-spike coding. (A) Hierarchical feed-forward network structure. Colors encode labels, throughout all figures. (B) Input (), hidden () and label () spike times in a raster plot. The ttfs coding amounts to converting black ()/white () pixels to early/late spikes and that the first label neuron to spike determines the inferred class (). (C) psp shapes for different ratios of and . For finite time the membrane ’forgets’ prior input, making it fundamentally different from the case where is infinite. (D) Illustration of a key challenge posed by finite membrane time constants: small variations of synaptic weights (not shown) or input spike times (upward arrows) can result in an appearing/disappearing output spike and a corresponding discontinuity in the function describing its timing.

2.1 Preliminaries

Leaky integrate-and-fire dynamics

The dynamics of an lif neuron with cuba synapses are given by

(1)

with membrane capacitance , leak conductance , presynaptic weights and spike times , synaptic time constant and the Heaviside step function. The first sum runs over all presynaptic neurons while the second sum runs over all spikes for each presynaptic neuron. The neuron elicits a spike at time when the presynaptic input pushes the membrane potential above a threshold . After spiking, a neuron becomes refractory for a time period , which is modeled by clamping its membrane potential to a reset value : for . For convenience and without loss of generality, we set the leak potential . Eqn. 1 can be solved analytically, which yields the subthreshold dynamics

(2)
(3)

with membrane time constant and the psp kernel given by a difference of exponentials. Here we already assumed the ttfs use case with only one relevant spike for each neuron, for which the second sum in Eqn. 1 reduces to a single term. The choice of ultimately influences the shape of a psp, starting from a simple exponential (), to a difference of exponentials (with an alpha function for the special case of ) to a graded step function () (Fig. 1C).

The first two cases are markedly different from the last one, which is also known as either the nlif or simply integrate-and-fire (IF) model and was used in previous work mostafa2017supervised. In the nlif model, input to the membrane is never forgotten, as opposed to the lif model, where the psp reaches a peak after finite time and subsequently decays back to its baseline. In other words, presynaptic spikes in the lif model have a purely local effect in time, unlike in the nlif model, where only the onset of a psp is localized in time, but the postsynaptic effect remains forever, or until the postsynaptic neuron spikes. A finite thus assigns much more importance to the time differences between input spikes and introduces discontinuities in the neuronal output that make an analytical treatment more difficult (Fig. 1D).

ttfs coding

Our spike-based neural code follows an idea first proposed in mostafa2017supervised. Unlike coding in ann and different from rate-based codes in snn, this scheme explicitly uses the timing of individual spikes for encoding information. In ttfs coding, the presence of a feature is reflected by the timing of a neuron’s first spike, with earlier spikes representing a more strongly manifested feature. This has the effect that important information inherently propagates faster through the network, with potentially only few spikes needed for the network to process an input. Consequently, this scheme enables a more efficient processing of inputs, both in terms of time-to-solution and energy-to-solution (assuming the latter depends on the total number of spikes and the time required for the network to solve, e.g., an input classification problem).

2.2 Learning rules

In order to formulate the optimization of first-spike times as a gradient-descent problem, we need to derive closed-form expressions for these . This is equivalent to finding the time of the first threshold crossing by solving for . Even though a general closed-form solution does not exist, we show analytical solutions for three specific cases: (i) , (ii) and (iii) :

(4)
(5)
(6)

where is the Lambert W function and using

(7)
(8)

as shorthand for sums over the set of causal presynaptic spikes . All three equations are differentiable with respect to synaptic weights and presynaptic spike times. For a detailed derivation of these results, we refer to Appendix A.

The implicit assumption of having only the first spike emitted by every neuron be relevant for downstream processing can effectively be ensured by using a long enough refractory period. Since the only information-carrying signal that is not reset upon firing is the synaptic memory, which is forgotten on the time scale of , we found that, in practice, setting leads to most neurons eliciting only one spike before the classification of a given input pattern.

The case has already been discussed in mostafa2017supervised and was reproduced here for completeness and comparison. Due to the symmetry in and of the psp (Eqn. A7), the case describes the case as well. Using Eqns. 6, 5 and 4, we can treat the ttfs network much like an ann, where instead of rates, spike times are propagated. In a layered feed-forward network, Eqns. 6, 5 and 4 can be used iteratively, i.e., one can calculate the spike times of the first layer, use these to calculate the spike times of the second layer, etc., until the label neurons are reached.

While we found both rules for finite to work well in practice, we focus on in the following, as and can be treated analogously. Equations for all cases are derived in Appendix A.

Exact error backpropagation with spikes

As depicted in Fig. 1A, we consider feed-forward networks of cuba lif neurons. The input uses the same coding scheme as all other neurons, with input neurons spiking earlier for darker pixels. In particular, no external time reference is required: the network effectively processes contrast information and is essentially agnostic with respect to specific absolute input spike times. The output of the network is defined by the identity of the label neuron that spikes first (Fig. 1B).

We denote by the output spike time of the th neuron in the th layer, e.g., for a network with layers, is the spike time of the th neuron in the label layer. The weight projecting to the th neuron of layer from the th neuron of layer is denoted by .

To apply a variant of the error backpropagation algorithm Rumelhart1986, we choose a loss function that is differentiable with respect to synaptic weights and spike times. During learning, the objective is to maximize the temporal difference between the correct and all other label spikes while minimizing the time-to-correct-solution. The following loss function fulfills the above requirements:

(9)

where

denotes the vector of label spike times

, the index of the correct label and , and

represent scaling parameters. Because the softmax-scaled spike times can be viewed as assigning a probability to the different labels, the first term represents a cross-entropy of this distribution relative to the true label distribution (which is 1 for the correct label and 0 otherwise). Reducing this term therefore increases the temporal difference between the output spike of the correct label neuron and all other label neurons. Notably, it only depends on the spike time difference and is invariant under absolute time shifts, making it independent of artificial outside clocking. The second term is a regularizer that favors solutions where the correct label neuron spikes as early as possible.

Weights are updated such that they minimize the loss . For weights projecting into the label layer, updates are calculated via

(10)

where the second term can be obtained straightforwardly from Eqn. 9: The first term depends on the psp shape; the corresponding differentiation of Eqn. 5 results in

(11)

for an arbitrary layer , where and are given in Eqns. 7 and 5. Using a relation for the derivative of , the equation can be simplified and made to depend on the output spike time :

(12)

Using this additional information optimizes learning in scenarios where the inferred output spike and the true output spike differ (Section 3.1).

The weight updates of deeper layers can be calculated iteratively by application of the chain rule:

(13)

where the second term is a propagated error that can be calculated recursively with a sum over the neurons in layer ):

(14)

The latter derivative amounts to, once the output spike time is reinserted as above,

(15)

The learning rule can be rewritten in layer-wise form to resemble the standard error backpropagation algorithm for abstract neurons (see

Appendix B for the standard backpropagation equation):

(16)
(17)
(18)

where is the element-wise product, the -superscript denotes the transpose of a matrix and is a vector containing the backpropagated errors of layer

. The individual elements of the tensors above are given by

(19)
(20)
(21)

In this form, it becomes apparent that for training, only the label layer error and the neuron spike times are required, which can either be calculated using Eqn. 5 or by simulating (or emulating) the lif dynamics (Eqn. 1).

As mentioned above, the treatment of the other two special cases is analogous to the above. Thus, for cuba lif neurons with finite time constants , and , both the forward pathway (spike times) and backward pathway (backpropagation) can be calculated analytically using a loss that is differentiable with respect to both synaptic weights and neuronal spike times.

Figure 2: Pattern recognition with time-to-first-spike coding for a simple data set. (A) Input pattern set consisting of four classes. (B-D) Voltage dynamics in the label layer (colored traces) and output spikes (downward arrows) induced by spikes from the hidden layer (upward arrows) for one pattern (plots of data from the same pattern are marked by gray background) at different stages of training. As intended, the learning rule has the effect of decreasing the spike time of the correct label neuron () while increasing the spike time of the other neurons. The colors of the traces correspond to the four different label neurons (Fig. 1A,B), with the correct one shown in orange. All times are given in units of the synaptic time constant . (E) Raster plot after training for the same sample, including spikes in the hidden layer (gray marks) and early and late input times (vertical lines). The classification (first label spike, orange) happens prior to a significant fraction of the hidden neuron spikes. (F-I) Evolution of label neuron spike times during training for all four classes, with colors marking the different label neurons as above, and incorrect labels being lighter. The correct neuron’s spike time decreases while all others are pushed back, producing a distinct gap. In (G) the snapshots from B, C, and D are marked. (J) Evolution of accuracy during training. (K) Loss (green is the first term of Eqn. 9, blue the -weighted second term) only reaches small values once the accuracy (I) is already at . (L) To show that back-propagation is working we trained only the weights from input to hidden neurons, keeping those from the hidden to label neurons fixed.

3 Demonstrating learning on various spiking substrates

3.1 Simulation results

Classification task
Figure 3: Spike time distributions over all four classes before (A,B) and after training without (, C, D) and with (, E, F) regularization. Left column (gray): hidden layer; right column: label layer, with correct spikes marked in green and false ones in orange. Here, we used noised inputs, with five examples per class, i.e. 20 patterns in total. The separation of the distribution into two distinct modes is a direct consequence of the black/white input and its encoding (see also Fig. 2). While the network trains to 100% accuracy for both values of (data for not shown), a nonzero leads to significantly earlier spikes in the label layer, as well as to the correct label neurons never spiking during the second volley.

We demonstrate the above framework in a pattern classification task (Fig. 2A), with the spiking network (Fig. 1A) simulated in NEST Gewaltig:NEST. To assist learning, mini batch training was used and the weight updates calculated as described in Section 2.2 were L1-normalized layer-wise to be smaller than 10. Furthermore, for layers with more than 35% of silent neurons averaged over the minibatches, all afferent weights were increased by a fixed amount in order to have sufficient activity. In case of multiple layers where this applied, only the first layer with insufficient spikes was boosted.

Fig. 2B-K shows results from training a network with 49 visible, 80 hidden and 4 label neurons on this data set. While not used during training, the temporal evolution of the membrane potentials helps illustrate the learning process. Fig. 2B-D shows voltages in the label layer for one class (orange) before, during and after training, illustrating how the trained weights make the correct neuron spike earliest by a large margin (see also Fig. 2E).

The spike times including the input spikes (vertical lines) and the ones in the hidden layer are shown in a raster plot in Fig. 2E. Due to the finite membrane and synaptic time constants, output spikes can only happen within a finite time window after their inputs. In the particular training scenario described in Fig. 2, this renders output spikes happening immediately before the late input spikes extremely unlikely. This effective gap can also be seen in the evolution of the spike times during training, where a small change in synaptic weights can bring a spike from the early into the late group and vice-versa, leading to sudden jumps in both spike timing and loss (Fig. 2F-I,K; see also Fig. 1D).

The evolution of the label layer spike times for all four classes is shown in Fig. 2F-I, including the steps at which the voltage plots were recorded. The spike times for the different classes together decide both the accuracy (the proportion of correct classifications) and the loss (Eqn. 9). The evolution of both during training is shown in Fig. 2J,K.

It is important to note that we have chosen a simple dataset in order to make it amenable to emulation on our neuromorphic systems (Section 3.2). In particular, it is linearly separable and would thus not require backpropagation for perfect classification. Therefore, to demonstrate that error backpropagation is working as intended, we performed an additional simulation with frozen weights between the hidden and label layer, training only the ones between input and hidden layer. As expected and shown in Fig. 2

L, training was successful in this setup, but took considerably longer for the same initialization and hyperparameters as used in

Fig. 2B-K.

As mentioned above, our loss function consists of two parts (Eqn. 9), the first relating to ttfs coding and the second representing a regularizer that pushes correct neurons to spike early, and stabilises the training. Figure 3 shows the effect of regularization: it shifts correct label spikes to earlier times, which in turn causes the afferent active hidden neurons to spike earlier as well.

Inserting substrate-specific information into the backward pass
Figure 4: Fitness of learning rules for varying . We simulated the forward pass for different ratios by varying the of neurons in the network. Here, we show the median cross-entropy over the last 300 training steps after a total of 3000 training steps, averaged over 30 different random initializations. This allows us to compare the efficacy of learning in two scenarios for the backward pass. Solid lines: weight updates depending only on neuron parameters, presynaptic spike times and weights (Eqn. 11). Dashed lines: weight updates including the true output spike times (Eqns. 15 and 12). Dotted lines: ideal ratios. Note how including output spike times significantly improves the training for both learning rules across a wide range of parameters.

As noted in the introduction, ttfs coding is a natural fit for neuromorphic hardware due to its emphasis on speed (early spikes), especially for accelerated devices which can profit additionally from intrinsic speed-up (Section 3.2). However, this speed comes at the price of reduced control over certain neuron and synapse parameters. This implies in particular that the ratio of the membrane and synaptic time constants is different from the ideal values of or used in derivations. It is therefore crucial that the learning rule still works under such parameter variability in order for it to be applicable to such neuromorphic substrates. The question at hand touches upon whether our learning rules also work for other parameter values than the specific ones for which they were derived, and if so, how well. We study the robustness of learning for different time constant ratios by sweeping a range of membrane time constants using the NEST Gewaltig:NEST simulator for the forward pass and the different learning rules for the backward pass.

One detail is important in this context. Eqns. 6, 5 and 4 for the spike times depend only on neuron parameters, presynaptic spike times and weights, thus the derivatives needed for the weight update initially depend on those ’natural’ variables as well (Eqns. A15 and A22). With some manipulations, the equation for the actual output spike time can be inserted (Eqns. A17 and A24), producing a version of the learning rule that profits from more information of the forward pass and is thus significantly more stable. The two versions are identical only in case the forward and backward pass have exactly the same ratios. The effect of this disparity is shown in Fig. 4. For both update rules, including information about the true spiking activity significantly improves learning over a wide range of ratios.

3.2 Fast neuromorphic classification

In this framework, classification speed is a function of the network depth and the time constants and

. Assuming typical biological timescales, most input patterns in the above scenario are classified within several

. By leveraging the speedup of neuromorphic systems such as BrainScaleS schmitt2017neuromorphic,billaudelle2019versatile, with intrinsic acceleration factors of -, the same computation can be achieved within .

However, the speed advantages of such analog systems compared to software simulations come at the cost of reduced control, and training needs to cope with phenomena such as spike time jitter and neuron parameter variability. In particular, this implies

, so the derived learning rule is only an approximation of true gradient descent in these systems, as discussed above.

3.2.1 Learning with TTFS on BrainScaleS-2

Figure 5: Neuromorphic pattern recognition with time-to-first-spike coding on BrainScaleS-2. (A) Accuracy and (B) energy (green: cross-entropy, blue: -weighted regularizer in Eqn. 9) during training. (C-F) Evolution of label neuron spike times, displayed separately for the four classes. For each pattern, the spike time of the neuron representing this class is shown as a solid line in full color. (G) Raster plot of hidden neurons (gray) and the four label neurons (colored) after training, shown for a stimulus representing the second pattern (). (H-K) Membrane voltage traces of the label neurons, for the four classes respectively. These analog membrane traces were digitized on the neuromorphic substrate after 100 training steps.

We ported the network architecture and training scheme to the BrainScaleS-2 system, a mixed-signal accelerated neuromorphic platform. The asic is built around an analog neuromorphic core which emulates the dynamics of neurons and synapses. All state variables, such as membrane potentials and synaptic currents, are physically represented in their respective circuits and evolve continuously in time. Considering the natural time constants of such integrated analog circuits, this emulation takes place at 1000-fold accelerated time scales compared to the biological nervous system. One BrainScaleS-2 core features 512 AdEx neurons, which can be freely configured; these circuits can be restricted to LIF dynamics as required by our training framework aamir2018lifarray,aamir2018adex. Both the membrane and synaptic time constants were calibrated to .

Each neuron circuit integrates stimuli from a column of 256 current-based synapses friedmann2016hybridlearning. Each synapse holds a weight value; its sign is shared with all other synapses located on the same row in the synapse matrix. The presented training scheme, however, allows weights to continuously transition between excitation and inhibition. We therefore allocated pairs of synapse rows to convey the activity of single presynaptic partners, one configured for excitation, the other one for inhibition.

Synapses receive their inputs from an event routing module allowing to connect neurons within a chip as well as to inject stimuli from external sources. Events emitted by the neuron circuits are annotated with a time stamp and then sent off-chip. The neuromorphic asic is accompanied by a fpga to handle the communication with the host computer. It also provides mechanisms for low-latency experiment control including the timed release of spike trains into the neuromorphic core. The fpga is furthermore used to record events and digitized membrane traces originating from the asic.

We used an in-the-loop-training approach, where emulation runs on the neuromorphic substrate were interleaved with host-based weight update calculations schmitt2017neuromorphic. For the emulation of the forward pass, the data set was broken down into mini-batches, converted into input spike trains and then injected into the neuromorphic system via the fpga. The latter was also used to record the spikes emitted by the hidden and label layers. Weight updates were – based on these output spike trains – calculated on the host computer and then written back to the synapse memory. This backward pass shared its implementation with the previously described simulation framework.

We were able to successfully and reliably train the network emulated on BrainScaleS-2 on the discussed data set (Fig. 5). The system quickly learned to fully discriminate between the presented patterns, with clear separation between label spike times. Learning performance in terms of convergence speed is difficult to compare because the hyperparameters are not easily transferable, but appears similar to the numerical simulations of the same network. After training, due to the interplay of the system’s intrinsic acceleration and the nature of the learning algorithm itself, each pattern was classified in less than .

3.2.2 Learning with TTFS on BrainScaleS-1

Figure 6: Training a spiking network on the wafer-scale BrainScaleS-1 system. Accuracy (A) and loss (B) during training of the four pattern data set (Fig. 2A). Green: cross-entropy, blue: -weighted regularizer in Eqn. 9. (C-F) Evolution of the spike times in the label layer for the four different patterns. In each, the neuron coding the correct class is shown with a solid line and in full color. (G) Raster plot after training for the second pattern ().

To demonstrate the amenability of our approach to different neuromorphic substrates, we also tested it on the BrainScaleS-1 system schemmel2010wafer. This version of BrainScaleS has a very similar architecture to BrainScaleS-2 , but its component chips are interconnected through postprocessing on their common wafer (wafer-scale integration). More importantly for our coding scheme and learning rules, its circuits emulate coba instead of cuba neurons. Furthermore, due to the different fabrication technology and design choices [in particular, the floating-gate parameter memory, see][]srowig2007analog, schemmel2010wafer, Koke2017, the parameter variability and spike time jitter are significantly higher than on BrainScaleS-2 schmitt2017neuromorphic.

The training procedure was analogous to the one used on BrainScaleS-2 . To accommodate the coba synapse dynamics, we introduced global weight scale factors that modeled the distance between reversal and leak potentials and the total conductance, which were multiplied to the synaptic weights to achieve a cuba approximation for which our learning rules apply. Despite this approximation and the considerable substrate variability (compare, e.g., Fig. 6C-F with Fig. 5C-F), our framework was able to compensate well, almost matching the performance achieved on BrainScaleS-2 (Fig. 6).

4 Discussion

In this manuscript, we proposed a model of deep time-to-first-spike learning that builds on a principled view of neuro-synaptic dynamics with finite time constants and comes with exact learning rules for optimizing first-spike times; an early version of this work was presented in goeltz2019mastersthesis. In this quintessentially spike-based learning framework, only single spike times are required for calculating the weight updates, thus reducing the memory (bandwidth and capacity) requirements of synaptic updates in comparison to, e.g., rate coding approaches [see, e.g., ][for an example of deep but rate-based learning that was also applied to a BrainScaleS system]schmitt2017neuromorphic.

Our work builds on earlier results by mostafa2017supervised, which we extended to accommodate leaky integrate-and-fire neurons, thereby including more biologically plausible and neuromorphic-hardware-compatible neuro-synaptic dynamics. Additionally, we introduced a regularizing loss term that favors early classification, thereby significantly improving the time-to-solution of the network. To account for substrate variability, we further incorporated output spike times directly into the backward pass, which extends the applicability of our derived learning rules to a wide range of parameters, thus allowing us to demonstrate the framework on two different neuromorphic platforms (two generations of the BrainScaleS architecture) that exhibit varying degrees of parameter noise in their analog components. Unlike other approaches mostafa2017supervised,comsa2019temporal,kheradpisheh2019s4nn we do not use any kind of clocking or bias spikes, thereby being independent of any absolute time reference or global clock signal.

The complexity of the learned dataset was mostly limited by the size of the used substrate, but we expect the framework to scale to significantly more challenging problems, as suggested by the FPGA-based experiments in mostafa2017fpga. After learning, the network needed less than one spike per neuron to produce a correct classification on all used substrates. With these few spikes, we achieved a time-to-solution of less than one synaptic time constant. Since the dynamical timescales directly affect the duration of the network emulation between synaptic updates, this inherently leads to a significant reduction of the total training time. Taking into consideration relaxation times between patterns, our setup was able to handle a concatenated pattern throughput of at least , independently of emulated network size. These results promise an efficient exploitation of such accelerated neuromorphic substrates for high-throughput inference on spiking input data.

Acknowledgment

The authors wish to thank Sebastian Schmitt for BrainScaleS-1 support as well as Jakob Jordan and Nico Gürtler for valuable discussions.

Funding

The authors gratefully acknowledge funding from the European Union under grant agreements 604102, 720270, 785907 (HBP) and the Manfred Stärk Foundation.

Simulation

Simulations were performed on the bwForCluster NEMO, supported by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 39/963-1 FUGG.

Appendix

Appendix A Derivation of main results

In this section we derive the equations in the main manuscript, starting with the learning rule for , Eqn. 4, then , Eqn. 5 and finally , Eqn. 6.

For each, a solution for the spike time , defined by

(A1)

given lif dynamics

(A2)
(A3)

has to be found. For convenience, we use the following definitions

(A4)
(A5)

with summation over the set of causal presynaptic spikes .

a.1 nlif learning rule for

With this choice of , the first term in Eqn. A2 becomes and we recover the nlif case discussed in mostafa2017supervised. Given the existence of an output spike, in Eqn. A1 the spike time appears only in one place and simple reordering yields

(A6)

where we used Eqn. A4 for and , the latter being the sum over the weights.

a.2 Learning rule for

Spike time

According to l’Hôpital’s rule, in the limit Eqn. A2 becomes a sum over -functions of the form

(A7)

Using these voltage dynamics for the equation of the spike time Eqn. A1, together with the definition Eqn. A5 and , we get the equation

(A8)

The variable is introduced to bring the equation into the form

(A9)

which can be solved with the differentiable Lambert W function . The goal is now to bring Eqn. A8 into this form, this is achieved by reformulation in terms of

(A10)
(A11)

With the definition of the Lambert W function the spike time can be written as

(A12)
Figure A1: (A) Membrane dynamics for one strong input spike at (upward arrow) with two threshold crossings due to leak pullback (earlier violet, later brown). The change induced by a reduction of the input weight is shown in red. (B) Edge case without crossing and exactly one time where . (C) Defining relation for the Lambert W function , evidently not an injective map. (D) Distinguishing between allows to define the inverse function of (C), the Lambert W function .
Branch choice

Given that a spike happens, there will be two threshold crossings: One from below at the actual spike time, and one from above when the voltage decays back to the leak potential (Fig. A1A,B). Correspondingly, the Lambert W function (Fig. A1C,D) has two real branches (in addition to infinite imaginary ones), and we need to choose the branch that returns the earlier solution. In case the voltage is only tangent to the threshold at its maximum, the Lambert W function only has one solution.

For choosing the branch in the other cases we need to look at from the definition, i.e.

(A13)

In a setting with only one strong enough input spike, the summations in and reduce to yield . Because the maximum of the psp for occurs at , we know that the spike must occur at and therefore

(A14)

This corresponds to the branch cut of the Lambert W function meaning we must choose the branch with . For a general setting, if we know a spike exists, we expect and to be positive. In order to get the earlier threshold crossing, we need the branch that returns the larger (Fig. A1D), that is where .

Derivatives

The derivatives for in the causal set come down to

(A15)
(A16)

A crucial step is to reinsert the definition of the spike time where it is possible (cf. Section 3.1). For this we need the derivative of the Lambert W function that follows from differentiating its definition Eqn. A9 with w.r.t. . With this derivative one can calculate the derivative of Eqn. A12 with respect to incoming weights and times as functions of presynaptic weights, input spike times and output spike time:

(A17)
(A18)

a.3 Learning rule for

Spike time

Inserting the voltage (Eqn. A2) into the spike time (Eqn. A1) yields

(A19)

Reordering and rewriting this in terms of , , and (with ) we get

(A20)

This is written such that its quadratic nature becomes apparent, making it possible to solve for and thus

(A21)
Branch choice

The quadratic equation has two solutions that correspond to the voltage crossing at spike time and relaxation towards the leak later; again, we want the earlier of the two solutions. It follows from the monotonicity of the logarithm that the earlier time is the one with the larger denominator. Due to an output spike requiring an excess of recent positively weighted input spikes, are positive, and the solution is the correct one.

Derivatives

Using the definition for brevity, the derivatives of Eqn. A21 are

(A22)
(A23)

Again, inserting the output spike time yields

(A24)
(A25)

Appendix B Standard error backpropagation

The standard error backpropagation formula for artificial (rate-based) neural networks Rumelhart1986 with rates is given by

(A26)
(A27)
(A28)

Traditionally, in artificial neural networks, the last layer is a linear classifier, but here, to highlight the resemblance to rate-based neurons, we define the loss function on the activation of the neurons in the last layer , where is the target label in one-hot coding.