I Introduction
i.0.1 Dynamical Coordination in Cognitive Processes.
Spikebased learning in plastic neuronal networks is playing increasingly key roles in both theoretical neuroscience and neuromorphic computing. The brain learns in part by modifying the synaptic strengths between neurons and neuronal populations. While specific synaptic plasticity or neuromodulatory mechanisms may vary in different brain regions, it is becoming clear that a significant level of dynamical coordination between disparate neuronal populations must exist, even within an individual neural circuit
[1]. Classically, backpropagation (BP, and other learning algorithms) [2, 3, 4]has been essential for supervised learning in artificial neural networks (ANNs). Although the question of whether or not BP operates in the brain is still an outstanding issue
[5], BP does solve the problem of how a global objective function can be related to local synaptic modification in a network. It seems clear, however, that if BP is implemented in the brain, or if one wishes to implement BP in a neuromorphic circuit, some amount of dynamical information coordination is necessary to propagate the correct information to the correct location such that appropriate local synaptic modification may take place to enable learning.i.0.2 The Grand Challenge of Spiking Backpropagation (sBP)
There has been a rapid growth of interest in the reformulation of classical algorithms for learning, optimization, and control using eventbased informationprocessing mechanisms. Such spiking neural networks (SNNs) are inspired by the function of biophysiological neural systems [6], i.e., neuromorphic computing [7]. The trend is driven by the advent of flexible computing architectures such as Intel’s neuromorphic research processor, codenamed Loihi, that enable experimentation with such algorithms in hardware [8]. There is particular interest in deep learning, which is a central tool in modern machine learning. Deep learning relies on a layered, feedforward network similar to the early layers of the visual cortex, with threshold nonlinearities at each layer that resemble meanfield approximations of neuronal integrateandfire models. While feedforward networks are readily translated to neuromorphic hardware [9, 10, 11], the far more computationally intensive training of these networks ‘on chip’ has proven elusive as the structure of backpropagation makes the algorithm notoriously difficult to implement in a neural circuit [12, 13]. A feasible neural implementation of the backpropagation algorithm has gained renewed scrutiny with the rise of new neuromorphic computational architectures that feature local synaptic plasticity [8, 14, 15, 16]. Because of the wellknown difficulties, neuromorphic systems have relied to date almost entirely on conventional offchip learning, and used onchip computing only for inference [9, 10, 11]. It has been a longstanding challenge to develop learning systems whose function is realized exclusively using neuromorphic mechanisms.
Backpropagation has been claimed to be biologically implausible or difficult to implement on spiking chips because of several issues: (a) Weight transport
– Usually, synapses in biology and on neuromorphic hardware cannot be used bidirectionally, therefore, separate synapses for the forward and backward pass are employed. However, correct credit assignment, i.e. knowing how a weight change affects the error, requires feedback weights to be the same as feedforward weights
[17, 18]; (b) Backwards computation – Forward and backward passes implement different computations [18]. The forward pass requires only weighted summation of the inputs, while the backward pass operates in the opposite direction and additionally takes into account the derivative of the activation function; (c)
Gradient storage – Error gradients must be computed and stored separately from activations; (d) Differentiability – For spiking networks, the issue of nondifferentiability of spikes has been discussed and solutions have been proposed [19, 20, 21, 22]; and (e) Hardware constraints – For the case of neuromorphic hardware, there are often constraints on plasticity mechanisms, which allow for adaptation of synaptic weights. On some hardware, no plasticity is offered at all, while in some cases only specific spiketiming dependent plasticity (STDP) rules are allowed. Additionally, in almost all available neuromorphic architectures, information must be local, i.e., information is only shared between neurons that are synaptically connected, in particular, to facilitate parallelization.i.0.3 Previous Attempts at sBP
The most commonly used approach to avoiding the above issues is to use neuromorphic hardware only for inference using fixed weights, obtained by training of an identical network offline and offchip [9, 10, 23, 24, 25, 26, 27]. This has recently achieved stateoftheart performance [11], but it does not make use of neuromorphic hardware’s full potential, because offline training consumes significant power. Moreover, to function in most field applications, an inference algorithm should be able to learn adaptively after deployment, e.g., to adjust to a particular speaker in speech recognition, which would enable autonomy and privacy of edge computing devices. So far, only last layer training, without backpropagation, and using variants of the delta rule, has been achieved on spiking hardware [28, 29, 30, 31, 32, 33, 34]. Other onchip learning approaches use alternatives to backpropagation [35, 36], bioinspired nongradient based methods [37], or hybrid systems with a conventional computer in the loop [38, 39]. Several promising alternative approaches for actual onchip spiking backpropagation have been proposed recently [40, 41, 42, 43], but have not yet been implemented in hardware.
To avoid the backwards computation issue (b) and because neuromorphic synapses are not bidirectional, a separate feedback network for the backpropagation of errors has been proposed [44, 45] (see Fig. 1). This leads to the weight transport problem (a) which has been solved by using symmetric learning rules to maintain weight symmetry [46, 45, 47] or with the KolenPollack algorithm [47], which leads to symmetric weights automatically. It has also been found that weights do not have to be perfectly symmetric, because backpropagation can still be effective with random feedback weights (random feedback alignment) [48], although symmetry in the sign between forward and backward weights matters [49].
The backwards computation issue (b) and the gradient storage issue (c) have been addressed by approaches that separate the function of the neuron into different compartments and use structures that resemble neuronal dendrites for the computation of backward propagating errors [50, 41, 43, 5]. The differentiability issue (d) has been circumvented by spiking ratebased approaches [51, 10, 24, 52]
that use the ReLU activation function as done in ANNs. The differentiability issue has also been addressed more generally using surrogate gradient methods
[19, 20, 9, 21, 22, 25, 28, 53] and methods that use biologicallyinspired STDP and reward modulated STDP mechanisms [54, 55, 56, 57]. For a review of SNNbased deep learning, see [58]. For a review of backpropagation in the brain, see [5].i.0.4 Our Contribution
In this study, we describe a hardware implementation of the backpropagation algorithm that addresses each of the issues (a)–(e) introduced above using a set of mechanisms that have been developed and tested in simulation by the authors during the past decade, synthesized in our recent study [59], and simplified and adapted here to the features and constraints of the Loihi chip. These neuronal and network features include propagation of graded information in a circuit composed of neural populations using synfiregated synfire chains (SGSCs) [60, 61, 62, 63], control flow based on the interaction of synfirechains [61], and regulation of Hebbian learning using synfiregating [64, 65]. We simplify our previously proposed network architecture [59], and streamline its function. We demonstrate our approach using a proofofprinciple implementation on Loihi [8], and examine the performance of the algorithm for learning and inference of the MNIST test data set [66]. The sBP implementation is shown to be competitive in clock time, sparsity, and power consumption with stateoftheart algorithms for the same tasks.
Ii Results
ii.0.1 The Binarized sBP model
We extend our previous architecture [59] using several new algorithmic and hardwarespecific mechanisms. Each unit of the neural network is implemented as a single spiking neuron, using the currentbased leaky integrateandfire (CUBA) model (see Eq. (19) in Section IV
) that is built into Loihi. The time constants of the CUBA model are set to 1, so that the neurons are memoryless. Rather than using rate coding, whereby spikes are counted over time, we consider neuron spikes at every algorithmic time step, so we can regard our implementation as a binary neural network. The feedforward component of our network is a classic multilayer perceptron (MLP) with 3layers, a binary activation function, and discrete (8 bit) weights. Our approach may, however, be extended to deeper networks and different encodings. In the following equations, each lowercase letter corresponds to a Boolean vector that represents a layer of spiking neurons on the chip (a spike corresponds to a 1). The inference (forward) pass through the network is computed as:
(1)  
(2)  
(3) 
where is the weight matrix of the respective layer, is a binary activation function with a threshold of , and denotes the Heaviside function. The forward pass thus occurs in 3 time steps as spikes are propagated through layers. The degree to which the feedforward network’s output deviates from a target value is quantified by the squared error, , which we would like to minimize. Performing backpropagation to achieve this requires the calculation of weight updates, which depend on the forward activations, and backward propagated local gradients , which represent the amount by which the loss changes when the activity of that neuron changes, as:
(4)  
(5)  
(6)  
(7) 
Here, denotes a Hadamard product, i.e. the elementwise product of two vectors, denotes the matrix transpose, is the sign function, and denotes the activation of the th layer, , with . Here,
denotes the learning rate, and is the only hyperparameter of the model apart from the weight initialization. Here
denotes the derivative, but because is a binary thresholding function (Heaviside), the derivative would be the Dirac delta function, which is zero everywhere apart from at the threshold. Therefore, we use a common method [67, 68], and represent the thresholding function using a truncated (between 0 and 1) ReLU (Eq. (8)) as a surrogate or straightthrough estimator when backpropagating the error. The derivative of the surrogate is a box function (Eq. (
9)):(8)  
(9) 
The three functions (Equations (2), (8) and (9)) are plotted in the inset in Fig. 2.
When performed for each target (
) in the training set, the model may be thought of as a stochastic gradient descent algorithm with a fixed step size update for each weight in the direction of the sign of the gradient.
ii.0.2 sBP on Neuromorphic Hardware
On the computational level, Equations (1)(9) fully describe our model exactly as it is implemented on Loihi, excluding the handling of bit precision constraints that affect integer discreteness and value limits and ranges. In the following, we describe how these equations are translated from the computational to the algorithmic neural circuit level, thereby enabling implementation on neuromorphic hardware. Further details on the implementation can be found in Section IV.
Hebbian weight update
Equation (7) effectively results in the following weight update per single synapse from presynaptic index in layer to postsynaptic index in layer :
(10) 
where is the constant learning rate. To accomplish this update, we use a Hebbian learning rule [69] implementable on the onchip microcode learning engine (for the exact implementation on Loihi, see Section IV.2). Hebbian learning means that neurons that fire together, wire together, i.e., the weight update is proportional to the product of the simultaneous activity of the presynaptic (source) neuron and the postsynaptic (target) neuron. In our case, this means that the values of the 2 factors of Equation (10) have to be propagated simultaneously, in the same time step, to the pre () and postsynaptic () neurons while the pre and postsynaptic neurons are not allowed to fire simultaneously at any other time. For this purpose, a mechanism to control the information flow through the network is needed.
Gating controls the information flow
As our information control mechanism, we use synfire gating [70, 60, 61, 62]. A closed chain of 12 neurons containing a single spike perpetually sent around the circle is the backbone of this flow control mechanism, which we call the gating chain. The gating chain controls information flow through the controlled network by selectively boosting layers to bring their neurons closer to the threshold and thereby making them receptive to input. By connecting particular layers to the gating neuron that fires in the respective time steps, we lay out a path that the activity through the network is allowed to take. For example, to create the feedforward pass, the input layer is connected to the first gating neuron and therefore gated ‘on’ in time step 1, the hidden layer is connected to the second gating neuron and gated ‘on’ in time step 2, and the output layer is connected to the third gating neuron and gated ‘on’ in time step 3. A schematic of this path of the activity can be found in Fig. 2. To speak in neuroscience terms, we are using synfire gating to design functional connectivity through the network anatomy shown in Supplementary Fig. 4 . Using synfire gating, the local gradient is brought to the postsynaptic neuron at the same time as the activity is brought back to the presynaptic neuron effecting a weight update. However, in addition to bringing activity at the right time to the right place for Hebbian learning, the gating chain also makes it possible to calculate and backpropagate the local gradient.
Local gradient calculation
For the local gradient calculation, according to Equation (5), the error and the box function derivative of the surrogate activation function (Equation (9)) are needed. Because there are no negative (signed) spikes, the local gradient is calculated and propagated back twice for a Hebbian weight update in two phases with different signs. The error is calculated in time step 4 in a layer that receives excitatory (positive) input from the output layer and inhibitory (negative) input from the target layer , and vice versa for .
The box function has the role of initiating learning when the presynaptic neuron receives a nonnegative input and of terminating learning when the input exceeds 1, which is why we call the two conditions ‘start’ and ‘stop’ learning (inspired by the nomenclature of [71]). This inherent feature of backpropagation avoids weight updates that have no effect on the current output as the neuron is saturated by the nonlinearity with the current input. This regulates learning by protecting weights that are trained and used for inference when given different inputs.
To implement these two terms of the box function (9), we use two copies of the output layer that receive the same input () as the output layer. Using the abovedescribed gating mechanism, one of the copies (start learning, ) is brought exactly to its firing threshold when it receives the input, which means that it fires for any activity greater than 0 and the input is not in the lower saturation region of the ReLU. The other copy (stop learning, ) is brought further away from the threshold (to 0), which means that if it fires, the upper saturation region of the ReLU has been reached and learning should cease.
Error backpropagation
Once the local gradient is calculated as described in the previous paragraph, it is sent to the output layer as well as to its copies to bring about the weight update of and its 4 copies in time steps 5 and 9. From there, the local gradient is propagated back through the transposed weight matrices and , which are copies of connected in the opposite direction and, in the case of , with opposite sign. Once propagated backwards, the backpropagated error is also combined with the ‘start’ and ‘stop’ learning conditions, and then it is sent to the hidden layer and its copies in time steps 7 and 11.
ii.0.3 Algorithm Performance
Our implementation of the sBP algorithm on Loihi achieves an inference accuracy of 95.7% after 60 epochs (best: 96.2%) on the MNIST test data set, which is comparable with other shallow, stochastic gradient descent (SGD) trained MLP models without additional allowances. In these computational experiments, the sBP model is distributed over 81 neuromorphic cores. Processing of a single sample takes 1.48 ms (0.169 ms for inference only) on the neuromorphic cores, including the time required to send the input spikes from the embedded processor, and consumes 0.653 mJ of energy on the neuromorphic cores (0.592 mJ of which is dynamic energy, i.e. energy used by our neural circuit in addition to the fixed background energy), resulting in an energydelay product of 0.878 Js. All measurements were obtained using NxSDK version 0.9.9 on the Nahuku32 board nclghrd01. Table 1 shows a comparison of our results with published performance metrics for other neuromorphic learning architectures that were also tested on MNIST. Table 3 in the Supplementary Material shows a breakdown of energy consumption and a comparison of different conditions and against a GPU. Switching off the learning engine after training reduces the dynamic energy per inference to 0.0204 mJ, which reveals that the onchip learning engine is responsible for most of the power consumption. Because the learning circuit is not necessary for performing inference, we also tested a reduced architecture that is able to do inference within four time steps using the previously trained weights. This architecture uses 0.00249 mJ of dynamic energy and 0.169 ms per inference.
The sBP algorithm trains the network without explicit sparsity constraints, and yet it exhibits sparsity because of its binary (spiking) nature. After applying the binary threshold of 0.5 to the MNIST images, one image is encoded using on average 100 spikes in the input layer, which corresponds to a sparsity of 0.25 spikes per neuron per inference. This leads to a typical number of 110 spikes in the hidden layer (0.28 spikes per neuron per inference) and 1 spike in the output layer (0.1 spikes per neuron per inference). The spikes of the input and hidden layer are repeated in the two learning phases (see Fig. 2) independent of the training progress. However, the errorinduced activity from the local gradient layer starts with 0.7 spikes per neuron per sample (during the first 1000 samples) and goes down to approximately 0 spikes in the trained network as the error goes towards 0.
Publication  Hardware  Learning Mode  Network  Energy per  Latency per  Test 
Structure  Sample (mJ)  Sample (ms)  Accuracy (%)  
Onchip backpropagation  
This study  Loihi  onchip sBP  40040010^{1}^{1}1400 (20x20) corresponds to 784 (28x28) after cropping of the empty image margin of 4 pixels  0.592  1.48  96.2 
Onchip single layer training or BP alternatives  
[36] Shrestha et al. (2021)  Loihi  EMSTDP FA/DFA  CNNCNN10010  8.4  20  94.7 
[35] Frenkel et al. (2020)  SPOON  DRTP  CNN10  0.000366^{2}^{2}2Calculated from given values  0.12  95.3 
[33] Park et al. (2019)  unnamed  mod. SD  78420020010  0.000253^{2}^{2}footnotemark: 2  0.01  98.1 
[72] Chen et al. (2018)  unnamed  SSTDP  23620^{3}^{3}3Offchip preprocessing  0.017  0.16  89 
[30] Frenkel et al. (2018)  ODIN  SDSP  25610  0.000015    84.5 
[73] Lin et al. (2018)  Loihi  SSTDP  192010^{3}^{3}footnotemark: 3  0.553    96.4 
[32] Buhler et al. (2017)  unnamed  LCA features  25610  0.000050  0.001^{2}^{2}footnotemark: 2  88 
Onchip inference only  
This study  Loihi  inference  40040010^{1}^{1}footnotemark: 1  0.00249  0.169  96.2 
[36] Shrestha et al. (2021)  Loihi  inference  CNNCNN10010  2.47  10  94.7 
[35] Frenkel et al. (2020)  SPOON  inference  CNN10  0.000313  0.12  97.5 
[74] Göltz et al. (2019)  BrainScaleS2  inference  25624610  0.0084  0.048  96.9 
[73] Lin et al. (2018)  Loihi  inference  192010^{3}^{3}footnotemark: 3  0.0128^{4}^{4}4Dynamic energy reported in the Supplementary Material of [75]    96.4 
[72] Chen et al. (2018)  unnamed  inference  784102451210  0.0017    97.9 
[76] Esser et al. (2015)  True North  inference  CNN (512 neurons)  0.00027  1  92.7 
[76] Esser et al. (2015)  True North  inference  CNN (3840 neurons)  0.108  1  99.4 
[77] Stromatias et al. (2015)  SpiNNaker  inference  78450050010  3.3  11  95 
Neuromorphic sBP in simulated SNN  
[78] Jin et al. (2018)  Simulation  BP  78480010      98.8 
[79] Neftci et al. (2017)  Simulation  BP  78450010      97.7 
[80] Shrestha et al. (2019)  Simulation  EMSTDP  78450010      97 
[81] Tavanaei and Maida (2019)  Simulation  BPSTDP  78450015010      97.2 
[82] Mostafa (2017)  Simulation  BP  78480010      97.55 
[83] Lee et al. (2016)  Simulation  BP  78480010      98.64 
[84] O’Connor and Welling (2016)  Simulation  BP  78430030010      96.4 
[85] Diehl and Cook (2015)  Simulation  STDP  784160010      95 
Iii Discussion
iii.0.1 Summary
As we have demonstrated here, by using a welldefined set of neuronal and neural circuit mechanisms, it is possible to implement a spiking backpropagation algorithm on contemporary neuromorphic hardware. Previously proposed methods to address the issues outlined in Section I were not on their own able to offer a straightforward path to implement a variant of the sBP algorithm on current hardware. In this study, we avoided or solved these previously encountered issues with spiking backpropagation by combining known solutions with synfiregated synfire chains (SGSC) as a dynamical information coordination scheme that was evaluated on the MNIST test data set on the Loihi VLSI hardware.
iii.0.2 Solutions of Implementation Issues
The five issues (a)(e) listed in Section I were addressed using the following solutions: (a) The weight transport issue was avoided via the use of a deterministic, symmetric learning rule for the parts of the network that implement inference (feedforward) and error propagation (feedback) as described by [86]. This approach is not biologically plausible because of a lack of developmental mechanisms to assure the equality of corresponding weights [87]. It would, however, without modifications to the architecture be feasible to employ weight decay as described by Kolen and Pollack [87] to achieve selfalignment of the backward weights to the forward weights or to use feedback alignment to approximately align the feedforward weights to random feedback weights [48]; (b) The backwards computation issue was solved by using a separate error propagation network through which activation is routed using an SGSC; (c) The gradient storage issue was solved by routing activity through the inference and error propagation circuits within the network in separate stages, thereby preventing the mixing of inference and error information. There are alternatives that would not require synfire gated routing, but are more challenging to implement on hardware [41, 43] as also described in a more comprehensive review [5]; (d) The differentiability issue was solved by representing the step activation function by a surrogate in the form of a (truncated) ReLU activation function with an easily implementable box function derivative; and (e) The hardware constraint issue was solved by the proposed mechanism’s straightforward implementation on Loihi because it only requires basic integrateandfire neurons and Hebbian learning that is modulated by a single factor which is the same for all synapses.
iii.0.3 Encoding and Sparsity
While neural network algorithms on GPUs usually use operations on dense activation vectors and weight matrices, and therefore do not profit from sparsity, spiking neuromorphic hardware only performs an addition operation when a spike event occurs, i.e., adding the weight to the input current as in Equation (19). This means that the power consumption directly depends on the number of spikes. Therefore sparsity, which refers to the property of a vector to have mostly zero elements, is important for neuromorphic algorithms [75, 88], and it is also observed in biology [89]. Consequently, significant effort has been made to make SNNs sparse to overcome the classical ratebased approach based on counting spikes [88, 74, 90, 91]. The binary encoding used here could be seen as a limit case of the ratebased approach allowing only 0 or 1 spike. Even without regularization to promote sparse activity, it yields very sparse activation vectors that are even sparser than most timing codes. The achievable encoded information per spike is unquestionably lower, however. In a sense, we already employ spike timing to route spikes through the network because the location of a spike in time within the 12 time steps determines if and where it is sent and if the weights are potentiated or depressed. However, usage of a timing code for activations could be enabled by having more than one Loihi time step per algorithm time step. Therefore the use of SGSCs is not limited to this particular binary encoding, and in fact, SGSCs were originally designed for a population rate code.
Similarly, the routing method we use in this work is not limited to backpropagation, but it could serve as a general method to route information in SNNs where autonomous activity (without interference from outside the chip) is needed. That is, our proposed architecture can act in a similar way as or even in combination with neural state machines [92, 93].
iii.0.4 AlgorithmicallyInformed Neurophysiology
Although our implementation of sBP here was focused primarily on a particular hardware environment, we point out that the synfiregated synfire chains and other network and neuronal structures that we employ could all potentially have relevance to the understanding of computation in neurophysiological systems. Many of these concepts that we use, such as spike coincidence, were originally inspired by neurophysiological experiments [61, 60, 94]. Experimental studies have shown recurring sequences of stereotypical neuronal activation in several species and brain regions [95, 96, 97] and particularly replay in hippocampus [98]. Recent studies also hypothesize [99, 100] and show [101] that a mechanism like gating by synfire chains may play a role in memory formation. Additional evidence [102] shows that largescale cortical activity has a stereotypical, packetlike character that can convey information about the nature of a stimulus, or be ongoing or spontaneous. This type of apparently algorithmicallyrelated activity has a very similar form to the SGSC controlled information flow found previously [60, 61, 62, 63]. Additionally, this type of sequential activation of populations is evoked by the sBP learning architecture, as seen in the raster plot in Fig. 5 in the Supplementary Material.
Other algorithmic spiking features, such as the backpropagated local gradient layer activity decreasing from 0.7 spikes per neuron to 0 by the end of training, could be identified and used to generate qualitative and quantitative hypotheses concerning network activity in biological neural systems.
iii.0.5 Future Directions
Although the accuracy we achieve is similar to early implementations of binary neural networks in software [68], subsequent approaches now reach 98.5% [103]
, and generally include binarized weights. However, networks that achieve such accuracy typically employ either a convolutional structure or multiple larger hidden layers. Additional features such as dropout, softmax final layers, gain terms, and others could in principle be included in spiking hardware and may also account for this 3% gap. So, while we show that it is possible to efficiently implement backpropagation on neuromorphic hardware, several nontrivial steps are still required to make it usable in practical applications:
1) The algorithm needs to be scaled to deeper networks. While the present structure is in principle scalable to more layers without major adjustments, investigation is needed to determine whether gradient information remains intact over many layers, and to what extent additional features such as alternatives to batch normalization may need to be developed.
2) Generalization to convolutional networks is compelling, in particular for application to image processing. The Loihi hardware presents an advantage in this setting because of its weightsharing mechanisms.
3) Although our current implementation demonstrates onchip learning, we train on the MNIST images in an offline fashion by iterating over the training set in epochs. Further research is required to develop truly continual learning mechanisms such that additional samples and classes can actually be learned without losing previously trained synaptic weights and without retraining on the whole dataset.
Additionally, the proposed algorithmic methodology can be used to inform hardware adjustments to promote efficiency for learning applications. Although our algorithm is highly efficient in terms of power usage, in particular for binary encoding, the Loihi hardware is not specifically designed for implementing standard deep learning models, but rather as generalpurpose hardware for exploring different SNN applications [75].
This leads to a significant computational overhead for functions that are not needed in our model (e.g. neuronal dynamics), or that could have been realized more efficiently if integrated directly on the chip instead of using network mechanisms. Our results provide a potential framework to guide future hardware modifications to facilitate more efficient learning algorithm implementations. For example, in an upgraded version of the algorithm, it would be preferable to replace relay neurons with presynaptic (eligibility) traces to keep a memory of the presynaptic activity for the weight update.
iii.0.6 Significance
To our knowledge, this work is the first to show an SNN implementation of the backpropagation algorithm that is fully onchip, without a computer in the loop. Other onchip learning approaches so far either use feedback alignment [36], forward propagation of errors [35] or single layer training [28, 30, 73, 72, 32]. Compared to an equivalent implementation on a GPU, there is no loss in accuracy, but there are about two orders of magnitude power savings in the case of small batch sizes which are more realistic for edge computing settings. So, this implementation shows a path for using inmemory, massively parallel neuromorphic processors for lowpower, lowlatency implementation of modern deep learning applications. The network model we propose offers opportunities as a building block that can, e.g. be integrated into larger SNN architectures that could profit from a trainable onchip processing stage.
Iv Methods
In this section, we describe our system on three different levels [104]. First, we describe the computational level by fully stating the equations that result in the intended computation. Second, we describe the spiking neural network (SNN) algorithm. Third, we describe the details of our hardware implementation that are necessary for exact reproducibility.
iv.1 The Binarized Backpropagation Model
Network Model.
Backpropagation is a means of optimizing synaptic weights in a multilayer neural network. It solves the problem of credit assignment, i.e., attributing changes in the error to changes in upstream weights, by recursively using the chain rule. The inference (forward) pass through the network is computed as
(11) 
where is an elementwise nonlinearity and is the weight matrix of the respective layer. The degree to which the network’s output () deviates from the target values () is quantified by the squared error, , which we aim to minimize. The weight updates for each layer are computed recursively by
(12)  
(13)  
(14) 
where is the network activity at layer (i.e., ). Here, denotes the derivative, denotes a Hadamard product, i.e. the elementwise product of two vectors, denotes the matrix transpose, and denotes , with . The parameter denotes the learning rate. These general equations (12)(14) are realized for two layers in our implementation as given by (5)(7) in Section II. Below, we relate these equations to the neural Hebbian learning mechanism used in the neuromorphic implementation of sBP.
Although in theory the derivative of the activation function is applied in (12), in the case that is a binary thresholding (Heaviside) function, the derivative is the Dirac delta function, which is zero everywhere apart from at the threshold. We use a common approach [68] and represent the activation function by a surrogate (or straightthrough estimator [67]), in the form of a rectified linear map (ReLU) truncated between 0 and 1 (Eq. (16)) in the part of the circuit that affects error backpropagation. The derivative of this surrogate () is of box function form, i.e.
(15)  
(16)  
(17) 
where denotes the Heaviside function:
(18) 
In the following section, we describe how we implement Equations (11)(16) in a spiking neural network.
Weight Initialization.
Plastic weights are initialized by sampling from a Gaussian distribution with mean of 0 and a standard deviation of
(He initialization [105]). denotes the number of neurons of the presynaptic layer and the number of neurons of the postsynaptic layer.Input Data. The images of the MNIST dataset [66] were cropped by a margin of 4 pixels on each side to remove pixels that are never active and avoid unused neurons and synapses on the chip. The pixel values were thresholded with 0.5 to get a black and white picture for use as input to the network. In the case of the architecture, the input images were downsampled by a factor of 2. The dataset was presented in a different random order in each epoch.
Accuracy Calculation. The reported accuracies are calculated on the full MNIST test data set. A sample was counted as correct when the index of the spiking neuron of the output layer in the output phase (time step 3 in Fig. 2) is equal to the correct label index of the presented image. In fewer than 1% of cases, there was more than one spike in the output layer, and in that case, the lowest spiking neuron index was compared.
iv.2 The Spiking Backpropagation Algorithm
Spiking Neuron Model. For a generic spiking neuron element, we use the currentbased linear leaky integrateandfire model (CUBA). This model is implemented on Intel’s neuromorphic research processor, codenamed Loihi [8]. Time evolution in the CUBA model as implemented on Loihi is described by the discretetime dynamics with , and time increment :
(19) 
where identifies the neuron.
The membrane potential is reset to upon exceeding the threshold and remains at its reset value for a refractory period, . Upon reset, a spike is sent to all connecting synapses. Here, represents a neuronal current and represents timedependent spiking input. is a constant input current.
Parameters and Mapping. In our implementation of the backpropagation algorithm, we take , , and (except in gating neurons, where ). This leads to a memoryless point neuron that spikes whenever its input in the respective time step exceeds . This happens, when the neuron receives synaptic input larger than and in the same timestep, a gating input overcomes the strong inhibition of the , i.e. it is gated ‘on’. This is how the Heaviside function in Equation (15) is implemented. For the other activation functions, a different gating input is applied.
There is a straightforward mapping between the weights and activations in the spiking neural network (SNN) described in this section and the corresponding artificial neural network (ANN) described in Section IV.1:
(20) 
So, a value of allows for a maximal ANN weight of 0.25, because the allowed weights on Loihi are the even integers from 256 to 254.
Feedforward Pass. The feedforward pass can be seen as an independent circuit module that consists of 3 layers. An input layer with 400 (20x20) neurons that spikes according to the binarized MNIST dataset, a hidden layer of 400 neurons, and an output layer of 10 neurons. The 3 layers are sequentially gated ‘on’ by the gating chain so that activity travels from the input layer to the hidden layer through the plastic weight matrix and then from the hidden to the output layer through the plastic weight matrix .
Learning Rule. The plastic weights follow the standard Hebbian learning rule with a global third factor to control the sign. Note, however, that here, unlike other work with Hebbian learning rules, due to the particular activity routed to the layers, the learning rule implements a supervised mechanism (backpropagation). Here we give the discrete update equation as implemented on Loihi:
(22) 
Above, , , and represent time series that are available at the synapse on the chip. The signals and are the pre and postsynaptic neuron’s spike trains, i.e., they are equal to 1 in time steps when the respective neuron fires, and 0 otherwise. The signal is a third factor that is provided to all synapses globally and determines in which phase (potentiation or depression) the algorithm is in. denotes the number of phases per trial, which is 12 in this case. So, is 0 in all time steps apart from the 5 and 7 of each iteration, where the potentiation of the weights happens. This regularly occurring signal could thus be generated autonomously using the gating chain. On Loihi, is provided using a socalled “reinforcement channel”. Note that the reinforcement channel can only provide a global modulatory signal that is the same for all synapses.
The above learning rule produces a positive weight update in time steps in which all three factors are 1, i.e., when both pre and postsynaptic neurons fire and the reinforcement channel is active. It produces a negative update when only the pre and postsynaptic neurons fire, and the weight stays the same in all other cases.
To achieve the correct weight update according to the backpropagation algorithm (see (10)), the spiking network has to be designed in a way that the presynaptic activity and the local gradient are present in neighboring neurons at the same time step. Furthermore, the sign of the local gradient has to determine if the simultaneous activity happens in a time step with active third factor or not.
This requires control over the flow of spikes in the network, which is achieved by a mechanism called synfire gating [64, 65], which we adapt and simplify here.
Gating Chain. Gating neurons are a separate structure within the backpropagation circuit and are connected in a ring. That is, the gating chain is a circular chain of neurons that, once activated, controls the timing of information processing in the rest of the network. This allows information routing throughout the network to be autonomous to realize the benefits of neuromorphic hardware without the need for control by a classical sequential processor. Specifically, the neurons in the gating chain are connected to the relevant layers of the network, which allows them to control when and where information is propagated. All layers are inhibited far away from their firing threshold by default, as described in Section IV.2, and can only transmit information, i.e., generate spikes, if their inhibition is neutralized via activation by the gating chain. Because a neuron only fires if it is gated ‘on’ AND gets sufficient input, such gating corresponds to a logical AND or coincidence detection with the input.
In our implementation, the gating chain consists of 12 neurons, which induce 12 algorithmic (Loihi) time steps that are needed to process one sample. Each neuron is connected to all layers that must be active in each respective time step. The network layers are connected in the manner shown in Supplementary Fig. 4, but the circuit structure, which is controlled by the gating chain, results in a functionally connected network as presented in Fig. 2, where the layers are shown according to the timing of when they become active during one iteration.
The weight of the standard gating synapses (from one of the gating neurons to each neuron in a target layer) is , i.e. each neuron that is gated ‘on’ is brought to half of its firing threshold, which effectively implements Eq. (15). In four cases, i.e., for the synapses to the start learning layers in time step 2 () and 3 () and to the backpropagated local gradient layer in time steps 6 and 10, the gating weight is . In two cases, i.e., for the synapses to the stop learning layer in time step 2 () and 3 (), the gating weight is . These different gating inputs lead to step activation functions with different thresholds, as required for the computations explained below, in Section IV.2.3.
Backpropagation Network Modules. In the previous sections, we have explained how the weight update happens and how to bring the relevant values ( and according to (10)) to the correct place at the correct time. In this section, we discuss how these values are actually calculated. The signal , which is the layer activity from the feedforward pass, does not need to be calculated but only remembered. This is done using a relay layer with synaptic delays, as explained in Section IV.2.1. The signal , the last layer local gradient, consists of 2 parts according to (5). The difference between the output and the target (see Section IV.2.2) and the box function must be calculated. We factorize the latter into two terms, a start and stop learning signal (see Section IV.2.3). The signal , the backpropagated local gradient, also consists of 2 parts, according to Eq. (6). In addition to another ‘start’ and ‘stop’ learning signal, we need , whose computation is explained in Section IV.2.4.
In the following equation, the weight update is annotated with the number of the paragraph in which the calculating module is described:
(23)  
(24) 
iv.2.1 Relay Neurons
The memory used to store the activity of the input and the hidden layer is a single relay layer that is connected both from and to the respective layer in a onetoone manner with the proper delays. The input layer sends its activity to the relay layer so that the activity can be restored in the update phases in time steps 7 and 11. The hidden layer sends its activity to the relay layer so that the activity can be restored in the update phases in time steps 5 and 9.
iv.2.2 Error Calculation
The error calculation requires a representation of signed quantities, which is not directly possible in a spiking network because there are no negative spikes. This is achieved here by splitting the error evaluation into two parts, and , to yield the positive and negative components separately. Similarly, the calculation of backpropagated local gradients, , is performed using a negative copy of the transpose weight matrix, and it is done in 2 phases for the positive and negative local gradient, respectively. In the spiking neural network, is implemented by an excitatory synapse from and an inhibitory synapse of the same strength from , and vice versa for . Like almost all other nonplastic synapses in the network, the absolute weight of the synapses is just above the value that makes the target neuron fire, when gated on. The difference between the output and the target is, however, just one part of the local gradient . The other part is the derivative of the activation function (box function).
iv.2.3 Start and Stop Learning Conditions
The box function (17) can be split in two conditions: a ‘start’ learning and a ‘stop’ learning condition. These two conditions are calculated in parallel with the feedforward pass. The feedforward activation corresponding to Eq. (1) is an application of the spiking threshold to the layer’s input with an offset of , which is given by the additional input from the gating neurons. The first term of the box function (9), , is also an application of the spiking threshold, but this time with an offset equal to the firing threshold so that any input larger than 0 elicits a spike. We call this first term the ‘start’ learning condition, and it is represented in for the hidden and in for the output layer. The second term of the box function in Eq. (9), , is also an application of the spiking threshold, but this time without an offset so that only an input larger than the firing threshold elicits a spike. We call this second term the ‘stop’ learning condition, and it is represented in and for the hidden and output layers, respectively. For the update, the two conditions are combined in a box function layer that then gates the local gradient layer. For the update, the two conditions are directly applied to the layers because an intermediate layer would waste one time step. The function is however the same: the stop learning inhibits the layers and the ‘start’ learning signal gates them. In our implementation, the two conditions are obtained in parallel with the feedforward pass, which requires two additional copies of each of the two weight matrices. An alternative method to avoid these copies but takes more time steps, would do this computation sequentially and use the synapses and layers of the feedforward pass three times per layer with different offsets, and then route the outputs to their respective destinations.
iv.2.4 Error Backpropagation
Error calculation and gating by the start learning signal and inhibition by the stop learning signal are combined in time step 4 to calculate the last layer local gradients and . From there, the local gradients are routed into the output layer and its copies for the last layer weight update. This happens in 2 phases: The positive local gradient is connected without delay so that it leads to potentiation of the forward and backward last layer weight matrices in time step 5. The negative local gradient is connected with a delay of 4 time steps so that it arrives in the depression phase in time step 9. For the connections to the layer which is connected to the negative copy , the opposite delays are used to get a weight update with the opposite sign. See Fig. 2 for a visualization of this mechanism. Effectively, this leads to the last layer weight update
(25) 
where the first term on the right hand side is nonzero when the local gradient is positive, corresponding to an update happening in the potentiation phase, and the second term is nonzero when the local gradient is negative, corresponding to an update happening in the depression phase. The functions and are as described in Equations (15) and (9).
The local gradient activation in the output layer does not only serve the purpose of updating the last layer weights, but it is also directly used to backpropagate the local gradients. Propagating the signed local gradients backwards through the network layers requires a positive and negative copy of the transposed weights, and , which are the weight matrices of the synapses between the output layer and the backpropagated local gradient layer , and between and , respectively. Here is an intermediate layer that is created exclusively for this purpose. The local gradient is propagated backwards in two phases. The part of the local gradient that leads to potentiation is propagated back in time steps 5 to 7, and the part of the local gradient that leads to depression of the weights is propagated back in time steps 9 to 11. In time step 6, the potentiating local gradient is calculated in layer as
(26) 
and in timestep 10 the depressing local gradient is calculated in layer as
(27) 
Critically, this procedure does not simply separate the weights by sign, but rather maintains a complete copy of the weights that is used to associate appropriate sign information to the backpropagated local gradient values. Note that here the Heaviside activation function is used rather than the binary activation function , so that any positive gradient will induce an update of the weights. Any positive threshold in this activation will lead to poor performance by making the learning insensitive to small gradient values. The transposed weight copies must be kept in sync with their forward counterparts, so the updates in the potentiation and depression phases are also applied to the forward and backward weights concurrently.
So in total, after each trial, the actual first layer weight update is the sum of four different parts:
(28) 
These four terms are necessary because, e.g., a positive error can also lead to depression if backpropagated through a negative weight matrix and the other way round.
iv.3 The sBP Implementation on Loihi
Partitioning on the Chip. To distribute the spike load over cores, neurons of each layer are approximately equally distributed over 96 cores of a single chip. This distribution is advantageous because only a few layers are active at each time step, and Loihi processes spikes within a core in a sequential manner. In total, the network as presented here needs 2+6+7+12 neurons. With , , , and , these are 3282 neurons and about 200k synapses, most of which are synapses of the 3 plastic alltoall connections between the input and the hidden layer.
Learning Implementation. The learning rule on Loihi is implemented as given in Eq. (IV.2). Because the precision of the plastic weights on Loihi is maximally 8 bits with a weight range from to , we can only change the weight in steps of 2 without making the update nondeterministic. This is necessary for keeping the various copies of the weights in sync (hence the factor of 2 in Eq. (IV.2)). With , this corresponds to a learning rate of . The learning rate can be changed by increasing the factor in the learning rule, which leads to a reduction in usable weight precision, or by changing , which changes the range of possible weights according to Eq. (20). Several learning rates (settings of ) were tested with the result that the final accuracy is not very sensitive to small changes. The learning rate that yielded the best accuracy is reported here. In the NxSDK API, the neuron traces that are used for the learning rule are x0, y0, and r1 for , and in IV.2 respectively. r1 was used with a decay time constant of 1, so that it is only active in the respective time step, effectively corresponding to r0. To provide the signal, a single reinforcement channel was used and was activated by a regularly firing spike generator in time steps 5 and 7.
Weight Initialization. After the He initialization as described in Section IV.1, the weights are mapped to Loihi weights according to (20). Then, the weights are truncated to and discretized to 8 bits resolution, i.e, steps of 2, by rounding them to the next valid number towards 0.
Power Measurements. All Loihi power measurements are obtained using NxSDK version 0.9.9 on the Nahuku32 board nclghrd01. Both software API and hardware were provided by Intel Labs. All other probes, including the output probes, are deactivated. For the inference measurements, we use a network that only consists of the three feedforward layers with nonplastic weights and the gating chain of four neurons. The power was measured for the first 10000 samples of the training set for the training measurements and all 10000 samples of the test set for the inference measurements.
References
 [1] Roelfsema, P. R. & Holtmaat, A. Control of synaptic plasticity in deep cortical networks. Nat Rev Neurosci 19, 166–180 (2018).
 [2] Linnainmaa, S. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ. Helsinki 6–7 (1970).
 [3] Werbos, P. Beyond regression:” new tools for prediction and analysis in the behavioral sciences. Ph. D. dissertation, Harvard University (1974).
 [4] Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning internal representations by error propagation. Tech. Rep. (1985). URL http://www.dtic.mil/docs/citations/{ADA164453}.
 [5] Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J. & Hinton, G. Backpropagation and the brain. Nature Reviews Neuroscience 1–12 (2020).
 [6] Yamins, D. L. & DiCarlo, J. J. Using goaldriven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).
 [7] Mead, C. Neuromorphic electronic systems. Proceedings of the IEEE 78, 1629–1636 (1990).
 [8] Davies, M. et al. Loihi: A neuromorphic manycore processor with onchip learning. IEEE Micro 38, 82–99 (2018). URL https://doi.org/10.1109/MM.2018.112130359.
 [9] Esser, S. et al. Convolutional networks for fast, energyefficient neuromorphic computing. PNAS 113, 11441–11446 (2016).
 [10] Rueckauer, B., Lungu, I.A., Hu, Y., Pfeiffer, M. & Liu, S.C. Conversion of continuousvalued deep networks to efficient eventdriven networks for image classification. Frontiers in neuroscience 11, 682 (2017).
 [11] Severa, W., Vineyard, C. M., Dellana, R., Verzi, S. J. & Aimone, J. B. Training deep neural networks for binary communication with the whetstone method. Nature Machine Intelligence 1, 86–94 (2019).
 [12] Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. In Waltz, D. & Feldman, J. A. (eds.) Connectionist Models and Their Implications: Readings from Cognitive Science, 243–283 (Ablex Publishing Corp., Norwood, NJ, USA, 1988). URL http://dl.acm.org/citation.cfm?id=54455.54464.
 [13] Crick, F. The recent excitement about neural networks. Nature 337, 129–132 (1989).
 [14] Painkras, E. et al. SpiNNaker: A multicore systemonchip for massivelyparallel neural net simulation. In Proceedings of the IEEE 2012 Custom Integrated Circuits Conference (2012).
 [15] Schemmel, J. et al. A waferscale neuromorphic hardware system for largescale neural modeling. In Proceedings of the IEEE 2010 IEEE International Symposium on Circuits and Systems (2010).
 [16] Qiao, N. et al. A reconfigurable online learning spiking neuromorphic processor comprising 256 neurons and 128k synapses. Frontiers in neuroscience 9, 141 (2015).
 [17] Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11, 23–63 (1987).

[18]
Liao, Q., Leibo, J. &
Poggio, T.
How important is weight symmetry in backpropagation?
In
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 30 (2016).  [19] Bohte, S. M., Kok, J. N. & La Poutré, H. Errorbackpropagation in temporally encoded networks of spiking neurons. Neurocomputing 48, 17–37 (2002).
 [20] Pfister, J.P., Toyoizumi, T., Barber, D. & Gerstner, W. Optimal spiketimingdependent plasticity for precise action potential firing in supervised learning. Neural Computation 18, 1318–1348 (2006). URL http://dx.doi.org/10.1162/neco.2006.18.6.1318.
 [21] Zenke, F. & Ganguli, S. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation 30, 1514–1541 (2018).
 [22] Huh, D. & Sejnowski, T. J. Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, 1433–1443 (2018).
 [23] Rasmussen, D. Nengodl: Combining deep learning and neuromorphic modelling methods. Neuroinformatics 17, 611–628 (2019).
 [24] Sengupta, A., Ye, Y., Wang, R., Liu, C. & Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in Neuroscience 13, 95 (2019). URL https://www.frontiersin.org/article/10.3389/fnins.2019.00095/full.
 [25] Shrestha, S. B. & Orchard, G. Slayer: Spike layer error reassignment in time. In Advances in Neural Information Processing Systems, 1412–1421 (2018).
 [26] Nair, M. V. & Indiveri, G. Mapping highperformance rnns to inmemory neuromorphic chips. arXiv preprint arXiv:1905.10692 (2019).
 [27] Rueckauer, B. et al. Nxtf: An api and compiler for deep spiking neural networks on intel loihi. arXiv preprint arXiv:2101.04261 (2021).
 [28] Stewart, K., Orchard, G., Shrestha, S. B. & Neftci, E. Onchip fewshot learning with surrogate gradient descent on a neuromorphic processor. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), 223–227 (IEEE, 2020).
 [29] DeWolf, T., Jaworski, P. & Eliasmith, C. Nengo and lowpower ai hardware for robust, embedded neurorobotics. Frontiers in Neurorobotics 14 (2020).
 [30] Frenkel, C., Lefebvre, M., Legat, J.D. & Bol, D. A 0.086mm 12.7pj/sop 64ksynapse 256neuron onlinelearning digital spiking neuromorphic processor in 28nm cmos. IEEE transactions on biomedical circuits and systems 13, 145–158 (2018).
 [31] Kim, J. K., Knag, P., Chen, T. & Zhang, Z. A 640m pixel/s 3.65 mw sparse eventdriven neuromorphic object recognition processor with onchip learning. In 2015 Symposium on VLSI Circuits (VLSI Circuits), C50–C51 (IEEE, 2015).
 [32] Buhler, F. N. et al. A 3.43 tops/w 48.9 pj/pixel 50.1 nj/classification 512 analog neuron sparse coding neural network with onchip learning and classification in 40nm cmos. In 2017 Symposium on VLSI Circuits, C30–C31 (IEEE, 2017).
 [33] Park, J., Lee, J. & Jeon, D. 7.6 a 65nm 236.5 nj/classification neuromorphic processor with 7.5% energy overhead onchip learning using direct spikeonly feedback. In 2019 IEEE International SolidState Circuits Conference(ISSCC), 140–142 (IEEE, 2019).
 [34] Nandakumar, S. et al. experimental demonstration of supervised learning in spiking neural networks with phasechange memory synapses. Scientific reports 10, 1–11 (2020).
 [35] Frenkel, C., Legat, J.D. & Bol, D. A 28nm convolutional neuromorphic processor enabling online learning with spikebased retinas. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS), 1–5 (IEEE, 2020).
 [36] Shrestha, A., Fang, H., Rider, D., Mei, Z. & Qui, Q. Inhardware learning of multilayer spiking neural networks on a neuromorphic processor. In to appear in 2021 58th ACM/ESDA/IEEE Design Automation Conference (DAC) (IEEE, 2021).
 [37] Imam, N. & Cleland, T. A. Rapid online learning and robust recall in a neuromorphic olfactory circuit. Nature Machine Intelligence 2, 181–191 (2020).
 [38] Friedmann, S. et al. Demonstrating hybrid learning in a flexible neuromorphic hardware system. IEEE transactions on biomedical circuits and systems 11, 128–142 (2016).
 [39] Nandakumar, S. et al. Mixedprecision deep learning based on computational memory. Frontiers in Neuroscience 14, 406 (2020).
 [40] Payvand, M., Fouda, M. E., Kurdahi, F., Eltawil, A. M. & Neftci, E. O. Onchip errortriggered learning of multilayer memristive spiking neural networks. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 522–535 (2020).
 [41] Payeur, A., Guerguiev, J., Zenke, F., Richards, B. & Naud, R. Burstdependent synaptic plasticity can coordinate learning in hierarchical circuits. bioRxiv (2020). URL https://doi.org/10.1101/2020.03.30.015511.
 [42] Bellec, G. et al. A solution to the learning dilemma for recurrent networks of spiking neurons. bioRxiv 738385 (2020).
 [43] Sacramento, J., Costa, R. P., Bengio, Y. & Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in neural information processing systems, 8721–8732 (2018).
 [44] Stork, D. G. Is backpropagation biologically plausible. In International Joint Conference on Neural Networks, vol. 2, 241–246 (IEEE Washington, DC, 1989).
 [45] Zipser, D. & Rumelhart, D. The neurobiological significance of the new learning models. In Schwartz, E. L. (ed.) Computational Neuroscience, 192––200 (The MIT Press, 1990).
 [46] O’Reilly, R. C. Biologically plausible errordriven learning using local activation differences: The generalized recirculation algorithm. Neural Computation 8, 895–938 (1996). URL http://www.mitpressjournals.org/doi/10.1162/neco.1996.8.5.895.
 [47] Kolen, J. & Pollack, J. Backpropagation without weight transport. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), 1375–1380 (IEEE, 1994). URL http://ieeexplore.ieee.org/document/374486/.
 [48] Lillicrap, T. P., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7, 1–10 (2016).
 [49] Liao, Q., Leibo, J. Z. & Poggio, T. How important is weight symmetry in backpropagation? In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, 1837–1844 (AAAI Press, 2016). URL http://dl.acm.org/citation.cfm?id=3016100.3016156.
 [50] Richards, B. A. & Lillicrap, T. P. Dendritic solutions to the credit assignment problem. Current Opinion in Neurobiology 54, 28–36 (2019). URL http://dx.doi.org/10.1016/j.conb.2018.08.003.

[51]
O’Connor, P., Neil, D.,
Liu, S.C., Delbruck, T. &
Pfeiffer, M.
Realtime classification and sensor fusion with a spiking deep belief network.
Frontiers in Neuroscience 7, 178 (2013). URL http://dx.doi.org/10.3389/fnins.2013.00178. 
[52]
Kim, R., Li, Y. &
Sejnowski, T. J.
Simple framework for constructing functional spiking recurrent neural networks.
Proceedings of the national academy of sciences 116, 22811–22820 (2019).  [53] Neftci, E. O., Mostafa, H. & Zenke, F. Surrogate gradient learning in spiking neural networks. IEEE Signal Processing Magazine 36, 61–63 (2019).
 [54] Izhikevich, E. M. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral Cortex 17, 2443–2452 (2007). URL http://dx.doi.org/10.1093/cercor/bhl152.
 [55] Sporea, I. & Grüning, A. Supervised learning in multilayer spiking neural networks. Neural Computation 25, 473–509 (2013). URL http://dx.doi.org/10.1162/{NECO_a_00396}.
 [56] Legenstein, R., Pecevski, D. & Maass, W. A learning theory for rewardmodulated spiketimingdependent plasticity with application to biofeedback. PLoS Computational Biology 4, e1000180 (2008). URL http://dx.doi.org/10.1371/journal.pcbi.1000180.
 [57] Frémaux, N. & Gerstner, W. Neuromodulated spiketimingdependent plasticity, and theory of threefactor learning rules. Frontiers in Neural Circuits 9, 85 (2015). URL http://dx.doi.org/10.3389/fncir.2015.00085.
 [58] Tavanaei, A., Ghodrati, M., Kheradpisheh, S. R., Masquelier, T. & Maida, A. Deep learning in spiking neural networks. Neural Networks 111, 47–63 (2019).
 [59] Sornborger, A., Tao, L., Snyder, J. & Zlotnik, A. A pulsegated, neural implementation of the backpropagation algorithm. In Proceedings of the 7th Annual Neuroinspired Computational Elements Workshop, 10 (ACM, 2019).
 [60] Sornborger, A., Wang, Z. & Tao, L. A mechanism for graded, dynamically routable current propagation in pulsegated synfire chains and implications for information coding. J. Comput. Neurosci. (2015).
 [61] Wang, Z., Sornborger, A. & Tao, L. Graded, dynamically routable information processing with synfiregated synfire chains. PLoS Comp Biol 12, 6 (2016).
 [62] Wang, C., Xiao, Z., Wang, Z., Sornborger, A. T. & Tao, L. A fokkerplanck approach to graded information propagation in pulsegated feedforward neuronal networks. arXiv preprint arXiv:1512.00520 (2015).
 [63] Xiao, Z., Wang, B., Sornborger, A. & Tao, L. Mutual information and information gating in synfire chains. Entropy 20, 102 (2018).
 [64] Shao, Y., Sornborger, A. & Tao, L. A pulsegated, predictive neural circuit. Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers (2016).
 [65] Shao, Y., Wang, B., Sornborger, A. & Tao, L. A mechanism for synaptic copy between neural circuits. bioRxiv (2018). URL https://www.biorxiv.org/content/early/2018/06/20/351114.
 [66] LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
 [67] Bengio, Y., Léonard, N. & Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
 [68] Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R. & Bengio, Y. Binarized neural networks. In Advances in neural information processing systems, 4107–4115 (2016).
 [69] Hebb, D. The Organization of Behavior: A Neuropsychological Approach (John Wiley & Sons, 1949).
 [70] Sornborger, A. & Tao, L. Exact, dynamically routable current propagation in pulsegated synfire chains. ArXiv:1410.1115 (2014).
 [71] Senn, W. & Fusi, S. Learning only when necessary: better memories of correlated patterns in networks with bounded synapses. Neural Computation 17, 2106–2138 (2005).
 [72] Chen, G. K., Kumar, R., Sumbul, H. E., Knag, P. C. & Krishnamurthy, R. K. A 4096neuron 1msynapse 3.8pj/sop spiking neural network with onchip stdp learning and sparse weights in 10nm finfet cmos. IEEE Journal of SolidState Circuits 54, 992–1002 (2018).
 [73] Lin, C.K. et al. Programming spiking neural networks on intel’s loihi. Computer 51, 52–61 (2018).
 [74] Göltz, J. et al. Fast and deep: energyefficient neuromorphic learning with firstspike times. arXiv preprint arXiv:1912.11443v4 (2019).
 [75] Davies, M. et al. Advancing neuromorphic computing with loihi: A survey of results and outlook. Proceedings of the IEEE 1–24 (2021).
 [76] Esser, S. K., Appuswamy, R., Merolla, P., Arthur, J. V. & Modha, D. S. Backpropagation for energyefficient neuromorphic computing. Advances in neural information processing systems 28, 1117–1125 (2015).
 [77] Stromatias, E. et al. Scalable energyefficient, lowlatency implementations of trained spiking deep belief networks on spinnaker. In 2015 International Joint Conference on Neural Networks (IJCNN), 1–8 (IEEE, 2015).
 [78] Jin, Y., Zhang, W. & Li, P. Hybrid macro/micro level backpropagation for training deep spiking neural networks. arXiv preprint arXiv:1805.07866 (2018).
 [79] Neftci, E. O., Augustine, C., Paul, S. & Detorakis, G. Eventdriven random backpropagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience 11, 324 (2017).
 [80] Shrestha, A., Fang, H., Wu, Q. & Qiu, Q. Approximating backpropagation for a biologically plausible local learning rule in spiking neural networks. In Proceedings of the International Conference on Neuromorphic Systems, 1–8 (2019).
 [81] Tavanaei, A. & Maida, A. Bpstdp: Approximating backpropagation using spike timing dependent plasticity. Neurocomputing 330, 39–47 (2019).
 [82] Mostafa, H. Supervised learning based on temporal coding in spiking neural networks. IEEE transactions on neural networks and learning systems 29, 3227–3235 (2017).
 [83] Lee, J. H., Delbruck, T. & Pfeiffer, M. Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, 508 (2016).
 [84] O’Connor, P. & Welling, M. Deep spiking networks. arXiv preprint arXiv:1602.08323 (2016).
 [85] Diehl, P. U. & Cook, M. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in computational neuroscience 9, 99 (2015).
 [86] Zipser, D. & Rumelhart, D. Neurobiological significance of new learning models. In Computational Neuroscience (1990).
 [87] Kolen, J. F. & Pollack, J. B. Backpropagation without weight transport. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), vol. 3, 1375–1380 (IEEE, 1994).
 [88] Stöckl, C. & Maass, W. Optimized spiking neurons can classify images with high accuracy through temporal coding with two spikes. Nature Machine Intelligence 1–9 (2021).
 [89] Baddeley, R. et al. Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society of London. Series B: Biological Sciences 264, 1775–1783 (1997).
 [90] Comsa, I. M. et al. Temporal coding in spiking neural networks with alpha synaptic function. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8529–8533 (IEEE, 2020).
 [91] Rueckauer, B. & Liu, S.C. Conversion of analog to spiking neural networks using sparse temporal coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 1–5 (IEEE, 2018).

[92]
Baumgartner, S. et al.
Visual pattern recognition with on onchip learning: towards a fully neuromorphic approach.
In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS) (IEEE, 2020).  [93] Liang, D. et al. Neural state machines for robust learning and control of neuromorphic agents. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 679–689 (2019).
 [94] Riehle, A., Grün, S., Diesmann, M. & Aertsen, A. Spike synchronization and rate modulation differentially involved in motor cortical function. Science 278, 1950–1953 (1997).
 [95] Abeles, M., Bergman, H., Margalit, E. & Vaadia, E. Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. Journal of neurophysiology 70, 1629–1638 (1993).
 [96] Hahnloser, R. H., Kozhevnikov, A. A. & Fee, M. S. An ultrasparse code underliesthe generation of neural sequences in a songbird. Nature 419, 65–70 (2002).
 [97] Ikegaya, Y. et al. Synfire chains and cortical songs: temporal modules of cortical activity. Science 304, 559–564 (2004).
 [98] Foster, D. J. & Wilson, M. A. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature 440, 680–683 (2006).
 [99] Rajan, K., Harvey, C. D. & Tank, D. W. Recurrent Network Models of Sequence Generation and Memory. Neuron 90, 128–142 (2016).
 [100] Pang, R. & Fairhall, A. L. Fast and flexible sequence induction in spiking neural networks via rapid excitability changes. Elife 8, e44324 (2019).
 [101] Malvache, A., Reichinnek, S., Villette, V., Haimerl, C. & Cossart, R. Awake hippocampal reactivations project onto orthogonal neuronal assemblies. Science 353, 1280–1283 (2016).
 [102] Luczak, A., McNaughton, B. L. & Harris, K. D. Packetbased communication in the cortex. Nature Reviews Neuroscience 16, 745–755 (2015).
 [103] Simons, T. & Lee, D.J. A review of binarized neural networks. Electronics 8, 661 (2019).
 [104] Marr, D. & Poggio, T. From understanding computation to understanding neural circuitry (1976).

[105]
He, K., Zhang, X., Ren,
S. & Sun, J.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InProceedings of the IEEE international conference on computer vision
, 1026–1034 (2015).
Acknowledgements
This work was carried out at Los Alamos National Laboratory under the auspices of the National Nuclear Security Administration of the U.S. Department of Energy under Contract No. 89233218CNA000001  SEEK: Scoping neuromorphic architecture impact enabling advanced sensing capabilities. Additional funding was provided by the LANL ASC Beyond Moore’s Law program and by LDRD Reserve Project 20180737ER  Neuromorphic Implementation of the Backpropagation Algorithm. L.T. acknowledges support from the Natural Science Foundation of China grants 31771147 (L.T.) and 91232715 (L.T). L.T. thanks the Los Alamos National Laboratory for its hospitality. A.R. acknowledges support from the Swiss National Science Foundation (SNSF) grant Ambizione PZOOP2 168183 ELMA. F.S. carried out work under the kind support of the Center for Nonlinear Studies fellowship and final preparations at the London Institute for Mathematical Sciences. The Authors thank Intel Labs for providing support and access to the Loihi API and hardware.
Author Contributions
The authors contributed equally to the methodology presented here. Alpha Renner adapted and implemented the algorithm on Intel’s Loihi chip. Alpha Renner, Forrest Sheldon, and Anatoly Zlotnik formalized neuromorphic informationprocessing mechanisms, implemented the algorithm in simulation and hardware, and developed figures. All authors wrote the manuscript with Renner, Sheldon, and Zlotnik doing the bulk of the writing. Andrew Sornborger and Anatoly Zlotnik supervised the research and Sornborger and Louis Tao developed the concepts and algorithmic basis of the neuromorphic backpropagation algorithm and circuit structure.
Competing Financial Interests
The authors declare that they have no competing financial interests.
Supplementary Material
Step  1  2  3  4  5 

ff in  ff hid  ff out  error  potentiation  
Step  6  7  8  9  10  11 

backpropagation  potentiation  depression  backpropagation  depression  
Power (mW)  
hardware  Static  Dynamic  Total  Latency (ms) 



Loihi training ds10030010  x86 cores  0.119  19.0  19.1  0.0139  0.0101  
neuron cores  35.7  193  229  0.141  0.103  
whole board  916  212  1128  0.731  0.155  0.113  
Loihi training 40030010 
x86 cores  0.133  18.9  19.0  0.0223  0.0264  
neuron cores  40.0  399  439  0.471  0.557  
whole board  914  418  1332  1.18  0.494  0.583  
Loihi training 40040010 start 
x86 cores  0.137  18.5  18.6  0.0275  0.0409  
neuron cores  37.6  398  435  0.591  0.88  
whole board  912  416  1328  1.49  0.619  0.92  
Loihi training 40040010 end 
x86 cores  0.15  18.5  18.7  0.0275  0.0408  
neuron cores  41.1  399  440  0.592  0.878  
whole board  914  418  1332  1.48  0.62  0.919  
Loihi 40040010 end (learning engine off) 
x86 cores  0.125  22.2  22.3  0.00822  0.00304  
neuron cores  34.1  55.1  89.2  0.0204  0.00755  
whole board  954  77.3  1031  0.37  0.0286  0.0106  
Loihi inference net 40040010 
x86 cores  0.143  23.4  23.6  0.00396  0.000669  
neuron cores  39.2  14.7  53.9  0.00249  0.000421  
whole board  980  38.2  1019  0.169  0.00645  0.00109  
GPU (Nvidia GTX 1660S) training 
batchsize 1  15000  19000  34000  1.0  19.0  19.0  

batchsize 10  15000  19000  34000  0.1  1.9  0.19  

batchsize 100  15000  19000  34000  0.01  0.19  0.0019  

iv.3.1 Power Measurements on GPU
The power on the GPU was measured using nvidiasmi
while running the TensorFlow implementation of our algorithm. The dynamic power was calculated by taking the difference between the power reading before the start of the run and during the run. The system is running Tensorflow version 2.3 on Windows 10 with an Intel i59600K processor and 16 GB of RAM. The GPU is an NVIDIA GeForce GTX 1600S (driver version: 461.92, CUDA version: 11.2).
iv.3.2 Detailed Description of the Algorithm
Here we go through all steps of the algorithm as shown in Fig. 2:

MNIST images are sent to the x layer in binarized form

is applied three times with different offsets (received from the gating chain) before application of the nonlinear activation function. This yields the hidden layer activity and start and stop learning conditions for the update according to Eq. (9)

The MNIST labels are sent as a onehot vector to the target layer and is applied three times with different offsets. This yields the output layer activity and start and stop learning conditions for the update according to Eq. (9). Furthermore, the input and hidden layer activities are stored into the and relay layers for later usage in the learning phases. In the box function layer (), the start and stop learning conditions of are combined using an excitatory and inhibitory connection. This corresponds to the multiplication in Eq. (9) as in this case, multiplication is the same as taking the difference with the inverted term.

The positive and negative errors are calculated by excitation from the layer and inhibition from the layer (and vice versa). The local gradient layers are not gated on by the gating chain but by the start learning condition , and furthermore, they are inhibited by the stop learning condition . This way, the Hadamard product from Eq. (5) is calculated.

The positive local gradient is now sent to the output layer (and its two siblings) while at the same time the hidden layer activity from time step 2 is reactivated from the relay layer. This leads to a positive weight update of , and because also the layer is activated, to updates of and according to the learning rule in Eq. (IV.2). Weights are however potentiated oppositely because layer receives input from layer in this time step. This time step implements the first term of Eq. (IV.2.4).

The actual backpropagation happens in this time step. By way of the transposed weight matrices, the layer receives activity from the output layer and the layer, which contains the positive and negative local gradients. The layer is gated by the box function layer , which calculates the Hadamard product from Eq. (6). This gating includes an offset so that even the smallest possible nonzero input sum leads to a spike in the neurons of the layer.

This time step realizes the potentiation of the weights. The hidden layer and its two siblings receive the local gradient that was propagated back in the previous time step, and the layer restores its previous activity from the relay layer. Because the third factor is switched on in this time step, potentiation happens according to Eq. (IV.2). This time step implements the first term of Eq. (IV.2.4).

To avoid interference of the input to the output layer from the previously active hidden layer (layer is connected to and ), no layer is gated on in this time step.

The steps 9 to 11 are the same as 5 to 7, but with input of the opposite local gradient and without third factor active and therefore the opposite sign of learning (depression).
iv.3.3 Synapse Types
There are five types of synapses (see different colors in Fig. 2 and Supplementary Fig. 4):

Plastic alltoall synapses (red), that change weights according to a learning rule (Eq. (IV.2)) and can be both positive or negative ().

Excitatory (black) onetoone synapses with a fixed weight () to copy the firing pattern of the source layer to the target layer.

Inhibitory (blue) onetoone synapses with a fixed weight () to always inhibit the target neurons.

Excitatory onetoall gating (green) synapses from a single gating chain neuron to all neurons of a layer with a fixed weight of usually . The usual gating synapses are used to gate ‘on’ a layer that is supposed to have the feedforward activation function . Different weights and are used to gate ‘on’ the start and stop learning conditions to get the two step functions with which the box function is built.

Excitatory onetoone gating (yellow) synapses from the start learning layer and the box function layer to the respective local gradient layers. The weight is used for gating of the layers in time step 4, and the weight is used for gating of the layer in time steps 6 and 10.
To a certain extent within the precision boundaries of the chip, the scaling of the weights is arbitrary. All plastic synapses and most other synapses have a synaptic delay of 0 time steps. Some synapses, specifically the ones originating from the relay layers, the layers, and the layer, have delays up to 6 time steps, i.e. their output spikes affect the target neuron several time steps later. The delays, corresponding to the horizontal arrow length, can be read from Fig. 2.
iv.3.4 Downsampled MNIST
Table III above refers to network structures with input sizes of 400 and 100. The 40040010 network used a set of trimmed images, where whitespace was removed to reduce the required size of the input layer. MNIST images for this had a border of width 4 removed. For the smaller 10030010 network, these images were then downsampled by averaging over pixel grids and binarized by thresholding at 0.5. This gave a set of binary images.
iv.3.5 Tensorflow Implementation
For the feedforward network, our GPU tensorflow implementation is similar to the binary network presented in [68]
but with continuous weights, and simplified to correspond with the neuromorphic design. The loss function was modified to mean squared error and minimized with stochastic gradient descent with a fixed learning rate. Dropout and batch normalization were removed and the batch size was set to 1.
The network was initialized with Glorot parameters for each layer , i.e., where is the number of neurons in layer . For each layer, weights were generated uniformly on the interval and clipped to these values during learning. While these initializations do not take into account the binary activation functions of our network, tests with alternative initializations showed little or no improvement, likely due to the shallow depth of the networks considered.
Gradients on the binary network in [68] were calculated with a ‘straightthrough’ estimator in which the derivative of the binary activation function
(29) 
was approximated as
(30) 
This same gradient calculation is maintained, but to mimic the binary gradient signal in the neuromorphic implementation, the final gradient with respect to each weight is thresholded. If the gradient calculated with respect to weight is , the applied gradient in the learning step is for a threshold representing the threshold of signal propagation in the neuromorphic implementation and as defined above.
Comments
There are no comments yet.