1 Introduction
In recent years the success of Deep Learning has proven that a lot of problems in machinelearning can be successfully attacked by applying backpropagation to learn multiple layers of representation. Most of the recent breakthroughs have been achieved through purely supervised learning.
In the standard application of a deep network to a supervisedlearning task, we feed some input vector through multiple hidden layers to produce a prediction, which is in turn compared to some target value to find a scalar cost. Parameters of the network are then updated in proportion to their derivatives with respect to that cost. This approach requires that all modules within the network be differentiable. If they are not, no gradient can flow through them, and backpropagation will not work.
An alternative class of artificial neural networks are Spiking Neural Networks. These networks, inspired by biology, consist of neurons that have some persistent “potential” which we refer to as , and alter eachothers’ potentials by sending “spikes” to one another. When unit sends a spike, it increments the potential of each downstream unit in proportion to the synaptic weight connecting the units. If this increment brings unit ’s potential past some threshold, unit sends a spike to its downstream units, triggering the same computation in the next layer. Such systems therefore have the interesting property that the amount of computation done depends on the contents of the data, since a neuron may be tuned to produce more spikes in response to some pattern of inputs than another.
In our flavour of spiking networks, a single forwardpass is decomposed into a series of small computations provide successively closer approximations to the true output. This is a useful feature for real time, lowlatency applications, as in robotics, where we may want to act on data quickly, before it is fully processed. If an input spike, on average, causes one spike in each downstream layer of the network, the average number of additions required per inputspike will be , where is the number of units in the layer . Compare this to a standard network, where the basic messaging entity is a vector. When a vector arrives at the input, full forward pass will require multiplyadds, and will yield no “preview” of the network output.
Spiking networks are welladapted to handle data from eventbased sensors, such as the Dynamic Vision Sensor (a.k.a. Silicon Retina, a vision sensor) Lichtsteiner et al. (2008) and the Silicon Cochlea (an audio sensor) Chan et al. (2007). Instead of sending out samples at a regular rate, as most sensors do, these sensors asynchronously output events when there is a change in the input. They can thus react with very low latency to sensory events, and produce very sparse data. These events could be directly fed into our spiking network (whereas they would have to be binned over time and turned into a vector to be used with a conventional deep network).
In this paper, we formulate a deep spiking network whose function is equivalent to a deep network of Rectified Linear (ReLU) units. We then introduce a spiking version of backpropagation to train this network. Compared to a traditional deep network, our Deep Spiking Network has the following advantageous properties:

Early Guessing. Our network can make an “early guess” about the class associated with a stream of input events, before all the data has been presented to the network.

No multiplications. Our training procedure consists only of addition, comparison, and indexing operations, which potentially makes it very amenable to efficient hardware implementation.

Datadependent computation. The amount of computation that our network does is a function of the data, rather than the network size. This is especially useful given that our network tends to learn sparse representations.
The remainder of this paper is structured as follows: In Section 2 we discuss past work in combining spiking neural networks and deep learning. In 3 we describe a Spiking MultiLayer Perceptron. In 4 we show experimental results demonstrating that our network behaves similarly to a conventional deep network in a classification setting. In 5 we discuss the implications of this research and our next steps.
2 Related Work
There has been little work on combining the fields of Deep Learning and Spiking neural networks. The main reason for this is that there is not an obvious way to backpropagate an error signal through a spiking network, since output is a stream of discrete events, rather than smoothly differentiable functions of the input. Bohte et al. (2000) proposes a spiking deep learning algorithm  but it involves simulating a dynamical system, is specific to learning temporal spike patterns, and has not yet been applied at any scale. Buesing et al. (2011)
shows how a somewhat biologically plausible spiking network can be interpreted as an MCMC sampler of a highdimensional probability distribution.
Diehl et al. does classification on MNIST with a deep eventbased network, but training is done with a regular deep network which is then converted to the spiking domain. A similar approach was used by Hunsberger and Eliasmith (2015)  they came up with a continuous unit which smoothly approximated the the firing rate of a spiking neuron, and did backpropagation on that, then transferred the learned parameters to a spiking network. Neftci et al. (2013)came up with an eventbased version of the contrastivedivergence algorithm, which can be used to train a Restricted Boltzmann Machine, but it was never applied in a DeepBelief Net to learn multiple layers of representation.
O’Connor et al. (2013) did create an eventbased spiking Deep Belief Net and fed it inputs from eventbased sensors, but the network was trained offline in a vectorbased system before being converted to run as a spiking network.Spiking isn’t the only form of discretization. Courbariaux et al. (2015) achieved impressive results by devising a scheme for sending back an approximate error gradient in a deep neural network using only lowprecision (discrete) values, and additionally found that the discretization served as a good regularizer. Our approach (and spiking approaches in general) differ from this in that they sequentially compute the inputs over time, so that it is not necessary to have finished processing all the information in a given input to make a prediction.
3 Methods
In Sections 3.1 to3.3 we describe the components used in our model. In Section 3.5 we will use these components to put together a Spiking MultiLayer Perceptron.
3.1 Spiking Vector Quantization
The neurons in the input layer of our network use an algorithm that we refer to as Spiking Vector Quantization (Algorithm 1) to generate “signed spikes”  that is, spikes with an associated positive or negative value. Given a real vector: , representing the input to an array of neurons, and some number of timesteps , the algorithm generates a series of signedspikes: , where is the total number of spikes generated from running for steps, is the index of the neuron from which the ’th spike fires (note that zero or more spikes can fire from a neuron within one time step), is the sign of the ’th spike.
In Algorithm 1, we maintain an internal vector of “neuron potentials” . Every time we emit a spike from neuron we subtract from the potential until is in the interval bounded by . We can show that as we run the algorithm for a longer time (as ), we observe the following limit:
(1) 
Where
is an onehot encoded vector with index
set to 1. The proof is in the supplementary material.Our algorithm is simply doing a discretetime, bidirectional version of DeltaSigma modulation  in which we encode floating point elements of our vector as a stream of signed events. We can see this as doing a sort of “deterministic sampling” or “herding” Welling (2009) of the vector v. Figure 1 shows how the cumulative vector from our stream of events approaches the true value of v at a rate of . We can compare this to another approach in which we stochastically sample spikes from the vector with probabilities proportional to the magnitude of elements of , (see the “Stochastic Sampling” section of the supplementary material), which has a convergence of .
3.2 Spiking Stream Quantization
A small modification to the above method allows us to turn a stream of vectors into a stream of signedspikes.
If instead of a fixed vector we take a stream of vectors , we can modify the quantization algorithm to increment by on timestep . This modifies Equation 8 to:
(2) 
So we end up approximating the running mean of . See “Spiking Stream Quantization” in the supplementary material for full algorithm and explanation. When we apply this to implement a neural network in Section 3.5, this stream of vectors will be the rows of the weight matrix indexed by the incoming spikes.
3.3 Rectifying Spiking Stream Quantization
We can add a slight tweak to our Spiking Stream Quantization algorithm to create a spiking version of a rectifiedlinear (ReLU) unit. To do this, we only fire events on positive thresholdcrossings, resulting in Algorithm 7.
We can show that if we draw spikes from a stream of vectors in the manner described in Algorithm 7, and sum up our spikes, we approach the behaviour of a ReLU layer:
(3) 
See “Rectified Stream Quantization” in the supplementary material for a more detailed explanation.
3.4 Incremental DotProduct
Thus far, we’ve shown that our quantization method transforms a vector into a stream of events. Here we will show that this can be used to incrementally approximate the dot product of a vector and a matrix. Suppose we define a vector , Where W is a matrix of parameters. Given a vector , and using Equation 8, we see that we can approximate the dot product with a sequence of additions:
(4) 
Where is the ’th row of matrix W.
3.5 Forward Pass of a Neural Network
Using the parts we’ve described so far, Algorithm 8 describes the forward pass of a neural network. The InputLayer procedure demonstrates how Spike Vector Quantization, shown in Algorithm 1 transforms the vector into a stream of events. The HiddenLayer procedure shows how we can combine the Incremental DotProduct (Equation 4) and Rectifying Spiking Stream Quantization (Equation 3) to approximate the a fullyconnected ReLU layer of a neural network. The Figure in the "MLP Convergence" section of the supplimentary material shows that our spiking network, if run for a long time, exactly approaches the function of the ReLU network.
3.6 Backward Pass
In the backwards pass we propagate error spikes backwards, in the same manner as we propagated the signal forwards, so that the error spikes approach the true gradients of the ReLU network as . Pseudocode explaining the procedure is provided in the “Training Iteration” Section of the supplementary material, and a diagram explaining the flow of signals is in the “Network Diagram” section.
A ReLU unit has the function and derivative:
(5)  
Where:
denotes a step function (1 if otherwise 0).
In the spiking domain, we express this simply by blocking error spikes on units for which the cumulative sum of inputs into that unit is below 0 (see the "filter" modules in the “Network Diagram” section of the supplementary material).
The signedspikes that represent the backpropagating error gradient at a given layer are used to index columns of that layer’s weight matrix, and negate them if the sign of the spike is negative. The resulting vector is then quantized, and the resulting spikes are sent back to previous layers.
One problem with the scheme described so far is that, when errors are small, it is possible that the errorquantizing neurons never accumulate enough potential to send a spike before the training iteration is over. If this is the case, we will never be able to learn when error gradients are sufficiently small. Indeed, when initial weights are too low, and therefore the initial magnitude of the backpropagated error signal is too small, the network does not learn at all. This is not a problem in traditional deep networks, because no matter how small the magnitude, some error signal will always get through (unless all hidden units are in their inactive regime) and the network will learn to increase the size of its weights. We found that a surprisingly effective solution to this problem is to simply not reset the of our error quantizers between training iterations. This way, after some burnin period, the quantizer’s starts each new training iteration at some random point in the interval , and the unit always has a chance to spike.
A further issue that comes up when designing the backward pass is the order in which we process events. Since an event can move a ReLU unit out of its active range, which blocks the transmission of itself or future events on the backward pass, we need to think about the order in which we processing these events. The topic of eventrouting is explained in the “Event Routing” section of the supplementary material.
3.7 Weight Updates
We can collect spike statistics and generates weight updates. There are two methods by which we can update the weights. These are as follows:
Stochastic Gradient Descent
The most obvious method of training is to approximate stochastic gradient descent. In this case, we accumulate two spikecount vectors,
and and take their outer product at the end of a training iteration to compute the weight update:(6) 
Fractional Stochastic Gradient Descent (FSGD) We can also try some thing new. Our spiking network introduces a new feature: if a data point is decomposed as a stream of events, we can do parameter updates even before a single data point has been observed. If we do updates whenever an error event comes back, we update each weight based on only the input data that has been seen so far. This is described by the rule:
(7) 
Where is an integer vector of counted input spikes, is the change to the column of the weight matrix, is the sign of the error event, and is the index of the unit that produced that error event, and is the number of timesteps per training iteration. Early input events will contribute to more weight updates than those seen near the end of a training iteration. Experimentally (see Section 4.2, we see that this works quite well. It may be that the additional influence given to early inputs causes the network to learn to make better predictions earlier on, compensating for the approximation caused by finiteruntime of the network.
3.8 Training
We chose to train the network with one sample at a time, although in principle it is possible to do minibatch training. We select a number of time steps , to run the network for each iteration of training. At the beginning of a round of training, we reset the state of the forwardneurons (all ’s and the state of the running sum modules), and leave the state of the errorquantizing neurons (as described in 3.6). On each time step , we feed the input vector to the input quantizer, and propagate the resulting spikes through the network. We then propagate an error spike back from the unit corresponding to the correct class label, and update the parameters by one of the two methods described in 3.7. See the “Network Diagram” section of the supplementary material to get an idea of the layout of all the modules.
4 Experiments
4.1 Simple Regression
We first test our network as a simple regressor, (with no hidden layers) on a binarized version of the newsgroups20 dataset, where we do a 2way classification between the electronics and medical newsgroups based wordcount vectors. We split the dataset with a 71 trainingtest ratio (as in
Crammer et al. (2009)) but do not do crossvalidation. Table 1 shows that it works.Network  % Test / Training Error 

1 Layer NN  2.278 / 0.127 
Spiking Regressor  2.278 / 0.82 
SVM  4.82 / 0 
Scores on 20 newsgroups, 2way classification between ’med’ and ’electronic’ newsgroups. We see that, somewhat surprisingly, our approach outperforms the SVM. This is probably because, being trained through SGD and tested at the end of each epoch, our classifier had more recently learned on samples at the end of the training set, which are closer in distribution to the test set than those at the beginning.
4.2 Comparison to ReLU Network on MNIST
We ran both the spiking network and the equivalent ReLU network on MNIST, using an architecture with 2 fullyconnected hidden layers, each consisting of 300 units. Refer to the “Hyperparameters” section of the Supplimentary Material for a full description of hyperparameters.
Network  % Test / Training Error 

Spiking SGD:  3.6 / 2.484 
Spiking FSGD:  2.07 / 0.37 
Vector ReLU MLP  1.63 / 0.426 
Spiking with ReLU Weights  1.66 / 0.426 
ReLU with Spiking FSGD weights  2.03 / 0.34 
Table 2 shows the results of our experiment, after 50 epochs of training. We find that the conventional ReLU network outperforms our spiking network, but only marginally. In order to determine how much of that difference was due to the fact that the Spiking network has a discrete forward pass, we mapped the learned parameters from the ReLU network onto the Spiking network (spiking with ReLU Weights”), and used the Spiking network to classify . The performance of the spiking network improved nearly to that of the ReLU network , indicating that the difference was not just due to the discretization of the forward pass but also due to the parameters learned in training. We also did the inverse (ReLU with SpikingFSGDtrained weights)  map the parameters of the trained Spiking Net onto the ReLU net, and found that the performance became very similar to that of the original Spiking (FSGDTrained) Network. This tells us that most of the difference in score is due to the approximations in training, rather than the forward pass. Interestingly, our SpikingFSGD approach outperforms the SpikingSGD  it seems that by putting more emphasis on early events, we compensate for the finite runtime of the Spiking Network. Figure 2 shows the learning curves over the first 20epochs of training. We see that the gap between training and test performance is much smaller in our Spiking network than in the ReLU network, and speculate that this may be to the regularization effect of the spiking. To confirm this, we would have to show that on a larger network, our regularization actually helps to prevent overfitting.
4.3 Early Guessing
We evaluated the "early guess" hypothesis from Section 1 using MNIST. The hypothesis was that our spiking network should be able to make computational cheap “early guesses" about the class of the input, before actually seeing all the data. A related hypothesis was under the “Fractional” update scheme discussed in Section 3.7, our networks should learn to make early guesses more effectively than networks trained under regular Stochastic Gradient Descent, because early input events contribute to more weight updates than later ones. Figure 3 shows the results of this experiment. We find, unfortunately, that our first hypothesis does not hold. The early guesses we get with the spiking network cost more than a single (sparse) forward pass of the input vector would. The second hypothesis, however, is supported by the rightside of Figure 5. Our networks trained with Fractional Stochastic Gradient Descent make better early guesses than those trained on regular SGD.
5 Discussion
We implemented a Spiking MultiLayer Perceptron and showed that our network behaves very similarly to a conventional MLP with rectifiedlinear units. However, our model has some advantages over a regular MLP, most of which have yet to be explored in full. Our network needs neither multiplication nor floatingpoint numbers to work. If we use Fractional Stochastic Gradient Descent, and scale all parameters in the network (initial weights, thresholds, and the learning rate) by the inverse of the learning rate, the only operations used are integer addition, indexing, and comparison. This makes our system very amenable to efficient hardware implementation.
The Spiking MLP brings us one step closer to making a connection between the types of neural networks we observe in biology and the type we use in deep learning. Like biological neurons, our units maintain an internal potential, and only communicate when this potential crosses some firing threshold. We believe that the main value of this approach is that it is a stepping stone towards a new type of deep learning. The way that deep learning is done now takes no advantage of the huge temporal redundancy in natural data. In the future we would like to adapt the methods developed here to work with nonstationary data. Such a network could pass spikes to keep the output distribution in “sync” with an everchanging input distribution. This property  efficiently keeping an online track of latent variables in the environment, could bring deep learning into the world of robotics.
References
 Bohte et al. [2000] Sander M Bohte, Joost N Kok, and Johannes A La Poutré. Spikeprop: backpropagation for networks of spiking neurons. In ESANN, pages 419–424, 2000.
 Buesing et al. [2011] Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput Biol, 7(11):e1002211, 2011.
 Chan et al. [2007] Vincent Chan, ShihChii Liu, and André Van Schaik. Aer ear: A matched silicon cochlea pair with address event representation interface. Circuits and Systems I: Regular Papers, IEEE Transactions on, 54(1):48–59, 2007.
 Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015. URL http://arxiv.org/abs/1511.00363.
 Crammer et al. [2009] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In Advances in neural information processing systems, pages 414–422, 2009.
 [6] Peter U Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, ShihChii Liu, and Michael Pfeiffer. Fastclassifying, highaccuracy spiking deep networks through weight and threshold balancing.
 Fan et al. [2012] Xiequan Fan, Ion Grama, and Quansheng Liu. Hoeffding’s inequality for supermartingales. Stochastic Processes and their Applications, 122(10):3545–3559, 2012.
 Hunsberger and Eliasmith [2015] Eric Hunsberger and Chris Eliasmith. Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829, 2015.
 Lichtsteiner et al. [2008] Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128 128 120 db 15 s latency asynchronous temporal contrast vision sensor. SolidState Circuits, IEEE Journal of, 43(2):566–576, 2008.
 Neftci et al. [2013] Emre Neftci, Srinjoy Das, Bruno Pedroni, Kenneth KreutzDelgado, and Gert Cauwenberghs. Eventdriven contrastive divergence for spiking neuromorphic systems. Frontiers in neuroscience, 7, 2013.

O’Connor et al. [2013]
Peter O’Connor, Daniel Neil, ShihChii Liu, Tobi Delbruck, and Michael
Pfeiffer.
Realtime classification and sensor fusion with a spiking deep belief network.
Frontiers in neuroscience, 7, 2013.  Welling [2009] Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128. ACM, 2009.
Appendix A Algorithms
a.1 Proof of convergence of SpikeVector Quantization
Here we show that if we obtain events given a vector and a time from the SpikingVector Quantization Algorithm then:
(8) 
Since the L1 norm is bounded by:
(9) 
where is the number of elements in vector . We can take the limit of infinite time, and show that our spikes converge to form an approximation of :
(10)  
a.2 Stochastic Sampling
a.3 Spiking Stream Quantization
In our modification to Spiking Vector Quantization, we instead feed in a stream of vectors, as in Algorithm 6.
If we simply replace the term in Equation 10 with , and follow the same reasoning, we find that we converge to the running mean of the vectorstream.
(11)  
a.4 Rectified Stream Quantization
We can further make a small modification where we only send positive spikes (so our can get unboundedly negative.
To see why this construction approximates a ReLU unit, first observe that the total number of spikes emitted can be computed by considering the total cumulative sum . More precisely:
(12) 
where indicates the number of spikes emitted from unit by time and indicates the integer floor of a real number.
Assume the are IID sampled from some process with mean
and finite standard deviation
. Define which has zero mean and the cumulative sum which is martingale. There are a number of concentration inequalities, such as the Bernstein concentration inequalities Fan et al. [2012] that bound the sum or the maximum of the sequence under various conditions. What is only important for us is the fact that in the limit the sums concentrate to a delta peak at zero in probability and that we can therefore conclude from which we can also conclude that the maximum, and thus the number of spikes will grow in the same way. From this we finally conclude that , which is the ReLU nonlinearity. Thus the mean spiking rate approaches the ReLU function of the mean input.Appendix B MLP Convergence
Appendix C A Training Iteration
Appendix D Network Diagram
Appendix E Hyperparameters
Our spiking architecture introduced a number of new hyperparameters and settings that are unfamiliar with those used to regular neural networks. We chose to evaluate these empirically by modifying them onebyone as compared to a baseline.

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Fractional Updates.

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

False (Baseline): We use the standard stochasticgradient descent method

True: We use our new Fractional Stochastic Gradient Descent method  described in section 3.7


DepthFirst

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

False (Baseline): Events are propagated "Breadthfirst", meaning that, at a given timestep, all events are collected from the output of one module before any of their childevents are processed.

True: If an event from module A creates childevents from module B, those are processed immediately, before any more events from module A are processed.


Smooth Weight Updates

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

False (Baseline): The weightupdate modules take in a count of spikes from the previous layer as their input.

True: The weightupdate modules take the rectified cumulative sum of the prequantized vectors from the previous layer  resulting in a smoother estimate of the input.


BackwardsQuantization:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

NoResetQuantization (Baseline): The backwards quantization modules do not reset their s with each training iteration.

Random: Each element of is randomly selection from the interval at the start of each training iteration.

ZeroReset: The backwards quantizers reset their s to zero at the start of each training iteration.


Number of timesteps: How many time steps to run the training procedure for each sample (Baseline is 10).
Since none of these hyperparameters have obvious values, we tested them empirically with a network with layer sizes [78420020010], trained on MNIST. Table 3 shows the affects of these hyperparameters.
Variant  % Error 

Baseline  3.38 
Fractional Updates  3.10 
DepthFirst Propagation  81.47 
Smooth Gradients  2.85 
Smooth & Fractional  3.07 
BackQuantization = ZeroReset  87.87 
BackQuantization = Random  3.15 
5 Time Steps  4.41 
20 Time Steps  2.65 
Most of the Hyperparameter settings appear to make a small difference. A noteable exception is the ZeroReset rule for our backwardsquantizing units  the network learns almost nothing throughout training. The reason for this is that the initial weights, which were drawn from are too small to allow any errorspikes to be sent back (the backwardpass quantizers never reach their firing thresholds). As a result, the network fails to learn. We found two ways to deal with this: “BackQuantization = Random” initializes the for the backwards quantizers randomly at the beginning of each round of training. “BackQuantization = NoReset” simply does not reset in between training iterations. In both cases, the backwards pass quantizers always have some chance at sending a spike, and so the network is able to train. It is also interesting that using Fractional Updates (FSGD) gives us a slight advantage over regular SGD (Baseline). This is quite promising, because it means we have no need for multiplication in our network  As Section 3.7 explains, we simply add a column to the weight matrix every time an error spike arrives. We also observe that using the rectified running sum of the prequantization vector from the previous layer as our input to the weightupdate module (Smooth Gradients) gives us a slight advantage. This is expected, because it is simply a less noisy version of the count of the input spikes that we would use otherwise.
Appendix F Event Routing
Since each event can result in a variable number of downstream events, we have to think about the order in which we want to process these events. There are two issues:

In situations where one event is sent to multiple modules, we need to ensure that it is being sent to its downstream modules in the right order. In the case of the SMLP, we need to ensure that, for a given input, its childevents reach the filters in the backward pass before its other childevents make their way around and do the backward pass. Otherwise we are not implementing backpropagation correctly.

In situations where one event results in multiple childevents, we need to decide in which order to process these child events and their child events. For this, there are two routing schemes that we can use: Breadthfirst and depthfirst. We will outline those with the example shown in Figure 6. Here we have a module that responds to some input event by generating two events: and . Event is sent to module B and triggers events and . Event is sent and triggers event . Table 4 shows how a breadthfirst vs depthfirst router will handle these events.
BreadthFirst  DepthFirst 

Experimentally, we found that BreadthFirst routing performed better on our MNIST task, but we should keep an open mind on both methods until we understand why.