Suppose we are trying to track objects in a scene. A typical system used today would consist of sending camera-frames into a convolutional network which predicts bounding boxes. Such a system may be trained by going over many hours of video with manually annotated bounding boxes, and learning to predict their locations. This system has to execute a forward pass of a convolutional network at each iteration. If we double the frame rate, we double the amount of computation, even if the contents of the video are mostly static. Intuitively, it does not feel that this should be necessary. Given the similarity between neighbouring frames of video, could we not reuse some of the computation from the last frame to update the bounding box inferences for the current frame? Is it really necessary to recompute the entire network on each frame?
Many robotic systems consist of many sensors operating at wildly different frame rates. Some “neuromorphic” sensors, such as the Dynamic Vision Sensor Lichtsteiner et al. (2008) have done away with the concept of frames altogether and instead send asynchronous “events” whenever the value of a pixel changes beyond some threshold. It’s not obvious, using current methods in deep learning, how we can efficiently integrate asynchronous sensory signals into a unified, trainable, latent representation, without recomputing the function of the network every time a new signal arrives.
There has been a lot of work on increasing the computational efficiency of neural networks by quantizing neural weights or activations (see Section 4), but comparatively little work on exploiting redundancies in the data to reduce the amount of computation. O’Connor and Welling (2016b), set out to exploit the temporal redundancy in video, by having neurons only send their quantized changes in activation to downstream neurons, and having the downstream neurons integrate these changes. This approach works for efficiently approximating the function of the network, but fails for training, because when the weights are changing with time, this approach (take the temporal difference, multiply by weights, temporally integrate) fails to reconstruct the correct activation for the next layer. In other words, . Figure 2 describes the problem visually. In this paper, we correct for this by instead encoding a mixture of two components of the layers activation : the proportional component , and the derivative component . When we invert this encoding scheme, we get get a decoding scheme which corresponds to taking an exponentially decaying temporal average of past inputs.
Biological neurons tend to respond to a newly presented stimulus with a burst of spiking, which then decays to a slower baseline firing rate as the stimulus persists, and that neural membrane potentials can approximately be modeled as an exponentially decaying temporal average of past inputs.
We propose a coding scheme where neurons can represent their activations as a temporally sparse series of impulses. The impulses from a given neuron encode a combination of the value and the rate of change of the neuron’s activation.
While our algorithm is designed to work efficiently with temporal data, we do not aim to learn temporal sequences in this work. We aim to efficiently approximate a function , where the current target is solely a function of the current input , and not previous inputs . The temporal redundancy between neighbouring inputs will however be used to make our approximate computation of this function more efficient.
Throughout this paper we will use the notation to denote function composition. We slightly abuse the notion of functions by allowing them to have an internal state which persists between calls. For example, we define the function in Equation 1 as being the difference between the inputs in two consecutive calls (where persistent variable is initialized to 0). The function, defined in Equation 2, returns a running sum of the inputs over calls. So we can write, for example, that when our composition of functions is called with a sequence of input variables , then , because .
In general, when we write , where is a function with persistent state, it will be implied that we have previously called for in sequence. Variable definitions that are used later will be highlighted in blue.
2.2 PD Encoding
Suppose a neuron has time-varying activation . Taking inspiration from Proportional-Integral-Derivative (PID) controllers, we can “encode” this activation at each time step as a combination of its current activation and change in activation as , (see Equation 4). The parameters and determine what portion of our encoding represents the value of the activation and the rate of change of that value, respectively. In Section 2.8, we will discuss the effect our choices for these parameters have on the network.
To get our decoding formula, we can simply solve for as (Equation 4), such that . Notice that Equation 5 corresponds to decaying the previous decoder state by some constant and then adding the input . We can expand this recursively to see that this corresponds to a temporal convolution where is a causal exponential kernel .
Our motivation for the aforementioned encoding scheme is that we now want to quantize our signal into a sparse representation. This will later be used to reduce computation. We can quantize our signal into a sparse, integer signal , where the quantizer Q is defined in Equation 3. Equation 3 implements a form of Sigma-Delta modulation, a method widely used in signal processing to approximately communicate signals at low bit-rates (Candy and Temes, 1962). We can show that that (See Supplementary Material Section A), where indicates applying a temporal summation, a rounding, and a temporal difference, in series. When , we can expect to consist of mostly zeros with a few 1’s and -1’s.
We can now approximately reconstruct our original signal as by applying our decoder, as defined in Equation 5. As our coefficients become larger, our reconstructed signal should become closer to the original signal . We illustrate examples of encoded signals and their reconstructions for different , in Figure 1.
2.3.1 Special cases
We can write compactly the entire reconstruction function as .
: When , we get and , so our reconstruction reduces to . Because all commute with one another, we can simplify this to . so our decoded signal is , with no dependence on . This is visible in the bottom row of Figure 1. This was the encoding scheme used in O’Connor and Welling (2016b).
: In this case, and so our encoding-decoding process becomes . In this case neither our encoder nor our decoder have any memory, and we take not advantage of temporal redundancy.
2.4 Sparse Communication Between Layers
The purpose of our encoding scheme is to reduce computation by sparsifying communication between layers of a neural network. Suppose we are trying to compute the pre-nonlinearity activation of the first hidden layer, , given the input activation, . We approximate as:
The first approximation comes from the quantization (Q) of the encoded signal, and the second from the fact that the weights change over time, as explained in Figure 2. The effects of these approximations are further explored in Section B of the Supplementary Material.
Computing takes multiplications and additions. The cost of computing , on the other hand, depends on the contents of . If the data is temporally redundant, should be sparse, with total magnitude .
can be decomposed into a sum of one-hot vectorswhere is a onehot vector with element . The matrix product can then be decomposed into a series of row additions:
If we include the encoding, quantization, and decoding operations, our matrix product takes a total of multiplications, and additions. Assuming the term dominates, we can say that the relative cost of computing vs is:
2.5 A Neural Network
We can implement this encoding scheme on every layer of a neural network. Given a standard neural net consisting of alternating linear () and nonlinear () operations, our network function can then be written as:
We can use the same approach to approximately calculate our gradients to use in training. If we define our layer activations as , and , where
is some loss function andis a target, we can backpropagate the approximate gradients as:
On every layer of the forward and backward pass, our quantization scheme corrupts the signals that are being sent between layers. Nevertheless we find experimentally that this does not matter much to the performance of the network.
2.6 Parameter Updates
There’s no use having an efficient backward pass if the parameter updates aren’t also efficient. In a normal neural network trained with backpropagation and simple stochastic gradient descent, the parameter update for weight matrixhas the form where is the learning rate. If connects layer to layer , we can write where is the presynaptic activation, is the postsynaptic (pre-nonlinearity) activation and is the outer product. So we pay multiplications to update the parameters for each sample.
We want a more efficient way to compute this product, which takes advantage of the sparsity of our encoded signals to reduce computation. We can start by applying our encoding-quantizing-decoding scheme to our input and error signals as and , and approximate our true update update as where and . This doesn’t do any good by itself, because the update rule still is not sparse. But, we can exactly compute the sum of this value over time using one of two sparse update schemes - past updates and future updates - which are depicted in Figure 3.
: For a given synapse, if either the presynaptic neuron spikes () or the postsynaptic neuron spikes (), we increment the by the total area under since the last spike. We can do this efficiently because between the current time and the time of the previous spike, is a geometric sequence. Given a known initial value , final value , and decay rate , a geometric sequence sums to . The area calculated is shown in pink on the bottom row of Figure 3, and one algorithm to calculate it is in Equation 13.
Future Updates: Another approach is to calculate the Present Value of the future area under the integral from the current spike. This is depicted in the blue-gray area in Figure 3, and the formula is in Equation 14.
To simplify our expressions in the update algorithms, we re-parametrize our coefficients as , .
(13) (13) (13) (13) (13) (13) (13) (13) (13)
(14) (14) (14) (14) (14) (14) (14)
2.7 Relation to STDP
An extremely attentive reader might have noted that Equation 14 has the form of an online implementation of Spike-Timing Dependent Plasticity (STDP). STDP (Markram et al., 2012) emerged from neuroscience, where it was observed that synaptic weight changes appeared to be functions of the relative timing of pre- and post-synaptic spikes. The empirically observed function usually has the double-exponential form seen on the rightmost plot of Figure 4.
Using the quantized input signal and error signal , and their reconstructions and as defined in the last section, we define a causal convolutional kernel and where . The middle plot of Figure 4 is a plot of . We define our STDP update rule as:
We note that while our version of STDP has the same double-exponential form as the classic STDP rule observed in neuroscience (Markram et al., 2012), we do not have the property that sign of the weight change depends on whether the presynaptic spike preceded the postsynaptic spike.
In Section C in the supplementary material we show experimentally that while Equations , , , may all result in different updates at different times, the rules are equivalent in that for a given set of pre/post-synaptic spikes , the cumulative sum of their updates over time converges exactly.
2.8 Tuning ,
The smaller the magnitude of a signal, the more severely distorted it is by our quantization-reconstruction scheme. We can see that scaling a signal by K has the same effect on the quantized version of the signal, , as scaling and by K:
. The fact that the reconstruction quality depends on the signal magnitude presents a problem when training our network, because the error gradients tend to change in magnitude throughout training (they start large, and become smaller as the network learns). To keep our signal within the useful dynamic range of the quantizer, we apply simple scheme to heuristically adjustand for the forward and backward passes separately, for each layer of the network. Instead of directly setting ,
as hyperparameters, we fix the ratio, and adapt the scale to the magnitude of the signal. Our update rule for is:
Where is the scale-adaptation learning rate, is a rolling average of the magnitude of signal , and defines how coarse our quantization should be relative to the signal magnitude (higher means coarser). We can recover for use in the encoders and decoders as and . In our experiments, we choose , and initialize .
To evaluate our network’s ability to learn, we run it on the standard MNIST dataset, as well as a variant we created called “Temporal MNIST”. Temporal MNIST is simply a reshuffling of the MNIST dataset so that so that similar inputs (in terms of L2-pixel distance), are put together. Figure 6
shows several snippets of consecutive frames in the temporal MNIST dataset. We compare our Proportional-Derivative Net against a conventional Multi-Layer Perceptron with the same architecture (one hidden layer of 200 ReLU hidden units and a softmax output). The results are shown in Figure6. Somewhat surprisingly, our predictor slightly outperformed the MLP, getting 98.36% on the test set vs 98.25% for the MLP. We assume this improvement is due to the regularizing effect of the quantization. On Temporal MNIST, our network was able to converge with less computation than it required for MNIST (It used operations for MNIST vs for Temporal MNIST), but ended up with a slightly worse test score when compared with the MLP (the PDNN got 97.99% vs 98.28% for the MLP). It’s not clear why our network appeared to achieve a slightly worse score on temporal data. This will be a subject for future investigation.
4 Related Work
There has been sparse but interesting work on merging the notions of spiking neural networks and deep learning. Diehl et al. (2015) found a way to efficiently map a trained neural network onto a spiking network. Lee et al. (2016) devised a method for training spiking of integrate-and-fire spiking neurons with backpropagation - though their neurons did not send a temporal difference of their activations. O’Connor and Welling (2016a) created a method for training event-based neural networks - but their method took no advantage of temporal redundancy in the data. Binas et al. (2016) and (O’Connor and Welling, 2016b) both took the approach of sending quantized temporal changes reduce computation on temporally redundant data, but their schemes could not be used to train a neural network. Bohte et al. (2000) showed how could apply backpropagation for training spiking neural networks, but it was not obvious how to apply the method to non-spiking data. Zambrano and Bohte (2016) developed a spiking network with an adaptive scale of quantization (which bears some resemblance to our tuning scheme described in Section 2.8), and show that the spiking mechanism is a form of Sigma-Delta modulation, which we also use here. Courbariaux et al. (2015) showed that neural networks could be trained with binary weights and activations (we just quantize activations). Bengio et al. (2015) found a connection between the classic STDP rule (Figure 4, right) and optimizing a dynamical neural network, although the way they arrived at an STDP-like rule was quite different from ours.
We set out with the objective of reducing the computation in deep networks by taking advantage of temporal redundancy in data. We described a simple rule (Equation 4) for sparsifying the communication between layers of a neural network by having our neurons communicate a combination of their temporal change in activation, and the current value of their activation. We show that it follows from this scheme that neurons should behave as leaky integrators (Equation 5). When we quantize our neural activations with Sigma-Delta modulation, a common quantization scheme in signal processing, we get something resembling a leaky integrate-and-fire neuron. We derive efficient update rules for the weights of our network, and show these to be equivalent to a form of STDP - a learning rule first observed in neuroscience. Finally, we train our network, verify that it does indeed compute more efficiently on temporal data, and show that it performs about as well as a traditional deep network of the same architecture, but with significantly reduced computation.
Code is available at github.com/petered/pdnn.
This work was supported by Qualcomm, who we’d like to thank for sharing their past work with us. In addition, we’d like to thank our colleagues, especially Matthias Reisser and Changyong Oh, for some very useful discussions which contributed to this work.
- Bengio et al.  Yoshua Bengio, Thomas Mesnard, Asja Fischer, Saizheng Zhang, and Yuhai Wu. An objective function for stdp. arXiv preprint arXiv:1509.05936, 2015.
- Binas et al.  Jonathan Binas, Giacomo Indiveri, and Michael Pfeiffer. Deep counter networks for asynchronous event-based processing. CoRR, abs/1611.00710, 2016. URL http://arxiv.org/abs/1611.00710.
- Bohte et al.  Sander M Bohte, Joost N Kok, and Johannes A La Poutré. Spikeprop: backpropagation for networks of spiking neurons. In ESANN, pages 419–424, 2000.
- Candy and Temes  James C Candy and Gabor C Temes. Oversampling delta-sigma data converters: theory, design, and simulation. University of Texas Press, 1962.
- Courbariaux et al.  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015. URL http://arxiv.org/abs/1511.00363.
Diehl et al. 
Peter U Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and
Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing.In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2015.
- Horowitz  Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14. IEEE, 2014.
- Lee et al.  Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using backpropagation. arXiv preprint arXiv:1608.08782, 2016.
- Lichtsteiner et al.  Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128 128 120 db 15 s latency asynchronous temporal contrast vision sensor. Solid-State Circuits, IEEE Journal of, 43(2):566–576, 2008.
- Markram et al.  Henry Markram, Wulfram Gerstner, and Per Jesper Sjöström. Spike-timing-dependent plasticity: a comprehensive overview. Frontiers in synaptic neuroscience, 4, 2012.
- O’Connor and Welling [2016a] Peter O’Connor and Max Welling. Deep spiking networks. arXiv preprint arXiv:1602.08323, 2016a.
- O’Connor and Welling [2016b] Peter O’Connor and Max Welling. Sigma delta quantized networks. arXiv preprint arXiv:1611.02024, 2016b.
- Zambrano and Bohte  Davide Zambrano and Sander M Bohte. Fast and efficient asynchronous neural computation with adapting spiking neural networks. arXiv preprint arXiv:1609.02053, 2016.
Appendix A Sigma-Delta Unwrapping
From Equation 3 (Q) we can see that
Now we can unroll for and observe use the fact that if then , to say:
At which point it is clear that Q is identical to a successive application of a temporal summation, a rounding, and a temporal difference. That is why we say .
Appendix B Scanning the K-space
Equation 7 shows how we make two approximations when approximating with . The first is the “nonstationary weight” approximation - arising from the fact that w changes in time, the second is the “quantization” approximation, arising from the quantization of x. Here do a small experiment in which we multiply a time-varying scalar signal with a time-varying weight for many different values of to understand the effects of on our approximation error.
Appendix C All roads lead to Rome
In Section 2.6 and 2.7, we described 4 different update rules, and stated that while they do not necessarily produce the same updates at the same times, they produce the same result in the end. Here we demonstrate this empirically. We generate two random spike-trains representing the input and the error signal to a single synapse. The plot on the bottom shows our weight as a function of time as it drifts from its initial value.