Temporally Efficient Deep Learning with Spikes

06/13/2017
by   Peter O'Connor, et al.
University of Amsterdam
0

The vast majority of natural sensory data is temporally redundant. Video frames or audio samples which are sampled at nearby points in time tend to have similar values. Typically, deep learning algorithms take no advantage of this redundancy to reduce computation. This can be an obscene waste of energy. We present a variant on backpropagation for neural networks in which computation scales with the rate of change of the data - not the rate at which we process the data. We do this by having neurons communicate a combination of their state, and their temporal change in state. Intriguingly, this simple communication rule give rise to units that resemble biologically-inspired leaky integrate-and-fire neurons, and to a weight-update rule that is equivalent to a form of Spike-Timing Dependent Plasticity (STDP), a synaptic learning rule observed in the brain. We demonstrate that on MNIST and a temporal variant of MNIST, our algorithm performs about as well as a Multilayer Perceptron trained with backpropagation, despite only communicating discrete values between layers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

11/12/2017

BP-STDP: Approximating Backpropagation using Spike Timing Dependent Plasticity

The problem of training spiking neural networks (SNNs) is a necessary pr...
07/27/2020

Supervised Learning in Temporally-Coded Spiking Neural Networks with Approximate Backpropagation

In this work we propose a new supervised learning method for temporally-...
02/27/2019

Biologically plausible deep learning -- but how far can we go with shallow networks?

Training deep neural networks with the error backpropagation algorithm i...
11/24/2020

A More Biologically Plausible Local Learning Rule for ANNs

The backpropagation algorithm is often debated for its biological plausi...
09/19/2015

STDP as presynaptic activity times rate of change of postsynaptic activity

We introduce a weight update formula that is expressed only in terms of ...
11/14/2020

Using noise to probe recurrent neural network structure and prune synapses

Many networks in the brain are sparsely connected, and the brain elimina...
12/12/2020

Learning Representations from Temporally Smooth Data

Events in the real world are correlated across nearby points in time, an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we are trying to track objects in a scene. A typical system used today would consist of sending camera-frames into a convolutional network which predicts bounding boxes. Such a system may be trained by going over many hours of video with manually annotated bounding boxes, and learning to predict their locations. This system has to execute a forward pass of a convolutional network at each iteration. If we double the frame rate, we double the amount of computation, even if the contents of the video are mostly static. Intuitively, it does not feel that this should be necessary. Given the similarity between neighbouring frames of video, could we not reuse some of the computation from the last frame to update the bounding box inferences for the current frame? Is it really necessary to recompute the entire network on each frame?

Many robotic systems consist of many sensors operating at wildly different frame rates. Some “neuromorphic” sensors, such as the Dynamic Vision Sensor Lichtsteiner et al. (2008) have done away with the concept of frames altogether and instead send asynchronous “events” whenever the value of a pixel changes beyond some threshold. It’s not obvious, using current methods in deep learning, how we can efficiently integrate asynchronous sensory signals into a unified, trainable, latent representation, without recomputing the function of the network every time a new signal arrives.

There has been a lot of work on increasing the computational efficiency of neural networks by quantizing neural weights or activations (see Section 4), but comparatively little work on exploiting redundancies in the data to reduce the amount of computation. O’Connor and Welling (2016b), set out to exploit the temporal redundancy in video, by having neurons only send their quantized changes in activation to downstream neurons, and having the downstream neurons integrate these changes. This approach works for efficiently approximating the function of the network, but fails for training, because when the weights are changing with time, this approach (take the temporal difference, multiply by weights, temporally integrate) fails to reconstruct the correct activation for the next layer. In other words, . Figure 2 describes the problem visually. In this paper, we correct for this by instead encoding a mixture of two components of the layers activation : the proportional component , and the derivative component . When we invert this encoding scheme, we get get a decoding scheme which corresponds to taking an exponentially decaying temporal average of past inputs.

Biological neurons tend to respond to a newly presented stimulus with a burst of spiking, which then decays to a slower baseline firing rate as the stimulus persists, and that neural membrane potentials can approximately be modeled as an exponentially decaying temporal average of past inputs.

[backgroundcolor=white!20,skipabove=0pt,skipbelow=0pt]0pt

(1)
(2)
(3)

 

(4)

 

(5)

 

(6)

2 Methods

We propose a coding scheme where neurons can represent their activations as a temporally sparse series of impulses. The impulses from a given neuron encode a combination of the value and the rate of change of the neuron’s activation.

While our algorithm is designed to work efficiently with temporal data, we do not aim to learn temporal sequences in this work. We aim to efficiently approximate a function , where the current target is solely a function of the current input , and not previous inputs . The temporal redundancy between neighbouring inputs will however be used to make our approximate computation of this function more efficient.

2.1 Preliminary

Throughout this paper we will use the notation to denote function composition. We slightly abuse the notion of functions by allowing them to have an internal state which persists between calls. For example, we define the function in Equation 1 as being the difference between the inputs in two consecutive calls (where persistent variable is initialized to 0). The function, defined in Equation 2, returns a running sum of the inputs over calls. So we can write, for example, that when our composition of functions is called with a sequence of input variables , then , because .

In general, when we write , where is a function with persistent state, it will be implied that we have previously called for in sequence. Variable definitions that are used later will be highlighted in blue.

2.2 PD Encoding

Suppose a neuron has time-varying activation . Taking inspiration from Proportional-Integral-Derivative (PID) controllers, we can “encode” this activation at each time step as a combination of its current activation and change in activation as , (see Equation 4). The parameters and determine what portion of our encoding represents the value of the activation and the rate of change of that value, respectively. In Section 2.8, we will discuss the effect our choices for these parameters have on the network.

To get our decoding formula, we can simply solve for as (Equation 4), such that . Notice that Equation 5 corresponds to decaying the previous decoder state by some constant and then adding the input . We can expand this recursively to see that this corresponds to a temporal convolution where is a causal exponential kernel .

2.3 Quantization

Our motivation for the aforementioned encoding scheme is that we now want to quantize our signal into a sparse representation. This will later be used to reduce computation. We can quantize our signal into a sparse, integer signal , where the quantizer Q is defined in Equation 3. Equation 3 implements a form of Sigma-Delta modulation, a method widely used in signal processing to approximately communicate signals at low bit-rates (Candy and Temes, 1962). We can show that that (See Supplementary Material Section A), where indicates applying a temporal summation, a rounding, and a temporal difference, in series. When , we can expect to consist of mostly zeros with a few 1’s and -1’s.

We can now approximately reconstruct our original signal as by applying our decoder, as defined in Equation 5. As our coefficients become larger, our reconstructed signal should become closer to the original signal . We illustrate examples of encoded signals and their reconstructions for different , in Figure 1.

2.3.1 Special cases

We can write compactly the entire reconstruction function as .

: When , we get and , so our reconstruction reduces to . Because all commute with one another, we can simplify this to . so our decoded signal is , with no dependence on . This is visible in the bottom row of Figure 1. This was the encoding scheme used in O’Connor and Welling (2016b).

: In this case, and so our encoding-decoding process becomes . In this case neither our encoder nor our decoder have any memory, and we take not advantage of temporal redundancy.

Figure 1: An example signal (blue), encoded with varying across rows and varying across columns. (black) is the quantized signal produced by the successive application of encoding (Equation 4) and quantization (Equation 3. (orange) is the reconstruction of produced by applying Equation 5 to . One might, after a careful look at this figure, ask why we bother with the proportional () term at all? Figure 2 anticipates this question and answers it visually.
Figure 2: The problem with only sending changes in activation (i.e. ) is that during training, weights change over time. In this example we generate random signals for a single scalar activation and scalar weight . We efficiently approximate with , as described in Section 2.4. As the

changes over time, our estimate

diverges from the correct value. Introducing allows us to bring our reconstruction back in line with the correct signal.

2.4 Sparse Communication Between Layers

The purpose of our encoding scheme is to reduce computation by sparsifying communication between layers of a neural network. Suppose we are trying to compute the pre-nonlinearity activation of the first hidden layer, , given the input activation, . We approximate as:

(7)

The first approximation comes from the quantization (Q) of the encoded signal, and the second from the fact that the weights change over time, as explained in Figure 2. The effects of these approximations are further explored in Section B of the Supplementary Material.

Computing takes multiplications and additions. The cost of computing , on the other hand, depends on the contents of . If the data is temporally redundant, should be sparse, with total magnitude .

can be decomposed into a sum of one-hot vectors

where is a onehot vector with element . The matrix product can then be decomposed into a series of row additions:

(8)

If we include the encoding, quantization, and decoding operations, our matrix product takes a total of multiplications, and additions. Assuming the term dominates, we can say that the relative cost of computing vs is:

(9)

2.5 A Neural Network

We can implement this encoding scheme on every layer of a neural network. Given a standard neural net consisting of alternating linear () and nonlinear () operations, our network function can then be written as:

(10)
(11)

We can use the same approach to approximately calculate our gradients to use in training. If we define our layer activations as , and , where

is some loss function and

is a target, we can backpropagate the approximate gradients as:

(12)

On every layer of the forward and backward pass, our quantization scheme corrupts the signals that are being sent between layers. Nevertheless we find experimentally that this does not matter much to the performance of the network.

2.6 Parameter Updates

Figure 3: A visualization of our efficient update schemes from Section 2.6. Top: A scalar signal representing a presynaptic neuron activation , its quantized version, , and its reconstruction . Middle: Another signal, representing the postsynaptic gradient of the error , along with its quantized () and reconstructed () variants. Bottom: The true weight gradient , the reconstruction gradient . At the time of the spike in , we have two schemes for efficiently computing the weight gradient that will be used to increment weight (see Section 2.6). The past scheme computes the area under since the last spike, and the future scheme computes the total future additional area due to the current spike.

There’s no use having an efficient backward pass if the parameter updates aren’t also efficient. In a normal neural network trained with backpropagation and simple stochastic gradient descent, the parameter update for weight matrix

has the form where is the learning rate. If connects layer to layer , we can write where is the presynaptic activation, is the postsynaptic (pre-nonlinearity) activation and is the outer product. So we pay multiplications to update the parameters for each sample.

We want a more efficient way to compute this product, which takes advantage of the sparsity of our encoded signals to reduce computation. We can start by applying our encoding-quantizing-decoding scheme to our input and error signals as and , and approximate our true update update as where and . This doesn’t do any good by itself, because the update rule still is not sparse. But, we can exactly compute the sum of this value over time using one of two sparse update schemes - past updates and future updates - which are depicted in Figure 3.

Past Updates

: For a given synapse

, if either the presynaptic neuron spikes () or the postsynaptic neuron spikes (), we increment the by the total area under since the last spike. We can do this efficiently because between the current time and the time of the previous spike, is a geometric sequence. Given a known initial value , final value , and decay rate , a geometric sequence sums to . The area calculated is shown in pink on the bottom row of Figure 3, and one algorithm to calculate it is in Equation 13.

Future Updates: Another approach is to calculate the Present Value of the future area under the integral from the current spike. This is depicted in the blue-gray area in Figure 3, and the formula is in Equation 14.

To simplify our expressions in the update algorithms, we re-parametrize our coefficients as , .

(13) (13) (13) (13) (13) (13) (13) (13) (13)
(14) (14) (14) (14) (14) (14) (14)

2.7 Relation to STDP

An extremely attentive reader might have noted that Equation 14 has the form of an online implementation of Spike-Timing Dependent Plasticity (STDP). STDP (Markram et al., 2012) emerged from neuroscience, where it was observed that synaptic weight changes appeared to be functions of the relative timing of pre- and post-synaptic spikes. The empirically observed function usually has the double-exponential form seen on the rightmost plot of Figure 4.

Using the quantized input signal and error signal , and their reconstructions and as defined in the last section, we define a causal convolutional kernel and where . The middle plot of Figure 4 is a plot of . We define our STDP update rule as:

(15)

We note that while our version of STDP has the same double-exponential form as the classic STDP rule observed in neuroscience (Markram et al., 2012), we do not have the property that sign of the weight change depends on whether the presynaptic spike preceded the postsynaptic spike.

In Section C in the supplementary material we show experimentally that while Equations , , , may all result in different updates at different times, the rules are equivalent in that for a given set of pre/post-synaptic spikes , the cumulative sum of their updates over time converges exactly.

Figure 4: Left: Our STDP rule, when both the input and error spikes have the same sign. Middle: Our STDP rule, when the input and error spikes have opposite signs. Right: The classic STDP rule Markram et al. (2012), where the weight update is positive when a presynaptic spike preceeds a postsynaptic spike, and negative otherwise.

2.8 Tuning ,

The smaller the magnitude of a signal, the more severely distorted it is by our quantization-reconstruction scheme. We can see that scaling a signal by K has the same effect on the quantized version of the signal, , as scaling and by K:

. The fact that the reconstruction quality depends on the signal magnitude presents a problem when training our network, because the error gradients tend to change in magnitude throughout training (they start large, and become smaller as the network learns). To keep our signal within the useful dynamic range of the quantizer, we apply simple scheme to heuristically adjust

and for the forward and backward passes separately, for each layer of the network. Instead of directly setting ,

as hyperparameters, we fix the ratio

, and adapt the scale to the magnitude of the signal. Our update rule for is:

(16)

Where is the scale-adaptation learning rate, is a rolling average of the magnitude of signal , and defines how coarse our quantization should be relative to the signal magnitude (higher means coarser). We can recover for use in the encoders and decoders as and . In our experiments, we choose , and initialize .

3 Experiments

To evaluate our network’s ability to learn, we run it on the standard MNIST dataset, as well as a variant we created called “Temporal MNIST”. Temporal MNIST is simply a reshuffling of the MNIST dataset so that so that similar inputs (in terms of L2-pixel distance), are put together. Figure 6

shows several snippets of consecutive frames in the temporal MNIST dataset. We compare our Proportional-Derivative Net against a conventional Multi-Layer Perceptron with the same architecture (one hidden layer of 200 ReLU hidden units and a softmax output). The results are shown in Figure

6. Somewhat surprisingly, our predictor slightly outperformed the MLP, getting 98.36% on the test set vs 98.25% for the MLP. We assume this improvement is due to the regularizing effect of the quantization. On Temporal MNIST, our network was able to converge with less computation than it required for MNIST (It used operations for MNIST vs for Temporal MNIST), but ended up with a slightly worse test score when compared with the MLP (the PDNN got 97.99% vs 98.28% for the MLP). It’s not clear why our network appeared to achieve a slightly worse score on temporal data. This will be a subject for future investigation.

Figure 5:

Top Row: Results on MNIST. Bottom Row: Results on Temporal MNIST. Left Column: the training and test scores as a function of epoch. Middle: We now put the number of computational operations on the x-axis. We see that as a result our PDNN shifts to the left. Right: Because our network computes primarily with additions rather than multiplications. When we multiply our operation counts with the estimates of

Horowitz (2014) for the computational costs of arithmethic operations (0.1pJ for 32-bit fixed-point addition vs 32pJ for multiplication), we can see that our algorithm would be at an advantage on any hardware where arithmetic operations were the computational bottleneck.
Figure 6: Some samples from the Temporal-MNIST dataset. Each column shows a snippet of adjacent frames.

4 Related Work

There has been sparse but interesting work on merging the notions of spiking neural networks and deep learning. Diehl et al. (2015) found a way to efficiently map a trained neural network onto a spiking network. Lee et al. (2016) devised a method for training spiking of integrate-and-fire spiking neurons with backpropagation - though their neurons did not send a temporal difference of their activations. O’Connor and Welling (2016a) created a method for training event-based neural networks - but their method took no advantage of temporal redundancy in the data. Binas et al. (2016) and (O’Connor and Welling, 2016b) both took the approach of sending quantized temporal changes reduce computation on temporally redundant data, but their schemes could not be used to train a neural network. Bohte et al. (2000) showed how could apply backpropagation for training spiking neural networks, but it was not obvious how to apply the method to non-spiking data. Zambrano and Bohte (2016) developed a spiking network with an adaptive scale of quantization (which bears some resemblance to our tuning scheme described in Section 2.8), and show that the spiking mechanism is a form of Sigma-Delta modulation, which we also use here. Courbariaux et al. (2015) showed that neural networks could be trained with binary weights and activations (we just quantize activations). Bengio et al. (2015) found a connection between the classic STDP rule (Figure 4, right) and optimizing a dynamical neural network, although the way they arrived at an STDP-like rule was quite different from ours.

5 Discussion

We set out with the objective of reducing the computation in deep networks by taking advantage of temporal redundancy in data. We described a simple rule (Equation 4) for sparsifying the communication between layers of a neural network by having our neurons communicate a combination of their temporal change in activation, and the current value of their activation. We show that it follows from this scheme that neurons should behave as leaky integrators (Equation 5). When we quantize our neural activations with Sigma-Delta modulation, a common quantization scheme in signal processing, we get something resembling a leaky integrate-and-fire neuron. We derive efficient update rules for the weights of our network, and show these to be equivalent to a form of STDP - a learning rule first observed in neuroscience. Finally, we train our network, verify that it does indeed compute more efficiently on temporal data, and show that it performs about as well as a traditional deep network of the same architecture, but with significantly reduced computation.

Code is available at github.com/petered/pdnn.

Acknowledgments

This work was supported by Qualcomm, who we’d like to thank for sharing their past work with us. In addition, we’d like to thank our colleagues, especially Matthias Reisser and Changyong Oh, for some very useful discussions which contributed to this work.

References

Appendix A Sigma-Delta Unwrapping

Here we show that , where are defined in Equations 3, 2, 6, 1, respectively.

From Equation 3 (Q) we can see that

Now we can unroll for and observe use the fact that if then , to say:

(17)

At which point it is clear that Q is identical to a successive application of a temporal summation, a rounding, and a temporal difference. That is why we say .

Appendix B Scanning the K-space

Equation 7 shows how we make two approximations when approximating with . The first is the “nonstationary weight” approximation - arising from the fact that w changes in time, the second is the “quantization” approximation, arising from the quantization of x. Here do a small experiment in which we multiply a time-varying scalar signal with a time-varying weight for many different values of to understand the effects of on our approximation error.

Figure 7: Top Left: A time varying signal , the quantized signal , and the time-varying “weight” . Bottom Left: Compare the true product of these signals with the , which shows the effects of the non-stationary weight approximation, and which shows both approximations. Top Middle: The Cosine distance between the “true” signal and the approximation due to the nonstationary w, scanned over a grid of values. Top Right: The cosine distance between the “true” signal and the approximation due to the quantization of x. Bottom Middle: The Cosine Distance between the “true” signal and the full approximation described in Equation 7. This shows why we need both and to be nonzero. Bottom Right: The Number of weight-lookups required for the to compute the full approximation. .

Appendix C All roads lead to Rome

In Section 2.6 and 2.7, we described 4 different update rules, and stated that while they do not necessarily produce the same updates at the same times, they produce the same result in the end. Here we demonstrate this empirically. We generate two random spike-trains representing the input and the error signal to a single synapse. The plot on the bottom shows our weight as a function of time as it drifts from its initial value.

Figure 8: A comparison of our different update methods. Top: A randomly generated presynaptic quantized signal , along with its reconstruction . Middle: A randomly generated postsynaptic quantized error signal , along with its reconstruction . Bottom: The cumulative weight update arising from our four updates methods. "recon" is just , “past” and “future” are described in Section 2.6 and “STDP” is described in Section 2.7