Spiking-C3D
Convolutional 3D Spiking Neural Network to classify videos
view repo
We train spiking deep networks using leaky integrate-and-fire (LIF) neurons, and achieve state-of-the-art results for spiking networks on the CIFAR-10 and MNIST datasets. This demonstrates that biologically-plausible spiking LIF neurons can be integrated into deep networks can perform as well as other spiking models (e.g. integrate-and-fire). We achieved this result by softening the LIF response function, such that its derivative remains bounded, and by training the network with noise to provide robustness against the variability introduced by spikes. Our method is general and could be applied to other neuron types, including those used on modern neuromorphic hardware. Our work brings more biological realism into modern image classification models, with the hope that these models can inform how the brain performs this difficult task. It also provides new methods for training deep networks to run on neuromorphic hardware, with the aim of fast, power-efficient image classification for robotics applications.
READ FULL TEXT VIEW PDFConvolutional 3D Spiking Neural Network to classify videos
Deep artificial neural networks (ANNs) have recently been very successful at solving image categorization problems. Early successes with the MNIST database
[1, 2] were expanded to the more difficult but similarly sized CIFAR-10 dataset [3] and Street-view house numbers dataset [4]. More recently, many groups have achieved better results on these small datasets (e.g. [5]) and as well as success on larger datasets (e.g. [6]). This work culminated with the application of deep neural networks to ImageNet
[7], a very large and challenging dataset.The relative success of deep ANNs in general—and convolutional neural networks in particular—on these datasets have put them well ahead of other methods in terms of image categorization by machines. Given that deep ANNs are approaching human performance on some datasets (or even passing it, for example on MNIST) suggests that these models may be able to shed light on how the human visual system solves these same tasks.
There has recently been considerable effort to take deep ANNs and make them more biologically plausible by introducing neural “spiking” [8, 9, 10, 11, 12, 13], such that connected nodes in the network transmit information via instantaneous single bits (spikes), rather than transmitting real-valued activities. While one goal of this work is to better understand the brain by trying to reverse engineer it [9], another goal is to build energy-efficient neuromorphic systems that use a similar communication method for image categorization [12, 13].
We first train a network on static images using traditional deep learning techniques; we call this the
static network. We then take the parameters (weights and biases) from the static network and use them to connect spiking neurons, forming the dynamic network (or spiking network). The challenge is to train the static network in such a way that a) it can be transferred into a spiking network, and b) the classification error of the dynamic network is as close to that of the static network as possible (this means the error rate is as low as possible, since we do not expect the dynamic network to perform better than the static one).We base our network off that of Krizhevsky et al. [7]
, which achieved 11% error on the CIFAR-10 dataset (a larger variant of the model won the ImageNet 2012 competition). The original network consists of five layers: two generalized convolutional layers, followed by two locally-connected non-convolutional layers, followed by a fully-connected softmax classifier. A generalized convolutional layer consists of a set of convolutional weights followed by a neural nonlinearity, then a pooling layer, and finally a local response normalization layer. The locally-connected non-convolutional layers are also followed by a neural nonlinearity. In the case of the original network, the nonlinearity is a rectified linear (ReLU) function, and both pooling layers perform overlapping max-pooling. Code for the original network and details of the network architecture and training can be found at
https://code.google.com/p/cuda-convnet2/.To make the static network transferable to spiking neurons, a number of modifications are necessary. First, we remove the local response normalization layers. This computation would likely require some sort of lateral connections between neurons, which are difficult to add in the current framework since the resulting network would not be feedforward.
Second, we changed the pooling layers from max pooling to average pooling. Again, computing max pooling would likely require lateral connections between neurons, making it difficult to implement without significant changes to the training software. While the Neural Engineering Framework can be used to compute a max function in a feedforward manner [14], this method requires prohibitively many neurons to achieve reasonable accuracy. Average pooling, on the other hand, is very easy to compute in spiking neurons, since it is simply a weighted sum.
The other modifications—using leaky integrate-and-fire neurons and training with noise—are the main focus of this paper, and are described in detail below.
Our network uses a modified leaky integrate-and-fire (LIF) neuron nonlinearity instead of the rectified linear nonlinearity. Past work has kept the rectified linear nonlinearity for the static network and substituted in the spiking integrate-and-fire (IF) neuron model in the dynamic network [12, 13], since the static firing curve of the IF neuron model is a rectified line. Our motivations for using the LIF neuron model are that a) it is more biologically realistic than the IF neuron model [15, p. 338], and b) it demonstrates that alternative models can be used in such networks. The methods applied here are transferable to other neuron types, and could be used to train a network for the idiosyncratic neuron types employed by some neuromorphic hardware (e.g. [16]).
The LIF neuron dynamics are given by the equation
(1) |
where is the membrane voltage, is the input current, and is the membrane time constant. When the voltage reaches , the neuron fires a spike, and the voltage is held at zero for a refractory period of . Once the refractory period is finished, the neuron obeys Equation 1 until another spike occurs.
Given a constant input current , we can solve Equation 1 for the time it takes the voltage to rise from zero to one, and thereby find the steady-state firing rate
(2) |
Theoretically, we should be able to train a deep neural network using Equation 2 as the static nonlinearity and make a reasonable approximation of the network in spiking neurons, assuming that the spiking network has a synaptic filter that sufficiently smooths a spike train to give a good approximation of the firing rate. The LIF steady state firing rate has the particular problem that the derivative approaches infinity as
, which causes problems when employing backpropagation. To address this, we added smoothing to the LIF rate equation.
Equation 2 can be rewritten as
(3) |
where . If we replace this hard maximum with a softer maximum , then the LIF neuron loses its hard threshold and the derivative becomes bounded. Further, we can use the substitution
(4) |
to allow us control over the amount of smoothing, where as . Figure 1 shows the result of this substitution.
Training neural networks with various types of noise on the inputs is not a new idea. Denoising autoencoders
[17] have been successfully applied to datasets like MNIST, learning more robust solutions with lower generalization error than their non-noisy counterparts.In a spiking neural network, the neuron receiving spikes in a connection (called the post-synaptic neuron) actually receives a filtered version of each spike. This filtered spike is called a post-synaptic current (or potential), and the shape of this signal is determined by the combined dynamics of the pre-synaptic neuron (e.g. how much neurotransmitter is released) and the post-synaptic neuron (e.g. how many ion channels are activated by the neurotransmitter and how they affect the current going into the neuron). This post-synaptic current dynamics can be characterized relatively well as a linear system with the impulse response given by the -function [18]:
(5) |
The filtered spike train can be viewed as an estimate of the neuron activity. For example, if the neuron is firing regularly at 200 Hz, filtering spike train will result in a signal fluctuating around 200 Hz. We can view the neuron output as being 200 Hz, with some additional “noise” around this value. By training our static network with some random noise added to the output of each neuron for each training example, we can simulate the effects of using spikes on the signal received by the post-synaptic neuron.
Figure 2 shows how the variability of filtered spike trains depends on input current for the LIF neuron. Since the impulse response of the -filter has an integral of one, the mean of the filtered spike trains is equal to the analytical rate of Equation 2
. However, the statistics of the filtered signal vary significantly across the range of input currents. Just above the firing threshold, the distribution is skewed towards higher firing rates (i.e. the median is below the mean), since spikes are infrequent so the filtered signal has time to return to near zero between spikes. At higher input currents, on the other hand, the distribution is skewed towards lower firing rates (i.e. the median is above the mean). In spite of this, we used a Gaussian distribution to generate the additive noise during training, for simplicity. We found the average standard deviation to be approximately
across all positive input currents for an -filter with . The final steady-state soft LIF curve used in training is given by(6) |
where
(7) |
and is given by Equation 4.
Finally, we convert the trained static network to a dynamic spiking network. The parameters in the spiking network (i.e. weights and biases) are all identical to that of the static network. The convolution operation also remains the same, since convolution can be rewritten as simple connection weights (synapses)
between pre-synaptic neuron and post-synaptic neuron . (How the brain might learn connection weight patterns, i.e. filters, that are repeated at various points in space, is a much more difficult problem that we will not address here.) Similarly, the average pooling operation can be written as a simple connection weight matrix, and this matrix can be multiplied by the convolutional weight matrix of the following layer to get direct connection weights between neurons.^{1}^{1}1For computational efficiency, we actually compute the convolution and pooling separately.The only component of the network that actually changes, then, when moving from the static to the dynamic network, is the neurons themselves. The most significant change is that we replace the soft LIF rate model (Equation 6) with the LIF spiking model (Equation 1). We also remove the additive Gaussian noise used in training.
Additionally, we add post-synaptic filters to the neurons, which filter the incoming spikes before passing the resulting currents to the LIF neuron equation. As stated previously, we use the -filter for our synapse model, since it has both strong biological support [18], and removes a significant portion of the high-frequency variation produced by spikes. We pick the decay time constant ms, typical for excitatory AMPA receptors in the brain [19].
We tested our network on the CIFAR-10 dataset. This dataset is composed of 60000 pixel labelled images from ten categories. We used the first 50000 images for training and the last 10000 for testing, and augmented the dataset by taking random patches from the training images and then testing on the center patches from the testing images. This methodology is similar to Krizhevsky et al. [7], except that they also used multiview testing where the classifier output is the average output of the classifier run on nine random patches from each testing image (increasing the accuracy by about 2%).
Table 1 shows the effect of each modification on the network classification error. Our original static network based on the methods of [7]
achieved 14.63% error, which is higher than the 11% achieved by the original paper since a) we are not using multiview testing, and b) we used a shorter training time (160 epochs versus 520 epochs).
Rows 1-5 in Table 1 show that each successive modification to make the network amenable to running in spiking neurons adds about 1-2% more error. Despite the fact that training with noise adds additional error to the static network, rows 6-8 of the table show that in the spiking network, training with noise pays off, though training with too much noise is not advantageous. Specifically, though training with versus decreased the error introduced when switching to spiking neurons ( 1% versus 2%), training with versus introduced an additional 2.5% error to the static network, making the final spiking network perform worse. In the interest of time, these spiking networks were all run on the same 1000-image random subset of the testing data. The last two rows of the table show the network with the optimal amount of noise () trained for additional epochs (a total of 520 as opposed to 160), and run on the entire test set. Our spiking network achieves an error of 17.05% on the full CIFAR-10 test set, which is the best published result of a spiking network on this dataset.
Comparing spiking networks is difficult, since the results depend highly on the characteristics of the neurons used. For example, neurons with very high firing rates, when filtered, will result in spiking networks that behave almost identically to their static counterparts. Using neurons with lower firing rates have much more variability in their filtered spike trains, resulting in noisier and less accurate dynamic networks. Nevertheless, we find it worthwhile to compare our results with those of Cao et al. [12], who achieved 22.57% error on the CIFAR-10 dataset (as far as we know, the only other spiking network with published results on CIFAR-10). Our approach is in many ways similar to theirs, with the notable differences that we used the LIF neuron instead of the IF neuron, and that we used noise during training. The fact that we achieved marginally better results suggests that LIF neuron spiking networks can be trained to state-of-the-art accuracy and that adding noise during training helps improve accuracy.
We also measured the spike rates for our network on the CIFAR-10 dataset. The average firing rate across all neurons in the network was 148 spikes/s, estimated from 20 test examples (30 seconds of simulated time). The firing rate was quite different between different layers of the network, with the first two convolutional layers having average firing rates of 172 spikes/s and 104 spikes/s respectively, and the locally connected layers having rates of 10.6 spikes/s and 7.6 spikes/s respectively. Cao et al. [12] reported the number of post-synaptic spikes () for their network on the Tower dataset. We used this, along with their simulation time (100 ms) and number of neurons (57606), to estimate the average firing rate of their network at 86.8 spikes/s. However, their firing rates on the CIFAR-10 dataset could be significantly different, since firing rates are heavily dependent on learned model parameters (i.e. weights and biases) which can vary significantly between datasets.
Most spiking deep networks to date have been tested on the MNIST dataset. The MNIST dataset is composed of 70000 labelled hand-written digits, with 60000 used for training and 10000 reserved for testing. While this dataset is quickly becoming obsolete as deep networks become more and more powerful, it is only recently that spiking networks are beginning to achieve human-level accuracy on the dataset.
We trained an earlier version of our network on the MNIST dataset. This version used layer-wise pretraining of non-convolutional denoising autoencoders, stacked and trained as a classifier. This network had two hidden layers of 500 and 200 nodes each, and was trained on the unaugmented dataset. Despite the significant differences between this network and the network used on the CIFAR-10 dataset, both networks use spiking LIF neurons and are trained with noise to minimize the error caused by the filtered spike train variation. Table 2 shows a comparison between our network and the best published results on MNIST. Our network significantly outperforms the best results using LIF neurons, and is on par with those of IF neurons. This demonstrates that state-of-the-art networks can be trained with LIF neurons. The average firing rate of this network is 25.7 spikes/s, with the hidden layers averaging 23.0 spikes/s and 32.5 spikes/s, respectively.
Our results demonstrate that it is possible to train accurate deep convolutional networks for image classification using more biologically accurate leaky integrate-and-fire (LIF) neurons, as opposed to the traditional rectified-linear or sigmoid neurons. Such a network can be run in spiking neurons, and training with noise decreases the amount of error introduced when running in spiking versus rate neurons.
The first main contribution of this paper is to demonstrate that state-of-the-art deep spiking networks can be trained with LIF neurons. Other state-of-the-art methods use integrate-and-fire (IF) neurons [12, 13]
, which are easier to fit to the rectified linear units commonly used in deep convolutional networks, but are biologically implausible. By smoothing the LIF response function so that its derivative remains bounded, we are able to use this more biologically plausible neuron with a standard convolutional network trained by backpropagation.
This idea of smoothing the neuron response function is applicable to other neuron types, too. Many other neuron types have discontinuous response functions (e.g. the FitzHugh-Nagumo neuron), and our smoothing method allows such neurons to be used in deep convolutional networks. We found that there was very little error introduced by switching from the soft response function to the hard response function with LIF neurons for the amount of smoothing that we used. However, for neurons with harsh discontinuities that require more smoothing, it may be necessary to slowly relax the smoothing over the course of the training so that, by the end of the training, the smooth response function is arbitrarily close to the hard response function.
The other main contribution of this paper is to demonstrate that training with noise on neuron outputs can decrease the error introduced when transitioning to spiking neurons. Training with noise on neuron outputs improved the performance of the spiking network considerably (the error decreased by 3.4%). This is because noise on the output of the neuron simulates the variability that a spiking network encounters when filtering a spike train. There is a tradeoff between too little training noise, where the resultant dynamic network is not robust enough against spiking variability, and too much noise, where the accuracy of the static network is decreased. Since the variability produced by spiking neurons is not Gaussian (Figure 2), our additive Gaussian noise is a rough approximation of the variability that the spiking network will encounter. Future work includes training with noise that is more representative of the variability seen in spiking networks, to accommodate both the non-Gaussian statistics at any particular input current, and the changing statistics across input currents.
Direct comparison with other spiking neural networks is difficult, since the amount of error introduced when converting from a static to a spiking network is heavily dependent on the firing rates of the neurons. Nevertheless, we found our network to perform favourably with other spiking networks, achieving the best published result for a spiking network on CIFAR-10, and the best result for a LIF neuron spiking network on MNIST. We also report our average firing rates for each layer and for the entire network, to facilitate comparison with future networks. The firing rates for the convolutional layers of our network are higher than typical in visual cortex [21]. Future work includes looking at methods to lower firing rates, though this may involve sparsification of neural firing—having fewer neurons fire for a particular stimulus—which can be difficult in convolutional networks.
Other future work includes implementing max-pooling and local contrast normalization layers in spiking networks. Networks could also be trained offline as described here and then fine-tuned online using an STDP rule, such as the one described in [22], to help further reduce errors associated with converting from rate-based to spike-based networks, while avoiding difficulties with training a network in spiking neurons from scratch.
A. Krizhevsky, “Convolutional deep belief networks on CIFAR-10,” Tech. Rep., 2010.
International Conference on Pattern Recognition (ICPR)
, 2012.International Conference on Artificial Intelligence and Statistics (AISTATS)
, vol. 38, 2015, pp. 562–570.E. Neftci, S. Das, B. Pedroni, K. Kreutz-Delgado, and G. Cauwenberghs, “Event-driven contrastive divergence for spiking neuromorphic systems,”
Frontiers in Neuroscience, vol. 7, no. 272, 2013.International Journal of Computer Vision
, vol. 113, no. 1, pp. 54–66, nov 2014.International Conference on Machine Learning (ICML)
, 2008, pp. 1096–1103.