Spiking neural networks (SNNs) are among the leading candidates to solve one of the major impediments of more widespread uses of modern AI: The energy consumption of the very large artificial neural networks (ANNs) that are needed. These ANNs have to be large, since they need to have a sufficiently large number of parameters in order to absorb enough information from the huge data sets on which they are trained, such as the 1.2 million images of ImageNet2012. Inference on these large ANNs is power hungry (Garcia-Martin2019), which impedes their deployment in mobile devices or autonomous vehicles. Spiking neurons have been in the focus of the development of novel computing hardware for AI with a drastically reduced energy budget, because the giant SNN of the brain –consisting of about 100 billion neurons– consumes just 20W (LingJ2001). Most spiking neuron models that are considered in neuromorphic hardware are in fact inspired by neurons in the brain. Their output is a train of stereotypical electrical pulses –called spikes. Hence their output is very different from the analog numbers that an ANN neuron produces as output.
But whereas large ANNs, trained with ever more sophisticated learning algorithms on giant data sets, approach –and sometimes exceed– human performance in several categories of intelligence, the performance of SNNs is lagging behind. There is some hope that this gap can be closed for the case of recurrent neural networks, since e-prop applied to recurrent SNNs with adapting neurons appears to capture most of the performance of BPTT applied to recurrent ANNs(Bellec2019). Furthermore state-of-the-art performance is usually achieved in AI with relatively small LSTM networks or other recurrent ANNs. But the situation is different for feedforward networks. CNNs that achieve really good image classification performance tend to be quite deep and very large, and training corresponding deep and large feedforward SNNs directly does not appear to achieve a similar performance level. One attractive option for providing state-of-the-art image preprocessing for SNNs whose main goal may be to classify videos or control movements –for which the recurrent part of the SNN can be trained directly via e-prop—is to simply take a well-performing trained CNN for image classification, and to convert it into an SNN –using the same connections and weights. The most common –and so far best performing—conversion method was based on the idea of (firing-) rate coding, where the analog output of an ANN unit is emulated by the firing rate of a spiking neuron (Rueckauer2017). This method can readily be used for ANN units that are based on the ReLU (rectified linear) activation function. It has produced impressive results for professional benchmark tasks such as ImageNet, but a significant gap to the accuracy, latency, and throughput of ANN solutions has thwarted its practical application. Problems with the timing and precision of resulting firing rates on higher levels of the resulting SNNs have been cited as possible reasons for the loss in accuracy of the SNN. In addition, the transmission of an analog value through a firing rate requires a fairly large number of time steps, which reduces both latency and throughput for inference.
We introduce a new ANN-to-SNN conversion method that we call FS-conversion because it requires a spiking neuron to spike just a few times (FS = Few Spikes). This method is very different from rate-based conversions, and structurally more similar to temporal coding, where the timing of a spike transmits extra information. However most forms of temporal coding have turned out to be difficult to implement in a noise-robust and efficient manner in neuromorphic hardware. This arises from the difficulty to implement delays with sufficiently high precision without sacrificing latency or throughput of the SNN, and the difficulty to design spiking neurons that can efficiently process such temporal code (maass1998), (Thorpe2001), (Rueckauer2017), (Kheradpisheh2019). In contrast, FS-coding coding requires just unit delays for all connections and only on the order of log N spikes for transmitting integers between 1 and N. However spikes at different times transmit different weights, which are collected in the non-leaky membranes of neurons in the next layer.
We will describe in the first subsection of Results the design of an FS-unit that emulates a ReLU gate of an ANN in this new ANN-to-SNN conversion. We then demonstrate the performance of the resulting SNN on the ImageNet2012 and CIFAR10 datasets.
1.1 FS-neuron model
The FS-conversion from ANNs to SNNs requires a variation of the standard spiking neuron model, to which we refer as FS-neuron. Assume that we want to emulate a ReLU neuron as depicted in Fig. A. Lets further assume for simplicity that it receives inputs from , , …, . We emulate this ReLU gate by a spiking neuron as shown in Fig. B. It has a membrane voltage without leak. But this membrane voltage is reset to after a spike at time t, where is its firing threshold at time . We denote the spike train that this neuron produces by , i.e., if the neuron fires at step , else . Each time step of a non-spiking ReLU neuron in the given feedforward CNN is simulated by subsequent time steps of this FS-neuron. Its firing threshold decays exponentially during these time steps, while its initial value and its later value for is so large that it can only fire during these time steps . Expressed in formulas, the membrane potential evolves according to
and a spike at time is sent with weight to the next gate, if is the weight of the corresponding connection in the ANN. Thus its spike output for gate input can be defined compactly by
where denotes the Heaviside step function. Hence the FS-neuron reproduces without error the output ReLU() of the ReLU gate for any from in its spike output :
Thus an FS-neuron on the next layer just has to collect these weighted spikes –multiplied with the weight of the corresponding synaptic connection in the CNN– in its non-leaking membrane potential –until its firing threshold is lowered so that it can become active during the next time steps. In order to be able to transmit also non-integer values between and some arbitrary positive constant , one simply multiplies with . Then the FS-neuron reproduces ReLU() for any non-negative less than that is a multiple of without any error, and ReLU() for values in between is rounded down to the next multiple of . Thus the output of the FS-neuron deviates for in the range from to by at most from the output of the ReLU gate.
The resulting approximation is plotted for in Fig. 2. Note that the number of neurons in the CNN is not increased through the FS-conversion, nor the number of connections.
The resulting SNN can be used in a pipelined manner, processing a new network input every time step. As soon as a layer has finished its active period –consisting of time steps– for one network input, it can collect inputs from the preceding layer for the next network input during the following time steps, and then actively process that next network input during the next time steps. Hence its throughput is much better than that of SNNs that result from rate-based ANN-to-SNN conversions, such as for example (Rueckauer2017; Sengupta2019). The Inception-v3 model in (Rueckauer2017)
reports that the network needs 550 time steps to classify an image. Under the assumption that rate based models profit only very little from pipelining, it is reasonable to estimate that the throughput of a SNN that results from FS-conversion withis roughly times higher.
The SNN resulting from the rate-based conversion of the ResNet34 model discussed in (Sengupta2019) has been reported to use time steps for a classification. Therefore we estimate that the throughput is increased here by a factor around through FS-conversion.
1.2 Application of the FS-conversion to the classification of images from ImageNet
The ImageNet data set (Russakovsky2015)
has become the most popular benchmark for image classification in machine learning (we are using here the ImageNet2012 version). This data set consists oftraining images and test images (both RGB images of different sized), that are labeled by 1000 different categories. Classifying imaged from ImageNet is a nontrivial task even for a human, since this data set contains for example 59 categories for birds of different species and gender (van2015building). This may explain why a relaxed performance measurement, where one records whether the target class is among the top 5 classifications that are proposed by the neural network (”Top5”), is typically much higher.
|Model||# params||ANN||SNN||# layers||# neurons||# spikes|
Note that the resulting SNN has a very high throughput, since it can classify a new image every time steps.
The accuracy of 75.22% for the ANN version of ResNet50 in Table 1
resulted from training a variant of ResNet50 where max pooling was replaced by average pooling, using the hyperparameters given in the TensorFlow repository. This accuracy is close to the best published performance of 76% for ResNet50 ANNs(Tan2019, Table 2). Note that the application of the FS-conversion to ResNet yields an SNN whose Top1 and Top5 performance is almost indistinguishable from the ANN version.
The best previous performance of an SNN on ImageNet was achieved by converting an Inception-v3 model (Szegedy2016) with a rate-based conversion scheme (Rueckauer2017). The reported test accuracy of the resulting SNN was 74.6%, where 550 time steps were used to simulate the model. Thus it can classify only every 550 time steps a new image, whereas the SNN that results from FS-conversion can classify a new image every time steps. Hence the application of FS-conversion to ResNet50 improves this result somewhat with regard to accuracy and essentially with regard to throughput.
1.3 Results for the classification of images from the CIFAR10 data set
The results for the ANN versions of ResNet that are given in Table 2 are the outcome of training them with the hyperparameters given in the TensorFlow models repository. They are very close to the best results reported in the literature. The best ResNet on CIFAR10 is ResNet110, where a test accuracy of 93.57% has been reported (He2016). Our ResNet50 achieves 92.99%, which is very close to the performance of the ResNet56 with 93.03%.
Spiking versions of ResNet20 have already been explored (Sengupta2019). Using a rate-based conversion scheme a performance of 87.46% was reported. Compared to these results, FS-conversion yields a substantially higher accuracy while only using 80 to 500 time steps, depending on the model depth, instead of 2000, thereby reducing latency for inference. In addition, the throughput is drastically improved.
|Model||ANN||SNN||# neurons||# spikes|
Note that the number of spikes used to classify one image decreases significantly compared to rate based conversion schemes. A converted ResNet11 has been reported to use more than 8 million spikes to classify a single test example (Lee2019). Comparing this to an FS-converted ResNet14 we find that the latter uses times fewer spikes despite being a slightly larger model.
Using direct training of SNNs instead of a conversion scheme has been reported to result in a lower amount of spikes needed to perform a single classification. However, even a directly trained SNN version of ResNet11 uses times more spikes than a FS-conversion of ResNet14 (Lee2019, Table 8).
We have introduced a new method for converting ANNs to SNNs. Since the resulting SNN uses for inference few spikes (FS) per neuron, FS-conversion provides an interesting alternative to rate-based conversion. It requires a somewhat different spiking neuron model, but improves accuracy, latency, and especially the throughput of the resulting SNN. In fact, it arguably reaches the information theoretic optimum for spike-based communication. As the number of spikes required for inference by a SNN is directly related to its energy consumption in spike-based neuromorphic hardware, the energy consumption of FS-converted SNNs appears to be close to the theoretical optimum of SNNs.
Since FS-conversion provides a tight bound on the number of time steps during which a spiking neuron is occupied, it can also be used for converting recurrently connected ANNs to SNNs. This requires however that a separate accumulator collects for each neuron its spike inputs from the simulation of the preceding time step of the recurrent ANN. Furthermore, similarly as the mathematically closed related AMOS conversion (stoeckl2019), one can extend the FS-conversion to handle also CNNs with more complex activation functions such as the Swish function (Tan2019), that assumes positive and negative values and can therefore not be converted to SNNs via rate-coding.
The proposed method for generating highly performant SNNs for image classification offers an opportunity to combine the computationally more efficient and functionally more powerful training of ANNs with the superior energy-efficiency of SNNs for inference. Note that one can also use the FS-converted SNN as initialization for subsequent direct training of the SNN for a more specific task. Altogether our results suggest that spike-based hardware may gain an edge in the competition for the development of drastically more energy efficient hardware for AI since it promises to combine high energy efficiency and competitive performance with a versatility that optimized hardware for specific ANNs –such as a specific type of convolutional neural networks— cannot offer.
We would like to thank Franz Scherr for helpful discussions. This research was partially supported by the Human Brain Project of the European Union (Grant agreement number 785907).