The code associated with Comparing SNNs and RNNs on neuromorphic vision datasets: Similarities and differences.
Neuromorphic data, recording frameless spike events, have attracted considerable attention for the spatiotemporal information components and the event-driven processing fashion. Spiking neural networks (SNNs) represent a family of event-driven models with spatiotemporal dynamics for neuromorphic computing, which are widely benchmarked on neuromorphic data. Interestingly, researchers in the machine learning community can argue that recurrent (artificial) neural networks (RNNs) also have the capability to extract spatiotemporal features although they are not event-driven. Thus, the question of "what will happen if we benchmark these two kinds of models together on neuromorphic data" comes out but remains unclear. In this work, we make a systematic study to compare SNNs and RNNs on neuromorphic data, taking the vision datasets as a case study. First, we identify the similarities and differences between SNNs and RNNs (including the vanilla RNNs and LSTM) from the modeling and learning perspectives. To improve comparability and fairness, we unify the supervised learning algorithm based on backpropagation through time (BPTT), the loss function exploiting the outputs at all timesteps, the network structure with stacked fully-connected or convolutional layers, and the hyper-parameters during training. Especially, given the mainstream loss function used in RNNs, we modify it inspired by the rate coding scheme to approach that of SNNs. Furthermore, we tune the temporal resolution of datasets to test model robustness and generalization. At last, a series of contrast experiments are conducted on two types of neuromorphic datasets: DVS-converted (N-MNIST) and DVS-captured (DVS Gesture).READ FULL TEXT VIEW PDF
Spiking neural networks (SNNs) are more biologically plausible than
Spiking neural networks (SNNs) are positioned to enable spatio-temporal
We present a fully event-driven vision and processing system for selecti...
The advantage of spiking neural networks (SNNs) over their predecessors ...
Spiking neural networks (SNNs) based on Leaky Integrate and Fire (LIF) m...
Neuromorphic sensing and computing hold a promise for highly energy-effi...
This paper presents a novel method for labeling real-world neuromorphic
The code associated with Comparing SNNs and RNNs on neuromorphic vision datasets: Similarities and differences.
Neuromorphic vision datasets [1, 2, 3] sense the dynamic change of pixel intensity and record the resulting spike events using dynamic vision sensors (DVS) [4, 5, 6, 7]. Compared to conventional frame-based vision datasets, the frameless neuromorphic vision datasets have rich spatiotemporal components by interacting the spatial and temporal information and follow the event-driven processing fashion triggered by binary spikes. Owing to these unique features, neuromorphic data have attracted considerable attention in many applications such as visual recognition [8, 9, 10, 11][12, 13, 14], motion segmentation , tracking control [16, 17, 18], robotics , etc. Currently, there are two types of neuromorphic vision datasets: one is converted from static datasets by scanning each image in front of DVS cameras, e.g. N-MNIST  and CIFAR10-DVS ; the other is directly captured by DVS cameras from moving objects, e.g. DVS Gesture .
Spiking neural networks (SNNs) 
, inspired by brain circuits, represent a family of models for neuromorphic computing. Each neuron in an SNN model updates the membrane potential based on its memorized state and current inputs, and fires a spike when the membrane potential crosses a threshold. The spiking neurons communicate with each other using binary spike events rather than continuous activations in artificial neural networks (ANNs), and an SNN model carries both spatial and temporal information. The rich spatiotemporal dynamics and event-driven paradigm of SNNs hold great potential in efficiently handling complex tasks such as spike pattern recognition[8, 21]
, optical flow estimation, and simultaneous localization and map (SLAM) building , which motivates their wide deployment on low-power neuromorphic devices [24, 25, 26]. Since the behaviors of SNNs naturally match the characteristics of neuromorphic data, a considerable amount of literature benchmark the performance of SNNs on neuromorphic datasets [12, 27, 28, 29].
Originally, neuromorphic computing and machine learning are two domains developing in parallel and are usually independent of each other. It seems that this situation is changing as more interdisciplinary researches emerge [26, 30, 29]. In this context, researchers in the machine learning community can argue that SNNs are not unique for the processing of neuromorphic data. The reason is that recurrent (artificial) neural networks (RNNs) can also memorize previous states and behave spatiotemporal dynamics, even though they are not event-driven. By treating the spike events as normal binary values in , RNNs are able to process neuromorphic datasets too. In essence, RNNs have been widely applied in many tasks with timing sequences such as language modeling , speech recognition , and machine translation ; whereas, there are rare studies that evaluate the performance of RNNs on neuromorphic data, thus the mentioned debate still remains open.
In this work, we try to answer what will happen when benchmarking SNNs and RNNs together on neuromorphic data, taking the vision datasets as a case study. First, we identify the similarities and differences between SNNs and RNNs from the modeling and learning perspectives. For comparability and fairness, we unify several things: i) supervised learning algorithm based on backpropagation through time (BPTT); ii) loss function inspired by the SNN-oriented rate coding scheme; iii) network structure based on stacked fully-connected (FC) or convolutional (Conv) layers; iv) hyper-parameters during training. Moreover, we tune the temporal resolution of neuromorphic vision datasets to test the model robustness and generalization. At last, we conduct a series of contrast experiments on typical neuromorphic vision datasets and provide extensive insights. Our work holds potential in guiding the model selection on different workloads and stimulating the invention of novel neural models. For clarity, we summarize our contributions as follows:
We present the first work that systematically compares SNNs and RNNs on neuromorphic datasets.
We identify the similarities and differences between SNNs and RNNs, and unify the learning algorithm, loss function, network structure, and training hyper-parameters to ensure the comparability and fairness. Especially, we modify the mainstream loss function of RNNs to approach that of SNNs and tune the temporal resolution of neuromorphic vision datasets to test model robustness and generalization.
On two kinds of typical neuromorphic vision datasets: DVS-converted (N-MNIST) and DVS-captured (DVS Gesture), we conduct a series of contrast experiments that yield extensive insights regarding recognition accuracy, feature extraction, temporal resolution and contrast, learning generalization, computational complexity and parameter volume (detailed in Section 4 and summarized in Section 5), which are beneficial for future model selection and construction.
The rest of this paper is organized as follows: Section 2 introduces some preliminaries of neuromorphic vision datasets, SNNs, and RNNs; Section 3 details our methodology to make SNNs and RNNs comparable and ensure the fairness; Section 4 shows the experimental results and provides our insights; Finally, Section 5 concludes and discusses the paper.
A neuromorphic vision dataset consists of many spike events, which are triggered by the intensity change (increase or decrease) of each pixel in the sensing field of the DVS camera [4, 5, 18]. A DVS camera records the spike events in two channels according to the different change directions, e.g. the On channel for intensity increase and the Off channel for intensity decrease. The whole spike train in a neuromorphic vision dataset can be represented as an sized spike pattern, where are the height and width of the sensing field, respectively, stands for the length of recording time, and “2” indicates the two channels. As mentioned in Introduction, currently there are two types of neuromorphic vision datasets: DVS-converted and DVS-captured, which are detailed as below.
DVS-Converted Dataset. Generally, DVS-converted datasets are converted from frame-based static image datasets. The spike events in a DVS-converted dataset are acquired by scanning each image with repeated closed-loop smooth (RCLS) movement in front of a DVS camera [34, 3], where the movement incurs pixel intensity changes. Figure 1 illustrates a DVS-converted dataset named N-MNIST . The original MNIST dataset includes 60000 static images of handwritten grayscale digits for training and extra 10000 for testing; accordingly, the DVS camera converts each image in MNIST into a spike pattern in N-MNIST. The larger sensing field is caused by the RCLS movement. Compared to the original frame-based static image dataset, the converted frameless dataset contains additional temporal information while retaining the similar spatial information. Nevertheless, the extra temporal information cannot become dominant due to the static information source, and some works even point out that the DVS-converted datasets might be not good enough to benchmark SNNs [35, 29].
DVS-Captured Dataset. In contrast, DVS-captured datasets generate spike events via natural motion rather than the simulated movement used in the generation of DVS-converted datasets. Figure 2 depicts a DVS-captured dataset named DVS Gesture . There are 11 hand and arm gestures performed by one subject in each trail, and there are total 122 trails in the dataset. Three lighting conditions including natural light, fluorescent light, and LED light are selected to control the effects of shadow and flicker on the DVS camera, providing a bias improvement for the data distribution. Different from the DVS-converted datasets, both temporal and spatial information in DVS-captured datasets are featured as essential components.
There are several different spiking neuron models such as leaky integrate and fire (LIF) , Izhikevich , and Hodgkin-Huxley , among which LIF is the most widely used in practice due to the lower complexity 
. In this work, we adopt the mainstream solution, i.e. taking LIF for neuron simulation. By connecting many spiking neurons through synapses, we can construct an SNN model. In this paper, for simplicity, we just consider feedforward SNNs with stacked FC or Conv layers.
There are two state variables in a LIF neuron: membrane potential () and output activity (). is a continuous value while can only be a binary value, i.e. firing a spike or not. The behaviors of an SNN layer can be described as
where denotes time, and are indices of layer and neuron, respectively, is a time constant, and is the synaptic weight matrix between two adjacent layers. The neuron fires a spike and resets only when exceeds a firing threshold (), otherwise, the membrane potential would just leak. Notice that denotes the network input.
In this work, RNNs mainly mean recurrent ANNs rather than SNNs. We select two kinds of RNN models in this work: one is the vanilla RNN and the other is the modern RNN named long short-term memory (LSTM).
Vanilla RNN. RNNs introduce temporal dynamics via recurrent connections. There is only one continuous state variable in a vanilla RNN neuron, called hidden state (). The behaviors of a vanilla RNN layer follow
where and denote the indices of timestep and layer, respectively, is the weight matrix between two adjacent layers, is the intra-layer recurrent weight matrix, and
is a bias vector.
is an activation function, which can be thefunction in general for vanilla RNNs. Similar to the for SNNs, also denotes the network input of RNNs, i.e. .
Long Short-Term Memory (LSTM). LSTM is designed to improve the long-term temporal dependence over vanilla RNNs by introducing complex gates to alleviate gradient vanishing [40, 41]. An LSTM layer can be described as
where and denote the indices of timestep and layer, respectively, , , are the states of forget, input, and output gates, respectively, and is the input activation. Each gate has its own weight matrices and bias vector. and are cellular and hidden states, respectively. and are and functions, respectively, and is the Hadamard product.
To avoid ambiguity, we would like to emphasize again that our “SNNs vs. RNNs” in this work is defined as “feedforward SNNs vs. recurrent ANNs”. For simplicity, we only select two representatives from the RNN family, i.e. vanilla RNNs and LSTM. In this section, we first rethink the similarities and differences between SNNs and RNNs from the modeling and learning perspectives, and discuss how to ensure the comparability and fairness.
Before analysis, we first convert Equation (1) to its iterative version to make it compatible with the format in Equation (2)-(3). This can be achieved by solving the first-order differential equation in Equation (1), which yields
where reflects the leakage effect of the membrane potential, and is a step function that satisfies when , otherwise . This iterative LIF model incorporates all behaviors of a spiking neuron, including integration, fire, and reset.
Now, from Equation (2)-(4), it can be seen that the models of SNNs and RNNs are quite similar, involving both temporal and spatial dimensions. Figure 3 further visualizes the information propagation paths of SNNs, vanilla RNNs, and LSTM in both forward and backward passes. Here we denote the hidden state before in vanilla RNNs as and denote the gate states before in LSTM as .
Spatiotemporal Dynamics in the Forward Pass. First, the forward propagation paths of SNNs and vanilla RNNs are similar if and of SNNs are regarded as and of vanilla RNNs, respectively. Second, for LSTM, there are more intermediate states inside a neuron, including and the cellular state. Although the neuron becomes complicated, the overall spatiotemporal path is still similar if we just pay attention to the propagation of the hidden state . Interestingly, the internal membrane potential of each spiking neuron can directly affect the neuronal state at the next timestep, which differs them from vanilla RNNs but acts similarly as the forget gate of LSTM.
For SNNs, the learning algorithms significantly vary in literature, for example, including unsupervised learning, ANN-to-SNN conversion , and supervised learning [44, 12, 21]. Since RNNs are usually trained by the gradient-descent-based supervised algorithm in the machine learning domain, we select a recent BPTT-inspired spatiotemporal backpropagation algorithm [12, 21] for SNNs to make our comparison fair.
Also from Figure 3, it can be seen that the gradient propagation paths of SNNs, vanilla RNNs, and LSTM also follow the similar spatiotemporal fashion. Moreover, we detail the backpropagation formula of each model for better understanding. Notice that the variable denotes the gradient, for example, where is the loss function of the network. For SNNs, we have
where the firing function is non-differentiable. To this end, a Dirac-like function is introduced to approximate its derivative . Specifically, can be calculated by
where is a hyper-parameter to control the gradient width when passing the firing function during backpropagation. For vanilla RNNs, we have a similar format as follows
For LSTM, the situation becomes complicated due to the interaction between gates. Specifically, we can similarly have
where the two items on the right side represent the spatial gradient backpropagation and the temporal gradient backpropagation, respectively. Moreover, we can yield
where converts a vector into a diagonal matrix.
Although SNNs, vanilla RNNs, and LSTM are similar in terms of information propagation paths, they are still quite different. In this subsection, we give our rethinking on their differences.
Connection Pattern. From Equation (2)-(4), it can be seen that the connection pattern of these models are different. First, for the neurons in the same layer, SNNs only have self-recurrence within each neuron, while RNNs have cross-recurrence among neurons. Specifically, the self-recurrence means that there are only intra-neuron recurrent connections; by contrast, the cross-recurrence allows inter-neuron recurrent connections within each layer. Second, the recurrent weights of SNNs are determined by the leakage factor of the membrane potential, which is restricted at ; while in RNNs, the recurrent weights are trainable parameters. To make them clear, we use Figure LABEL:fig:connection to visualize the connection pattern of SNNs and RNNs and use Figure LABEL:fig:recurrent_weight to show the distribution of recurrent weights collected from practical models, which reflect the theoretical analysis.
Neuron Model. Besides the analysis of the connection pattern, we discuss the modeling details inside each neuron unit. As depicted in Figure LABEL:fig:neuron, apparently, there are no gates in vanilla RNNs, unlike the complex gates in LSTM. For SNNs, as aforementioned, the extra membrane potential path is similar to the forget gate of LSTM; however, the reset mechanism bounds the membrane potential, unlike the unbounded cellular state in LSTM. In addition, as Figure LABEL:fig:act_fun shows, the activation function of SNNs is a firing function, which is essentially a step function with binary outputs; while the activation functions in vanilla RNNs and LSTM are continuous functions such as and .
|Model||Spatiotemporal||Recurrence||Recurrent||Gate Structure||Activation Function||Loss Function|
|SNNs||Self-Neuron||Forget Gate||Binary: ()|
|Vanilla RNNs||Cross-Neuron||Trainable||Continuous: ()||Flexible|
|LSTM||Cross-Neuron||Trainable||Multiple Gates||Continuous: () & ()||Flexible|
Loss Function. Under the framework of gradient descent based supervised learning, a loss function is critical for the overall optimization. The loss function formats for SNNs and RNNs are different. Specifically, for SNNs, the spike rate coding scheme is usually combined with the mean square error (MSE) to form a loss function, which can be abstracted as
where is the label, is the output of the last layer, and is the number of simulation timesteps during training. This loss function takes the output spikes fired at all timesteps into account, and thus the neuron fires the most determines the recognition result. Different from Equation (10), the mainstream loss function for RNNs usually obeys
where is the hidden state of the last layer at the last timestep and is a trainable weight matrix.
Based on the above analysis, we summarize the similarities and differences among SNNs, vanilla RNNs, and LSTM in Table 1. Owing to the similar spatiotemporal dynamics, it is possible to benchmark all these models on neuromorphic datasets. Moreover, facing the differences, we appropriately unify the following aspects to ensure comparability and fairness in our evaluation.
We benchmark all models on two neuromorphic vision datasets: one is a DVS-converted dataset named N-MNIST and the other is a DVS-captured dataset named DVS Gesture, which are already introduced in Section 2.1. The detailed information of the two selected datasets is provided in Table 2. For SNNs, the processing of neuromorphic data is natural due to the same spatiotemporal components and event-driven fashion; while for RNNs, the spike data are just treated as binary values, i.e. .
|Description||Handwritten Digits||Human Gestures|
Usually, the original recording time length of each spike pattern is very long, e.g. . The underlying reason is due to the fine-grained temporal resolution of DVS cameras, originally at level. However, the simulation timestep number of neural networks cannot be too large, otherwise, the time and memory costs during training are unaffordable. To this end, we consider the flexibility in tuning the temporal resolution. Specifically, every multiple successive slices of spike events in the original dataset within each temporal resolution unit are collapsed along the temporal dimension into one slice. Here the temporal collapse means there will be a spike at the resulting pixel if there exist spikes at the same location in any original slices within the collapse window. We describe the collapse process as
where denotes the original slice sequence, is the original recording timestep index, denotes the new slice sequence after collapse, is the new recording timestep index, and is the temporal resolution factor. is defined as: ; ; . After collapse, the original slice sequence becomes a new slice sequence . Apparently, the actual temporal resolution () satisfies
where is the original temporal resolution. Figure LABEL:fig:collapse illustrates an example of temporal collapse with .
A large temporal resolution will increase the spike rate of new slices, as demonstrated in Figure 9. In addition, at each simulation timestep in Equation (2)-(4), the neural network processes one slice after the temporal collapse. Therefore, if the simulation timestep number remains fixed, a larger temporal resolution could extend the actual simulation time, which is able to capture more temporal dependence in the neuromorphic dataset. By tuning the temporal resolution, we create opportunities to extract more insights from the change of model performance.
Since RNNs are normally trained by supervised BPTT algorithm, to make the comparison fair, we select a recent BPTT-inspired learning algorithm with spatiotemporal gradient propagation for SNNs . Regarding the loss function required by gradient descent, the one in Equation (10) based on the rate coding scheme and MSE is widely used for SNNs. Under this loss function, there is gradient feedback at every timestep, which can alleviate the gradient vanishing problem to some extent.
In contrast, the existing loss functions for RNNs are flexible, including the mainstream one shown in Equation (11) that considers only the output at the last timestep and others that consider the outputs at all timesteps [45, 46, 47] such as
However, even if using the above loss function that considers the outputs at all timesteps, it is still slightly different from the one in Equation (10) for SNNs. To make the comparison fair, we provide two kinds of loss function configuration for RNNs. One is the mainstream loss function as in Equation (11); the other is a modified version of Equation (14), i.e.,
For clarity, we term the above format in Equation (15) for RNNs as the rate-coding-inspired loss function.
The FC layer based structure is widely used in SNNs and RNNs, which is termed as multilayered perceptron (MLP) based structure in this work. Whereas, the learning performance of MLP-based structures is usually poor, especially for visual tasks. To this end, the Conv layer based structure is introduced into SNNs to improve the learning performance
, which is termed as convolutional neural network (CNN) based structure in this work. Facing this situation, besides the basic MLP structure, we also implement the CNN structure for RNNs, including both vanilla RNNs and LSTM. In this way, the comparison between different models is restricted on the same network structure, which is more fair. Table3 provides the network structure configuration on different datasets. Since N-MNIST is a simpler task, we only use the MLP structure; while for DVS Gesture, we adopt both MLP and CNN structures.
|Neuromorphic Vision Dataset||Network Structure|
|DVS Gesture||MLP: Input-MP4-512FC-11|
Note: C3-Conv layer with output feature maps and weight kernel size, MP4-max pooling with
weight kernel size, MP4-max pooling withpooling kernel size, AP2-average pooling with pooling kernel size.
Besides the network structure, the training process needs some hyper-parameters such as number of epochs, number of timesteps, batch size, learning rate, etc. To ensure fairness, we unify the training hyper-parameters of different models. Specifically, as listed in Table4, except for the unique hyper-parameters for SNNs, others are shared by all models.
In summary, with the above rethinking on the similarities and differences as well as the proposed solutions, we successfully unify several aspects involving testing datasets, temporal resolution, learning algorithm, loss function, network structure, hyper-parameter, etc., which are listed in Table 5. This unification ensures the comparability between SNNs and RNNs, and further makes the comparison fair, which lays the foundation of this work.
|Neuromorphic Vision Dataset||N-MNIST & DVS Gesture|
|Temporal Resolution||Tunable ()|
|Learning Algorithm||BPTT-inspired (SNNs); BPTT (RNNs)|
|Loss Function||Rate Coding (SNNs); Mainstream or Rate-Coding-Inspired (RNNs)|
|Network Structure||MLP & CNN|
|Hyper-parameter||SNN-Specialized & Shared|
With the unification mentioned in Table 5, we conduct a series of contrast experiments and extract some insights in this section.
) is fixed at 0.3. In addition, the Adam (adaptive moment estimation) optimizer with the default parameter setting (, , , ) is used for the adjustment of network parameters.
|SNNs||Vanilla RNNs||LSTM||Vanilla RNNs||LSTM|
|SNNs||Vanilla RNNs||LSTM||Vanilla RNNs||LSTM|
|SNNs||Vanilla RNNs||LSTM||Vanilla RNNs||LSTM|
Tables 6-8 list the accuracy results of a series of contrast experiments on both N-MNIST and DVS Gesture datasets. On N-MNIST, SNNs achieve the best accuracy among the common models. Interestingly, when we apply the rate-coding-inspired loss function (Equation (15)), RNNs can achieve comparable or even better accuracy than SNNs. A similar trend is also found in DVS Gesture. However, it seems that the vanilla RNNs cannot outperform SNNs on DVS Gesture, especially in the MLP-based cases, even if the rate-coding-inspired loss function is used. The underlying reason might be due to the gradient problem. As well known, compared to vanilla RNNS, LSTM can alleviate the gradient vanishing issue via the complex gate structure, thus achieving much longer temporal dependence [40, 41]. For SNNs, the membrane potential can directly impact the neuronal state at the next timestep, leading to one more information propagation path over vanilla RNNs in both forward and backward passes (see Figure 3). This extra path acts similarly as the LSTM’s forget gate (i.e. the most important gate of LSTM ), thus it can also memorize longer-term dependence than vanilla RNNs and improve accuracy.
Figure 10 further presents the training curves of these models on N-MNIST. It could be observed that the common RNNs converge poorly on neuromorphic datasets while the RNNs with the rate-coding-inspired loss function can shift the training curves upward, which demonstrates of the effectiveness of the rate-coding-inspired loss function. Moreover, we find that SNNs and LSTM converge faster than vanilla RNNs. All these observations are consistent with the results in Tables 6-8.
Besides the above analysis, we further visualize the feature maps of Conv layers on DVS Gesture to see what happens, as shown in Figure 11. For simplicity, here we only visualize the case of ; the RNN models are improved by the rate-coding-inspired loss function. Among the three models, the vanilla RNN has the most clear feature maps, close to the input slices at the corresponding timesteps. However, the feature maps of SNN and LSTM models obviously include an integration of the current timestep and traces of previous timesteps, which owes to the extra membrane potential path of SNNs and the complex gate structure of LSTM. In the feature maps of SNNs and LSTM, the entire area passed by the dynamic gesture is lighted up, making them look like comets. This feature integration strengthens the temporal dependence, which further changes the later layers from learning temporal features to learning spatial features to some extent. On DVS-captured datasets like DVS Gesture, the different input slices across timesteps jointly constitute the final pattern to recognize; while on DVS-converted datasets like N-MNIST, the slices at different timesteps are close. This, along with the longer-term memory of SNNs and LSTM, can explain that the accuracy gap between vanilla RNNs and SNNs/LSTM is larger on DVS Gesture than that on N-MNIST.
Furthermore, we do an extra experiment to investigate the influence of the membrane potential leakage and reset mechanisms for SNNs. Here we test on N-MNIST with . As presented in Table 9, the removal of these mechanisms will degrade the accuracy. In fact, both leakage and reset can reduce the membrane potential, thus lowering the spike rate to some extent, which is helpful to improve the neuronal selectivity. Interestingly, we find the joint impact of the two mechanisms is larger than the impact of any one of them.
|Lee et al. ||Input-10000FC-10||98.66%|
|DART ||DART Feature Descriptor||97.95%|
|Wu et al. ||SNN (CNN-based 8 layers)||99.53%|
|DVS Gesture||TureNorth ||SNN (CNN-based 16 layers)||91.77%|
|SLAYER ||SNN (CNN-based 8 layers)||93.64%|
|Ours||SNN (CNN-based 8 layers)||93.40%|
At last, we provide the accuracy results of several prior works that applied SNNs on the two neuromorphic datasets. Note that we do not provide results involving RNNs since rare work tested them on neuromorphic datasets. As depicted in Table 10, our SNN models can achieve acceptable results, although not the best. Since our focus is the comparison between SNNs and RNNs rather than beating prior work, we do not adopt large models and complex optimization strategies used in prior work to improve accuracy.
Also from Tables 6-8, we find that as the temporal resolution grows larger, the accuracy will be improved. The reasons are two-fold: on the one hand, the spike events become dense when lies in large values (see Figure 9), which usually forms more effectual features in each slice; on the other hand, with the same number of simulation timesteps (i.e. ) during training, a larger temporal resolution can include more slices in the original dataset, which provides more information of the moving object. Furthermore, we find that SNNs achieve significant accuracy superiority on DVS-captured datasets like DVS Gesture when the temporal resolution is small (e.g. ). This indicates that, unlike the continuous RNNs, the event-driven SNNs are more suitable to extract sparse features, which is also pointed out in . On DVS-converted datasets like N-MNIST, the sparsity gap of spike events under different temporal resolution is usually smaller than that on DVS-captured datasets, thus the accuracy superiority of SNNs is degraded.
In essence, the influence of temporal resolution increase is not always positive. As illustrated in Figure LABEL:fig:collapse, the temporal collapse as grows also loses some spikes, leading to temporal precision loss. To investigate the negative effect, we conduct experiments on N-MNIST with large values. To eliminate the impact of different simulation temporal length (i.e. ) when varies, we adapt the number of simulation timesteps to ensure the same simulation temporal length, i.e. fixing here. The results are given in Table 11. As excessively increases, the accuracy degrades due to the temporal precision loss.
|SNN (Adaptive Leakage)||98.19%||97.83%||97.04%|
|Vanilla RNN (Rate-Coding-inspired)||98.15%||97.09%||78.33%|
Next, we do a simple experiment to test the generalization ability of models under different temporal resolutions. We train an SNN model, a vanilla RNN model (rate-coding-inspired), and an LSTM model (rate-coding-inspired) on N-MNIST under , and then test the trained models under . Also, we keep the simulation temporal length identical as above, fixing here. Unless otherwise specified, the leakage factor equals 0.3. Table 12 reports the accuracy results, and the training curves can be found in Figure 12. We have two observations: (1) the testing under and loses accuracy, and the degradation increases significantly as becomes much smaller such as ; (2) the SNN model presents better generalization ability. Specifically, when testing under , the SNN model only loses 2.18% accuracy, while the vanilla RNN model and the LSTM model lose 19.82% and 20.87% accuracy, respectively, which are much higher than the loss of the SNN model.
We explain the above robustness of SNNs as follows. First, as mentioned earlier, SNNs are naturally suited for processing sparse features under smaller temporal resolution owing to the event-driven paradigm. Second, different from the trainable cross-neuron recurrent weights in RNNs, SNNs use self-neuron recurrence with restricted weights (i.e. the leakage factor ). This recurrence restriction stabilizes the SNN model thus leading to improved generalization. To evidence the latter prediction, we additionally test the performance of the SNN model with trainable cross-neuron recurrent weights and present the results in Table 12. As expected, the generalization ability dramatically degrades, like RNNs. This might be caused by the increased number of parameters and more complex dynamics after introducing the trainable cross-neuron recurrence. Additionally, we try to identify whether the leakage factor would affect the generalization ability of SNNs. In all previous experiments, the leakage factor is fixed at 0.3; by contrast, we further test an SNN model with adaptive leakage factors by fixing only but varying . Also from Table 12, it can be seen that the adaptive leakage factor just slightly improves the robustness.
In Section 4.2, we have observed that the rate-coding-inspired loss function can boost more accuracy on DVS-captured datasets. In this subsection, we do a deeper analysis on this phenomenon. We define the temporal contrast of a neuromorphic vision dataset as the cross-entropy between slices at different timesteps. Specifically, we denote as the slices between the -th timestep and the -th timestep. Thus, there exists a cross-entropy value between and where and can be any two given timesteps. Here we define the cross-entropy value as
where and are the index and number of elements in , respectively. Note that is the function with numerical optimization: . We set
where is a small constant. The reason we do the above optimization is because the elements in can only be binary values within , which might cause zero or negative infinity results when passing through the function. Then, we visualize the cross-entropy matrices of the two neuromorphic vision datasets we use: the DVS-converted N-MNIST and the DVS-captured DVS Gesture, as presented in Figure 13.
Apparently, it can be seen that the temporal contrast of DVS-captured datasets is much larger than that of DVS-converted datasets. This indicates that there are more temporal components in DVS-captured datasets, while the slices at different timesteps in DVS-converted datasets are close. Furthermore, we provide the statistic data, including mean and variance, of the cross-entropy matrices derived from the above two datasets. The calculation rules of mean and variance follow the normal definitions in statistics. As shown in Table 13, it also demonstrates that the data variance of DVS-captured datasets is much larger than that of DVS-converted datasets, which is consistent with the conclusion from Figure 13. By taking the outputs at different timesteps into account, the rate-coding-inspired loss function in Equation (15) is able to provide error feedback at all timesteps thus optimizing the final recognition performance. The above quantitative analysis can well explain the underlying reason that the rate-coding-inspired loss function can gain more accuracy boost on DVS Gesture than the gain on N-MNIST. We should note that, when the temporal contrast is too large, the effectiveness of the rate-coding-inspired loss function might be degraded due to the divergent gradient directions at different timesteps, which needs more practice in the future.
Besides the accuracy analysis, we further consider the memory and compute costs during model running. For the computational complexity, we take one layer with neurons as an example. In the forward pass, we count the operations when it propagates the activities to the next timestep and the next layer with neurons; while in the backward pass, we count the operations when it receives the gradients from the next timestep and the next layer. Notice that we only count the matrix operations because they occupy much more complexity than the vector and scalar ones. The comparison is presented in Table 14, which is mainly derived from Equation (2)-(9). Apparently, the SNN model consumes fewer operations owing to the self-neuron recurrence and avoids costly multiplications in the forward pass owing to the spike format. Furthermore, the event-driven computation in the forward pass can further reduce the required operations that are proportional to the normalized spike rate.
|Data Path||SNN||Vanilla RNN||LSTM|
|Backward||MACs||MACs||MULs + MACs|
On the other hand, the memory overhead is mainly determined by the number of parameters, especially when performing inference with only the forward pass on edge devices. Figure 14 compares the number of model parameters of SNNs, vanilla RNNs, and LSTM. Here we take the models we used on the DVS Gesture dataset for illustration. We find that the parameter amount of SNNs is much smaller than those of RNNs. Overall, SNNs only occupy about 80% and 20% parameters compared with the vanilla RNNs and LSTM, respectively.
Interestingly, despite the fewer operations and parameters of SNNs, the extra membrane potential path helps them achieve comparable (under large temporal resolution) or even better (under small temporal resolution) recognition accuracy than LSTM with complex gates; in the meantime, the self-neuron recurrence and the restricted recurrent weights make them more lightweight and robust.
In the previous content, we focus on vision datasets. Actually, another important branch of data sources with spatiotemporal components is the audio data, which has also been used in neuromorphic computing [52, 53]. To gently extend the scope of this work, we provide an extra experiment on an audio dataset in this subsection.
|during Testing||SNNs||Vanilla RNNs||LSTM|
We select the Spoken-Digits  for testing. The network structure is “Input-512FC-10”. The hyper-parameter setting is the same as that on N-MNIST except for the number of simulation timesteps, i.e. . We set during training, while varying it during testing to explore the generalization. The results are listed in Table 15. It can be seen that the vanilla RNNs perform the worst while SNNs are the best. Furthermore, SNNs show better generalization ability on this dataset, which is consistent with the observation in Section 4.3.
In this work, we conduct a systematic investigation to compare SNNs and RNNs on neuromorphic vision datasets and then compare their performance and complexity. To make SNNs and RNNs comparable and improve fairness, we first identify several similarities and differences between them from the modeling and learning perspectives, and then unify the dataset selection, temporal resolution, learning algorithm, loss function, network structure, and training hyper-parameters. Especially, inspired by the rate coding scheme of SNNs, we modify the mainstream loss function of RNNs to approach that of SNNs; to test model robustness and generalization, we propose to tune the temporal resolution of neuromorphic vision datasets. Based on a series of contrast experiments on N-MNIST (a DVS-converted dataset) and DVS Gesture (a DVS-captured dataset), we achieve extensive insights in terms of recognition accuracy, feature extraction, temporal resolution and contrast, learning generalization, computational complexity and parameter volume. For better readability, we summarize our interesting findings as below:
SNNs are usually able to achieve better accuracy than common RNNs. Whereas, the rate-coding-inspired loss function can boost the accuracy of RNNs especially LSTM to be comparable or even slightly better than that of SNNs.
The event-driven paradigm of SNNs makes them more suitable to process sparse features. Therefore, in the cases of small temporal resolution with sparse spike events, SNNs hold obvious accuracy superiority.
On one hand, LSTM can memorize long-term dependence via the complex gates, while the extra membrane potential path of SNNs also brings longer-term memory than vanilla RNNs; on the other hand, the temporal contrast of slices in DVS-captured datasets is much larger than that in DVS-converted datasets, thus the processing of DVS-captured datasets depends more on the long-term memorization ability. These two sides can explain the reason that SNNs and LSTM significantly outperform vanilla RNNs on DVS Gesture, while this gap is small on N-MNIST.
The self-neuron recurrence pattern and restricted recurrent weights of SNNs greatly reduce the number of parameters and operations, which improves both the running efficiency and the model generalization.
We believe that the above conclusions can benefit the neural network selection and design on different workloads in the future. We simply discuss several examples. On DVS-converted datasets, the accuracy gap between different models is small so that any model selection is acceptable. On DVS-captured datasets, we do not recommend vanilla RNNs due to the low accuracy. When the temporal resolution is large, we recommend LSTM with the rate-coding-inspired loss function; while when the temporal resolution is small, we recommend SNNs. If we need a compact model size, we always recommend SNNs that have significantly fewer parameters and operations. Moreover, it might be possible to improve models by borrowing ideas from each other. For instance, vanilla RNNs can be further enhanced by introducing more information propagation paths like the membrane potential path in SNNs; LSTM can be made more compact and robust by introducing the recurrence restriction; SNNs can be improved by introducing more gates like LSTM. It is even possible to build a hybrid neural network model by combining multiple kinds of neurons, thus taking the advantages of different models and alleviating their respective defects. In addition, we mainly focus on vision datasets and just provide very limited exploration on audio data in this work. More extensive experiments in a wide spectrum of tasks are highly expected.
The work was partially supported by National Science Foundation (Grant No. 1725447), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua University Initiative Scientific Research Program, and a grant from the Institute for Guo Qiang, Tsinghua University.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7243–7252, 2017.
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174, IEEE, 2015.
P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, ieee, 2015.
, “Pytorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems, pp. 8024–8035, 2019.