SiamSNN: Spike-based Siamese Network for Energy-Efficient and Real-time Object Tracking

Although deep neural networks (DNNs) have achieved fantastic success in various scenarios, it's difficult to employ DNNs on many systems with limited resources due to their high energy consumption. It's well known that spiking neural networks (SNNs) are attracting more attention due to the capability of energy-efficient computing. Recently many works focus on converting DNNs into SNNs with little accuracy degradation in image classification on MNIST, CIFAR-10/100. However, few studies on shortening latency, and spike-based modules of more challenging tasks on complex datasets. In this paper, we focus on the similarity matching method of deep spike features and present a first spike-based Siamese network for object tracking called SiamSNN. Specifically, we propose a hybrid spiking similarity matching method with membrane potential and time step to evaluate the response map between exemplar and candidate images, with the same function as correlation layer in SiamFC. Then we present a coding scheme for utilizing temporal information of spike trains, and implement it in output spiking layers to improve the performance and shorten the latency. Our experiments show that SiamSNN achieves short latency and low precision loss of the original SiamFC on the tracking datasets OTB-2013, OTB-2015 and VOT2016. Moreover, SiamSNN achieves real-time (50 FPS) and extremely low energy consumption on TrueNorth.


page 1

page 7


Training Energy-Efficient Deep Spiking Neural Networks with Time-to-First-Spike Coding

The tremendous energy consumption of deep neural networks (DNNs) has bec...

Supervised Training of Siamese Spiking Neural Networks with Earth's Mover Distance

This study adapts the highly-versatile siamese neural network model to t...

Spiking-GAN: A Spiking Generative Adversarial Network Using Time-To-First-Spike Coding

Spiking Neural Networks (SNNs) have shown great potential in solving dee...

Deep SCNN-based Real-time Object Detection for Self-driving Vehicles Using LiDAR Temporal Data

Real-time accurate detection of three-dimensional (3D) objects is a fund...

ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction

Deep neural networks (DNNs) provide high image classification accuracy, ...

Spiking-YOLO: Spiking Neural Network for Real-time Object Detection

Over the past decade, deep neural networks (DNNs) have become a de-facto...

Learning First-to-Spike Policies for Neuromorphic Control Using Policy Gradients

Artificial Neural Networks (ANNs) are currently being used as function a...

I Introduction

Nowadays Deep Neural Networks (DNNs) have shown remarkable performance in various scenarios [1, 2, 3]. However, due to the high computation cost and heavy energy consumption, it’s difficult to employ DNNs on embedded systems such as mobile devices. To this end, many tiny but efficient networks [4, 5] are proposed and achieve promising performance. However, the problem of insufficient computing power still exists in resource-constrained systems.

As an alternative, spiking neural networks (SNNs) have realized ultra-low power consumption on neuromorphic hardware (such as TrueNorth [6] and Loihi [7]) by transmitting information in the form of an event-driven binary spike trains rather than continuous values like DNNs [8]

. Furthermore, spiking neurons in SNNs only fire an output spike when their membrane potentials exceed a certain threshold 

[9], which sparsely activates neurons in networks to save energy. SNNs are being regarded as the third generation artificial neural networks [10] because they are bio-inspired and more like a real human brain than the common DNNs.

SNNs have shown the effectiveness in some scenarios, yet there is still a gap to DNNs owing to the hardness of training [11]. One major reason is that the training algorithms, such as STDP [12]

and backpropagation 

[13], can only train shallow SNNs with two or three layers. In order to avoid this issue, an alternative way is converting trained DNNs to SNNs [14, 15, 16]. This aims to convert DNNs to SNNs by transferring the well-trained models with weight rescaling and normalization methods, and there would only be a small loss of accuracy compared to original DNNs. Recently [17] propose a hybrid coding scheme, which achieves near-lossless conversion in image classification.

However, SNNs approaches that provide competitive results mostly in classification but few tasks beyond perception and inference on static images [18]

. These methods are mainly suitable for convolutional neural networks (CNNs) with base layers like ReLU, pooling and batch normalization (BN). It’s necessary to design more spike-based modules for various models. The competitive results only demonstrate that the useful information for image classification have not been lost, and it’s not clear about the transmission of other information. As far as we know, only Spiking-YOLO 

[19] successfully performs object detection by solving regression problem through channel-normalization and converting leaky-ReLU to a spiked fashion. In addition, SNNs of remarkable performance usually requires long latency, which causes the inability to reach real-time that limits applications of SNNs. Consequently, Focusing on shortening latency and exploiting various spike-based modules for more advanced tasks are significant to the popularization of SNNs.

Object tracking, one of the most important problems in the field of computer vision with wide applications, is a more challenging task than image classification due to the difficulties of occlusions, deformation, background cluttering and other distractors 


. An object bounding box is solely provided for the first frame of a sequence and the algorithm should estimate the target position in each remaining frame. Recently Siamese DNNs based trackers obtain state-of-the-art results 

[21], while SNNs only have few attempts in simple tracking scenarios [22, 23].

In this paper, we focus on the similarity matching method of deep spike trains and present a spike-based Siamese network for object tracking called SiamSNN based on SiamFC [3]. Specifically, we first implement a simple matching method thorough membrane potential or time step. After our in-depth analysis, we optimize them and propose a hybrid similarity matching method for utilizing temporal information of spike trains. Meanwhile, we present a coding scheme in output spiking layer to optimize the temporal distribution in spike trains for shortening the latency.

The proposed coding scheme are inspired by the neural phenomenon of long-term depression (LTD) [24]. To the best of our knowledge, SiamSNN is the first deep SNN for object tracking that achieves short latency and low precision loss of the original SiamFC on the tracking datasets OTB-2013 [25], OTB-2015 [26] and VOT2016 [27]. The contributions of this paper can be summarized as follows:

  • We design SiamSNN for object tracking in deep SNN with short latency and low precision loss. This is the first successful attempt to apply deep SNN to object tracking on the datasets with complex scenes.

  • We propose a hybrid similarity matching method to construct spike-base correlation layer to evaluate the response map between exemplar and candidate images. The proposed method exploits temporal information to cut down accuracy degradation.

  • We propose a novel coding scheme in the output spiking layer to optimize the temporal distribution of spike trains that improves the performance and shortens the latency.

Ii Related Work

Ii-a Deep SNNs through conversion

Many researches have been proposed for efficient SNNs training [12, 13]

and achieve certain effects. However, they usually suffer accuracy degradation compared to the DNNs. One major reason is that the training algorithms only fit shallow SNNs with two or three layers but DNNs are much depper with great feature extraction and information processing capability. In order to construct deep SNNs, an alternative way of directly training deep SNNs is converting trained DNNs to SNNs with the same topology.

In the early stage work, Cao et al. [14]

converts DNNs to SNNs and achieve excellent performance. They propose a comprehensive conversion scheme that can convert most modules in CNN-based networks except bias, batch normalization (BN) and max-pooling layers. Diehl

et al. [15] proposes data-based normalization method to enable a sufficient firing rate and achieve nearly lossless accuracy on MNIST [28] compared to DNNs.

In the subsequent work, Rueckauer et al. [16] demonstrates that merging BN layer to the preceding layer is lossless and introduce spike max-pooling. Kim et al. [19] proposes channel-wise normalization to upgrade firing rate and first apply deep SNN to object detection successfully. Sengupta et al. [29]

introduces a novel spike-normalization method for generating an SNN with deeper architecture and experiment on complex standard vision dataset ImageNet 

[30]. These methods have achieved comparable performance to DNNs in image classification and object detection, while there are few attempts to apply SNNs to matching problems such object tracking.

Ii-B Spike coding

Spike coding is the method of representing information with spike trains. Rate and temporal coding are the most frequently used coding schemes in SNNs.

Rate coding is based on spike firing rate, which is used as the signal intensity, counted by the number of spikes occurred during a period of time. Thus, it has been widely used in converted SNNs [14, 16, 31] for classification by comparing the firing rates of each categories. However, rate coding requires a higher latency for more accurate information [32]. For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time steps are required. As a result, rate coding is inefficient in terms of information transmission because of its redundancy, which leads to a long latency and energy consumption in deep SNNs.

Temporal coding uses timing information to mitigate long latency. Time-to-first-spike [33] only utilizes the arrival time of first spike during a certain time. And there are many other coding schemes exploiting temporal information, such as rank-order coding [34] and phase coding [35]. On the basis of these , Kim et al. [36] proposes weighted spikes by the phase function to make the most of temporal information. Park et al. [17] introduces burst coding scheme inspired by neuroscience research, and investigates a hybrid coding scheme to exploit different features of different layers in SNNs.

Ii-C Object tracking

Object tracking aims to estimate the position of arbitrary target in a video sequence while its location only initializes in the first frame by a bounding box. Trackers usually learn a model of the object’s appearance and match with the candidate image in current frame to determine the location by maximum response point. Tracking scenarios often appear occlusions, out-of-view, deformation, background cluttering and other variations, which makes object tracking more challenging [37].

Two main branches of recent trackers are based on correlation filter (CF) [38, 39] and DNNs [40, 41, 42]. CF trackers train regressors in the Fourier domain and update the weights of filters to do online tracking. Trackers based on DNNs are trained end-to-end and off-line, which enhance the richness of model by big data sets. Motivated by CF, trackers based on Siamese networks with similarity comparison strategy have drawn great attention due to their high performance. SiamFC [3] introduces the correlation layer rather than fully connected layers to evaluate the regional feature similarity between two frames, which highly improve the accuracy.

Many researches focus on improving the architecture of Siamese network and get state-of-the-art results. SiamRPN [43] adopts a region proposal network (RPN [44]) after a Siamese network and enhance tracking performance, the object tracking task is decomposed into classification and regression problem with correlation operation. SiamRPN++ [21] proposes a new model architecture to perform layer-wise and depth-wise aggregations and successfully train a Siamese tracker with deeper CNNs. Liet al. [45] investigates a method to enhance tracking robustness and accuracy by learning target-aware features.

Fig. 1: Framework of SiamSNN. This framework consists of a Siamese spiking convolutional neural network backbone, two-status coding, and hybrid spiking similarity matching (temporal and potential correlation). Two-status coding optimizes the temporal distribution of output spike trains. Hybrid spiking similarity matching calculates the response map which denotes the similarity score between the template and the search region and the maximum of the response map indicates the target position.

Iii Proposed Methods

Iii-a Model Overview

SiamFC [3] first introduces the correlation layer to calculate the response map and it’s simple but efficient. As our base model, SiamFC can be reconstructed by existing spiking modules except for correlation layer, which aims to find the most similar instance to exemplar from candidate frame through an embedding function :


where denotes a bias equated in every location.

The converted SiamFC is called SiamSNN, in which the max-pooling and BN layers are implemented according to [16]. We adopt layer-wise normalization (abbreviated to layer-norm) [15, 16] to prevent the neuron from over- or under-activation. We detail the SiamSNN as shown in Fig. 1.

Different from DNNs, SNNs uses event-driven binary spike trains rather than continuous values between neurons. The integrate-and-fire (IF) neuron model adopts membrane potential to simulate the post-synaptic potential. The membrane potential in the th layer for each neuron is described as


where is a spike in the th neuron in th layer, is a certain threshold, and is the input of the th neuron in th layer, which is described as


where is the weight and is the bias in the th layer. Each neuron will generate a spike only if their membrane potential exceeds a certain threshold . The process of spike generation can be generalized as


where is a unit step function. Once the spike is generated, the membrane potential is reset to the resting potential. As shown in Eq. 2, we adopt the method of resetting by subtraction rather than resetting to zero, which increases firing rate and reduces the loss of information. The firing rate is defined as


where N is the total number of spikes during a given period . The maximum firing rate will be 100% since neuron generates a spike at every time step. Spike coding is the method of representing information with spike trains and different coding schemes cause various firing rates. A typical coding scheme is that input layer with real coding (real value) and hidden layers with rate coding [17].

Iii-B Analysis

Like softmax layer, we can simply infer the response map output from the correlation computed on the membrane potentials of the last layer over the entire time. Hence, Eq. 

1 can be changed into a spiking form:


where and are the resulting spike trains encoded through input image for each pixel, is the latency. In the rest of the paper, we will refer to this simple method as potential spiking matching (PSM).

We preliminary test PSM with layer-norm on OTB-2013, unfortunately, it suffers from severe performance degradation. To prevent under-activation from layer-norm, we adopt channel-wise data-based normalization (abbreviated to channel-norm) [19], which first introduces in object detection instead of layer-norm to gain more activations in short latency. Additionally, we implement burst coding scheme [17] in our hidden layers to upgrade firing rate by generating a group of short-ISI spikes. Surprisingly, the result turns out to be lower as Fig. 2(a) shows. Although these two methods cause high firing rates, they don’t improve accuracy.

There is no doubt that the main reason for a drop of performance is a reduction of firing rates in higher layers [31], so many researches aim at increasing the firing rate to enhance the efficiency of the information transmission, although larger number of spikes incurs more dynamic energy consumption and significantly diminishes the merit of low energy consumption in deep SNNs [36]. High firing rate can ensure the information transmission of small activations, which is beneficial to image classification or detection, especially in the deep layers for regression. But it can also bring more background information to obstruct similarity matching.

It’s indicated that only a few convolutional filters are active in describing the target in tracking and others contain irrelevant and redundancy information because inter-class differences are mainly related to a few feature channels [45]. For this reason, channel-wise normalization brings more irrelevant information in tracking due to enhancing the activations of all channels. Moreover, burst coding scheme also strongly activates neurons in all channels by generating a group of short inter-spike interval (ISI) spikes. In addition, small values of the short-ISI spikes result small response values in correlation operations, which will narrow the similarity gap between foreground and background.

Our in-depth analysis demonstrates latent reasons for the performance degradation by increasing firing rates of all channels. Therefore, we still utilize layer-norm to balance the weights and threshold voltage. However, as Fig. 2(b) shows, firing rate of most neurons in layer-norm is below 10%, which results in long latency in order to gain more activations. In this case, we expect for more similarity information to accelerate matching and improve accuracy. Temporal information of spike trains are the main difference between DNNs and SNNs, which makes an important impact on information transmission in SNNs [36, 18]. Motivated by this, we exploit temporal information in similarity matching method and coding scheme.

(a) AUC on OTB-2013
(b) Average firing rates
Fig. 2: (a).The area under the curve (AUC) of success plots on OTB-2013 dataset. (b).Average firing rates of the three methods in conv1-4.

Iii-C Hybrid Spiking Similarity Matching

Similar to PSM, we can match the temporal similarity between spiking features of exemplar and candidate images (abbreviated to TSM). Essentially, correlation operation is almost the same as convolution, it can be calculated as convolution per time step in SNNs as follows:


However, it causes strict time consistency of correlation, which can also lead to a large drop of accuracy in actual measurement. For instance in Fig. 3, if spike train A of a pixel has spikes at , and spike train B has spikes at , their temporal correlation is 0 during , but intuitively they are highly similar . The measures of spike train synchrony in [46] also show high similarity of the above two spike trains. Therefore, we introduce a response period with time weights during matching to solve this issue. It’s defined as:


where is the response period threshold. As shown in Fig. 3, Eq. 8 makes spike trains A response with B during . Response value will attenuate gradually when time step far from . There is a trade-off between computation consumption and period threshold, and longer period has few contributions to the response value. Thus, is usually set to a small number like 2.

Fig. 3: Illustration of temporal correlation between spike train A and B.

Considering PSM performs better than TSM on portions of sequences in OTB-2013, we calculate the final response map as follows:


where and are the weights to do trade-off and normalization, denotes the same bias in SiamFC. Noted that in SiamFC, compares an exemplar image to a candidate image in current frame of the same size. It generates high value if the two images describe the same object and low otherwise. And just simulates the function, not strictly calculates similarity between two spike trains (the similarity equals 1 between two zero spike trains). It’s the reason for using instead of spike numbers as the normalization parameter to get response value.

It’s worth noting that almost all tracking models based on Siamese networks have the correlation layer, and many of them obtain state-of-the-art results [21, 45]

. The proposed method makes it possible to convert these trackers to deep SNNs. Meanwhile, Siamese network has been used in various fields, such as person re-identification and face recognition. Our approaches provide a reference for similarity matching of spike trains in deep SNNs. In the rest of the paper, we will refer to hybrid spiking similarity matching as HSM.

Iii-D Two-Status Coding Scheme

Fig. 4: Example of spiking value and temporal distribution in constant threshold, weighted spike, burst coding and two-status coding.

Although Eq. 9 enhances the temporal correlation between two spike trains, the temporal distributions of spikes are random and disorganized, which makes a large portion of spike values underutilized. We expect that the output spike of each time step will contribute to the final response value, which shortens latency for reducing energy consumption.

Weighted spikes [36] are implemented by changing voltage threshold depending on the phase of global reference clock, which cyclically produces spikes with phase value. The function of phase is given by


where K is the period of the phase. Weighted spikes needs only K time steps to represent a K-bit data whereas rate coding would require time steps to represent the same data.

Similar as the exponential decay of threshold, Burst coding [17] generates a group of short-ISI spikes as follows:


where is a burst constant. Threshold in burst coding changes while the neuron generates a spike, instead of changing periodically in weighted spikes. These two methods can transmit more information with spikes than rate coding for the same time steps, which results in shortened latency and higher energy efficiency in classification. However, as analyzed before, small values of spikes result in small response values in correlation operations. To this end, we need periodic spike with same value.

Motivated by the neural phenomenon that repetitive electrical activity can induce a persistent potentiation or depression of synaptic efficacy in various parts of the nervous system (LTP and LTD)  [24], we propose two-status coding scheme (hereinafter abbreviated as TS coding), which represents the voltage threshold of potentiation status and depression status. We define the function as follows:


where to prevent generating spikes and is often smaller than 1, is a constant that controls the period of neuron state change, is the current time step.

We illustrate the spiking value and temporal distribution of spikes on different methods in Fig. 4. Constant threshold generates spikes disorderly with stable values, weighted spike with different values usually emerge in low threshold, burst coding fires a group of short-ISI spikes with exponentially decreasing values. Our approach makes neurons only fire at the same value in potentiation status and accumulate membrane potential in depression status. We constrain equivalent spikes in a fixed periodic distribution, increase the density of spikes in potentiation status to enhance response value. The proposed method can also save energy by avoiding neurons generating spikes that are useless for matching. Moreover, spiking neurons in the next layer are not consuming during depression status due to zero input.

Park et al. [17] proposes a hybrid coding scheme on input and hidden layers by the motivation that neurons occasionally use different neural coding schemes depending on their roles and locations in the brain. Inspired by this idea, we only use TS coding scheme on the last (output) spiking layer, which is the input for correlation, because our method only optimizes the temporal distribution for better matching without advantages in transmission on hidden layers. And we use real value for the input layer due to its fast and accurate features.

Iv Experimental Results

Iv-a Experimental Settings

We implement SiamSNN and the weights and biases are converted from SiamFC, which has simple architecture but ex excellent performance. We evaluate our methods on OTB-2013, OTB-2015 and VOT2016. The simulation and implementations are based on TensorFlow.

OTB and VOT are the widely used tracking benchmarks, which contain varieties of tracking challenging about out-of-view, variation, occlusions, deformation, fast motion, rotation, and background cluttering. OTB datasets adopt overlap score and center location error as their base metrics. The success plot shows the ratios of successful frames whose overlap ratio is larger than a given threshold. The precision plot measures the percentage of frames whose center location error in range of a certain threshold. The area under the curve (AUC) of the success plot is usually used to rank tracking algorithms. As for VOT, the performance is evaluated in terms of accuracy (average overlap) and robustness (failure times). The overall performance is evaluated through Expected Average Overlap (EAO) which takes account of both accuracy and robustness.

In the evaluations, we find that when the response period threshold is set to 1, the ratio of is about 1:20, the neuron state period is set to 10, and is set between 0.6 and 0.7 , the tracker is relatively effective. Moreover, when is enough to prevent neurons generating spikes.

SiamFC 67.2 65.4 62.2 67.4 66.4 65.6 69.3 69 65.5 65.5 66.6 66.4
SiamSNN+PSM 59.6 56.3 54.9 59.2 60.2 56.1 62 63.4 58.7 57.1 56.2 57.3
SiamSNN+TSM 60.4 56.7 56.7 61.4 61.4 56.9 57.4 63.8 59.6 57.9 54.2 57.7
SiamSNN+HSM 60.5 56.8 57.6 60.8 60.7 56.5 57.3 63.6 60.4 57.8 53.9 58
SiamSNN+HSM+TS 61.7 59.9 58.4 61.3 61.6 59.8 59.5 64 61.1 59.8 54.8 59.1
TABLE I: The average overlap ratio across all the videos of various the challenges on OTB100, about the four configuration of SiamSNN and SiamFC. The challenges are abbreviated as follows: background clutters (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV), scale variation (SV).

Iv-B Tracking Results

(a) Results on the OTB-2013 dataset
(b) Results on the OTB-2015 dataset
Fig. 5: Success and precision plots on the OTB-2013 and OTB-2015 datasets.
Fig. 6: Tracking results on 10 sequences of OTB-2015 (from left to right and top to down are basketball, bolt2, car1, david3, human7, jogging1, lemming, skating2-1, surfer, tiger1). The indexes are shown in the top left of each frame.

To assess and verify the proposed methods, we use overlap ratio (OR) as the basic metric firstly. OR is the Intersection over Union (IoU) between tracked bounding box and ground truth bounding box, which is calculated based on object detection challenge VOC [47]:


Because that no spike-based methods were used to convert correlation layer before, although PySpike  [46] can measure distance between two single spike trains, it’s hard to be implemented in deep SNNs. Thus, we only compare the four configuration of SiamSNN with SiamFC, compute the average OR across all the videos of various the challenges on OTB100, as shown in Table I. SiamSNN with HSM and TS coding achieves the least accuracy degradation (5.5%) compared to SiamFC (61.7% vs. 67.2%). For further evaluation, we measure the precision and success plots of OPE on OTB-2013 and OTB-2015. As depicted in Fig. 5, the target AUC of precision and success plots in SiamFC are 79%, 59.19%  (OTB-2013) and 77%, 57.78% (OTB-2015). And SiamSNN+HSM+TS achieves 72%, 52.66% (OTB-2013) and 68%, 50.72% (OTB-2015). SiamSNN+HSM+TS achieves outstanding performance than other algorithms.

We find that SiamSNN will reduce accuracy more or less in each type of tracking challenge attributes, especially in low resolution (LR) and out-of-view (OV). We choose some videos for in-depth analysis. Fig. 6 depicts the tracking results of three trackers (SiamFC, SiamSNN+PSM and SiamSNN+HSM+TS) in several challenging sequences. SiamSNN has almost the same performance as SiamFC in most simple tracking scenarios. However the converted spike-based model will have weaker discrimination abilities, it is likely to make wrong precisions in case of challenging scenarios.

In occlusion, motion blur and scale variation, the performance of SiamSNN+HSM+TS is closer to SiamFC than SiamSNN+PSM. PSM often drift the target when heavy occlusions and scale variations occur in car1, david3, human7, jogging1.

In background clutters, plane rotation, fast motion, both SiamSNN and SiamFC will drifts off the target when encountering challenging circumstances. So that SiamSNN unexpectedly performs well than SiamFC in the basketball sequence, and they all drift away in the bolt2 and skating2-1 sequence.

In low resolution, out-of-view, although SiamSNN can scarcely track the target in the sequences of lemming, surfer and tiger1, the bounding boxes always deviate from ground truths. .

On OTB-2015, overlap ratio will drop sharply once SiamSNN drifts off the target, since the tracker will not be reset on OTB benchmark. Thus, we compare SiamFC and SiamSNN+HSM+TS on VOT2016. As shown in Table II, the accuracy only drops 3% but robustness becomes poor, and the degradation of EAO is 2%. From here we see that the converted spike-based model lose some discriminative feature, which causes SiamSNN to drift off the target when encountering complex circumstances.

 SiamFC 0.53 0.49 0.23
 SiamSNN 0.50 0.63 0.21
TABLE II: Comparison results about SiamFC and SiamSNN on VOT2016.
Fig. 7: Results of SiamSNN on different configurations evaluated by latency and the AUC of success plots on OTB-2015.

Iv-C Energy Consumption and Speed Evaluation

Power (W) Time (ms) Energy (J)
 SiamFC 5.44E+09 250 11.63 2.9
1.10E+09 2.75E-03 20 5.5E-05
TABLE III: Comparison of speed and energy consumption of SiamFC and SiamSNN

The latency requirements are presented in Fig. 7, SiamSNN converges rapidly in 10 time steps and then rises slowly, long latency contributes little to the accuracy. To reach the maximum AUC, SiamSNN+HSM+TS requires approximately 20 time steps. In other tasks, burst coding demands 3000 time steps for image classification on CIFAR-100 [17] and Spiking-YOLO needs 3500 time steps for object detection [19]. HSM+TS achieves a remarkable performance of latency, which provides great possibilities to achieve real-time on neuromorphic chips.

We estimate energy consumption and speed from neuromorphic chips (TrueNorth [6]) to investigate the effect of our methods, and compare them to SiamFC. Bertinetto et al. [3] run SiamFC(3s) on a single NVIDIA GeForce GTX Titan X (250 Watts) and reach 86 frames-per-second (11.63ms per frame). TrueNorth measured computation by synaptic operations per second (SOPS) and can deliver 400 billion SOPS per Watt, while floating-point operations per second (FLOPS) in modern supercomputers. And the time step is nominally 1ms, set by a global 1kHz clock [6]. We count the operations with the formula in [16].

The calculation results of processing one frame are presented in Table III. The energy consumptions of SiamSNN on TrueNorth are extremely lower than SiamFC on GPU. It’s worth noting that our proposed methods reach 50 FPS on tracking, while SNNs requires long latency in image classification and object detection. We can expect that our SiamSNN can be implemented on many embedded systems and applied widely in CV scenarios.

V Conclusion

In this paper, we design SiamSNN, the first deep SNN model for object tracking with short latency and low precision loss compared to SiamFC on OTB-2013, OTB-2015 and VOT2016 datasets. Our analysis indicates that enhancing firing rates of all neurons has no contribution for tracking. Consequently, we propose hybrid spiking similarity matching method and two-status coding scheme for taking full advantage of temporal information in spike trains, which achieves real-time on TrueNorth. We believe that our methods can be applied in more Siamese networks for tracking, person re-identification and face recognition.


  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 779–788.
  • [3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in ECCV.   Springer, 2016, pp. 850–865.
  • [4] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in CVPR, 2017, pp. 7370–7379.
  • [5] Y. Wei, X. Pan, H. Qin, W. Ouyang, and J. Yan, “Quantization mimic: Towards very tiny cnn for object detection,” in ECCV, 2018, pp. 267–283.
  • [6] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
  • [7] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
  • [8] N. K. Kasabov, “Neucube: A spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data,” Neural Networks, vol. 52, pp. 62–76, 2014.
  • [9] S. Ghosh-Dastidar and H. Adeli, “Spiking neural networks,” International journal of neural systems, vol. 19, no. 04, pp. 295–308, 2009.
  • [10] W. Maass, “Networks of spiking neurons: the third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.
  • [11]

    A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, “Deep learning in spiking neural networks,”

    Neural Networks, 2018.
  • [12] N. Caporale and Y. Dan, “Spike timing–dependent plasticity: a hebbian learning rule,” Annu. Rev. Neurosci., vol. 31, pp. 25–46, 2008.
  • [13] Y. Jin, W. Zhang, and P. Li, “Hybrid macro/micro level backpropagation for training deep spiking neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 7005–7015.
  • [14] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks for energy-efficient object recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.
  • [15]

    P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in

    IJCNN.   IEEE, 2015, pp. 1–8.
  • [16] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Conversion of continuous-valued deep networks to efficient event-driven networks for image classification,” Frontiers in neuroscience, vol. 11, p. 682, 2017.
  • [17] S. Park, S. Kim, H. Choe, and S. Yoon, “Fast and efficient information transmission with burst spikes in deep spiking neural networks,” in Design Automation Conference.   ACM, 2019, p. 53.
  • [18] K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intelligence with neuromorphic computing,” Nature, vol. 575, no. 7784, pp. 607–617, 2019.
  • [19] S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-yolo: Spiking neural network for real-time object detection,” arXiv preprint arXiv:1903.06530, 2019.
  • [20] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware siamese networks for visual object tracking,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 101–117.
  • [21] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in CVPR, 2019, pp. 4282–4291.
  • [22] G. Pedretti, V. Milo, S. Ambrogio, R. Carboni, S. Bianchi, A. Calderoni, N. Ramaswamy, A. Spinelli, and D. Ielmini, “Memristive neural network for on-line learning and tracking with brain-inspired spike timing dependent plasticity,” Scientific reports, vol. 7, no. 1, p. 5288, 2017.
  • [23] Z. Cao, L. Cheng, C. Zhou, N. Gu, X. Wang, and M. Tan, “Spiking neural network-based target tracking control for autonomous mobile robots,” Neural Computing and Applications, vol. 26, no. 8, pp. 1839–1847, 2015.
  • [24] G.-q. Bi and M.-m. Poo, “Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type,” Journal of neuroscience, vol. 18, no. 24, pp. 10 464–10 472, 1998.
  • [25] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in CVPR, 2013, pp. 2411–2418.
  • [26] Y. Wu, J. Lim, and M.-H.Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.
  • [27] J. M. M. F. R. P. P. M. Kristan, A. Leonardis and et al, “The visual object tracking VOT2016 challenge results,” in ECCV 2016 Workshops, 2016, pp. 777–823.
  • [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [29] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spiking neural networks: Vgg and residual architectures,” Frontiers in neuroscience, vol. 13, 2019.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [31] B. Rueckauer, I.-A. Lungu, Y. Hu, and M. Pfeiffer, “Theory and tools for the conversion of analog to spiking convolutional neural networks,” arXiv preprint arXiv:1612.04052, 2016.
  • [32] J. Gautrais and S. Thorpe, “Rate coding versus temporal order coding: a theoretical approach,” Biosystems, vol. 48, no. 1-3, pp. 57–65, 1998.
  • [33] S. Thorpe, A. Delorme, and R. Van Rullen, “Spike-based strategies for rapid processing,” Neural networks, vol. 14, no. 6-7, pp. 715–725, 2001.
  • [34] S. J. Thorpe, “Spike arrival times: A highly efficient coding scheme for neural networks,” Parallel processing in neural systems, pp. 91–94, 1990.
  • [35] C. Kayser, M. A. Montemurro, N. K. Logothetis, and S. Panzeri, “Spike-phase coding boosts and stabilizes information carried by spatial and temporal spike patterns,” Neuron, vol. 61, no. 4, pp. 597–608, 2009.
  • [36] J. Kim, H. Kim, S. Huh, J. Lee, and K. Choi, “Deep neural networks with weighted spikes,” Neurocomputing, vol. 311, pp. 373–386, 2018.
  • [37] M. Fiaz, A. Mahmood, S. Javed, and S. K. Jung, “Handcrafted and deep trackers: A review of recent object tracking approaches,” ACM Surveys, 2018.
  • [38] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr, “Staple: Complementary learners for real-time tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1401–1409.
  • [39] W. Ruan, J. Chen, Y. Wu, J. Wang, C. Liang, R. Hu, and J. Jiang, “Multi-correlation filters with triangle-structure constraints for object tracking,” IEEE Transactions on Multimedia, vol. 21, no. 5, pp. 1122–1134, 2018.
  • [40] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deep regression networks,” in ECCV.   Springer, 2016, pp. 749–765.
  • [41]

    Q. Wang, C. Yuan, J. Wang, and W. Zeng, “Learning attentional recurrent neural network for visual tracking,”

    IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 930–942, 2018.
  • [42] H. Hu, B. Ma, J. Shen, H. Sun, L. Shao, and F. Porikli, “Robust object tracking using manifold regularized convolutional neural networks,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 510–521, 2018.
  • [43] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in CVPR, 2018, pp. 8971–8980.
  • [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [45] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, “Target-aware deep tracking,” in CVPR, 2019, pp. 1369–1378.
  • [46] M. Mulansky and T. Kreuz, “Pyspike-a python library for analyzing spike train synchrony,” SoftwareX, vol. 5, pp. 183–189, 2016.
  • [47] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.