## 1 Introduction

Deep learning, i.e., the use of deep convolutional neural networks (DCNN ), is a powerful tool for pattern recognition (image classification) and natural language (speech) processing [52][44]. Deep convolutional networks use multiple convolution layers to learn the input data [26] [53] [12]

. They have been used to classify the large data set Imagenet

[25] with an accuracy of 96.6% [4]. In this work deep spiking networks are considered[47]. This is new paradigm for implementing artificial neural networks using mechanisms that incorporate spike-timing dependent plasticity which is a learning algorithm discovered by neuroscientists [16] [36]. Advances in deep learning has opened up multitude of new avenues that once were limited to science fiction [61]. The promise of spiking networks is that they are less computationally intensive and much more energy efficient as the spiking algorithms can be implemented on a neuromorphic chip such as Intel’s LOIHI chip [6] (operates at low power because it runs asynchronously using spikes). Our work is based on the work of Masquelier and Thorpe [38] [37], and Kheradpisheh et al. [23] [22]. In particular a study is done of how such networks classify MNIST image data [28] and N-MNIST spiking data [45]. The networks used in [23] [22]consist of multiple convolution/pooling layers of spiking neurons trained using spike timing dependent plasticity (STDP

[54]) and a final classification layer done using a support vector machine (SVM)

[17].### 1.1 Spike Timing Dependant Plasticity (STDP)

Spike timing dependant plasticity (STDP) [35] has been shown to be able to detect hidden (in noise) patterns in spiking data [37]. Figure 1 shows a simple 2 layer fully connected network with input (pre-synaptic) neurons and 1 output neuron. The spike signals are modelled as being either 0 or 1 in one millisecond increments. That is, 1 msec pulse of unit amplitude represents a spike while a value of 0 represents no spike present. See the left side of the Figure 1

. Each spike signal has a weight (synapse) associated with it which multiplies the signal to obtain

which is called the*post synaptic potential*due to the input neuron. These potentials are then summed as

is called the *membrane potential* of the output neuron. At any
time if the membrane potential is greater than a specified
threshold , i.e., if

then the output neuron spikes. By this we mean that the output neuron produces a 1 msec pulse of unit amplitude. See the right side of Figure 1.

Denote the input spike pattern as

Let be a sequence of times for which the spike
pattern is fixed, that is, while
at all other times the values are *random* (E.g.,
and

). The idea here is that the weights can be updated according to an unsupervised learning rule that results in the output spiking if and only if the fixed pattern is present. The learning rule used here is called spike timing dependent plasticity or STDP. Specifically, we used a simplified STDP model as in given as

[23]Here and are the spike times of the pre-synaptic (input) and
the post-synaptic (output) neuron, respectively. That is, if the
input neuron spikes before the output neuron spikes then the weight is
increased otherwise the weight is decreased.^{1}^{1}1The input neuron is
assumed to have spiked *after* the output neuron spiked. Learning refers
to the change in the (synaptic) weights with and
denoting the learning rate constants. These rate constants are
initialized with low values and are typically increased as
learning progresses. This STDP rule is considered simplified because the
amount of weight change doesn’t depend on the time duration between
pre-synaptic and post-synaptic spikes.

To summarize, if the pre-synaptic (input) neuron spikes before post-synaptic (output) neuron, then the synapse is increased. If the pre-synaptic neuron doesn’t spike before the post-synaptic neuron then it is assumed that the pre-synaptic neuron will spike later and the synapse is decreased.

The membrane potential profile of the type of output neuron considered here looks as shown in the Figure 2. In Figure 2 the output neuron is shown to receive a spike at 1 msec, two spikes at 2 msec and another two spikes at 3 msec. The output neuron spikes at time 3 msec as its membrane potential exceeded the threshold ().

Figure 3 shows a raster plot of an input neuron versus its spike times for the first 54 msecs. Figure 3 shows input neurons and at time a dot denotes a spike while an empty space denotes no spike. Red dots in the plot indicates a spike as part of a fixed pattern of spikes. In Figure 3 the pattern presented to the output neuron is 5 msec long in duration. The blue part of Figure 3 denotes random spikes being produced by the input neurons (noise). On close observation of Figure 3 one can see that fixed spike pattern in red is presented at time 0, time 13, and time 38.

Using only the above STDP learning rule, the output neuron learns to spike only when the fixed pattern is produced by the input neurons. With the weights

set randomly from normal distribution, i.e.,

Figure 4 shows the output spiking for the first 50 msecs. However after about 2000 msec, Figure 5 shows the output neuron starts to spike selectively, though it incorrectly spikes at times when the pattern is not present. Finally, after about 3000 msec, Figure 6 shows that the output neuron spikes only when the pattern is present.### 1.2 Convolution operation

In this work spiking convolutional neural networks (SCNN) are used for feature extraction. A short explanation of convolution is now presented. Figure

7 shows a convolution operation on an input image.Let

denote a convolution weight kernel (filter) indicated by the red square Figure 7 above. With the kernel centered on the location of the input image () the value () of the output image at is given by

Note that the shape of the output image is same as the input image, such convolutions are called same mode convolutions.

Convolution networks are used to detect features in images. To explain, consider the convolution kernel as shown in Figure 8. This kernel is used to find vertical lines of spikes at any location of the spiking input image. For example, at the location at time the kernel is convolved with the spiking image to give

If there is a vertical line of spikes in the spiking image that matches up with the kernel, then this result will be a maximum (maximum correlation of the kernel with the image). The accumulated membrane potential for the neuron at of map1 of the Conv1 layer is given by

The neuron at of map 1 of the Conv1 layer then spikes at time if

where is the threshold. If the neuron at in map 1 of Conv1 spikes then a vertical line of spikes have been detected in the spiking image centered at .

Figure 9 shows that map 2 (second feature map) of Conv1 is used to detect a line of spikes at 45 degrees. The third feature map (map 3) is used to detect a line of spikes at 135 degrees and the fourth feature map (map 4) is used to detect a horizontal line of spikes.

A typical SCNN has multiple layers. Each layer will have multiple feature maps (simply, maps).

## 2 Literature survey

In Hubel and Wiesel [18] showed that a cat’s neurons in primary visual cortex are tuned to simple features and the inner regions of the cortex combined these simple features to represent complex features. The neocognitron model was proposed in by Fukushima to explain this behavior [11]. This model didn’t require a ”teacher” (unsupervised) to learn the inherent features in the input, akin to the brain. The neocognitron model is a forerunner to the spiking convolutional neural networks considered in this work. These convolutional layers are arranged in layers to extract features in the input data. The terminology ”deep” CNNs refers to a network with many such layers. However, the deep CNNs used in industry (Google, Facebook, etc.) are fundamentally different in that they are trained using supervision (back propagation of a cost function). Here our interest is to return to the neocognitron model using spiking convolutional layers in which all but the output layer is trained without supervision.

### 2.1 Unsupervised networks

A network equipped with STDP [35] and lateral inhibition was shown to develop orientation selectivity similar to the visual frontal cortex in a cat’s brain [7] [63]

. STDP was shown to facilitate approximate Bayesian computation in the visual cortex using expectation-maximization

[43]. STDP is used for feature extraction in multi-layer spiking CNNs. It has been shown that deeper layers combine the features learned in the earlier layers in order to represent advanced features, but at the same time sparsity of the network spiking activity is maintained [46] [23] [22] [9] [38] [60] [59] [62] [57]. In [8] a fully connected networks trained using unsupervised STDP and homeostasis achieved a 95.6% classification accuracy on the MNIST data set.### 2.2 Reward modulated STDP

Mozafari et al. [40] [41] proposed reward modulated STDP (R-STDP) to avoid using a support vector machine (SVM) as a classifier. It has been shown that the STDP learning rule can find spiking patterns embedded in noise [37]. That is, after unsupervised training, the output neuron spikes if the spiking pattern is input to it. A problem with this unsupervised STDP approach is that as this training proceeds the output neuron will spike when just the first few milliseconds of the pattern have been presented. (For example, the pattern in Figure 3 is 5 msecs long and the output starts to spike when only (say) the first 2 msecs of the pattern have been presented to it though it should only spike after the full 5 msec pattern has been presented. Mozafari et al showed in [41] that R-STDP helps to alleviate this problem.

When unsupervised training methods are used, the features learned in the last layer are used as input to an SVM classifier [22][23] or a simple two or three layer back propagation classifier [56]. In contrast, R-STDP uses a reward or punishment signal (depending upon if the prediction is correct or not) to update the weights in the final layer of a multi-layer (deep) network. Spiking convolutional networks are successful in extracting features [41][22][23]

. Because R-STDP is a supervised learning rule, the extracted features (reconstructed weights) more closely resemble the object they detect and thus can (e.g.,) more easily differentiate between a digit “1” and a digit ”7” compared to STDP. That is, reward modulated STDP seems to compensate for the inability of the STDP to differentiate between features that closely resemble each other

[10] [31] [40] [55]. It is also reported in [40] that R-STDP is more computationally efficient. However, R-STDP is prone to over fitting, which is alleviated to some degree by scaling the rewards and punishments (e.g., receiving higher punishment for a false positive and a lower reward for a true positive) [40] [41]. In more detail, the reward modulated STDP learning rule is:If a reward signal is generated then the weights are updated according to

If a punishment signal is generated then the weights are updated according to

Here and are the pre- and post-synaptic times, respectively. For every input images, and are number of misclassified and correctly classified samples. Note that , if the decision of the network is based on the maximum potential of the network, if the decision of the network is based on the early spike because there might be no spikes for some inputs.

### 2.3 Spiking networks with back propagation

[29] used two unsupervised spiking CNNs for feature extraction. Then initializing with these weights, they used a type of softmax cost function for classification with the error back propagated through all layers. They were able to obtain a classification accuracy 99.1% on the MNIST data set. A similar approach with comparable accuracy was carried by [58]. Other methods such as computing the weights on conventional (non spiking) CNNs trained using the back propagation algorithm and then converting them to work on spiking networks have been shown to achieve an accuracy of 99.4% on MNIST data set and 91.35% on CIFAR10 data set [51]. An approximate back propagation algorithm for spiking neural networks was proposed in [2] [30]. In [20] a spiking CNN with 15C5-P2-40C5-P2-300-10 layers using error back propagation through all the layers reported an accuracy of 99.49% on the MNIST data set. The authors in [20] also classified the N-MNIST data set using a fully connected three-layer network with 800 neurons in the hidden layer and reported an accuracy of 98.84%.

Another approach to back propagation in spiking networks is the *random
back* propagation approach. First the standard back propagation equations in
(non-spiking) neural networks is now summarized [44]. The gradient
of a quadratic cost gives the error from
the last layer as

(1) |

is the activation of the neurons in the output layer,

is the activation function and

is the net input to the output layer. This error on the last layer is back propagated according to(2) |

where are the weights connecting the and ( layer. The weights and biases are updated as follows:

(3) |

(4) |

In equation (2), the weight matrix connecting the
and ( layer is the same as the weight matrix used in forward
propagation to calculate the activations of layer. This
is bothersome to the neuroscience community as this is not biologically
plausible [32] [14] [50]. This is referred to
as the *weight transport problem*. Lillicrap et al. [33]
showed that the back propagation algorithm works well even if in
equation (2) is replaced with another fixed *random* matrix
. This eliminates the requirement of weight symmetry,
i.e., the same weights for forward and backward propagations. A neuromorphic
hardware specific adaptation of random error back propagation that solves the
weight transport problem was introduced by [42] and was shown to
achieve an error rate of 1.96% for the MNIST data set. The cost function in
[42] is defined as

(5) |

where is the error of the output neuron and and are the firing rates of the prediction neuron and the label neuron.

(6) |

In equation (6), was approximated as

(7) |

Where is the current entering into post-synaptic neuron and indicates the presence of a pre-synaptic spike. For more details see [42]. The weight update for the last layer is then

(8) |

The weight update for hidden layers is

(9) |

where denotes the error term of the neuron in the output layer and is a fixed random number as suggested by the random back propagation algorithm. In the work to be reported below, random back propagation is not used. Specifically, when back propagation is used below, it is only between the penultimate and output layer making random back propagation unnecessary.

### 2.4 Spike encoding

Spikes are either rate coded or latency coded [13] [24] [49] [3]. Rate coding refers to the information encoded by the number of spikes per second (more spikes per time carries more information) In this case the spike rate is determined by the mean rate of a Poisson process. Latency encoding refers to the information encoded in the time of arrival of a spike (earlier spikes carry more information). The raster plot of Figure 3 shows that spatiotemporal information is provided by the input spikes to the output neuron. That is, which input neuron is spiking (spatio) and the time a neuron spikes (temporal) is received by the output neuron. The spiking networks use this spatiotemporal information to extract features (e.g., detect the pattern in Figure 3) in the input data [15] [39].

### 2.5 Realtime spikes

Image sensors (silicon retinas) such as ATIS [48] and eDVS [5] provide (latency encoded) spikes as their output. These sensors detect changes in pixel intensities. If the pixel value at location increases then an ON-center spike is produced while if the pixel value decreased an OFF-center spike is produced. Finally, if the pixel value does not change, no spike is produced. The spike data from an image sensor is packed using an address event representation (AER [19]) protocol and can be accessed using serial communication ports. A recorded version of spikes from eDVS data set was introduced in [34] and a similar data set of MNIST images recorded with ATIS data set was introduced in [45].

## 3 Background

### 3.1 Spiking Images

We have considered the standard grey-scale
MNIST images^{2}^{2}2We removed the outer most pixels in the data set
[28] giving images. [28] and the spiking N-MNIST
data files [45] for our experiments. In the case of the MNIST images
we needed to convert them to spikes. This was done by first using both an
on-center and an off-center Difference of Gaussian (DoG) convolution filter
for edge detection given by

where for the on-center and for the off-center.

With the input image , the output of each of the two DoG filters is computed using
the *same* mode convolution

Then these two resulting “images” were then converted to an on and an off spiking image by At each location of the output image a unit spike is produced if and only if ([21])

The spike signal is temporally coded (rank order coding[7]) by having it delayed “leaving” the Difference of Gaussian image by the amount

That is, the more exceeds the threshold the sooner it leaves or equivalently, the value of is encoded in the value

For all experiments the arrival times of the spikes were sorted in ascending order and then (approximately) equally divided into 10 bins (10 times in Figure 14). The raster plot shows which neurons (pixels of ) spiked to make up bin 1 (time 0), bin 2 (time 1), etc. Figure 14 shows an example for ON center cell spikes. In all the experiments each image is encoded into 10 msec (10 bins) and there is a 2 msec silent period between every image.

### 3.2 Network Description

We have a similar network as in [23][22] as illustrated in Figure 15. We let denote the spike signal at time emanating from the neuron of spiking image where (ON center) or (OFF center). The L2 layers consists of 30 maps with each map having its own convolution kernel (weights) of the form

The “membrane potential” of the
neuron of map () of L2 at time is given by the
*valid* mode convolution

If at time the potential

then the neuron at emits a unit spike.

#### 3.2.1 Convolution Layers and STDP

At any time *all* of the potentials for
and are computed (in theory this
can all be done in parallel) with the result that neurons in different
locations within a map and in different maps may have spiked. In particular,
at the location there can be multiple spikes (up to 30) produced by
different maps. The desire is to have different maps learn different features
of an image. To enforce this learning, *lateral inhibition* and
*STDP competition* are used [23].

##### Lateral Inhibition

To explain lateral inhibition, suppose at the location there were
potentials in different maps at time that exceeded
the threshold Then the neuron in the map with the highest potential
at inhibits the neurons in all the other maps at the
location from spiking till the end of the present image (even if their
potential exceeded the threshold). Figure 16 shows the
accumulated spikes (from an MNIST image of “5”) from all 30 maps at each location with lateral
inhibition *not* being imposed. For example, at location (19,14) in
Figure 16 the color code is yellow indicating in excess
of 20 spikes, i.e., more than 20 of the maps produced a spike at that
location.

Figure 17 shows the accumulation of spikes from all 30 maps, but now with lateral inhibition imposed. Note that at each location there is either 1 spike or no spike as indicated by the color code.

##### STDP Competition

After lateral inhibition, we consider each map that had one or more neurons
whose potential exceeded Let these maps be where^{3}^{3}3The other maps did not have any neurons
whose membrane potential crossed the threshold and therefore cannot spike.
. Then in each map we locate
the neuron in that map that has the maximum potential value. Let
be the location of these
maximum potential neurons in each map. Then neuron inhibits
all other neurons in map from spiking for the remainder of the time
steps of that spiking image. Further, these neurons can inhibit each other
depending on their relative location as we now explain. Suppose neuron
of map has the highest potential of these
neurons. Then, in an area centered about this
neuron inhibits all neurons of all the other maps in the same
area. Next, suppose neuron of map has the second
highest potential of the remaining neurons. If the location
of this neuron was within the area centered on
neuron of map then it is inhibited. Otherwise,
this neuron at inhibits all neurons of all the other maps in
a area centered on it. This process is continued for the
remaining neurons. In summary, there can be no more than one neuron that
spikes in the same area of all the maps.

Figure 18 shows the spike accumulation after both lateral inhibition and STDP competition have been imposed. The figure shows that there is at most one spike from all the maps in any area.

Lateral Inhibition and STDP inhibition enforce sparse spike activity and, as a consequence, the network tends to spike sparsely

##### Spike Timing Dependent Plasticity (STDP)

Only those maps that produced a spike (with lateral inhibition and STDP competition imposed) have their weights (convolution kernels) updated using spike timing dependent plasticity. Let be the weight connecting the pre-synaptic neuron in the L1 layer to post-synaptic neuron in the L2 layer. If the post-synaptic neuron spikes at time with the pre-synaptic neuron spiking at time then the weight is updated according to the simplified STDP rule [7]

The parameters and are referred to as learning rate constants. is initialized to and is initialized to and are increased by a factor of 2 after every 1000 spiking images. STDP is shown to detect a hidden pattern in the incoming spike data [37]. In all of our experiments we used the above simplified STDP model as in [23] (simplified STDP refers to the weight update not depending on the exact time difference between pre-synaptic and post-synaptic spikes). If the pre-synaptic neuron spikes before post-synaptic neuron then the synapse is strengthened, if the pre-synaptic neuron doesn’t spike before post-synaptic neuron then it is assumed that the pre-synaptic neuron will spike later and the synapse is weakened.

Figure 19 is a plot of the weights (convolution kernels)
for each of the 30 maps. Following [23], each column
corresponds to a map and each row presents the weights after every 500 images.
For example, for and
are the weights for the ON (green) and OFF (red) filters^{4}^{4}4That is, the
ON (green) and Off (red) weight are superimposed on the same plot. for the
map (right-most column of Figure 19). It
turned out that there were approximately 17 spikes per image in this layer
(L2). At the end of the training most of the synapses will be saturated either
at 0 or 1.

##### Homeostasis

Homeostasis refers to the convolution kernels (weights) for all maps being updated approximately the same number of times during training. With homeostasis each kernel gets approximately the same number of opportunities to learn its unique feature. Some maps tend to update their weights more than others and, if this continues, these maps can take over the learning. That is, only the features (weights of the convolution filter) of those maps that get updated often will be of value with the rest of the maps not learning any useful feature (as their weights are not updated). Homeostasis was enforced by simply decreasing the weights of a map by if it tries to update more than twice for every 5 of input images.

#### 3.2.2 Pooling Layers

A pooling layer is a way to down sample the spikes from the previous convolution layer to reduce the computational effort.

##### Max Pooling

After the synapses (convolution kernels or weights) from L1 to L2 have been
learned (unsupervised STDP learning is over^{5}^{5}5And therefore
STDP competition is no longer enforced.

), they are fixed, but lateral inhibition continues to be enforced in L2. Spikes from the maps of the convolution layer L2 are now passed on to layer L3 using max pooling. First of all, we ignored the last row and last column of each of the

maps of L2 so that they may be considered to be Next, consider the first map of the convolution layer L2. This map is divided into non-overlapping area of neurons. In each of these sets of neurons, at most one spike is allowed through. If there is more than one spike coming from the area, then one compares the membrane potentials of the spikes and passes the one with the highest membrane potential. Each set of neurons in the first map is then a single neuron in the first map of the L3 layer. Thus each map of L3 has (down sampled) neurons. This process is repeated for all the maps of L2 to obtain the corresponding maps of L3. Lateral inhibition is not applied in a pooling layer. There is no learning done in the pooling layer, it is just way to decrease the amount of data to reduce the computational effort.After training the L2 convolution layer, we then passed 60,000 MNIST digits through the network and recorded the spikes from the L3 pooling layer. This is shown in Figure 20. For example, in the upper left-hand corner of Figure 20 is shown the number spikes coming out of the first map of the pooling layer L3 for each of the 10 MNIST digits. It shows that the digit “3” produced over 100, 000 spikes when the 60,000 MNIST digits were passed through the network while the digit “1” produced almost no spikes. That is, the spikes coming from digit “1” do not correlate with the convolution kernel (see the inset) to produce a spike. On the other hand, the digit ”3” almost certainly causes a spike in the first map of the L3 pooling layer. In the bar graphs of Figure 20 the red bars are the five MNIST digits that produced the most spikes in the L3 pooling layer while the blue bars are the five MNIST digits that produced the least.

Figure 21 shows convolution kernel between the L3 pooling layer and the L4 convolution layer. We chose to have 500 maps in L4 which means that for we have

The spikes from the L3 pooling layer are then used to train the weights (convolutional kernels) in the same manner as

In some of our experiments we simply did a type of global pooling to go to the
output layer L5. Specifically, at each time step, we convolve the spikes from
L3 to compute the potential for each of the neurons of L4.
The maximum potential for each map in L4 was then found and stored (This is a
vector in ). The potentials in L4 were then reset to 0 and the process repeated
for each of the remaining time steps of the current image. This procedure
results in ten vectors for each image. The sum of these vectors then encodes the
current image in L5, i.e., as a single vector in The motivation to take the maximum potential of each map at each time
step is because all the neurons in a given map of L4 are looking for the
*same* feature in the current image.

Unsupervised STDP training is done in the convolution layers with both STDP competition and lateral inhibition applied to the maps of the convolution layer doing training. Once a convolution layer is trained, it’s weights are fixed and the spikes are passed through it with only lateral inhibition imposed.

## 4 Classification of MNIST data set

In the following subsections we considered two different network architectures along with different classifiers for the MNIST data set.

### 4.1 Classification with Two Convolution/Pool Layers

In this first experiment the architecture shown in Figure 22 was used. Max pooled ”membrane potentials”, i.e., the L5 layer of Figure 22, was used to transform each ( training image into a new ”image” in . Using these images along with their labels, a support vector machine [17]

was then used to find the hyperplanes that optimally

^{6}

^{6}6In is optimal in the sense that a Lagrangian was minimized. separate the training digits into 10 classes. With the SVM weights, the quantity

was added to the SVM Lagrangian to for regularization. Both linear and radial basis function (RBF) kernels were used in the SVM. We used 20,000 MNIST images for the (unsupervised) training of the two convolution/pool layers (Layers L2-L5). Then we used 50,000 images to train the SVM with another 10,000 images used for validation (to determine the choice of

). The SVM gives the hyperplanes that optimally separate the 10 classes of digit. Table 1 shows classification accuracies when maps were used in L4. The first two rows of Table 2 give the test accuracy on 10,000 MNIST test images. In particular, note a 98.01 % accuracy for the RBF SVM and a 97.8 % accuracy for a Linear SVM. Using a similar network with linear SVM, Kheradpisheh et al. [23] reported an accuracy of 98.3%.Classifier | Test Acc | Valid Acc | Training Time | Epochs | ||

RBF SVM | 97.92 % | 97.98 % | 8 minutes | 1/3.6 | - | - |

Linear SVM | 97.27 % | 97.30 % | 4 minutes | 1/0.012 | - | - |

2 Layer FCN (backprop) | 96.90 % | 97.02 % | 15 minutes | 1.0 | 30 | |

3 layer FCN (backprop) | 97.8 % | 97.91 % | 50 minutes | 6.0 | 30 |

For comparison purposes with SVM, we also considered putting the L5 neurons (i.e., vectors in ) into both a conventional two and three layer fully connected network (FCN). Using a two layer FCN (see Figure 23) with sigmoidal outputs, a cross-entropy cost function, and a learning rate we obtained 97.97 % classification accuracy. Similarly with a three layer FCN (see Figure 24) with the same conditions an accuracy of 98.01 % was obtained.

Separability of the MNIST Set

If then the 50,000 training and 10,000 validation images converted to “images” turn out to be completely separable into the 10 digit classes! However, the accuracy on the remaining 10,000 test images drops to 97.01%. The original 60,000 MNIST (training & validation) images in are not separable by a linear SVM (The SVM code was run for 16 hours with without achieving separability).

Increasing the Number of Output Maps

If the number of maps in the L4 layer are increased to 1000 with the L5 maps correspondingly increased to 1000, then there is a slight increase in test accuracy as shown in Table 2. With the 50,000 training and 10,000 validation images converted to “images” also turn out to be completely separable into the 10 digit classes. However, with the test accuracy decreases to 97.61.

Classifier | Test Acc | Valid Acc | Training Time | Epochs | ||

RBF SVM | 98.01 % | 98.20 % | 8 minutes | 1/3.6 | - | - |

Linear SVM | 97.80 % | 98.02 % | 4 minutes | 1/0.012 | - | - |

2 Layer FCN (backprop) | 97.71 % | 98.74 % | 15 minutes | 1.0 | 30 | |

3 layer FCN (backprop) | 98.01 % | 98.10 % | 50 minutes | 6.0 | 30 |

### 4.2 Classification with a Single Convolution/Pool Layer

The architecture shown in Figure 25 has a single convolutional/pooling layer with pooled neurons in L3. These neurons are fully connected to L4 layer of neurons. However, the neurons in L4 are in 1-1 correspondence with the L3 neurons (flatten). Further, each neuron in L4 simply sums the spikes coming into it from its corresponding neuron in L3. The L4 neurons are fully connected (with trainable weights) to 10 output neurons. This final layer of weights are then trained using backprop only on this output layer, i.e., only backprop to L4. (See Lee at al. [29] where the error is back propagated through all the layers and reported an accuracy of 99.3%). Inhibition settings are same as in the above experiment.

The first row of Table 3 shows a 98.4% test accuracy using back propagation on the output layer (2 Layer FCN). The second and third rows give the classification accuracy using an SVM trained on the L4 neurons (their spike counts). The feature extraction that takes place in the L2 layer (and passed through the pooling layer) results in greater than 98% accuracy with a two layer conventional FCNN output classifier. A conventional FC two layer NN (i.e., no hidden layer) with the images of the MNIST data set as input has only been reported to achieve 88% accuracy and 91.6% with preprocessed data [27]. This result strengthens our view that the unsupervised STDP appears to convert the MNIST classes into classes in a higher space that are separable.

We also counted the spikes in network with two convolution/pool layers (see Figure 22) but found that the accuracy decreased (see Table 2) This decrease may be due to that reduced number of spikes in the output neurons compared to have only one convolution/pool layer.

Classifier | Test Acc | Valid Acc | Training Time | Epochs | ||
---|---|---|---|---|---|---|

2 Layer FCN | 98.4% | 98.5% | 10mins | 20 | ||

RBF SVM | 98.8% | 98.87% | 150 minutes | - | - | |

Linear SVM | 98.41% | 98.31% | 100 minutes | - | - |

## 5 Reward Modulated STDP

Reward modulated STDP is a way to use the accumulated spikes at the output to do the final classification (in contrast to SVM and a two layer backprop mentioned above). Figure 26 shows the network architecture where the reward modulated STDP is carried out between the (flattened) L5 layer and the ten output neurons of the L6 layer. The weights between the fully connected neurons of Layer 5 and Layer 6 are then trained as follows: For any input image the spikes through the network arrive between and time steps. The final () membrane potential of the output neuron for is then

Denote by and the number of correctly classified and incorrectly classified images for every (e.g., etc.) input images so . If the output potential is maximum (i.e., for ) and the input image has label then the weights going into the output neuron are rewarded in the sense that

(10) |

If is the maximum potential, but the label of the image is then the weights going into output neuron are punished in the sense that

(11) |

Note that only the weights of those neurons connected to the output neuron with the maximum potential are updated. The term “modulated” in reward modulated STDP refers to the factors and which multiply (modulate) the learning rule. Equation (10) refers to the case where the k output neuron also has the high membrane potential of the ten outputs. If is small then the network accuracy is performing well in terms of accuracy and the change is weights is small (as the weights are thought to already have learned to correctly classify). On the other hand, equation (11) refers to the case where the k output has the highest membrane potential, but the label is Then, if is small, it follows that is large the weights of the neurons going into the k neuron have their values changed by a relatively large amount to (hopefully) correct the misclassification.

In this experiment with R-STDP, only 20,000 MNIST digits were used for training, 10,000 digits for validation (used to choose the number of training epochs), and the 40,000 remaining digits were used for testing. The R-STDP synaptic weights between L5 and L6 were initialized from the normal distribution . Table 4 shows that a test accuracy of only 90.1% was obtained.

Maps in L4 | Valid acc % | Test Acc % | Epochs |
---|---|---|---|

750 | 91.2 | 90.1 | 150 |

For comparison, we replaced the R-STDP classifier (from L5 to L6) with a simple 2 layer neural network (from L5 to L6) which used error back propagation. These weights for back propagation were initialized from the normal distribution as in [44]. Table 6 shows that R-STDP performed poorly compared to the simple two layer backprop which ran for only 20 epochs.

Classifier | Test Acc | Valid Acc | Epochs | ||
---|---|---|---|---|---|

2 Layer FCN | 97.5% | 97.6% | 20 |

Mozafari et al. [41][40] got around this poor performance by having 250 neurons in the output layer and assigning 25 output neurons per class. They reported 97.2 % test accuracy while training on 60,000 images and testing on 10,000 images. We also considered multiple neurons per class in the output layer. As Table 6 shows, we considered 300 output neurons (30 per class) and we also consider dropout. means that output neurons were prevented from updating their weights for the particular training image. For each input image a different set of 120 randomly neurons were chosen to not have their weights updated. Table 6 shows that the best performance of 95.91 % test accuracy was obtained with

Maps in L4 | #Output Neurons | P | Valid acc % | Test acc % | Epochs |
---|---|---|---|---|---|

750 | 300 | 0.3 | 95.81 | 95.84 | 400 |

750 | 300 | 0.4 | 96.01 | 95.91 | 400 |

750 | 300 | 0.5 | 95.76 | 95.63 | 400 |

### 5.1 R-STDP as a Classification Criteria

We experimented with R-STDP learning rule applied to L5-L6 synapses of the network in the Figure 26 by two different kinds of weight initialization and also varying initialization of parameters like and .

#### 5.1.1 Backprop Initialized Weights for R-STDP

We were concerned with the poor performance using an R-STDP as a classifier as given in Table 6. In particular, perhaps the weight initialization plays a role in that the R-STDP rule can get stuck in a local minimum. To study this in more detail the network in Figure 26 was initialized with a set of weight that are known to give a high accuracy. To explain, the final weights used in the 2 Layer FCN reported in Table 5 were used as a starting point. As these weights are both positive and negative, they were shifted to be all positive. This was done by first finding the minimum value of these weights and simply adding to them so that they are all positive. Then this new set of weights were re-scaled to be between 0 and 1 by dividing them all by their maximum value (positive). These shifted and scaled weights were then used to initialize the weights of the R-STDP classifier. The parameters were initialized to be 0.004, 0.003, 0.0005, 0.004 respectively. With the network in Figure 26 initialized by these weights, the validation images were fed through the network and the neuron number with the maximum potential is the predicted output. The validation accuracy was found to be 97.1%.

With weights of the fully connected layer of Figure 26 initialized as just described, the R-STDP rule was used to train the network further for various number of epochs and two different ways of updating and

##### Batch Update of and

The first set of experiments were done with the and
ratios updated after every *batch* of images for
As the weights of the fully connected layer of Figure
26 with the backprop trained values, we expect
to be a low fraction or equivalently to be high. Consequently, they were initialized as With these initialization, Table 7 shows
that accuracy on the validation set did not decrease significantly for not
too large. However, using larger values of (value of N depends on the
initialization of and ) the accuracy goes down
significantly. For example, for the cases where and
the accuracy didn’t significantly decrease until the batch
size was In the case with and the
accuracy didn’t decrease at all. This is because the best performing weights
for validation accuracy were used, but these same weights also gave 100%
accuracy on the training data.

Acc. at start | Acc. at end | |||
---|---|---|---|---|

0.1 | 0.9 | 100 | 97.1% | 96.91% |

0.1 | 0.9 | 500 | 97.1% | 96.96% |

0.1 | 0.9 | 1500 | 97.1% | 96.82% |

0.1 | 0.9 | 2500 | 97.1% | 90.76% |

0.035 | 0.965 | 2500 | 97.1% | 96.69% |

0.035 | 0.965 | 3000 | 97.1% | 96.58% |

0.035 | 0.965 | 3500 | 97.1% | 91.05% |

0.035 | 0.965 | 4000 | 97.1% | 90.98% |

0.0 | 1.0 | 100 | 97.1% | 96.93% |

0.0 | 1.0 | 500 | 97.1% | 96.93% |

0.0 | 1.0 | 1500 | 97.1% | 96.94% |

0.0 | 1.0 | 2500 | 97.1% | 96.94% |

0.0 | 1.0 | 3000 | 97.1% | 96.94% |

0.0 | 1.0 | 3500 | 97.1% | 96.94% |

0.0 | 1.0 | 4000 | 97.1% | 96.93% |

Table 8 shows the classification accuracy with ”poor” initialization and If the weights had been randomly initialized then the initialization and would be appropriate. However, Table 8 shows that R-STDP isn’t able to recover from this poor initialization.

Acc. at start | Acc. at end | |||
---|---|---|---|---|

0.9 | 0.1 | 100 | 97.1% | 91.52% |

0.9 | 0.1 | 500 | 97.1% | 90.67% |

0.9 | 0.1 | 1500 | 97.1% | 90.47% |

0.9 | 0.1 | 2500 | 97.1% | 90.45% |

##### Update of and after each image

Next, and were updated after every image using the most recent images. Even with and initialized incorrectly, the validation accuracies in Table 9 did not decrease significantly. Though the accuracy still goes down slightly, the table indicates that updating and after every image mitigates this problem.

Acc. at start | Acc. at end | |||
---|---|---|---|---|

0.9 | 0.1 | 100 | 97.1% | 96.93% |

0.9 | 0.1 | 500 | 97.1% | 96.94% |

0.9 | 0.1 | 1500 | 97.1% | 96.93% |

0.9 | 0.1 | 2500 | 97.1% | 96.94% |

Still updating and after each image, it was found that R-STDP accuracy was very sensitive to the initialized weights. Specifically the L5-L6 R-STDP weights were initialized using the backprop trained weights (as explained above) by doing the backprop for just 10 epochs (instead of 20) and (regularization parameter) which gave 99.6% training and 96.8% validation accuracies. Table 10 gives the validation accuracies using R-STDP for 100 epochs. Surprisingly, even with a good initialization of the weights and the ratios and , the validation accuracy suffers.

Acc. at start | Acc. at end | |||
---|---|---|---|---|

0.0 | 1.0 | 100 | 96.8% | 90.75% |

0.0 | 1.0 | 4000 | 96.8% | 90.67% |

For this same cases as Table 10, the R-STDP algorithm was run for 1000 epochs
with the training and validation accuracies versus epoch plotted in Figure
27. Notice that the validation accuracy drops to
~90%. It seems that R-STDP is not a valid cost function as far as
accuracy is concerned^{7}^{7}7At least using one output neuron per class..
Interestingly, as shown next, training with R-STDP with randomly initialized
weights, the validation accuracy only goes up to ~90% (see Figure
28).

#### 5.1.2 Randomly Initialized Weights for R-STDP

In the set of experiments, the weights trained with R-STDP were
*randomly* initialized from the normal distribution and the parameters initialized with the
values given in Table 13. Validation accuracies are shown at the end of 100
epochs and were updated after every
image.

Acc. at start | Acc. at end | |||
---|---|---|---|---|

0.9 | 0.1 | 100 | 10.3 | 90.22 |

0.9 | 0.1 | 500 | 10.1 | 90.13 |

0.9 | 0.1 | 1500 | 10.2 | 90.12 |

0.9 | 0.1 | 2500 | 10.6 | 90.16 |

For this same cases as Table 11, the R-STDP algorithm was run for 1000 epochs with the training and validation accuracies versus epoch plotted in Figure 28. The validation accuracy only goes up to ~90%.

## 6 Classification of N-MNIST data set

In the above we artificially constructed spiking images using a DoG filter on the standard MNIST data set as in [23][22]. However the ATIS (silicon retina) camera [48] works by producing spikes. We also considered classification directly on recorded output from the ATIS camera given in the N-MNIST data set [45]. A silicon retinal detects change in pixel intensity and thus the MNIST digits are recorded with camera moving slightly (saccades). Figure 30 shows the raw accumulated spikes of the N-MNIST data set as given in [45].

Figure 31 is the same as Figure 30, but corrected for saccades (camera motion) using the algorithm given in [45].

Figure 29 shows the network we used for classification of the N-MNIST data. We first hard wired the weights of the convolution kernel from L1 to L2 of Figure 29 to the values already trained above in subsection 4.2 (see Figure 25). Only the weights from L4 to L5 were trained for classification by simply back propagating the errors from L5 to L4. This result in given in the first row of Table 12. We also trained an SVM on the L4 neuron outputs with the results given in row 2 (RBF) and row 3 (linear) of Table 12. All the results in Table 12 were done on the raw spiking inputs from [45] (i.e., not corrected for saccade) with training done on 50,000 (spiking) images, validation & testing done on 10,000 images each.

Classifier | Test Acc | Valid Acc | Training Time | Epochs | ||
---|---|---|---|---|---|---|

2 Layer FCN | 97.45% | 97.62% | 5 minutes | 20 | ||

RBF SVM | 98.32% | 98.40% | 200 minutes | - | - | |

Linear SVM | 97.64% | 97.71% | 100 minutes | - | - |

Classification accuracies of N-MNIST data set with one convolution/pool layers for transfer learning.

In Table 13 we show the results for the case where the weights of the convolution kernel from L1 to L2 of Figure 29 were trained (unsupervised) using the N-MNIST data set. In this instance we used N-MNIST data corrected for saccades since this gave better result than the uncorrected data. All the results in Table 8 were produced by training on 50,000 (spiking) images with validation & testing done using 10,000 images each.

Classifier | Test Acc | Valid Acc | Training Time | Epochs | ||
---|---|---|---|---|---|---|

1 Layer FCN | 97.21% | 97.46% | 5 minutes | 20 | ||

RBF SVM | 98.16% | 98.2% | 150 minutes | - | - | |

Linear SVM | 97.38% | 97.44% | 100 minutes | - | - |

We also added an extra convolution layer, but found that the classification accuracy decreased. Jin et al reported an accuracy of 98.84% by using a modification of error back propagation (all layers) algorithm [20]. Stromatias et al reported an accuracy of 97.23% accuracy by using artificially generated features for the kernels of the first convolutional layer and training a 3 layer fully connected neural network classifier on spikes collected at the first pooling layer [56].

## 7 Catastrophic Forgetting

Catastrophic forgetting is a problematic issue in deep convolutional neural networks. In the context of the MNIST data set this refers to training the network to learn the digits 0,1,2,3,4 and, after this is done, training on the digits 5,6,7,8,9 is carried on. The catastrophic part refers to the problem that the network is no longer able to classify the first set of digits 0,1,2,3,4. In more detail, Figure 32 shows a conventional (non-spiking) neural network with one convolution layer & one pool layer followed by a fully connected softmax output.

This network has 10 outputs but was first trained only on the digits 0,1,2,3,4
back propagating the error (computed from all 10 outputs) to the input
(convolution) layer. This training used approximately 2000 digits per class
and was done for 75 epochs. Before training the network on the digits
5,6,7,8,9 we initialized the weights and biases of the convolution and fully
connected layer with the saved weights of the previous training. For the
training with the digits 5,6,7,8,9 we *fixed* the weights and biases of
the convolution layer with their initial values. The network was then trained,
but only the weights of fully connected layer were updated. (I.e., the error
was only back propagated from the 10 output neurons to the previous layer
(flattened pooled neurons). This training also used approximately 2000 digits
per class and was done for 75 epochs. While the network was being trained on
the second set of digits, we computed the validation accuracy on all 10 digits
at the end of each epochs. We plotted these accuracies in Figure
33. The solid red line in Figure
33 are the accuracies versus epoch on the
first set of digits {0,1,2,3,4} while the solid blue line gives the
accuracies on the second set of digits {5,6,7,8,9} versus epoch. Figure
34 is a zoomed in picture of Figure
33 for better resolutions of the
accuracies above 90%. These plots also show the validation accuracy results
when the second set of training data modified to include a fraction of data
from the first set of training digits {0,1,2,3,4}. For example, the dashed
red line is the validation accuracy on the first set of digits when the
network was trained with 2000 digits per class of {5,6,7,8,9} *along
with* 200 (10%) digits per class of {0,1,2,3,4}. The blue dashed line is
the validation accuracy of the second set of digits after each epoch.
Similarly this was done with 15%, 25%, 27.5%, and 30% of the first set of
digits included in the training set of the second set of digits. The solid red
line shows that after training with the second set of digits for a single
epoch the validation accuracy on first set goes down to 10% (random
accuracy). The solid blue line shows a validation accuracy of over 97% on the
second set of digits after the first epoch. Thus the network has now learned
the second set of digits but has catastrophically forgotten the first set of
digits shown by solid red line.

### 7.1 Forgetting In Spiking Networks

For comparison we tested forgetting in our spiking network of Section
4.2 (see Figure 25). The network was
first trained only on the digits 0,1,2,3,4 with STDP on the convolution layer
and back propagating the error (computed from all 10 outputs) just to the
previous (flattened pool layer) layer. This training used approximately 2000
digits per class and was done for 75 epochs. Then, before training the network
on the set of digits {5,6,7,8,9}, we initialized the weights of the
convolution and fully connected layer with the saved weights of the previous
training. For the training with the digits 5,6,7,8,9 we *fixed* the
weights of the convolution layer with their initial values. The network was
then trained, but only the weights of fully connected layer were updated.
(I.e., the error was only back propagated from the 10 output neurons to the
previous layer (flattened pooled neurons). This training also used
approximately 2000 digits per class and was done for 75 epochs. While the
network was being trained on the second set of digits, we computed the
validation accuracy on all 10 digits at the end of each epochs. We plotted
these accuracies in Figure 35. The
solid red line in Figure 35 are the
accuracies versus epoch on the first set of digits {0,1,2,3,4} while the
solid blue line gives the accuracies on the second set of digits {5,6,7,8,9}
versus epoch. Figure 36 is a zoomed in
picture of Figure 35 for better
resolutions of the accuracies above 90%. These plots also show the validation
accuracy results when the second set of training data modified to include a
fraction of data from the first set of training digits {0,1,2,3,4}. For
example, the dashed red line is the validation accuracy on the first set of
digits when the network was trained with 2000 digits per class of
{5,6,7,8,9} *along with* 200 (10%) digits per class of {0,1,2,3,4}.
The blue dashed line is the validation accuracy of the second set of digits
after each epoch. Similarly this was done with 15%, 25%, 27.5%, and 30% of
the first set of digits included in the training set of the second set of
digits. The solid red line shows that after training with the second set of
digits for a single epoch the validation accuracy on first set goes down to
77% (compared to the 10% accuracy of a non-spiking CNN). The solid blue line
shows a validation accuracy of about 95% on the second set of digits after
the first epoch. Thus the network has now learned the second set of digits but
has not catastrophically forgotten the first set of digits shown by solid red
line.

As another approach we first trained on the set {0,1,2,3,4} exactly as just describe above. However, we then took a different approach to training on the set {5,6,7,8,9}. Specifically we trained on 500 random digits chosen from {5,6,7,8,9} (approximately 50 from each class) and then compute the validation accuracy on all ten digits. We repeated this for every additional 250 images with the results shown in Figure 37. Interestingly this shows that if we stop after training on 1000 digits from {5,6,7,8,9} we retain a validation accuracy of 91.1% and 90.71% test accuracy on all 10 digits.

# images (classes 5-9) | # images (classes 0-4) | Validation | Test | Epochs |
---|---|---|---|---|

10,000 | 1000(10%) | 95.235% | 95.1% | 75 |

10,000 | 1500(15%) | 95.95% | 95.9% | 75 |

10,000 | 2500(25%) | 96.83% | 96.81% | 75 |

10,000 | 2750(27.5%) | 96.98% | 96.92% | 75 |

10,000 | 3000(30%) | 97.1% | 97.043% | 75 |

Jason et al reported an accuracy of 93.88% for completely disjoint data sets[1].

## 8 Feature Reconstruction

We have already presented in Figure 19 which is a reconstruction of the convolution kernels (weights) from Layer L1 to Layer 2 into features. Each of the 30 maps of L2 has a convolution kernel in associated with it which maps L1 to L2.

We now want to reconstruct (visualize) the features learned by the second convolution layer. Each of the 500 maps of L4 (see Figure 21) has a convolutional kernel in associated with it which maps L3 to L4, i.e., for we have

A area of pooled layer L3 receives spikes from area of neurons in L2. Thus, for , the kernels are reconstructed to be features

connecting L2 to L4. How is this done? Consider the kernel and for the slice of the value of the element is mapped to the element of the slice of All other values of the slice are set to zero. This is done for

Now recall that there are 30 kernels in . Specifically, for

is for ON center kernels and is for off center kernels. These kernels maps spikes from area of neurons in L1 to a area of layer of L2. Thus the feature must be reconstructed to be a feature in . That is, for

(Each neuron in L4 has a field of view of neurons in L1). How is this done?

Let the matrix on the left-hand side of Figure 38 denote an ON center kernel for some In particular, let it be the second kernel so Now the feature denoted by can be visualized as being made up of slices for To go with the second kernel we take the second slice (k=1) of the feature denoted as which we take to be the matrix on the right-hand side of Figure 38. In practice these slices are sparse and we show the particular slice Figure 38 to have only two non zero elements, the and the elements.

To carry out the reconstruction we compute and center it on of as indicated in Figure 39. We then repeat this process for all non zero elements of which in this example is just .

Filling in with zeros we end up with the matrix shown in Figure 40.

Each of the 500 reconstructed features of which 150 are shown in Figure 44 is the sum of 30 matrices of the type shown in Figure 40.

To reconstruct the third matrix we use the third kernel () taken to be the matrix on the left-side of Figure 41 and the third slice () of the feature denoted as which we take to be the matrix on the right-hand side of Figure 41.

Here the only non zero components are and . We compute and center it on

Comments

There are no comments yet.