The human brain is a computational marvel compared to man-made systems, both in its ability to learn to execute highly complex cognitive tasks, as well as in its energy efficiency. The computational efficiency of the brain stems from its use of sparsely issued binary signals or spikes to encode and process information. Inspired by this, spiking neural networks (SNNs) have been proposed as a computational framework for learning and inference 
. General purpose graphical processing units (GP-GPUs) have become an ideal platform for accelerated implementation of large scale machine learning algorithms. There have been multiple GPU based implementations for simulating large SNNs [3, 4, 5, 6, 7, 8], with most of these targeting the forward communication of spikes through large networks of spiking neurons and/or local weight update based on spike timing difference. In contrast, we demonstrate a highly optimized real time implementation scheme for spike based supervised learning on GPU platforms and use the framework for real time inference on digits captured from different users through a touch-screen interface.
Previous efforts to develop deep convolutional spiking networks started by using second generation artificial neural networks (ANNs) with back-propagation of errors to train the network and thereafter converting it into spiking versions [9, 10, 11, 12]. There have been several supervised learning algorithms proposed to train the SNNs, by explicitly using the time of spikes of neurons to encode information, and to derive the appropriate weight update rules to minimize the distance between desired spike times and observed spike times in a network [13, 14, 15, 16, 17]. We use the Normalized Approximate Descent (NormAD) algorithm to design a system to identify handwritten digits. The NormAD algorithm has shown superior convergence speed compared to other methods such as the Remote Supervised Method (ReSuMe) .
, where a two-stage convolution neural network achieved an accuracy ofon the test set. Our network, in contrast, has just three layers, with about
learning synapses (fewer parameters compared to ) and achieves an accuracy of on the MNIST test dataset.
The paper is organized as follows. The computational units of the SNN and the network architecture are described in section II. Section III details how the network simulation is divided among different CUDA kernels. The user-interface system and the image pre-processing steps are explained in Section IV. We present the results of our network simulation and speed related optimizations in Section V. Section VI concludes our GPU based system implementation study.
Ii Spiking Neural Network
The basic units of an SNN are spiking neurons and synapses interconnecting them. For computational tractability, we use the leaky integrate and fire (LIF) model of neurons, where the evolution of the membrane potential, is described by:
Here is the total input current, is the resting potential, and (pF) and (nS) model the membrane capacitance and leak conductance, respectively . Once the membrane potential crosses a threshold (), it is reset to its resting value and remains at that value till the neuron comes out of its refractory period (ms). The synapse, with weight connecting input neuron to output neuron , transforms the incoming spikes (arriving at times ) into a post-synaptic current (), based on the following transformation,
Here, the summed function represents the incoming spike train and the double decaying exponentials with ms and ms represent the synaptic kernel. These values closely match the biological time constants .
Ii-a Network architecture & spike encoding
We use a three-layered network where hidden layer performs feature extraction and the output layer performs classification (see Fig. 1). The network is designed to take input from pixel MNIST digit image. We translate this pixel value into a set of spike streams, by passing the pixels as currents to a layer of neurons (first layer). The current applied to a neuron corresponding to pixel value , in the range is obtained by the following linear relation:
where pA is a scaling factor, and pA is the minimum current above which an LIF neuron can generate a spike (for the parameters chosen in equation 1). These spike streams are then weighted with twelve synaptic weight maps (or filters), with a priori chosen values to generate equivalent current streams using equations 2 and 3. These spatial filter maps are chosen to detect various edges and corners in the image.
The output layer consists of neurons, one for each of the ten digits. We train the network so that the correct neuron in the output layer generates a spike train with a frequency close to Hz and the other output neurons issue no spikes during the presentation duration, (set to ms in baseline experiments). is also a hyper-parameter of our network, and its effect on the network’s classification ability will be discussed in section V. This layer also has lateral inhibitory connections that helps to prevent the non-label neurons from spiking for a given input. The output neuron with the highest number of spikes is declared the winner of the classification.
Ii-B Learning layer
The synapses connecting the hidden layer neurons to the output layer neurons are modified during the course of training using the NormAD rule . The strength of the weights are adjusted based on the error between the observed and desired spike streams () and the term , denoting the effect of incoming spike kernels on the neuron’s membrane potential, according to the relation:
where represents the neuron’s impulse response with ms, and is the learning rate.
Iii CUDA implementation
The SNN is implemented on a GPU platform using the CUDA-C programming framework. A GPU is divided into streaming multiprocessors (SM), each of which consists of stream processors (SP) that are optimized to execute math operations. The CUDA-C programming framework exploits the hardware parallelism of GPUs and launches jobs on the GPU in a grid of blocks each mapped to an SM. The blocks are further divided into multiple threads, each of which is scheduled to run on an SP, also called a CUDA core. Since memory transfer between CPU and GPU local memory is one of the main bottlenecks, all network variables (i.e., neuron membrane potentials and synaptic currents) are declared in the global GPU memory in our implementation. The simulation equations (1), (2) and (3) are evaluated numerically in an iterative manner at each time step.
Fig. 2 shows the forward pass and backward pass for weight update during the training phase. Image pixels read into the GPU memory are passed as currents to layer one neurons (grid size of ) for the presentation duration, . The filtering process involves 2D convolution of the incoming spike kernels and the weight matrix (). The computation is parallelized across CUDA kernels, each with a grid size of threads. Each thread computes the current to the hidden layer neurons, indexed as a 2D-array , at a time-step , based on the following spatial convolution relation:
where represents the synaptic kernel (equation 2) calculated from the spike trains of the pixels and represents each of the weights from the filter matrix.
The membrane potential of an array of LIF neurons, for applied current (as described in equation 1) is evaluated using the second order Runge-Kutta method as:
Each thread independently checks if the membrane potential has exceeded the threshold to artificially reset it.
Refractory period is implemented by storing the latest spike issue time,
of each neuron in a vector; the membrane potential of a neuron is updated only when the current time step .
The synaptic current from neuron in hidden layer to neuron in output layer as given in equation 3 can be re-written to be evaluated in an iterative manner, thereby avoiding the evaluation of expensive exponential of the difference between the current time and previous spike times . The synaptic current computation, at time step , for each of the synapse is spawned in CUDA across kernels as:
where and represent the rising and falling regions of the double exponential synaptic kernel. The strength of the synapses between the hidden and output layers is initialized to zero during training. At every time step, the error function for each output neuron is calculated, based on the difference between the observed and desired spikes. Next, (equation 5) for the spikes originating from neuron is computed as:
Once is evaluated, we compute its norm across all neurons and determine the instantaneous for all the synapses in parallel, if there is a spike error. At the end of presentation, the accumulated is used to update the synaptic weights in parallel. The evaluation of the total synaptic current and the norm is performed using parallel reduction in CUDA . During the inference or testing phase, we calculate the synaptic currents and membrane potentials of neurons in both layers to determine spike times, but do not evaluate the term and the weight update.
Iv Real-time inference on user data
We used the CUDA based SNN described in the previous section, to design a user interface that can capture and identify the images of digits written by users in real-time from a touch-screen interface. The drawing application to capture the digit drawn by the user is built using OpenCV, an image processing library . The captured image from the touch screen is pre-processed using standard methods similar to that used to generate the MNIST dataset images . We convert the user drawn images to the required format which is a grayscale image of size pixels. The network is implemented on the NVIDIA GTX 860M GPU which has CUDA cores. The preprocessing phase takes about ms and this image is then passed to the trained SNN for inference. The CUDA process takes about ms to initialize the network in the GPU memory, after which the network simulation time depends on the presentation time and the time step interval .
Iv-a Image Preprocessing
(b) shows some sample pre-processed images. The image captured from the user is first binarized by thresholding and cropped to remove excess background. The image is resized topixels along its longer dimension, while maintaining its aspect ratio. Thereafter, the resized image is placed in a bounding box such that the image’s center of mass coincides with the center of the bounding box. Finally, the image is passed through a blurring filter to create gray-scale images similar to the ones in the MNIST dataset.
We trained the network on the MNIST training data-set consisting of images, for epochs. Our network achieves an error of on the training set and on the test set with a time step of ms when the network is simulated for ms. Table I lists the state-of-the-art networks (ANN and SNN) for the MNIST classification problem. It can be seen that though these networks have classification accuracies exceeding , they use more than the number of parameters compared to our network, which is designed to simplify the computational load in developing real-time system.
|Network and learning algorithm||Learning synapses||Accuracy|
|Deep Learning |
|ANN converted to SNN|
|4-layer convolution SNN |
|SNN, with NormAD (this work)|
If the integration time step interval used during inference is ms (i.e., approximating the neuronal integration) instead of ms, the MNIST test error increases only by about (see Fig. 4(a)), but there is a reduction in the processing time. Hence, for our touch screen based interface system we simulate the SNN with of ms to infer the users’ digits. When each digit is presented for ms, the network can be simulated in an average wall clock time of ms, making real-time processing possible (Fig. 4(b)).
We tested the network’s accuracy with ms on a set of handwritten digits collected from various users through our user-interface system. At ms, we measure an accuracy of on our set of captured images, while on the MNIST test-set it was . The slight loss in performance compared to the MNIST dataset is attributed to the deviations from the statistical characteristics of the captured images compared to the MNIST dataset.
We developed a simple three-layer spiking neural network that performs spike encoding, feature extraction, and classification. All information processing and learning within the network is performed entirely in the spike domain. With approximately times lesser number of synaptic weight parameters compared to the state of the art spiking networks, we show that our approach achieves classification accuracy exceeding on the training set of the MNIST database and on its test set. The trained network implemented on the CUDA parallel computing platform is also able to successfully identify digits written by users in real-time, demonstrating its true generalization capability.
We have also demonstrated a general framework for implementing spike based neural networks and supervised learning with non-local weight update rules on a GPU platform. At each time step, the neuronal spike transmission, synaptic current computation and weight update calculation for the network are all executed in parallel in this framework. Using this GPU implementation, we demonstrated a touch-screen based platform for real-time classification of user-generated images.
-  W. Maass, “Networks of spiking neurons: The third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.
-  A. Coates et al., “Deep learning with COTS HPC systems,” in 30th International Conference on Machine Learning, 2013.
-  A. K. Fidjeland and M. P. Shanahan, “Accelerated simulation of spiking neural networks using GPUs,” in The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 2010, pp. 1–8.
-  J. M. Nageswaran et al., “A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors,” Neural networks, vol. 22, no. 5, pp. 791–800, 2009.
-  E. Yavuz, J. Turner, and T. Nowotny, “GeNN: a code generation framework for accelerated brain simulations,” Nature Scientific Reports, vol. 6, no. 18854, 2016.
-  D. Yudanov et al., “GPU-based simulation of spiking neural networks with real-time performance and high accuracy,” in International Joint Conference on Neural Networks, July 2010.
-  F. Naveros et al., “Event-and time-driven techniques using parallel CPU-GPU co-processing for Spiking Neural Networks,” Front. in Neuroinformatics, vol. 11, 2017.
-  J. L. Krichmar, P. Coussy, and N. Dutt, “Large-scale spiking neural networks using neuromorphic hardware compatible models,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 11, no. 4, p. 36, 2015.
-  P. U. Diehl et al., “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–8.
Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural networks
for energy-efficient object recognition,”
International Journal of Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.
-  B. Rueckauer et al., “Theory and tools for the conversion of analog to spiking convolutional neural networks,” arXiv preprint arXiv:1612.04052, 2016.
-  E. Hunsberger and C. Eliasmith, “Training Spiking Deep Networks for Neuromorphic Hardware,” arXiv preprint arXiv:1611.05141, 2016.
-  N. Anwani and B. Rajendran, “NormAD -Normalized Approximate Descent based supervised learning rule for spiking neurons,” in International Joint Conference on Neural Networks, July 2015, pp. 1–8.
-  F. Ponulak and A. Kasinski, “Supervised learning in spiking neural networks with ReSuMe: sequence learning, classification, and spike shifting,” Neural Computation, vol. 22, no. 2, pp. 467–510, 2010.
-  A. Mohemmed et al., “SPAN: Spike Pattern Association Neuron for Learning Spatio-Temporal Spike Patterns,” International Journal of Neural Systems, vol. 22, no. 04, 2012.
J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training Deep Spiking Neural Networks using Backpropagation,”Frontiers in Neuroscience, vol. 10, p. 508, 2016.
-  W. W. Lee, S. L. Kukreja, and N. V. Thakor, “CONE: Convex-Optimized-Synaptic Efficacies for Temporally Precise Spike Mapping,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 4, pp. 849–861, April 2017.
-  “The MNIST database of handwritten digits,” available at http://yann.lecun.com/exdb/mnist/.
-  P. Dayan, L. Abbott et al., “Theoretical neuroscience: computational and mathematical modeling of neural systems,” Journal of Cognitive Neuroscience, vol. 15, no. 1, pp. 154–155, 2003.
-  M. Harris, “Optimizing parallel reduction in CUDA,” available at http://bit.ly/2gd7fSb.
-  I. Culjak, D. Abram, T. Pribanic, H. Dzapo, and M. Cifrek, “A brief introduction to OpenCV,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 1725–1730.
-  L. Wan et al., “Regularization of Neural Networks using DropConnect,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1058–1066.