## 1 Introduction

Machine learning is a paradigm according to which algorithms perform tasks by learning from data instead of exploiting domain knowledge. Recently, machine-learning algorithms that employ artificial neural networks (ANNs) have been attracting much interest from researchers and engineers in diverse fields owing to their astonishing performance on various tasks such as image recognition, natural language processing, and time series forecasting

[23, 25]. These machine-learning algorithms are known as deep learningbecause of the characteristic multilayer structures of ANNs. In deep learning,

stochastic gradient descent (SGD) and back propagation (BP) are important techniques that enable multilayer ANNs to perform effective learning [1, 51]. In addition, recent advances in algorithms such as convolutional neural networks (CNNs) [26] and skip connections[19] have led to the success of deep learning. Moreover, the evolution of very large-scale integration (VLSI) technologies, such as Moore’s law [36] and Dennard scaling[14], have enabled deep learning to be implemented in real-world applications.However, the computational burden of deep learning often hinders its application in the real world. Therefore, researchers and engineers often use high-performance processors such as graphical processing units, tensor processing units

[22], and application-specific integrated circuits (ASICs)[56] to compensate for this shortcoming. Despite these developments, it remains difficult to enable the learning to converge in an acceptable time, or to perform the tasks in real time. This is even more difficult when performing tasks with processors placed near the end-users (edge computing) owing to the limited battery capacity. Therefore, for edge computing, high computational energy efficiency is required.In this regard, finding a solution for the energy-efficiency problem by looking toward the human brain would only seem natural, because the human brain is capable of processing complex information in real time by consuming only 20-30 W of energy. An ANN is mostly a mathematical model inspired by the physiological observation of spike frequencies in biological neurons, whereas a spiking neural network (SNN) is a mathematical model that is directly concerned with the spike generation [32]. Therefore, SNNs would be expected to perform tasks in a way similar to the brain with low energy consumption. Until recently, SNNs did not perform as well as ANNs on tasks such as image recognition. However, it has been shown that the performance of SNNs could reach that of ANNs by exploiting the techniques used in deep learning (multilayer structures, SGD, BP, etc.)[57, 47]. Thus, it seems reasonable to find a way to really improve the energy efficiency of SNNs. In fact, not only have SNNs demonstrated their high algorithmic performance but also their advantages in terms of energy efficiency especially when implemented in ASICs by exploiting the asynchronous property and binary (spike) communications [30]. Many research groups have manufactured various types of ASICs for SNNs, which include, for example, mixed-signal ASICs[49, 37, 52, 17], and fully digital ASICs [34, 13]. Other ASICs were also reported [53, 8]. These ASICs are able to execute SNNs with complex and flexible networks in a highly energy efficient manner. However, it is still unclear whether ASICs for SNNs are more energy efficient per operation or per task than ASICs for ANNs. This comparison is blurred by a common bottleneck preventing computational energy efficiency, which is caused by data movement between processors and memory [20]. Because of this common bottleneck, the advantages of employing SNNs in a hardware implementation such as the asynchronous property and spike communications are hardly visible in terms of the computational energy efficiency of ASICs. One promising approach to overcome this bottleneck is in-memory computing.

In-memory computing is a concept that involves performing the computing where the memorized data are located [59]. Especially, in-memory computing schemes that use analog resistive memory have being attracting hardware researchers’ attention [58, 21, 64]

. Analog resistive memory stores information in the form of resistance. The application of voltage across the two terminals of an analog resistive memory cell results in an electric current, which is proportional to the voltage, and which flows through the memory cell by obeying Ohm’s law. Currents flowing through memory cells can easily be summed up on the basis of Kirchhoff’s current law. Note that this computing scheme does not require information to be stored elsewhere (e.g., static random access memory or dynamic random access memory). Based on the above principle, computing such as matrix-vector multiplications (MVMs) can be performed in a highly energy efficient way.

This motivated us to propose a novel supervised learning algorithm for multilayer SNNs to achieve high algorithmic performance and high energy efficiency. It is also possible to implement our algorithm in VLSI circuits with analog resistive memory. Our algorithm is based on the following ideas: (i) SNNs are composed of spiking neuron models which facilitate VLSI implementations; (ii) SNNs are trained with a temporal coding scheme, thereby enabling SNNs to perform tasks with fewer spikes than SNNs based on rate coding. We also propose novel learning techniques to improve the learning performance and evaluate the performance on the MNIST dataset. Furthermore, for real VLSI implementations, robustness against manufacturing variations in devices is crucial. Therefore, we investigated the effects of these manufacturing variations on the learning performance of the proposed algorithm. In addition, we propose a technique to suppress the effect of manufacturing variations.

This paper consists of the following sections: section 2 describes related work; section 3 proposes an SNN and its learning algorithm; section 4 introduces the experimental setup for evaluating the algorithm on the MNIST dataset, and for investigating the effects of manufacturing variations; section 5 presents the experimental results; section 6 concludes this paper.

## 2 Related Work

### 2.1 Supervised learning algorithms for multilayer SNNs

A number of studies have shown that multilayer SNNs can be trained with the BP technique [57, 47, 42]. The application of the BP to SNNs is not straightforward as in the case of ANNs because SNNs have non-differentiable processes such as spike generations. Furthermore, the dynamics and the learning performance depend on the coding schemes of SNNs. Fig. 1 shows SNNs based on two different coding schemes: rate coding and temporal coding. In rate coding, information is contained in the spike frequency of neurons, whereas in temporal coding, information is contained in the spike timing of neurons [31].

In rate coding, BP algorithms can be obtained by approximations such as linear neuron approximation [48], and by considering spike generation as a probabilistic process [41]. In [27, 66, 42], the BP algorithms were obtained by using the derivative of the membrane potential with respect to time. In addition, it has been shown that trained ANNs can be converted into rate-coding-based SNNs [15, 16]. However, these algorithms need to average a number of spikes to propagate information forward or backward. Because the energy consumption of ASICs increases as the number of spikes increases, performing tasks with fewer spikes is preferable.

Temporal coding enables SNNs to perform tasks with fewer spikes compared to rate coding. SpikeProp is the first algorithm that enabled multilayer SNNs to learn by using temporal coding[7]. SpikeProp obtains the BP algorithm by directly differentiating spike timing with respect to weights. Mostafa [38] improved the learning performance of SNNs in temporal coding on recognition tasks by using non-leaky spiking neurons. Comsa et al. [12]

used an alpha synaptic function and employed an evolutionary algorithm to search for the hyperparameter to improve the learning performance.

In this paper, we propose a supervised learning algorithm for SNNs based on temporal coding, which is similar to those mentioned above [38, 12]. The novelties of our work are as follows: we use a different neuron model to simplify the hardware implementation and modify the learning algorithm to improve the learning performance.

### 2.2 Computing with analog resistive memory

Analog resistive memory such as resistive RAM and phase change memory is considered important building blocks for next-generation artificial-intelligence processors

[58, 21, 64]. Many research groups have already designed circuits with analog resistive memory to efficiently compute MVMs [9, 11, 4, 28, 10]. Among various VLSI implementations with analog resistive memory, circuits that process information by using the timing of electronic pulses are expected to be advantageous in terms of energy efficiency because of their simple implementation [10, 50, 54, 33, 2, 62, 61, 3, 65]. For example, circuits using the timing difference between two pulses [62, 61] and those using the pulse-width difference between two pulses [3, 65] are able to compute MVMs efficiently. Nandakumar et al. [40] developed circuits with phase change memory that is capable of computing single-layer SNNs based on rate coding.In this paper, we show that SNNs based on temporal coding can be straightforwardly mapped to the above VLSI implementation schemes.

## 3 Methods

This section introduces our proposed SNNs and their learning algorithms. The SNNs described here are feedforward networks of spiking neurons as shown in Fig. 1(b), where spikes propagate from one layer to the next layer without skipping layers. The SNNs are to be trained such that the network can convert a given input spike pattern into a specific output spike pattern. We also present several examples of VLSI implementations of the SNNs where analog resistive memory is used as network weights.

### 3.1 Models

The time evolution of the membrane potential of a spiking neuron that we employed is given by

(1) | ||||

(2) |

where we define as the membrane potential of the th spiking neuron in the th layer, and define as the spike timing generated by the same neuron to the present input pattern. We define as the weight of the connection from the th neuron in the th layer to the th neuron in the th layer. is the number of neurons constituting the th layer. The zeroth layer and the th layer are referred to as the input and output layers, respectively. The other layers are hidden layers. This network is an -layer SNN.

The analytical solution of (1) is given by

(3) |

The neuron generates (fires) a spike when its membrane potential reaches the firing threshold . Then, the membrane potential is fixed to the reset voltage (), which prevents the neuron from firing twice for a given input spike pattern. Fig. 2(b) shows a schematic diagram to indicate the dependence of the evolution of the membrane potential on the timing of received spikes and the connection weights.

The timing of spikes can be obtained as follows. The spike timing satisfies the equation

(4) |

where is a set of indices of neurons in the th layer that sends spikes before the th neuron in the th layer fires. With (4), we obtain the spike timing as follows:

(5) |

### 3.2 Supervised learning of SNNs

For the supervised learning of SNNs, we define the following functions

(6) | ||||

(7) | ||||

(8) | ||||

(9) |

where

are the cost function, loss function, softmax function, and temporal penalty term, respectively. Teacher labels are represented as a vector

such that is 1 when the th label is given, and 0, otherwise. The weights of the entire network are represented as a vector . The temporal penalty term is defined by the total difference between the spike timing of the output neurons and the timing of a reference spike . Note that is the vector of spike timing of the neurons in the output layer. As explained later, the temporal penalty is intended to circumvent learning difficulties. The coefficient controls the effect of the temporal penalty term.The network is trained by updating the weights by using SGD on a dataset as follows:

(10) |

where is the update of and is the learning rate. The derivative of the cost function with respect to weights is given by

(11) |

where we define the propagation error as

(12) |

By using the following derivatives

(13) | ||||

(14) | ||||

(15) |

the updating difference (10) can be obtained.

The derivative of the cost function is rigorous unless a neuron ceases to fire because of the update. However, we found that the network experience two difficulties during training. These difficulties are related to destructively large weight updates and non-unique spike timing, respectively. The destructively large weight updates are caused by infinitesimal values in the denominator of (13) or (14). Our solution to prevent these unacceptable large weight updates is to redefine (13) and (14) as

(16) | ||||

(17) |

where is a positive constant.

Next, the loss function shown in (7) is invariant against an arbitrary value such that

(18) |

where is a vector of which the elements are all unity. Owing to its invariant nature, the loss function does not uniquely specify the timing of output spikes, which forces the timing of the output spike to become large, resulting in the disappearance of spikes. The temporal penalty term, which is introduced to solve this problem, enables the cost function to uniquely specify the timing of the output spike to approximate of the reference spike.

Here, we summarize the difference between our algorithm and the algorithms reported before [38, 12]. These previous studies were also concerned with the supervised learning of SNNs based on temporal coding. They used almost the same loss function like an exponential form of spike timing in [38]. Their approach to avoid the destructively large weight updates was to normalize the gradient when it became excessively large [38], or to clip the gradient to a fixed value [12]. We use a simple method (16) and (17), which can prevent the magnitude of the updated weights from becoming too large. They also reported the spike disappearance problem as explained above. Mostafa [38] avoided this problem by introducing a penalty term that increased the sum of values of weights and the regularization term, but the two terms need to be balanced appropriately. Comsa et al. [12] also avoided this problem by adding a small penalty that increased the sum of weights when a neuron did not fire. On the other hand, we avoid this problem by introducing the temporal penalty term, which is a more direct way to stabilize the timing of an output spike. Our approach enhances the performance compared to that of Mostafa [38], and is slightly superior to that of Comsa et al. [12].

### 3.3 Deep learning with SNNs

Because the important properties of deep learning are sequential nonlinear transformations over layers and end-to-end learning, the proposed algorithm can be considered to be a form of deep learning. The proposed SNN processes information by converting the pattern of an input spike in a layer-by-layer manner as follows:

(19) |

where . Note that is considered to be a sufficiently large constant if the th neuron in the th layer does not fire at the end of the simulations. It is worth noting that the layer-to-layer nonlinear transformation in SNNs is different from that in ANNs. According to (5), the spike timing is a linear function of the presynaptic spike timing if , whereas it does not depend on if

. This function seems to be similar to a rectified linear unit (ReLU)

[39], which is commonly used as an activation function for ANNs. However, unlike ReLUs, the turning point,

, is determined by other spikes and their weights. In other words, the information is processed by using the timing order of spikes, which is the most unique characteristic of SNNs based on temporal coding.### 3.4 Consideration of VLSI implementation

For analog VLSI implementations, the dynamics of SNNs must be physically realized in circuits that are carefully designed using either the nonlinear or linear properties of transistors. Here, we show that the time evolution of the membrane potential of neurons as expressed by (1) can be mapped to circuits with analog resistive memory in a simple and straightforward manner.

Fig. 3(a) shows an example of such implementations including the analog resistive memory, operational amplifiers, capacitors, and comparators. The membrane potential is represented as the voltage at the capacitor , which is determined by the charge at the node . A spike is generated by a comparator when the membrane voltage exceeds the firing threshold voltage . The existence of the spike is represented with the output voltage and as follows:

(20) |

where are constants and is assumed.

In Fig. 3, and represent the conductance (the inverse of the resistance) from the th neuron in the th layer to the th neuron in the th layer. This combination can represent positive weights and negative weights as explained later. Currents flow through the conductance to the th neuron in the th layer, which causes the membrane potential to change. It should be noted that, in Fig. 3(a), the voltage at node Q is virtually grounded because of the operational amplifiers. The time evolution of the membrane potential is obtained by Kirchhoff’s current law and Ohm’s law as follows:

(21) |

We introduce another simpler circuit configuration, as shown in Fig. 3(b), to allow us to investigate the effect of an approximation of the neuron model (1) on the learning performance. This circuit implementation is similar to that in [61], which was proposed for calculating vector-matrix operations. Because this circuit (Fig. 3(b)) uses no operational amplifiers, which often consume the largest amount of energy in analog circuits, this implementation is expected to be more energy efficient. Similar to (21), the time evolution of the membrane potential is performed by using Kirchhoff’s current law and the Ohm’s law as follows:

(22) |

where we define

(23) | ||||

(24) |

Note that the following relationships

(25) |

were used. By setting and letting , (22) converges to (1), where is realized by . However, smaller values of are more preferable in terms of energy efficiency. The effects of the finite values of on the learning performance were not previously considered [61]. Thus, we investigated the effects by numerical simulations. Except when explicitly pointed out, simulations were conducted with the neuron model expressed by (1). Details of the simulation method and the learning methods for the neuron model expressed by (22) are provided in the appendix. We note that if current sources (field-effect transistors) are used instead of resistors, as employed by others [65], the time evolution equation becomes similar to (21) corresponding to the case of Fig. 3(a).

Finally, we explain the advantages of SNNs over time-domain circuits of ANNs[33, 3, 65, 61] in terms of the energy efficiency of circuits. First, the circuits of Marinella et al. [33] need Analog-to-Digital Converters (ADCs) to convert the signal for each layer of ANNs, which are known as energy-hungry processes. Actually, it is reported that ADCs were the energy bottleneck in the systems [33]. Circuit implementations requiring no ADCs were proposed [3, 65, 61]. These circuits compute the positive and negative weights independently, and subsequently take the difference of the two. As a result of the above principles, these circuits require two circuits to compute the positive and negative weights, and the dynamic range decreases significantly because the difference is taken (corresponding to digit loss in digital computing). The advantage of using SNNs over these circuits is that our circuits shown in Fig. 3 require no full ADCs because the input and output of the circuits consist of binary spikes, and that the circuits can compute the positive and negative weights at the same time.

## 4 Experimental setup

In this section, we describe the experimental setup for evaluating the learning performance of the proposed algorithm with the MNIST dataset[24]. We also present the experimental setup for evaluating the robustness of the algorithms against manufacturing variations in the devices, which are unavoidable in analog VLSI implementations. In all the experiments, we set . We again note that neuron model (1) was used unless otherwise noted.

### 4.1 Model evaluation with MNIST dataset

The MNIST dataset [24] consists of 70,000 images of handwritten digits, each of which is represented as a -dimensional vector with a label of a corresponding integer (from 0 to 9). In our experiments, 60,000 images were used for training, and the remaining 10,000 images were used for testing, as in [24].

To process data in SNNs, the vector data are converted into spike patterns as follows: we normalize all the elements (pixels) of the vector as . Then, the corresponding input spikes are generated at

(26) |

where is the maximum time window of the input spikes, and was set at ms in our experiments. A spike is not generated when

. At the training phase, to improve the generalization performance, noise was introduced according to a Gaussian distribution

with zero mean and standard deviation

.### 4.2 Robustness against device manufacturing variations

In analog VLSI implementations, variations in the manufactured devices are unavoidable, and must be taken into account during circuit design. In this experiment, we investigated the effect of manufacturing variations on the learning performance, which has yet not been reported for SNNs based on temporal coding, to the best of our knowledge. To suppress these effects, we employed a learning technique that is described later.

In the VLSI implementations of Fig. 3(a) and Fig. 3(b), various parameters, such as weights, capacitance, spike delays, and firing threshold, need to be taken into account. Owing to the computational complexity, we investigated the effects of variations on the learning performance of only the firing threshold and spike delays in this study. We chose these two parameters because they affect the network dynamics in a different way: the firing threshold changes the generation timing of spikes, whereas the spike delays shift the arrival timing of spikes. The other parameters can be investigated in the same way we explained below. The remainder of this section describes the mathematical representations of the variations and then the evaluations of the effects of the variations.

First, we mathematically model the variations of the firing threshold . We define the firing threshold of the th neuron in the th layer as

. For simplicity, we assume that the probability density distribution of the variation is a clipped Gaussian distribution as follows:

(27) | ||||

(28) |

where represents a Gaussian distribution with mean and standard deviation and the operation ‘’ means drawing a value from the distribution.

Second, we mathematically model the variation of spike delays. As shown in Fig. 4(a), spike delays from the th neuron in the th layer to the th neuron in the th layer are defined as . We assume a Gaussian probability density distribution of the spike delays

(29) |

We also assume

(30) |

such that the condition is always satisfied as shown in Fig. 4 (b). The offset can be removed from the simulation as follows. Consider the modification of in the SNNs. If we add a constant to for all and , all the spike timings of the neurons in the th-layer are delayed by , e.g., for the first layer the delays are and for the second layer the delays are . Therefore, the spike timing in the output layer is delayed as . We denote the delayed spike timing by for the th layer. Note that the loss function shown in (7) is invariant against the delays as

(31) |

because the effects of the difference between and are cancelled out in (8). Moreover, in (9), the temporal penalty term is invariant against the delays if we add the delay to the reference spike timing as

(32) |

Therefore, the cost function shown in (6) becomes invariant against the delays . This invariance indicates that a constant shift of the delay does not affect the learning results. Thus, our numerical simulations were conducted with . Note that a situation in which the spike timing causes the spike to arrive earlier than the arrival time may occur in the numerical simulation. However, information processing in feedforward SNNs is valid because the differences among the arrival times within each layer are maintained.

Using the variation models for the firing threshold and for spike delays , we investigated the effects of the variations on the learning performance in the following two cases: (i) the exact values of the parameters ( and ) are not known, but those distributions are known; (ii) the exact values of the variation parameters are known.

In case (i), the dynamics in the SNNs is black boxed, thus direct training is not possible. To overcome this problem, we employed a learning technique in the training phase. We extract the values of these parameters from (27) or (29

) with estimated standard deviations

. The same techniques were reported for training binarized neural networks

[35]. In case (ii), training the SNNs can be carried out in a straightforward manner.In the test phase, we evaluated the performance of the SNNs with various standard variations . Note that the experimental setup described in section 4.1 corresponds to the case of in case (i).

## 5 Results

### 5.1 Performance on the MNIST dataset

Fig. 5 shows the spike timing pattern (a raster plot) of a 2-layer SNN (784-800-10) after training for the input, the hidden, and output layers, from top to bottom, respectively. The horizontal axis represents the time in milliseconds and the vertical axis represents the index of neurons for each layer. These results show that the broad distribution of the spike timing in the hidden layer is such that a large portion of neurons in the hidden layer fire after the output neurons fire. In the output layer, the ninth output neuron fires the earliest, which corresponds to the correct label . Fig. 6 shows the time evolution of the membrane potential of the neurons in each layer. The membrane potentials evolve as (1) such that their behavior can be expressed by a piecewise linear function. For the hidden layer, because the neurons receive no spikes from the input layer after 5 ms, their membrane potentials evolve linearly after the time 5 ms. For the output neurons, their membrane potential varies nonlinearly because they receive many spikes from the hidden layer. The membrane potential of the output neurons was observed to be suppressed by spikes from the hidden layer at approximately ms except that of the output neuron corresponding to the correct label. One neuron fired at approximately 12 ms and the others at approximately . These results indicate that our proposed temporal penalty term is effective.

Similar but slightly different network dynamics is observed for 3-layer SNNs (784-400-400-10). Fig. 7 shows the raster plots of the spikes in the SNN. The distribution of spikes in the first hidden layer was broad, whereas that in the second hidden layer was relatively narrow. A large portion of neurons in the second hidden layer are observed not to have fired. This reduction in the number of firing neurons may imply that the neurons needed for information processing are selected for each data. Fig. 8 shows the time evolution of the membrane potential of the neurons in each layer. The results show that the membrane potentials of neurons in the second and output layers evolve nonlinearly, whereas those in the first layer evolve linearly after 5 ms.

Network | Coding | Accuracy |
---|---|---|

784-800-10 [55] | ANN | 98.4 % [55] |

784-800-800-10 [60] | ANN | 98.8 % [60] |

784-800-10 [27] | SNN (rate coding) | 98.64 % [27] |

784-300-300-10 [27] | SNN (rate coding) | 98.77 % [27] |

784-800-10 [38] | SNN (temporal coding) | 97.55 % [38] |

784-400-400-10 [38] | SNN (temporal coding) | 97.14 % [38] |

784-340-10 [12] | SNN (temporal coding) | 97.96% [12] |

784-500-10 (This work) | SNN (temporal coding) | 97.830.07 % |

784-800-10 (This work) | SNN (temporal coding) | 98.040.05 % |

784-300-300-10 (This work) | SNN (temporal coding) | 97.730.08 % |

784-400-400-10 (This work) | SNN (temporal coding) | 97.900.12 % |

Table 1

shows the results of the classification accuracy on the MNIST dataset. The standard errors ‘

’ for our results were obtained from 10 trials with different initial weights. The SNNs based on rate-coding [27] performed as good as ANNs[55, 60], whereas the SNNs based on temporal coding [38] performed slightly worse than ANNs. However, the improvements in our learning algorithm increased the performance of our algorithm compared with previous research [38], and the performance was slightly higher than [12]. Our best result is highlighted in bold in Table 1.Next, we show the results of the learning performance when we replace the original neuron model (1) with the alternative neuron model (22). We again note that the model (22) is considered to be more energy efficient. To reduce the computational costs, we focused on a small-scale network (169-300-10) and used a shrunk version of the MNIST dataset. A shrunk MNIST image was obtained by convoluting a kernel of which the elements are all onto an original image

with a stride of 2. The

pixel of the shrunk image is obtained by(33) |

Then, we obtained spike patterns by using (26) with the flatted shrunk two-dimensional data . In the experiment, we experienced difficulties in achieving learning convergence because of destructively large weight updates which had their origins in the temporal penalty term (9). We circumvented this difficulty by replacing the temporal penalty term (9) with

(34) |

to soften the effect of its large weight updates when the timing of the output spike is far from .

Fig. 9 shows the time evolution of the membrane potentials of the neurons for different values of (). When (Fig. 9(a)), the membrane potentials of the neurons in the hidden layer evolve linearly after 5 ms and those in the output layer evolve nonlinearly, which is similar to the case of Fig. 6. On the other hand, when (Fig. 9(b)), the membrane potentials of the neurons in the hidden evolve nonlinearly, which is caused by the term on the right-hand side of (22). Moreover, the membrane potentials of the neurons are stabilized at some intermediate value, because (22) has an equilibrium point with

(35) |

Fig. 10 shows the results of the classification accuracy for various values of (). As the value of decreases, the classification accuracy decreases monotonically. As shown in Fig. 9, there exists an equilibrium point in the membrane potential lower than when the value of is small, which may complicate the training.

### 5.2 Effects of manufacturing variations on learning performance

We present the results of the numerical simulation of the experiments described in 4.2. Fig. 11 shows the results of the classification accuracy for various values of the standard deviations of the firing threshold in the test phase . When no variation is assumed at the training phase (), the classification accuracy decreases conspicuously in the test phase as the standard deviation increases. When variations are assumed in the training phase () corresponding to case (i) in 4.2, the degradation of the classification accuracy can be suppressed to an extent. Especially, the suppression is effective when the variation in the test phase is almost the same as that assumed in the training phase (). When the values of all are known corresponding to case (ii) in 4.2, the degradation of the classification accuracy can be well suppressed.

Next, the results of classification accuracy for various standard deviations of spike delays at the test phase are shown in Fig. 12. As in the case of variations of firing threshold in Fig. 11, when no variation is assumed at the training phase (), the classification accuracy decreases as the standard deviation at the test phase increases. While, when variations are assumed at the training phase () corresponding to case (ii) in 4.2, the degradation of classification accuracy can be suppressed to some extent. Especially, the suppression efficiently works when the variation at the test phase is almost the same as that assumed at the training phase (). When values of all are known corresponding to case (ii) in 4.2, the degradation of classification accuracy can be well suppressed.

The series of numerical simulations shown above indicates that our proposed algorithm is robust against the device manufacturing variations. This fact paved a way to manufacturing VLSIs implementing the proposed algorithm.

## 6 Concluding remarks

In this study, we pursued the goal of achieving both high algorithmic performance and high energy efficiency by proposing a novel supervised learning algorithm for multilayer SNNs which can be implemented in VLSI circuits with analog resistive memory. We also proposed novel learning techniques such as adding a temporal penalty term to improve the learning performance.

Although the performance of the proposed algorithm on the MNIST dataset was slightly worse than SNNs based on rate coding and ANNs, the performance was confirmed to be as high as the state-of-the-art SNNs based on temporal coding. We showed that VLSI circuits can be designed with analog resistive memory in a simple and straightforward manner. We used numerical simulation to investigate the extent to which the learning performance depended on the circuit design. We also showed that the effects of manufacturing variations, which are unavoidable for real VLSI implementations, can be suppressed by estimating the magnitude of variations or by observing the exact values of these parameters. Therefore, we achieved our goal to a certain extent. The next step will be the fabrication of VLSI chips based on the circuits and algorithms proposed in this paper.

In this study, we focused on the development of energy-efficient algorithms for feedforward networks as a first step. Recently, it was shown that SNNs with recurrent connections can process time-series data with high performance [43, 45, 5]. For various applications of edge computing, the development of energy-efficient and hardware-friendly learning algorithms for SNN with recurrent connections is strongly anticipated. Thus, the development of these algorithms should be the next goal. Finally, we would like to remark that recent advances in biologically plausible and spike-based BP [29, 44, 18, 41, 6, 63] are foreseen to be beneficial to construct energy-efficient systems including the training process [46].

## Appendix: Back propagation for case (22)

Here, we redefine the indices of the neurons to satisfy their indices to be sorted in the order of their spike time as

(36) |

Then, we obtain the corresponding sorted weights

(37) |

The time evolutions of the membrane potentials are piecewise such that they can be represented as the corresponding equations. For the th neuron in the

th layer, from the moment when it receives the

th earliest spike to when it receives the th earliest spike, the time evolution of its membrane potential is described as(38) | ||||

for | (39) |

The terminal values of are denoted by as follows:

(40) |

where represents the number of received spikes before the th neuron fires, which can take different values for different neurons. Using the above definitions, we can obtain an analytical form of the membrane potential as follows:

(41) |

where

(42) | ||||

(43) |

Furthermore, the spike timing of the th neuron in the th layer is obtained as follows:

(44) |

Fig. 13 shows a schematic diagram of the time evolution of the membrane potential .