A Long Short-Term Memory for AI Applications in Spike-based Neuromorphic Hardware

07/08/2021 ∙ by Philipp Plank, et al. ∙ 6

In spite of intensive efforts it has remained an open problem to what extent current Artificial Intelligence (AI) methods that employ Deep Neural Networks (DNNs) can be implemented more energy-efficiently on spike-based neuromorphic hardware. This holds in particular for AI methods that solve sequence processing tasks, a primary application target for spike-based neuromorphic hardware. One difficulty is that DNNs for such tasks typically employ Long Short-Term Memory (LSTM) units. Yet an efficient emulation of these units in spike-based hardware has been missing. We present a biologically inspired solution that solves this problem. This solution enables us to implement a major class of DNNs for sequence processing tasks such as time series classification and question answering with substantial energy savings on neuromorphic hardware. In fact, the Relational Network for reasoning about relations between objects that we use for question answering is the first example of a large DNN that carries out a sequence processing task with substantial energy-saving on neuromorphic hardware.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Implementing a long short-term memory in spike-based neuromorphic hardware

Figure 1: Schematics and dynamics of LIF neurons with and without AHP currents – A) Schematic for the implementation of spike frequency adaptation on LIF Neurons. B) Shows the response of the LIF neuron model without AHP currents (red compartment in panel A) to a synthetic constant input current. The input postsynaptic current (PSC) is leaky-integrated into the membrane voltage . Spikes are emitted and is reset each time the voltage crosses the threshold . In response to a piecewise constant input PSC, the neuron fires at a constant rate. C) The response to a piecewise constant input PSC, of a LIF neuron with AHP currents that shows spike frequency adaptation. The adaptation is implemented by means of an after-hyperpolarizing (AHP) current triggered by the spiking of the neuron. Each output spike decreases (makes more negative) the AHP current thus reducing the total current that is integrated. This weakens subsequent spiking, and we see that even with a constant input PSC, the spike rate decreases over time i.e. we implement spike frequency adaptation (SFA). The decay of is usually much slower than the decay of the membrane voltage (). Thus even after an extended gap of 700ms, the neuron retains memory of its previous input spikes and shows weaker spiking in response to the input PSC.

Working memory is maintained in an LSTM unit in a special memory cell, to which read- and write-access is gated by trained neurons with sigmoidal activation function 

[Hochreiter1997]

. Such an LSTM unit is difficult to realize efficiently in spike-based hardware. However, it turns out that by simply adding a standard feature of some biological neurons, slow after-hyperpolarizing (AHP) currents, a spiking neural network (SNN) acquires similar working memory capabilities as LSTM units over the time scale of the AHP currents. These AHP currents lower the membrane potential of a neuron after each of its spikes (see Fig. 1). Furthermore, these AHP currents can easily be implemented on Loihi with the desirable side benefit of reducing firing activity, and therefore energy consumption. In addition, neurons with slow AHP currents capture another essential feature of LSTM units: Gradients can go iteratively through the content of a memory cell of an LSTM unit without being subject to exponential growth or decay because the content of the memory cell is effectively connected to itself with a weight of size 1. The current amplitude of AHP currents can be viewed as a replacement of the content of the memory cell of an LSTM unit, and because this amplitude decays slowly, gradient that go backwards in time through this hidden variable are also protected from exponential growth and decay. Therefore SNNs that contain neurons with slowly changing AHP currents can be trained very well with backpropagation through time (BPTT).

We refer to a SNN that contains these LIF neurons with slowly changing AHP currents, as a long short-term memory SNN (LSNN), borrowing the terminology of [bellec2018]. There, the slow dynamics of a different hidden variable of a spiking neuron model, a time-varying firing threhold, was used to provide a longer short-term memory. But this hidden variable cannot be readily implemented on Loihi. Note that neurons with AHP currents can participate also in the spike-based computations. Hence working memory function and computational processing need not be allocated to spatially separated units in the resulting LSNN. This is important because shuffling of information between processing and memory is commonly viewed as an important factor of the high energy consumption of standard computing hardware.

Fig. 1 B shows the dynamics of a LIF neuron without AHP currents, where the neuron performs leaky integration of the input postsynaptic current (PSC) to calculate the membrane voltage . The neuron emits spikes when this voltage exceeds a threshold and resets its value to zero. Fig. 1 C shows the dynamics of a LIF neuron with spike-induced AHP currents that hyperpolarize the membrane voltage. The AHP current () increases by an amount whenever the neuron spikes, i.e., z(t) = 1, and decays between spikes with a large time constant of . This current, along with the input current are leaky-integrated to calculate the membrane voltage. Upon each output spike, the increased negative value of the reduces the total input current into the neuron, and thus inhibits subsequent spikes. The large value of () is what enables the recurrent network to retain memory over larger time spans. LIF neurons with AHP currents are precisely defined as follows:

(1)
(2)

where and ; and are the time constants of exponential decay of the membrane voltage and the AHP current respectively with . is the membrane conductance. The definition of as a function of input spikes, and a more detailed model is described in Methods. For the purpose of this paper, the membrane voltage and currents are unitless quantities and their values represent their values as seen in Loihi. The multi-compartment feature of Loihi allows the maintenance of the AHP current within the same neuron, hence LIF neurons with AHP currents can be implemented very efficiently on Loihi. Fig. 1 A shows the schematic of this multi-compartment neuron.

These networks of adaptive LIF neurons can be programmed onto the massively parallel Loihi multi-core architecture. Each of its neuro-core consists of multiple independent SRAMs holding neural and synaptic parameters. In addition, it computes the dynamics of up to 1024 single-compartment (no AHP) or up to 512 2-compartment (with AHP) neurons locally in-memory and thereby avoids expensive data movement between processing elements and external memories. 128 of such interconnected neuro-cores form a Loihi chip. Systems like the 32 chip Nahuku platform finally allow us to execute large scale models such as our Spiking RelNet.

It had been shown already in [bellec2018] that a related mechanism for spike-frequency adaptation, which is more difficult to realize in neuromorphic hardware, enables networks of spiking neurons to achieve a similar performance level as LSTM networks for many temporal processing tasks in current AI. We show that the previously discussed mechanism with AHP currents provides a similarly good performance of SNNs.

Comparing the energy consumption of spiking and non-spiking RNNs with Long Short-Term Memory for a standard time-series classification benchmark task

In order to test the energy efficiency of the proposed emulation of LSTM units with LIF neurons with AHP currents, we use a classical time series classification task: sequential MNIST (sMNIST). Here the pixel values of handwritten digits from the MNIST dataset 

[lecun2010mnist] are presented sequentially in a fixed order, pixel by pixel, and the task is to identify the underlying digit. The gray values of pixels are encoded by spikes through a population of spiking input neurons that fire when the gray value crosses some threshold, where each neuron in the population has a different threshold (see Fig. 2 A). We trained a recurrent network consisting of 240 LIF neurons for this task, and implemented it on Loihi. A random subset of 100 of them were equipped with AHP currents. Using the technique of DEEP-R [bellec2018deep], we train the network to be sparsely connected with 20% of the recurrent connections enabled. Details on the network structure and parameters can be found in Methods.

Figure 2: Illustration of the sMNIST task and comparison of performance and energy consumption on spiking and non-spiking hardware. A) The input pixels get encoded by spikes based on a threshold crossing method for a sequence of pixel values. 80 thresholds were used, represented by 80 input neurons, which send spikes depending on the change of the pixel value with respect to the previous pixel value. B) The network consist of an input layer sending spikes, a recurrently connected layer of LIF neurons with and without AHP currents, and a linear readout layer. C) The classification accuracy of the network running on Loihi was compared to the full precision LSNN, a network of LIF neurons without AHP currents, an artificial RNN, and an LSTM network, as in [bellec2018] D) The EDP was used to compare the time and energy performance of the spiking network running on Loihi and a corresponding LSTM network running on the GPU Nvidia RTX 2070 Super utilizing parallel evaluation of 100 samples at the same time (batched) and one sample at a time as well as the CPU Intel Core i5-7440HQ222Loihi: Nahuku board (ncl-ghrd-01), CPU: Intel Core i9-7920X, RAM: 128GB, OS: Ubuntu 16.04.6 LTS, NxSDK: 0.95

Nvidia RTX 2070: Nvidia RTX 2070 Super, GPU-RAM: 8GB, CPU: Intel Core i7-9700K, RAM: 32GB, OS: Ubuntu 16.04.6 LTS, Python 3.6.5, TensorFlow-GPU: 1.14.0, CUDA: 10.0.


Intel Core i5-7440HQ: RAM: 16GB, OS: Windows 10 (build18362), Python 3.6.7, TensorFlow: 1.14.1
Performance results are based on testing as of July 9, 2021 and may not reflect all publicly available security updates. Results may vary.
.

In order to compare accuracy, execution time and energy consumption with conventional hardware, we also implemented an LSTM network for solving the same task on CPUs and GPUs. The test accuracy of the spiking network on Loihi was 96.0% which is competitive against the full precision artificial networks as well as the best reported LSNN, using full precision from [bellec2018] (see Fig. 2 C). We focus on delay (execution time) and energy consumption as the main metrics in our benchmarks. Typically, there is a trade off between energy consumption and delay, e.g., increased energy (supply voltage) will decrease the delay, in integrated circuit chips build with CMOS technology, which is used in modern CPUs, GPUs and also Loihi. Therefore the product of the energy value and the measured delay, the EDP is ideal to compare applications between different hardware architectures, if these applications have a clear delay metric, e.g., time per classification. The EDP of the SNN running on Loihi is 4 orders of magnitude lower than the network on CPU or GPU (see Fig. 2 D) in batch size 1 regime, with Loihi outperforming over 2x on execution time and over a 1000x on energy consumption per inference. Details regarding the benchmark procedure can be found in the Supplement.

A reason for this significant improvement in execution time and energy consumption compared to LSTMs on conventional hardware is the relatively small network of a few hundred neurons which is sufficient to solve this task. Thus, the network fits on a single chip of Loihi and uses only one or two neuro-cores. Being able to keep spike traffic within a neuro-core is the fastest and most efficient way to process spikes with Loihi. Another reason is the task itself, as processing the input pixel by pixel in a time series manner benefits the LSNN architecture of the network. The information processed is sparse over time, meaning firstly that the amount of information of a single pixel value is low and we require only a single time step to transfer this to the network, and secondly the large number of time steps (pixels) allows sufficient time for the neural states to evolve and assimilate information from the different pixels. Therefore this network architecture on neuromorphic hardware is most effective on processing time series data. Another aspect of performance in conventional neural networks is using parallel processing of batches of data to increase the throughput. Even with a batch size of 100 on the GPU the spiking network on Loihi operating in the batch size 1 regime is still more efficient. Furthermore, multiple instances of the spiking network for parallel computation on Loihi would also be possible, although there is a bottleneck for the input data transfer on current Loihi boards.

Energy-efficient implementation of a large DNN for relational reasoning in neuromorphic hardware

We wondered whether this implementation of working memory in spiking neurons could be used to also implement large DNNs for more demanding sequence processing tasks in an energy-efficient manner in spike-based neuromorphic hardware. Therefore we implemented and tested a spiking variant of the relational network (RelNet) of [santoro2017] on Loihi, which we refer to as the Spiking RelNet. The question of whether this can be done in an energy-efficient manner is quite non-trivial since the Spiking RelNet consists primarily of feed-forward networks. The Spiking RelNet takes as input a set of objects and a single question that are encoded respectively by input spike trains and , see Fig. 4 B. As indicated in Fig. 4 A it computes the function

(3)

with the output given through one-hot encoding of words by readout units. The only recurrent network modules are the ones indicated as module B of Fig. 

4, that transform each input sequence (a sentences of words in natural language) into spiking activity of 200 neurons within a compressed time span of 37ms 333All time intervals and time constants are specified in terms of Loihi computation steps where we use the convention of one step corresponding to 1ms time (see Methods), see Fig. 4B. This input embedding of sentences was carried out by LSTM networks in the RelNet [santoro2017], and is carried out by LSNNs in the Spiking RelNet. In the next processing step (panel C of Fig. 4), the resulting compressed spike codes for each pair of sentences in the story and for the question are processed in parallel by a copy of a feed-forward LIF network that implements the relational function that extracts salient relational information for question from the two sentences. The outputs of these network modules are superimposed and connected to a LIF layer one-to-one, which implements the element-wise function (aggregation function in panel D). The readout network processes the output of through another feed-forward LIF network. The feed-forward networks don’t use the AHP current. The answer to the question is then given by an application of soft-max to one-hot readout neurons that each favor a particular word as the answer (see panel E in Fig. 4). For more details, see Methods and Supplement.

In the above description, we observe that the Spiking RelNet uses both recurrent networks (part B of Fig. 4 A) as well as feed-forward networks (Parts C, D, E of Fig. 4 A) to perform the calculation. When scaling up to tasks with a larger number of objects, the fraction of these feed-forward components of RelNet increases (See Fig. 3 B). The reason is that the number of recurrent network modules scales linearly with , whereas the number of feed-forward modules that compute the function increase quadratically with , since we have an instance of for each pair of objects. Consequently, the numerous instances of the relational function () occupy the majority of the hardware resources (see Fig. 5 C). This increasing fraction of feed-forward network modules is problematic from the perspective of energy efficiency, since prior emulations of feed-forward networks in spike-based hardware demonstrated that their advantage regarding energy-consumption gets lost for larger networks when high classification accuracy is required [davies2021]. The reason is that these prior implementations had to use spike rate coding instead of event-based processing in order to achieve high classification accuracy. However, rate coding uses many spikes per neuron, thereby moving the network out of an energy-efficient working regime, and also increases the computation time of the network, i.e., reduces its throughput.

We show that this obstacle, which was largely based on experience with CNNs, can be overcome in the case of RelNet, since these networks can be implemented with high accuracy in a more event-based working regime. An important underlying difference to CNNs is that even for large problem instances, i.e., stories with many sentences, the number of relations that are relevant for answering a question tend to increase only linearly with the length of the story. Hence, with an aggressive spike-rate regularization during training (described in Methods and Supplement), one can force the network to focus its spiking activity on those events where potentially relevant relations are extracted from pairs of sentences. However, such strong spike rate regularization tends to affect the network performance in a substantial manner, since it drives many neurons into an ineffective state where their membrane potential is far away from the firing threshold, see the upper part of Fig. 3

B. We counter this this by adding an additional regularization term, called the voltage regularization loss to the loss function that penalizes the occurrence of these neuron states, see Fig. 

3 A and Methods.

Figure 3: Illustration of voltage regularization and its its capability to enforce –in conjunction with spike rate regularization– a sparse firing regime. A) The voltage regularization penalty as function of the value taken by the scaled membrane voltage at a particular time step. The scaled membrane voltage is as defined in Eq. 13. A value of 0 corresponds to the spiking threshold, and a value of -1 corresponds to the value of the voltage corresponding to a zero input PSC. The membrane voltage is thus penalized if the scaled voltage is outside the range . B) The distribution of the scaled voltage values across different batches, neurons, and time steps with and without regularization. C) The spikes used per neuron in relation to the network size (which varies for different story sizes). One observes that larger networks use fewer spikes per neuron as a result of spike rate regularization combined with voltage regularization, which results in savings in energy when run on hardware.
Figure 4: Spiking RelNet architecture and spike-coding schemes that it uses.  A) The top-level Spiking RelNet architecture. We embed each sentence and the question into spike sequence objects and respectively via an LSNN. For each pair of sentence objects , we apply the relational function to the triplet . The outputs of the relational function are aggregated in a LIF Layer and then passed to the final readout function . B) The embedding scheme, where each word is provided for with one-hot coded spikes, aligned so that the first word is provided at the very end of the duration. The spikes in the last

are padded to a length of

(red box) to form a time-compressed sentence embedding and .
C) An instance of the spiking relational function operating on a sample triplet . D) The aggregation layer is a layer of LIF neurons that receive one-to-one connections from each relational function instance. This aggregates the spike trains from across the relational function instances and outputs a spike sequence for the readout network. E) The final readout function consists of a three layer feed-forward LIF network followed by a linear readout (with one neuron per word in the dictionary), that integrates synaptic inputs only during the last 10ms (marked as yellow bar). The value of the readout at the final time step provides input to the softmax, whose output produces the the final answer through one-hot encoding of words.

In addition, we introduced another method for encouraging the network to work in an event-based processing regime: We forced the network to encode its output in the membrane potential of readout neurons at a particular point in time (marked in Fig. 4 E). This compression of the time window for producing the network output induced upstream feed-forward parts C, D, E of RelNet to constrain their firing activity to rather short time windows, see Fig. 4. In addition, both the readout neurons and all network neurons used a rather short membrane time constant of 7ms, which makes it difficult to integrate information from firing rates of upstream neurons. As a result, the spike rate regularization managed to keep the average firing rates very low, in spite of the theoretically possible maximal firing rate of 1000 Hz caused by the absence of a refractory period in the neuron model, which was employed to enhance the backwards propagation of gradients in BPTT. As result, we see in Fig. 3 C that most neurons fired at most one spike during a network computation. Furthermore, the number of spikes per neuron decreased when RelNet was scaled up to larger instances that can answer questions about longer stories.

We tested the performance of the resulting spike-based RelNet implementation on Loihi for a standard benchmark dataset for question-answering; The bAbI dataset introduced by [weston2015], that were also used for testing RelNet by [santoro2017] This dataset consists of 20 different types of tasks, that each probe different challenges in reasoning about relational information contained in a set of sentences, i.e., a story. For example, tasks 4 and 5 require reasoning about a set of facts that are provided in the form of sentences with 2 arguments ("The office is north of the bedroom.") or 3 arguments ("Mary gave the cake to Bill."). Task 14 requires reasoning about temporal relationships between events, task 15 requires basic deduction, task 18 requires reasoning about relative sizes of objects, task 19 requires planning of a path, and task 20 requires reasoning about the likely motivations of an agent (see Supplement for an examples from tasks 15, 18, 19, 20). The questions are formulated in such a way that an answer can be given with a single word via one-hot encoding in the output (or with a sequence of two words in the case of path planning in Task 19; one has here an output line for each such possible sequence). According to the convention of [weston2015] and [santoro2017] a task is considered as being solved if the network has an error rate of at most on instances of the task that had not been used for training. When applying a RelNet to solve this task, each sentence (question) forms an object () that is embedded via LSNN’s to a spiking representation (). Thus, the difficulty of a particular instance of a bAbI task, and the required size of the RelNet grows quadratically with the number of sentences in the story, since the number of potential relations between the contents of sentences (in the context of the question) grows quadratically.

The whole SNN implementation of RelNet was largely trained end-to-end via BPTT for 17 of the bAbI tasks, with some extra measures to speed up training time (see Methods). We exluded 3 of the 20 bAbI tasks, ’Task 2: Two Supporting Facts’, ’Task 3: Three Supporting Facts’, and ’Task 18: Basic Induction’, because also the ANN implementation of RelNet from [santoro2017] was not able to solve these 3 tasks. The network is able to solve 16/17 tasks that it was trained on to errors under . The performance of the network was unsatisfactory on task 17 "Positional Reasoning", as a result of the complex sentences needing more time steps to process (see Supplement).

Figure 5: Spiking RelNet placement and optimization on Loihi A) Highlighting the different parts of the Spiking RelNet with the color code used in C and D. B) The Spiking RelNet was configured on a Nahuku board with 32 Loihi chips. On each side of the board 16 chips are placed in a checkerboard pattern. Each chip has 128 neuromorphic computing units or neuro-cores. C) The full scale Spiking RelNet which can solve tasks up to 20 sentences utilizes 2308 neuro-cores on 22 chips. The detailed mapping of the different layers is shown. In order to minimize cross-chip spikes some chips were not fully utilized. D) Straightforward assignment of relay neuro-cores on the same chip as the source neuro-cores on the left side and optimized assignment of the relay neuro-cores on the chip of the target neuro-cores, to minimize inter-chip spike traffic, on the right side. Note that the target neuro-cores must be assigned carefully to do this efficiently. E) Shows the benefit of the optimized assignment in energy-delay product as a function of network size as measured by number of bAbI sentences which roughly corresponds to the number of Loihi chips for the RelNet solving bAbI tasks on Loihi. Thus we see that the placement of the Spiking RelNet is optimized in terms of resource utilization as well as network performance on the hardware.

Optimizing the performance of large RelNets in spike-based hardware.

For the longest stories that contain 20 sentences, the network contains 238604 neurons. When placing the densely connected recurrent and feed-forward layers in the Spiking RelNet onto Loihi, the hardware constraints on network connectivity (see Methods, Supplement) mean that we can place at-most 128 neurons per neuro-core (less than the maximum possible 1024). Our most resource efficient placement hence requires 2308 neuro-cores spread across 22 chips. Placing the network of this size onto Loihi brings with it the challenge of minimizing spike congestion. This happens primarily when we route spikes from the LSNNs that do the embedding (module B, Fig. 4), to the various instances of (module C, Fig. 4). A straightforward placement leads to excessive cross-chip spike transmission. This leads to significant delays in spike transmission, which slows down the computation. Therefore separate relay neuro-cores (marked green in Fig. 5 C and D), and an optimized allocation of instances of onto chips, were introduced for reducing across-chip spike transmission. This resulted in significant improvements of the EDP, see Fig. 5 E. The final optimized layout of the network over the chips can be see in Fig. 5 C and Supplement.

Another aspect of optimization concerns the number of time steps used, called the compute time, which affects not only the energy consumed and delay on Loihi, but also the training speed. We found that using spiking neurons without refractory period and membrane time constants of just 7 ms significantly reduced the required number of time steps, while causing only a mild decrease in accuracy.

Energy-efficiency of RelNet in spike-based neuromorphic hardware

We compared the energy consumption and delay of the spike-based implementation of RelNet on Loihi with GPU implementations of the ANN RelNet from [santoro2017], see Table LABEL:tab:power-results. One sees that the spike-based implementation consumes between 4 and 16 times less energy than the GPU implementation. The energy savings are lower for longer story sizes, apparently because these require the use of substantially more Loihi chips, and inter-chip communication appears to be less energy efficient in this spike-based hardware. One should also note that the average length of a story for the 16 datasets that we consider is just 6.5 sentences. The computation time on Loihi was slightly larger than on the GPU. But nevertheless, the resulting EDP remained lower for Loihi. For the longest and therefore slowest story size the average computation time per sample is 6.54 ms wall-clock time, which would still be sufficient for online applications like voice control or virtual assistants.

Discussion

We have shown that a key tool for sequence processing in recurrent neural networks in machine learning and AI, LSTM units, can be replaced in spike-based neuromorphic hardware by neurons with a biologically inspired mechanism for spike frequency adaptation (SFA). SFA was achieved -similarly as in the brain- through spike-triggered hyperpolarizing currents on the time scale of seconds. Since neurons with SFA can also be used for generic network computations, this solution does not require a separation of units for computing and working memory, hence it can be viewed as an in-memory computing solution for the case of working memory. Like other in-memory computing solutions it comes with the benefit of avoiding latencies and energy consumption that generally arises from traffic between computing and memory units. The resulting spike based solution for solving a benchmark time series classification tasks such as sMNIST turns out to be three orders of magnitudes more energy efficient than state-of-the-art implementations of LSTM networks on CPUs and GPUs, while achieving virtually the same performance. This property could be especially interesting to low latency processing of real-time workloads.

We have also shown that this method enables us to port large ANNs that involve LSTM units into spike-based hardware. We have demonstrated this for the example of relational networks, since these enable a qualitative jump in AI capabilities by supporting reasoning about relationships between objects in a story or image. An essential feature of our spike-based emulation of LSTM networks is that these networks can be trained very effectively through BPTT, like LSTM networks. In particular, the implementation of the RelNet on the neuromorphic chip Loihi achieved almost the same performance as the ANN counterpart. The resulting reduction of energy consumption for relational reasoning is less drastic as for the time series classification task sMNIST, because the relational network contains also a large fraction of feed-forward neural network modules. But we have shown that the feed-forward network modules can be organized through suitable output encoding and regularization mechanisms so that they not only interact seamlessly with the recurrent neural network modules, but also compute with very few spikes per neuron, thus in an event-based rather than rate coding regime. We believe that the energy efficiency of resulting spike-based feed-forward modules can be increased by more dedicated hardware. In that respect, RelNet appear to represent a more suitable target for implementing large AI networks in energy-efficient neuromorphic hardware than CNNs. Similar to

[santoro2017], we expect that relational networks in neuromorphic hardware can be used not only for solving question-answering tasks in natural language, but also for reasoning about relations between objects in an image or in an auditory scene. This would provide a qualitative jump in AI capabilities of energy-efficient neuromorphic hardware.

Another interesting next step will be to enable on-chip training of these spike-based emulations of LSTM networks by using e-prop instead of BPTT, which has already been shown to work very well for networks of spiking neurons with SFA [Bellec2020]. Also one-shot learning capability has been demonstrated for these spiking networks [Scherr2020], and it is likely that the required method will also enable one-shot on-chip training of these networks.

Finally, spiking neurons with SFA are a first step in the direction of state-of-the-art point neuron models for neurons in the neocortex [Billeh2020]. Hence, if our emulation of neurons with SFA can be expanded towards these more general GLIF (generalized leaky integrate-and-fire) neuron models, it will become possible to emulate state-of-the-art models for parts of the neocortex in large energy efficient neuromorphic systems, thereby providing a new venue for simulating large neural networks of the brain at a substantially reduced energy cost. This would be an important breakthrough for the scientific analysis of these data-driven brain models that is currently starting. These perspectives point to a significant advantage of neuromorphic hardware such as Loihi or SpiNNaker [Furber2014] that supports the implementation of variations of the standard spiking neuron model as they arise in further work towards spike-based AI or neuromorphic implementations of large-scale data-driven models for neural networks of the brain.

Methods

LIF neuron model with after-hyperpolarizing (AHP) currents

The dynamical behavior of an LIF Neuron with after-hyperpolarizing (AHP) currents (indexed by ), as implemented in Loihi, is given by Eq. 18. Here we show the dynamic interaction between incoming spikes at time , the resultant input postsynaptic current (PSC) , the internal AHP current , the membrane voltage , and the output spikes . The equations are explained subsequently

(4)
(5)
(6)
(7)
(8)

Eq. 46 are represent temporal convolution with a exponentially decaying kernal. Here , , , where , , and are the decay constants of the corresponding exponentials. is update to the AHP current in response to an output spike. Since the LIF state transition is computed in Loihi, we associate a single compute step with 1ms biological time, correspondingly and .

Eq. 4 defines the PSC as a function of input spikes arriving through incoming synapses of weights and delays of steps. The LIF neuron without AHP currents corresponds to the case of , where the neuron performs a leaky-integration of to get the membrane voltage . When this voltage exceeds a threshold, it is reset to zeros and an output spike is generated. In this case, the memory of the neuron is limited by the voltage and PSC decay time constants and respectively, which are typically around . This means that even when connected in a recurrent fashion, the memory capacity for the network is typically at-most a .

Eq. 5 defines the AHP current. With , each output spike i.e. will cause to become more negative by a value of . When leaky-integrated into the membrane voltage (Eq. 6), this increased negative value of lowers the rate of subsequent spikes, leading to spike frequency adaptation. The decay time constant of the AHP current is much longer than , typically . The slow decay means that this inhibition persists over a much longer duration thus functioning as a longer-term memory cell. This longer lasting memory proves invaluable to solve the complex tasks demonstrated in this work.

Details of LIF network training

In this section we describe important details pertaining to the training of networks of LIF neurons with and without AHP currents. In all equations below, we drop the neuron index for brevity.

The scaled voltage

For subsequent details regarding LIF network training, we find it useful to define a normalized version of the membrane voltage i.e. a scaled voltage .

We first notice that the membrane voltage is a sum of two voltage components and which are a result of leaky-integrating and respectively. Correspondingly, we can rewrite Eq. 6,8 describing the voltage evolution as follows:

(9)
(10)
(11)
(12)

The scaled voltage is defined below:

(13)

takes the value of 0 when and a value of when . This is motivated by the fact that is the value that would take if there was no input PSC.

The surrogate gradient

The generation of spikes from the membrane voltage (Eq. 7), involves the use of a step function centered at the neuron threshold. This function is non-differentiable at the neuron threshold and provides a non-informative gradient of zero at all other points. Thus in order to use gradient back-propagation to train networks of LIF Neurons, we consider a surrogate gradient for the step function similar to methods used in previous works [bellec2018, zenke2021remarkable, esser2016convolutional, shreshtha2018slayer, neftci2019surrogate, zenke2018superspike, zhu2021efficient].

We rewrite the thresholding equation Eq. 7 in terms of the scaled voltage :

(14)

where is the unit step function and is referred to as the scaled voltage.

We then use the following piece-wise linear surrogate gradient function to serve as a pseudo-derivative of the step function .

(15)

where and define the support of the surrogate gradient and is a dampening factor that affects the magnitude of the derivative.

Thus peaks at a value of for and linearly decays to zero at the values of and .

Spike rate regularization

For each neuron , we calculate the mean rate across all batches. we then add the following regularization loss

(16)

where is a target rate and is the parameter that controls the strength of the regularization. This loss encourages the mean spike rate of each neuron across a random batch to be as close to the target rate . This ensures that the network activity does not die out and that the spike rate stays sparse owing to the low value of . The outermost square is in order to dynamically reduce the regularization strength as the loss becomes smaller.

When training the Spiking RelNet, we use a more aggressive spike rate regularization to limit the total spike rate across all across the instances of the relational function . This is described below in the section on the training of the Spiking RelNet.

Voltage regularization

The spike rate regularization has a tendency to push the synaptic weights low enough that the membrane voltages become very negative. This leads to a large number of time steps where the voltage values fall outside the support of the surrogate gradient and thus no gradient information can be propagated through them, which impedes gradient back propagation. Thus we are motivated to add a loss that penalizes voltages that fall significantly outside the support of the surrogate gradient function defined in Eq. 15. Since the surrogate gradient is defined in terms of the scaled voltage (Eq. 13), we define the voltage regularization loss in terms of it as well.

For each neuron and time step , we calculate the loss component

(17)

The total voltage regularization loss is given by

(18)

The above penalizes all neurons at all time instants that the scaled voltage goes outside the range . This prevents the network from using voltages that are excessively negative and increases the proportion of voltage values that lie within the support of the surrogate-gradient. Moreover, limiting the range of the voltage values is also crucial in order to be able to fit the voltage values onto the range offered by the fixed precision registers on Loihi.

Use of PSC kernels

For the LIF neuron model, the membrane voltage resets to zero upon spiking, and stays zero for the duration of the refractory period. This means that the gradient cannot propagate through the membrane voltage beyond the last spike. For a LIF Neuron with AHP currents this issue is alleviated by the slow decay of the AHP current through which gradients can be propagated much further into the past. However, for the feed-forward layers used in the relational network, which don’t use the AHP current, we need to make use of the PSC to propagate gradients, as it is unaffected by the spiking of the neuron. We thus find that the use of a non-zero PSC decay time constant , i.e. an exponentially decaying PSC, offers improved performance upon training compared to using a delta PSC. The use of a non-delta PSC means that a change in the weight of an input synapse changes the rate at which the membrane potential rises and therefore has the capacity to smoothly modify the spike time.

Details for the application to sMNIST

Input encoding

The gray values of the pixels from an MNIST image were encoded in spikes. 80 input neurons were used and each pixel was associated with a particular threshold for the gray value. So there were 79 linear spaced thresholds between 0 and 256. Every second threshold refereed to an increasing gray value, while the others refereed to a decreasing gray value. If the gray value increases when transitioning from one pixel to the next, every second input neuron from the last threshold to the next threshold generates a spike. The pseudo code for the input encoding can be seen in the Supplement. The last input neuron becomes active after the presentation of all 784 pixels for , thus the presentation of one sample takes . This last input neuron which generates a spike at every time step after the image presentation indicates the end of an sample. The classification happened at the last time step i.e. time step 840 of a sample. Each of the 10 output neuron denoted a digit and the neuron with the highest membrane potential on the last time step defined the predicted class.The network was implemented on the Intel Loihi chip using NxNet API from the NxSDK v0.95.

Network structure

An LSNN was used consisting of 240 neurons, 180 excitatory and 60 inhibitory. A random subset of 100 of the excitatory neurons were equipped with AHP currents. Additionally 80 input neurons were used to perform an input spike encoding of the images, and 10 output neurons were used corresponding to the 10 classes of the MNIST dataset. The overall connectivity of the network, including the input and output connectivity, was kept at 20%, meaning that only 20% of the possible synapses between the neurons were used. This was achieved by using a rewiring technique named DEEP-R [bellec2018deep] during training. The hyper-parameters which were used to train the network for Loihi were , baseline threshold = 127, , as well as a refractory period and delay of .

Details for the Spiking RelNet

In this section, we describe in detail the structure of the Spiking RelNet as applied to the bAbI tasks.

High-level network outline

Building on the general architecture proposed in [santoro2017], the Spiking RelNet takes as input objects , and a question object and implements the following function to compute its output.

(19)

Fig. 4 A shows the basic outline of this network. When applied to the bAbI task, the sentences of the story and the question are embedded into the spike sequences and respectively by means of LSNNs, which are recurrent networks consisting of LIF Neurons both with and without AHP currents. To provide the input to the LSNN, we assign an input neuron corresponding to each distinct word used in the bAbI dataset. The words in a sentence/question are then presented in sequence, with each word being presented for a duration of during which only the corresponding input neuron fires continuously. We then take the spike activity of the LSNN over the last , and pad it to a length of to form the embedding spike sequences and (see Fig. 4 B).

The function is the relational function. It receives as input a triplet of spike sequences corresponding to a pair of sentences and questions, and produces a spike sequence output It is implemented as a four layer feed-forward spiking neural network with LIF neurons. We have an instance of for each pair of sentences , so that the ordering of the sentences in the stories is made available to the network.

The function is an element-wise function. This is implemented by means of a LIF layer to which each instance of is connected one-to-one, where the set of input weights from an instance of to this layer is shared across all instances. This is an addition to the architecture proposed in [santoro2017] and plays an essential role in enabling the implementation of Spiking RelNets onto neuromorphic hardware.

The function

is the readout function. It is implemented as a 3-Layered feed-forward LIF network followed by a linear readout (see section below) and a softmax layer.

outputs, for each unique word present in the bAbI dataset, the probability of that word being the answer to the question. This probability is used to compute a cross-entropy loss that is used to train the network using gradient back-propagation.

The LIF neurons used in , and don’t use AHP currents. For more detailed parameters pertaining to the layers, see Supplement.

The linear readout

The design of the linear readout is crucial to the performance of the relational network. The linear readout consists of a network of specialized readout neurons, with one neuron for each word in the database of words used in the bAbI task.

The readout neuron is a variant of the LIF Neuron without AHP currents, where and , and the threshold . This corresponds to a neuron which does not spike, but where the PSC decays with the readout time constant and the neuron performs (non-leaky) integration of the PSC to calculate the membrane potential. However, we chose to enable the integration of the PSC into the voltage only prior to the final step. The value of the membrane voltage at the final step is scaled by a fixed scalar and forms the input to the softmax (see Fig. 4E). This design incentivizes the spike activity of the final layers to occur in a confined time window close to the final time step, while allowing the precise timing of the spike to influence the final output, leading to a high information capacity in a short time window.

Training the relational network

The Spiking RelNet requires many time steps of compute time compared to the non-spiking RelNet, making the loss computation and gradient back-propagation through time many times more expensive in the spiking case. Simply training the network end-to-end with the cross-entropy loss requires an impractically long time for the network to converge, as well as leading to pathological spike rates and low performance. The solutions to these issues are:

In order to speed convergence, We first train a non-spiking relational network to solve the bAbI tasks, where LSTM’s are used to embed the questions and words. We then train the LSNN to reproduce the outputs of the LSTMs for the various input sentences in the dataset. The weights of these pre-trained LSNNs are fixed, and they are used to perform the embedding while we train the relational function and readout function

. This helps the network converge in much fewer training epochs than the end-to-end trained non-spiking relational network. This makes the training feasible for a Spiking RelNet.

The emergence of pathological spike rates and membrane voltage values is solved by the use of spike rate regularization and membrane voltage regularization described above. We use a more aggressive regularization for the spike rates in the instances of the relational function , where the regularization forces the total spike rate summed over all instances of towards a low target rate. This minizes the number of spikes transmitted to the aggregation layer , thus reducing cross chip transfer of spikes. It also forces the network to only generate spikes corresponding to those sentence pairs relevant to the question. The resultant low spike rate seen in Fig. 3B results in a very power and delay efficient implementation of feed-forward spiking networks onto neuromorphic hardware.

Placement of the Spiking RelNet onto Loihi

The Loihi Nahuku board consists of 32 interconnected Loihi chips, each of which contains 128 neuro-cores. The neuro-core is the fundamental computational unit that computes the dynamics of the LIF neurons with and without AHP currents. Loihi allows one to connect any neuro-core on any chip, to any other neuro-core on any other chip thus enabling large networks to be placed on the board. However due to hardware limitations, the number of connections and the connectivity is constrained as described below. Additionally, transporting a large number of spikes across different chips incurs significant latency. We discuss here the strategies to place the Spiking RelNet within these constraints.

The LSNN network that solves the sMNIST task contains 240 neurons connected with 20% of the recurrent connections enabled. This network is small enough to fit in a single chip and occupy only one neuro-core. The Spiking RelNet is a much larger network. Considering a maximum of sentences in a story, the Spiking RelNet has instances of the LSNN’s that embed sentences, plus one for the questions. Additionally, there exists an instance of the relational function for each pair of sentences , making a total of instances. Each of these instances is implemented as a separate network on Loihi, leading to a total network size of neurons. The placement of this network needs to take into consideration many constraints regarding connectivity, memory, and the latency of spike transport. The associated challenges and solutions are outlined in this section.

Synaptic memory limit

Each neuro-core has a limited amount of SRAM memory which can be used to store synaptic parameters. This limits the number of incoming synapses to a particular neuro-core. The precise number is dependent on synaptic parameters and we have found an empirical limit of around 40000 synapses per neuro-core. Except the aggregation layer, all layers in the network have dense input and recurrent synaptic connections. Thus each layer needs to be placed over multiple neuro-cores in order to store the input and recurrent connections.

Fanout limits – LSNN relay layer

The total number of neuro-cores to which the neurons of a neuro-core connect to is limited to 2048, and 4096 for intra-chip connections. This plays a role when connecting the LSNNs to the large number of instances of the relational function .

Thus, one can split the neurons across multiple neuro-cores to reduce the number of output connections per neuro-core. However splitting a recurrent LSNN network across too many neuro-cores increases latency. Instead we use relay layers. A relay layer, as the name suggests, simply reproduces the spiking activity of the layer that forms its input. Each LSNN is thus connected to multiple relays which then each fanout to a smaller number of instances of .

Limits pertaining to fanin – The aggregation layer

For any neuro-core , Loihi limits the number of neurons that can be connected to that neuro-core to . Unlike the two constraints above, this constraint on the fanin to a neuro-core introduces a fundamental restriction to the network architectures that can be implemented on Loihi.

The layer that receives the output from the instances of receives input from instances. For this layer to not violate the fanin constraint, the connection from the output of to this layer must be sparse. Thus, we introduce an aggregation layer to which each instance of is connected in a sparse one-to-one manner, with shared weights across instances. The sparse connection enables the aggregation layer to be implemented within the fanin constraints

For a more detailed treatment of the constraints, as well as the number of neuro-cores required to place each layer, see Supplement.

Optimizing network placement to minimize congestion in cross chip spike transport

Placing the LSNN, relay, the relational networks taking into consideration only the connectivity constraints, we notice significant delays that occur due to transporting spikes from the LSNN and relay networks to the instances of . This is oweing to the large number of spikes that need to be transferred across different Loihi chips. Thus, we need in addition to optimize the placement of the instances of , and the relay networks in a manner that minimizes cross-chip spike transport. We break down this general objective into the following constraints.

  • All relay networks must be connected only to relational function instances that are placed on the same Loihi chip. We thus choose to place the initial layer of the relational function instances in the same chip as the relay networks that give them their input.

  • We aim to minimize the number of relay networks required. Each chip has a limit of neuro-cores and thus a limit on the number of instances that can be placed. This means that for each chip, we must choose the set of instances in such a manner that the number of distinct sentences needed as input is minimized.

  • For each instance, all layers after the first one are to be placed on the same chip.

The layout that we arrive at with the above principles is described in the Supplement. The resultant improvement in delay and the corresponding energy delay product is shown in Fig. 5E.

Data availability

The MNIST dataset [lecun2010mnist] is freely available at http://yann.lecun.com/exdb/mnist/. The bAbI dataset [weston2015] is freely available at https://research.fb.com/downloads/babi/.

Code availability

References

Acknowledgements

This research/project was supported by the Human Brain Project (Grant Agreement number 785907 and 945539) of the European Union and a grant from Intel. Special thanks go to Guillaume Bellec and Darjan Salaj for their insightful comments and ideas when carrying out this work.

Author contributions statement

A.R., P.P. and W.M. contributed to the design and planning of the experiments. A.R. and P.P. carried out the experiments. A.R., P.P., A.W. and W.M. participated in the analysis of the experimental data. A.R., P.P., A.W. and W.M. wrote the manuscript.

Competing interests

The authors declare competing interests as follows. P.P. and A.W. are currently employed by Intel Labs, developers of the Loihi neuromorphic system. W.M. and A.R. are members of the Intel Neuromorphic Research Community and W.M. has received research funding from Intel for related work.

Additional information

Supplementary information is available.