I Introduction
Neurophysiological data suggest that brain networks are sparsely connected, highly dynamic and noisy [1, 2]
. A single neuron is only connected to a fraction of potential postsynaptic partners and this sparse connectivity changes even in the adult brain on the timescale of hours to days
[3, 4]. The dynamics that underlies the process of synaptic rewiring was found to be dominated by noise [5]. It has been further suggested that the permanently ongoing dynamics of synapses lead to a random walk that is well described by a stochastic driftdiffusion process, that gives rise to a stationary distribution over synaptic strengths. Therefore, synapses are permanently changing and randomly rewiring while the overall statistics of the connectivity remains stable [6, 7, 8, 9]. Theoretical considerations suggest that the brain is not suppressing these noise sources since they can be exploited as a computational resource to drive exploration of parameter spaces, and several models have been proposed to capture this feature of brain circuits (see [10] and [11] for reviews).The synaptic sampling model that has been proposed in [12, 13] employs this approach for rewiring and synaptic plasticity. The noisy learning rules drive a sampling process which mimics the driftdiffusion dynamics of synapses in the brain. Although the network is permanently rewired, this process provably leads to a stationary distribution of the connectivity. This distribution over the network connectivity can be shaped by reward signals, to incorporate reinforcement learning, and can be constrained to enforce sparsity [14]. The synaptic sampling model reproduces a number of experimental observations, such as the dynamics of synaptic decay under stimulus deprivation or the longtailed distribution over synaptic weights [12, 14]
. Furthermore, when equipped with standard error backpropagation this method was found to perform on a par with classical fully connected machine learning networks, at a fraction of the memory requirement
[15].However, the gain in efficiency of biologyinspired algorithms such as synaptic sampling can often not be fully realized on either dedicated neuromorphic hardware or standard digital compute hardware, since these models require complex operations such as random number generation or exponential functions. The former hardware usually has very narrowly configurable plasticity functions unsuitable for this kind of exploration [16, 17, 18, 19]. Thus, synaptic weights that experience complex plasticity functions are usually precomputed in software and then run statically on mixedsignal [20, 21] or on digital neuromorphic hardware [22]
. On the other hand, standard digital compute hardware is in principle flexible enough, but the functions required by the plasticity models are very expensive to compute on standard hardware which significantly narrows down the gain in efficiency. Despite recent efforts to simulate spiking neural networks on GPUs
[23], there is, to the best of our knowledge, no hardware support available for random number generation, especially true random number generation, and exponential function in GPUs. A common workaround on digital hardware is to store massive amount of random numbers and lookup tables for the exponential function before the simulation starts [24]. This reduces computation time at the cost of increasing the requirements for the already limited memory of embedded applications. The 2nd generation SpiNNaker system strives to break the tradeoff between computation time and memory by employing dedicated hardware components for these time (and energy)consuming operations. Standard ARM processors are augmented with hardware accelerators for random numbers [25] and exponential functions [26]. We show that this allows us to implement complex learning algorithms in a compact, power efficient package. In addition, by fitting the model into the local SRAM, DRAM can be switched off, further reducing the power consumption. This potentially offers a new compute substrate especially for mobile and biomedical applications such as neural implants that are strictly limited by the power budget, computation speed and memory capacity of the silicon chip on which they are executed.In this article we present the main features of the prototype chip of the 2nd generation SpiNNaker system in detail and showcase the benefits of the architecture for experiments on rewardbased synaptic sampling [14]. We show that the architecture allows us to exploit the advantage of the synaptic sampling algorithm. The model is efficiently implemented thanks to the hardware accelerators, the software optimizations and the floating point unit available in ARM M4F. We show a speedup of more than 2 due to the use of hardware accelerators. Our hardwaresoftware system optimizes the implementation of rewardbased synaptic sampling with respect to the memory footprint, computation and power and energy consumption. We built a scalable distributed realtime online learning system and demonstrate its usability in a closedloop reinforcement learning task. Furthermore, we study a modified rewiring scheme called random reallocation that recycles the memory of synapses by immediately reconnecting them to a new postsynaptic target. We show that this more efficient version of synaptic sampling also leads to faster learning.
In Section II we give an overview of the prototype chip, focusing on the random number generator and the exponential function accelerator. Section III shows the rewardbased synaptic sampling model implemented in this work. Section IV presents the software implementation and experimental results are presented in Section V.
Ii Hardware
Iia System Overview
SpiNNaker [27] is a digital neuromorphic hardware system based on lowpower ARM processors built for the realtime simulation of spiking neural networks (SNNs). On the basis of the firstgeneration SpiNNaker architecture and our previous work in power efficient multiprocessor systems on chip [28, 29], the second generation SpiNNaker system (SpiNNaker 2) is currently being developed in the Human Brain Project [30]. By employing a stateofthe art CMOS technology and advanced features such as percore power management, more processors can be integrated per chip at significantly increased energyefficiency. In this article we use the first SpiNNaker 2 prototype chip, with architecture as shown in Fig. 1. Table I provides a brief summary of the new hardware features which are relevant for this work, in contrast to the first generation SpiNNaker [31] system. Furthermore, the table includes an outlook on the final SpiNNaker 2 chip (tapeout 2020).
SpiNNaker 1  SpiNNaker 2 Prototype  SpiNNaker 2  
(used in this work)  (current plan, cf. [32])  
Microarchitecture  ARMv5TE  ARMv7M  ARMv7M 
Max. Clock Frequency  200 MHz  500 MHz  500 MHz 
Floating Point  —  single precision  single precision 
HW Accelerators  —  EXP, PRNG, TRNG  EXP, LOG, PRNG, TRNG 
Technology node  130 nm  28 nm  22 nm 
ARM cores / chip  18  4  144 
The processing element (PE) is based on an ARM M4F processor core with 128 KB local SRAM, an exponential function accelerator [26], neuromorphic power management [33] and a hardware pseudo random number generator (PRNG). The SpiNNaker router [34] handles onchip and offchip spike communication. Furthermore the chip provides a dedicated true random number generator (TRNG). The various components are interconnected via NetworkonChip (NoC). The chip has been fabricated in 28 nm SLP CMOS technology by Globalfoundries (Fig. 2).
IiB Random Number Generator
The hardware PRNG is a specific implementation of Marsaglia’s KISS [35]
random number generator. The generated sequence depends only on the initial seed. The provided 32Bit integer values are uniform distributed and accessible within a delay of one clock cycle. An equivalent software implementation takes
clock cycles ^{1}^{1}1All clock cycle numbers in this paper are measured on the ARM core of the prototype chip. The model in this work uses uniform distributed floatingpoint numbers in the range from to . Therefore, the conversion to floating point and the range scaling adds another clock cycles, resulting in 42 clock cycles in total.The main advantage of a PRNG over a TRNG is the reproducibility, which simplifies debugging. However, due to the properties of a PRNG not all effects of the randomness might be seen, since the entropy of the sequence is reduced to the seed of the generator. In order to facilitate to run an experiment with different random inputs and a higher entropy, the prototype offers the possibility to scramble the seed of the PRNG with a value generated by the TRNG. From a software point of view just the initial configuration differs and no further changes on the code are necessary. The entropy source of the TRNG is the jitter of the different clockgenerators of the chip [36]. In conventional clock generators, this unwanted noise would be cancelled by the control loop [37]. However, in this case the noise provides us with an entropy source at minimal cost in terms of power and area, since the clockgenerators have to run anyway, for the PE itself as well as for the SpiNNaker links. The principle is described in detail in [25] and has been submitted as a patent [38]. The entropy of each single clockgenerator is combined as true random bus which is sampled by the PRNG in order to realize the scrambling.
IiC Exponential Function Accelerator
The exponential function accelerator calculates an exponential function with the signed fixedpoint s16.15 data type. In the implementation, the operand is divided into three parts:
(1) 
where is the integer part, and are the upper and lower fractional parts, respectively. and are calculated with two separate lookup tables (LUTs), and is a polynomial. The split into two separate LUTs considerably reduces the memory size and thus the silicon area compared to one combined LUT, by taking advantage of the properties of the exponential function. The split of the evaluation of the fractional part into a LUT and a polynomial reduces the computational complexity of the polynomial with minimum memory overhead. The overall implementation achieves singleLSB precision in the employed fixedpoint format [26]. The exponential accelerator is included in each PE, and makes up for approx. 2% of the silicon area of each PE. The lookup and the polynomial calculation are parallelized, resulting in a latency of four clock cycles for each exponential function. Writing the operand to the accelerator and reading the result from it via the AHB bus adds additional two clock cycles, resulting in 6 clock cycles in total. In pipelined operation the processor writes one operand in one clock cycle and reads the result of a previous exponential function in another clock cycle, resulting in two clock cycles per exponential function [26].
Iii Spiking network model
To demonstrate the performance gain of the SpiNNaker 2 hardware for simulations of spiking neural networks, we implemented the synaptic sampling model introduced in [14]. In this section we briefly review this model for stochastic synaptic plasticity and rewiring. The model combines insights from experimental results on synaptic rewiring in the brain with a model for online reward maximization through policy gradient (see Section IIIC for details). The network has a large number of potential synaptic connections
only a fraction of which is functional at any moment in time, whereas most others are nonfunctional (disconnected). The network connectivity is permanently modified through rewiring. Synaptic weight changes and rewiring are guided by stochastic learning rules that probe different network configurations. Hence, synaptic sampling, other than usually considered deterministic learning rules that converge to some (local) optimum of parameters, in our framework learning converges to a target distribution
over synaptic parameters . The learning rules are designed in such a way that maxima of the distribution coincide with maxima of the expected reward. We first summarize the general synaptic sampling framework in Section IIIA and IIIB and then provide additional details to its application to reinforcement learning in Section IIIC. All parameter values are summarized in Table II. In Section IIID we discuss random reallocation of synapses, a modified rewiring scheme that is more memory efficient.Iiia Synapse model
In our model for synaptic rewiring we consider a neural network scaffold with a large number of potential synaptic connections between neurons. For each functional synaptic connection, we introduce a realvalued parameter that determines the strength of connection through the exponential mapping
(2) 
with a positive offset parameter that scales the minimum strength of synaptic connections. The mapping in Eq. (2) accounts for the experimentally found multiplicative synaptic dynamics in the cortex (c.f. [39, 7, 8], see [14] for details). For simplicity we assume that only excitatory connections (with ) are plastic, but the model can be easily generalized to inhibitory synapses.
The functional goal of network learning is determined by the dynamics of the synaptic parameters . It was shown in [14] that for some target distribution over synaptic parameters with partial derivative of the logdistribution with respect to parameter evaluated at time , the stochastic driftdiffusion processes
(3) 
give rise to a stationary distribution over that is proportional to . In Eq. (3) plays the role of a learning rate and are stochastic increments and decrements of Wiener processes, which are scaled by the temperature parameter .
This result suggests that a rule for rewardbased synaptic plasticity should be designed in a way that
has most of its mass on highly rewarded parameter vectors
. We use target distributions of the form where denotes proportionality up to a positive normalizing constant. can encode structural priors of the network scaffold, e.g. to enforce sparsity. This happens when has most of its mass near. In our experiments we have used a Gaussian distribution with mean
and variance
for the prior , such that .The function denotes the expected discounted reward associated with a given parameter vector . In Section IIIC we will discuss in detail how the term can be computed using rewardmodulated plasticity rules.
Synaptic rewiring is included in this model by interpreting each synapse for which as disconnected. To reconnect synapses we tested two approaches. In the first approach we continued to simulate the dynamics of the prior distribution, i.e. a process of the form (3) with until the synapse reconnects (). This is the algorithm that was proposed in [14]. In Section IIID we introduce another approach for rewiring called random reallocation of synapses that makes more effective use of memory resources. The two approaches are compared in the results below.
IiiB Neuron model
We considered a general network of stochastic spiking neurons and we denote the output spike train of a neuron by , defined as the sum of Dirac delta pulses positioned at the spike times , i.e., . We denote by and the index of the pre and postsynaptic neuron of synapse , respectively, which unambiguously specifies the connectivity in the network. Further, we define to be the index set of synapses that project to neuron . Note that this indexing scheme allows us to include multiple (potential) synaptic connections between a given pair of neurons. In all simulations we allow multiple synapses between neuron pairs.
Network neurons were modeled by a standard stochastic variant of the spike response model [40]. We denote by the synaptic efficacy of the th synapse in the network at time , determined by Eq. (2). The membrane potential of neuron at time is then given by
(4) 
where denotes the slowly adapting bias potential of neuron , and denotes the trace of the (unweighted) postsynaptic potentials (PSPs) that neuron leaves in its postsynaptic synapses at time . It is defined as given by spike trains filtered with a PSP kernel of the form , with time constants and . Here denotes convolution and is the Heaviside step function, i.e. for and otherwise.
Spike trains were generated using the following method. We used an exponential dependence between the membrane potential and firing rate, such that the instantaneous rate of neuron at time is given by . Spike events were drawn from a Poisson process with rate . After each spike, neurons were refractory for a fixed time window of length .
The bias potential in Eq. (4) implements a slow rate adaptation mechanism which was updated according to
(5) 
where is the time constant of the adaptation mechanism and is the desired output rate of the neuron. In our simulations, the bias potential was initialized at 3 and then followed the dynamics given in Eq. (5) (see [14] for details).
IiiC Rewardbased synaptic sampling
In a rewardbased learning framework we assume that the network is exposed to a realvalued scalar function that denotes the reward at any moment in time in response to the network behavior. The value function determines the expectation of over all possible network states while discounting future rewards, i.e. , with discounting time constant and denotes the expectation over all possible network responses.
The gradient
can be estimated for the network model outlined above using standard rewardmodulated learning rules with an eligibility trace (see
[14] for details)(6) 
where is the time constant of the eligibility trace. Recall that denotes the index of the presynaptic neuron and the index of the postsynaptic neuron for synapse . In Eq. (6) denotes the postsynaptic spike train, denotes the instantaneous firing rate of the postsynaptic neuron and denotes the postsynaptic potential under synapse .
This eligibility trace Eq. (6) is multiplied by the reward and integrated in each synapse using a second dynamic variable
(7) 
where is a lowpass filtered version of with time constant . The variable combines the eligibility trace and the reward in a temporal average. is a constant offset on the reward signal. This parameter can be set to an arbitrary value without changing the stationary dynamics of the model [14]. In our simulations, this offset was chosen slightly above () such that small parameter changes were also present without any reward. The variable realizes an online estimator for [14].
symbol  value  description 

2 ms  time constant of EPSP kernel (rising edge)  
20 ms  time constant of EPSP kernel (falling edge)  
1 s  time constant of eligibility trace  
50 s  time constants for Eq. (5) and Eq. (7)  
5 Hz  desired output rate  
refractory time  
0.1  temperature  
0.02  offset to reward signals  
learning rate  
0  mean of prior  
2  std of prior 
IiiD Random Reallocation of Synapse Memory
In the original synaptic sampling model, outlined above, whenever a synapse is disconnected (when ), it undergoes a random walk according to Eq. (3) until again becomes larger than zero and the synapse reappears. The dynamics of synapses that are disconnected also become independent of the network activity and are therefore not influenced by the pre and postsynaptic spike trains, since the eligibility trace Eq. (6) vanishes. Nevertheless, synapses need to be updated even when they are not used which wastes memory and CPU time. In a typical simulation of synaptic sampling, where the majority of synapses are nonfunctional most of the time, this overhead may even dominate the simulation. Here, we discuss a more efficient approach for synaptic rewiring called random reallocation of synapse memory.
It has been previously noted that the synaptic sampling dynamics can be replaced by a more efficient approach for online rewiring of neural networks [15]. The theoretical analysis there has shown that the original synaptic sampling formulation, with convergence to a stationary distribution , can be combined with a hard constraint on the network connectivity such that at any moment in time a fixed number of connections is functional, i.e. . In this modified version of network rewiring, whenever a connection becomes nonfunctional another synapse is randomly reintroduced to keep the total number of synapses constant. Thus, nonfunctional synapses do not need to be simulated and therefore don’t waste memory or CPU time. It has been shown that this more efficient rewiring approach also leads to a stationary distribution of network configurations, that is identical to the original posterior confined to the manifold of the parameter space that fulfills the constraint (see [15]
for details). This rewiring strategy has already been successfully applied to deep learning
[15] and implemented on the SpiNNaker 2 prototype chip [41].Here, we used a similar rewiring approach to the one in [15]. However, an additional limitation on the rewiring scheme comes from the memory model of the software framework. In our implementation, each neuron maintains a table of its postsynaptic targets (see Section IVC for details). Therefore, the free space of synapses that become disconnected can most efficiently be reassigned to another postsynaptic target of the same presynaptic neuron. Consequently, we decided to use a connectivity constraint that assures that the fanout of each neuron is constant throughout the simulation. This is simply achieved by immediately reconnecting each synapse that becomes nonfunctional to a new randomly chosen postsynaptic target. Since drawing random numbers becomes efficient due to the random number generator (Section IIB), this approach has little computational overhead.
Our results from the prototype chip presented in Section VC suggest, that random reallocation increases the effective usage of the hardware, the number of active synapses in the network, and also accelerates the exploration of the parameter space, leading to faster convergence to the stationary distribution. Interestingly, the connectivity constraint used here is somewhat similar to analog neuromorphic systems which contain synaptic matrices fixedly assigned to postsynaptic neurons with only the presynaptic sources flexible to some degree [42]. Rewiring in such a setup has to operate ‘postsynapticcentric’ and similar to our approach has a fixed number of synapses per postsynaptic neuron [43].
Iv Implementation of Synaptic Sampling on the SpiNNaker 2 Prototype
The software implementation of this model is optimized regarding computation time, memory, power consumption and scalability, in order to bridge the gap between stateoftheart biologically plausible neural models and efficient execution of the model in hardware. This is explained in more detail in the following.
Iva Numerical Optimizations
Reducing computation time with hardware generated uniform random numbers
The synaptic sampling model draws one random number for each synapse in each simulation time step (1 ms). Since thousands of synapses are simulated in each core, random number generation could dominate the computation time. As described in Section III, the Wiener process requires Gaussian random numbers to be generated. But as described in Section IIB, only uniform random number can be generated by the accelerator. As shown in Table III, the generation of a pseudo Gaussian random number with BoxMuller transform [44] in software requires 172 clock cycles. One option could be to convert the hardware generated uniform random number into Gaussian random number with Inverse CDF method [45] and lookup table, which reduces the computation time to 21 clock cycles. However, analytical and numerical studies have found that for the simulation of Wiener process, Gaussian random numbers can be replaced by uniform random numbers without affecting model performance [46]. The generation of a uniform random number in software with Marsaglia RNG [35, 47] requires 42 clock cycles, whereas with hardware it takes only 5 clock cycles, including fetching the integer random number from the accelerator and converting it to floating point type in the range of 0 to 1.
Reducing computation time with exponential function accelerator
In the synapse model, the parameter of each synapse accumulates small changes in each time step. The exponential function accelerator, which calculates the exponential function within 6 clock cycles (Section IIC), uses a fixedpoint data type whose precision is not enough for this model, because the change of would be rounded to zero. Calculating a floating point exponential function with software libraries like Newlib takes 163 clock cycles. Since high precision is only necessary for storing the small change of , but not necessary for calculating intermediate variables like , can be stored as floating point in memory, and when calculating with exponential function, can be converted to fixed point and calculated with the exponential function accelerator. The result is then converted back to floating point. Simulations show that the performance of the model is not affected. This reduces the computation time to 15 cycles with 6 cycles required by the hardware accelerator and 9 additional cycles for the conversion of data type. For the sake of comparison, emulation of exponential accelerator in software takes 95 cycles instead of 6 [26]. Thus, with conversion of data type, this approach would take 104 cycles with software (Table III).
Reducing memory footprint with 16bit floating point data type
In order to simulate more synapses with limited memory, which is the case when the synapse parameters are stored in SRAM (see Section IVB), the single precision floating point with 32 bits can be converted into half precision floating point with 16 bits. For each synapse , three parameters need to be stored in memory: eligibility trace , estimated gradient and synaptic parameter . Simulations show that converting and to half precision does not affect the model performance.
IvB Local Computation
By avoiding external DRAM access and instead storing all parameters and state variables of the model locally in SRAM, both energy and computation time can be saved.
To read (write) data from (to) the offchip DRAM, the core sends a read (write) request which is first stored in a DMA (Direct Memory Access) queue in software, then sent to the DMA unit, and at last sent to the DRAM. When the read (write) process is complete, an interrupt is triggered and an interrupt handler is called, which, in case of read request, processes the data from DRAM. Then the next read/write request in the queue is sent to DMA (Fig. 3). Since the DRAM access is time consuming, the software can let DMA run in background and continue with other tasks. When the read/write process is complete, the core stops with the current task, handles the interrupt and then resumes the stopped task after the interrupt handler is complete. Although this saves computation time compared to waiting for the read/write process to complete, it still has the following drawbacks:

The energy consumption of DRAM access can be two orders of magnitudes higher than SRAM access [49].

This only works if the other tasks are independent from the data being fetched.

Managing the DMA queue and calling the interrupt handler still consumes computation time, which becomes a problem when memory is frequently accessed.
The drawback when not using external DRAM is the limited memory space available in SRAM. This is not a problem for this model, since on the one hand the required memory is reduced with 16bit floating point (Section IVA), and on the other hand due to the complexity of the model, the number of synapses per core is limited by computation as is shown in Section VB.
IvC Memory Model
The memory model (Fig. 4) of this work is based on the software for the first generation SpiNNaker system [50]. The spike packet contains the ID of the presynaptic neuron. The master population table contains keys which are presynaptic neuron IDs. Each key is 4 bytes long and is stored together with the 4 byte starting address of the synapse parameters for the presynaptic neuron. These synapse parameters are stored in a contiguous memory block called synapse row. Each row is composed of 4byte words. For each presynaptic neuron, the first word is the length of the plastic synapse region. In our implementation, the plastic synapse region consists of 8byte blocks with 2 bytes for , 2 bytes for and 4 bytes for . After the plastic synapse region there is one word for the length of fixed synapse region. The next word is the length of the plastic control region which stores special parameters needed by the plasticity rules. In this work this region is used to store the parameters for the PSP kernel of input spike, e.g. and (corresponding to the time constants and ). Since the PSP kernel of the incoming spike is the same for all synapses of the same presynaptic neuron, the parameters for the PSP kernel are shared in order to reduce memory footprint. After the word for the length of plastic control region follow the parameters for fixed synapses.
The synapse parameters should also include the index of the postsynaptic neuron. One way to implement this is to add a 4byte word for each postsynaptic neuron in addition to the 8 bytes for , and , which is the case in the original SpiNNaker software framework. Alternatively, since in this network all input neurons have the same fanout, the indexes are stored in a 2d array (Postsyn. Neuron ID in Fig. 4), where the column index stands for the presynaptic neuron ID and the entries represent the postsynaptic neuron IDs. Each entry represents a synapse and occupies one byte, supporting maximum 256 target neurons per core. Since multiple synapses are allowed between a pair of neurons, the ID of a postsynaptic neuron can appear multiple times in each column of the 2d array. In general, depending on application, one of the two approaches can be chosen.
The master population table, synapse rows and postsynaptic neuron ID are arrays generated by each core after the network configuration is specified. Each core generates its own data in a distributed way instead of having a centralized host PC generating data for all cores. This, combined with local computation (Section IVB), drastically reduces the time for data generation and transmission of data from host PC to chip, which could make up significant amount of total simulation time especially in the case of large systems [51, 52].
IvD Program Flow and SpiNNaker Software Framework Integration
The SpiNNaker system employs parallel computation to run large scale neural simulations in real time. Although the prototype chip consists of only 4 cores, the software implementation of the synaptic sampling model is integrated into the SpiNNaker software framework allowing for scaling up onto larger systems. The design of the program flow is based on [50].
The timer tick signal of the ARM core is used to trigger each time step in real time. The length of a time step can be arbitrarily chosen. For this implementation, one time step is one millisecond. The timer tick signal triggers an interrupt. Then the handler of the interrupt is called and processes the incoming spikes from the last time step, which are stored in a hardware buffer in SRAM. In this step, for each incoming spike, first the starting memory address of its corresponding synapse parameters is found in the master population table, then the synaptic weights of the activated synapses in the synapse row are added to the synaptic input buffers of the target neurons.
For the network model implemented in this work (Section VB), one of the cores, the “master core”, then simulates the environment that computes the global reward signal. All cores continue with the synapse update and neuron update, which integrate the synaptic weight onto the membrane potential of the postsynaptic neuron. Next, the synaptic plasticity update is performed, as now all required information is available, i.e. incoming spikes, neuron states and global reward.
At last, the spikes of the neurons in each core are sent to the SpiNNaker router, which then multicasts the spikes to the cores containing the corresponding postsynaptic neurons. The SpiNNaker router [34] allows for fast multicast of small packets, which is key to efficient spike communication for manycore neuromorphic systems like SpiNNaker. The distributed computation, synchronization with timer tick and communication with the SpiNNaker router allows for scaling up the model implementation onto large systems consisting of millions of cores.
V Results
In the following we show how the hardware accelerators and numerical optimizations reduce the computation time for one plasticity update of the synaptic sampling model. Then, we implement a network model that performs rewardbased synaptic sampling on the SpiNNaker 2 prototype, for which we also provide power and energy measurements.
Va Computation Time of Plasticity Update
HW Accelerator  only Software  

Random number generation  5  42 
Exponential function  15  104 
Rest  90  90 
Total  110  236 
(RNG + EXP) / Total  18%  62% 
As shown in Section IVA the generation of a uniform distributed random number takes 5 clock cycles with hardware accelerator and 42 clock cycles with software. The floating point exponential function with exponential accelerator and conversion of data type takes 15 clock cycles, whereas the same algorithm in software takes 104 clock cycles. The rest of the plasticity update of a synapse takes 90 clock cycles. In total, the plasticity update takes 110 clock cycles with hardware accelerators and the equivalent implementation with only software takes 236 clock cycles (Table IV). For this application, the hardware accelerators result in a speedup of 2 regarding the number of clock cycles. Considering the increase of clock frequency from 200 MHz in SpiNNaker 1 to 500 MHz in the current prototype chip, in total a speedup factor of 5 is achieved. In the plasticity update, the computation time for random number generation and exponential function reduced from 62% to 18%.
VB Network Description
Fig. 6 illustrates the network topology and the mapping to the prototype chip. The network consists of 200 input neurons which are alltoall connected to 20 neurons with plastic synapses. Multiple synapses between each pair of neurons are allowed. In this implementation 3 synapses between each pair of neurons are initiated, resulting in 200 x 20 x 3 = 12000 plastic synapses. 2 spike patterns are encoded in the spike rate of the input neurons and are sent to the hidden neurons (see Fig. 7). The 20 hidden neurons are divided into two populations (A and B). The output spikes of the hidden neurons are sent to the environment (Env), which evaluates the global reward. A high reward is obtained if input pattern 1(2) is present and the mean firing rate of population A(B) is higher than population B(A). The global reward is sent back to the network and shapes the plastic synapses between the input neurons and the two populations. The goal is to let the two populations ‘know’ which spike pattern they represent and signal this with a high firing rate when their pattern is present. In addition to the feedforward input, hidden neurons receive lateral inhibitory synapses that are initiated to fixed random weights between each pair of hidden neurons.
The network is mapped to the prototype chip with each core simulating 5 neurons from the two populations (see Fig. 6). The first core (”master core”) also generates the input spikes and evaluates the reward. The 200 input neurons lead to pairs of neurons in each core.
The profiling results in section VA provide the computational aspect when assigning the number of synapses to simulate on each core. The ARM Cortex M4F core used in this prototype chip is configured to run at 500 MHz, which means 500 000 clock cycles are available in each time step (1 ms). The computation for one time step without plasticity update takes ca. 45 000 clock cycles for core 0 and 40 000 clock cycles for the other cores. Since each plasticity update takes 110 cycles with hardware accelerators and 236 cycles without hardware accelerators, the theoretical upper limit for the number of synapses per core is ca. 4 100 with hardware accelerators and ca. 1 900 without hardware accelerators.
In terms of memory, the prototype chip has 64 kB Data Tightly Coupled Memory (DTCM) per core, for all initialized data, uninitialized data, heap and stack. By checking the binary file size after compilation, the maximum number of synapses is estimated as 4 700. Thus, this model is limited by computation rather than memory (see table V).
Core Memory Constraint  Real Time Constraint  
With Accelerators  4 700  4 100 
Without Accelerators  4 700  1 900 

In the implementation, 3 000 plastic synapses per core are simulated, in order to ensure the stability of the software. Since 3 000 plastic synapses can be simulated in each core, each pair of neurons has 3 plastic synapses. Note that this is only the initial configuration. Due to random reallocation of synapse memory, the postsynaptic neuron could change, so that not each single pair of neurons has 3 plastic synapses.
VC Implementation Results
The usability of the network is demonstrated in a closedloop reinforcement learning task implemented with 4 ARM cores. The generation of input spikes and evaluation of output spikes are also implemented on chip.
As shown in Fig. 7, the 200 input neurons send two spike patterns in random order. Each spike pattern lasts for 500 ms. Resting periods of 500 ms are inserted between two pattern presentations, where the input neurons only send random spikes with low firing rate representing background noise. The numbers at the top of Fig. 7 and shaded colored areas indicate which pattern is present. As discussed above, the 20 neurons are divided into 2 populations (A and B), each representing one of the two patterns. Neuron 1 to neuron 10 belong to population A, neuron 11 to neuron 20 belong to population B. In the second row of Fig. 7, blue and green curves represent population firing rates of A and B, respectively. The firing rates were obtained with a Gaussian filter () applied to the raw spike trains. The goal of learning is to let population A fire at a higher rate when pattern 1 is present and let population B fire at a higher rate when pattern 2 is present.
Fig. 8 shows the evolution of the mean reward with and without random reallocation of synapse memory (see Section IIID). The mean reward in each minute is lowpass filtered with a Gaussian kernel with
. Averages over 5 independent trial runs using the true random number generator are shown with solid lines, shaded areas indicate standard deviations. The reward is normalized to the theoretically maximum reachable reward. At learning onset the two populations respond randomly to input spike patterns and the reward is low. The synaptic weights explore the parameter space with the random process guided by the global reward as described in Section
IIIA. Over time, the network learns the desired input/output mapping and the reward increases. After ca. 10 minutes of training, the two populations learn to respond correctly to the two spike patterns with the firing rate of one population higher than the other when the corresponding spike pattern is present, and reward becomes high. Our results show that the reward increases much faster with reallocation due to the accelerated exploration of the parameter space. After the reward reaches a high value, the network continues exploration and the reward might fluctuate while the network searches for equally good network configurations.VD Power and Energy Measurement Results
with DRAM, no Accelerator  no DRAM, no Accelerator  no DRAM, with Accelerator  
Power (mW)  285  225  225 
Time (ms)  1.58  1.58  0.76 
Energy (J)  450.3  355.5  171 
Reduction of Energy  0%  21%  62% 

The optimizations described in section IV result in considerable reduction of power and energy consumption. To show the benefit of the optimizations, power and energy consumption is measured in three cases. First, the synapse rows are stored in the external DRAM memory, and the exponential function and random number generation are done only with the software running on ARM core. Second, the synapse rows are stored in the local SRAM memory, and the exponential function and random number generation are still only done with the software running on ARM core. At last, the synapse rows are stored in the local SRAM memory, and the exponential function and random number generation are done with the hardware accelerators. For this measurement, the software is run without random reallocation of synapse memory. As summarized in table VI, the power and energy consumption is reduced by local computation without external DRAM and reduction of computation time.
First, the memory footprint is optimized by employing 16bit floating point data type and the compact arrangement of memory model described in sections IVA and IVC. The random reallocation described in section IIID increases the effective number of synapses which is otherwise only achievable with external memory like DRAM. The reduction of memory footprint allows for local computation with SRAM, as described in section IVB. Switching off DRAM allows for a reduction of power consumption by 21%, from 285 mW to 225 mW.
In addition, as summarized in section VA, the computation time for each plasticity update is reduced by 53.4%. Without the hardware accelerators, simulating the network with 3 000 plastic synapses per core for one time step (1 ms) takes 1.58 ms, losing the real time capability. With the hardware accelerators, the simulation of one time step is finished within 0.76 ms. To measure the energy consumption, the length of the time step is chosen to be the minimum required for each time step to finish, i.e. 1.58 ms for without accelerators and 0.76 ms for with accelerators. The reduction of computation time for plasticity update reduces the energy consumption for one time step by 51.9%, from 355.5 J to171 J .
In total, the energy consumption for the simulation of the network for one time step is reduced by 62%, from 450.3 J to 171 J, making the system attractive for mobile and embedded applications.
Vi Discussion
In the following we discuss how the implementation of the rewardbased synaptic sampling model would scale for larger networks on the final SpiNNaker 2 system. Finally, we argue about the possiblility to realize this network model on SpiNNaker 1 and other neuromorphic platforms with learning capabilities.
Via Scalability
The SpiNNaker architecture was designed for the scalable realtime simulation of spiking neural networks with up to a million cores [27]. SpiNNaker’s scalability is based on the multicast network for routing of spike events [34] and a software framework for mapping network models onto the system that has shown to support the simulation of largescale neural networks [52]. Building on this, the rewardbased synaptic sampling model can be scaled to future SpiNNaker 2 systems without major restrictions, i.e. as our implementation is integrated into the SpiNNaker software framework, the automatic mapping of larger networks onto many cores and the configuration of routing tables comes for free. In principle, with more than 100 cores per chip in SpiNNaker 2 (cf. Table I), DRAM bandwidth may become a bottleneck for some applications, but not in our case, as synapse variables are stored and processed locally in each core and DRAM is not used. Furthermore, a manychip implementation should not be limited by the communication bandwith for spike packets between chips, as the rewardbased synaptic sampling model is mainly limited by the computation of the synapse updates and has rather moderate spike rates (Section VB). Still, we remark that, as in any largescale neuromorphic hardware system, the fraction of energy consumed for communication will increase with network size [53] demanding optimized routing architectures [54].
Future work will include simulating larger networks of this type on the fullscale SpiNNaker 2 system with many cores. Such a scaledup, realtime version of the synaptic sampling framework, will enable us to explore rewardbased learning on highdimensional input such as dynamic vision sensors [55] or conventional highdensity image sensors [56].
ViB Comparison with SpiNNaker 1
Rewardbased learning and structural plasticity have been implemented on the SpiNNaker system before [48] [57]
. The rewardbased synaptic sampling model implemented in this work is more complex because of the need for random number generation and exponential function for each plastic synapse in each time step. In addition, due to the lack of floating point arithmetic, this synapse model would be very hard, if possible at all, to be implemented in the first generation SpiNNaker system, since the change of synaptic weight is very small in each time step and can not be captured by the precision of fixed point format.
ViC Comparison with other neuromorphic platforms
To the best of our knowledge, there exists today no neuromorphic hardware platform, except SpiNNaker 2, that would be able to directly simulate complex learning rules such as synaptic sampling. Most other approaches have traded off accessible model complexity for a more direct implementation of the neuron dynamics. We discuss here how synaptic sampling could still be emulated on other architectures.
Clearly, since synaptic sampling is inherently an online learning model, it cannot be directly implemented on neuromorphic hardware with only static synapses, such as TrueNorth [58], NeuroGrid [59], HiAERIFAT [54], DYNAPs [60] and DeepSouth [61]. However, the network dynamics could be approximated by alternating short time windows of network simulation and reprogramming synaptic weights by an external device.
Architectures that do support synaptic plasticity on chip, such as Loihi[62] and the BrainScales 2 system[63], have so far quite limited weight resolutions (9bit signed integer on Loihi and 12bit on BrainScales 2). Since 32bit fixedpoint format was found to be insufficient for this model (cf. section IVA), it is questionable, even with stochastic rounding, whether synaptic sampling can be implemented with such low weight resolution, and at what cost in performance. Also, in the case of Loihi, the size of the microcode that is allowed for computing synaptic updates is quite limited (e.g. 16 32bit words). Besides, hardware accelerators for complex functions like the exponential function are not available on these two platforms, which makes the implementation more challenging, especially in the case of Brainscales 2, because the high data rate caused by accelerated operation requires fast execution of learning rules. These restrictions put some doubt on whether complex learning mechanisms, as the one considered here, can be implemented exactly. Also, exact implementation of the synaptic sampling model seems infeasible on neuromorphic hardwares with configurable (but not programmable) plasticity, like ROLLS [64], ODIN [65] and TITAN [66] (see [67] and [68] for reviews). However, it might be possible to realize simplified, approximate, versions of synaptic sampling on these neuromorphic platforms.
Vii Conclusion
In this work, a rewardbased synaptic sampling model is implemented in the prototype chip of the second generation SpiNNaker system. This realtime online learning system is demonstrated in a closedloop online reinforcement learning task. While hardware features of the future SpiNNaker 2 and its prototypes have already been published, this is the first time learning spiking synapses have been shown on SpiNNaker 2. As shown in sections I and VIC, this is also one of the most complex synaptic learning models ever implemented in neuromorphic hardware. The hardware accelerators and the software optimizations allow for efficient neural simulation with regard to computation time, memory and power and energy consumption, while at the same time the SpiNNaker 2 system keeps the full flexibility of being processor based. For this application, we show slightly more than a factor of 2 speedup of the algorithm compared to a pure software implementation. Coupled with the 2.5 fold increase in clock frequency, we can theoretically simulate 5 times as many synapses of this type in SpiNNaker 2 as in SpiNNaker 1 in the same time span. In addition, we show a reduction of energy consumption by 62% compared to implementation without the use of hardware accelerators and with external DRAM.
Acknowledgements
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7) under grant agreement no 604102 and the EU’s Horizon 2020 research and innovation programme under grant agreements No 720270 and 785907 (Human Brain Project, HBP). In addition, this work was supported by the Center for Advancing Electronics Dresden (cfaed) and the H2020FETPROACT project Plan4Act (#732266) [DK]. Furthermore, this work was supported by the Austrian Science Fund (FWF): I 3251N33. The authors thank Andrew Rowley, Luis Plana, Alan Stokes and Michael Hopkins for providing the source code of SpiNNaker 1 software. In addition, the authors thank ARM and Synopsis for IP and the Vodafone chair at Technische Universität Dresden for contributions to RTL design.
References
 [1] A. A. Faisal et al., “Noise in the nervous system,” Nature reviews neuroscience, vol. 9, no. 4, p. 292, 2008.
 [2] P. G. Clarke, “The limits of brain determinacy,” Proceedings of the Royal Society of London B: Biological Sciences, vol. 279, no. 1734, pp. 1665–1674, 2012.
 [3] A. J. Holtmaat et al., “Transient and persistent dendritic spines in the neocortex in vivo,” Neuron, vol. 45, no. 2, pp. 279–291, 2005.
 [4] S. Rumpel and J. Triesch, “The dynamic connectome,” eNeuroforum, vol. 7, no. 3, pp. 48–53, 2016.
 [5] R. Dvorkin and N. E. Ziv, “Relative contributions of specific activity histories and spontaneous processes to size remodeling of glutamatergic synapses,” PLoS biology, vol. 14, no. 10, p. e1002572, 2016.
 [6] U. Rokni et al., “Motor learning with unstable neural representations,” Neuron, vol. 54, no. 4, pp. 653–666, 2007.
 [7] N. Yasumatsu et al., “Principles of longterm dynamics of dendritic spines,” The Journal of Neuroscience, vol. 28, no. 50, pp. 13 592–13 608, 2008.

[8]
Y. Loewenstein et al.
, “Multiplicative dynamics underlie the emergence of the lognormal distribution of spine sizes in the neocortex in vivo,”
The Journal of Neuroscience, vol. 31, no. 26, pp. 9481–9488, 2011.  [9] A. Statman et al., “Synaptic size dynamics as an effectively stochastic process,” PLoS computational biology, vol. 10, no. 10, p. e1003846, 2014.
 [10] M. D. McDonnell and L. M. Ward, “The benefits of noise in neural systems: bridging theory and experiment,” Nature Reviews Neuroscience, vol. 12, no. 7, p. 415, 2011.
 [11] W. Maass, “Noise as a resource for computation and learning in networks of spiking neurons,” Proceedings of the IEEE, vol. 102, no. 5, pp. 860–880, 2014.

[12]
D. Kappel et al.
, “Network plasticity as bayesian inference,”
PLoS computational biology, vol. 11, no. 11, p. e1004485, 2015.  [13] D. Kappel et al., “Synaptic sampling: a bayesian approach to neural network plasticity and rewiring,” in Advances in Neural Information Processing Systems, 2015, pp. 370–378.
 [14] D. Kappel et al., “A dynamic connectome supports the emergence of stable computational function of neural circuits through rewardbased learning,” eNeuro, vol. 5, no. 2, 2018. [Online]. Available: http://europepmc.org/articles/PMC5913731
 [15] G. Bellec et al., “Deep rewiring: Training very sparse deep networks,” ICLR, 2018.
 [16] G. Indiveri et al., “Neuromorphic architectures for spiking deep neural networks,” in Electron Devices Meeting (IEDM), 2015 IEEE International. IEEE, 2015, pp. 4–2.
 [17] M. Noack et al., “Switchedcapacitor realization of presynaptic shorttermplasticity and stoplearning synapses in 28 nm cmos,” Frontiers in neuroscience, vol. 9, p. 10, 2015.
 [18] N. Du et al., “Single pairing spiketiming dependent plasticity in bifeo3 memristors with a time window of 25 ms to 125 s,” Frontiers in neuroscience, vol. 9, p. 227, 2015.

[19]
T. Levi et al.
, “Development and applications of biomimetic neuronal networks toward brainmorphic artificial intelligence,”
IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 5, pp. 577–581, 2018.  [20] S. Schmitt et al., “Neuromorphic hardware in the loop: Training a deep spiking network on the brainscales waferscale system,” Proceedings of the 2017 IEEE International Joint Conference on Neural Networks, pp. 2227–2234, 2017. [Online]. Available: http://ieeexplore.ieee.org/document/7966125/
 [21] M. A. Petrovici et al., “Pattern representation and recognition with accelerated analog neuromorphic systems,” in Circuits and Systems (ISCAS), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–4.
 [22] P. A. Merolla et al., “A million spikingneuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
 [23] J. C. Knight and T. Nowotny, “Gpus outperform current hpc and neuromorphic solutions in terms of speed and energy when simulating a highlyconnected cortical model,” Frontiers in Neuroscience, vol. 12, p. 941, 2018. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00941
 [24] B. Vogginger et al., “Reducing the computational footprint for realtime bcpnn learning,” Frontiers in Neuroscience, vol. 9, p. 2, 2015. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2015.00002
 [25] F. Neumärker et al., “True random number generation from bangbang adpll jitter,” in 2016 IEEE Nordic Circuits and Systems Conference (NORCAS), Nov 2016, pp. 1–5.
 [26] J. Partzsch et al., “A fixed point exponential function accelerator for a neuromorphic manycore system,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
 [27] S. B. Furber et al., “The SpiNNaker project,” Proceedings of the IEEE, vol. 102, no. 5, pp. 652–665, May 2014.
 [28] S. Haas et al., “An mpsoc for energyefficient database query processing,” in Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 2016, pp. 1–6.
 [29] S. Haas et al., “A heterogeneous sdr mpsoc in 28 nm cmos for lowlatency wireless applications,” in Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017, p. 47.
 [30] K. Amunts et al., “The human brain project: creating a european research infrastructure to decode the human brain,” Neuron, vol. 92, no. 3, pp. 574–581, 2016.
 [31] E. Painkras et al., “SpiNNaker: A 1w 18core systemonchip for massivelyparallel neural network simulation,” IEEE Journal of SolidState Circuits, vol. 48, no. 8, pp. 1943–1953, Aug 2013.
 [32] S. Höppner and C. Mayr, “SpiNNaker 2  towards extremely efficient digital neuromorphics and multiscale brain emulation,” in Neuro Inspired Computational Elements Workshop (NICE). NICE Workshop Foundation, 2018. [Online]. Available: http://niceworkshop.org/wpcontent/uploads/2018/05/227SHoppnerSpiNNaker2.pdf
 [33] S. Höppner et al., “Dynamic voltage and frequency scaling for neuromorphic manycore systems,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
 [34] J. Navaridas et al., “SpiNNaker: Enhanced multicast routing,” Parallel Computing, vol. 45, pp. 49 – 66, 2015, computing Frontiers 2014: Best Papers. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167819115000095
 [35] G. Marsaglia, “Xorshift rngs,” Journal of Statistical Software, Articles, vol. 8, no. 14, pp. 1–6, 2003. [Online]. Available: https://www.jstatsoft.org/v008/i14
 [36] S. Höppner et al., “A fastlocking adpll with instantaneous restart capability in 28nm cmos technology,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 60, no. 11, pp. 741–745, 2013.
 [37] H. Eisenreich et al., “A novel adpll design using successive approximation frequency control,” Microelectronics Journal, vol. 40, no. 11, pp. 1613–1622, 2009.
 [38] S. Höppner et al., “Method for generating true random numbers on a multiprocessor system and the same,” 2018, european Patent Register  EP3147775.
 [39] A. Holtmaat et al., “Experiencedependent and celltypespecific spine growth in the neocortex,” Nature, vol. 441, no. 7096, pp. 979–983, 2006.
 [40] W. Gerstner et al., Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014. [Online]. Available: http://neuronaldynamics.epfl.ch
 [41] C. Liu et al., “Memoryefficient deep learning on a SpiNNaker 2 prototype,” Frontiers in Neuroscience, vol. 12, p. 840, 2018.
 [42] M. Noack et al., “Biologyderived synaptic dynamics and optimized system architecture for neuromorphic hardware,” in Mixed Design of Integrated Circuits and Systems (MIXDES), 2010 Proceedings of the 17th International Conference. IEEE, 2010, pp. 219–224.
 [43] R. George et al., “Eventbased softcore processor in a biohybrid setup applied to structural plasticity,” in Eventbased Control, Communication, and Signal Processing (EBCCSP), 2015 International Conference on. IEEE, 2015, pp. 1–4.
 [44] G. E. P. Box and M. E. Muller, “A note on the generation of random normal deviates,” Ann. Math. Statist., vol. 29, no. 2, pp. 610–611, 06 1958. [Online]. Available: https://doi.org/10.1214/aoms/1177706645
 [45] W. Hörmann and J. Leydold, “Continuous random variate generation by fast numerical inversion,” ACM Trans. Model. Comput. Simul., vol. 13, no. 4, pp. 347–362, Oct. 2003. [Online]. Available: http://doi.acm.org/10.1145/945511.945517
 [46] B. Dünweg and W. Paul, “Brownian dynamics simulations without gaussian random numbers,” International Journal of Modern Physics C, vol. 2, no. 3, pp. 817–827, 1991.
 [47] M. Hopkins, “random.c (source code),” 2014. [Online]. Available: https://github.com/SpiNNakerManchester/spinn_common/blob/master/src/random.c
 [48] M. Mikaitis et al., “Neuromodulated synaptic plasticity on the SpiNNaker neuromorphic system,” Frontiers in Neuroscience, vol. 12, p. 105, 2018. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00105
 [49] S. Han et al., “Learning both weights and connections for efficient neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 1135–1143. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969239.2969366
 [50] O. Rhodes et al., “spynnaker: A software package for running pynn simulations on SpiNNaker,” Frontiers in Neuroscience, vol. 12, p. 816, 2018. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00816
 [51] T. Sharp and S. Furber, “Correctness and performance of the SpiNNaker architecture,” in The 2013 International Joint Conference on Neural Networks (IJCNN), Aug 2013, pp. 1–8.
 [52] S. J. van Albada et al., “Performance comparison of the digital neuromorphic hardware SpiNNaker and the neural network simulation software nest for a fullscale cortical microcircuit model,” Frontiers in neuroscience, vol. 12, 2018.
 [53] J. Hasler and H. B. Marr, “Finding a roadmap to achieve large neuromorphic hardware systems,” Frontiers in neuroscience, vol. 7, p. 118, 2013.
 [54] J. Park et al., “Hierarchical address event routing for reconfigurable largescale neuromorphic systems,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2408–2422, 2017.
 [55] P. Lichtsteiner et al., “A 128*128 120 db 15 us latency asynchronous temporal contrast vision sensor,” IEEE journal of solidstate circuits, vol. 43, no. 2, pp. 566–576, 2008.
 [56] S. Henker et al., “Active pixel sensor arrays in 90/65nm cmostechnologies with vertically stacked photodiodes,” in Proc. IEEE International Image Sensor Workshop IIS07, 2007, pp. 16–19.
 [57] P. A. Bogdan et al., “Structural plasticity on the SpiNNaker manycore neuromorphic system,” Frontiers in Neuroscience, vol. 12, p. 434, 2018.
 [58] P. A. Merolla et al., “A million spikingneuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014. [Online]. Available: http://science.sciencemag.org/content/345/6197/668
 [59] B. Varkey Benjamin et al., “Neurogrid: A mixedanalogdigital multichip system for largescale neural simulations,” Proceedings of the IEEE, vol. 102, pp. 1–18, 05 2014.
 [60] S. Moradi et al., “A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (dynaps),” IEEE Transactions on Biomedical Circuits and Systems, vol. 12, no. 1, pp. 106–122, Feb 2018.
 [61] R. M. Wang et al., “An fpgabased massively parallel neuromorphic cortex simulator,” Frontiers in Neuroscience, vol. 12, p. 213, 2018. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00213
 [62] M. Davies et al., “Loihi: A neuromorphic manycore processor with onchip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, January 2018.
 [63] S. Friedmann et al., “Demonstrating hybrid learning in a flexible neuromorphic hardware system,” IEEE Transactions on Biomedical Circuits and Systems, vol. 11, no. 1, pp. 128–142, Feb 2017.
 [64] N. Qiao et al., “A reconfigurable online learning spiking neuromorphic processor comprising 256 neurons and 128k synapses,” Frontiers in Neuroscience, vol. 9, p. 141, 2015. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2015.00141
 [65] C. Frenkel et al., “A 0.086mm 12.7pj/sop 64ksynapse 256neuron onlinelearning digital spiking neuromorphic processor in 28nm cmos,” IEEE Transactions on Biomedical Circuits and Systems, vol. 13, no. 1, pp. 145–158, Feb 2019.
 [66] C. Mayr et al., “A biologicalrealtime neuromorphic system in 28 nm cmos using lowleakage switched capacitor circuits,” IEEE Transactions on Biomedical Circuits and Systems, vol. 10, no. 1, pp. 243–254, Feb 2016.
 [67] C. S. Thakur et al., “Largescale neuromorphic spiking array processors: A quest to mimic the brain,” Frontiers in Neuroscience, vol. 12, p. 891, 2018. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00891
 [68] M. Rahimi Azghadi et al., “Spikebased synaptic plasticity in silicon: Design, implementation, application, and challenges,” Proceedings of the IEEE, vol. 102, no. 5, pp. 717–737, May 2014.