1 Introduction
The growing amount of energy consumed by Artificial Neural Networks (ANNs) has been identified as an important problem in the context of mobile, IoT, and edge applications (Moloney, 2016; Zhang et al., 2017; McKinstry et al., 2018; Wang et al., 2019; Situnayake and Warden, 2019). The vast majority of an ANN’s time and energy is consumed by the multiplyaccumulate (MAC) operations implementing the weighting of activities between layers (Sze et al., 2017). Thus, many ANN accelerators focus almost entirely on optimizing MACs (e.g. Ginsburg et al., 2017; Jouppi et al., 2017), while practitioners prune (Zhu and Gupta, 2017; Liu et al., 2018) and quantize (Gupta et al., 2015; Courbariaux et al., 2015; McKinstry et al., 2018; Nayak et al., 2019) weights to reduce the use and size of MAC arrays.
While these strategies focus on the weight matrix, the Spiking Neural Network (SNN) community has taken a different but complimentary approach that instead focuses on temporal processing. The operations of an SNN are temporally sparsified, such that an accumulate only occurs whenever a “spike” arrives at its destination. These sparse, onebit activities (i.e., “spikes”) not only reduce the volume of data communicated between nodes in the network (Furber et al., 2014), but also replace the multipliers in the MAC arrays with adders – together providing orders of magnitude gains in energy efficiency (Davies et al., 2018; Blouw et al., 2019).
The conventional method of training an SNN is to first train an ANN, replace the activation functions with spiking neurons that have identical firing rates
(Hunsberger and Eliasmith, 2015), and then optionally retrain with spikes on the forward pass and a differentiable proxy on the backward pass (Huh and Sejnowski, 2018; Bellec et al., 2018; Zhang and Li, 2019). However, converting an ANN into an SNN often degrades model accuracy – especially for recurrent networks. Thus, multiple hardware groups have started building hybrid architectures that support ANNs, SNNs, and mixtures thereof (Liu et al., 2018; Pei et al., 2019; Moreira et al., 2020) – for instance by supporting eventbased activities, fixedpoint representations, and a variety of multibit coding schemes. These hybrid platforms present the alluring possibility to trade accuracy for energy in taskdependent ways (Blouw and Eliasmith, 2020). However, principled methods of leveraging such tradeoffs are lacking.In this work, we propose to our knowledge the first method of training hybridspiking networks (hSNNs) by smoothly interpolating between ANN (i.e., 32bit activities) and SNN (i.e., 1bit activities) regimes. The key idea is to interpret spiking neurons as onebit quantizers that diffuse their quantization error across future timesteps – similar to Floyd and Steinberg (1976) dithering. This idea can be readily applied to any activation function at little additional cost, generalizes to quantizers with arbitrary bitwidths (even fractional), provides strong bounds on the quantization errors, and relaxes in the limit to the ideal ANN.
Our methods enable the training procedure to balance the accuracy of ANNs with the energy efficiency of SNNs by evaluating the continuum of networks in between these two extremes. Furthermore, we show that this method can train hSNNs with superior accuracy to ANNs and SNNs trained via conventional methods. In a sense, we show that it is useful to think of spiking and nonspiking networks as extremes in a continuum. As a result, the set of hSNNs captures networks with any mixture of activity quantization throughout the architecture.
2 Related Work
Related work has investigated the quantization of activation functions in the context of energyefficient hardware (e.g., Jacob et al., 2018; McKinstry et al., 2018). Likewise, Hopkins et al. (2019)
consider stochastic rounding and dithering as a means of improving the accuracy of spiking neuron models on lowprecision hardware relative to their ideal ordinary differential equations (ODEs). Neither of these approaches account for the quantization errors that accumulate over time, whereas our proposed method keep these errors bounded.
Some have viewed spiking neurons as onebit quantizers, or analogtodigital (ADC) converters, including Chklovskii and Soudry (2012); Yoon (2016); Ando et al. (2018); Neckar et al. (2018); Yousefzadeh et al. (2019a, b). But these methods are not generalized to multibit or hybrid networks, nor leveraged to interpolate between nonspiking and spiking networks during training.
There also exist other methods that introduce temporal sparsity into ANNs. One such example is channel gating (Hua et al., 2019), whereby the channels in a CNN are dynamically pruned over time. Another example is dropout (Srivastava et al., 2014) – a form of regularization that randomly drops out activities during training. The gating mechanisms in both cases are analogous to spiking neurons.
Neurons that can output multibit spikes have been considered in the context of packets that bundle together neighbouring spikes (Krithivasan et al., 2019). In contrast, this work directly computes the number of spikes in time and memory per neuron, and varies the temporal resolution during training to interpolate between nonspiking and spiking and allow optimization across the full set of hSNNs.
Our methods are motivated by some of the recent successes in training SNNs to compete with ANNs on standard machine learning benchmarks (Bellec et al., 2018; Stöckl and Maass, 2019; Pei et al., 2019). To our knowledge, this work is the first to parameterize the activation function in a manner that places ANNs and SNNs on opposite ends of the same spectrum. We show that this idea can be used to convert ANNs to SNNs, and to train hSNNs with improved accuracy relative to pure (i.e., 1bit) SNNs and energy efficiency relative to pure (i.e., 32bit) ANNs.
3 Methods
3.1 Quantized Activation Functions
We now formalize our method of quantizing any activation function. In short, the algorithm quantizes the activity level and then pushes the quantization error onto the next timestep – analogous to the concept of using error diffusion to dither a onedimensional timeseries (Floyd and Steinberg, 1976). The outcome is a neuron model that interpolates an arbitrary activation function, , between nonspiking and spiking regimes through choice of the parameter , which acts like a timestep.
3.1.1 TemporallyDiffused Quantizer
Let be the input to the activation function at a discrete timestep, , such that the ideal output (i.e., with unlimited precision) is . The algorithm maintains one scalar statevariable across time, , that tracks the total amount of quantization error that the neuron has accumulated over time. We recommend initializing independently for each neuron. The output of the neuron, , is determined by Algorithm 1.
The ideal activation, , may be any conventional nonlinearity (e.g., , sigmoid, etc.), or the timeaveraged response curve corresponding to a biological neuron model (e.g., leaky integrateandfire) including those with multiple internal statevariables (Koch, 2004). Adaptation may also be modelled by including a recurrent connection from to (Voelker, 2019, section 5.2.1).
To help understand the relationship between this algorithm and spiking neuron models, it is useful to interpret as the number of spikes () that occur across a window of time normalized by the length of this window (). Then represents the expected number of spikes across the window, and tracks progress towards the next spike.
We note that Algorithm 1 is equivalent to Ando et al. (2018, Algorithm 1) where is the rectified linear (ReLU) activation function, and . Yousefzadeh et al. (2019a, Algorithm 1) extend this to represent changes in activation levels, and allow negative spikes. Still considering the ReLU activation, Algorithm 1 is again equivalent to the spiking integrateandfire (IF) neuron model, without a refractory period, a membrane voltage of normalized to , a firing rate of Hz, and the ODE discretized to a timestep of s using zeroorder hold (ZOH). The parameter essentially generalizes the spiking model to allow multiple spikes per timestep, and the IF restriction is lifted to allow arbitrary activation functions (including leaky neurons, and those with negative spikes such as ).
3.1.2 Scaling Properties
We now state several important properties of this quantization algorithm (see supplementary for proofs). For convenience, we assume the range of is scaled such that over the domain of valid inputs (this can also be achieved via clipping or saturation).
ZeroMean Error
Supposing , the expected quantization error is .
Bounded Error
The total quantization error is bounded by across any consecutive slice of timesteps,
. As a corollary, the signaltonoise ratio (SNR) of
scales as , and this SNR may be further scaled by the timeconstant of a lowpass filter (see section 3.3).BitWidth
The number of bits required to represent in binary is at most if is nonnegative (plus a sign bit if can be negative).
ANN Regime
As , , hence the activation function becomes equivalent to the ideal .
SNN Regime
When , the activation function becomes a conventional spiking neuron since it outputs either zero or a spike () if is nonnegative (and optionally a negative spike if is allowed to be negative).
Temporal Sparsity
The spike count scales as .
To summarize, the choice of results in activities that require bits to represent, while achieving an SNR of relative to the ideal function. The effect of the algorithm is depicted in Figure 1 for various .
3.1.3 Backpropagation Training
To train the network via backpropagation, we make the simplifying assumption that
are i.i.d. random variables (see supplementary), which implies that
where is uncorrelated zeromean noise. This justifies assigning a gradient of zero to . The forward pass uses the quantized activation function to compute the true error for the current , while the backward pass uses the gradient of (independently of ). Therefore, the training method takes into account the temporal mechanisms of spike generation, but allows the gradient to skip over the sequence of operations that are responsible for keeping the total spike noise bounded by .3.2 Legendre Memory Unit
As an example application of these methods we will use the Legendre Memory Unit (LMU; Voelker et al., 2019) – a new type of Recurrent Neural Network (RNN) that efficiently orthogonalizes the continuoustime history of some signal, , across a sliding window of length . The network is characterized by the following coupled ODEs:
(1) 
where is a
dimensional memory vector, and (
, ) have a closedform solution (Voelker, 2019):(2)  
The key property of this dynamical system is that represents sliding windows of via the Legendre (1782) polynomials up to degree :
(3)  
where is the shifted Legendre polynomial (Rodrigues, 1816). Thus, nonlinear functions of correspond to functions across windows of length projected onto the Legendre basis.
Discretization
We map these equations onto the state of an RNN, , given some input
, indexed at discrete moments in time,
:(4) 
where (, ) are the ZOHdiscretized matrices from equation 2 for a timestep of , such that is the desired memory length expressed in discrete timesteps. In the ideal case, should be the identity function. For our hSNNs, we clip and quantize using Algorithm 1.
Architecture
The LMU takes an input vector, , and generates a hidden state. The hidden state, , and memory vector, , correspond to the activities of two neural populations that we will refer to as the hidden neurons and memory neurons, respectively. The hidden neurons mutually interact with the memory neurons in order to compute nonlinear functions across time, while dynamically writing to memory. The state is a function of the input, previous state, and current memory:
(5) 
where is some chosen nonlinearity—to be quantized using Algorithm 1—and , , are learned weights. The input to the memory is:
(6) 
where , , are learned vectors.
3.3 Synaptic Filtering
SNNs commonly apply a synapse model to the weighted summation of spiketrains. This filters the input to each neuron over time to reduce the amount of spike noise
(Dayan and Abbott, 2001). The synapse is most commonly modelled as a lowpass filter, with some chosen timeconstant , whose effect is equivalent to replacing each spike with an exponentially decaying kernel, ().By lowpass filtering the activities, the SNR of Algorithm 1 is effectively boosted by a factor of relative to the filtered ideal, since the filtered error becomes a weighted timeaverage of the quantization errors (see supplementary). Thus, we lowpass filter the inputs into both and .
To account for the temporal dynamics introduced by the application of a lowpass filter, Voelker and Eliasmith (2018, equation 4.7) prove that the LMU’s discretized statespace matrices, , should be exchanged with :
(7)  
where is the timeconstant (in discrete timesteps) of the ZOHdiscretized lowpass that is filtering the input to .
To summarize, the architecture that we train includes a nonlinear layer () and a linear layer (), each of which has synaptic filters. The recurrent and input weights to are fixed to and , and are not trained. All other connections are trained.
3.4 SNR Scheduling
To interpolate between ANN and SNN regimes, we set
differently from one training epoch to the next, in an analogous manner to scheduling learning rates. Since
is exponential in bitprecision, we vary on a logarithmic scale across the interval , where is set to achieve rapid convergence during the initial stages of training, and depends on the hardware and application. Once , training is continued until validation error stops improving, and only the model with the lowest validation loss during this finetuning phase is saved.We found that this method of scheduling typically results in faster convergence rates versus the alternative of starting at its final value. We also observe that the SNR of is far more critical than the SNR of , and thus schedule the two differently (explained below).
Network  Trainable  Weights  Nonlinearities  State  Levels  Steps  Test (%) 

LSTM  67850  67850  384 sigmoid, 128 tanh  256  784  98.5  
LMU  34571  51083  128 sigmoid  256  784  98.26  
hsLMU  34571  51083  128 LIF, 128 IF  522  2–5  784  97.26 
LSNN  68210  8185  120 LIF, 100 Adaptive  2  1680  96.4 
4 Experiments
To facilitate comparison between the “Long ShortTerm Memory Spiking Neural Network” (LSNN) from
Bellec et al. (2018), and both spiking and nonspiking LMUs (Voelker et al., 2019), we consider the sequential MNIST (sMNIST) task and its permuted variant (psMNIST; Le et al., 2015). For sMNIST, the pixels are supplied sequentially in a timeseries of length. Thus, the network must maintain a memory of the relevant features while simultaneously computing across them in time. For psMNIST, all of the sequences are also permuted by an unknown fixed permutation matrix, which distorts the temporal structure in the sequences and significantly increases the difficulty of the task. In either case, the network outputs a classification at the end of each input sequence. For the output classification, we take the argmax over a dense layer with 10 units, with incoming weights initialized using the Xavier uniform distribution
(Glorot and Bengio, 2010). The network is trained using the categorical crossentropy loss function (fused with softmax).
All of our LMU networks are built in Nengo (Bekolay et al., 2014) and trained using NengoDL (Rasmussen, 2019). The 50k “lost MNIST digits” (Yadav and Bottou, 2019)^{1}^{1}1This set does not overlap with MNIST’s train or test sets. are used as validation data to select the best model. All sequences are normalized to
via fixed linear transformation (
). We use a minibatch size of , and the Adam optimizer (Kingma and Ba, 2014)with all of the default hyperparameters (
, , ).To quantize the hidden activations, we use the leaky integrateandfire (LIF) neuron model with a refractory period of 1 timestep and a leak of 10 timesteps (corresponding to Nengo’s defaults given a timestep of 2 ms), such that its response curve is normalized to . The input to each LIF neuron is biased such that , and scaled such that (see supplementary). During training, the for is interpolated across . Thus, the hidden neurons in the fully trained networks are conventional (1bit) spiking neurons.
To quantize the memory activations, we use , which is analogous to using IF neurons that can generate both positive and negative spikes. To maintain accuracy, the for is interpolated across for sMNIST, and across for psMNIST. We provide details regarding the effect of these choices on the number of possible activity levels for the memory neurons, and discuss the impact this has on MAC operations as well as the consequences for energyefficient neural networks.
The synaptic lowpass filters have a timeconstant of 200 timesteps for the activities projecting into , and 10 timesteps for the activities projecting into . The output layer also uses a 10 timestep lowpass filter. We did not experiment with any other choice of timeconstants.
All weights are initialized to zero, except: is initialized to ,
is initialized using the Xavier normal distribution
(Glorot and Bengio, 2010), and are initialized according to equation 7 and left untrained. L2regularization () is added to the output vector. We did not experiment with batch normalization, layer normalization, dropout, or any other regularization techniques.
Network  Trainable  Weights  Nonlinearities  BitWidth  Significant Bits  Test (%) 

LSTM  163610  163610  600 sigmoid, 200 tanh  32  N/A  89.86 
LMU  102027  167819  212 tanh  32  N/A  97.15 
hsLMU  102239  168031  212 LIF, 256 IF  3.74  1.26  96.83 
4.1 Sequential MNIST
4.1.1 StateoftheArt
The LSTM and LSNN results shown in Table 1 have been extended from Bellec et al. (2018, Tables S1 and S2). We note that these two results (98.5% and 96.4%) represent the best test accuracy among 12 separately trained models, without any validation. The mean test performance across the same 12 runs are 79.8% and 93.8% for the LSTM and LSNN, respectively.
The LSTM consists of only 128 “units,” but is computationally and energetically intensive since it maintains a 256dimensional vector of 32bit activities that are multiplied with over 67k weights. The LSNN improves this in two important ways. First, the activities of its 220 neurons are all one bit (i.e., spikes). Second, the number of parameters are pruned down to just over 8k weights. Thus, each timestep consists of at most 8k synaptic operations that simply add a weight to the synaptic state of each neuron, followed by local updates to each synapse and neuron model.
However, the LSNN suffers from half the throughput (each input pixel is presented for two timesteps rather than one), a latency of 112 additional timesteps to accumulate the classification after the image has been presented (resulting in a total of steps), and an absolute 2.1% decrease in test accuracy. In addition, at least 550 statevariables (220 membrane voltages, 100 adaptive thresholds, 220 lowpass filter states, 10 output filters states, plus state for an optional delay buffer attached to each synapse) are required to maintain memory between timesteps. The authors state that the input to the LSNN is preprocessed using 80 more neurons that fire whenever the pixel crosses over a fixed value associated with each neuron, to obtain “somewhat better performance.”
4.1.2 NonSpiking LMU
The nonspiking LMU is the Nengo implementation from Voelker et al. (2019) with and , the sigmoid activation chosen for
, and a trainable bias vector added to the hidden neurons.
This network obtains a test accuracy of 98.26%, while using only 128 nonlinearities, and training nearly half as many weights as the LSTM or LSNN. However, the MAC operations are still a bottleneck, since each timestep requires multiplying a 256dimensional vector of 32bit activities with approximately 51k weights (including and ).
4.1.3 HybridSpiking LMU
To simplify the MAC operations, we quantize the activity functions and filter their inputs (see section 3). We refer to this as a “hybridspiking LMU” (hsLMU) since the hidden neurons are conventional (i.e., onebit) spiking LIF neurons, but the memory neurons can assume a multitude of distinct activation levels by generating multiple spikes per timestep.
By training until for , each memory neuron can spike at 5 different activity levels (see Figure 2; Top). We remark that the distribution is symmetric about zero, and “prefers” the zero state (51.23%), since equation 1 has exactly one stable point: . As well, the hidden neurons spike only 36.05% of the time. As a result, the majority of weights are not needed on any given timestep. Furthermore, when a weight is accessed, it is simply added for the hidden activities, or multiplied by an element of for the memory activities.
These performance benefits come at the cost of a 1% decrease in test accuracy, and additional state and computation—local to each neuron—to implement the lowpass filters and Algorithm 1. Specifically, this hsLMU requires 522 statevariables (256 membrane voltages, 256 lowpass filters, and 10 output filters). This network outperforms the LSNN, does not sacrifice throughput nor latency, and does not require special preprocessing of the input data.
4.2 Permuted Sequential MNIST
4.2.1 StateoftheArt
4.2.2 HybridSpiking LMU
We consider the same network from section 4.1.3, scaled up to and . Consistent with the previous hsLMU, the hidden neurons are spiking LIF, and the memory neurons are multibit IF neurons that can generate multiple positive or negative spikes per step. In particular, by training until for , each memory neuron can spike between 24 and +26 times (inclusive) per step for a total of 50 distinct activity levels, which requires 6 bits to represent.
Again, the distribution of memory activities are symmetric about zero, and 17.71% of the time the neurons are silent. The 1bit hidden neurons spike 40.24% of the time. We note that the hsLMU uses 212 more parameters than the LMU from Voelker et al. (2019), as the latter does not include a bias on the hidden nonlinearities.
To quantify the performance benefits of lowprecision activities, we propose the following two metrics. The first is the worstcase number of bits required to communicate the activity of each neuron, in this case for the hidden neurons and for the memory neurons, which has a weighted average of approximately bits. The second is the number of bits that are significant (i.e., after removing all of the trailing zero bits, and including a sign bit for negative activities), which has a weighted average of approximately bits.
The “bitwidth” metric is useful for determining the worstcase volume of spike traffic on hardware where the size of the activity vectors are userconfigurable (Furber et al., 2014; Liu et al., 2018), and for hardware where the quantization of activities leads to quadratic improvements in silicon area and energy requirements (McKinstry et al., 2018). The “significant bits” metric reflects how many significant bits are multiplied with each weight, which is important for hardware where bitflips in the datapath correlate with energy costs (Li et al., 2019), or hardware that is optimized for integer operands close to zero. For instance, a value of 1 for this metric would imply that each MAC, on average, only needs to accumulate its weight (i.e., no multiply is required). These performance benefits come at the cost of a 0.32% decrease in test accuracy, which still outperforms all other RNNs considered by Chandar et al. (2019); Voelker et al. (2019) apart from the LMU, while using comparable resources and parameter counts.
Interestingly, for the sMNIST network in section 4.1.3, the bitwidth metric is exactly 2 (as there are an equal number of hidden (1bit) and memory (3bit) neurons). The significant bits for that network is 0.58, because a majority of the neurons are inactive on each time step.
5 Discussion
Although the biological plausibility of a neuron that can output more than one spike “at once” is questionable, it is in fact mathematically equivalent to simulating the neuron with a timestep of and bundling the spikes together (Krithivasan et al., 2019). Consequently, all of the networks we consider here can be implemented in 1bit spiking networks, although with an added time cost. This is similar to the LSNN’s approach of simulating the network for two timesteps per image pixel, but does not incur the same cost in throughput. Alternatively, a space cost can be paid by replicating the neuron times and uniformly spacing the initial (not shown). Likewise, negative spikes are a more compact and efficient alternative to duplicating the neurons and mirroring their activation functions.
Our methods are convenient to apply to the LMU because equation 7 accounts for the dynamics of the lowpass filter, and the vector naturally prefers the zero (i.e., silent) state. At the same time, it is a challenging test for the theory, since we do not train the LMU matrices, which are primarily responsible for accuracy on psMNIST (Voelker et al., 2019), and RNNs tend to accumulate and propagate their errors over time. Notably, the method of Algorithm 1 can be applied to other neural network architectures, including feedforward networks.
6 Conclusions
We have presented a new algorithm and accompanying methods that allow interpolation between spiking and nonspiking networks. This allows the training of hSNNs, which can have mixtures of activity quantization, leading to computationally efficient neural network implementations. We have also shown how to incorporate standard SNN assumptions, such as the presence of a synaptic filter.
We demonstrated the technique on the recently proposed LMU, and achieved better than stateoftheart results on sMNIST than a spiking network. Additionally, on the more challenging psMNIST task the reported accuracy of the spiking network is better than any nonspiking RNN apart from the original LMU (Chandar et al., 2019; Voelker et al., 2019).
However, our focus here is not on accuracy per se, but efficient computation. In this context, the training procedure enables us to leverage the accuracy of ANNs and the energy efficiency of SNNs by scheduling training to evaluate a series of networks in between these two extremes. In the cases we considered, we reduced the activity to 2–6 bits on average, saving at least 26 bits over the standard LMU implementation with minimal impact on accuracy. While difficult to convert these metrics to energy savings in a hardwareagnostic manner, such optimizations can benefit both spiking and nonspiking architectures.
We anticipate that techniques like those we have outlined here will become more widely used as the demands of edge computing continue to grow. In such powerconstrained contexts, extracting as much efficiency as possible, while retaining sufficient accuracy, is central to the efforts involved in codesigning both algorithms and hardware for neural network workloads.
References
 Dither NN: an accurate neural network with dithering for low bitprecision hardware. In 2018 International Conference on FieldProgrammable Technology (FPT), pp. 6–13. Cited by: §2, §3.1.1.
 Nengo: a Python tool for building largescale functional brain models. Frontiers in neuroinformatics 7, pp. 48. Cited by: §4.
 Long shortterm memory and learningtolearn in networks of spiking neurons. In Advances in Neural Information Processing Systems, pp. 787–797. Cited by: §1, §2, §4.1.1, §4.
 Eventdriven signal processing with neuromorphic computing systems. In 45th International Conference on Acoustics, Speech, and Signal Processing, Cited by: §1.
 Benchmarking keyword spotting efficiency on neuromorphic hardware. In Proceedings of the 7th Annual Neuroinspired Computational Elements Workshop, pp. 1–8. Cited by: §1.
 Towards nonsaturating recurrent units for modelling longterm dependencies. arXiv preprint arXiv:1902.06704. Cited by: §4.2.1, §4.2.2, §6.
 Neuronal spike generation mechanism as an oversampling, noiseshaping AtoD converter. In Advances in Neural Information Processing Systems, pp. 503–511. Cited by: §2.
 BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131. Cited by: §1.
 Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
 Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press. Cited by: §3.3.
 An adaptive technique for spatial grayscale. In Proceedings of the Society of Information Display, Vol. 17, pp. 75–77. Cited by: §1, §3.1.
 The SpiNNaker project. Proceedings of the IEEE 102 (5), pp. 652–665. Cited by: §1, §4.2.2.
 Tensor processing using low precision format. Google Patents. Note: US Patent App. 15/624,577 Cited by: §1.

Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 249–256. Cited by: §4, §4.  Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §1.
 Stochastic rounding and reducedprecision fixedpoint arithmetic for solving neural ODEs. arXiv preprint arXiv:1904.11263. Cited by: §2.

Channel gating neural networks
. In Advances in Neural Information Processing Systems, pp. 1884–1894. Cited by: §2.  Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pp. 1433–1443. Cited by: §1.
 Spiking deep networks with LIF neurons. arXiv preprint arXiv:1510.08829. Cited by: §1.

Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2704–2713. Cited by: §2.  Indatacenter performance analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1.
 Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 Biophysics of computation: information processing in single neurons. Oxford university press. Cited by: §3.1.1.
 Dynamic spike bundling for energyefficient spiking neural networks. In International Symposium on Low Power Electronics and Design, pp. 1–6. Cited by: §2, §5.

A simple way to initialize recurrent networks of rectified linear units
. arXiv preprint arXiv:1504.00941. Cited by: §4.  Recherches sur l’attraction des sphéroïdes homogènes. Mémoires de Mathématiques et de Physique, présentés à l’Académie Royale des Sciences, pp. 411–435. Cited by: §3.2.
 Improving efficiency in neural network accelerator using operands hamming distance optimization. In The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Cited by: §4.2.2.
 Memoryefficient deep learning on a SpiNNaker 2 prototype. Frontiers in neuroscience 12, pp. 840. Cited by: §1, §1, §4.2.2.
 Discovering lowprecision networks close to fullprecision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191. Cited by: §1, §2, §4.2.2.
 Embedded deep neural networks: “The cost of everything and the value of nothing”. In Hot Chips 28 Symposium, pp. 1–20. Cited by: §1.
 NeuronFlow: a hybrid neuromorphic – dataflow processor architecture for AI workloads. In 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems, Cited by: §1.
 Bit efficient quantization for deep neural networks. arXiv preprint arXiv:1910.04877. Cited by: §1.
 Braindrop: a mixedsignal neuromorphic architecture with a dynamical systemsbased programming model. Proceedings of the IEEE 107 (1), pp. 144–164. Cited by: §2.
 Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572 (7767), pp. 106–111. Cited by: §1, §2.
 NengoDL: combining deep learning and neuromorphic modelling methods. Neuroinformatics 17 (4), pp. 611–628. Cited by: §4.
 De l’attraction des sphéroïdes, correspondence sur l’École impériale polytechnique. Ph.D. Thesis, Thesis for the Faculty of Science of the University of Paris. Cited by: §3.2.

TinyML: machine learning with TensorFlow Lite on Arduino and ultralow power microcontrollers
. O’Reilly Media, Inc.. External Links: ISBN 9781492052036 Cited by: §1.  Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
 Recognizing images with at most one spike per neuron. arXiv preprint arXiv:2001.01682. Cited by: §2.
 Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §1.
 Legendre Memory Units: continuoustime representation in recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 15544–15553. Cited by: §3.2, §4.1.2, §4.2.1, §4.2.2, §4.2.2, §4, §5, §6.
 Improving spiking dynamical networks: accurate delays, higherorder synapses, and time cells. Neural computation 30 (3), pp. 569–609. Cited by: §3.3.
 Dynamical systems in spiking neuromorphic hardware. PhD Thesis, University of Waterloo. Cited by: §3.1.1, §3.2.
 Energyaware neural architecture optimization with fast splitting steepest descent. arXiv preprint arXiv:1910.03103. Cited by: §1.
 Cold case: the lost MNIST digits. In Advances in Neural Information Processing Systems, pp. 13443–13452. Cited by: §4.
 LIF and simplified SRM neurons encode signals into spikes via a form of asynchronous pulse sigma–delta modulation. IEEE transactions on neural networks and learning systems 28 (5), pp. 1192–1205. Cited by: §2.
 Conversion of synchronous artificial neural network to asynchronous spiking neural network using sigmadelta quantization. In 1st IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 81–85. Cited by: §2, §3.1.1.
 Asynchronous spiking neurons, the natural key to exploit temporal sparsity. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (4), pp. 668–678. Cited by: §2.
 Spiketrain level backpropagation for training deep recurrent spiking neural networks. In Advances in Neural Information Processing Systems, pp. 7800–7811. Cited by: §1.
 Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §1.
 To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §1.
Comments
There are no comments yet.