A Spike in Performance: Training Hybrid-Spiking Neural Networks with Quantized Activation Functions

02/10/2020 ∙ by Aaron R. Voelker, et al. ∙ 0

The machine learning community has become increasingly interested in the energy efficiency of neural networks. The Spiking Neural Network (SNN) is a promising approach to energy-efficient computing, since its activation levels are quantized into temporally sparse, one-bit values (i.e., "spike" events), which additionally converts the sum over weight-activity products into a simple addition of weights (one weight for each spike). However, the goal of maintaining state-of-the-art (SotA) accuracy when converting a non-spiking network into an SNN has remained an elusive challenge, primarily due to spikes having only a single bit of precision. Adopting tools from signal processing, we cast neural activation functions as quantizers with temporally-diffused error, and then train networks while smoothly interpolating between the non-spiking and spiking regimes. We apply this technique to the Legendre Memory Unit (LMU) to obtain the first known example of a hybrid SNN outperforming SotA recurrent architectures—including the LSTM, GRU, and NRU—in accuracy, while reducing activities to at most 3.74 bits on average with 1.26 significant bits multiplying each weight. We discuss how these methods can significantly improve the energy efficiency of neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The growing amount of energy consumed by Artificial Neural Networks (ANNs) has been identified as an important problem in the context of mobile, IoT, and edge applications (Moloney, 2016; Zhang et al., 2017; McKinstry et al., 2018; Wang et al., 2019; Situnayake and Warden, 2019). The vast majority of an ANN’s time and energy is consumed by the multiply-accumulate (MAC) operations implementing the weighting of activities between layers (Sze et al., 2017). Thus, many ANN accelerators focus almost entirely on optimizing MACs (e.g. Ginsburg et al., 2017; Jouppi et al., 2017), while practitioners prune (Zhu and Gupta, 2017; Liu et al., 2018) and quantize (Gupta et al., 2015; Courbariaux et al., 2015; McKinstry et al., 2018; Nayak et al., 2019) weights to reduce the use and size of MAC arrays.

While these strategies focus on the weight matrix, the Spiking Neural Network (SNN) community has taken a different but complimentary approach that instead focuses on temporal processing. The operations of an SNN are temporally sparsified, such that an accumulate only occurs whenever a “spike” arrives at its destination. These sparse, one-bit activities (i.e., “spikes”) not only reduce the volume of data communicated between nodes in the network (Furber et al., 2014), but also replace the multipliers in the MAC arrays with adders – together providing orders of magnitude gains in energy efficiency (Davies et al., 2018; Blouw et al., 2019).

The conventional method of training an SNN is to first train an ANN, replace the activation functions with spiking neurons that have identical firing rates 

(Hunsberger and Eliasmith, 2015), and then optionally retrain with spikes on the forward pass and a differentiable proxy on the backward pass (Huh and Sejnowski, 2018; Bellec et al., 2018; Zhang and Li, 2019). However, converting an ANN into an SNN often degrades model accuracy – especially for recurrent networks. Thus, multiple hardware groups have started building hybrid architectures that support ANNs, SNNs, and mixtures thereof (Liu et al., 2018; Pei et al., 2019; Moreira et al., 2020) – for instance by supporting event-based activities, fixed-point representations, and a variety of multi-bit coding schemes. These hybrid platforms present the alluring possibility to trade accuracy for energy in task-dependent ways (Blouw and Eliasmith, 2020). However, principled methods of leveraging such trade-offs are lacking.

In this work, we propose to our knowledge the first method of training hybrid-spiking networks (hSNNs) by smoothly interpolating between ANN (i.e., 32-bit activities) and SNN (i.e., 1-bit activities) regimes. The key idea is to interpret spiking neurons as one-bit quantizers that diffuse their quantization error across future time-steps – similar to Floyd and Steinberg (1976) dithering. This idea can be readily applied to any activation function at little additional cost, generalizes to quantizers with arbitrary bit-widths (even fractional), provides strong bounds on the quantization errors, and relaxes in the limit to the ideal ANN.

Our methods enable the training procedure to balance the accuracy of ANNs with the energy efficiency of SNNs by evaluating the continuum of networks in between these two extremes. Furthermore, we show that this method can train hSNNs with superior accuracy to ANNs and SNNs trained via conventional methods. In a sense, we show that it is useful to think of spiking and non-spiking networks as extremes in a continuum. As a result, the set of hSNNs captures networks with any mixture of activity quantization throughout the architecture.

2 Related Work

Related work has investigated the quantization of activation functions in the context of energy-efficient hardware (e.g., Jacob et al., 2018; McKinstry et al., 2018). Likewise, Hopkins et al. (2019)

consider stochastic rounding and dithering as a means of improving the accuracy of spiking neuron models on low-precision hardware relative to their ideal ordinary differential equations (ODEs). Neither of these approaches account for the quantization errors that accumulate over time, whereas our proposed method keep these errors bounded.

Some have viewed spiking neurons as one-bit quantizers, or analog-to-digital (ADC) converters, including Chklovskii and Soudry (2012); Yoon (2016); Ando et al. (2018); Neckar et al. (2018); Yousefzadeh et al. (2019a, b). But these methods are not generalized to multi-bit or hybrid networks, nor leveraged to interpolate between non-spiking and spiking networks during training.

There also exist other methods that introduce temporal sparsity into ANNs. One such example is channel gating (Hua et al., 2019), whereby the channels in a CNN are dynamically pruned over time. Another example is dropout (Srivastava et al., 2014) – a form of regularization that randomly drops out activities during training. The gating mechanisms in both cases are analogous to spiking neurons.

Neurons that can output multi-bit spikes have been considered in the context of packets that bundle together neighbouring spikes (Krithivasan et al., 2019). In contrast, this work directly computes the number of spikes in time and memory per neuron, and varies the temporal resolution during training to interpolate between non-spiking and spiking and allow optimization across the full set of hSNNs.

Our methods are motivated by some of the recent successes in training SNNs to compete with ANNs on standard machine learning benchmarks (Bellec et al., 2018; Stöckl and Maass, 2019; Pei et al., 2019). To our knowledge, this work is the first to parameterize the activation function in a manner that places ANNs and SNNs on opposite ends of the same spectrum. We show that this idea can be used to convert ANNs to SNNs, and to train hSNNs with improved accuracy relative to pure (i.e., 1-bit) SNNs and energy efficiency relative to pure (i.e., 32-bit) ANNs.

Figure 1: Visualizing the output () of Algorithm 1, with

as the ReLU function, given an MNIST digit as input (

). The bit-width is varied as ; correspond to the activities of a 32-bit ANN, whereas correspond to those of an SNN.

3 Methods

3.1 Quantized Activation Functions

We now formalize our method of quantizing any activation function. In short, the algorithm quantizes the activity level and then pushes the quantization error onto the next time-step – analogous to the concept of using error diffusion to dither a one-dimensional time-series (Floyd and Steinberg, 1976). The outcome is a neuron model that interpolates an arbitrary activation function, , between non-spiking and spiking regimes through choice of the parameter , which acts like a time-step.

3.1.1 Temporally-Diffused Quantizer

Let be the input to the activation function at a discrete time-step, , such that the ideal output (i.e., with unlimited precision) is . The algorithm maintains one scalar state-variable across time, , that tracks the total amount of quantization error that the neuron has accumulated over time. We recommend initializing independently for each neuron. The output of the neuron, , is determined by Algorithm 1.

  Input:
  State:
  Output:
  
  
  
  
Algorithm 1 Temporally-Diffused Quantizer (; )

The ideal activation, , may be any conventional nonlinearity (e.g., , sigmoid, etc.), or the time-averaged response curve corresponding to a biological neuron model (e.g., leaky integrate-and-fire) including those with multiple internal state-variables (Koch, 2004). Adaptation may also be modelled by including a recurrent connection from to  (Voelker, 2019, section 5.2.1).

To help understand the relationship between this algorithm and spiking neuron models, it is useful to interpret as the number of spikes () that occur across a window of time normalized by the length of this window (). Then represents the expected number of spikes across the window, and tracks progress towards the next spike.

We note that Algorithm 1 is equivalent to Ando et al. (2018, Algorithm 1) where is the rectified linear (ReLU) activation function, and . Yousefzadeh et al. (2019a, Algorithm 1) extend this to represent changes in activation levels, and allow negative spikes. Still considering the ReLU activation, Algorithm 1 is again equivalent to the spiking integrate-and-fire (IF) neuron model, without a refractory period, a membrane voltage of normalized to , a firing rate of  Hz, and the ODE discretized to a time-step of  s using zero-order hold (ZOH). The parameter essentially generalizes the spiking model to allow multiple spikes per time-step, and the IF restriction is lifted to allow arbitrary activation functions (including leaky neurons, and those with negative spikes such as ).

3.1.2 Scaling Properties

We now state several important properties of this quantization algorithm (see supplementary for proofs). For convenience, we assume the range of is scaled such that over the domain of valid inputs (this can also be achieved via clipping or saturation).

Zero-Mean Error

Supposing , the expected quantization error is .

Bounded Error

The total quantization error is bounded by across any consecutive slice of time-steps,

. As a corollary, the signal-to-noise ratio (SNR) of

scales as , and this SNR may be further scaled by the time-constant of a lowpass filter (see section 3.3).

Bit-Width

The number of bits required to represent in binary is at most if is non-negative (plus a sign bit if can be negative).

ANN Regime

As , , hence the activation function becomes equivalent to the ideal .

SNN Regime

When , the activation function becomes a conventional spiking neuron since it outputs either zero or a spike () if is non-negative (and optionally a negative spike if is allowed to be negative).

Temporal Sparsity

The spike count scales as .

To summarize, the choice of results in activities that require bits to represent, while achieving an SNR of relative to the ideal function. The effect of the algorithm is depicted in Figure 1 for various .

3.1.3 Backpropagation Training

To train the network via backpropagation, we make the simplifying assumption that

are i.i.d. random variables (see supplementary), which implies that

where is uncorrelated zero-mean noise. This justifies assigning a gradient of zero to . The forward pass uses the quantized activation function to compute the true error for the current , while the backward pass uses the gradient of (independently of ). Therefore, the training method takes into account the temporal mechanisms of spike generation, but allows the gradient to skip over the sequence of operations that are responsible for keeping the total spike noise bounded by .

3.2 Legendre Memory Unit

As an example application of these methods we will use the Legendre Memory Unit (LMU; Voelker et al., 2019) – a new type of Recurrent Neural Network (RNN) that efficiently orthogonalizes the continuous-time history of some signal, , across a sliding window of length . The network is characterized by the following coupled ODEs:

(1)

where is a

-dimensional memory vector, and (

) have a closed-form solution (Voelker, 2019):

(2)

The key property of this dynamical system is that represents sliding windows of via the Legendre (1782) polynomials up to degree :

(3)

where is the shifted Legendre polynomial (Rodrigues, 1816). Thus, nonlinear functions of correspond to functions across windows of length projected onto the Legendre basis.

Discretization

We map these equations onto the state of an RNN, , given some input

, indexed at discrete moments in time,

:

(4)

where () are the ZOH-discretized matrices from equation 2 for a time-step of , such that is the desired memory length expressed in discrete time-steps. In the ideal case, should be the identity function. For our hSNNs, we clip and quantize using Algorithm 1.

Architecture

The LMU takes an input vector, , and generates a hidden state. The hidden state, , and memory vector, , correspond to the activities of two neural populations that we will refer to as the hidden neurons and memory neurons, respectively. The hidden neurons mutually interact with the memory neurons in order to compute nonlinear functions across time, while dynamically writing to memory. The state is a function of the input, previous state, and current memory:

(5)

where is some chosen nonlinearity—to be quantized using Algorithm 1—and , , are learned weights. The input to the memory is:

(6)

where , , are learned vectors.

3.3 Synaptic Filtering

SNNs commonly apply a synapse model to the weighted summation of spike-trains. This filters the input to each neuron over time to reduce the amount of spike noise 

(Dayan and Abbott, 2001). The synapse is most commonly modelled as a lowpass filter, with some chosen time-constant , whose effect is equivalent to replacing each spike with an exponentially decaying kernel, ().

By lowpass filtering the activities, the SNR of Algorithm 1 is effectively boosted by a factor of relative to the filtered ideal, since the filtered error becomes a weighted time-average of the quantization errors (see supplementary). Thus, we lowpass filter the inputs into both and .

To account for the temporal dynamics introduced by the application of a lowpass filter, Voelker and Eliasmith (2018, equation 4.7) prove that the LMU’s discretized state-space matrices, , should be exchanged with :

(7)

where is the time-constant (in discrete time-steps) of the ZOH-discretized lowpass that is filtering the input to .

To summarize, the architecture that we train includes a nonlinear layer () and a linear layer (), each of which has synaptic filters. The recurrent and input weights to are fixed to and , and are not trained. All other connections are trained.

3.4 SNR Scheduling

To interpolate between ANN and SNN regimes, we set

differently from one training epoch to the next, in an analogous manner to scheduling learning rates. Since

is exponential in bit-precision, we vary on a logarithmic scale across the interval , where is set to achieve rapid convergence during the initial stages of training, and depends on the hardware and application. Once , training is continued until validation error stops improving, and only the model with the lowest validation loss during this fine-tuning phase is saved.

We found that this method of scheduling typically results in faster convergence rates versus the alternative of starting at its final value. We also observe that the SNR of is far more critical than the SNR of , and thus schedule the two differently (explained below).


Network Trainable Weights Nonlinearities State Levels Steps Test (%)
LSTM 67850 67850 384 sigmoid, 128 tanh 256 784 98.5
LMU 34571 51083 128 sigmoid 256 784 98.26
hsLMU 34571 51083 128 LIF, 128 IF 522 2–5 784 97.26
LSNN 68210 8185 120 LIF, 100 Adaptive 2 1680 96.4
Table 1: Performance of RNNs on the sequential MNIST task.

4 Experiments

To facilitate comparison between the “Long Short-Term Memory Spiking Neural Network” (LSNN) from

Bellec et al. (2018), and both spiking and non-spiking LMUs (Voelker et al., 2019), we consider the sequential MNIST (sMNIST) task and its permuted variant (psMNIST; Le et al., 2015). For sMNIST, the pixels are supplied sequentially in a time-series of length

. Thus, the network must maintain a memory of the relevant features while simultaneously computing across them in time. For psMNIST, all of the sequences are also permuted by an unknown fixed permutation matrix, which distorts the temporal structure in the sequences and significantly increases the difficulty of the task. In either case, the network outputs a classification at the end of each input sequence. For the output classification, we take the argmax over a dense layer with 10 units, with incoming weights initialized using the Xavier uniform distribution 

(Glorot and Bengio, 2010)

. The network is trained using the categorical crossentropy loss function (fused with softmax).

All of our LMU networks are built in Nengo (Bekolay et al., 2014) and trained using Nengo-DL (Rasmussen, 2019). The 50k “lost MNIST digits” (Yadav and Bottou, 2019)111This set does not overlap with MNIST’s train or test sets. are used as validation data to select the best model. All sequences are normalized to

via fixed linear transformation (

). We use a minibatch size of , and the Adam optimizer (Kingma and Ba, 2014)

with all of the default hyperparameters (

, , ).

To quantize the hidden activations, we use the leaky integrate-and-fire (LIF) neuron model with a refractory period of 1 time-step and a leak of 10 time-steps (corresponding to Nengo’s defaults given a time-step of 2 ms), such that its response curve is normalized to . The input to each LIF neuron is biased such that , and scaled such that (see supplementary). During training, the for is interpolated across . Thus, the hidden neurons in the fully trained networks are conventional (1-bit) spiking neurons.

To quantize the memory activations, we use , which is analogous to using IF neurons that can generate both positive and negative spikes. To maintain accuracy, the for is interpolated across for sMNIST, and across for psMNIST. We provide details regarding the effect of these choices on the number of possible activity levels for the memory neurons, and discuss the impact this has on MAC operations as well as the consequences for energy-efficient neural networks.

The synaptic lowpass filters have a time-constant of 200 time-steps for the activities projecting into , and 10 time-steps for the activities projecting into . The output layer also uses a 10 time-step lowpass filter. We did not experiment with any other choice of time-constants.

All weights are initialized to zero, except: is initialized to ,

is initialized using the Xavier normal distribution 

(Glorot and Bengio, 2010), and are initialized according to equation 7 and left untrained. L2-regularization (

) is added to the output vector. We did not experiment with batch normalization, layer normalization, dropout, or any other regularization techniques.


Network Trainable Weights Nonlinearities Bit-Width Significant Bits Test (%)
LSTM 163610 163610 600 sigmoid, 200 tanh 32 N/A 89.86
LMU 102027 167819 212 tanh 32 N/A 97.15
hsLMU 102239 168031 212 LIF, 256 IF 3.74 1.26 96.83
Table 2: Performance of RNNs on the permuted sequential MNIST task.

4.1 Sequential MNIST

4.1.1 State-of-the-Art

The LSTM and LSNN results shown in Table 1 have been extended from Bellec et al. (2018, Tables S1 and S2). We note that these two results (98.5% and 96.4%) represent the best test accuracy among 12 separately trained models, without any validation. The mean test performance across the same 12 runs are 79.8% and 93.8% for the LSTM and LSNN, respectively.

The LSTM consists of only 128 “units,” but is computationally and energetically intensive since it maintains a 256-dimensional vector of 32-bit activities that are multiplied with over 67k weights. The LSNN improves this in two important ways. First, the activities of its 220 neurons are all one bit (i.e., spikes). Second, the number of parameters are pruned down to just over 8k weights. Thus, each time-step consists of at most 8k synaptic operations that simply add a weight to the synaptic state of each neuron, followed by local updates to each synapse and neuron model.

However, the LSNN suffers from half the throughput (each input pixel is presented for two time-steps rather than one), a latency of 112 additional time-steps to accumulate the classification after the image has been presented (resulting in a total of steps), and an absolute 2.1% decrease in test accuracy. In addition, at least 550 state-variables (220 membrane voltages, 100 adaptive thresholds, 220 lowpass filter states, 10 output filters states, plus state for an optional delay buffer attached to each synapse) are required to maintain memory between time-steps. The authors state that the input to the LSNN is preprocessed using 80 more neurons that fire whenever the pixel crosses over a fixed value associated with each neuron, to obtain “somewhat better performance.”

Figure 2: Distribution of activity levels for the memory neurons, , in the hsLMU network solving the sMNIST task (Top; see Table 1) and the psMNIST task (Bottom; see Table 2).

4.1.2 Non-Spiking LMU

The non-spiking LMU is the Nengo implementation from Voelker et al. (2019) with and , the sigmoid activation chosen for

, and a trainable bias vector added to the hidden neurons.

This network obtains a test accuracy of 98.26%, while using only 128 nonlinearities, and training nearly half as many weights as the LSTM or LSNN. However, the MAC operations are still a bottleneck, since each time-step requires multiplying a 256-dimensional vector of 32-bit activities with approximately 51k weights (including and ).

4.1.3 Hybrid-Spiking LMU

To simplify the MAC operations, we quantize the activity functions and filter their inputs (see section 3). We refer to this as a “hybrid-spiking LMU” (hsLMU) since the hidden neurons are conventional (i.e., one-bit) spiking LIF neurons, but the memory neurons can assume a multitude of distinct activation levels by generating multiple spikes per time-step.

By training until for , each memory neuron can spike at 5 different activity levels (see Figure 2; Top). We remark that the distribution is symmetric about zero, and “prefers” the zero state (51.23%), since equation 1 has exactly one stable point: . As well, the hidden neurons spike only 36.05% of the time. As a result, the majority of weights are not needed on any given time-step. Furthermore, when a weight is accessed, it is simply added for the hidden activities, or multiplied by an element of for the memory activities.

These performance benefits come at the cost of a 1% decrease in test accuracy, and additional state and computation—local to each neuron—to implement the lowpass filters and Algorithm 1. Specifically, this hsLMU requires 522 state-variables (256 membrane voltages, 256 lowpass filters, and 10 output filters). This network outperforms the LSNN, does not sacrifice throughput nor latency, and does not require special preprocessing of the input data.

4.2 Permuted Sequential MNIST

4.2.1 State-of-the-Art

Several RNN models have been used to solve the psMNIST task (Chandar et al., 2019), with the highest accuracy of 97.15% being achieved by an LMU (Voelker et al., 2019) of size , . The LMU result, and the LSTM result from Chandar et al. (2019), are reproduced in Table 2.

Figure 3: An example of each hybrid-spiking Legendre Memory Unit (hsLMU) network producing the correct classification given a test digit for the sMNIST task (Top; see Table 1) and the psMNIST task (Bottom; see Table 2). The recurrent network consists of one-bit spiking LIF neurons (representing ) coupled with multi-bit spiking IF neurons (representing ). Classifications are obtained by taking an argmax of the output layer on the final time-step of each sequence.

4.2.2 Hybrid-Spiking LMU

We consider the same network from section 4.1.3, scaled up to and . Consistent with the previous hsLMU, the hidden neurons are spiking LIF, and the memory neurons are multi-bit IF neurons that can generate multiple positive or negative spikes per step. In particular, by training until for , each memory neuron can spike between -24 and +26 times (inclusive) per step for a total of 50 distinct activity levels, which requires 6 bits to represent.

Again, the distribution of memory activities are symmetric about zero, and 17.71% of the time the neurons are silent. The 1-bit hidden neurons spike 40.24% of the time. We note that the hsLMU uses 212 more parameters than the LMU from Voelker et al. (2019), as the latter does not include a bias on the hidden nonlinearities.

To quantify the performance benefits of low-precision activities, we propose the following two metrics. The first is the worst-case number of bits required to communicate the activity of each neuron, in this case for the hidden neurons and for the memory neurons, which has a weighted average of approximately bits. The second is the number of bits that are significant (i.e., after removing all of the trailing zero bits, and including a sign bit for negative activities), which has a weighted average of approximately bits.

The “bit-width” metric is useful for determining the worst-case volume of spike traffic on hardware where the size of the activity vectors are user-configurable (Furber et al., 2014; Liu et al., 2018), and for hardware where the quantization of activities leads to quadratic improvements in silicon area and energy requirements (McKinstry et al., 2018). The “significant bits” metric reflects how many significant bits are multiplied with each weight, which is important for hardware where bit-flips in the datapath correlate with energy costs (Li et al., 2019), or hardware that is optimized for integer operands close to zero. For instance, a value of 1 for this metric would imply that each MAC, on average, only needs to accumulate its weight (i.e., no multiply is required). These performance benefits come at the cost of a 0.32% decrease in test accuracy, which still outperforms all other RNNs considered by Chandar et al. (2019); Voelker et al. (2019) apart from the LMU, while using comparable resources and parameter counts.

Interestingly, for the sMNIST network in section 4.1.3, the bit-width metric is exactly 2 (as there are an equal number of hidden (1-bit) and memory (3-bit) neurons). The significant bits for that network is 0.58, because a majority of the neurons are inactive on each time step.

5 Discussion

Although the biological plausibility of a neuron that can output more than one spike “at once” is questionable, it is in fact mathematically equivalent to simulating the neuron with a time-step of and bundling the spikes together (Krithivasan et al., 2019). Consequently, all of the networks we consider here can be implemented in 1-bit spiking networks, although with an added time cost. This is similar to the LSNN’s approach of simulating the network for two time-steps per image pixel, but does not incur the same cost in throughput. Alternatively, a space cost can be paid by replicating the neuron times and uniformly spacing the initial (not shown). Likewise, negative spikes are a more compact and efficient alternative to duplicating the neurons and mirroring their activation functions.

Our methods are convenient to apply to the LMU because equation 7 accounts for the dynamics of the lowpass filter, and the vector naturally prefers the zero (i.e., silent) state. At the same time, it is a challenging test for the theory, since we do not train the LMU matrices, which are primarily responsible for accuracy on psMNIST (Voelker et al., 2019), and RNNs tend to accumulate and propagate their errors over time. Notably, the method of Algorithm 1 can be applied to other neural network architectures, including feed-forward networks.

6 Conclusions

We have presented a new algorithm and accompanying methods that allow interpolation between spiking and non-spiking networks. This allows the training of hSNNs, which can have mixtures of activity quantization, leading to computationally efficient neural network implementations. We have also shown how to incorporate standard SNN assumptions, such as the presence of a synaptic filter.

We demonstrated the technique on the recently proposed LMU, and achieved better than state-of-the-art results on sMNIST than a spiking network. Additionally, on the more challenging psMNIST task the reported accuracy of the spiking network is better than any non-spiking RNN apart from the original LMU (Chandar et al., 2019; Voelker et al., 2019).

However, our focus here is not on accuracy per se, but efficient computation. In this context, the training procedure enables us to leverage the accuracy of ANNs and the energy efficiency of SNNs by scheduling training to evaluate a series of networks in between these two extremes. In the cases we considered, we reduced the activity to 2–6 bits on average, saving at least 26 bits over the standard LMU implementation with minimal impact on accuracy. While difficult to convert these metrics to energy savings in a hardware-agnostic manner, such optimizations can benefit both spiking and non-spiking architectures.

We anticipate that techniques like those we have outlined here will become more widely used as the demands of edge computing continue to grow. In such power-constrained contexts, extracting as much efficiency as possible, while retaining sufficient accuracy, is central to the efforts involved in co-designing both algorithms and hardware for neural network workloads.

References

  • K. Ando, K. Ueyoshi, Y. Oba, K. Hirose, R. Uematsu, T. Kudo, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and M. Motomura (2018) Dither NN: an accurate neural network with dithering for low bit-precision hardware. In 2018 International Conference on Field-Programmable Technology (FPT), pp. 6–13. Cited by: §2, §3.1.1.
  • T. Bekolay, J. Bergstra, E. Hunsberger, T. DeWolf, T. C. Stewart, D. Rasmussen, X. Choo, A. Voelker, and C. Eliasmith (2014) Nengo: a Python tool for building large-scale functional brain models. Frontiers in neuroinformatics 7, pp. 48. Cited by: §4.
  • G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing Systems, pp. 787–797. Cited by: §1, §2, §4.1.1, §4.
  • P. Blouw and C. Eliasmith (2020) Event-driven signal processing with neuromorphic computing systems. In 45th International Conference on Acoustics, Speech, and Signal Processing, Cited by: §1.
  • P. Blouw, X. Choo, E. Hunsberger, and C. Eliasmith (2019) Benchmarking keyword spotting efficiency on neuromorphic hardware. In Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop, pp. 1–8. Cited by: §1.
  • S. Chandar, C. Sankar, E. Vorontsov, S. E. Kahou, and Y. Bengio (2019) Towards non-saturating recurrent units for modelling long-term dependencies. arXiv preprint arXiv:1902.06704. Cited by: §4.2.1, §4.2.2, §6.
  • D. B. Chklovskii and D. Soudry (2012) Neuronal spike generation mechanism as an oversampling, noise-shaping A-to-D converter. In Advances in Neural Information Processing Systems, pp. 503–511. Cited by: §2.
  • M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131. Cited by: §1.
  • M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al. (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §1.
  • P. Dayan and L. F. Abbott (2001) Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press. Cited by: §3.3.
  • R. Floyd and L. Steinberg (1976) An adaptive technique for spatial grayscale. In Proceedings of the Society of Information Display, Vol. 17, pp. 75–77. Cited by: §1, §3.1.
  • S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana (2014) The SpiNNaker project. Proceedings of the IEEE 102 (5), pp. 652–665. Cited by: §1, §4.2.2.
  • B. Ginsburg, S. Nikolaev, A. Kiswani, H. Wu, A. Gholaminejad, S. Kierat, M. Houston, and A. Fit-Florea (2017) Tensor processing using low precision format. Google Patents. Note: US Patent App. 15/624,577 Cited by: §1.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 249–256. Cited by: §4, §4.
  • S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §1.
  • M. Hopkins, M. Mikaitis, D. R. Lester, and S. Furber (2019) Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ODEs. arXiv preprint arXiv:1904.11263. Cited by: §2.
  • W. Hua, Y. Zhou, C. M. De Sa, Z. Zhang, and G. E. Suh (2019)

    Channel gating neural networks

    .
    In Advances in Neural Information Processing Systems, pp. 1884–1894. Cited by: §2.
  • D. Huh and T. J. Sejnowski (2018) Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pp. 1433–1443. Cited by: §1.
  • E. Hunsberger and C. Eliasmith (2015) Spiking deep networks with LIF neurons. arXiv preprint arXiv:1510.08829. Cited by: §1.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2704–2713. Cited by: §2.
  • N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017) In-datacenter performance analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • C. Koch (2004) Biophysics of computation: information processing in single neurons. Oxford university press. Cited by: §3.1.1.
  • S. Krithivasan, S. Sen, S. Venkataramani, and A. Raghunathan (2019) Dynamic spike bundling for energy-efficient spiking neural networks. In International Symposium on Low Power Electronics and Design, pp. 1–6. Cited by: §2, §5.
  • Q. V. Le, N. Jaitly, and G. E. Hinton (2015)

    A simple way to initialize recurrent networks of rectified linear units

    .
    arXiv preprint arXiv:1504.00941. Cited by: §4.
  • A. Legendre (1782) Recherches sur l’attraction des sphéroïdes homogènes. Mémoires de Mathématiques et de Physique, présentés à l’Académie Royale des Sciences, pp. 411–435. Cited by: §3.2.
  • M. Li, Y. Li, P. Chuang, L. Lai, and V. Chandra (2019) Improving efficiency in neural network accelerator using operands hamming distance optimization. In The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Cited by: §4.2.2.
  • C. Liu, G. Bellec, B. Vogginger, D. Kappel, J. Partzsch, F. Neumärker, S. Höppner, W. Maass, S. B. Furber, R. Legenstein, and C. G. Mayr (2018) Memory-efficient deep learning on a SpiNNaker 2 prototype. Frontiers in neuroscience 12, pp. 840. Cited by: §1, §1, §4.2.2.
  • J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha (2018) Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191. Cited by: §1, §2, §4.2.2.
  • D. Moloney (2016) Embedded deep neural networks: “The cost of everything and the value of nothing”. In Hot Chips 28 Symposium, pp. 1–20. Cited by: §1.
  • O. Moreira, A. Yousefzadeh, F. Chersi, A. Kapoor, R.-J. Zwartenkot, P. Qiao, M. Lindwer, and J. Tapson (2020) NeuronFlow: a hybrid neuromorphic – dataflow processor architecture for AI workloads. In 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems, Cited by: §1.
  • P. Nayak, D. Zhang, and S. Chai (2019) Bit efficient quantization for deep neural networks. arXiv preprint arXiv:1910.04877. Cited by: §1.
  • A. Neckar, S. Fok, B. V. Benjamin, T. C. Stewart, N. N. Oza, A. R. Voelker, C. Eliasmith, R. Manohar, and K. Boahen (2018) Braindrop: a mixed-signal neuromorphic architecture with a dynamical systems-based programming model. Proceedings of the IEEE 107 (1), pp. 144–164. Cited by: §2.
  • J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He, et al. (2019) Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572 (7767), pp. 106–111. Cited by: §1, §2.
  • D. Rasmussen (2019) NengoDL: combining deep learning and neuromorphic modelling methods. Neuroinformatics 17 (4), pp. 611–628. Cited by: §4.
  • O. Rodrigues (1816) De l’attraction des sphéroïdes, correspondence sur l’É-cole impériale polytechnique. Ph.D. Thesis, Thesis for the Faculty of Science of the University of Paris. Cited by: §3.2.
  • D. Situnayake and P. Warden (2019)

    TinyML: machine learning with TensorFlow Lite on Arduino and ultra-low power microcontrollers

    .
    O’Reilly Media, Inc.. External Links: ISBN 9781492052036 Cited by: §1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • C. Stöckl and W. Maass (2019) Recognizing images with at most one spike per neuron. arXiv preprint arXiv:2001.01682. Cited by: §2.
  • V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §1.
  • A. Voelker, I. Kajić, and C. Eliasmith (2019) Legendre Memory Units: continuous-time representation in recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 15544–15553. Cited by: §3.2, §4.1.2, §4.2.1, §4.2.2, §4.2.2, §4, §5, §6.
  • A. R. Voelker and C. Eliasmith (2018) Improving spiking dynamical networks: accurate delays, higher-order synapses, and time cells. Neural computation 30 (3), pp. 569–609. Cited by: §3.3.
  • A. R. Voelker (2019) Dynamical systems in spiking neuromorphic hardware. PhD Thesis, University of Waterloo. Cited by: §3.1.1, §3.2.
  • D. Wang, M. Li, L. Wu, V. Chandra, and Q. Liu (2019) Energy-aware neural architecture optimization with fast splitting steepest descent. arXiv preprint arXiv:1910.03103. Cited by: §1.
  • C. Yadav and L. Bottou (2019) Cold case: the lost MNIST digits. In Advances in Neural Information Processing Systems, pp. 13443–13452. Cited by: §4.
  • Y. C. Yoon (2016) LIF and simplified SRM neurons encode signals into spikes via a form of asynchronous pulse sigma–delta modulation. IEEE transactions on neural networks and learning systems 28 (5), pp. 1192–1205. Cited by: §2.
  • A. Yousefzadeh, S. Hosseini, P. Holanda, S. Leroux, T. Werner, T. Serrano-Gotarredona, B. L. Barranco, B. Dhoedt, and P. Simoens (2019a) Conversion of synchronous artificial neural network to asynchronous spiking neural network using sigma-delta quantization. In 1st IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 81–85. Cited by: §2, §3.1.1.
  • A. Yousefzadeh, M. A. Khoei, S. Hosseini, P. Holanda, S. Leroux, O. Moreira, J. Tapson, B. Dhoedt, P. Simoens, T. Serrano-Gotarredona, et al. (2019b) Asynchronous spiking neurons, the natural key to exploit temporal sparsity. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (4), pp. 668–678. Cited by: §2.
  • W. Zhang and P. Li (2019) Spike-train level backpropagation for training deep recurrent spiking neural networks. In Advances in Neural Information Processing Systems, pp. 7800–7811. Cited by: §1.
  • Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §1.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §1.