DeepAI
Log In Sign Up

Meta-learning Spiking Neural Networks with Surrogate Gradient Descent

Adaptive "life-long" learning at the edge and during online task performance is an aspirational goal of AI research. Neuromorphic hardware implementing Spiking Neural Networks (SNNs) are particularly attractive in this regard, as their real-time, event-based, local computing paradigm makes them suitable for edge implementations and fast learning. However, the long and iterative learning that characterizes state-of-the-art SNN training is incompatible with the physical nature and real-time operation of neuromorphic hardware. Bi-level learning, such as meta-learning is increasingly used in deep learning to overcome these limitations. In this work, we demonstrate gradient-based meta-learning in SNNs using the surrogate gradient method that approximates the spiking threshold function for gradient estimations. Because surrogate gradients can be made twice differentiable, well-established, and effective second-order gradient meta-learning methods such as Model Agnostic Meta Learning (MAML) can be used. We show that SNNs meta-trained using MAML match or exceed the performance of conventional ANNs meta-trained with MAML on event-based meta-datasets. Furthermore, we demonstrate the specific advantages that accrue from meta-learning: fast learning without the requirement of high precision weights or gradients. Our results emphasize how meta-learning techniques can become instrumental for deploying neuromorphic learning technologies on real-world problems.

READ FULL TEXT VIEW PDF
02/21/2021

Fast On-Device Adaptation for Spiking Neural Networks via Online-Within-Online Meta-Learning

Spiking Neural Networks (SNNs) have recently gained popularity as machin...
03/09/2021

Scalable Online Recurrent Learning Using Columnar Neural Networks

Structural credit assignment for recurrent learning is challenging. An a...
06/13/2022

Spiking Neural Networks for Frame-based and Event-based Single Object Localization

Spiking neural networks have shown much promise as an energy-efficient a...
05/22/2018

Meta-Learning with Hessian Free Approach in Deep Neural Nets Training

Meta-learning is a promising method to achieve efficient training method...
06/17/2021

Meta-Calibration: Meta-Learning of Model Calibration Using Differentiable Expected Calibration Error

Calibration of neural networks is a topical problem that is becoming inc...
10/06/2022

Few-shot Generation of Personalized Neural Surrogates for Cardiac Simulation via Bayesian Meta-Learning

Clinical adoption of personalized virtual heart simulations faces challe...
10/11/2019

On-chip Few-shot Learning with Surrogate Gradient Descent on a Neuromorphic Processor

Recent work suggests that synaptic plasticity dynamics in biological mod...

1 Introduction

Rapid adaptation to unfamiliar and ambiguous tasks are hallmarks of cognitive function and long-standing goals of Artificial Intelligence (AI). Neuromorphic electronic systems inspired by the brain’s dynamics and architecture strive to capture its key properties to enable low-power, versatile, and fast information processing

(Mead90_neurelec; Indiveri_etal11_neursili; Davies19_bencprog). Several recent neuromorphic systems are now equipped with on-chip local synaptic plasticity dynamics (Chicca_etal13_neurelec; Pfeil_etal12_4-bisyna; Davies_etal18_loihneur). Such neuromorphic learning machines hold promise to building fast and power-efficient life-learning machines (Neftci18_datapowe).

SNNs can be modeled as a special case of artificial Recurrent Neural Networks with internal states akin to the Long Short-Term Memory (LSTM) (Neftci_etal19_surrgrad). Using a surrogate gradient approach that approximates the spiking threshold function for gradient estimations, SNNs

can be trained to match or exceed the accuracy of conventional neural networks on event-based vision, audio, and reinforcement learning tasks

(Kaiser_etal19_synaplas; Cramer_etal20_traispik; Bellec_etal19_biolinsp; Bohnstingl_etal20_onlispat; Zenke_Ganguli17_supesupe; Neftci_etal17_evenranda). Although these methods achieve state-of-the-art accuracy in SNNs

, they are not practically realizable on neuromorphic hardware or any other online learning systems, for several reasons: Firstly, learning via (stochastic) gradient descent requires data to be sampled in an independent and identically distributed fashion

(Vapnik13_natustat). However, when sensory data is acquired and processed during task performance, data samples are generally correlated, leading to many convergence problems, including catastrophic forgetting (McClelland_etal95_whyther). Secondly, many networks use batch sizes larger than one. While networks with a batch size equal to one eventually converge (LeCun_Bottou04_largscal), using a smaller batch size means that learning rates must be smaller as well. Smaller learning rates result in smaller weight updates which require memories or buffers that store the weights or weight updates with higher precision, and hence hardware area. Thirdly, surrogate gradient-based SNN training inherits other fundamental issues of deep learning, namely that very large datasets and a large number of iterations are necessary for convergence. The combination of the three problems stated above, i.e. correlated data samples, data inefficiency, and memory requirements hamper the successful deployment of neuromorphic hardware to solve real-world learning problems.

In this article, we demonstrate that gradient-based meta-learning on SNNs can solve these problems in practical cases with technological interest, and are particularly well suited to the constraints of neuromorphic hardware and online learning (Fig. 1). To do so we combine MAML

, a second-order gradient-based method that optimizes the network hyperparameters, and the surrogate gradient method

(Neftci_etal19_surrgrad). Two ingredients were key to the results of our work. First, surrogate functions used to estimate SNN gradients can be made twice differentiable, hence are suitable for second-order learning as in MAML. Second, the definition of suitable event-based datasets to demonstrate meta-learning on SNNs. While MAML had been previously applied to SNNs, prior work focused on meta-training Hebbian Spike Time Dependent Plasticity (STDP) dynamics on non-event-based datasets which do not take any advantage of the event-based nature of SNNs. Furthermore, surrogate gradient learning implementing stochastic gradient descent can be implemented as a form of three-factor learning (Gerstner_etal18_eligtrac; Zenke_Ganguli17_supesupe; Kaiser_etal20_synaplas; Bellec_etal19_biolinsp) that vastly exceeds the performance of classical STDP, while being compatible with neuromorphic hardware implementations (Payvand_etal20_errothre; Cramer_etal20_traispik). The meta-training of SNNs using the surrogate gradient method can be seen as a tool to adapt and tune synaptic plasticity circuits.

We study the SNN MAML approach in the context of few-shot learning, whereby a model is trained on a set of labeled tasks drawn from a given domain

of tasks to adapt to unseen ones of the same domain using a small number of samples and iterations. Examples of few-shot learning are learning novel hand or body gestures, agents learning to take new goal-driven actions in a new maze, or optimizing automatic speech recognition to the individual pronunciation of the subject.

One important obstacle to meta-learning research in neuromorphic engineering is the lack of suitable datasets. Neuromorphic hardware implementing SNNs is most suitable to processing event-based datasets and loses most of its salient features when applied to static data (Davies19_bencprog). The Omniglot (Lake_etal17_builmach) and MiniImagenet (Vinyals_etal16_matcnetw) datasets have been pivotal in pushing the field of meta-learning ahead. However, there exists no event-based dataset that is comparable to Omniglot or MiniImagenet where modeling dynamics are crucial to solving the problem. Taking inspiration from existing meta-learning benchmarks that fuse multiple datasets (mulitdigitmnist), we define new benchmarks that consist of combinations of event-based datasets recorded using neuromorphic vision sensors. We demonstrate performances that match - and in some cases exceed - the performance of conventional neural networks trained on these datasets.

Finally, we analyze the updated statistics, which reveal that the MAML not only results in fast learning but does so using large weight updates. These results lay promise for online learning using low-precision weight memory.

Specific Contributions

This work provides 1) a method of parameter initialization that enables neuromorphic hardware to few-shot learn new tasks; 2) a method to construct meta-datasets using data taken from the DVS neuromorphic sensor, with two examples made publicly available: Double NMNIST and Double ASL-DVS; and 3) the effectiveness of second-order meta-training of SNNs. We present this work as a stepping stone towards implementing MAML with SNNs in neuromorphic hardware for fast adaptation to streamed event-based sensor data.

Figure 1: Meta-Learning for SNNs using Surrogate Gradients. In the first phase, an SNN or a functional simulator of a neuromorphic hardware’s SNNs is meta-trained using surrogate gradient methods on a class of tasks stored on a computer. The goal of meta-training is to learn an initial parameter set such that out-of-sample tasks (e.g. , , , …) can be learned quickly. In the envisioned application, would be learned offline on a conventional computer and learning/adaption would take place at the edge, using neuromorphic sensing and processing.

2 Methods

2.1 Model Agnostic Meta-Learning

Define a neural network model that produces an output batch given its parameters and an input batch . For simplicity, we focus here on classification problems, such that

represents logits and

is a class, although any supervised learning problem would be suitable. In the classification case, each batch consists of

samples of each class. The parameters

are trained by minimizing a task-relevant loss function

, such as cross-entropy, where is a batch of targets.

The goal of meta-learning is to optimize the meta-parameters of , such as the initialization parameters noted as . This work makes use of the standard second-order MAML algorithm to meta-train the SNN. The standard MAML workflow is designed to optimize the parameters of a neural network model across multiple tasks in a few-shot setting. MAML achieves this using two nested optimization loops, one “inner” loop and one “outer” loop. The inner loop consists of a standard Stochastic Gradient Descent (SGD) update, where the gradient operations are traced for auto-differentiation (Griewank_Walther08_evalderi). In the outer loop, an update is made using gradient descent on the meta-parameters.

To make use of MAML, it is essential to set up the experimental framework accordingly. Define three sets of tasks: meta training , meta validation , and meta testing . Each task consists of a training dataset and a validation dataset of the form . Here denotes the input data, the target (label) and M is the number of target samples. In general, the datasets corresponding to different tasks can have different sizes, but we omit this in the notation to avoid clutter. During learning, a task is sampled from and inner loop updates are made using batches of data sampled from . The resulting parameters are then used to make the outer loop update using the matching validation dataset . During each inner loop update, one or more SGD update steps are performed over a task-relevant loss function :

(1)

Here is the number of inner loop adaptation steps, is the inner loop learning rate. Note the dependence of on the initial parameter set at the beginning of the inner loop through the recursion. here indicates the gradient over the inner loop loss on the network using parameters . The outer loop loss is defined as:

(2)

where . Note that in practice the above expression is generally computed over a random subset of tasks rather than the full set . Notice that the outer loop loss is computed over the validation dataset , whereas is computed using the training dataset , which is argued to improve generalization. The goal is to find the optimal , denoted such that:

(3)

Provided the inner loop loss is at least twice differentiable with respect to , the optimization can be performed via gradient descent over the initial parameters , using a standard gradient-based optimizer using gradients

Successive applications of the chain rule in the expression above results in second order gradients of the form

. If these second-order terms are ignored, it is still possible to meta-learn (Finn_etal17_modemeta), using the method called first-order MAML (FOMAML).

In our experiments, we use the ADAM optimizer for the outer loop loss function and vanilla SGD for the inner loop loss. This choice is motivated by a hybrid learning framework whereby the outer loop training can occur offline with large memory and compute resources (e.g. ADAM which requires more memory and compute), whereas the inner loop is constrained by hardware at the edge. The model is validated and tested on and , respectively. In the following, we describe the SNN model used with MAML.

2.2 Maml-compatible Spiking Neuron Model

The neuron model used in the

SNNs in our work follows Leaky Integrate & Fire (LIF) dynamics as described in (Kaiser_etal20_synaplas). For completeness, we summarize the dynamics of the neuron model here:

(4)

where is the membrane potential, are the synaptic weights between pre-synaptic neuron and post-synaptic neuron and is the timestep. Neurons emit a spike at time when the threshold of their membrane potential is reached. is the unit step function, where if , otherwise . and

describe the traces of the membrane potential of the neuron and the current of the synapse, respectively. For each incoming spike to a neuron, each trace undergoes a jump of height 1 and decays exponentially if no spikes are received. The constants

(5)

reflect the time constants of the membrane , synaptic , and refractory dynamics. Weighting the trace with the synaptic weight results in the Post-Synaptic Potential (PSP) of post-synaptic neuron caused by pre-synaptic neuron . The constant is a bias current representing the intrinsic excitability of the neuron. The reset mechanism is captured by the dynamics of , and the factors , and are time constants of the membrane, synapse, and reset dynamics respectively. Note that Eq.4 is equivalent to a discrete-time version of the Spike Response Model (SRM) with linear filters (Gerstner_Kistler02_spikneur).

To compute the second-order gradients, MAML requires the SNN to be twice differentiable. However, the spiking function is non-differentiable. The surrogate gradient approach, where is replaced by a differentiable surrogate function for computing gradients, has been used to successfully side-step this problem (Neftci_etal19_surrgrad). For MAML

, the surrogate function can be chosen to be a twice differentiable function. Although many suitable surrogate gradient functions exist, the fast sigmoid function described in

(Zenke_Vogels20_remarobu) strikes a good trade-off between simplicity and effectiveness in learning and is twice differentiable: . All simulations in this work use the fast sigmoid function as a surrogate function.

Because SNNs is a special case of recurrent neural networks, it is possible to apply Automatic Differentiation tools for implementing the gradient (Baydin_etal17_autodiff; Paszke_etal17_autodiff)

. This also applies to the calculation of the second-order gradients needed for backpropagating the gradient of the inner loss.

2.3 Datasets

Figure 2:

Top: Examples of Double N-MNIST tasks. Each sample contains a combination of two N-MNIST digits to make two-digit numbers. Bottom: Examples of Double ASL-DVS tasks. Each sample contains a combination of two ASL-DVS letters. In all examples, the images are DVS events summed over 100ms into frames.

We benchmark our models on modifications of datasets collected using event-based vision sensors (Lichtsteiner_etal08_128x120; Posch_etal11_qvga143), the Neuromorphic MNIST (N-MNIST) (Orchard_etal15_convstat), and the American Sign Language Dynamic Vision Sensor (ASL-DVS) (bi2019graph) datasets. NMNIST consists of , ms long event data streams of MNIST images recorded with an ATIS Camera (Posch_etal11_qvga143). The dataset contains 60,000 training event streams and 10,000 test event streams. From the N-MNIST dataset, we create Double N-MNIST datasets. Each event stream of Double N-MNIST is a combination of two N-MNIST event streams to make , ms long event data streams of double-digit numbers that are downsampled to , ms long event data streams. Because there are ten digits in the original N-MNIST dataset, 100 different double-digit numbers can be created. These 100 different numbers can be used to create a meta-dataset with K=100 tasks, where each double-digit number represents one task. We create N-shot K-way meta training, meta validation, and meta-test Double N-MNIST datasets from the training and test N-MNIST dataset. Each meta dataset consists of a subset of the 100 total possible tasks. The meta training dataset contains 64 tasks, the meta validation dataset contains 16 tasks, and the meta test dataset contains 20 tasks.

The ASL-DVS dataset contains 24 classes corresponding to letters A-Y, excluding J, in American Sign Language recorded using a DAVIS 240C event-based sensor (Brandli_etal14_240180). Data recording was performed in an office environment under constant illumination. The dataset contains 4200 ms long event data streams of each letter, for a total of 100,800 samples. Like Double N-MNIST, each event stream of Double ASL-DVS is a combination of two ASL-DVS event data streams to make , ms long event data streams of two letters that are downsampled to , ms long event data streams. Out of the 24 classes, 576 different tasks consisting of double ASL letter event streams can be used to create N-shot K-way meta datasets with each double ASL letter representing a task. From the ASL-DVS dataset, we create Double ASL-DVS N-shot K-way meta training, validation, and test datasets. The meta training dataset contains 369 tasks, the meta validation dataset contains 92 tasks, and the meta test dataset contains 115 tasks. Example images from the Double N-MNIST and Double ASL-DVS datasets are shown in Fig. 2.

For all datasets, gradients were computed for the last of the sequence to reduce the memory footprint.

2.4 Model Architecture

Layer Kernel NMNIST Output ASL-DVS Output
input 32162 80302
1 32c5p0s1 321632 803032
3 2a 16832 401532
4 64c5p0s1 16864 401564
5 2a 8464 20764
8 128c5p0s1 88128 207128
9 2a 42128 103128
output - K5 K5

Notation: Ya represents YxYmax pooling, XcYpZsS represents X convolution filters (YxY

) with padding

and stride

.

Table 1: SNN MAML Network Architecture

The architecture for the models trained is shown in Table 1. The architecture for all models trained is equivalent, consisting of three convolutional layers and a linear output layer. SNN output membrane potentials are encoded into classes by using the output neuron with the highest membrane potential as the classification.

3 Results

3.1 Few-shot Learning Performance on Double NMNIST and Double ASL Tasks

For each dataset, we ran 1-shot 5-way learning experiments on models trained using MAML. The MAML models were meta-trained on the meta training tasks () with the meta validation tasks () used to compute the loss gradient in the outer loop. The meta-trained models were then tested on the meta-test tasks (). A summary of our results for each dataset is shown in Table 2. The results in Table 2 are obtained from averaging the inference performance of a meta-trained model over 10 trials on the test datasets () of the meta validation and meta test tasks. Each trial has different random batches of data sampled from training tasks and validation tasks respectively. All experiments only used a single inner loop gradient step (i.e. in MAML was set to ). Additionally, we compare the results of the SNNs to equivalent non-spiking models meta-trained with MAML as well as SNNs trained with first-order MAML. First-order MAML ignores all terms involving the second-order gradients and is thus similar to the joint training of the tasks. This has the advantage of reducing the memory footprint required for learning but is known to reduce the accuracy of the meta-trained model (Finn_etal17_modemeta). For the non-spiking models, the input data is first converted from the address event representation to static images by summing the events over the time dimension.

The results show that both spiking and non-spiking MAML achieve 1-shot learning performance on the datasets comparable to the state-of-the-art performance shown by non-meta models. The state-of-the-art test accuracy for standard non-meta model training on the NMNIST dataset with SNNs is % accuracy (Shrestha_Orchard18_slayspik). The SNN achieves % test accuracy on Double NMNIST in a one-shot learning scenario. Our SNN network achieves a test accuracy on the Double ASL-DVS dataset of %. For both datasets, first-order MAML performed significantly worse, highlighting the importance of differentiable surrogate gradients for successful meta-training.

On average MAML on SNNs tend to match or outperform non-spiking MAML on event-based datasets. This is likely because the dynamics of the SNN neurons are well suited for processing and learning the Spatio-temporal patterns of the event-based data streams the datasets are composed of.

Task Algorithm Train Accuracy Test Accuracy
Double N-MNIST MAML (SNN) 98.761.05% 98.231.12%
MAML (CNN) 99.09.53% 98.351.26%
FOMAML (SNN) 92.591.0% 92.63.74%
Double ASL DVS MAML (SNN) 95.77.88% 96.042.31%
MAML (CNN) 94.93.92% 94.971.63%
FOMAML (SNN) 94.971.12% 94.27.6%
Table 2: 1-Shot 5-Way Accuracy Results. Train accuracy indicated accuracy over the test datasets () in the meta validation set and Test Accuracy indicates accuracy on the meta test set .

3.2 Generalization of Learning Performance

Figure 3: Example of how changing the number of inner loop training steps during meta-testing affects the error of the meta-trained model. Accuracy increases as the number of gradient steps increases and without adapting to a new task the model will have very high error. Left: Double NMNIST. Right: Double ASL-DVS.
Figure 4: Example of how freezing layers of the network during inner loop adaptation does not greatly impact learning performance. This supports the network is using feature reuse as in (Raghu2020Rapid). Left: Double NMNIST. Right: Double ASL-DVS.

MAML requires selecting hyper-parameters such as the number of update steps and the learning rates. In real-world scenarios, the input constraints cannot be tightly controlled, leading to potential mismatches with the MAML hyper-parameters. For example, in a real-time gesture learning scenario, the parameter update schedule may not be tightly linked to the time the gesture is presented. Here we study the ability of MAML trained SNNs to generalize across different input conditions.

The ability of MAML to generalize learning performance across its settings, such as the number of steps, has been previously documented for conventional ANNs (finn2018metalearning).

Here, we demonstrate that this feature extends to our SNN. Using an SNN MAML network meta-trained on the Double NMNIST dataset, we varied the number of gradient steps during inner loop adaptation on test data. The Fig. 3 shows how changing the number of gradient steps during inner loop adaptation affects the one-shot 5-way learning performance on each dataset. On both datasets, as the number of inner loop gradient steps increases the performance increases. Therefore there is a trade-off between the computational overhead of performing multiple gradient steps during one-shot learning and accuracy.

We also show how the learning performance is affected when layers of the network are frozen during few-shot learning. Using a network meta-trained on the Double NMNIST dataset, we progressively froze layers of the network to observe the impact on performance, which is shown in Fig. 4. Even when all layers of the network were frozen, in this case, three, there is not a significant impact on performance. This gives further evidence to the claim that MAML learns a suitable representation for few-shot learning instead of rapid learning (Raghu2020Rapid). This is interesting from an engineering perspective, as the network with a meta-learned initialization can achieve high performance on learning new tasks with only one gradient update and only at the final layer. This is well suited for real-time adaptation in neuromorphic hardware demonstrated in previous work (Stewart_etal20_onlifew-).

3.3 MAML Few-shot Learning Relies on Few, Large Magnitude Updates

To obtain adequate generalization, conventional deep learning relies on many, small magnitude updates across a large dataset. This is achieved using a relatively small learning rate. The usage of small learning rates is challenging on a physical substrate, as it requires high precision memory to accumulate the gradients across updates. This problem is further compounded by the fact that learning on a physical substrate cannot be easily performed using batches of samples.

Interestingly, few-shot learning has the opposite requirements: few but large magnitude updates. This result is extremely relevant for neuromorphic hardware which uses low precision parameters.

Algorithm Avg Magnitude Sum of Magnitudes Max Magnitude
MAML Outer Loop 7.72e-05.0002 0.2960.599 .002.0017
MAML Inner Loop .0044.0007 17.032.69 0.15480.16
Non-Meta 0.0005.0002 0.54990.2412 0.0011.0002
Table 3: Comparison of the Magnitude of Updates Between MAML and non-MAML Learning
Figure 5: A comparison of the weight update magnitudes on Double NMNIST data, shown on a log scale, of an inner loop update and (left) and equivalent not meta trained model update; and (right) an inner loop update that is thresholded. MAML makes fewer non-zero weight updates that are large in magnitude compared to non-meta models. Additionally when thresholded MAML makes fewer non-zero weight updates that are larger in magnitude.

Likewise, we observe that the SNN MAML model only needs a few, large magnitude parameter updates for few-shot learning. The Table 3 shows the truncated values of the update magnitude between two training iterations of the output layer for a meta-model and an equivalent non-meta model both trained on Double NMNIST. The Fig. 5 gives a more detailed picture by showing histograms of the weight updates. Comparing the MAML inner loop and the non-meta model’s magnitudes, the average update of the inner loop is an order of magnitude larger than the equivalent non-meta model’s update. To summarize, we find that, first, meta-trained models only need one adaptation step to achieve high accuracy when learning a new task (see Table 2), and second, that these models only need a few updates with a large magnitude to perform few-shot learning (see Table3, Fig. 5).

Additionally, the magnitude of the weight updates in the inner loop during meta training and adaptation can be thresholded to use even fewer and larger magnitude updates. During the inner loop adaptation, instead of always updating the parameters the update is gated by a threshold which is described in Eq. (6).

(6)

where are the parameters of the model, is the magnitude of the update, and is the threshold. Thresholding the updates forces the parameters to be larger with fewer updates, which is shown in the Fig. 5. The threshold used in the Fig. 5was equal to of the value of the range of magnitude updates.

3.4 Comparison to Transfer Learning

Figure 6:

The error of a pre-trained non-meta model on Double NMNIST training classes using transfer learning to learn the test set of classes. The model requires more shots than MAML to achieve comparable performance.

A generalization problem involves learning a function, or model, whose behavior is constrained through a dataset that can make predictions about (i.e. learn features than can transfer to) other samples. A task domain consists of datasets that are related by a common domain, for example, datasets that all consist of double-digit numbers. Learning on one task in the domain to improve performance on another task is commonly referred to as transfer learning. Meta-learning can cast transfer learning as a generalization problem because each example, or task, in a given task domain, is a generalization problem instance in meta-learning, which means generalization in meta-learning corresponds to the ability to transfer knowledge between the different problem instances(Andrychowicz_etal16_learto; Vanschoren2019).

We compare meta-learning to the conventional transfer learning for few-shot learning where a model is pre-trained on a subset of classes within a task domain and then the pre-trained features are transferred to another model that learns to classify new classes within the task domain. For comparison to the Double NMNIST SNN MAML model we first pre-trained an SNN model on the 64 classes of the training dataset. Then we transferred the features to a new model that had an untrained last layer. The model was trained and tested on 20 of the remaining classes where all layers except the last layer were frozen. The model was trained on one shot of data at a time and then tested on 20 unseen shots of data. The few-shot transfer learning results are shown in Fig.

6. The results shown in the table were averaged over 10 trials. After the model trains on about 9 or 10 shots of data, the model achieves comparable accuracy to the SNN MAML model shown in Table 2. Comparing that to the high accuracy of the SNN MAML models on new tasks after using only one shot of data (see. Fig. 3, Table 2.), we can conclude that SNN MAML can adapt to a new task using fewer shots of data.

4 Discussion

Neuromorphic hardware is particularly well suited for online learning at the edge. Here, we demonstrated how to pre-train SNNs to perform one-shot learning using MAML. The SNN MAML models used the surrogate gradients method to overcome non-linearities that occur in gradient-based training of SNNs. We demonstrated our results on combinations of event-based datasets recorded using a neuromorphic vision sensor.

The effective batch size (=1), the precision required for learning from scratch, and the potential correlation in data samples in neuromorphic learning are serious obstacles to deploying neuromorphic learning in practical scenarios. Fortunately, learning from scratch on the device is generally not even desirable due to robustness and time-to-convergence issues, especially if the devices are intended for edge applications. Some form of offline pre-training can alleviate these issues, and MAML is an excellent tool to automate this pre-training.

Our results showed that a meta-trained SNN MAML model can learn new event-based tasks in one or a few shots within a task domain. This enables learning in real-world scenarios when data is streaming, online, and observed only once. Additionally, the model can relearn prior learned tasks in one or few shots which greatly reduces the impact of catastrophic forgetting because the model does not need to retrain for many iterations.

In hardware for training neural networks, weight updates must be rounded to fall onto values that are resolved at the desired resolution, thereby placing a lower bound to the learning rate. Conveniently, in few-shot learning, updates are of large magnitude (Fig. 5) and a single update must be sufficient to make a change in the output, effectively implying that the learning rate is large.

Meta-training SNN MAML models require considerable computing power and memory. The tasks are stored in memory as a sequence of length with network computations calculated over the sequence. Our method re-initializes the network dynamics each iteration of training by inputting a partial sequence of a data sample into the network that executes the dynamics but does not update the network. Additionally, gradients must be computed and stored both in the inner loop and outer loop and backpropagated through the network to meta train. This largely prevents learning for a large number of inner loops. While first-order MAML requires less memory and compute for each update, it performed significantly worse than MAML. Other first-order MAML methods such as REPTILE (Nichol_etal18_firsmeta) are an alternative that can reduce memory usage at the cost of more compute time and some accuracy.

4.1 Non-gradient based meta-learning

MAML is known to learn representations that are general across datasets rather than “learning to learn” (Raghu2020Rapid). The results of our experiments with freezing model layers showed this is the case for SNN MAML as well indicating the layers already contain good features at meta-initialization. Another meta-learning approach done with artificial recurrent neural networks is to train the optimizer itself modeled using an LSTM (Andrychowicz_etal16_learto). The underlying mechanism relies on the recurrent cell states that capture knowledge that is common across the domain of tasks.

The SNN based work (Scherr_etal20_one-lear) falls into this category. There, meta-learning was applied to SNNs trained using e-prop on arm reach and Omniglot tasks. The approach used for the meta-learning combined a Learning Network (LN) to carry out inner loop adaptation and a Learning Signal Generator (LSG) to carry out outer loop generalization that was both modeled by recurrent SNNs.

4.2 Synaptic Plasticity and Meta-Training

Several recent methods for training SNNs using gradient descent have been introduced. Assuming a global cost function defined on the spikes of the top layer, the gradients with respect to the weights are:

(7)

The above equation describes a three-factor learning rule (Gerstner_etal18_eligtrac). The factor describes how changing the output in neuron modifies the global loss and captures the credit assignment problem. Interestingly, this learning rule is compatible with synaptic plasticity in the brain so long as there exists a process that computes (or approximates) and communicates to neuron . Whether and how this can be achieved in a local and biologically plausible fashion is under debate, and there exist several methods to approximate this term on a variety of problems (Zenke_Neftci21_brailear). For instance, direct feedback alignment is an important candidate method to overcome this problem in SNNs (Neftci_etal17_evenranda). lindsey2020learning build on direct feedback alignment by meta-training the random parameters of the feedback alignment.

The relationship of gradient descent with synaptic plasticity means that meta-training is a form of programming of synaptic plasticity. Programming synaptic plasticity is important for mixed-signal neuromorphic hardware implementation of synaptic plasticity because such circuits are prone to mismatch (Chicca_etal13_neurelec; Prezioso_etal18_spikplas). Even in digital neuromorphic technologies, training programming synaptic plasticity can be useful to approximate an optimal learning rule that would otherwise be impossible or too expensive to implement (Davies_etal18_loihneur).

4.3 Meta-learning for Neuromorphic Learning Machines

(miconi2018plasticnets) showed that plasticity can be optimized by gradient descent, or meta-learned, in large artificial networks with Hebbian plastic connections called differentiable plasticity. Network synapses store both a fixed component, a traditional connection weight, and a plastic component as a Hebbian trace that stores a running average of the product of pre and post-synaptic activity modified by a coefficient to control how plastic the synapse is. Networks using the plastic weights were demonstrated to achieve similar performance to MAML and Matching Networks. A Hebbian-based hybrid global and local plasticity learning rule similar to the differentiable plasticity presented in (miconi2018plasticnets) was applied with SNNs in the the Tianjic hybrid neuromorphic chip (wu2020braininspired).

This work meta-optimized Hebbian-based STDP learning rules and local meta-parameters on the Omniglot dataset to examine the performance of the model in few-shot learning tasks on the Tianjic neuromorphic hardware. Our work extends the one (wu2020braininspired) by directly optimizing a putative three-factor learning rule on event-based data and surrogate gradients that could be used for few-shot learning in neuromorphic hardware.

Meta-learning and transfer learning techniques are already presenting themselves as key tools for neuromorphic learning machines. For example, transfer learning on the Intel Loihi Neuromorphic Research Chip was used to enable few-shot learning (Stewart_etal20_on-cfew-) There, a gesture classification network was pre-trained using a functional simulator of the Loihi cores. The trained parameters were transferred onto the chip. Using local synaptic plasticity processors, the hardware was able to learn 5 novel gestures without catastrophic forgetting, achieving 60% accuracy after a single, 1 second-long presentation of each class. While encouraging, we believe this performance can be improved using our approach.

4.4 Conclusion

We argued that successful meta-learning on SNNs holds promise to help reduce training data iterations at the edge, making it essential to designing and deploying neuromorphic learning machines to real-world problems; prevent catastrophic forgetting and learn with low-precision plasticity mechanisms. As a bi-level learning mechanism, our results point towards a hybrid framework whereby SNNs are pre-trained offline for online learning. As a result, we expect a strong redefinition of synaptic plasticity requirements and exciting new learning applications at the edge.

Data Availability Statement

The data that supported the findings of this study are found at https://github.com/nmi-lab/torchneuromorphic.

Acknowledgements

This research was supported by the Intel Corporation (KS, EN), the National Science Foundation )NSF) under grant 1652159 (EN), and the Telluride Neuromorphic Cognition Workshop 2020 (NSF OISE 2020624). We would like to thank Jan Finkbeiner for his useful comments.

References