Synaptic Plasticity Dynamics for Deep Continuous Local Learning

11/27/2018 ∙ by Jacques Kaiser, et al. ∙ FZI Forschungszentrum Informatik University of California, Irvine University of California, San Diego 0

A growing body of work underlines striking similarities between spiking neural networks modeling biological networks and recurrent, binary neural networks. A relatively smaller body of work, however, discuss similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely due to the discrepancy between dynamical properties of synaptic plasticity and the requirements for gradient backpropagation. Here, we demonstrate that deep learning algorithms that locally approximate the gradient backpropagation updates using locally synthesized gradients overcome this challenge. Locally synthesized gradients were initially proposed to decouple one or more layers from the rest of the network so as to improve parallelism. Here, we exploit these properties to derive gradient-based learning rules in spiking neural networks. Our approach results in highly efficient spiking neural networks and synaptic plasticity capable of training deep neural networks. Furthermore, our method utilizes existing autodifferentation methods in machine learning frameworks to systematically derive synaptic plasticity rules from task-relevant cost functions and neural dynamics. We benchmark our approach on the MNIST and DVS Gestures dataset, and report state-of-the-art results on the latter. Our results provide continuously learning machines that are not only relevant to biology, but suggestive of a brain-inspired computer architecture that matches the performances of GPUs on target tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding how the plasticity dynamics in multilayer biological neural networks is organized for efficient data-driven learning is a long standing question in computational neurosciences (Zenke and Ganguli, 2017; Sussillo and Abbott, 2009; Clopath et al., 2010). The generally unmatched success of deep learning in a wide variety of data-driven tasks prompts the question whether the ingredients of their success are compatible with biological neural networks, namely spiking neural networks. The response to this question is largely positive (Neftci, 2018). However, biological neural networks distinguish themselves from the assumptions made in artificial neural networks by their continuous-time dynamics, the locality of their operations (Baldi et al., 2017)

, and their spike-based communication. Taking these properties into account in a neural network is challenging. The spiking nature of the neurons’ nonlinearity makes it non-differentiable. The continuous-time dynamics involve temporal dependencies that create a challenging credit assignment problem. The assumption of local computations at the neuron disqualifies the use of backpropagation through time. The failure to take all these properties into account causes learning in spiking neural networks to require either very large neural networks, a lot of time, and most often both compared to artificial neural networks. Improving the learning performance of spiking neural networks is not only one step in the quest to understand the adaptive capabilities the brain, but also a critical endeavor to build brain-inspired, neuromorphic computing technologies that emulate the dynamics of neural circuits

(Neftci, 2018).

In this article, we describe Deep Continuous Local Learning (DCLL), a spiking neural network model with plasticity dynamics that is compatible with the properties of biological neural networks mentioned above, and learns at proficencies comparable to that of small deep neural networks (Fig. 1). DCLL

builds on recent work in training spiking neural networks using the strategies to train deep neural networks. Using layerwise local classifiers

(Mostafa et al., 2017), the gradients are computed locally using pseudo-targets (usually the labels themselves). To take the temporal dynamics of the neurons into account, we use a Spike Response Model (SRM) model and a soft threshold function for computing a surrogate gradient, similarly to SuperSpike (Zenke and Ganguli, 2017). The information needed to compute the gradient forward (as opposed to storing them in backpropagation through time, for example), making the plasticity rule temporally local. While SuperSpike scales at least quadratically with the number of neurons, our model scales linearly. To achieve this, we use a local rate-based cost function reminiscent of readout neurons in liquid state machines (Maass et al., 2002), but where the readout is performed over a fixed random combination of the neuron outputs. The rate based readout does not have a temporal convolution term in the cost function, the absence of which enables linear scaling. Furthermore, the rate-based readout does not prevent learning precise temporal spike trains.

The local classifier in DCLL acts like an encoder-decoder layer reminiscent of the learning mechanism in reservoir type networks, such as the neural engineering framework (Eliasmith and Anderson, 2004), liquid state machines (Maass et al., 2002) and FORCE learning (Sussillo and Abbott, 2009). In reservoir networks, the encoder is typically random and fixed and the decoder is trained. Just like in DCLL, they use a rate-based cost function over a linear combination of spike-driven basis functions. The key difference with DCLL is that the encoder weights are trained, whereas the decoder (readout) weights are random and fixed. The training of the encoder weights allows the network to learn representations that are amenable as inputs for subsequent layers.

Figure 1: Deep Continuous Local Learning. (Left) Each layer consists of spiking neurons with continuous dynamics. Each layer feeds into a local classifier through fixed, random connections (diamond-shaped, ). The classifier is trained to produce auxiliary targets

. Errors in the local classifiers are propagated through the random connections to train weights coming in to the spiking layer, but no further (curvy, dashed line). To simplify the learning rule and enable linear scaling of the computations, the cost function is formulated using a rate code. The state of the spiking neurons (membrane potential, synaptic states, refractory state) are carried forward in time. Consequently, even in the absence of recurrent connections, the neurons are stateful in the sense of recurrent neural networks. (Right) Snapshot of the neural states illustrating the DCLL learning rule in the top layer. In this example, the network is trained to produce three time-varying pseudotargets

, and .

Our approach can be viewed as a type of synthetic gradient. Synthetic gradients were initially proposed to decouple one or more layers from the rest of the network as to prevent layerwise locking, similar to DCLL. While synthetic gradients usually involve an outer loop that is equivalent to a full backpropagation through the network which cannot be done locally in spiking neural networks. Instead DCLL relies on an initialization of the local random classifier weights and forgoes the outer loop.

One appeal of our model is the scalability of the learning rule: its formulation allows for the convenient use of the autodifferentation mechanisms in existing machine learning frameworks, in this case PyTorch. Its linear scalability enables the training of hundreds of thousands of neurons on a single GPU, and the learning on extremely fine time scales even on long sequences. In our case, we trained with

ms sequences on a ms time step. Due to the memory requirements, Back-Propagation-Through-Time (BPTT) is typically truncated to about 10 steps.

We demonstrate our approach on the classification of gestures, IBM DVS Gestures dataset (Amir et al., 2017), recorded using an event-based neuromorphic sensor and report comparable performance to deep neural networks and even networks trained with BPTT. We also perform an “ablation” study on the neuron model which reveals that the spiking nature of the neuron plays no significant role in the final accuracies. These results are consistent with the idea that it is the continuous-time dynamics and the locality of computations that are the most important distinguishing properties of spiking neural networks.

2 Related Work in Multilayer Spike-Based Learning

In (Neftci et al., 2017), the authors demonstrated Event-Driven Random Back-Propagation (eRBP) which is a form of approximate gradient backpropagation in spiking neural networks that translates into a three factor rule reminiscent of an error-modulated Hebb rule. The error was mediated by a random top-down (spike-based) feedback and accumulated at a second compartment in each neuron. Upon every pre-synaptic (input) spike, the weights are updated in the direction opposite to the value stored in the second compartment. At classical MNIST digit recognition tasks, eRBP performed nearly as well as an equivalent deep neural network. However, a shortcoming of eRBP is that the model does not take continuous dynamics into account. This is problematic because the “loop duration” i.e. “the duration necessary from the input onset to a stable response in the error neurons” scales with the number of layers. In deep networks, the errors can be strongly delayed when the time constants are long, or when the inputs have fast components. This reduces the qualities of the computed gradients.

SuperSpike employs a surrogate gradient descent to train networks of Linear Integrate & Fire (LI&F) neurons on a spike train distance measure. Because the LI&F

neuron output is non-differentiable, SuperSpike uses a surrogate network with differentiable activation functions to compute an approximate gradient. The authors show that this learning rule is equivalent to a forward-propagation of errors using eligibility traces, and is capable of efficient learning in hidden layers of feedforward multilayer networks. Unfortunately, the approximations in SuperSpike prevent efficient learning in deep layers, and the algorithm scales as

, where is the number of neurons. While quadratic scaling is biologically plausible, it prevents an efficient implementation in digital hardware. Like SuperSpike, DCLL uses surrogate gradients to perform weight updates, but the cost function is rate-based, such that the algorithm scales as . Rate-base costs function are used in similar scenarios in liquid state machines, and FORCE learning, and do not prevent the learning of fine, temporal dynamics as in SuperSpike.

Spiking neural networks can be viewed as a subclass of binary, recurrent artificial neural networks. Spiking neurons are recurrent in the Artificial Neural Network (ANN) sense even if all the connections are feed-forward, because the neurons have a state that is propagated forward at every time step. Binary neural networks, where both activations and weights are binary were studied in deep learning as a way to decrease model complexity during inference (Courbariaux et al., 2016; Rastegari et al., 2016). We are not aware of work on recurrent variations of binary nets, however.

Surrogate-gradient descent and forward propagation of the eligibility functions is the flipside of backpropagation-through-time, where gradients are computed using past activities. The BPTT-like approach for spiking neural networks was investigated in (Bohte et al., 2000; Lee et al., 2016; Shrestha and Orchard, 2018)

. While these approaches provide unbiased estimation of the gradients, we show that

DCLL can perform equally or better than these techniques using lower computational resources. This is because the computational and memory demands are higher for BPTT, which requires trucation and limits the size of the networks that can be simulated. Furthermore, forward-propagation techniques such as DCLL can be formulated as local synaptic plasticity rules, and are thus amenable to implementation in dedicated, event-based (neuromorphic) hardware (Neftci, 2018).

Hierarchy of Time Surfaces (HOTS) is a model for event-based pattern recognition using time surfaces

(Lagorce et al., 2015)

. Time surfaces describe the recent history of events in the spatial neighborhood of an event. Synaptic dynamics in the spiking neuron models exponentially filter their input events play the role of time surfaces. In the case of convolutional neural networks (as used in this work) and continuous-time operation, the

DCLL

forward dynamics are identical to that of HOTS. While deep weights there are trained using unsupervised learning, all the layers in

DCLL

are trained using gradient-based supervised learning updates making it more efficient when targets or pseudo targets are available.

Decoupled Neural Interfaces (DNI) were proposed to mitigate layerwise locking in training deep neural networks (Jaderberg et al., 2016). Layerwise locking occurs when the computations in one layer are locked until the error necessary for the weight update become available. In DNI, this decoupling is achieved using a synthetic gradient, a neural network that estimates the gradients for a portion of the network. In an inner loop, the network parameters are trained using the synthetic gradients, and in an outer loop the synthetic gradient network parameters are trained using a full BP step. The gradient computed using local errors in DCLL described below can be viewed as a type of synthetic gradients, which ignores the outer loop to avoid a full BP step (Mostafa et al., 2017). Although we ignore the outer loop limits DCLL’s cross-layer feature adaptation, we find that the network performs strikingly well.

This work builds on a combination of SuperSpike, local errors and random backpropagation, with the realization that a rate-based cost function combined with a differentiable spike-to-rate decoder can still exploit temporal dynamics of the spiking neurons.

3 Methods

3.1 Neuron and Synapse Model

The neuron model used for DCLL can be compactly described as follows:

where is the unit step function, and are kernels that reflect neural and synaptic dynamics, e.g. refractoriness, reset and postsynaptic potentials, and denotes a (temporal) convolution. Consistently with current-based Integrate & Fire (I&F) model, the second and first order kernels for and are, respectively:

(1)

This model is consistent with a deterministic SRM (Gerstner and Kistler, 2002).

3.2 Surrogate Gradients of the Neuron and Synapse

Generally, the SRM

output is stochastic, such that the conditional probability of an output spike (

) given the input spike vector

is:

where is interpreted as a stochastic intensity. The use of the stochastic intensity provides a good description for biological neurons, and provides the means to compute the gradient with respect to the neuron parameters (Williams, 1992)

. However, the training of stochastic neurons is notoriously slow due to the high variance of the gradient estimator. Recent work showed that a deterministic approach using a hard threshold during inference combined with a differentiable activation function during training yields very good results

(Zenke and Ganguli, 2017; Neftci et al., 2017). This approach is called surrogate gradient-based learning, since parameters updates are based off a differentiable but different version of the task-performing network. For clarity, the variables are used in the forward (non-learning) computations and the variables are used in the backward (learning) computations, where

is a sigmoidal function. In practice, we choose a symmetric

centered on , i.e. such that . We emphasize here again that the neuron model remains deterministic and is used as a differentiable approximation of the step function .

The surrogate network can be differentiated with respect to the neuron parameters. This enables a gradient-based optimization of a target loss as a function of :

(2)

The gradient of the neuron with respect to the parameter is

(3)

Due to the dependence of the refractory term on its own history, the gradient cannot be computed in closed form. Dropping the remaining derivative in the equation above is not a valid option in the general case, as there is no guarantee that the term is small. One possibility is to use a weak refractory kernel or none (i.e. = 0). However, in the absence of refractory mechanism or firing rate regularization, the neurons fire at very high rates. The solution we use is to enforce a low firing rate through an activity-dependent regularizer strong enough that the contribution of the refractory term is negligible. A similar approach was used in SuperSpike (Zenke and Ganguli, 2017).

Our simulations here being based on dense operations (GPU based for the most part), the high firing rate does not have any impact on performance. Note that in the case of event-based neuromorphic accelerators, the performance of a network scales directly in the number of synaptic events (Merolla et al., 2014). When strong refractory kernels are used, we use a regularizer to prevent sustained firing (i.e. keeping below the firing threshold, and activity regularizer to maintain a minimum firing rate in each layer.

Eq. (2) and Eq. (3

) suggest a synaptic plasticity rule that is tailored to the neuron and synapse model through its dependence on

and . Furthermore, we now see that Eq. (2) consists of three types of factors, one modulatory (), one post-synaptic () and pre-synaptic (). These types of rules are often termed three factor rules, which have been shown to be consistent with biology (Pfister et al., 2006)

and compatible with a wide number of unsupervised, supervised and reinforcement learning paradigms

(Urbanczik and Senn, 2014).

3.3 Local Synaptic Plasticity Rules and Auxiliary Cost Functions

In Eq. (3), both pre-synaptic and post-synaptic terms are local, meaning that all the variables to compute them are available and the neuron and synapse. The third factor plays the role of the back-propagated errors in the gradient backpropagation rule, and generally involves non-local terms, including the activity of other neurons and the targets, and their history. While an increasing body of work is showing that approximations to the back-propagated errors are possible, for example in feedback alignment (Lillicrap et al., 2014; Nø kland, 2016; Neftci et al., 2017), how to maintain their history efficiently remains a challenging problem. SuperSpike (Zenke and Ganguli, 2017) deals with it by explicitly computing this history at the synapse. In the exact form, this results in nested convolutions for a network of layers, which is computationally inefficient. To approximate this, (Zenke and Ganguli, 2017) uses a straight-through estimator, whereby the activation function derivatives of the other layers are ignored (they are all equal to 1). In this case, the nested convolutions can be combined as a single one, such that only convolutions remain necessary. This approach however has limited power in cases where two or more layers are used, and the 2 nested convolutions involves a quadratic scaling of the number of state variables.

One other approach is to enforce locality by using local gradients, or equivalently, local classifiers. One difficulty in defining a local error signal at a neuron in a deep layer is that the cost function is almost always defined using the network output at the top layer. Thus, using local information only, a neuron in a deep layer can not infer how a change in its activity will affect the top-layer cost. To address this conundrum, ref. (Mostafa et al., 2017) attaches random local classifiers to deep layers and defines auxiliary cost functions using their output. These auxiliary cost functions provide a task-relevant source of error for neurons in deep layers. Surprisingly, training deep layers using auxiliary local errors that minimize the cost at the local classifiers still allows the network as a whole to reach a small top-layer cost. That is because minimizing the local classifiers’ cost puts pressure on deep layers to learn useful task-relevant features that will allow the random local classifiers to solve the task. Moreover, each layer builds on the features of the previous layer to learn even better features for its local random classifier. Thus, even though no error information propagates downwards through the layer stack, the layers indirectly learn useful hierarchical features that end up minimizing the cost at the top layer.

3.4 The Deep Continuous Local Learning rule

The DCLL rule combines SuperSpike with deep local learning described above to solve the temporal and spatial credit assignment problem in continuous (spiking) neural networks. Hence our approach is called Deep Continuous Local Learning (DCLL). To achieve this, we organize layers of such neurons, and train each layer to predict a pseudotarget using a random local classifier , where indexes the layer, and are a fixed, random matrices (one for each layer

). The loss function is the sum of the layerwise loss functions,

i.e. , where is the pseudotarget for layer . In the special case of an MSE loss, the layerwise loss is:

where is the pseudotarget for layer . The gradient of the loss becomes:

(4)

where for MSE loss . This update rule prescribes that the weight updates should be executed after the presentation of the sequence of duration . In a discrete-time simulation, empirical results show that learning works well even if the updates are made at every time step of the simulation (Neftci, 2018). Replacing Eq. (3) in Eq. (4):

(5)

where is a learning rate. We note that the gradient of the loss at the top layer, , is only used to update the weights in layer

and does not backpropagate further through the network. In all our experiments, updates are made for each time step of the simulation. Furthermore, the neual and synaptic time constants were drawn randomly from a uniform distribution.

Implementation using Automatic Differentiation:

Temporal convolutions can make implementation in machine learning frameworks difficult. With DCLL, however, the function to be differentiated is linear in the parameters (note that it is not the case in the Van Rossum Distance). In the case of no refractory kernel (), the loss function involves derivatives of . This equation involves no temporal convolutions in the trainable parameters . The linear property enables the cost function to be differentiated using automatic differentiation tools provided by machine learning frameworks out-of-the-box. The surrogate gradient approach guarantees that the computed gradients of are the same as the DCLL rule. This enables the integration of DCLL with machine learning frameworks, in our case PyTorch without backpropagating through time since the only time-dependent term, is propagated forward. The integration is significant in that one can build large convolutional neural networks, as well as leverage any type of layer, operation, optimizer and cost function provided by the software. We leverage this integration in all our experiments under the Results section.

Local synaptic plasticity rule:

Eq. (5) requires some information that is non-local to the error neurons and the hidden neuron. This includes (not local to the error neurons), the targets and the weights (not local to the hidden neurons). We assume that a dedicated channel communicates these targets to each unit . Furthermore, because the are fixed, we can duplicate them on both ends of the connection. The term can be approximated by as in the Van Rossum Distance but without taking it into account in the gradient so as to avoid the nested convolution, or simply . In both cases can be efficiently delivered to the error neuron through a conventional connection.

Finally, computed at the error neuron must be communicated to the hidden neuron. To transmit this, the error neuron output can be separated in two, one positive and and one negative , each of them spiking when the error increases by a fixed positive threshold and negative threshold, respectively. This approach is particularly interesting, as it prescribes an error-triggered learning rule (Neftci, 2018),

(6)

which can improve learning efficiency when implemented on an event-based processor, as Eq. (6) is triggered only when the local error exceeds some threshold. Given that our simulations are currently GPU based, we use Eq. (5) for updates. In future work, we will evaluate the effectiveness of Eq. (6) on large-scale neural networks (but see (Neftci, 2018) for a simple example).

Relation to Van Rossum distance and SuperSpike and Linear versus Quadratic Scaling

The SuperSpike learning rule is a surrogate gradient descent on spike distance, i.e. Van Rossum distance:

where is an exponential filter similar to above. When an MSE loss is used with DCLL, it can be viewed as a simplification of the SuperSpike rule where spike distance is replaced by instantaneous spike count distance. In fact, the target in the Van Rossum Distance, , can be interpreted as and the prediction fulfills a role similar to the local classifier activation function. These dynamics and loss function avoid nested convolutions, enable the scaling of DCLL, where is the number of neurons.

4 Experiments

4.1 Regression with Poisson Spike Trains

To illustrate the inner workings of DCLL, we demonstrate DCLL in a regression task. A three layer fully connected network is stimulated with a frozen ms Poisson spike train. The pseudotargets are a ramp function, a cosine function and a sine function for each layer, respectively. (Fig. 1) illustrates the states of the neuron. For each local classifier, dropout was used to prevent over reliance on one particular neuron. For illustration purposes, the recording of the neural states were made in the absence of parameter updates (i.e. the learning rate is 0). We use on the Adamax optimizer (Kingma et al., 2014) and a smooth L1 loss. In this experiment, the refractory period was non-zero, as observed by the resetting effect in the membrane potential (). As discussed in the methods we use regularization to keep the neurons from sustaining high firing rates and an activity regularizer. Updates to the weight can occur each time step, when the derivative of the activation function , the input state are non-zero. The magnitude and direction of the update is determined by the error. Note that, in effect, the error is randomized as a consequence of the random local classifier. Because the input neurons has a fixed mean firing rate, the network learned to use the input spike times to reliably produce the targets.

4.2 Poisson MNIST

We first show the results of our method on the MNIST dataset compared to a conventional convolutional neural network. For DCLL, Each digit is converted into a 500ms Poisson spiketrain, where the mean firing rates vary from 0 to Hz depending on the pixel intensity. Gradient updates are performed at every simulation step after a 50ms burn-in (450 gradient steps per minibatch). Each minibatch contains 64 samples. The test set consists of 1024 unseen samples converted into spiketrains and presented for 1s to the network. We rely on a simple network architecture consisting of three convolutional layers of 16, 24 and 32 channels respectively with

kernels interleaved with max pooling layers. In total, the network has 3528 tunable weights and 72 biases, and was trained using Adamax and on a smooth L1 loss.

The results are compared with a reference network of the same architecture and optimizer, trained in the same regime by performing 500 backpropagation steps for each batch. The pseudo-targets used for the local classifiers are class labels. In order to match the number of trainable parameters with the conventional convnet architecture, we used one additional fully connected layer without spiking neuron dynamics.

The results of the method on the MNIST dataset are shown in Figure 2

. The final accuracy after 50 000 training samples is 98.73% for the third layer of spiking DCLL version against 98% for the analog backpropagation network. Note that the accuracy was computed sample by sample. When few samples are present, the spiking DCLL version seems to learn a better generalization, as the performance are below 90% after the first 2000 samples. This could be due to the random noise added to the samples when converting them to spiketrains with the Poisson distribution, which helps learning a generalization.

Figure 2: Classification results on the MNIST dataset for the three layers of our network against a reference analog network. (Left) Accuracy for the whole training procedure. (Right) accuracy for the first 3000 training samples.

4.3 DVS Gestures

We test DCLL at the more challenging task of learning gestures recorded using a Dynamical Vision Sensor (DVS) (Lichtsteiner et al., 2008). Amir et al. recorded DvsGesture dataset using a DVS, comprising 1342 instances of a set of 11 hand and arm gestures, collected from 29 subjects under 3 different lighting conditions (Amir et al., 2017). Unlike standard imagers, the DVS records streams of events that signal the temporal intensity changes at each of its pixels. The unique features of each gesture is embedded in the stream of events. The event streams were downsized to and binned in frames of , the effective time step of the GPU-based simulation ((Fig. 3)). During training, random ms long sequences were selected in batches of 72. Testing sequences were ms long, and where selected starting from the beginning of each recording was used (288 testing sequences). Note that the shortest recording in the test set is ms, and this duration selected to simplify and speed up the classification evaluation. The classification is obtained by counting spikes at the output starting from a “burnin period” of ms and selecting as output class the neuron that spiked the most. Contrary to (Amir et al., 2017), we did not use stochastic decay and the neural network structure is an all convolutional neural network, loosely adapted from (Springenberg et al., 2014). We find that kernels provide much better results that or

kernels commonly used in image recognition. Furthermore, we did not observe significant improvement by adding more than 3 convolutional layers. The optimal hyperparameters were found by a combination of manual and grid search. We used a weak refractory kernel

( and ). These settings provided the best accuracy, at the cost of allowing shorts bursts of neural activity (Fig. 3).

Overall, our performance are equal or better than other published spiking neural network implementations that use backpropagation for training ((Tab. 1), (Fig. 5)). DCLL reached the reported accuracies after a much smaller number of iterations compared to the IBM EEDN case (Amir et al., 2017). Furthermore, our network achieved these results using a much smaller network compared to the other reported results.

The spiking activation function performed better than the sigmoid and ReLU cases (not shown). However, we note that the parameter settings were tuned for the spiking neuron and transferred without further tuning to the networks with ReLU and logistic activation functions.

Model Error Training Samples
IBM EEDN 5.51% Offline samples
Slayer 6.36% Offline # samples not reported
DCLL (This Work) 5.819% Online samples
Table 1: Classification error at the DVS Gestures task. The number of layers reported includes pooling layers. The number of training samples indicates the number of sample iterations the algorithm during training, as opposed to the number of distinct samples in the dataset.
Layer Type # Dimensions
Input (ON, OFF) 2
Conv 64
MaxPool 64
Dropout(p=.5)
Dense 11
Conv 128
Dropout(p=.5)
Dense 11
Conv 128
MaxPool 128
Dropout(p=.5)
Dense 11
Table 2: All convolutional neural network used for the DvsGestures dataset. Note that dense layers are used for the local classifiers only and were not fed to the subsequent convolutional layers.
Figure 3: Example of processed DVS Gestures data used for DCLL. Red pixels correspond to OFF events, green pixels correspond to ON events. Note that DCLL is fed with ms frames, and the images shown here aggregate 10 bins (ms) for visualization purposes only.
Figure 4: Raster plot for 5 representative neurons in layer 1 of the DVS Gestures DCLL network. Here, a weak refractory kernel was used, as well as neural activity regularization as it provided the best results.
Figure 5: Classification Error for the DVS Gestures task during learning

5 Discussion

Understanding and deriving neural and synaptic plasticity rules that can enable hidden weights to learn is an ongoing quest in neuroscience and neuromorphic engineering. From a machine learning perspective, locality and differentiability are key issues of the spiking neuron model operations. While the latter problem is now being tackled with surrogate gradient approaches, how to achieve this in deep networks in a scalable and local fashion is still an open question.

We presented a novel synaptic plasticity rule, DCLL, derived from a surrogate gradient approach with linear computational scalability in the number of neurons. The rule draws on recent work in surrogate gradient descent in spiking neurons and local learning with layerwise classifiers. The linear scalability is obtained through a rate-based cost function on the local classifier. To motivate this from a biological point of view, and a neuromorphic implementation point of view, we discussed a fully event-driven variant of the rule using error-triggered updates. The simplicity of the DCLL rule equation make it amenable for a direct exploitation of existing machine learning software libraries. Thanks to the surrogate gradient approach, the updates computed through automatic differentiation are equal to the DCLL update.

The DCLL rule can exploit the temporal dynamics of the spiking neuron to learn classification regression on spike trains, similarly to simple recurrent units in artificial neural networks. This is because the state of the neuron being maintained is operationally equivalent to the recurrent neuron. The neuron can learn sequences that have temporal dependencies on the scale of the neural and synaptic time constants, i.e. the kernel. The near state-of-the-art classification accuracy on the DVS gestures task demonstrates the scalability of the approach. The surrogate gradient approach also enables DCLL to seamlessly exploit random time constants (e.g.

see DVS gestures experiments). Random time constants are interesting given that multiplicity of time constants in a recurrent neural networks can confer it with a long term memory that rivalizes those of long-short term memory units

(Koutnik et al., 2014).

Researchers using Spiking Neural Networks in data-driven machine learning tasks often emphasize the spiking nature of its units a key difference compared to artificial neural networks. Our experience is different however. We find that their temporal dynamics, and local computations when implemented on dedicated hardware are the key differences compared to artificial neural networks. Thanks to our machine learning framework driven experimentations, we could easily replace the threshold neuron with a ReLU function, or a sigmoid. In general, our initial results are often better with the ReLU. However, after a close inspection of the dynamics, we observe that this is caused by activations reaching their saturation values. Through proper initialization and refractory periods, we could close the gap between ReLU and spiking activation functions.

One limitation of the DCLL rule currently presented is the lack of connections within a single layer. Up to now the learning of recurrent connections has been studied in the rate domain (Sussillo and Abbott, 2009), or using back-propagation through time (Bellec et al., 2018). The key challenge using a forward surrogate gradient approach is the dependency of the synaptic update on the neuron’s own history (e.g. see the discussion on the refractory period in the methods). We speculate that an approach that bootstraps the gradient computation in a fashion similar to synthetic gradients is a promising track to follow.

A direct consequence of the local classifiers is the lack cross-layer adaptation of the layers. To tackle this problem, one could use meta-learning to adapt the random matrix in the classifier. In effect, the meta-learning loop would act as the outer loop in the synthetic gradients approach

(Jaderberg et al., 2016). From a neuroscience perspective, the notion that a “layer” of neurons specialized to solving certain problems and sensory modalities is natural.

Funding

EN was supported by the Intel Corporation, the National Science Foundation under grant 1640081, and by the Korean Institute of Science and Technology. JK was supported by a fellowship within the FITweltweit programme of the German Academic Exchange Service (DAAD). HM was supported by the Swiss National Fund. JK, HM, EN wrote the paper, conceived the experiments. JK and EN ran the experiments and analyzed the data.

References