Reservoirs learn to learn

09/16/2019 ∙ by Anand Subramoney, et al. ∙ 0

We consider reservoirs in the form of liquid state machines, i.e., recurrently connected networks of spiking neurons with randomly chosen weights. So far only the weights of a linear readout were adapted for a specific task. We wondered whether the performance of liquid state machines can be improved if the recurrent weights are chosen with a purpose, rather than randomly. After all, weights of recurrent connections in the brain are also not assumed to be randomly chosen. Rather, these weights were probably optimized during evolution, development, and prior learning experiences for specific task domains. In order to examine the benefits of choosing recurrent weights within a liquid with a purpose, we applied the Learning-to-Learn (L2L) paradigm to our model: We optimized the weights of the recurrent connections -- and hence the dynamics of the liquid state machine -- for a large family of potential learning tasks, which the network might have to learn later through modification of the weights of readout neurons. We found that this two-tiered process substantially improves the learning speed of liquid state machines for specific tasks. In fact, this learning speed increases further if one does not train the weights of linear readouts at all, and relies instead on the internal dynamics and fading memory of the network for remembering salient information that it could extract from preceding examples for the current learning task. This second type of learning has recently been proposed to underlie fast learning in the prefrontal cortex and motor cortex, and hence it is of interest to explore its performance also in models. Since liquid state machines share many properties with other types of reservoirs, our results raise the question whether L2L conveys similar benefits also to these other reservoirs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One motivation for the introduction of the liquid computing model [Maass et al., 2002] was to understand how complex neural circuits in the brain, for cortical columns, are able to support the diverse computing and learning tasks which the brain has to solve. It was shown that recurrent networks of spiking neurons with randomly chosen weights, including models for cortical columns with given connection probability between laminae and neural populations, could in fact support a large number of different learning tasks, where only the synaptic weights to readout neurons were adapted for a specific task [Maass et al., 2004, Haeusler and Maass, 2006]

. However it is fair to assume that synaptic weights of neural networks in the brain are not just randomly chosen, but shaped through a host of processes – from evolution, over development to preceding learning experiences. These processes are likely to aim at improving the learning and computing capability of the network. Hence we asked whether the performance of liquids can also be improved by optimizing the weights of recurrent connections within the recurrent network for a large range of learning tasks. The Learning-to-Learn setup offers a suitable framework for examining this question. This framework builds on a long tradition of investigating L2L, also referred to as meta-learning, in cognitive science, neuroscience, and machine learning

[Abraham and Bear, 1996, Wang et al., 2018, Hochreiter et al., 2001, Wang et al., 2016]. The formal model from [Hochreiter et al., 2001, Wang et al., 2016] and related recent work in machine learning assumes that learning (or optimization) takes place in two interacting loops (see Fig. 1A). The outer loop aims at capturing the impact of adaptation on a larger time scale (such as evolution, development, and prior learning in the case of brains). It optimizes a set of parameters , for a – in general infinitely large – family of learning tasks. Any learning or optimization method can be used for that. For learning a particular task from

, the neural network can adapt those of its parameters which do not belong to the hyperparameters

that are controlled by the outer loop. These are in our first demo (section 2) the weights of readout neurons. In our second demo in section 3 we assume that – like in [Wang et al., 2018, Wang et al., 2016, Hochreiter et al., 2001] – ALL weights from, to, and within the neural network, in particular also the weights of readout neurons, are controlled by the outer loop. In this case only the dynamics of the network can be used to maintain information from preceding examples for the current learning task in order to produce a desirable output for the current network input. One exciting feature of this L2L approach is that all synaptic weights of the network can be used to encode a really efficient network learning algorithm. It was recently shown in [Bellec et al., 2018b] that this form of L2L can also be applied to networks of spiking neurons. We show in our second demo that this second form of L2L enables recurrent networks of spiking neurons to learn tasks from a given family even faster than in the paradigm where readouts are trained for each particular task. We discuss in section 3 also the interesting fact that L2L induces priors and internal models into liquid state machines.

The structure of this article is as follows. We address in section 2 the first form of L2L, where synaptic weights to readout neurons can be trained for each learning task, exactly like in the standard liquid computing paradigm. We discuss in section 3 the more extreme form of L2L where ALL synaptic weights are determined by the outer loop of L2L, so that no synaptic plasticity is needed for learning in the inner loop. In section 4 we give full technical details for the demos given in sections 2 and 3. Finally, in section 5 we will discuss implications of these results, and list a number of related open problems.

2 Optimizing liquids to learn

Figure 1: Learning-to-Learn setup: A) Schematic of the nested optimization that is carried out in Learning-to-Learn (L2L). B) Learning architecture that is used to obtain trained liquids

In the typical workflow of solving a task with liquid computing, we have to address two main issues 1) a suitable liquid has to be generated and 2) a readout function has to be determined that maps the states of the liquid to the target outputs. In the following, we address the first issue by doing a close investigation of how we can improve the process of obtaining suitable liquids. For the remainder of this investigation, we consider here recurrently connected networks of spiking neurons as liquids. Usually, in order to obtain a network of such neurons that enable computations with liquid states, a particular network architecture is specified and the corresponding synaptic weights are generated at random, which then remain fixed throughout learning a particular task. Clearly, one can tune the random creation process to suit better the needs of the considered task. For example, one can adapt the probability distribution from which weights are drawn from. However, it is likely that a liquid that was generated according to a coarse random schedule is far from perfect at producing liquid states that are really useful for the readout. It is therefore conceivable that a liquid exists, which is better suited for the task or for the range of tasks at hand.

In order to test this hypothesis we optimized liquids to produce informative liquid states, where a readout can easily extract the information it needs. This is carried out within the learning-to-learn framework.

Description of trained liquids:

The main characteristic of our approach is to view the weight of each synaptic connection in the liquid of neurons, both recurrent and input weights (, ), as hyperparameters . This viewpoint allows us to adapt the dynamics of the liquid such that useful liquid states emerge. Learning of a particular task is carried out according to the standard liquid computing paradigm, where only the parameters of a readout are trained for the task at hand.

In order to optimize the large number of hyperparameters and at the same time to prevent overfitting to a particular task, we optimize the learning performance of the liquid for a multitude of tasks in a family . As a result, a nested optimization procedure is introduced that consists of an inner loop and an outer loop, as shown in Figure 1A. In the inner loop, we consider a particular task which consists of an input time series and a target (see Figure 2A). is passed as a stream of inputs to the liquid, gets processed by the dynamics of the recurrently connected neurons, and results in liquid states . A linear readout is used to project the emerging features to a target prediction: . On the level of the inner loop, only the readout weights are modified to improve task performance. Specifically, the plasticity rule that acts upon these weights is described by gradient descent:

(1)

with representing a learning rate.

The outer loop is concerned with improving the learning process in the inner loop for an entire family of tasks . Thus, on the level of the outer loop, we consider an optimization objective that acts upon the hyperparameters :

(2)
subject to (3)
Figure 2: Learning to learn a nonlinear transformation of a time series: A) Different tasks arise by sampling second order Volterra kernels according to a random procedure. Each Volterra kernel is applied to the same input time series B) Learning performance in the inner loop using the learning rule (1), both for the case of a liquid with random weights, and for a liquid that was trained in the outer loop by L2L. Panel (C) refers to the time point indicated by the crosses (C left, C right). C) Sample performance of a random liquid and of a trained liquid after readouts have been trained for seconds. Network activity shows 40 neurons out of 800.
Regressing Volterra filters:

Models of liquid computing typically get applied to tasks that exhibit nontrivial temporal relationships in the mapping from input signal to target , because the property of fading memory allows the liquid to keep track of recent history. Theory guarantees that a large enough liquid can retain all relevant information. In practice, one is bound to a dynamical system of a limited size. Hence, it is likely that a liquid, optimized for the memory requirements and time scales of the specific task family at hand, will perform better than a liquid which was generated at random.

We consider a task family where the target time series for a task from the family is generated by an application of a randomly chosen second order Volterra filter [Volterra, 2005] to a fixed input signal . The kernel used in the Volterra filter is sampled at random according to a predefined procedure for each task and exhibits a typical temporal time scale.

Details of the family of tasks:  The input signal is given as a sum of three sinusoids of different frequencies with random phase and amplitude, and is kept constant throughout all tasks. The target for a task arises by applying a randomly chosen second order Volterra filter to the input:

(4)

The kernels and are truncated after a maximum time lag of . The procedure that creates the kernels at random is detailed in the methods section 4.3.

The task of the liquid is to provide suitable features such that a linear readout with weights can project the liquid states to a good prediction of the target: .

Implementation:  In practice, the simulations were carried out using discretized time, with steps of length. We used a network of recurrently connected neurons. The liquid state was implemented as a concatenation of the exponentially filtered spike trains of all neurons (the time constant being ). Learning of the linear readout weights in the inner loop was implemented using gradient descent as outlined in Eqn. (1), however, weight changes of the readout were accumulated in chunks and were applied every . The objective for the outer loop in Eqn.(2

) was optimized using backpropagation through time (BPTT), which is an algorithm to compute gradients for weights in recurrent neural networks. Observe that this is possible because the dynamics of the readout plasticity is itself differentiable, and can therefore be optimized by gradient descent. In addition, one needs to backpropagate through the spiking network dynamics, the details of which can be found in the methods 

4.2.

Results:  The liquid that emerged from outer-loop training was compared against a reference baseline, whose weights were not optimized for the task family, but had otherwise exactly the same structure and readout learning rule. In Figure 2B we report the learning performance on unseen task instances from the family , averaged over 200 different tasks. We find that the learning performance of the optimized liquid is drastically improved as compared to the random baseline.

This becomes even more obvious when one compares the quality of the fit on a concrete example as shown in Figure 2C. Whereas the random liquid fails to make consistent predictions about the desired output signal based on the liquid state, the trained liquid is able to capture all important aspects of the target signal. This occurred already after seconds within learning the specific task, because the trained liquid was already confronted before with tasks of a similar structure, and could capture through the outer loop optimization the smoothness of the Volterra kernels and the relevant time dependencies in its recurrent weights.

3 Liquids learn by using their internal dynamics instead of synaptic plasticity for readouts

Figure 3: L2L setup with liquids that learn using their internal dynamics A) Learning architecture for RSNN liquids. All the weights are only updated in the outer-loop training using BPTT. B) Supervised regression tasks are implemented as neural networks with randomly sampled weights: target networks (TN). C) Sample input/output curves of TNs on a 1D subset of the 2D input space, for different weight and bias values.
Figure 4: (Caption next page.)
Figure 4: Learning to learn a nonlinear function that is defined by an unknown target network (TN): A) Performance of the liquid in learning a new TN during training in the outer loop of L2L. B) Performance of the trained liquid during testing compared to a random liquid and the linear baseline. C

) Learning performance within a single inner-loop episode of the liquid for 1000 new TNs (mean and one standard deviation). Performance is compared to that of a random liquid.

D) Performance for a single sample TN, a red cross marks the step after which output predictions became very good for this TN. The spike raster for this learning process is the one depicted in (F). E) The internal model of the liquid (as described in the text) is shown for the first few steps of inner loop learning. The liquid starts by predicting a smooth function, and updates its internal model in just steps to correctly predict the target function. F) Network input (top row, only 100 of 300 neurons shown), internal spike-based processing with low firing rates in the neuron populations (middle row), and network output (bottom row) for 25 steps of 20 ms each. G) Learning performance of backpropagation for the same 1000 TNs as in C, working directly on the ANN from Fig. 3B, with a prior for small weights, with the best hyper-parameters from a grid-search. H) Performance comparison, for a single inner-loop episode (mean), between a liquid where all weights are fixed in the inner loop and liquid where the readout weights are trained in the inner loop, for the Volterra series task family described in Section 2.

We next asked whether liquids where the readouts are also fixed can learn, using only their internal dynamics. It was shown in [Hochreiter et al., 2001] that LSTM networks can learn nonlinear functions from a teacher without modifying their recurrent or readout weights. It has recently been argued in [Wang et al., 2018] that the pre-frontal cortex (PFC) accumulates knowledge during fast reward-based learning in its short-term memory, without using synaptic plasticity, see the text to Suppl. Fig. 3 in [Wang et al., 2018]. The experimental results of [Perich et al., 2018] also suggest a prominent role of network dynamics and short-term memory for fast learning in the motor cortex. Inspired by these results from biology and machine learning, we explored the extent to which recurrent networks of spiking neurons can learn using just their internal dynamics, without synaptic plasticity.

In this section, we show that one can generate liquids through L2L that are able to learn with fixed weights, provided that the liquid receives feedback about the prediction target as input. In addition, relying on the internal dynamics of the liquid to learn allows the liquid to learn as fast as possible for a given task i.e. the learning speed is not determined by any predetermined learning rate.

Target networks as the task family : We chose the task family to demonstrate that liquids can use their internal dynamics to regress complex non-linear functions, and are not limited to generating or predicting temporal patterns. This task family also allows us to illustrate and analyse the learning process in the inner loop more explicitly. We defined the family of tasks using a family of non-linear functions that are each defined by a target feed-forward network (TN) as illustrated in Fig. 3B. Specifically, we chose a class of continuous functions of two real-valued variables as the family of tasks. This class was defined as the family of all functions that can be computed by a 2-layer artificial neural network of sigmoidal neurons with 10 neurons in the hidden layer, and weights and biases in the range [-1, 1]. Thus overall, each such target network (TN) from was defined through 40 parameters in the range [-1, 1]: 30 weights and 10 biases. Random instances of target networks were generated for each episode by randomly sampling the 40 parameters in the above range. Most of the functions that are computed by TNs from the class are nonlinear, as illustrated in Fig. 3C for the case of inputs with .

Learning setup: In an inner loop learning episode, the liquid was shown a sequence of pairs of inputs () and (delayed) targets sampled from the non-linear function generated by one random instance of the target network. After each such pair was presented, the liquid was trained to produce a prediction of

. The task of the liquid was to produce predictions with a low error. In other words, the task of the liquid was to perform non-linear regression on the presented pairs of inputs and targets and produce predictions of low-error on new inputs. The liquid was optimized in the outer loop to learn this fast and well.

When giving an input for which the liquid had to produce prediction , we could not also give the target for that same input at the same time. This is because, the liquid could then “cheat” by simply giving this value as its prediction . Therefore, we gave the target value to the liquid with a delay, after it had produced the prediction . Giving the target value as input to the liquid is necessary, as otherwise, the liquid has no way of figuring out the specific underlying non-linear function for which it needs to make predictions.

Learning is carried out simultaneously in two loops as before (see Fig. 1A). Like in [Hochreiter et al., 2001, Wang et al., 2016, Duan et al., 2016] we let all synaptic weights of , including the recurrent, input and readout weights, to belong to the set of hyper-parameters that are optimized in the outer loop. Hence the network is forced to encode all results from learning the current task in its internal state, in particular in its firing activity. Thus the synaptic weights of the neural network are free to encode an efficient algorithm for learning arbitrary tasks from .

Implementation: We considered a liquid consisting of leaky integrate and fire neurons with full connectivity. All neurons in the liquid received input from a population of 300 external input neurons. A linear readout receiving inputs from all neurons in the liquid was used for the output predictions. The liquid received a stream of 3 types of external inputs (see top row of Fig. 4F): the values of , and of the output of the TN for the preceding input pair (set to 0 at the first trial), each represented through population coding in an external population of 100 spiking neurons. It produced outputs in the form of weighted spike counts during ms windows from all neurons in the network (see bottom row of Fig. 4F). The weights for this linear readout were trained, like all weights inside the liquid, in the outer loop, and remained fixed during learning of a particular TN.

The training procedure in the outer loop of L2L was as follows: Network training was divided into training episodes. At the start of each training episode, a new target network TN was randomly chosen and used to generate target values for randomly chosen input pairs . 400 of these input pairs and targets were used as training data, and presented one per step to the liquid during the episode, where each step lasted ms. The liquid parameters were updated using BPTT to minimize the mean squared error between the liquid output and the target in the training set, using gradients computed over batches of such episodes, which formed one iteration of the outer loop. In other words, each weight update included gradients calculated on the input/target pairs from different TNs. This training procedure forced the liquid to adapt its parameters in a way that supported learning of many different TNs, rather than specializing on predicting the output of single TN. After training, the weights of the liquid remained fixed, and it was required to learn the input/output behaviour of TNs from that it had never seen before in an online manner by just using its fading memory and dynamics. See the Methods (Section 4.4) for further details.

Results: The liquid achieves low mean-squared error (MSE) for learning new TNs from the family , significantly surpassing the performance of an optimal linear approximator (linear regression) that was trained on all 400 pairs of inputs and target outputs, see grey bar in Fig. 4B. One sample of a generic learning process is shown in Fig. 4D.

Each sequence of examples evokes an “internal model” of the current target function in the internal dynamics of the liquid. We make the current internal model of the liquid visible by probing its prediction for hypothetical new inputs for evenly spaced points in the entire domain, without allowing it to modify its internal state (otherwise, inputs usually advance the network state according to the dynamics of the network). Fig. 4E shows the fast evolution of internal models of the liquid for the TN during the first trials (visualized for a 1D subset of the 2D input space). One sees that the internal model of the liquid is from the beginning a smooth function, of the same type as the ones defined by the TNs in . Within a few trials this smooth function approximated the TN quite well. Hence the liquid had acquired during the training in the outer loop of L2L a prior for the types of functions that are to be learnt, that was encoded in its synaptic weights. This prior was in fact quite efficient, as Figs. 4C,D,E show, compared to that of a random liquid. The liquid was able to learn a TN with substantially fewer trials than a generic learning algorithm for learning the TN directly in an artificial neural network as shown in Fig. 4

G: backpropagation with a prior that favored small weights and biases. In this case, the target input was given as feedback to the liquid throughout the episode, and we compare the training error achieved by the liquid with that of a FF network trained using backpropagation. A liquid with a long short-term memory mechanism where we could freeze the memory after low error was achieved allowed us to stop giving the target input after the memory was frozen. This long short-term memory mechanism was in the form of neurons with adapting thresholds as described in 

[Bellec et al., 2018b, Bellec et al., 2018a]. These results suggest that L2L is able to install some form of prior knowledge about the task in the liquid. We conjectured that the liquid fits internal models for smooth functions to the examples it received.

We tested this conjecture in a second, much simpler, L2L scenario. Here the family consisted of all sine functions with arbitrary phase and amplitudes between 0.1 and 5. The liquid also acquired an internal model for sine functions in this setup from training in the outer loop, as shown in [Bellec et al., 2018b]. Even when we selected examples in an adversarial manner, which happened to be in a straight line, this did not disturb the prior knowledge of the liquid.

Altogether the network learning that was induced through L2L in the liquid is of particular interest from the perspective of the design of learning algorithms, since we are not aware of previously documented methods for installing structural priors for online learning of a recurrent network of spiking neurons.

We then compared the learning performance of liquids trained using the two forms of L2L – one where all the weights, including the readout weights are fixed in the inner loop, and the other where the readouts are trained for each particular task in the inner loop. Both types of liquids were trained and then tested on the same task – the Volterra series task used in Section 2. We found that the second form of L2L, where all the weights of the liquid were fixed and the target was given as feedback to the liquid with a delay of second, was able to learn significantly faster (see Fig. 4H) compared to the liquid trained using the paradigm from Section 2.

4 Methods

4.1 Leaky integrate and fire neurons

We used leaky integrate-and-fire models of spiking neurons, where the membrane potential of neuron evolves according to:

(5)

where is the membrane resistance, A neuron spikes as soon at its membrane potential is above its firing threshold . At each spike time , the membrane potential is reset by subtracting the current threshold value and the neuron enters a strict refractory period where it cannot spike again.

4.2 Backpropagation through time

We introduced a version of backpropagation through time (BPTT) in [Bellec et al., 2018b] which allows us to back-propagate the gradient through the discontinuous firing event of spiking neurons. The firing is formalized through a binary step function applied to the scaled membrane voltage . The gradient is propagated through this step function with a pseudo-derivative as in [Courbariaux et al., 2016, Esser et al., 2016], but with a dampened amplitude at each spike.

Specifically, the derivative of the spiking w.r.t to the normalized membrane potential is defined as:

(6)

In this way the architecture and parameters of a RSNN can be optmized for a given computational task.

4.3 Optimizing liquids to learn

Liquid model:  Our liquid consists of 800 recurrently connected leaky integrate-and-fire (LIF) neurons according to the dynamics defined above. The network simulation is carried out in discrete timesteps of . The membrane voltage decay was uniform across all neurons and was computed to correspond to a time constant of (). The normalized spike threshold was set to and a refractory period of

was introduced. Synaptic delays were set uniform across all synapses with

. In the beginning of the experiment, input and recurrent weights

were initialized according to Gaussian distributions with zero mean and standard deviations of

and respectively. Similarly, the initial values of the readout

were also optimized in the outer loop, and were randomly initialized in the beginning of the experiment according to a uniform distribution, as proposed in 

[Glorot and Bengio, 2010].
Readout learning:  The readout was iteratively adapted according to equation (1). Weight changes were computed at each timestep and accumulated. After every second these changes were used to actually modify the readout weights. Thus, formulated in discrete time, the plasticity of the readout weights in a task took the following form:

(7)

where is a learning rate.
Outer loop optimization:  To optimize input and recurrent weights of the liquid in the outer loop, we simulated the learning procedure described above for different tasks in parallel. After each seconds, the simulation was paused and the outer loop objective was evaluated, based on the truncated chunks that include 3 readout weight updates:

(8)

observe that we include in the objective only the last two seconds of simulation. The cost function was then minimized using a variant of gradient descent (Adam [Kingma and Ba, 2014]), where a learning rate of was used. The required gradient was computed with backpropagation through time using the second chunks of simulation and was clipped if the -norm exceeded a value of .

Regularization:  In order to encourage the model to settle in a regime of plausible firing rates, we add to the outer loop cost function a term that penalizes excessive firing rates:

(9)

with the hyperparameter . We compute the firing rate of a neuron based on the number of spikes in the past seconds.
Task details:  We describe here the procedure according to which the input time series and target time series were generated. The input signal was set to be the same for each task and was generated once in the beginning of the experiment. It was composed of a sum of three sines with random phase and amplitude , both sampled uniformly in the given interval.

(10)

with periods of , and .

The corresponding target function was then computed by an application of a random second order Volterra filter to according to equation (4). Each task uses a different kernel in the Volterra filter and we exhibit here the process by which we generate the kernels and . Recall that we truncate the kernels after a time lag of . Together with the fact that we simulate in discrete time steps of we can represent

as a vector with

entries, and as a matrix of dimension .

Sampling : We parametrize as a normalized sum of two different exponential filters with random properties:

(11)
(12)

with being sampled uniformly in , and drawn randomly in . For normalization, we use the sum of all entries of the filter in the discrete representation ().

Sampling : We construct to resemble a Gaussian bell shape centered at , with a randomized “covariance” matrix , which we parametrize such that we always obtain a positive definite matrix:

(13)

where are sampled uniformly in . With this we defined the kernel according to:

(14)
(15)

The normalization term here is again given by the sum of all entries of the matrix in the discrete time representation ().

4.4 Liquids learn by using their internal dynamics instead of synaptic plasticity for readouts

Liquid model:  The liquid model used here was the same as that in section 4.3, but with neurons.

Input encoding: Analog values were transformed into spiking trains to serve as inputs to the liquid as follows: For each input component, input neurons are assigned values evenly distributed between the minimum and maximum possible value of the input. Each input neuron has a Gaussian response field with a particular mean and standard deviation, where the means are uniformly distributed between the minimum and maximum values to be encoded, and with a constant standard deviation. More precisely, the firing rate (in Hz) of each input neuron is given by , where Hz, is the value assigned to that neuron, is the analog value to be encoded, and , with being the minimum and maximum values to be encoded.

Setup and training schedule: The output of the liquid was a linear readout that received as input the mean firing rate of each of the neurons per step i.e the number of spikes divided by for the ms time window that the step consists of.

The network training proceeded as follows: A new target function was randomly chosen for each episode of training, i.e., the parameters of the target function are chosen uniformly randomly from within the ranges above.

Each episode consisted of a sequence of steps, each lasting for ms. In each step, one training example from the current function to be learned was presented to the liquid. In such a step, the inputs to the liquid consisted of a randomly chosen vector with its dimensionality and range determined by the target function being used ( for TNs). In addition, at each step, the liquid also got the target value from the previous step, i.e., the value of the target calculated using the target function for the inputs given at the previous step (in the first step, is set to ). The previous target input was provided to the liquid during all steps of the episode.

All the weights of the liquid were updated using our variant of BPTT, once per iteration, where an iteration consists of a batch of episodes, and the weight updates are accumulated across episodes in an iteration. The ADAM [Kingma and Ba, 2014] variant of gradient descent was used with standard parameters and a learning rate of

. The loss function for training was the mean squared error (MSE) of the predictions over an iteration (i.e. over all the steps in an episode, and over the entire batch of episodes in an iteration). In addition, a regularization term was used to maintain a firing rate of

Hz. Specifically, the regularization term is defined as the mean squared difference between the average neuron firing rate in the liquid and a target of Hz. The total loss was then given by . In this way, we induce the liquid to use sparse firing. We trained the liquid for iterations.

Parameter values: The parameters of the leaky integrate-and-fire neurons were as follows: ms neuronal refractory period, delays spread uniformly between ms, membrane time constant ms, mV baseline threshold voltage. The dampening factor for training was .

Comparison with Linear baseline: The linear baseline was calculated using linear regression with L2 regularization with a regularization factor of (determined using grid search), using the mean spiking trace of all the neurons. The mean spiking trace was calculated as follows: First the neuron traces were calculated using an exponential kernel with ms width and a time constant of ms. Then, for every step, the mean value of this trace was calculated to obtain the mean spiking trace. In Fig. 4B, for each episode consisting of steps, the mean spiking trace from a subset of steps was used to train the linear regressor, and the mean spiking trace from remaining steps was used to calculate the test error. The reported baseline is the mean of the test error over one batch of episodes with error bars of one standard deviation.

For the case where neural networks defined the function family, the total test MSE was (linear baseline MSE was ).

Comparison with random liquid: In Fig. 4B,C, a liquid with randomly initialized input, recurrent and readout weights was tested in the same way as the trained liquid – with the same sets of inputs, and without any synaptic plasticity in the inner loop. The plotted curves are the average over 8000 different TNs.

Comparison with backprop: The comparison was done for the case where the liquid was trained on the function family defined by target networks. A feed-forward (FF) network with hidden neurons and output was constructed. The input to this FF network were the analog values that were used to generate the spiking input and targets for the liquid. Therefore the FF had inputs, one for each of and . The error reported in Fig 4G is the mean training error over TNs with error bars of one standard deviation.

The FF network was initialized with Xavier normal initialization [Glorot and Bengio, 2010] (which had the best performance, compared to Xavier uniform and plain uniform between ). Adam [Kingma and Ba, 2014] with AMSGrad [Reddi et al., 2018] was used with parameters . These were the optimal parameters as determined by a grid search. Together with the Xavier normal initialization and the weight regularization parameter , the training of the FF favoured small weights and biases.

Comparison on the Volterra series task: The task family was defined exactly as described in 4.3. A liquid of neurons was used. Two inputs were given to this liquid – the input signal and the target delayed by ms, and it produced predictions . The training setup was the same as for target networks described above, but with episodes of seconds. We compare the mean error over 40 episodes in Fig. 4H for liquids with and without readout plasticity, over the first seconds of an inner-loop episode.

5 Discussion

We have investigated potential benefits of a deviation from the standard reservoir computing paradigm, where only synaptic weights of readout neurons are adapted. We have optimized synaptic weights of recurrent connections and connections from input neurons to speed up this adaptation process for synaptic weights of readout neurons, simultaneously for a large family of potential tasks. A suitable framework for this 2-tier adaptation process is provided by the common L2L paradigm, that we have illustrated in Fig. 1A. In Fig. 2 we have demonstrated the qualitative jump in learning performance that is achieved if one uses a liquid that is not randomly chosen, but optimized in the outer loop of L2L. This result raises the question whether a similar boosting of learning performance can also be achieved for other types of reservoirs, such as recurrent networks of artificial neurons. This approach might also be applicable to other physically implemented reservoirs such as those reviewed in [Tanaka et al., 2019]. If one does not have a differentiable computer model for such physically implemented reservoir, one would have to use a gradient-free optimization method for the outer loop, such as simulated annealing or stochastic search, see [Bohnstingl et al., 2019] for a first step in that direction.

We have explored in section 3 a new variation of the reservoir computing paradigm, where not even the weights to readout neurons need to be adapted for learning a specific tasks. Instead, the weights of recurrent connections within the reservoir can be optimized so that the reservoir can learn a task from a given family of tasks by maintaining learnt information for the current task in its working memory, i.e., in its network state. This state may include values of hidden variables such as current values of adaptive thresholds, as in the case of LSNNs [Bellec et al., 2018b]. It turns out that L2L without any synaptic plasticity in the inner loop enables a spike-based reservoir to learn faster than with plastic synapses to readout neurons, see Fig 4H. In fact, in enables the reservoir to learn faster than the optimal learning method from machine learning for the same task: Backpropagation applied directly to the target network architecture which generated the nonlinear transformation, compare panels C and G of Fig. 4. We also have demonstrated in Fig 4E that the L2L method can be viewed as installing a prior in the reservoir. This observation raises the question what types of priors or rules can be installed in reservoirs with this approach. For neurorobotics applications it would be especially important to be able to install safety rules in a neural network controller that can not be overridden by subsequent learning. We believe that L2L methods could provide valuable tools for that.

Another open question is whether biologically more plausible and computationally more efficient approximations to BPTT, such as e-prop [Bellec et al., 2019], can be used instead of BPTT for optimizing a reservoir in the outer loop of L2L.

Acknowledgements

This research/project was supported by the HBP Joint Platform, funded from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement No. 720270 (Human Brain Project SGA1) and under the Specific Grant Agreement No. 785907 (Human Brain Project SGA2). Research leading to these results has in parts been carried out on the Human Brain Project PCP Pilot Systems at the Jülich Supercomputing Centre, which received co-funding from the European Union (Grant Agreement No. 604102). We gratefully acknowledge Sandra Diaz, Alexander Peyser and Wouter Klijn from the Simulation Laboratory Neuroscience of the Jülich Supercomputing Centre for their support.

References

  • [Abraham and Bear, 1996] Abraham, W. C. and Bear, M. F. (1996). Metaplasticity: the plasticity of synaptic plasticity. Trends in neurosciences, 19(4):126–130.
  • [Bellec et al., 2018a] Bellec, G., Salaj, D., Subramoney, A., Kraisnikovic, C., Legenstein, R., and Maass, W. (2018a). Slow dynamic processes in spiking neurons substantially enhance their computing capability; in preparation.
  • [Bellec et al., 2018b] Bellec, G., Salaj, D., Subramoney, A., Legenstein, R., and Maass, W. (2018b). Long short-term memory and learning-to-learn in networks of spiking neurons. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 795–805. Curran Associates, Inc.
  • [Bellec et al., 2019] Bellec, G., Scherr, F., Subramoney, A., Hajek, E., Salaj, D., Legenstein, R., and Maass, W. (2019). A solution to the learning dilemma for recurrent networks of spiking neurons. bioRxiv, page 738385. 00000.
  • [Bohnstingl et al., 2019] Bohnstingl, T., Scherr, F., Pehle, C., Meier, K., and Maass, W. (2019). Neuromorphic hardware learns to learn. Frontiers in Neuroscience, 13:483.
  • [Courbariaux et al., 2016] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830.
  • [Duan et al., 2016] Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. (2016). : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.
  • [Esser et al., 2016] Esser, S. K., Merolla, P. A., Arthur, J. V., Cassidy, A. S., Appuswamy, R., Andreopoulos, A., Berg, D. J., McKinstry, J. L., Melano, T., Barch, D. R., Nolfo, C. d., Datta, P., Amir, A., Taba, B., Flickner, M. D., and Modha, D. S. (2016). Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences, 113(41):11441–11446.
  • [Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256.
  • [Haeusler and Maass, 2006] Haeusler, S. and Maass, W. (2006). A statistical analysis of information-processing properties of lamina-specific cortical microcircuit models. Cerebral cortex, 17(1):149–162.
  • [Hochreiter et al., 2001] Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001). Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer.
  • [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Maass et al., 2004] Maass, W., Natschläger, T., and Markram, H. (2004). Fading memory and kernel properties of generic cortical microcircuit models. Journal of Physiology-Paris, 98(4-6):315–330.
  • [Maass et al., 2002] Maass, W., Natschläger, T., and Markram, H. (2002). Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation, 14(11):2531–2560.
  • [Perich et al., 2018] Perich, M. G., Gallego, J. A., and Miller, L. E. (2018). A neural population mechanism for rapid learning. Neuron.
  • [Reddi et al., 2018] Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of adam and beyond. In International Conference on Learning Representations.
  • [Tanaka et al., 2019] Tanaka, G., Yamane, T., Héroux, J. B., Nakane, R., Kanazawa, N., Takeda, S., Numata, H., Nakano, D., and Hirose, A. (2019). Recent advances in physical reservoir computing: a review. Neural Networks.
  • [Volterra, 2005] Volterra, V. (2005). Theory of functionals and of integral and integro-differential equations. Courier Corporation.
  • [Wang et al., 2018] Wang, J. X., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J. Z., Hassabis, D., and Botvinick, M. (2018).

    Prefrontal cortex as a meta-reinforcement learning system.

    Nature Neuroscience.
  • [Wang et al., 2016] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.