Learning sparsity in reservoir computing through a novel bio-inspired algorithm

07/19/2019
by   Luca Manneschi, et al.
The University of Sheffield
0

The mushroom body is the key network for the representation of learned olfactory stimuli in Drosophila and insects. The sparse activity of Kenyon cells, the principal neurons in the mushroom body, plays a key role in the learned classification of different odours. In the specific case of the fruit fly, the sparseness of the network is enforced by an inhibitory feedback neuron called APL, and by an intrinsic high firing threshold of the Kenyon cells. In this work we took inspiration from the fruit fly brain to formulate a novel machine learning algorithm that is able to optimize the sparsity level of a reservoir by changing the firing thresholds of the nodes. The sparsity is only applied on the readout layer so as not to change the timescales of the reservoir and to allow the derivation of a one-layer update rule for the firing thresholds. The proposed algorithm is a combination of learning a neuron-specific sparsity threshold via gradient descent and a global sparsity threshold via a Markov chain Monte Carlo method. The proposed model outperforms the standard gradient descent, which is limited to the readout weights of the reservoir, on two example tasks. It demonstrates how the learnt sparse representation can lead to better classification performance, memorization ability and convergence time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/04/2019

SpaRCe: Sparse reservoir computing

"Sparse" neural networks, in which relatively few neurons or connections...
06/04/2020

Sparsity in Reservoir Computing Neural Networks

Reservoir Computing (RC) is a well-known strategy for designing Recurren...
06/11/2020

Model-Size Reduction for Reservoir Computing by Concatenating Internal States Through Time

Reservoir computing (RC) is a machine learning algorithm that can learn ...
11/21/2021

CD4 T follicular helper cells in HIV

HIV infects millions of individuals worldwide, and new things still emer...
01/12/2015

Photonic Delay Systems as Machine Learning Implementations

Nonlinear photonic delay systems present interesting implementation plat...
10/08/2018

Optimizing Waiting Thresholds Within A State Machine

Azure (the cloud service provided by Microsoft) is composed of physical ...
01/11/2021

Exploiting Multiple Timescales in Hierarchical Echo State Networks

Echo state networks (ESNs) are a powerful form of reservoir computing th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparsity is a well known concept in neuroscience, observed from the high selectivity of the neurons, ranging from the sensory cortex of mammalian brain [1] to the Kenyon cells (KCs) in the mushroom body [2]. In particular, this work is inspired from the low coding level that is observed in the KCs of Drosophila, where the high intrinsic thresholds of KCs permits such neurons to sparsely and selectively encode external stimuli [3]

. Analogously, the model proposed exploits the concept of learnable thresholds to optimize the level of sparsity inside the network. The learning is performed by optimization of a distance measure between the output of the neural network and the desired outcome without exploiting any normalization term. The novelty of the proposed approach lies on the fact that a sparsity level is reached due to the presence of firing thresholds, rather than to regularization

[4] [5] [6]. From the machine learning perspective, adopting sparse representations can lead to more interpretable model [5], to a reduced computational cost [7], and can help solve overfitting problems [8]. In this regard, the work in [7]

demonstrated how structured sparsity can have benefits in term of computational speed and accuracy in a convolutional neural network. Rasmussen et al.

[9]

showed how the choice of regularization parameters of the model can impact the interpretability and the reproducibility of a classifier of neuroimaging data, and showed the existence of a trade-off between pure classification accuracy and reproducibility. The Dropout technique

[8], which selects random subset of units in a neural network during training, can prevent overfitting by diminishing the codependence of the units in deep neural networks.
The network under consideration in this work is a reservoir of leaky integrators [10]. The connectivity between the nodes is represented through a random sparse fixed adjacency matrix that enables the associated dynamical system to exhibit a multitude of characteristic timescales. This complex connectivity is consistent with experimental reports of chemical [11] and electrical [12]synapses between Kenyon cells in Drosophila, although the physiological function of KC-KC synapses has not yet been discovered.

2 Methods

The reservoir under consideration is a network of leaky integrators described by the following equation

(1)

where defines the temporal scale of the neuron and

is the activity vector of the integrators

111It is called V to resemble the voltage of a neuron.

. The activation function

chosen is a rectified linear unit.

is the input adjacency matrix,

is the fixed sparse random matrix that describes the recurrency of the reservoir, and

s is the signal. The rescaling factor

is chosen in order to constrain the eigenvalues of the associated dynamic system inside the imaginary plane and guarantee the Echo State property. The number of connections of the input matrix

from the input neurons to the reservoir follow a lognormal distribution where each node in the reservoir is connected to six input nodes on average. Furthermore, a specific weight of a postsynaptic neuron is inversely proportional to the number of connections to such neuron. This choice of is inspired by experimental evidence [13]. However, other forms of are possible and the results of this work do not depend considerably on the specific form of the distribution of the input connections. Nevertheless, the magnitude of the weights of plays an important role and must be chosen appropriately [10].

In contrast to previous models [14] [15] [10] that define the output of the neural network through a read-out of the V vector, we introduced another variable , defined as follows

(2)

where stands for rectified linear unit, and is a vector of thresholds that enables x to be sparse. Thus, the ‘measurable’ variable is zero if the ‘hidden’ variable is lower than the corresponding threshold . The training procedure minimizes a measure of the distance between the output of the neural network and the desired value . Mathematically, the cost function is

(3)

in which and the thresholds are the learnable parameters. Thanks to eq.2 and the introduction of the sparse measurable variable x it is possible to change the thresholds without affecting the temporal timescales of the reservoir. This specific formulation allow us to compute the gradient of with respect to

without incurring backpropagation through time and to preserve the idea behind reservoir computing as a fixed, dynamically rich, representation. The rest of this methodological part is organized as follows: subsection

describes the tasks analysed, in subsection we consider two possible training procedure to learn the thresholds values, and subsection describes the best performing algorithm.

2.1 Tasks

We considered two classification tasks where the model must make a decision after a prefixed time interval . To make the task more challenging and more biologically plausible, the model selects a class 222

Or an action in the case of Reinforcement Learning that will be analysed later.

with probability that corresponds to a softmax function applied on the output layer. It follows that choosing the class that corresponds to the highest output would only make the task easier for all the models that will be analysed, and that the relative differences among the various algorithms and our conclusions would not change.


In the first paradigm, the external input s(t) is derived from the simulated response of 24 projection neurons (PNs, second-order neurons in the fly olfactory system) to 110 different odors, based on physiological recordings of olfactory receptor neurons (ORNs) and known characteristics of the ORN-PN synapse [16] [17]. This simulated activity, which we call (HO for Hallem-Olsen), has previously been used in computational analyses of fly olfaction [18] [19] [20]. If is the dimensional vector describing the activities of the input neurons, the i-th dimension of the input is , where

is a Gaussian distributed random variable with zero mean and unitary variance. Thus, the temporal dependence of the external stimulus is due to the presence of noise only and the stimulus, without carrying any relevant temporal information, is practically static

333Indeed, a simple network of unconnected integrators can solve this first experiment successfully as long as the characteristic times of the nodes is big enough to smooth the fluctuation of the signal.. Each stimulus is then associated to a random chosen class, which is the desired outcome of the classification. Given the random nature of the pairing between stimuli and corresponding correct outputs, the model cannot exploit correlations among different signals and the performance achieved is a measure of pure memorization ability. A scheme of the task is shown in fig.1

In the second paradigm considered we evaluated the performance of the models in classifying sequences of three successive stimuli. The procedure for building different sequences is described in the third panel of fig.1. Given a base sequence of randomly selected stimuli 444The improbable choice of the stimuli is for illustrative purposes., we substituted the last signal with random stimuli (if , and for instance) and associated each new sequence to a random different class (following the considered example, to class one and to class two). From now on, we will call “perturbations” the new substituted elements of the sequence and ”context” the remaining elements. In the case of , which is derived from the base , is the perturbation while represents the context. Then, the same procedure is applied to the previous elements and of as illustrated in the left column of the third panel of fig.1. The purpose of this procedure of defining sequences through perturbations of single elements is to test the temporal memory, or echo, of the reservoir. For a given sequence , the network has to remember the association between a perturbation and the desired output class. However, if perturbations are not repeated, the model does not have to take into account the relationships among various elements, but the memorization of the perturbation is sufficient to obtain a correct classification. In order to make the task more challenging, we changed the context of a perturbed sequence by replacing the two context elements with new random stimuli. The new succession is then associated to a different class than the original . For instance, if is the sequence obtained by perturbing the base , a contextual change can be obtained by replacing with and by defining the new succession . Contextual variations are finally illustrated in the right column of the the third panel of fig.1 The overall procedure defines a systematic way to test the temporal and contextual memory capacity of the network while preserving the random association between desired class and successions of elements. Apart from possible similarities among sequences of the same class due to a poor statistics, there are no correlations between inputs that could help the classification procedure, and each sequence has to be classified independently. Thus, given successions, this methodology creates a total number of sequences that equals to , where the factor is due to the number of elements in a succession, a factor to perturbations, and a factor to contextual changes.

Figure 1: Scheme of the network and of the tasks considered. 1. The network is composed by a reservoir of integrators. The activities of the nodes is divided into two variables: a hidden variable containing all the complex dynamic of the reservoir, and a measurable variable containing the thresholds values and defining a sparse output representations. 2.

Illustration of the static task considered. The network has to classify experimental stimuli perturbed with white noise and randomly associated to different classes. The model has to classify at the end of the presentation of the external stimulus.

3. Scheme of the classification task and of the procedure adopted to define sequences from single stimuli (Methods for more details). As before, the network is asked to classify at the end of the successions which are assigned to random classes.

2.2 Algorithms

The objective of this section is to review two possible algorithms to optimize the vector of thresholds of equation 2 capable of learning sparse representations of input stimuli and to introduce a benchmark model. The first one is based on the derivation of the gradient of the error function with respect to the individual neuronal thresholds while the second one is the well known Metropolis algorithm. By optimizing a sparsity level, we envisage that the desired learning rule would also separate the representation of stimuli belonging to different classes in order to facilitate the classification, as non-specific neurons, i.e. neurons that fire to all stimuli, are likely unhelpful for the learning process. The decrease of the overlap between antagonist representations will be quantified through a specificity measure that is introduced in the next section.

Benchmark ()

The benchmark model exploits gradient descent on the output weights without the introduction of the thresholds and the additive layer . The output activity corresponds to a readout of the V variable (eq.1).
Also all the other algorithms analysed will learn through gradient descent, but their output activity will be a readout of the x (eq.2) variable and they will have the additional complexity of learning the thresholds .

Gradient Descent on ()

In addition to learning the ’output’ weights of the reservoir with gradient descent, we choose to learn the local firing thresholds. The derivative of the cost function with respect the leads to

(4)

where is the learning rate and is the Heaviside function. While this is a supervised scenario, we are adopting a Reinforcement Learning terminology: eq.4 can be viewed as the sum of prediction errors where each reinforcement signal belongs to a different class and is modulated by the corresponding output synapses. In order to interpret eq.4, let us consider the case in which so that we have a qualitative idea of what the meaning of eq.4 is. In this specific scenario eq.4 becomes:

where we consider that desired output is a positive quantity , usually set as one in a classification task for the correct class and zero otherwise. Let us consider the extreme assumption where and a first learning phase where the relations and can hold by construction. In this ideal case the learning rule is driven by the difference , that leads to a decrease of the i-th firing threshold when and the i-th specific node is helping to achieve the right classification, and an increase of the threshold value when and the node is contributing to reach the wrong output. The general case where is now understandable by considering that the local factor are modulated through a feedback signal that gives the priority to nodes that are far from the desired output.
The problem with this gradient based rule is that it changes the activation of the nodes unidirectionally. While eq.4 deactivates nodes that are not useful for the classification task, the learning rule cannot reactivate neurons that were silent because the gradient of a non active node is zero by definition. Furthermore, we will later see that the mean of the distribution of thresholds found by this algorithm is suboptimal. 555This last consideration will be supported by the results of fig.3. The limitations of a gradient based approach led us to consider the following algorithm.

Metropolis algorithm ()

While the previous algorithm optimizes a separate threshold for each neuron, the algorithm analysed in this section exploits a global threshold for the whole network (). Indeed, a desired sparsity level is reachable with one single value of only, and preliminary simulations have demonstrated how adopting diverse fixed values of the global threshold can lead to different performance. We want to analyse the consequences of optimizing such parameter with an algorithm that, instead of using the derivative of the cost function, performs a stochastic search by randomly perturbing the value of and accepts or declines the new value with some probability. In the proposed implementation such a probability is given by the Metropolis algorithm [21] where the energy of the system corresponds to the cost function . The procedure adopted is summarized in the following steps:

  • Starting from a value of , propose a new threshold value , where

  • Repeat for steps:

         Compute the cost      for the two thresholds values

         Update the output weights through      gradient descent on

         Compute an average of

  • Accept the network corresponding to with probability

Practically, the algorithm proposes a new reservoir with a global threshold , changes the output weights through standard gradient descent, and accepts the new network by applying the Metropolis rule on a running exponential average of the cost function. In the above scheme, is the number of steps used to compute where the values of are fixed, and defines the memory of the running average .

2.3 Proposed algorithm, a unified approach

The proposed algorithm exploits a mixture of the Metropolis and the gradient descent updating rules to change the values of the thresholds. Each single node has a threshold

(5)

defined as the sum of a global factor and a local factor . The proposed model optimizes the global part through stochastic perturbations and the local one via a gradient descent approach. The algorithm shares the same steps of the Metropolis procedure explained above (section Metropolis), with the following differences: the stochastic perturbation on step is applied to the global factor , and the gradient descent on step is also applied to the local thresholds (instead of the output weights only). For clarity, the scheme of the final algorithm is

  • Starting from a value of , propose a new threshold value , where

  • Repeat for steps:

         Compute the cost      for the two thresholds values

         Update the output weights and      through gradient descent on

         Compute an average of

  • Accept the network corresponding to with probability

Since the algorithm is now learning three sets of variables, , and , one important aspect to take into account is the fact that the effective learning rate depends on the sparsity level. In particular, changing the initial value of can strongly affect the effective learning rate. Because the optimization of the weights and of the local thresholds depends on the step size and consequently on , the initial value of can become important even if the stochastic variations of the Metropolis are optimizing it. In order to choose the starting condition, the algorithm exploits a small prelearning phase (about steps) by applying the same steps described in the scheme above, but resetting and after each Metropolis update to their initial values and trying different initial values of . In such a way, we guarantee choosing a good starting condition. In practice, this procedure was done only one time for each task considered without trying to fine tune the initial for all the simulations performed.

2.4 Specificity

Since a sparsity level is obtained through direct minimization of eq.3 by gradient descent and/or stochastic search, we expect that the optimized thresholds would decrease the overlap in the representations among stimuli corresponding to different classes. Indeed, this separation would make the classification task easier. Thus, a measure of specificity is formulated to quantify and to understand how the proposed learning can lead to better representations. Let us consider two classes and and a neuron . The node is specific to the class with respect to the other if it is more active when stimuli belonging to class

are presented to the network. Generalizing this idea it is possible to build a tensor

defined as

(6)

where () are the number of times the neuron was active after the presentation of a stimulus of class () and is the total number of episodes taken into account to compute eq.6. There are also other possibilities analogous to eq. 666 could be substituted by and define a relative specificity that is rescaled by the number of times a neuron was active. However, a decrease of causes an undesirable increase in the specificity measure. Another choice is to use the sum of the activities of a neuron for all the stimuli in class instead of , but this leads to comparable results., but we considered the above equation the most easily interpretable. Given it is possible to compute a measure of specificity for each single neuron as

(7)

where we considered only the upper triangular part of because of the symmetry of the latter tensor.

Parameters Values, task
Tasks
Network
Model

3 Results

We will now compare the four algorithms: gradient descent on the weights (), gradient descent on the weights and thresholds (), the Metropolis algorithm and the composite model on the two tasks (section Tasks).

Task 1, static input

In this scenario the model has to classify, after a fixed time interval of , a noisy dimensional stimulus. Each input is defined as described in the methodological section and it is randomly assigned to a class in order to test the pure memorization ability of the network. The performance is shown in fig.2, where the left panel reports a training example of a classification of

stimuli. The parameters defining the task and the hyperparameters of the composed algorithm are reported in the table above.


Figure 2: Task 1, performance. Left. Performance of the model during training. The Composed model has the highest speed of convergence, while the model without thresholds and sparse activity has lowest accuracy. Right Fraction of correct classifications after episodes. The difference between the algorithms increases with the difficulty of the task.

Figure 3: Task 1, Mean and variance of the optimized distribution. Left The average of the distribution shows an upward trend and the sparseness in the network rises with the number of input stimuli. Right The need to differentiate the values of the thresholds is reflected in the of the optimized distribution and it increases as the task becomes more demanding.

The black line refers to the standard model called 777for gradient descent on the output weights , where the threshold and the x variable are not introduced; the output of the neural network is a readout of V. Since the algorithms proposed are more complex and can exploit an exponential running average of the cost function to update the global parameter, we optimised the batch size of to improve its performance and to make the comparison to the disadvantage of the proposed algorithm. The batch size chosen for is . It is clear how the three models that exploit the additional complexity due to the thresholds outperform the standard model in terms of convergence speed and accuracy. The performance is also reported in the right panel after episodes 888One episode corresponds to a single presentation of a stimulus that ends when the network makes a decision as the number of stimuli to be classified changes and the task becomes more demanding. The difference between the models become more evident as the difficulty increases, and the composite model robustly reports the best classification accuracy.

After learning and for the case where , the average percentage of active nodes is about fifty percent for the composed algorithm and the Metropolis, while it is about seventy percent for . This difference in the sparsity levels found by the three algorithms is visible in the left panel of fig.3, where the variance and the mean of the optimized distribution of are reported for all the cases. The Metropolis and the composed model report higher values of the mean of the thresholds distribution (blue and red diamonds) in comparison to , which leads to a corresponding higher coding level. However, the increasing trend of for all the learning rules demonstrate how sparsity is needed when the memorization ability required by the network is high. The importance of a local threshold is reflected on the spread of the distribution, reported as . Indeed, as the performance between the Metropolis and the composed model diverge going from left to right (right panel of fig.2), the variance of the optimized distribution raises (fig.3).
Thus, even if learning a single global parameter leads to remarkable improvements compared to , learning a distribution of and optimizing local parameters can become relevant when the model has to memorize a large number of inputs. The surprising results obtained by optimizing a global factor through stochastic changes are due to the static nature of the signal used for this task, and the importance of the local thresholds will become dominant for sequence classification.

Task 2, sequence classification

Each input corresponds to a noisy succession of three stimuli that is defined as described in Methods. A single element of the succession lasts for and the network is asked to classify after . Sequences are randomly associated to classes as in the previous task, but in this case the model has to temporally remember the past of the signal and to take into account relationships among elements of the sequence to classify correctly. For this specific case, we used a batch size of for , Composed model and Metropolis, while a batch size of for the standard algorithm . The batch size for was chosen to optimize its performance in terms of speed and accuracy.

Figure 4: Task 1, performance. Left. Performance of the model during training. The performance of the composed model are the highest. The most remarkable difference in comparison to the results of Task 1 is the low accuracy reported by the Metropolis, which is the model that optimizes a global threshold. Since the task considered is dynamic, this result is expected (see text).

Figure 6: Change in the level of specificity after training. Left. Distribution of before learning. Right. Distribution of after learning.

Figure 7: Performance of the proposed algorithm in a Reinforcement Learning framework.
Figure 5: Task 2, Mean and variance of the optimized distribution. Even if there is no evident trend in the average or the variance of the distribution, the results showed in fig.3 are robust with respect to the variations of the values showed. reports the highest mean, which is suboptimal if the performance of fig.2 are considered. Thus, the presence of the global threshold in the Composed model helps the model to reach a better