1 Introduction
Sparsity is a well known concept in neuroscience, observed from the high selectivity of the neurons, ranging from the sensory cortex of mammalian brain [1] to the Kenyon cells (KCs) in the mushroom body [2]. In particular, this work is inspired from the low coding level that is observed in the KCs of Drosophila, where the high intrinsic thresholds of KCs permits such neurons to sparsely and selectively encode external stimuli [3]
. Analogously, the model proposed exploits the concept of learnable thresholds to optimize the level of sparsity inside the network. The learning is performed by optimization of a distance measure between the output of the neural network and the desired outcome without exploiting any normalization term. The novelty of the proposed approach lies on the fact that a sparsity level is reached due to the presence of firing thresholds, rather than to regularization
[4] [5] [6]. From the machine learning perspective, adopting sparse representations can lead to more interpretable model [5], to a reduced computational cost [7], and can help solve overfitting problems [8]. In this regard, the work in [7]demonstrated how structured sparsity can have benefits in term of computational speed and accuracy in a convolutional neural network. Rasmussen et al.
[9]showed how the choice of regularization parameters of the model can impact the interpretability and the reproducibility of a classifier of neuroimaging data, and showed the existence of a tradeoff between pure classification accuracy and reproducibility. The Dropout technique
[8], which selects random subset of units in a neural network during training, can prevent overfitting by diminishing the codependence of the units in deep neural networks.The network under consideration in this work is a reservoir of leaky integrators [10]. The connectivity between the nodes is represented through a random sparse fixed adjacency matrix that enables the associated dynamical system to exhibit a multitude of characteristic timescales. This complex connectivity is consistent with experimental reports of chemical [11] and electrical [12]synapses between Kenyon cells in Drosophila, although the physiological function of KCKC synapses has not yet been discovered.
2 Methods
The reservoir under consideration is a network of leaky integrators described by the following equation
(1) 
where defines the temporal scale of the neuron and
is the activity vector of the integrators
^{1}^{1}1It is called V to resemble the voltage of a neuron.. The activation function
chosen is a rectified linear unit.
is the input adjacency matrix,is the fixed sparse random matrix that describes the recurrency of the reservoir, and
s is the signal. The rescaling factoris chosen in order to constrain the eigenvalues of the associated dynamic system inside the imaginary plane and guarantee the Echo State property. The number of connections of the input matrix
from the input neurons to the reservoir follow a lognormal distribution where each node in the reservoir is connected to six input nodes on average. Furthermore, a specific weight of a postsynaptic neuron is inversely proportional to the number of connections to such neuron. This choice of is inspired by experimental evidence [13]. However, other forms of are possible and the results of this work do not depend considerably on the specific form of the distribution of the input connections. Nevertheless, the magnitude of the weights of plays an important role and must be chosen appropriately [10].In contrast to previous models [14] [15] [10] that define the output of the neural network through a readout of the V vector, we introduced another variable , defined as follows
(2) 
where stands for rectified linear unit, and is a vector of thresholds that enables x to be sparse. Thus, the ‘measurable’ variable is zero if the ‘hidden’ variable is lower than the corresponding threshold . The training procedure minimizes a measure of the distance between the output of the neural network and the desired value . Mathematically, the cost function is
(3) 
in which and the thresholds are the learnable parameters. Thanks to eq.2 and the introduction of the sparse measurable variable x it is possible to change the thresholds without affecting the temporal timescales of the reservoir. This specific formulation allow us to compute the gradient of with respect to
without incurring backpropagation through time and to preserve the idea behind reservoir computing as a fixed, dynamically rich, representation. The rest of this methodological part is organized as follows: subsection
describes the tasks analysed, in subsection we consider two possible training procedure to learn the thresholds values, and subsection describes the best performing algorithm.2.1 Tasks
We considered two classification tasks where the model must make a decision after a prefixed time interval . To make the task more challenging and more biologically plausible, the model selects a class ^{2}^{2}2
Or an action in the case of Reinforcement Learning that will be analysed later.
with probability that corresponds to a softmax function applied on the output layer. It follows that choosing the class that corresponds to the highest output would only make the task easier for all the models that will be analysed, and that the relative differences among the various algorithms and our conclusions would not change.
In the first paradigm, the external input s(t) is derived from the simulated response of 24 projection neurons (PNs, secondorder neurons in the fly olfactory system) to 110 different odors, based on physiological recordings of olfactory receptor neurons (ORNs) and known characteristics of the ORNPN synapse [16] [17]. This simulated activity, which we call (HO for HallemOlsen), has previously been used in computational analyses of fly olfaction [18] [19] [20]. If is the dimensional vector describing the activities of the input neurons, the ith dimension of the input is , where
is a Gaussian distributed random variable with zero mean and unitary variance. Thus, the temporal dependence of the external stimulus is due to the presence of noise only and the stimulus, without carrying any relevant temporal information, is practically static
^{3}^{3}3Indeed, a simple network of unconnected integrators can solve this first experiment successfully as long as the characteristic times of the nodes is big enough to smooth the fluctuation of the signal.. Each stimulus is then associated to a random chosen class, which is the desired outcome of the classification. Given the random nature of the pairing between stimuli and corresponding correct outputs, the model cannot exploit correlations among different signals and the performance achieved is a measure of pure memorization ability. A scheme of the task is shown in fig.1In the second paradigm considered we evaluated the performance of the models in classifying sequences of three successive stimuli. The procedure for building different sequences is described in the third panel of fig.1. Given a base sequence of randomly selected stimuli ^{4}^{4}4The improbable choice of the stimuli is for illustrative purposes., we substituted the last signal with random stimuli (if , and for instance) and associated each new sequence to a random different class (following the considered example, to class one and to class two). From now on, we will call “perturbations” the new substituted elements of the sequence and ”context” the remaining elements. In the case of , which is derived from the base , is the perturbation while represents the context. Then, the same procedure is applied to the previous elements and of as illustrated in the left column of the third panel of fig.1. The purpose of this procedure of defining sequences through perturbations of single elements is to test the temporal memory, or echo, of the reservoir. For a given sequence , the network has to remember the association between a perturbation and the desired output class. However, if perturbations are not repeated, the model does not have to take into account the relationships among various elements, but the memorization of the perturbation is sufficient to obtain a correct classification. In order to make the task more challenging, we changed the context of a perturbed sequence by replacing the two context elements with new random stimuli. The new succession is then associated to a different class than the original . For instance, if is the sequence obtained by perturbing the base , a contextual change can be obtained by replacing with and by defining the new succession . Contextual variations are finally illustrated in the right column of the the third panel of fig.1 The overall procedure defines a systematic way to test the temporal and contextual memory capacity of the network while preserving the random association between desired class and successions of elements. Apart from possible similarities among sequences of the same class due to a poor statistics, there are no correlations between inputs that could help the classification procedure, and each sequence has to be classified independently. Thus, given successions, this methodology creates a total number of sequences that equals to , where the factor is due to the number of elements in a succession, a factor to perturbations, and a factor to contextual changes.
2.2 Algorithms
The objective of this section is to review two possible algorithms to optimize the vector of thresholds of equation 2 capable of learning sparse representations of input stimuli and to introduce a benchmark model. The first one is based on the derivation of the gradient of the error function with respect to the individual neuronal thresholds while the second one is the well known Metropolis algorithm. By optimizing a sparsity level, we envisage that the desired learning rule would also separate the representation of stimuli belonging to different classes in order to facilitate the classification, as nonspecific neurons, i.e. neurons that fire to all stimuli, are likely unhelpful for the learning process. The decrease of the overlap between antagonist representations will be quantified through a specificity measure that is introduced in the next section.
Benchmark ()
The benchmark model exploits gradient descent on the output weights without the introduction of the thresholds and the additive layer . The output activity corresponds to a readout of the V variable (eq.1).
Also all the other algorithms analysed will learn through gradient descent, but their output activity will be a readout of the x (eq.2) variable and they will have the additional complexity of learning the thresholds .
Gradient Descent on ()
In addition to learning the ’output’ weights of the reservoir with gradient descent, we choose to learn the local firing thresholds. The derivative of the cost function with respect the leads to
(4) 
where is the learning rate and is the Heaviside function. While this is a supervised scenario, we are adopting a Reinforcement Learning terminology: eq.4 can be viewed as the sum of prediction errors where each reinforcement signal belongs to a different class and is modulated by the corresponding output synapses. In order to interpret eq.4, let us consider the case in which so that we have a qualitative idea of what the meaning of eq.4 is. In this specific scenario eq.4 becomes:
where we consider that desired output is a positive quantity , usually set as one in a classification task for the correct class and zero otherwise. Let us consider the extreme assumption where and a first learning phase where the relations and can hold by construction. In this ideal case the learning rule is driven by the difference , that leads to a decrease of the ith firing threshold when and the ith specific node is helping to achieve the right classification, and an increase of the threshold value when and the node is contributing to reach the wrong output.
The general case where is now understandable by considering that the local factor are modulated through a feedback signal that gives the priority to nodes that are far from the desired output.
The problem with this gradient based rule is that it changes the activation of the nodes unidirectionally. While eq.4 deactivates nodes that are not useful for the classification task, the learning rule cannot reactivate neurons that were silent because the gradient of a non active node is zero by definition.
Furthermore, we will later see that the mean of the distribution of thresholds found by this algorithm is suboptimal. ^{5}^{5}5This last consideration will be supported by the results of fig.3.
The limitations of a gradient based approach led us to consider the following algorithm.
Metropolis algorithm ()
While the previous algorithm optimizes a separate threshold for each neuron, the algorithm analysed in this section exploits a global threshold for the whole network (). Indeed, a desired sparsity level is reachable with one single value of only, and preliminary simulations have demonstrated how adopting diverse fixed values of the global threshold can lead to different performance. We want to analyse the consequences of optimizing such parameter with an algorithm that, instead of using the derivative of the cost function, performs a stochastic search by randomly perturbing the value of and accepts or declines the new value with some probability. In the proposed implementation such a probability is given by the Metropolis algorithm [21] where the energy of the system corresponds to the cost function . The procedure adopted is summarized in the following steps:

Starting from a value of , propose a new threshold value , where

Repeat for steps:
Compute the cost for the two thresholds values
Update the output weights through gradient descent on
Compute an average of 
Accept the network corresponding to with probability
Practically, the algorithm proposes a new reservoir with a global threshold , changes the output weights through standard gradient descent, and accepts the new network by applying the Metropolis rule on a running exponential average of the cost function. In the above scheme, is the number of steps used to compute where the values of are fixed, and defines the memory of the running average .
2.3 Proposed algorithm, a unified approach
The proposed algorithm exploits a mixture of the Metropolis and the gradient descent updating rules to change the values of the thresholds. Each single node has a threshold
(5) 
defined as the sum of a global factor and a local factor . The proposed model optimizes the global part through stochastic perturbations and the local one via a gradient descent approach. The algorithm shares the same steps of the Metropolis procedure explained above (section Metropolis), with the following differences: the stochastic perturbation on step is applied to the global factor , and the gradient descent on step is also applied to the local thresholds (instead of the output weights only). For clarity, the scheme of the final algorithm is

Starting from a value of , propose a new threshold value , where

Repeat for steps:
Compute the cost for the two thresholds values
Update the output weights and through gradient descent on
Compute an average of 
Accept the network corresponding to with probability
Since the algorithm is now learning three sets of variables, , and , one important aspect to take into account is the fact that the effective learning rate depends on the sparsity level. In particular, changing the initial value of can strongly affect the effective learning rate. Because the optimization of the weights and of the local thresholds depends on the step size and consequently on , the initial value of can become important even if the stochastic variations of the Metropolis are optimizing it. In order to choose the starting condition, the algorithm exploits a small prelearning phase (about steps) by applying the same steps described in the scheme above, but resetting and after each Metropolis update to their initial values and trying different initial values of . In such a way, we guarantee choosing a good starting condition. In practice, this procedure was done only one time for each task considered without trying to fine tune the initial for all the simulations performed.
2.4 Specificity
Since a sparsity level is obtained through direct minimization of eq.3 by gradient descent and/or stochastic search, we expect that the optimized thresholds would decrease the overlap in the representations among stimuli corresponding to different classes. Indeed, this separation would make the classification task easier. Thus, a measure of specificity is formulated to quantify and to understand how the proposed learning can lead to better representations. Let us consider two classes and and a neuron . The node is specific to the class with respect to the other if it is more active when stimuli belonging to class
are presented to the network. Generalizing this idea it is possible to build a tensor
defined as(6) 
where () are the number of times the neuron was active after the presentation of a stimulus of class () and is the total number of episodes taken into account to compute eq.6. There are also other possibilities analogous to eq. ^{6}^{6}6 could be substituted by and define a relative specificity that is rescaled by the number of times a neuron was active. However, a decrease of causes an undesirable increase in the specificity measure. Another choice is to use the sum of the activities of a neuron for all the stimuli in class instead of , but this leads to comparable results., but we considered the above equation the most easily interpretable. Given it is possible to compute a measure of specificity for each single neuron as
(7) 
where we considered only the upper triangular part of because of the symmetry of the latter tensor.
Parameters  Values, task 

Tasks  
Network  
Model  
3 Results
We will now compare the four algorithms: gradient descent on the weights (), gradient descent on the weights and thresholds (), the Metropolis algorithm and the composite model on the two tasks (section Tasks).
Task 1, static input
In this scenario the model has to classify, after a fixed time interval of , a noisy dimensional stimulus. Each input is defined as described in the methodological section and it is randomly assigned to a class in order to test the pure memorization ability of the network. The performance is shown in fig.2, where the left panel reports a training example of a classification of
stimuli. The parameters defining the task and the hyperparameters of the composed algorithm are reported in the table above.
The black line refers to the standard model called ^{7}^{7}7for gradient descent on the output weights , where the threshold and the x variable are not introduced; the output of the neural network is a readout of V. Since the algorithms proposed are more complex and can exploit an exponential running average of the cost function to update the global parameter, we optimised the batch size of to improve its performance and to make the comparison to the disadvantage of the proposed algorithm. The batch size chosen for is . It is clear how the three models that exploit the additional complexity due to the thresholds outperform the standard model in terms of convergence speed and accuracy. The performance is also reported in the right panel after episodes ^{8}^{8}8One episode corresponds to a single presentation of a stimulus that ends when the network makes a decision as the number of stimuli to be classified changes and the task becomes more demanding. The difference between the models become more evident as the difficulty increases, and the composite model robustly reports the best classification accuracy.
After learning and for the case where , the average percentage of active nodes is about fifty percent for the composed algorithm and the Metropolis, while it is about seventy percent for . This difference in the sparsity levels found by the three algorithms is visible in the left panel of fig.3, where the variance and the mean of the optimized distribution of are reported for all the cases. The Metropolis and the composed model report higher values of the mean of the thresholds distribution (blue and red diamonds) in comparison to , which leads to a corresponding higher coding level. However, the increasing trend of for all the learning rules demonstrate how sparsity is needed when the memorization ability required by the network is high. The importance of a local threshold is reflected on the spread of the distribution, reported as . Indeed, as the performance between the Metropolis and the composed model diverge going from left to right (right panel of fig.2), the variance of the optimized distribution raises (fig.3).
Thus, even if learning a single global parameter leads to remarkable improvements compared to , learning a distribution of and optimizing local parameters can become relevant when the model has to memorize a large number of inputs. The surprising results obtained by optimizing a global factor through stochastic changes are due to the static nature of the signal used for this task, and the importance of the local thresholds will become dominant for sequence classification.
Task 2, sequence classification
Each input corresponds to a noisy succession of three stimuli that is defined as described in Methods. A single element of the succession lasts for and the network is asked to classify after . Sequences are randomly associated to classes as in the previous task, but in this case the model has to temporally remember the past of the signal and to take into account relationships among elements of the sequence to classify correctly. For this specific case, we used a batch size of for , Composed model and Metropolis, while a batch size of for the standard algorithm . The batch size for was chosen to optimize its performance in terms of speed and accuracy.
Comments
There are no comments yet.