A biological gradient descent for prediction through a combination of STDP and homeostatic plasticity

06/21/2012 ∙ by Mathieu Galtier, et al. ∙ 0

Identifying, formalizing and combining biological mechanisms which implement known brain functions, such as prediction, is a main aspect of current research in theoretical neuroscience. In this letter, the mechanisms of Spike Timing Dependent Plasticity (STDP) and homeostatic plasticity, combined in an original mathematical formalism, are shown to shape recurrent neural networks into predictors. Following a rigorous mathematical treatment, we prove that they implement the online gradient descent of a distance between the network activity and its stimuli. The convergence to an equilibrium, where the network can spontaneously reproduce or predict its stimuli, does not suffer from bifurcation issues usually encountered in learning in recurrent neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Identifying, formalizing and combining biological mechanisms which implement known brain functions, such as prediction, is a main aspect of current research in theoretical neuroscience. In this letter, the mechanisms of Spike Timing Dependent Plasticity (STDP) and homeostatic plasticity, combined in an original mathematical formalism, are shown to shape recurrent neural networks into predictors. Following a rigorous mathematical treatment, we prove that they implement the online gradient descent of a distance between the network activity and its stimuli. The convergence to an equilibrium, where the network can spontaneously reproduce or predict its stimuli, does not suffer from bifurcation issues usually encountered in learning in recurrent neural networks.

1 Introduction

One of the main functions of the brain is prediction [Bar, 2009]. This function is generally thought to rely on the idea that cortical regions learn a model of the world and simulate it to generate predictions of future events [Gilbert and Wilson, 2007, Schacter et al., 2008]

. Several recent experimental findings support this view, showing in particular that the spontaneous neuronal activity after presentation of a stimulus is correlated with the evoked activity

[Kenet et al., 2003], and that this similarity increases along development and learning [Berkes et al., 2011]. Moreover, at the scale of neuronal networks, prediction can also be seen as a general organization principle : it has been argued [Rao et al., 1999, Clark, 2012] that the brain would contain a hierarchy of predictive units which are able to predict their direct stimuli or inputs, through the modification of the synaptic connections.

Understanding the mechanisms and principles underlying this prediction function is a key challenge, not only from a neuroscience perspective but also for machine learning where the number of applications requiring prediction is significant.

In the field of machine learning, recurrent neural networks have been successfully proposed as candidates for these predictive units [Williams and Zipser, 1989, Williams and Zipser, 1995, Pearlmutter, 1995, Jaeger and Haas, 2004, Sussillo and Abbott, 2009]. In most cases, these algorithms aim at creating a neural network that autonomously and spontaneously reproduces a given time series. The Bayesian approach is also useful in designing predictors [Dayan et al., 1995, George and Hawkins, 2009] and has also been mapped to neural networks [Deneve, 2008, Friston, 2010, Bitzer and Kiebel, 2012]. However, apart from a rough conceptual equivalence, this paper is devoid of Bayesian terminology and directly focuses on neural networks. In this framework, prediction is often achieved by minimizing a distance between the activity of the neural network and the target time series. Although neural networks were originally studied in a feedforward framework [Rosenblatt, 1958], the most efficient networks for prediction shall involve recurrent connections giving the network some memory properties. So called gradient descent algorithms in recurrent neural network [Mandic and Chambers, 2001] involve the learning of the entire connectivity matrix. They minimize the distance between a target trajectory and the trajectory of the network. On the other hand, researchers in the field of reservoir computing [Lukosevicius and Jaeger, 2009] only optimize some connections in the network whereas the others are randomly drawn and fixed. To do prediction, they minimize the “one-step ahead” error corresponding to the distance between the network predictions and the next time step of the target time series. Thus, these algorithms are derived to optimize an accuracy criterion, with learning rules generally favoring prediction efficiency over biological meaning.

In the field of neuroscience, these last years have seen many discoveries in the study of synaptic plasticity, in particular providing experimental evidences and possible mechanisms for two major concepts in the current biology of learning.

The first is the discovery of spike-timing dependent plasticity (STDP)  [Markram et al., 1997, Bi and Poo, 1998, Caporale and Dan, 2008, Sjöström and Gerstner, 2010]. It is a temporally asymmetric form of Hebbian learning induced by temporal correlations between the spikes of pre- and post-synaptic neurons. The general principle is that if a neuron fires before (resp. after) another then the strength of the connection from the former to the latter will be increased (resp. decreased). The summation of all these modification leads to the strengthening of causality links between neurons. Although STDP is originally based on spiking network, it has several extensions or analogs for rate-based networks (those used in machine learning) [Kempter et al., 1999, Izhikevich and Desai, 2003, Pfister and Gerstner, 2006]. The functional role of STDP is still discussed, for instance: reducing latency [Song et al., 2000], optimizing information transfer [Hennequin et al., 2010], invariant recognition [Sprekeler et al., 2007] and even learning temporal patterns [Gerstner et al., 1993, Rao and Sejnowski, 2001, Yoshioka et al., 2007] (non exhaustive list).

Second, the notion of homeostatic plasticity [Miller, 1996, Abbott and Nelson, 2000, Turrigiano and Nelson, 2004], including mechanisms such as synaptic scaling, has proved to be important to moderate the growth of connection strength. In contrast to previously theory-motivated normalization of the connectivity [Miller and MacKay, 1994, Oja, 1982], there is a need of a biologically plausible means to prevent the connectivity from exploding under the influence of shaping mechanisms like Hebbian learning or STDP.

From a theoretical viewpoint, STDP and homeostatic plasticity are almost always studied independently. An extensive bottom-up numerical analysis of the combination of such learning mechanisms, done by Triesch and colleagues, has already lead to biologically relevant behaviors [Lazar et al., 2007, Lazar et al., 2009, Zheng et al., 2013]. However, the mathematical understanding of their combination in terms of functionality still stands as an undocumented challenge to researchers.

This letter aims at bridging the gap between biological mechanisms and machine learning regarding the issue of predictive neural networks. We rigorously show how a biologically-inspired learning rule, made of an original combination of STDP mechanism and homeostatic plasticity, mimics the gradient descent of a distance between the activity of the neural network and its direct stimuli. This results in capturing the underlying dynamical behavior of the stimuli into a recurrent neural network and therefore in designing a biologically plausible predictive network.

The letter is organized as follows. In section 2, we construct a theoretical learning rule designed for prediction, based on an appropriate gradient descent method. Then, in section 3, we introduce a biologically inspired learning rule, combining the concepts of STDP and homeostatic plasticity, whose purpose is to mimic the theoretical learning rule. We discuss the various biological mechanisms which may be involved in this new learning rule. Finally in section 4, we provide a mathematical justification of the link between the theoretical and the biologically inspired learning rule, based on the key idea that STDP can be seen as a differential operator.

2 Theoretical learning rule for prediction

In a machine learning approach, we introduce here a procedure to design a neural network which autonomously replays a target time series.

2.1 Set-up

We consider a recurrent neural network made of neurons which is exposed to a time dependent input of the same dimension. Our aim is to construct a learning rule which will enable the network to reproduce the input’s behavior.

Our approach is focused on learning the underlying dynamics of the input. Therefore, we assume that is generated by an arbitrary dynamical system:

(1)

with

a smooth vector field from

to . We also assume that the trajectory of the inputs or stimuli is -periodic. The key mathematical assumptions on the input is in fact ergodicity, but we restrict our study to periodic inputs for simplicity. In particular, periodic inputs can be constructed by the repetition of a given finite-time sample.

Although the following method virtually works with any network equations, we focus on a neural network composed of neurons and governed by

(2)

where is a vector representing neuronal activity, is the connectivity matrix, is a decay constant and

is an entry-wise sigmoid function.

2.2 Gradient descent learning rule

The idea behind our learning rule is to find the best connectivity matrix which will minimize a distance between the two functions and . In this perspective, we define the following quantity:

(3)

When , the vector fields of systems (1) and (2) are equal on the trajectories of the inputs. This quantity may be viewed as a distance between the two vector fields defining the dynamics of the inputs and of the neuronal network along the trajectories of the inputs. One shall notice that it is similar to classical gradient methods [Pearlmutter, 1995, Williams and Zipser, 1995, Mandic and Chambers, 2001], except that the norms are applied to the flows of inputs and neural network instead of their activity. Thus, it focuses more specifically on the dynamical structure of the inputs. Moreover, it is possible to show, using Girsanov’s Theorem, that this definition coincides with the concept of relative entropy between two diffusion processes, namely the ones obtained by adding a standard Gaussian perturbation to both equations. Therefore, we will call the relative entropy.

In order to capture the dynamics of the inputs into the network, it is natural to look for a learning rule minimizing this quantity. To this end, we consider the gradient of this measure with respect to the connectivity matrix:

(4)

where the component of is for functions from to . Equivalently, these functions can be seen as semi continuous matrices in and is the transpose of .
Thus, an algorithm implementing the gradient descent is a good candidate to minimize the relative entropy between inputs and spontaneous activity. Since is quadratic in , it follows that is a convex function, thus excluding situations with multiple local minima. Moreover, if is invertible, one can compute directly the minimizing connectivity as

(5)

Implementing a gradient descent based on equation (4) does not immediately lead to a biologically relevant mechanism. First, it requires a direct access to the inputs , whereas synaptic plasticity mechanisms shall only rely on the network activity . Second, it is a batch learning algorithm which requires an access to the entire history of the inputs. Third, it requires the ability to compute . Therefore, we will see in section 3 how to overcome these issues combining biologically inspired synaptic plasticity mechanisms.

2.3 Example

Figure 1: Learning how to write the letter A. Time evolution of the input movie (top row) and of the network activity after learning (bottom row). Each pixel corresponds to a neuron. Parameters: Number of neurons ; ; .

In order to illustrate the idea that learning rule (4) enables the network to learn dynamical features of the input, we have constructed the following experiment. We present to the network an input movie displaying sequentially the writing of the letter A (Figure 1 - top row). To each pixel we assign one neuron, so that the input and the network share the same dimension. This input movie is repeated periodically until the connectivity matrix of the network, evolving under rule (4), stabilizes. Then the input is turned off and we set the initial state of the network to a priming image showing the bottom left part of letter A. The network evolving without input strikingly reproduces the dynamical writing of letter A as displayed in Figure 1. Thus, with this example we have illustrated the ability of the learning rule we have derived from a theoretical principle to capture a dynamic input into a connectivity matrix.

3 A biological learning rule

We now introduce a biological learning rule made of the combination of STDP and homeostatic plasticity. Later in section 4, we show that this learning rule minimizes . Here, we first give a mathematical description of this learning rule and, second, relate the different terms to biological mechanisms.

3.1 Mathematical description

Learning corresponds to a modification of the connectivity simultaneous to the network activity evolution. The result is a coupled system of equations. The learning rate is chosen to be small: , so that learning can be considered slow compared to the evolution of the activity. The full online learning system is

(6)

where

and

(7)

where the notation denotes the convolution operator. The function is defined as with the Heaviside function and for any positive number as shown in the left picture of Figure 2. As illustrated in the right picture of Figure 2, the operator roughly corresponds to the classical STDP window [Bi and Poo, 1998] (taking into account a y-axis symmetry corresponding to the symmetric formalism we are using). The constant is a time constant corresponding to the width of the STDP window used for learning.

Figure 2: (left) Plots of the function for . (right) Plot of the function for and . This function corresponds to the operator as shown in section 4.1.2.

3.2 Biological mechanisms

3.2.1 An input estimate

The variable can be seen as a spatio-temporal differential variable which approximates the inputs

. Although unsupervised learning rules are often algebraic combinations of element-wise functions applied to the activity of the network 

[Gerstner and Kistler, 2002], it is not precisely the case here. Indeed, learning is based on the variable which corresponds to the subtraction of the temporally integrated synaptic drive from the activity of the neurons . For each neuron, this variable takes into account the past of all the neurons which are then spatially averaged through the connectivity to be subtracted from the current activity. This gives a differential flavor to this variable which is reminiscent of former learning rules [Bienenstock et al., 1982, Sejnowski, 1977] for the temporal aspect and [Miller and MacKay, 1994] for the spatial aspect. Note that this variable is not strictly speaking local (i.e. the connection needs the values of to be updated), yet it is biologically plausible since the term is accessible for neuron on its dendritic tree, which is a form of locality in a broader sense.

3.2.2 An STDP mechanism

The first term in (6) can be related to STDP. The antisymmetric part of this term is responsible for retrieving the drift in equation (4). The symmetric part (corresponding to Hebbian learning) is responsible for retrieving the second term in (4). Thus, it captures the causality structure of the inputs which is a task generally attributed to STDP [Sjöström and Gerstner, 2010]. Beyond the simple similarity of functional role, we believe a simplification of this term may shed light on the deep link it has with STDP. The main difference between our setup and STDP is that the former is based on a rate-based dynamics, whereas the latter is based on a spiking dynamics. In a pure spike framework, i.e the activity is a sum of Diracs, the STDP can be seen as this simple learning rule . Indeed, the term is non-null only when the post-synaptic neuron is firing and then, via the factor , it counts the number of preceding pre-synaptic spikes that might have caused ’s excitation and weight them by the decreasing exponential . Thus, this term exactly accounts for the positive part of the STDP curve. The negative term takes the opposite perspective and accounts for the negative part of the STDP curve. A loose extension of this rule to the case where the activity is smoothly evolving leads to identifying the function to the STDP mechanism for rate-based networks [Galtier, 2012, Izhikevich and Desai, 2003].

3.2.3 Homeostatic plasticity

The second term in (6) accounts for what is usually presented as homeostatic plasticity mechanisms. The previous STDP term seems to be a powerful mechanism to shape the response of the network. However, there is a need of a regulatory process to prevent from uncontrolled growth of the network connectivity [Abbott and Nelson, 2000, Turrigiano and Nelson, 2004, Miller, 1996]. It has been argued that STDP could be self regulatory [Van Rossum et al., 2000, Song et al., 2000], but it is not the case in our framework and an explicit balancing mechanism is necessary to avoid the divergence of the system. This last term is the only one with a negative sign and is multiplicative with respect to the connectivity. Thus, according to [Abbott and Nelson, 2000], it is a reasonable candidate for homeostasis. It has been argued [Turrigiano et al., 1998, Kim et al., 2012] that homeostatic plasticity might keep the relative synaptic weights by dividing the connectivity with a common scaling factor, theoretically preventing from a possible information loss. In contrast to these ad hoc re-normalizations often introduced in other learning rules [Miller and MacKay, 1994, Oja, 1982], our relative entropy minimizing learning rule thus introduces naturally an original form of homeostatic plasticity.

Although we have separated the description of the various terms in (6), our approach suggests that homeostasis may be seen, not necessarily as a scaling term, but as a constitutive part of a learning principle, deeply entangled [Turrigiano, 1999] with the other learning mechanisms.

3.3 Numerical application

Although the focus of the paper is on theory, we introduce a simple numerical example to illustrate the predictive properties of the biological learning rule. More precisely, we investigate the question of retrieving the connectivity of a neural network based on the observation of the time series of its activity. This is an inverse problem which is a usual challenging topic in computational neuroscience [Friston et al., 2003, Galán, 2008, Potthast and beim Graben, 2009] since it may give access to large scale effective connectivities simply from the observation of a neuronal activity. Here we address it in an elementary framework. The network generating the activity patterns is referred as input network and evolves according to . For this example, the network is made of neurons and its connectivity is shown in Figure 3.a). These parameters were chosen so that the activity is periodic as shown by the dashed curves in Figure 3.c). Then, we simulate the entire system (6) with a decay constant and observe that its connectivity converges to .

Figure 3: Retrieving the connectivity a) Connectivity matrix of the input network. b) Evolution of the difference between current connectivity and input connectivity through learning. The black dot-dashed curve corresponds to the online learning rule (6) in the homogeneous case, i.e. in both equations in (6). The grey dashed curve corresponds to the online learning system (6) in the hybrid framework, i.e. for the learning equation and for the network equation. For this simulation we chose . The black plain curve (superposed to the grey dashed curve) corresponds to the batch relative entropy minimization (4). c) The dashed curves correspond to the activity of the inputs. It is a three dimensional input (i.e. ) and the three different grey levels correspond to the different dimensions. The plain curves correspond to the simulation of the network (top equation in (6)) post learning, in the hybrid framework, and without inputs. The parameters for these simulations are , and .

As shown in the next section, it is necessary in order to approximate accurately the input’s activity with the online learning rule (6). Given that the time constant of the inputs is governed by , the previous approximation holds only if . If this assumption is broken, e.g. , then the final connectivity matrix is different from , see the black dot-dashed curve in Figure 3.b). Indeed, in this case, the network only learns to replay the slow variation of the inputs.

A method to recover the precise time course of the inputs consists in artificially changing the time constants at different steps of an algorithm described in the following. First, simulate the network equation (top equation in (6)) with a constant . Yet, the time constant in the learning equation (bottom equation in (6)) is to be kept at , thus introducing a hybrid model. In this framework, the connectivity converges exactly to as shown in the grey dashed curve in Figure 3.b). After learning, simulate the network equation with the learned connectivity and with a time constant switched back to . This gives an activity as shown in the plain curves in Figure 3.c).

4 Link between theoretical and biological learning rules

In this part, we show how the biological learning rule (6) implements the gradient descent based on equation (4). First, we introduce three technical results which are the mathematical cornerstones of the paper and then combine them to obtain the desired result.

4.1 Three technical results

As mentioned previously, equation (4) is not biologically plausible for three main reasons: (i) it requires a direct access to the inputs , whereas synaptic plasticity mechanisms shall only rely on the network activity ; (ii) it is a batch learning algorithm which requires an access to the entire history of the inputs; (iii) it requires the ability to compute (equal to the time-derivative of the inputs according to equation (1)).

On the contrary, the biological learning rule (6

) does not have these problems. First, it uses an estimate of the inputs, noted

, which is based on the activity variable only. Second, it is an online learning rule, i.e. it takes input on the fly, and relies on a slow-fast mechanism to temporally average the variables. Third, it approximates the computation of with a function inspired from STDP convolution. In the following, we mathematically address these three points successively.

4.1.1 is an input estimate

The online learning rule (6) is expressed by means of the activity of the neural network governed by the top equation in (6). However, to be comparable to the gradient of the relative entropy (4), we first need to make explicit the dependency on the inputs . Therefore, we show that the network dynamics induces a simple relation between and the inputs : a simple computation in the Fourier domain shows that the convolution between the temporal operator and results in . Applying this result to the neural dynamics leads to reformulating the top equation in (6) as . We recognize the definition of the variable (which was originally defined according to this relation) such that the network’s dynamics of the fast equation in (6) is equivalent to

(8)

4.1.2 Temporal averaging of the learning rule

To prove that (6) implements the gradient descent of the relative entropy, we need to use a time-scale separation assumption, enabling the input to reveal its dynamical structure through ergodicity. Indeed, the fact that learning is very slow compared to the activity corresponds to the assumption . In this case, we can apply classical results of periodic averaging [Sanders and Verhulst, 1985] (see also [Galtier and Wainrib, 2012] in the context of neural networks) to show that the evolution of is well-approximated by

(9)

where and . The right picture of Figure 2 shows a plot of the function .

4.1.3 STDP as a differential operator

Here, we prove the two following equalities

(10)

and

(11)

This second equation is the key mathematical mechanism that makes STDP a good approximation of the temporal derivative of the inputs.

Both proofs consist in going in the Fourier domain, where convolutions are turned into multiplications. We use the unitary, ordinary frequency convention for the Fourier transform. Observe that the Fourier transform of

is . And we define111This notation is motivated by the following: the convolution with can be written as a matrix multiplication with a continuous Toeplitz matrix whose component is . In this framework, the convolution with corresponds to the multiplication with the transpose of the previous Toeplitz operator. such that and .

  1. Let us show that
    We proceed in two steps:

    1. Let us show that
      The Fourier transform of the convolution is the product . From the usual Fourier tables we observe that the right hand side is the Fourier transform of . This immediately leads to the result.

    2. Let us show that
      Compute

    Using the result (a) and applying result (b) to and leads to the result .

  2. To prove equation (11), let us first show that .
    The Fourier transform of the convolution is the product . Besides,

    Because , taking the inverse Fourier transform of gives the intermediary result .

    Using it and applying the first result to and leads to equation (11).

4.2 Main result

We prove here that the biological learning rule (6) implements the gradient descent based on (4).

The temporally averaged version of the biological learning rule is given by (9). We simultaneously use equations (8) and (11) on this formula to show that the solution of the biological learning rule is well approximated by

If is slow enough, i.e. , then this equation is precisely the gradient descent of . If is too fast, then the network will only learn to predict the slow variations of the inputs. Some fast-varying information is lost in the averaging process, mainly because the network equation in (6) acts as a relaxation equation which filters the activity. Note that this loss of information is not necessarily a problem for the brain, since it may be extracting and treating information at different time scales [Kiebel et al., 2008]. Actually, the choice of the parameter specifies the time scale under which the inputs are observed.

4.3 Stability

Both theoretical and biological learning rules are assured to make the connectivity converge to an equilibrium point whatever the initial condition. In particular, this method does not suffer from the bifurcation issues often encountered in recurrent neural network learning [Doya, 1992]. In most cases, problems arise when learning leads the network activity to a bifurcation. The bifurcation suddenly changes the value of the quantity being minimized and this intricate coupling leads to very slow convergence rates. The reason why this method has such an unproblematic converging property is because the relative entropy (as opposed to other energy functions traditionally used) is independent from the network activity . Besides, for any network pattern equation (8) ensures that is always equal to the convolved inputs. These two arguments break the pathologic coupling preventing from getting interesting convergence results.

From a mathematical perspective, the Krasovskii-Lasalle invariance principle [Khalil and Grizzle, 1992] ensures that the theoretical learning rule converges to a equilibrium point. Indeed, the relative entropy acts as an energy or Lyapunov function. If is invertible then the equilibrium point is unique and defined by equation (5). If it is not invertible then the equilibrium depends on the initial condition.

If the inputs are slow enough, it has been shown that the biological learning is well approximated by the theoretical gradient descent. Therefore, the former exhibits the same stable converging behavior as the latter. Thus, the biological learning rule (6) is stable. In practice, we have never encountered a diverging situation even when the inputs are not slow enough.

5 Discussion and conclusion

We have shown that a biological learning rule shapes a network into a predictor of its stimuli. This learning rule is made of a combination of a STDP mechanism and homeostatic plasticity. After learning the network would spontaneously predict and replay the stimuli it has previously been exposed to. This was achieved by showing that this learning rule minimizes a quantity analogous to the relative entropy (or Kullback-Leibler divergence) between the stimuli and the network activity as for other well-known algorithms

[Ackley et al., 1985, Hinton, 2009].

We believe this letter brings interesting arguments in the debate concerning the functional role of STDP. We have shown that the antisymmetric part of STDP can be seen as a differential operator. When its effect is moderated by an appropriate scaling term, we argue that it could correspond to a generic mechanism shaping neural networks into predictive units. This idea is not new, but this letter may give it a stronger theoretical basis.

This study also gives central importance to the time constants characterizing the network. Indeed, the fact the biological learning rule (6) implements a gradient descent is rigorously true for slow inputs. Inputs are slow if they are significantly slower than both the time window of the STDP and the decay constant of the network. This means that sub-networks in the brain could select the frequency of the information they are predicting. Given that the brain may process information at different time scales [Kiebel et al., 2008], this may be an interesting feature. Besides, note that the proposed biological learning rule (6) is partly characterized by the activity decay parameter

. This is surprising because it links the intrinsic dynamics of the neurons to the learning processes corresponding to the synapse. Therefore, it may provide a direction to experimentally check this theory: is the symmetric part of STDP (i.e. Hebbian learning) stronger between faster neurons?

One of the characteristic features of this research is the combination of rate-based networks and the concept of spike timing dependent plasticity. Obviously, this made impossible to consider the rigorous definition of STDP. However, we have argued that the function in (7) is an alternative formulation which is equivalent to the others in the spiking framework and which can be trivially extended to analog networks. Besides, it does capture the functional mechanism of the STDP in the rate-based framework: when the activation of a neuron precedes (resp. follows) the activation of another the strength of the synapse from the former to the latter is increased (resp. decreased). Finally, we believe the theory presented in this paper gives support to this rate-based STDP since it appears to fill a gap between machine learning and biology by implementing a differential operator.

This approach can be applied to other forms of network equations than (2) such as the Wilson-Cowan or Kuramoto models, leading to different learning rules. In this perspective, learning can be seen as a projection of a given arbitrary dynamical system to a versatile neuronal network model, and the learning rule will depend on the chosen model. However, we shall remark that any network equation with an additive structure – intrinsic dynamics + communication with other neurons – as in (2) will lead to a similar structure for the learning rule, with terms that may share similar biological interpretations as developed above. A special case is the linear network for which various statistical methods to estimate the connectivity matrix have been applied e.g. in climate modeling [Penland and Sardeshmukh, 1995], gene regulatory networks [Yeung et al., 2002] and spontaneous neuronal activity [Galán, 2008]. In this simpler case, the biological learning rule in equation (6) would be with . The method developed in this article can be used to extend the inverse modeling approach previously developed in the linear case to models with non-linear interactions.

One of the main restrictions of this approach is that the dimensionality of the stimuli and that of the network have to be identical. Therefore, as such, the accuracy of this biological mechanism does not match the state of the art machine learning algorithms.This is not necessarily a big issue since a high dimensional projections of the inputs may be used as preprocessing. From a biological viewpoint and taking the example of vision, this would correspond to the retino-cortical pathway which is not one to one and very redundant. But this limitation also puts forward the necessity to study a network containing hidden neurons in a similar fashion. Ongoing research is revealing that the mathematical formalism is well suited to extend this approach to the field of reservoir computing [Jaeger and Haas, 2004].

Acknowledgments

The authors thank Herbert Jaeger for his helpful comments on the manuscript.

MNG was partially funded by the Amarsi project (FP7-ICT #24833), the ERC advanced grant NerVi #227747, the région PACA, France and the IP project BrainScaleS #269921.

References

  • [Abbott and Nelson, 2000] Abbott, L. and Nelson, S. (2000). Synaptic plasticity: taming the beast. Nature neuroscience, 3:1178–1183.
  • [Ackley et al., 1985] Ackley, D., Hinton, G., and Sejnowski, T. (1985).

    A learning algorithm for boltzmann machines.

    Cognitive science, 9(1):147–169.
  • [Bar, 2009] Bar, M. (2009). Predictions: a universal principle in the operation of the human brain. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1181–1182.
  • [Berkes et al., 2011] Berkes, P., Orbán, G., Lengyel, M., and Fiser, J. (2011). Spontaneous cortical activity reveals hallmarks of an optimal internal model of the environment. Science, 331(6013):83.
  • [Bi and Poo, 1998] Bi, G. and Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of Neuroscience, 18(24):10464–10472.
  • [Bienenstock et al., 1982] Bienenstock, E., Cooper, L., and Munro, P. (1982). Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. The Journal of Neuroscience, 2(1):32–48.
  • [Bitzer and Kiebel, 2012] Bitzer, S. and Kiebel, S. (2012).

    Recognizing recurrent neural networks (rrnn): Bayesian inference for recurrent neural networks.

    Biological cybernetics, pages 1–17.
  • [Caporale and Dan, 2008] Caporale, N. and Dan, Y. (2008). Spike timing-dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci., 31:25–46.
  • [Clark, 2012] Clark, A. (2012). Whatever next? predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci.
  • [Dayan et al., 1995] Dayan, P., Hinton, G., Neal, R., and Zemel, R. (1995). The helmholtz machine. Neural computation, 7(5):889–904.
  • [Deneve, 2008] Deneve, S. (2008). Bayesian spiking neurons i: inference. Neural computation, 20(1):91–117.
  • [Doya, 1992] Doya, K. (1992). Bifurcations in the learning of recurrent neural networks. In Circuits and Systems, 1992. ISCAS’92. Proceedings., 1992 IEEE International Symposium on, volume 6, pages 2777–2780. IEEE.
  • [Friston, 2010] Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2):127–138.
  • [Friston et al., 2003] Friston, K., Harrison, L., and Penny, W. (2003). Dynamic causal modelling. Neuroimage, 19(4):1273–1302.
  • [Galán, 2008] Galán, R. (2008). On how network architecture determines the dominant patterns of spontaneous neural activity. PLoS One, 3(5):e2148.
  • [Galtier, 2012] Galtier, M. (2012). A mathematical approach to unsupervised learning in recurrent neural networks. PhD thesis, Mines Paristech / INRIA.
  • [Galtier and Wainrib, 2012] Galtier, M. and Wainrib, G. (2012). Multiscale analysis of slow-fast neuronal learning models with noise. The Journal of Mathematical Neuroscience, 2(1):13.
  • [George and Hawkins, 2009] George, D. and Hawkins, J. (2009). Towards a mathematical theory of cortical micro-circuits. PLoS computational biology, 5(10):e1000532.
  • [Gerstner and Kistler, 2002] Gerstner, W. and Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge Univ Pr.
  • [Gerstner et al., 1993] Gerstner, W., Ritz, R., and Van Hemmen, J. (1993). Why spikes? hebbian learning and retrieval of time-resolved excitation patterns. Biological cybernetics, 69(5):503–515.
  • [Gilbert and Wilson, 2007] Gilbert, D. and Wilson, T. (2007). Prospection: experiencing the future. Science, 317(5843):1351–1354.
  • [Hennequin et al., 2010] Hennequin, G., Gerstner, W., and Pfister, J. (2010). Stdp in adaptive neurons gives close-to-optimal information transmission. Frontiers in Computational Neuroscience, 4.
  • [Hinton, 2009] Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(4):5947.
  • [Izhikevich and Desai, 2003] Izhikevich, E. and Desai, N. (2003). Relating stdp to bcm. Neural Computation, 15(7):1511–1523.
  • [Jaeger and Haas, 2004] Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80.
  • [Kempter et al., 1999] Kempter, R., Gerstner, W., and Van Hemmen, J. (1999). Hebbian learning and spiking neurons. Physical Review E, 59(4):4498.
  • [Kenet et al., 2003] Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., and Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425(6961):954–956.
  • [Khalil and Grizzle, 1992] Khalil, H. and Grizzle, J. (1992). Nonlinear systems. Macmillan Publishing Company New York.
  • [Kiebel et al., 2008] Kiebel, S., Daunizeau, J., and Friston, K. (2008). A hierarchy of time-scales and the brain. PLoS computational biology, 4(11):e1000209.
  • [Kim et al., 2012] Kim, J., Tsien, R., and Alger, B. (2012). An improved test for detecting multiplicative homeostatic synaptic scaling. PloS one, 7(5):e37364.
  • [Lazar et al., 2007] Lazar, A., Pipa, G., and Triesch, J. (2007). Fading memory and time series prediction in recurrent networks with different forms of plasticity. Neural Networks, 20(3):312–322.
  • [Lazar et al., 2009] Lazar, A., Pipa, G., and Triesch, J. (2009). Sorn: a self-organizing recurrent neural network. Frontiers in computational neuroscience, 3.
  • [Lukosevicius and Jaeger, 2009] Lukosevicius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149.
  • [Mandic and Chambers, 2001] Mandic, D. and Chambers, J. (2001). Recurrent neural networks for prediction: Learning algorithms, architectures and stability. John Wiley & Sons, Inc.
  • [Markram et al., 1997] Markram, H., Lübke, J., Frotscher, M., and Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps. Science, 275(5297):213–215.
  • [Miller, 1996] Miller, K. (1996). Synaptic economics: Competition and cooperation in correlation-based synaptic plasticity. Neuron, 17:371–374.
  • [Miller and MacKay, 1994] Miller, K. and MacKay, D. (1994). The role of constraints in hebbian learning. Neural Computation, 6(1):100–126.
  • [Oja, 1982] Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273.
  • [Pearlmutter, 1995] Pearlmutter, B. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. Neural Networks, IEEE Transactions on, 6(5):1212–1228.
  • [Penland and Sardeshmukh, 1995] Penland, C. and Sardeshmukh, P. (1995). The optimal growth of tropical sea surface temperature anomalies. Journal of climate, 8(8):1999–2024.
  • [Pfister and Gerstner, 2006] Pfister, J. and Gerstner, W. (2006). Triplets of spikes in a model of spike timing-dependent plasticity. The Journal of neuroscience, 26(38):9673–9682.
  • [Potthast and beim Graben, 2009] Potthast, R. and beim Graben, P. (2009). Inverse problems in neural field theory. SIAM Journal on Applied Dynamical Systems, 8:1405.
  • [Rao et al., 1999] Rao, R., Ballard, D., et al. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2:79–87.
  • [Rao and Sejnowski, 2001] Rao, R. and Sejnowski, T. (2001). Spike-timing-dependent hebbian plasticity as temporal difference learning. Neural computation, 13(10):2221–2237.
  • [Rosenblatt, 1958] Rosenblatt, F. (1958).

    The perceptron: a probabilistic model for information storage and organization in the brain.

    Psychological review, 65(6):386.
  • [Sanders and Verhulst, 1985] Sanders, J. and Verhulst, F. (1985). Averaging methods in nonlinear dynamical systems, volume 59. Springer.
  • [Schacter et al., 2008] Schacter, D., Addis, D., and Buckner, R. (2008). Episodic simulation of future events. Annals of the New York Academy of Sciences, 1124(1):39–60.
  • [Sejnowski, 1977] Sejnowski, T. (1977). Statistical constraints on synaptic plasticity. Journal of theoretical biology, 69(2):385.
  • [Sjöström and Gerstner, 2010] Sjöström, J. and Gerstner, W. (2010). Spike-timing dependent plasticity. 5(2):1362.
  • [Song et al., 2000] Song, S., Miller, K., and Abbott, L. (2000). Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature neuroscience, 3:919–926.
  • [Sprekeler et al., 2007] Sprekeler, H., Michaelis, C., and Wiskott, L. (2007). Slowness: An objective for spike-timing–dependent plasticity? PLoS Computational Biology, 3(6):e112.
  • [Sussillo and Abbott, 2009] Sussillo, D. and Abbott, L. (2009). Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557.
  • [Turrigiano, 1999] Turrigiano, G. (1999). Homeostatic plasticity in neuronal networks: the more things change, the more they stay the same. Trends in neurosciences, 22(5):221–227.
  • [Turrigiano et al., 1998] Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., and Nelson, S. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. NATURE, 391:893.
  • [Turrigiano and Nelson, 2004] Turrigiano, G. and Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5(2):97–107.
  • [Van Rossum et al., 2000] Van Rossum, M., Bi, G., and Turrigiano, G. (2000). Stable hebbian learning from spike timing-dependent plasticity. The Journal of Neuroscience, 20(23):8812–8821.
  • [Williams and Zipser, 1989] Williams, R. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
  • [Williams and Zipser, 1995] Williams, R. and Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. Back-propagation: Theory, architectures and applications, pages 433–486.
  • [Yeung et al., 2002] Yeung, M., Tegnér, J., and Collins, J. (2002).

    Reverse engineering gene networks using singular value decomposition and robust regression.

    Proceedings of the National Academy of Sciences, 99(9):6163.
  • [Yoshioka et al., 2007] Yoshioka, M., Scarpetta, S., and Marinaro, M. (2007). Spatiotemporal learning in analog neural networks using spike-timing-dependent synaptic plasticity. Physical Review E, 75(5):051917.
  • [Zheng et al., 2013] Zheng, P., Dimitrakakis, C., and Triesch, J. (2013). Network self-organization explains the statistics and dynamics of synaptic connection strengths in cortex. PLoS Computational Biology, 9(1).