Introduction
The connectome is dynamic: Networks of neurons in the brain rewire themselves on a time scale of hours to days [Holtmaat et al., 2005, Stettler et al., 2006, Yang et al., 2009, Holtmaat and Svoboda, 2009, Ziv and Ahissar, 2009, Minerbi et al., 2009, Kasai et al., 2010, Loewenstein et al., 2011, Loewenstein et al., 2015, Rumpel and Triesch, 2016, Chambers and Rumpel, 2017, van Ooyen and ButzOstendorf, 2017]. This synaptic rewiring manifests in the emergence and vanishing of dendritic spines [Holtmaat and Svoboda, 2009]. Additional structural changes of established synapses are observable as a growth and shrinking of spine heads which take place even in the absence of neural activity [Yasumatsu et al., 2008]. The recent study of [Dvorkin and Ziv, 2016], which includes in Fig. 8 a reanalysis of mouse brain data from [Kasthuri et al., 2015], showed that this spontaneous component is surprisingly large, at least as large as the impact of pre and postsynaptic neural activity. In addition, Nagaoka and colleagues provide direct evidence in vivo that the baseline turnover of dendritic spines is mediated by activityindependent intrinsic dynamics [Nagaoka et al., 2016]. Furthermore, experimental data also suggest that taskdependent selfconfiguration of neural circuits is mediated by reward signals in [Yagishita et al., 2014].
Other experimental data show that not only the connectome, but also the dynamics and function of neural circuits is subject to continuously ongoing changes. Continuously ongoing drifts of neural codes were reported in [Ziv et al., 2013, Driscoll et al., 2017]
. Further data show that the mapping of inputs to outputs by neural networks that plan and control motor behavior are subject to a random walk on a slow timescale of minutes to days, that is conjectured to be related to stochastic synaptic rewiring and plasticity
[van Beers et al., 2013, Chaisanguanthum et al., 2014, Rokni et al., 2007].We address two questions that are raised by these data:

[label=)]

How can stable network performance be achieved in spite of the experimentally found continuously ongoing rewiring and activityindependent synaptic plasticity in neural circuits?

What could be a functional role of these processes?
Similar as [Statman et al., 2014, Loewenstein et al., 2015, Rokni et al., 2007] we model spontaneous synapseautonomous spine dynamics of each potential synaptic connection through a stochastic process that modulates a corresponding parameter
. We provide in this article a rigorous mathematical framework for such stochastic spine dynamics and rewiring processes. Our analysis shows that one can describe the network configuration, i.e., the current state of the dynamic connectome and the strengths of all currently functional synapses, at any time point by a vector
that encodes the current values for all potential synaptic connections . The stochastic dynamics of this highdimensional vectordefines a Markov chain whose stationary distribution (illustrated in Fig.
1d) provides insight into questions that address the relation between properties of local synaptic processes and the computational function of a neural network.Based on the wellstudied paradigm for rewardbased learning in neural networks, we propose the following answer to question i): As long as most of the mass of this stationary distribution lies in regions or lowdimensional manifolds of the parameter space that produce good performance, stable network performance can be assured in spite of continuously ongoing movement of [Loewenstein et al., 2015]. Our experimental results suggest that when a computational task has been learnt, most of the subsequent dynamics of takes place in taskirrelevant dimensions.
The same model also provides an answer to question ii): Synapseautonomous stochastic dynamics of the parameter vector enables the network not only to find in a highdimensional regions with good network performance, but also to rewire the network in order to compensate for changes in the task. We analyze how the strength of the stochastic component of synaptic dynamics affects this compensation capability. We arrive at the conclusion that compensation works best for the task considered here if the stochastic component is as large as in experimental data [Dvorkin and Ziv, 2016].
On the more abstract level of reinforcement learning, our theoretical framework for rewarddriven network plasticity suggests a new algorithmic paradigm for network learning: policy sampling. Compared with the familiar policy gradient learning [Williams, 1992, Baxter and Bartlett, 2000, Peters and Schaal, 2006] this paradigm is more consistent with experimental data that suggest a continuously ongoing drift of network parameters.
The resulting model for rewardgated network plasticity builds on the approach from [Kappel et al., 2015]
for unsupervised learning, that was only applicable to a specific neuron model and a specific STDPrule. Since the new approach can be applied to arbitrary neuron models, in particular also to large databased models of neural circuits and systems, it can be used to explore how databased models for neural circuits and brain areas can attain and maintain a computational function. (668 words)
Results
We first address the design of a suitable theoretical framework for investigating the selforganization of neural circuits for specific computational tasks in the presence of spontaneous synapseautonomous processes and rewards. There exist wellestablished models for rewardmodulated synaptic plasticity, see e.g. [Frémaux et al., 2010], where reward signals gate common rules for synaptic plasticity, such as STDP. But these rules are lacking two components that we need here:

an integration of rewiring with plasticity rules that govern the modulation of the strengths of already existing synaptic connections

a term that reflects the spontaneous synapseautonomous component of synaptic plasticity and rewiring.
In order to illustrate our approach we consider a neural network scaffold (see Fig. 1a) with a large number of potential synaptic connections between excitatory neurons. Only a subset of these potential connections is assumed to be functional at any point in time.
If one allows rewiring then the concept of a neural network becomes problematic, since the definition of a neural network typically includes its synaptic connections. Hence we refer to the set of neurons of a network, its set of potential synaptic connections, and its set of definite synaptic connections – such as in our case connections from and to inhibitory neurons (see Fig. 1a) – as a network scaffold. A network scaffold together with a parameter vector that specifies a particular selection of functional synaptic connections out of the set of potential connections and particular synaptic weights for these defines a concrete neural network, to which we also refer as network configuration.
For simplicity we assume that only excitatory connections are plastic, but the model can be easily extended to also reflect plasticity of inhibitory synapses. For each potential synaptic connection , we introduce a parameter that describes its state both for the case when this potential connection is currently not functional (this is the case when ) and when it is functional (i.e., ). More precisely, encodes the current strength or weight of this synaptic connection through the formula
(1) 
with a positive offset parameter that regulates the initial strength of new functional synaptic connections (we set in our simulations).
The exponential function in Eq. (1) turns out to be useful for relating the dynamics of to experimental data on the dynamics of synaptic weights. The volume – or image brightness in Caimaging – of a dendritic spine is commonly assumed to be proportional to the strength of a synapse [Holtmaat et al., 2005]
. The logarithm of this estimate for
was shown in Fig. 2i of [Holtmaat et al., 2006] and also in [Yasumatsu et al., 2008, Loewenstein et al., 2011] to exhibit a dynamics similar to that of an OrnsteinUhlenbeck process, i.e., a random walk in conjunction with a force that draws the random walk back to its initial state. Hence if is chosen to be proportional to the logarithm of , it is justified to model the spontaneous dynamics of as an OrnsteinUhlenbeck process. This is done in our model, as we will explain after Eq. (5) and demonstrate in Fig. 2c. The logarithmic transformation also ensures that additive increments of yield multiplicative updates of , which have been observed experimentally [Loewenstein et al., 2011].Altogether our model needs to create a dynamics for that is not only consistent with experimental data on spontaneous spine dynamics, but is for the case also consistent with rules for rewardmodulated synaptic plasticity as in [Frémaux et al., 2010]. This suggests to look for plasticity rules of the form
(2) 
where the deterministic plasticity rule could for example be a standard rewardbased plasticity rule. We will argue below that it makes sense to include also an activityindependent prior in this deterministic component of rule (2), both for functional reasons and in order to fit data on spontaneous spine dynamics. We will further see that when the activityindependent prior dominates, we obtain the OrnsteinUhlenbeck process mentioned above. The stochastic term in Eq. (2) is an infinitesimal step of a random walk, more precisely for a Wiener process . A Wiener process is a standard model for Brownian motion in one dimension [Gardiner, 2004]. The term scales the strength of this stochastic component in terms of a “temperature” and a learning rate , and is chosen to be of a form that supports analogies to statistical physics. The presence of this stochastic term makes it unrealistic to expect that converges to a particular value under the dynamics defined by Eq. (2). In fact, in contrast to many standard differential equations, the stochastic differential equation or SDE (2) does not have a single trajectory of as solution, but an infinite family of trajectories that result from different random walks.
We propose to focus – instead of the common analysis of the convergence of weights to specific values as invariants – on the most prominent invariant that a stochastic process can offer: the longterm stationary distribution of synaptic connections and weights. The stationary distribution of the vector of all synaptic parameters informs us about the statistics of the infinitely many different solutions of a stochastic differential equation of the form (2). In particular, it informs us about the fraction of time at which particular values of will be visited by these solutions (see Methods for details). We show that a large class of rewardbased plasticity rules produce in the context of an equation of the form (2) a stationary distribution of that can be clearly related to reward expectation for the neural network, and hence to its computational function.
We want to address the question whether rewardbased plasticity rules achieve in the context with other terms in Eq. (2) that the resulting stationary distribution of network configurations has most of its mass on highly rewarded network configurations. A key observation is that if the first term on the righthandside of (2) can be written for all potential synaptic connections in the form , where is some arbitrary given distribution and denotes the partial derivative with respect to parameter , then these stochastic processes
(3) 
give rise to a stationary distribution that is proportional to . Hence, a rule for rewardbased synaptic plasticity that can be written in the form , where has most of its mass on highly rewarded network configurations , achieves that the network will spend most of its time in highly rewarded network configurations. This will hold even if the network does not converge to or stay in any particular network configuration (see Fig. 1d,f for an illustration). Furthermore the role of the temperature in (3) becomes clearly visible in this result: if is large the resulting stationary distribution flattens the distribution , whereas for the network will remain for larger fractions of the time in those regions of the parameter space where achieves its largest values. In fact, if the temperature converges to 0, the resulting stationary distribution degenerates to one that has all of its mass on the network configuration for which reaches its global maximum, as in simulated annealing [Kirkpatrick et al., 1983].
We will focus on target distributions of the form
(4) 
where denotes proportionality up to a positive normalizing constant. can encode structural priors of the network scaffold . For example, it can encode a preference for sparsely connected networks. This happens when has most of its mass near , see Fig. 1c for an illustration. But it could also convey genetically encoded or previously learnt information, such as a preference for having strong synaptic connections between two specific populations of neurons. The term in Eq. (4) denotes the expected discounted reward associated with a given parameter vector (see Fig. 1b). Eq. (3) for the stochastic dynamics of parameters takes then the form
(5) 
When the term vanishes, this equation models spontaneous spine dynamics. We will make sure that this term vanishes for all potential synaptic connections that are currently not functional, i.e., where
. If one chooses a Gaussian distribution as prior
, the dynamics of (5) amounts in the case to an OrnsteinUhlenbeck process. There is currently no generally accepted description of spine dynamics. OrnsteinUhlenbeck dynamics has previously been proposed as a simple model for experimentally observed spontaneous spine dynamics [Yasumatsu et al., 2008, Loewenstein et al., 2011, Loewenstein et al., 2015]. Another proposed model uses a combination of multiplicative and additive stochastic dynamics [Statman et al., 2014, Rubinski and Ziv, 2015]. We used in our simulations a Gaussian distribution that prefers small but nonzero weights for the prior . Hence, our model (5) is consistent with previous OrnsteinUhlenbeck models for spontaneous spine dynamics.Thus altogether we arrive at a model for the interaction of stochastic spine dynamics with reward where the usually considered deterministic convergence to network configurations that represent local maxima of expected reward (e.g. to the local maxima in Fig. 1b) is replaced by a stochastic model. If the stochastic dynamics of is defined by local stochastic processes of the form (5), as indicated in Fig. 1e, the resulting stochastic model for network plasticity will spend most of its time in network configurations where the posterior , illustrated in Fig. 1d, approximately reaches its maximal value. This provides on the statistical level a guarantee of task performance, in spite of ongoing stochastic dynamics of all the parameters .
Rewardbased rewiring and synaptic plasticity as policy sampling
We assume that all synapses and neurons in the network scaffold receives reward signals at certain times , corresponding for example to dopamine signals in the brain (see [Collins and Frank, 2016] for a recent discussion of related experimental data). The expected discounted reward that occurs in the second term of Eq. (5) is the expectation of the time integral over all future rewards , while discounting more remote rewards exponentially, see Eq. (6) in Methods. Fig. 1b shows a hypothetical landscape over two parameters . The posterior shown in Fig. 1d is then proportional to the product of (panel b) and the prior (panel c).
The computational behavior of the network configuration, i.e., the mapping of network inputs to network outputs that is encoded by the parameter vector , is referred to as a policy in the context of reinforcement learning theory. The parameters (and therefore the policy) are gradually changed through Eq. (5) such that the expected discounted reward is increased: The parameter dynamics follows the gradient of , i.e., , where is a small learning rate. When the parameter dynamics is given solely by the second term in the parenthesis of Eq. (5), , we recover for the case deterministic policy gradient learning [Williams, 1992, Baxter and Bartlett, 2000, Peters and Schaal, 2006].
For a network scaffold of spiking neurons, the derivative gives rise to synaptic updates at a synapse that are essentially given by the product of the current reward signal and an eligibility trace that depends on pre or postsynaptic firing times, see Synaptic dynamics for the rewardbased synaptic sampling model in Methods. Such plasticity rules have previously been proposed by [Seung, 2003, Xie and Seung, 2004, Izhikevich, 2007, Pfister et al., 2006, Florian, 2007, Legenstein et al., 2008, Urbanczik and Senn, 2009]. For nonspiking neural networks, a similar update rule was first introduced by Williams and termed the REINFORCE rule [Williams, 1992].
In contrast to policy gradient, reinforcement learning in the presence of the stochastic last term in Eq. (5) cannot converge to any network configuration. Instead, the dynamics of Eq. (5) produces continuously changing network configurations, with a preference for configurations that both satisfy constraints from the prior and provide a large expected reward , see Fig. 1d,f. Hence this type of reinforcement learning samples continuously from a posterior distribution of network configurations. This is rigorously proven in Theorem 1 of Methods. We refer to this reinforcement learning model as policy sampling, and to the family of rewardbased plasticity rules that are defined by Eq. (5) as rewardbased synaptic sampling.
Another key difference to previous models for rewardgated synaptic plasticity and policy gradient learning is, apart from the stochastic last term of Eq. (5), that the deterministic first term of Eq. (5) also contains a rewardindependent component that arises from a prior for network configurations. In our simulations we consider a simple Gaussian prior with mean that encodes a preference for sparse connectivity (see Eq. (17)).
It is important that the dynamics of disconnected synapses, i.e., of synapses with or equivalently , does not depend on pre/postsynaptic neural activity or reward since nonfunctional synapses do not have access to such information. This is automatically achieved through our ansatz for the rewarddependent component in Eq. (5), since a simple derivation shows that it entails that the factor appears in front of the term that depends on pre and postsynaptic activity, see Eq. (15). Instead, the dynamics of depends for only on the prior and the stochastic term . This results in a distribution over waiting times between downwards and upwards crossing of the threshold that was found to be similar to the distribution of interevent times of a Poisson point process, see [Ding and Rangarajan, 2004] for a detailed analysis. This theoretical result suggest a simple approximation of the dynamics of Eq. (5) for currently nonfunctional synaptic connections, where the process (5) is suspended whenever becomes negative, and continued with
after a waiting time that is drawn from an exponential distribution. As in
[Deger et al., 2016]this can be realized by letting a nonfunctional synapse become functional at any discrete time step with some fixed probability (Poisson process). We have compared in Fig.
3c the resulting learning dynamics of the network for this simple approximation with that of the process defined by Eq. (5).Taskdependent routing of information through the interaction of stochastic spine dynamics with rewards
Experimental evidence about gating of spine dynamics by reward signals in the form of dopamine is available for the synaptic connections from the cortex to the entrance stage of the basal ganglia, the medium spiny neurons (MSNs) in the striatum [Yagishita et al., 2014]. They report that the volumes of their dendritic spines show significant changes only when pre and postsynaptic activity is paired with precisely timed delivery of dopamine (see [Yagishita et al., 2014], Fig. 1 EG, O). More precisely, an STDP pairing protocol followed by dopamine uncaging induced strong LTP in synapses onto MSNs, whereas the same protocol without dopamine uncaging lead only to a minor increase of synaptic efficacies.
MSNs can be viewed as readouts from a large number of cortical areas, that become specialized for particular motor functions, e.g. movements of the hand or leg. We asked whether reward gating of spine dynamics according to the experimental data of [Yagishita et al., 2014] can explain such task dependent specialization of MSNs. More concretely, we asked whether it can achieve that two different distributed activity patterns , of upstream neurons in the cortex get routed to two different ensembles of MSNs, and thereby to two different downstream targets and of these MSNs (see Fig. 2a,h,i). We assumed that for each upstream activity pattern a particular subset of upstream neurons is most active, . Hence this routing task amounted to routing synaptic input from to those MSNs that project to downstream neuron .
We applied to all potential synaptic connections from upstream neurons to MSNs a learning rule according to Eq. (5), more precisely, the rule for rewardgated STDP (Eq. (15), Eq. (16) and Eq. (18)) that results from this general framework. The parameters of the model were adapted to qualitatively reproduce the results from Figures 1F,G of [Yagishita et al., 2014] when the same STDP protocol was applied to our model (see Fig. 2d,e). The parameter values are reported in Tab. 1 in Methods. If not stated otherwise, we applied these parameters in all following experiments. In Role of the prior distribution in Methods, we further analyze the impact of different prior distributions on task performance and network connectivity.
Our simple model consisted of 20 inhibitory model MSNs with lateral recurrent connections. These received excitatory input from 200 input neurons. The synapses from input neurons to model MSNs were subject to our plasticity rule. Multiple connections were allowed between each pair of input neuron and MSN (see Methods). The MSNs were randomly divided into two assemblies, each projecting exclusively to one of two downstream target areas and . Cortical input was modeled as Poisson spike trains from the 200 input neurons with instantaneous rates defined by two prototype rate patterns and , see Fig. 2h. The task was to learn to activate projecting neurons and to silence projecting neurons whenever pattern was presented as cortical input. For pattern , the activation should be reversed: activate projecting neurons and silence those projecting to . This desired function was defined through a reward signal that was proportional to the ratio between the mean firing rate of MSNs projecting to the desired target and that of MSNs projecting to the nondesired target area (see Methods).
Fig. 2h shows the firing activity and reward signal of the network during segments of one simulation run. After about 80 minutes of simulated biological time, each group of MSNs had learned to increase its firing rate when the activity pattern associated with its projection target was presented. Fig. 2f shows the average reward throughout learning. After 3 hours of learning about of the maximum reward was acquired on average, and this level was maintained during prolonged learning.
Fig. 2g shows that the parameter vector kept moving at almost the same speed even after a high plateau of rewards had been reached. Hence these ongoing parameter changes took place in dimensions that were irrelevant for the rewardlevel.
Fig. 2i provides snapshots of the underlying “dynamic connectome” [Rumpel and Triesch, 2016] at different points of time. New synaptic connections that were not present at the preceding snapshot are colored green. One sees that the bulk of the connections maintained a solution of the task to route inputs from to target area and inputs from to target area . But the identity of these connections, a taskirrelevant dimension, kept changing. In addition the network always maintained some connections to the currently undesired target area, thereby providing the basis for a swift builtup of these connections if these connections would suddenly also become rewarded.
We further examine the exploration along taskirrelevant dimensions in Fig. 2
j. Here, the highdimensional parameter vector over a training experiment of 24 h projected to the first two components of the demixed principal component analysis (dPCA) that best explain the variance of the average reward is shown (see Methods and
[Kobak et al., 2016]). The first component (dpc1) explains of the variance. Movement of the parameter vector mainly takes place along this dimensions during the first 4 hours of learning. After the performance has converged to a high value, exploration continues along other components (dpc2, and higher components) that explain less than of the average reward variance.This simulation experiment showed that rewardgated spine dynamics as analyzed in [Yagishita et al., 2014] is sufficiently powerful from the functional perspective to rewire networks so that each signal is delivered to its intended target.
A model for taskdependent selfconfiguration of a recurrent network of excitatory and inhibitory spiking neurons
We next asked, whether our simple integrated model for rewardmodulated rewiring and synaptic plasticity of neural circuits according to Eq. (5) could also explain the emergence of specific computations in recurrent networks of spiking neurons. As paradigm for a specific computational task we took a simplified version of the task that mice learned to carry out in the experimental setup of [Peters et al., 2014]. There a reward was given whenever a lever was pressed within a given time window indicated by an auditory cue. This task is particular suitable for our context, since spine turnover and changes of network activity were continuously monitored in [Peters et al., 2014] while the animals learned this task.
We adapted the learning task of [Peters et al., 2014] in the following way for our model (see Fig. 3a). The beginning of a trial was indicated through the presentation of a cue input pattern : a fixed, randomly generated rate pattern for all 200 input neurons that lasted until the task was completed, but at most 10s. As network scaffold we took a generic recurrent network of excitatory and inhibitory spiking neurons with connectivity parameters for connections between excitatory and inhibitory neurons according to data from layer 2/3 in mouse cortex [Avermann et al., 2012]. The network consisted of 60 excitatory and 20 inhibitory neurons (see Fig. 3a). Half of the excitatory neurons could potentially receive synaptic connections from the 200 excitatory input neurons. From the remaining 30 neurons we randomly selected one pool D of 10 excitatory neurons to cause downwards movements of the lever, and another pool U of 10 neurons for upwards movements. We refer to the 40 excitatory neurons that were not members of D or U as hidden neurons. All excitatory synaptic connections from the external input (cue) and between the 60 excitatory neurons (including those in the pools D and U) in the network were subjected to rewardbased synaptic sampling.
To decode the lever position we filtered the population spikes of D and U with a smoothing kernel. The filtered population spikes of D were then subtracted from those of U to determine the lever position (seem Methods for details). When the lever position crossed the threshold +5 after first crossing a lower threshold 5 (black horizontal lines in Fig. 3a,b) within 10 s after cue onset a 400 ms reward window was initiated during which was set to (red vertical bars in Fig. 3b). Unsuccessful trials were aborted after 10 seconds and no reward was delivered. After each trial a brief holding phase of random length was inserted, during which input neurons were set to a background input rate of 2 Hz.
Thus, the network had to learn without any guidance, except for the reward in response to good performance, to create after the onset of the cue first higher firing in pool D, and then higher firing in pool U. This task was challenging, since the network had no information which neurons belonged to pools D and U. Moreover, the synapses did not “know” whether they connected to hidden neurons, neurons within a pool, hidden neurons and poolneurons, or input neurons with other neurons. The plasticity of all these different synapses was gated by the same global reward signal. Since the pools D and U were not able to receive direct synaptic connections from the input neurons, the network also had to learn to communicate the presence of the cue pattern via disynaptic connections from the input neurons to these pools.
Network responses before and after learning are shown in Fig. 3b. Initially, the rewarded goal was only reached occasionally, while the turnover of synaptic connections (number of synaptic connections that became functional or became nonfunctional in a time window of 2 hours) remained very high (see Fig. 3e). After about 3 h, performance improved drastically (Fig. 3c), and simultaneously the turnover of synaptic connections slowed down (Fig. 3e). After learning for 8 hours, the network was able to solve the task in most of the trials, and the average trial duration (movement completion time) had decreased to less than 1 second (, Fig. 3c). Improved performance was accompanied by more stereotyped network activity and lever movement patterns as in the experimental data of [Peters et al., 2014]: compare our Fig. 3d with Fig. 1b and Fig. 2j of [Peters et al., 2014]. In Fig. 3d we show the trialaveraged activity of the 60 excitatory neurons before and after learning for 22 hours. The neurons are sorted in the first two plots of Fig. 3d by the time of maximum activity after movement onset times before learning, and in the 3rd plot resorted according to times of maximum activity after 22 hours of learning (see Methods). These plots show that rewardbased learning led to a restructuring of the network activity: an assembly of neurons emerged that controlled a sharp upwards movement. Also, less background activity was observed after 22 hours of learning, in particular for neurons with early activity peaks. Lower panels in Fig. 3d show the average lever movement and 10 individual movement traces at the beginning and after 22 hours of learning. Similar as in [Peters et al., 2014] the lever movements became more stereotyped during learning, featuring a sharp upwards movement at cue onset followed by a slower downwards movement in preparation for the next trial.
The synaptic parameter drifts due to SDE (5) inherently lead to forgetting. In Fig. 3f we tested this effect by running a very long experiment over 14 simulated days. After 24 h, when the network had learned to reliably solve the task, we stopped the application of the reward but continued the synaptic dynamics. We found that the task could be reliably recalled for more than 5 days. Onset of forgetting was observed after day 6. We wondered whether a simple consolidation mechanism could prevent forgetting in our model. To test this we used the prior distribution to stabilize the synaptic parameters. After 4 simulated days we set the mean of the prior to the current value of the synaptic parameters and reduced the variance, while continuing the synaptic dynamics with the same temperature. A similar mechanism for synaptic consolidation has been recently suggested in [Kirkpatrick et al., 2017]. This mechanism reliably prevents forgetting in our model throughout the simulation time of 14 days. We conclude that the effect of forgetting is quite mild in our model and can be further suppressed by a consolidation mechanism that stabilizes synapses on longer timescales.
Next we tested whether similar results could be achieved with a simplified version of the stochastic synapse dynamics while a potential synaptic connection is nonfunctional, i.e., . Eq. (5) defines for such nonfunctional synapses an OrnsteinUhlenbeck process, which yields a heavytailed distribution for the waiting time until reappearance (Fig. 3g, left). We tested whether similar learning performance can be achieved if one approximates the distribution by an exponential distribution, for which we chose a mean of 12 h. The small distance between the blue and green curve in Fig. 3c shows that this is in fact the case for the overall computational task that includes a task switch at 24 h that we describe below. Compensation for the task switch was slightly slower when the approximating exponential distribution was used, but the task performance converged to the same result as for the exact rule. This holds in spite of the fact that the approximating exponential distribution is less heavytailed (Fig. 3g, right). Altogether these results show that rewiring and synaptic plasticity according to Eq. (5) yields selforganization of a generic recurrent network of spiking neurons so that it can control an arbitrarily chosen motor control task.
Compensation for network perturbations
We wondered whether this model for the task of [Peters et al., 2014] would in addition be able to compensate for a drastic change in the task, an extra challenge that had not been considered in the experiments of [Peters et al., 2014]. To test this we suddenly interchanged after 24 h the actions that were triggered by the pools D and U. D now caused upwards and U downwards lever movement.
We found that our model compensated immediately (see the faster movement in the parameter space depicted in Fig. 3h) for this perturbation and reached after about 8 h a similar performance level as before (Fig. 3c). This compensation phase was accompanied by a substantial increase in the turnover of synaptic connections similar as in experiments for learning of a new task, see e.g. [Xu et al., 2009] (Fig. 3e). The turnover rate also remained slightly elevated during the subsequent learning period. Furthermore, a new assembly of neurons emerged that now triggered a sharp onset of activity in the pool D (compare the activity neural traces at h = 22 and h = 46 in Fig. 3d). Another experimentally observed phenomenon that occured in our model were drifts of neural codes, which happened also during phases of the experiment without perturbations. Despite these drifts, the task performance stayed constant, similar to experimental data in [Driscoll et al., 2017] (see Fig. 6 in Methods).
If rewiring was disabled after the task change at 24 h the compensation was significantly delayed and overall performance declined (see black curve in Fig. 3c). Here, we disallowed any turnover of potential synaptic connections such that the connectivity remained the same after 24 h. This result suggests that rewiring is necessary for adapting to the task change. In Fig. 3h we further analyzed the profile of synaptic turnover for the different populations of the network scaffold in Fig. 3a. The synaptic parameters were measured immediately before the task change at 24 h and compared to the connectivity after compensation at 48 h for the experiment shown in Fig. 3c (blue). Most synapses (6675%) were nonfunctional before and after the task change (stable nonfunctional). About 20% of the synapses changed their behavior and either became functional or nonfunctional. Most prominently a large fraction (21.9%) of the synapses from hidden neurons to U became nonfunctional while only few (5.9%) new connections were introduced. The connections from hidden to D showed the opposite behavior. This modification of the network connectome reflects the requirement to reliably route information about the presence of the cue pattern encoded in the activity of hidden neurons to the pool D (and not to U) to initiate the lever movement after the task change.
We then asked whether rewiring is also necessary for the initial learning of the task. To answer this question, we performed a simulation where the network connectivity was fixed from the beginning. We found that inital task performance was not significantly worse compared to the setup with rewiring. This indicates that at least for this task, rewiring is necessary for compensating task switches, but not for initial task learning. We expect however that this is not the case for more complex tasks, as indicated by a recent study that used artificial neural networks [Bellec et al., 2017].
A structural difference between stochastic learning models such as policy sampling and learning models that focus on convergence of parameters to a (locally) optimal setting becomes apparent when one tracks the temporal evolution of the network parameters over larger periods of time during the previously discussed learning process (Fig. 3i). Although performance no longer improved after 5 hours, both network connectivity and parameters kept changing in taskirrelevant dimensions. For Fig. 3i we randomly selected 5 of the roughly 47000 parameters and plotted the first 3 principal components of their dynamics. The task change after 24 hours caused the parameter vector to migrate to a new region within about 8 hours of continuing learning (see Fig. 7 where the projected parameter dynamics is further analyzed). Again we observe that policy sampling keeps exploring different equally good solutions after the learning process has reached stable performance.
To further investigate the impact of the temperature parameter on the exploration speed in the parameter space we measured the amplitudes of parameter changes for different values of . We recorded the synaptic parameters every 20 minutes and estimated the exploration speed by measuring the average Euclidean distance between successive snapshots of the parameter vectors. We found that a temperature of resulted in an increase in exploration speed by around 150% compared to the case of . A temperature of resulted in an increase in exploration speed by around 400%.
Relative contributions of spontaneous and activitydependent synaptic processes
[Dvorkin and Ziv, 2016] analyzed the correlation of sizes of postsynaptic densities and spine volumes for synapses that shared the same pre and postsynaptic neuron, called commonly innervated (CI) synapses, and also for synapses that shared in addition the same dendrite (). Activitydependent rules for synaptic plasticity, such as Hebbian or STDP rules – on which previous models for network plasticity relied – suggest that the strength of CI and especially synapses should be highly correlated. But both data from exvivo [Kasthuri et al., 2015] and neural circuits in culture [Dvorkin and Ziv, 2016] show that postsynaptic density sizes and spine volumes of synapses are only weakly correlated, with correlation coefficients between 0.23 and 0.34. Thus even with a conservative estimate that corrects for possible influences of their experimental procedure, more than 50% of the observed synaptic strength appears to result from activityindependent stochastic processes (Fig. 8E of [Dvorkin and Ziv, 2016]). [Bartol Jr et al., 2015] had previously found larger correlations of synaptic strengths of synapses for a smaller data set (based on 17 pairs instead of the 72 pairs, 10 triplets, and 2 quadruplets in the exvivo data from [Kasthuri et al., 2015]). But the spine volumes differed in these pairs also on average by a factor of around 2.
We asked how such a strong contribution of activityindependent synaptic dynamics affects network learning capabilities, such as the ones that were examined in Fig. 3
. We were able to carry out this test because many synaptic connections between neurons that were formed in our model consisted of more than one synapse. We classified pairs of synapses that had the same pre and postsynaptic neuron as CI synapses (one could also call them
synapses, since the neuron model did not have different dendrites), and pairs with the same postsynaptic but different presynaptic neurons as nonCI synapses. Example traces of synaptic weights for CI and nonCI synapse pairs of our network model from Fig. 3 are shown in Fig. 4a,b. CI pairs were found to be more strongly correlated than nonCI pairs (Fig. 4c). However also the correlation of CI pairs was quite low, and varied with the temperature parameter in Eq. (5), see Fig. 4d. The correlation was measured in terms of the Pearson correlation (covariance of synapse pairs normalized between 1 and 1).Since the correlation of CI pairs in our model depends on the temperate , we analyzed the model of Fig. 3 for different temperatures (the temperature had been fixed at T=0.1 throughout the experiments for Fig. 3). In Fig. 4d the Pearson correlation coefficient for CI synapses is plotted together with the average performance achieved on the task of Fig. 3dh ( hours after the task switch) for networks with different temperatures . The best performing temperature region for the task () roughly coincided with the region of experimentally measured values of Pearson’s correlation for CIsynapses. Fig. 4e shows the correlation of 100 CI synapse pairs that emerged from a run with . We found a value of in this case. This value is in the order of the lowest experimentally found correlation coefficients in [Dvorkin and Ziv, 2016] (both in culture and exvivo, see Fig. 8AD in [Dvorkin and Ziv, 2016]). The speed of compensation and the overall replenishing of synapses was strongly dependent on the temperature (see Fig. 4g). For , a complete compensation for the task changes was prevented (performance converged to during a longer run of 96 hours). In other words, the temperature region – that is consistent with experimentally measured Pearson’s correlation for CIsynapses – leads to fastest task relearning, allowing for a compensation within approximately hours of exposure. For we found the best compensation capabilities and the closest match to experimentally measured correlations when the results of [Dvorkin and Ziv, 2016] were corrected for measurement limitations: A correlation coefficient of for CI synapses and for nonCI synapse pairs (mean s.e.m. over 5 trials, CI synapses were significantly stronger correlated than nonCI, pvalue below 0.005 in all trials. Statistical significance values based on twotailed MannWhitney U test.).
[Dvorkin and Ziv, 2016] further analyzed the ratio of contributions from different processes to the measured synaptic dynamics. They analyzed the contribution of neural activity history dependent processes, which amount for 36% of synapse dynamics in their data, and that of neuronwide processes that were not specific to presynaptic activity, but specific to the activity of the postsynaptic neuron (8%). Spontaneous synapseautonomous processes were found to explain 56% of the observed dynamics (see Fig. 8E in [Dvorkin and Ziv, 2016]). The results from our model, that are plotted in Fig. 4f, match these experimentally found values quite well. Altogether we found that the results of [Dvorkin and Ziv, 2016] are best explained by our model for a temperature parameter between (corresponding to their lowest measured correlation coefficient) and (corresponding to their most conservative estimate). This range of parameters coincided with wellfunctioning learning behavior in our model, which included a test of compensation capability for a change of the task after 24 h (Fig. 4d). Hence our model suggests that a large component of stochastic synapseautonomous processes, as it occurs in the data, supports efficient network learning and compensation for changes in the task.
Discussion
Recent experimental data ([Dvorkin and Ziv, 2016], where in Fig. 8 also mouse brain data from [Kasthuri et al., 2015] were reanalyzed, and [Nagaoka et al., 2016]) suggest that common models for learning in neural networks of the brain need to be revised, since synapses are subject to powerful processes that do not depend on pre and postsynaptic neural activity. In addition, experimentally found network rewiring has so far not been integrated into models for rewardgated network plasticity. We have presented a theoretical framework that enables us to investigate and understand rewardbased network rewiring and synaptic plasticity in the context of the experimentally found high level of activityindependent fluctuations of synaptic connectivity and synaptic strength. We have shown that the analysis of the stationary distribution of network configurations, in particular the FokkerPlanck equation from theoretical physics, allows us to understand how large numbers of local stochastic processes at different synapses can orchestrate global goaldirected network learning. This approach provides a new normative model for rewardgated network plasticity.
We have shown in Fig. 2 that the resulting model is consistent with experimental data on dopaminedependent spine dynamics reported in [Yagishita et al., 2014], and that it provides an understanding how these local stochastic processes can produce functionoriented corticalstriatal connectivity. We have shown in Fig. 3 that this model also elucidates rewardbased selforganization of generic recurrent neural networks for a given computational task. We chose as benchmark task the production of a specific motor output in response to a cue, like in the experiments of [Peters et al., 2014]. Similarly as reported in [Peters et al., 2014], the network connectivity and dynamics reorganized itself in our model, just driven by stochastic processes and rewards for successful task completion, and reached a high level of performance. Furthermore it maintained this computational function in spite of continuously ongoing further rewiring and network plasticity. A quantitative analysis of the impact of stochasticity on this process has shown in Fig. 4 that the network learns best when the component of synaptic plasticity that does not depend on neural activity is fairly large, as large as reported in the experimental data of [Kasthuri et al., 2015, Dvorkin and Ziv, 2016].
Our approach is based on experimental data for the biological implementation level of network plasticity, i.e., for the lowest level of the Marr hierarchy of models [Marr and Poggio, 1976]. However, we have shown that these experimental data have significant implications for understanding network plasticity on the top level (”what is the functional goal?”) and the intermediate algorithmic level (”what is the underlying algorithm?”) of the Marr hierarchy. They suggest for the top level that the goal of network plasticity is to sample from a posterior distribution of network configurations. This posterior integrates functional demands formalized by the expected discounted reward with a prior in a multiplicative manner . Priors can represent structural constraints as well as results of preceding learning experiences and innate programs. Since our model samples from a distribution proportional to , for
, our model suggests to view rewardgated network plasticity as Bayesian inference over network configurations on a slow time scale, see
Probabilistic framework for rewardmodulated learning in Methods for details. For a temperature parameter , the model samples from a tempered version of the posterior, which generalizes the basic Bayesian approach. This Bayesian perspective also creates a link to previous work on Bayesian reinforcement learning [Vlassis et al., 2012, Rawlik et al., 2013]. We note however that we do not consider parameter adaptation in our framework to implement full Bayesian learning, as there is no integration over the posterior paramter settings to obtain network outputs (or actions in a reinforcement learning context). Even if one would do that, it would be of little practical use, since the sampling would be much too slow in any but the simplest networks. The experimental data suggest for the intermediate algorithmic level of the Marr hierarchy a strong reliance on stochastic search (”synaptic sampling”). The essence of the resulting model for rewardgated network learning is illustrated in Fig. 1: The traditional view of deterministic gradient ascent (policy gradient) in the landscape (panel b) of reward expectation is first modified through the integration of a prior (panel c), and then through the replacement of gradient ascent by continuously ongoing stochastic sampling (policy sampling) from the posterior distribution of panel d, which is illustrated in panels e and f.This model explains a number of experimental data that had not been addressed by previous models. Continuously ongoing stochastic sampling of network configurations suggests that synaptic connectivity does not converge to a fixed point solution but rather undergoes permanent modifications (Fig. 3h,i). This behavior is compatible with reports of continuously ongoing spine dynamics and axonal sprouting even in the adult brain [Holtmaat and Svoboda, 2009, Yasumatsu et al., 2008, Stettler et al., 2006, Yamahachi et al., 2009, Loewenstein et al., 2011, Holtmaat et al., 2005, Loewenstein et al., 2015]. Recently proposed models to maintain stable network function in the presence of highly volatile spine dynamics suggest that subsets of connections are selectively stabilized to support network function [Berry and Nedivi, 2017, Mongillo et al., 2017]. Our result shows that high task performance can be reached in spiking neural networks in the presence of high volatility of all synapses. Still our model can be extended with a process that selectively stabilizes synapses on longer timescales as demonstrated in Fig. 3f. In addition, our model predicts that not only synaptic spine dynamics but also changes of synaptic efficacies show a large stochastic component on all timescales.
The continuously ongoing parameter changes induce continuously ongoing changes in the assembly sequences that accompany and control a motor response (see Fig. 3d). These changes do not impair the performance of the network, but rather enable the network to explore different but equally good solutions when exposed for many hours to the same task (see Fig. 3i). Such continuously ongoing drifts of neural codes in functionally less relevant dimensions have already been observed experimentally in some brain areas [Ziv et al., 2013, Driscoll et al., 2017]. Our model also suggests that the same computational function is realized by the same neural circuit in different individuals with drastically different parameters, a feature which has already been addressed in [Tang et al., 2010, Grashow et al., 2010, Marder, 2011, Prinz et al., 2004]. In fact, this degeneracy of neural circuits is thought to be an important property of biological neural networks [Marder, 2011, Prinz et al., 2004, Marder and Goaillard, 2006]. Our model networks automatically compensate for disturbances by moving their continuously ongoing sampling of network configurations to a new region of the parameter space, as illustrated by the response to the disturbance marked by in Fig. 3i.
Our theoretical framework is consistent with experimental data that showed drifts of neural representations in motor learning [Rokni et al., 2007]. In that article, a stochastic plasticity model was proposed that is structurally similar to our model. It was shown in computer simulations that a simple feed forward ratebased neural network is able to retain stable functionality despite of such stochastic parameter changes. The authors hypothesized that this is the case because network parameters move on a submanifold in parameter space with constant performance. Our theoretical framework provides a mathematical justification for their hypothesis in general, but also refines these statements. It shows that the network samples network configurations (including the rewiring of connections that was not considered in [Rokni et al., 2007]) from a welldefined distribution. The manifold that is visited during the learning process is given by the highprobability regions of this distribution, but in principle, also suboptimal regions could be visited. Such suboptimal regions are however highly unlikely if the parameter space is overcomplete, i.e., if large volumes of the parameter space lead to good performance. Hence, in comparison with [Rokni et al., 2007], this work provides the following features: (a) It provides a quantitative mathematical framework for the qualitative descriptions in [Rokni et al., 2007] that allows a rigorous understanding of the plasticity processes, (b) it includes synaptic rewiring, reproducing experimental data on this topic and providing a hypothesis on its computational role, and (c), it is able to tackle the case of recurrent spiking neural networks as compared to feed forward rate models.
We have shown in Fig. 3f that despite these permanent parameter drifts, the task performance in our model remains stable for many simulated days if reward delivery is stopped. At the same time, the model is also able to continuously adapt to changes in the task Fig. 3ce. These results suggest that our model keeps a quite good balance between stability and plasticity [Abraham and Robins, 2005], which has already been shown to be one important functional aspect of network rewiring [Fauth et al., 2015]. Furthermore, we have shown in Fig. 3f, that the structural priors over synaptic parameters can be utilized to stabilize synaptic parameters similar to previous models of synaptic consolidation [Fusi et al., 2005, Kirkpatrick et al., 2017]. In addition, more complex prior distributions over multiple synapses could be utilized to model homeostatic processes and clustering of synapses. The latter has been suggested as a mechanism to tackle the stabilityplasticity dilemma [Fares and Stepanyants, 2009].
In conclusion the mathematical framework presented in this article provides a principled way of understanding the complex interplay of deterministic and stochastic processes that underlie the implementation of goaldirected learning in neural circuits of the brain. It also offers a solution to the problem how reliable network computations can be achieved with a “dynamic connectome” [Rumpel and Triesch, 2016]. We have argued that the stationary distribution of the highdimensional parameter vector that results from large numbers of local stochastic processes at the synapses provides a timeinvariant perspective of salient properties of a network. Standard rewardgated plasticity rules can achieve that this stationary distribution has most of its mass on regions in the parameter space that provide good network performance. The stochastic component of synaptic dynamics can flatten or sharpen the resulting stationary distribution, depending on whether the scaling parameter (“temperature”) of the stochastic component is larger or smaller than 1. A functional benefit of this stochastic component is that the network keeps exploring its parameter space even after a wellperforming region has been found, providing one mechanism to tackle the explorationexploitation dilemma (see Fig. 2j). This enables the network to migrate quickly and automatically to a better performing region when the network or task changes. We found in the case of the motor learning task of Fig. 3 that a temperature around 0.15, which lies in the same range as related experimental data (see Fig. 4d), suffices to provide this functionally important compensation capability. The same mathematical framework can also be applied to artificial neural networks, leading to a novel braininspired learning algorithm that uses rewiring to train deep networks under the constraint of very sparse connectivity [Bellec et al., 2017]. (1657 words)
Methods
Probabilistic framework for rewardmodulated learning
The classical goal of reinforcement learning is to maximize the expected future discounted reward given by
(6) 
In Eq. (6) we integrate over all future rewards , while discounting more remote rewards exponentially with a discount rate , which for simplicity was set equal to in this paper. We find (see Eq. (15)) that this time constant is immediately related to the experimentally studied time window or eligibility trace for the influence of dopamine on synaptic plasticity [Yagishita et al., 2014]. This property is true in general for rewardbased learning rules that make use of eligibility traces and is not unique to our model. The expectation in Eq. (6) is taken with respect to the distribution over sequences of future rewards that result from the given set of synaptic parameters . The stochasticity of the reward sequence arises from stochastic network inputs, stochastic network responses, and stochastic reward delivery. The resulting distribution of reward sequences for the given parameters can also include influences of network initial conditions by assuming some distribution over these initial conditions. Network initial conditions include for example initial values of neuron membrane voltages and refractory states of neurons. The role of initial conditions on network learning is further discussed below when we consider the online learning scenario in Rewardmodulated synaptic plasticity approximates gradient ascent on the expected discounted reward.
There exists a close relationship between reinforcement learning and Bayesian inference [Vlassis et al., 2012, Rawlik et al., 2013, Botvinick and Toussaint, 2012]
. To make this relationship apparent, we define our model for rewardgated network plasticity by introducing a binary random variable
that represents the currently expected future discounted reward in a probabilistic manner. The likelihood is determined in this theoretical framework by the expected future discounted reward Eq. (6) that is achieved by a network with parameter set (see e.g., [Rawlik et al., 2013]):(7) 
where denotes a constant, that assures that Eq. (7
) is a correctly normalized probability distribution. Thus rewardbased network optimization can be formalized as maximizing the likelihood
with respect to the network configuration . Structural constraints can be integrated into a stochastic model for network plasticity through a prior over network configurations. Hence rewardgated network optimization amounts from a theoretical perspective to learning of the posterior distribution , which by Bayes’ rule is defined (up to normalization) by . Therefore, the learning goal can be formalized in a compact form as evaluating the posterior distribution of network parameters under the constraint that the abstract learning goal is achieved.More generally, one is often interested in a tempered version of the posterior
(8) 
where is a suitable normalization constant and is the temperature parameter that controls the “sharpness” of . For , is given by the original posterior, emphasizes parameter values with high probability in the posterior, while leads to parameter distributions
which are more uniformly distributed than the posterior.
Analysis of policy sampling
Here we prove that the stochastic parameter dynamics Eq. (5) samples from the tempered posterior distribution given in Eq. (8). In Results we suppressed timedependencies in order to simplify notation. We reiterate Eq. (3) with explicit timedependencies of parameters:
(9) 
where the notation denotes the derivative of with respect to evaluated at the current parameter values . By Bayes’ rule, the derivative of the log posterior is the sum of the derivatives of the prior and the likelihood:
which allows us to rewrite Eq. (9) as
(10) 
which is identical to the form Eq. (5), where the contributions of and are given explicitly.
The fundamental property of the synaptic sampling dynamics Eq. (9) is formalized in Theorem 1 and proven below. Before we state the theorem, we briefly discuss its statement in simple terms. Consider some initial parameter setting . Over time, the parameters change according to the dynamics (9). Since the dynamics include a noise term, the exact value of the parameters at some time cannot be determined. However, it is possible to describe the exact distribution of parameters for each time . We denote this distribution by , where the “FP” subscript stands for “FokkerPlanck” since the evolution of this distribution is described by the FokkerPlanck equation (11) given below. Note that we make the dependence of this distribution on time explicit in this notation. It can be shown that for the dynamics (11), converges to a welldefined and unique stationary distribution in the limit of large . Of practical relevance is the socalled burnin time after which the distribution of parameters is very close to the stationary distribution. Note that the parameters will continue to change. Nevertheless, at any time after the burn in, we can expect the parameter vector to be situated at a particular value with the probability (density) given by the stationary distribution, see Fig. 1d,f. Any distribution that is invariant under the parameter dynamics is a stationary distribution. Here, invariance means: when one starts with an invariant distribution over parameters in the FokkerPlanck equation, the dynamics are such that this distribution will be kept forever (we will use this below in the proof of Theorem 1). Theorem 1 states that the parameter dynamics leaves given in Eq. (8) invariant, i.e., it is a stationary distribution of the network parameters. Note that in general, the stationary distribution may not be uniquely defined. That is, it could happen that for two different initial parameter values, the network reaches two different stationary distributions. Theorem 1 further states that for the synaptic sampling dynamics, the stationary distribution is unique, i.e., the distribution is reached from any initial parameter setting when the conditions of the theorem apply. We now state Theorem 1 formally. To simplify notation we drop in the following the explicit time dependence of the synaptic parameters .
Theorem 1.
Let be a strictly positive, continuous probability distribution over parameters , twice continuously differentiable with respect to , and let . Then the set of stochastic differential equations Eq. (9) leaves the distribution (8) invariant. Furthermore, is the unique stationary distribution of the sampling dynamics.
Proof.
The proof is analogous to the one provided in [Kappel et al., 2015]. The stochastic differential equation Eq. (9) translates into a FokkerPlanck equation [Gardiner, 2004] that describes the evolution of the distribution over parameters
(11) 
where denotes the distribution over network parameters at time . To show that leaves the distribution invariant, we have to show that (i.e., does not change) if we set to on the right hand side of Eq. (11). Plugging in the presumed stationary distribution for on the right hand side of Eq. (11), one obtains
which by inserting , with normalizing constant , becomes
This proves that is a stationary distribution of the parameter sampling dynamics Eq. (9). Since is strictly positive, this stationary distribution is also unique (see Section 3.7.2 in [Gardiner, 2004]).
The unique stationary distribution of Eq. (11) is given by , i.e. is the only solution for which becomes , which completes the proof. ∎
The uniqueness of the stationary distribution follows because each parameter setting can be reached from any other parameter setting with nonzero probability (ergodicity). The stochastic process can therefore not get trapped in cycles or absorbed into a subregion of the parameter space. The time spent in a certain region of the parameter space is therefore directly proportional to the probability of that parameter region under the posterior distribution. The proof requires that the posterior distribution is smooth and differentiable with respect to the synaptic parameters. This is not true in general for a spiking neural networks. In our simulations we used a stochastic neuron model (defined in the next section). As the reward landscape in our case is defined by the expected discounted reward (see below), a probabilistic network tends to smoothen this landscape and therefore the posterior distribution.
Network model
Plasticity rules for this general framework were derived based on a specific spiking neural network model, which we describe in the following. All reported computer simulations were performed with this network model. We considered a general network scaffold of neurons with potentially asymmetric recurrent connections. Neurons are indexed in an arbitrary order by integers between and . We denote the output spike train of a neuron by . It is defined as the sum of Dirac delta pulses positioned at the spike times , i.e., . Potential synaptic connections are also indexed in an arbitrary order by integers between and , where denotes the number of potential synaptic connections in the network. We denote by and the index of the pre and postsynaptic neuron of synapse , respectively, which unambiguously specifies the connectivity in the network. Further, we define to be the index set of synapses that project to neuron . Note that this indexing scheme allows us to include multiple (potential) synaptic connections between a given pair of neurons. We included this experimentally observed feature of biological neuronal networks in all our simulations. We denote by the synaptic efficacy of the th synapse in the network at time .
Network neurons were modeled by a standard stochastic variant of the spike response model [Gerstner et al., 2014]. In this model, the membrane potential of a neuron at time is given by
(12) 
where denotes the slowly changing bias potential of neuron , and denotes the trace of the (unweighted) postsynaptic potentials (PSPs) that neuron leaves in its postsynaptic synapses at time . More precisely, it is defined as given by spike trains filtered with a PSP kernel of the form , with time constants and , if not stated otherwise. Here denotes convolution and is the Heaviside step function, i.e. for and otherwise.
The synaptic weights in Eq. (12) were determined by the synaptic parameters through the mapping Eq. (1) for . Synaptic connections with were interpreted as not functional (disconnected) and was therefore set to 0 in that case.
The bias potential in Eq. (12) implements a slow adaptation mechanism of the intrinsic excitability, which ensures that the output rate of each neuron stays near the firing threshold and the neuron maintains responsiveness [Desai et al., 1999, Fan et al., 2005]. We used a simple adaptation mechanism which was updated according to
(13) 
where is the time constant of the adaptation mechanism and is the desired output rate of the neuron. In our simulations, the bias potential was initialized at 3 and then followed the dynamics given in Eq. (13). We found that this regularization significantly increased the performance and learning speed of our network model. In [Remme and Wadman, 2012] a similar mechanism was proposed to balance activity in networks of excitatory and inhibitory neurons. The regularization used here can be seen as a simplified version of this mechanism that regulates the mean firing rate of each excitatory neuron using a simple linear control loop and thereby stabilizes the output behavior of the network.
We used a simple refractory mechanism for our neuron model. The firing rate, or intensity, of neuron at time is defined by the function , where denotes a refractory variable that measures the time elapsed since the last spike of neuron . We used an exponential dependence between membrane potential and firing rate, such that the instantaneous firing rate of the neuron at time can be written as
(14) 
Furthermore, we denote by the firing rate of the neuron postsynaptic to synapse . If not stated otherwise we set the refractory time to 5 ms. In addition, a subset of neurons was clamped to some given firing rates (input neurons), such that of these input neurons was given by an arbitrary function. We denote the spike train from these neurons by , the network input.
Synaptic dynamics for the rewardbased synaptic sampling model
Here, we provide additional details on how the synaptic parameter dynamics Eq. (5) was computed. We will first provide an intuitive interpretation of the equations and then provide a detailed derivation in the next section. The second term of Eq. (5) denotes the gradient of the expected future discounted reward Eq. (6). In general, optimizing this function has to account for the case where rewards are provided after some delay period. It is well known that this distal reward problem can be solved using plasticity mechanisms that make use of eligibility traces in the synapses that are triggered by near coincident spike patterns, but their consolidation into the synaptic weights is delayed and gated by the reward signal [Sutton and Barto, 1998, Izhikevich, 2007]. The theoretically optimal shape for these eligibility traces can be derived using the reinforcement learning theory and depends on the choice of network model. For the spiking neural network model described above, the gradient can be estimated through a plasticity mechanism that uses an eligibility trace in each synapse which gets updated according to
(15) 
where is the time constant of the eligibility trace. Recall that denotes the index of the presynaptic neuron and the index of the postsynaptic neuron for synapse . In Eq. (15) denotes the postsynaptic spike train, denotes the instantaneous firing rate (Eq. (14)) of the postsynaptic neuron and denotes the postsynaptic potential under synapse .
The last term of Eq. (15) shares salient properties with standard STDP learning rules, since plasticity is enabled by the presynaptic term and gated by the postsynaptic term (see [Pfister et al., 2006]). The latter term also regularizes the plasticity mechanism such that synapses stop growing if the firing probability of the postsynaptic neuron is already close to one.
The eligibility trace Eq. (15) is multiplied by the reward and integrated in each synapse using a second dynamic variable
(16) 
where is a lowpass filtered version of (described below). The variable combines the eligibility trace and the reward, and averages over the time scale . is a constant offset on the reward signal. This parameter can be set to an arbitrary value without changing the stationary dynamics of the model (see next section). In our simulations, this offset was chosen slightly above () such that small parameter changes were also present without any reward, as observed in [Yagishita et al., 2014]. Furthermore, does not have to be chosen constant. E.g. this term can be used to incorporate predictions about the reward outcome by setting to the negative of output of a critic network that learns to predict future reward. This approach has been previously studied in [Frémaux et al., 2013] to model experimental data of [Schultz et al., 1997, Schultz, 2002].
In the next section we show that approximates the gradient of the expected future reward with respect to the synaptic parameter. In our simulations we found that incorporating the lowpass filtered eligibility traces (Eq. (16)) into the synaptic parameters works significantly better than using the eligibility traces directly for weight updates, although the latter approach was taken in a number of previous studies (see e.g. [Pfister et al., 2006, Legenstein et al., 2008, Urbanczik and Senn, 2009]). Eq. (16) essentially combines the eligibility trace with the reward and smoothens the resulting trace with a lowpass filter with time constant . This time constant has been chosen to be in the order spontaneous decay of disinhibited CaMKII in the synapse which is closely related to spine enlargement in the dopaminegated STDP protocol of [Yagishita et al., 2014] (c.f. their Fig. 3F and Fig. 4C).
in Eq. (16) is a lowpass filtered version of that scales the synaptic updates. It was implemented through , with . The value of has been chosen to equal based on theoretical considerations (see Online learning). This scaling of the reward signal has the following effect. If the current reward exceeds the average reward , the effect of the neuromodulatory signal will be greater than . On the other hand, if the current reward is below average synaptic updates will be weighted by a term significantly lower than . Therefore, parameter updates are preferred for which the current reward signal exceeds the average.
Similar plasticity rules with eligibility traces in spiking neural networks have previously been proposed by several authors [Seung, 2003, Xie and Seung, 2004, Izhikevich, 2007, Pfister et al., 2006, Florian, 2007, Legenstein et al., 2008, Urbanczik and Senn, 2009, Frémaux et al., 2010, Frémaux et al., 2013]. In [Frémaux et al., 2013] also a method to estimate the neural firing rate from backpropagating action potentials in the synapses has been proposed. The main difference to these previous approaches is that the activitydependent last term in Eq. (15) is scaled by the current synaptic weight . This weightdependence of the update equations induces multiplicative synaptic dynamics and is a consequence of the exponential mapping Eq. (1) (see derivation in the next section). This is an important property for a network model that includes rewiring. Note, that for retracted synapses (), both and decay to zero (within few minutes in our simulations). Therefore, we find that the dynamics of retracted synapses is only driven by the first (prior) and last (random fluctuations) term of Eq. (5) and are independent from the network activity. Thus, retracted synapses spontaneously reappear also in the absence of reward after a random amount of time.
The first term in Eq. (5) is the gradient of the prior distribution. We used a prior distribution that pulls the synaptic parameters towards such that unused synapses tend to disappear and new synapses are permanently formed. Throughout all simulations we used independent Gaussian priors for the synaptic parameters
where
is the standard deviation of the prior distribution. Using this, we find that the contribution of the prior to the online parameter update equation is given by
(17) 
Finally by plugging Eq. (17) and (16) into Eq. (5) the synaptic parameter changes at time are given by
(18) 
If not stated otherwise we used and , and a learning rate of . By inspecting Eq. (18) it becomes immediately clear that the parameter dynamics follow an OrnsteinUhlenbeck process it the activitydependent second term is inactive (in the absence of reward), i.e. if . In this case the dynamics are given by the deterministic drift towards the mean value and the stochastic diffusion fueled by the Wiener process . The temperature and the standard deviation scale the contribution of these two forces.
Rewardmodulated synaptic plasticity approximates gradient ascent on the expected discounted reward
We first consider a theoretical setup where the network is operated in arbitrarily long episodes such that in each episode a reward sequence is encountered. The reward sequence can be any discrete or realvalued function that is positive and bounded. The episodic scenario is useful to derive exact batch parameter update rules, from which we will then deduce online learning rules. Due to stochastic network inputs, stochastic network responses, and stochastic reward delivery, the reward sequence is stochastic.
The classical goal of reinforcement learning is to maximize the function of discounted expected rewards Eq. (6). Policy gradient algorithms perform gradient ascent on by changing each parameter in the direction of the gradient . Here, we show that the parameter dynamics Eq. (15), (16) approximate this gradient, i.e., .
It is natural to assume that the reward signal only depends indirectly on the parameters , through the history of network spikes up to time , which we write as , i.e., . We can first expand the expectation in Eq. (6
) to be taken over the joint distribution
over reward sequences and network trajectories . The derivative(19) 
can be evaluated using the wellknown identity :
(20)  
Here, is the probability of observing the spike train in the time interval 0 to . For the definition of the network given above, the gradient of this distribution can be directly evaluated. Using Eq. (12) and (1) we get [Pfister et al., 2006]
(21)  
where we have used that by construction only the rate function depends on the parameter . If one discretizes time and assumes that rewards and parameter updates are only realized at the end of each episode, the REINFORCE rule is recovered [Williams, 1992].
In Eq. (21) we used the approximation . This expression ignores the discontinuity of Eq. (1) at , where the function is not differentiable. In practice we found that this approximation is quite accurate if is large enough such that is close to zero (which is the case for in our simulation). In control experiments we also used a smooth function (without the jump at ) for which Eq. (21) is exact, and found that this yields results that are not significantly different from the ones that use the mapping Eq. (1).
Online learning
Eq. (20) defines a batch learning rule with an average taken over learning episodes where in each episode network responses and rewards are drawn according to the distribution . In a biological setting, there are typically no clear episodes but rather a continuous stream of network inputs and rewards and parameter updates are performed continuously (i.e., learning is online). The analysis of online policy gradient learning is far more complicated than the batch scenario, and typically only approximate results can be obtained that however perform well in practice, see e.g., [Seung, 2003, Xie and Seung, 2004] for discussions.
In order to arrive at an online learning rule for this scenario, we consider an estimator of Eq. (20) that approximates its value at each time based on the recent network activity and rewards during time for some suitable . We denote the estimator at time by
Comments
There are no comments yet.