Spiking neurons with short-term synaptic plasticity form superior generative networks

09/24/2017 ∙ by Luziwei Leng, et al. ∙ 0

Spiking networks that perform probabilistic inference have been proposed both as models of cortical computation and as candidates for solving problems in machine learning. However, the evidence for spike-based computation being in any way superior to non-spiking alternatives remains scarce. We propose that short-term plasticity can provide spiking networks with distinct computational advantages compared to their classical counterparts. In this work, we use networks of leaky integrate-and-fire neurons that are trained to perform both discriminative and generative tasks in their forward and backward information processing paths, respectively. During training, the energy landscape associated with their dynamics becomes highly diverse, with deep attractor basins separated by high barriers. Classical algorithms solve this problem by employing various tempering techniques, which are both computationally demanding and require global state updates. We demonstrate how similar results can be achieved in spiking networks endowed with local short-term synaptic plasticity. Additionally, we discuss how these networks can even outperform tempering-based approaches when the training data is imbalanced. We thereby show how biologically inspired, local, spike-triggered synaptic dynamics based simply on a limited pool of synaptic resources can allow spiking networks to outperform their non-spiking relatives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Significance statement

Neural networks have long been used to solve various problems in machine learning, but apart from a conceptual similarity to cortical structure they stray away from their biological archetypes. Their recent success has prompted many efforts to implement them with more biologically plausible components, but the computational advantages thereof have so far proven elusive. In this work, we focus on two well-established biological facts: spike-based communication between neurons and a limited pool of synaptic resources (neurotransmitters). We argue that, in combination, these two mechanisms can endow networks with computational capabilities that are otherwise difficult to achieve. In particular, in the context of probabilistic inference, we show how plastic synapses bolster the generative capabilities of spiking networks while requiring only a small, local computational overhead, as opposed to the classical tempering solutions for their conventional counterparts. Our work thereby highlights important computational consequences of biological features that might otherwise appear as mere engineering limitations or artifacts of evolution.

Abstract

Spiking networks that perform probabilistic inference have been proposed both as models of cortical computation and as candidates for solving problems in machine learning. However, the evidence for spike-based computation being in any way superior to non-spiking alternatives remains scarce. We propose that short-term plasticity can provide spiking networks with distinct computational advantages compared to their classical counterparts. In this work, we use networks of leaky integrate-and-fire neurons that are trained to perform both discriminative and generative tasks in their forward and backward information processing paths, respectively. During training, the energy landscape associated with their dynamics becomes highly diverse, with deep attractor basins separated by high barriers. Classical algorithms solve this problem by employing various tempering techniques, which are both computationally demanding and require global state updates. We demonstrate how similar results can be achieved in spiking networks endowed with local short-term synaptic plasticity. Additionally, we discuss how these networks can even outperform tempering-based approaches when the training data is imbalanced. We thereby show how biologically inspired, local, spike-triggered synaptic dynamics based simply on a limited pool of synaptic resources can allow spiking networks to outperform their non-spiking relatives.

Introduction

Neural networks are, once again, in the focus of both the artificial and the biological intelligence communities. Originally inspired by the dynamics and architecture of cortical networks [1, 2], they have increasingly strayed away from their biological archetypes, prompting questions about their relevance for understanding the brain [3, 4]. However, their recent hardware-fueled dominance [5] has motivated renewed efforts to align them with biologically more plausible models [6, 7, 8, 9]. Moreover, neural networks have been used to explain some aspects of in-vivo cortical dynamics [10, 11].

Two questions are immanent to these efforts: From a machine learning perspective, how useful are spike-based versions of deep neural networks? And from a biological perspective, how much can we learn about the brain from artificial neural networks? Much of the recent work on neural networks has focused on the ”forward” computation pathway, i.e., learning pattern classification through error backpropagation

[12]. However, the ”backward” pathway required for generative models has also made significant progress [13, 14]

. A key aspect to the success of a generative model is its capability to mix, i.e., travel through the probability landscape that it needs to represent. The performace gain of recently proposed models is to a large extent due to refined mixing algorithms, most of which are based on a form of simulated tempering

[15, 16, 17].

The discriminative capacity of the neocortex is well-established, as evidenced by the difficulty of artificial systems to achieve superhuman classification performance [12]. Simultaneously however, the brain also appears to learn a generative model of its sensory environment [18, 19, 20]. How these capabilities are achieved remains an open question, but it is unlikely that complex tempering schedules are at work.

One mechanism that is capable of modulating synaptic weights and thereby shaping the probability landscape of a neural network is short-term synaptic plasticity. In this work, we investigate the ability of this biologically ubiquitous mechanism to improve the mixing capabilities of generative neural networks. Furthermore, we show how hierarchical spiking networks endowed with short-term plasticity can simultaneously become good discriminative and generative models, a feature that is difficult to achieve due to the conflicting nature of these two tasks. We thereby offer a potential explanation for the generative capabilities of cortical networks, while at the same time proposing a simple but efficient mechanism to bolster the usefulness of spiking networks for machine learning applications. This can be of particular interest in combination with spiking neuromorphic systems which, compared to conventional simulation platforms, implement fast and energy-efficient physical models of neuro-synaptic dynamics [21, 22].

Methods

Figure 1: (A)

Structure of a hierarchical sampling spiking network. Its classical counterpart is a restricted Boltzmann machine with a visible (

), hidden () and label () layer. (B) Interpretation of states as samples in a spiking network. A neuron with a freely evolving membrane potential is said to be in the state and switches to the state upon firing, where it stays for the duration of the refractory period. (C) Sketch of the membrane potential evolution for three relevant scenarios: static (black), renewing (green) and modulated (blue) synapses. Bottom right: envelope of the PSP height for three parameter sets from the manuscript: (black), (green) and (blue). Note how the latter only weakly modulates the PSP height. (D) In order to correctly sample from a posterior distribution, a network needs to be able to mix, i.e., traverse barriers between low-energy basins. To facilitate mixing, tempering methods globally rescale the energy landscape with an inverse temperature (top). In contrast, STP can be viewed as only modulating the energy landscape locally, thereby only affecting the currently active attractor (bottom).

We start with a brief introduction of Boltzmann machines as generative models and their spike-based implementation. We then describe the problem of mixing and outline the essential elements of tempering-based solutions. Finally, we discuss the model of short-term plasticity that we later use in our spiking networks. Supplementary information (SI) available in the last page.

Boltzmann machines and spiking networks

Among the neural networks proposed as generative models for high-dimensional input, Boltzmann machines (BMs) [23] are arguably the most prominent [24, 25, 26, 27]. Neurons in BMs are binary units with states . These states are typically updated in a sequential schedule in a way that implements Gibbs sampling from a target Boltzmann distribution

(1)

with the inverse temperature , partition function and the energy function parametrized by the weight matrix

and bias vector

. This is achieved by having each neuron compute a local ”membrane potential” as the log-odds of its conditional firing probability, which for the Boltzmann distribution is equal to a weighted sum over input activities:

(2)

Consequently, state updates are computed using a logistic activation function

.

In a restricted Boltzmann machine (RBM), the units are subdivided into a visible and a hidden layer, with no within-layer connections (Fig. 1A). To enable classification, an additional label layer can be added, which for training purposes can be treated as part of the visible layer. During training, weights and biases are iteratively updated in order to optimize the marginal distribution as the underlying distribution for the set of training samples.

Recently, it has been shown how networks of spiking neurons can perform equivalent computations [28], which we briefly outline in the following. The building blocks for our spiking networks are conductance-based leaky integrate-and-fire (LIF) neurons, with membrane potential dynamics governed by

(3)

where is the membrane capacitance, and the leak conductance and potential, and the synaptic current. Neurons fire upon reaching a threshold voltage, which causes the membrane to be clamped to a reset potential for a duration equal to the refractory period . The synaptic current is modeled as a sum of exponential kernels triggered by presynaptic spikes with a synaptic time constant and weighted by synaptic weights and reversal potentials :

(4)

The temporal dependence of the synaptic weights accounts for the STP mechanism we discuss later.

Each neuron receives both functional synaptic input from other neurons within the network and diffuse background input from external neurons that can be modeled as Poisson spike trains. The latter causes the neuron to fire stochastically. Since, at the level of spikes, the output of a neuron can be considered binary, we associate a binary random variable

to each neuron. As a neuron never fires within the refractory period, it is natural to set for and 0 otherwise (Fig. 1B).

For constant functional synaptic input, the mean firing rate of such a neuron is proportional to its activation function . By applying strong background input, we lift neurons into a high-conductance state [HCS, 29], which molds their activation function into an approximately logistic shape [30]:

(5)

with scaling parameters and , where represents the functional, i.e., background-free, membrane potential. Similarly to Gibbs sampling, the functional membrane potential thereby fulfills the local computability condition (Eqn. 2), which is a sufficient computational prerequisite for sampling in neural networks [23, 31]. The scaling parameters can be derived analytically and allow a direct translation of the BM parameters and to the corresponding parameters in the biological domain (SI, Sec. 1).

Tempering vs. short-term plasticity

When trained from data, the energy landscape is shaped in a way that assigns low energy values (modes) to the samples in the training data. If this dataset is composed of very dissimilar classes, training algorithms tend to separate them by high energy barriers. As their height grows during training, Gibbs sampling becomes increasingly ineffective at covering the entire relevant state space, as reflected by a high correlation between consecutive samples caused by the component-wise update of states. Consequently, a BM would need longer to converge towards its underlying distribution. This problem becomes particularly inconvenient when dealing with complex, real-world data, or when an agent must rely on the prediction of the network to make a fast decision.

The ability of a sampling-based generative model to jump across energy barriers, also known as mixing, has therefore received significant attention [32, 33, 16, 17]. Many of these methods rely on some version of simulated tempering, which modifies the temperature parameter in order to globally flatten the network’s energy landscape (Fig. 1D). Therefore, in addition to conventional Gibbs sampling, we use the adaptive simulated tempering algorithm [AST 16] as a benchmark for our spiking networks (SI, Sec. 2).

While greatly increasing the mixing capabilities of generative networks, it is important to note that all tempering schedules come with a cost of their own, both because they require additional computations and because they only gather valid samples at low temperatures (), thereby effectively slowing down the sampling process. Furthermore, they require parameter changes that assume knowledge about the global state of the network, which is difficult to reconcile with biology. This motivates the search for a local update rule that has biological relevance, improves mixing and can be embedded in spiking networks.

In biological neural networks, the momentary synaptic interaction strength is reflected in the size of the elicited postsynaptic potential (PSP). In dynamic synapses, this value may change over time depending on the presynaptic activity. To model this dependence, we use the Tsodyks-Markram model of short-term plasticity [STP, 34]:

(6)
(7)
(8)

Here, represents the (static) synaptic weight and the utilized fraction of available synaptic resources . Upon arrival of a presynaptic spike at time , the synapse is depressed by subtracting from , which recovers exponentially with the time constant . Facilitation is modeled by a simultaneous increase in , followed by an exponential decay with time constant .

Since both tempering and STP effectively modify the energy landscape by changing network parameters during sampling, they clearly bear some conceptual resemblance. However, while tempering simultaneously affects all synaptic weights, STP only affects the efferent connections of those neurons that are simultaneously active at a given moment in time. Therefore, in contrast to the global modifications of the energy landscape incurred by tempering, STP has a more local effect, as sketched in Fig. 

1D. Note that this effect is not equivalent to neuronal adaptation, because it does not prohibit neurons from remaining active over extended periods, which is essential for generating consecutive patterns with significant overlap.

Results

We study the effects of STP on the performance of spiking networks trained for different tasks. We start by discussing how STP can improve the sampling accuracy of small networks configured to sample from a fully specified target distribution, even when the energy landscape is shallow enough to not cause mixing problems. This is no longer the case for hierarchical networks trained directly on data, for which we study the influence of STP on both their generative and their discriminative properties. Furthermore, we show how STP can aid pattern completion in a network trained on a highly imbalanced dataset. For all spiking network simulations, we used NEST [35] with PyNN [36] as a front-end.

Sampling from a fully specified target distribution

By modulating synaptic interactions, STP shapes the sampled distribution. This can be helpful when a spiking network needs to approximate a distribution that is otherwise incompatible with biological neuro-synaptic dynamics, as we discuss in the following.

Consider the case where the target distribution of the spiking network is a Boltzmann distribution. When a neuron needs to continuously represent a state for an extended period, it fires a sequence of spikes at maximum frequency . Following Eqn. 2, the resulting PSPs should increase a postsynaptic neuron’s membrane by a constant , which implies a rectangular PSP shape. However, this is not a realistic shape for a more biologically plausible scenario, where PSPs have an exponentially shaped decay. This causes them to accumulate (Fig. 1C), such that the average increment becomes a function of the burst length , thereby distorting the sampled distribution.

Synaptic depression can mitigate this effect (Fig. 2B) by causing a gradual decrease in the amplitude of consecutive PSPs. Indeed, when sweeping over the parameter space (Fig. 2A), we find that an optimal reproduction of the target distribution is achieved for , which is close to the synaptic time constant of . This affords an intuitive explanation: In the HCS, the effective membrane time constant becomes small and dominates the PSP decay. If the recovery of synaptic resources (Eqn. 7) happens at the same speed as the PSP decay, the STP mechanism essentially emulates a renewing synapse with an approximately constant running average (Fig. 1C). The slightly larger optimal recovery time constant further compensates for the long tails of exponential PSPs, which potentiate interaction strengths compared to the ideal case of rectangular PSPs (SI, Sec.1). Note that the manifold for which the target distribution is close-to-optimally reproduced contains many different STP configurations, including the range of biologically observed parameters [37, 38], but not the triplet for static synapses (Fig. 2A).

For this example, we used a fully specified target distribution ; training was not needed, as synaptic weights can be computed directly from the parameters and (SI, Sec.1). Here, we used a target Boltzmann distribution with randomly drawn parameters that produce a diverse energy landscape, but not so rough as to create problems with mixing. This changes when the network parameters are learned from data, as we discuss in the following.

Figure 2: (A)Kullback-Leibler divergence between sampled () and target () distribution of a spiking network with 10 neurons (5 hidden, 5 visible) for different STP parameters . Note that many different parameter combinations lead to close to optimal (white cross) sampling, but static synapses (black cross) are not among them. (B) Distribution sampled by the spiking network for two different configurations of synaptic parameters. Depressing synapses (bottom) allow the network to come much closer to the target distribution (blue) than non-plastic ones (top). (C) Left: Training data for the easy (top) and hard (bottom) learning scenario (individual images are overlapped). Right: Sequence of images generated by a Gibbs sampler and an STP-endowed spiking network with equivalent parameters and .

Mixing in a simple learning scenario

Borrowing from observations in the early visual system, we generated images of oriented bars. The bars were positioned in a way that gave rise to an ”easy” (overlapping) and a ”hard” (non-overlapping) dataset (Fig. 2C). We then trained a two-layer hierarchical network (400 visible, 30 hidden units) on each of these datasets using a version of the wake-sleep algorithm [16] (SI, Sec. 2). Intuitively, the difficulty of learning a generative model of this data increases when the bars have little or no overlap: in this case, training gives rise to three nearly disjoint populations that have, on average, excitatory connections within and inhibitory connections between them. The emergence of such a population-based winner-take-all structure can be characterized by the mean interaction strength between two population activity vectors and , which represent the average network activity during the presentation of the th and th input pattern, respectively. For the easy dataset, learning gave rise to a mean within-population interaction strength of and a mean between-population interaction strength of . These values changed to and for the hard dataset, reflecting the increased competition and disjointedness between the three emerging populations. STP, however, can weaken active synapses, temporarily reducing to enable switching between attractors.

The learned parameter set was used to compare the performance of classical Gibbs sampling and STP-endowed spiking networks (Fig. 2C). For the easy dataset, both the Gibbs sampler and the spiking network were able to mix, although the former spent on average 100 times longer in the same mode before switching, thereby requiring more time to converge to the target distribution. For the hard dataset, the spiking networks retained their ability to mix, whereas Gibbs sampling was unable to leave the (randomly initialized) local mode. These observations mirror those found in studies of cortical attractor networks [39]. While this simple experimental setup was specifically designed to illustrate the potential problems of sampling-based generative models and the ability of STP-endowed spiking networks to circumvent them, we show in the following that these properties are preserved in more complex scenarios.

Generation and classification of handwritten digits

Figure 3: Superior generative performance of an STP-endowed spiking network compared to an equivalent Gibbs sampler. (A) 2D parameter scans of the STP parameters with multiple configurations leading to good generative performance. (B) Log-likelihood (ISL) of the test set calculated from an increasing number of samples. Each sampling method was simulated with 10 different random seeds. The ISLs of an optimal sampler with the same parameters (OPT, gray) and the product of marginals (POM, brown) are shown for comparison (see SI Sec. 3). (C) Direct comparison between the two sampling methods for samples, equivalent to a sampling duration of in the biological domain. ISL histogram generated from 100 random seeds. (D) Histogram of times spent within the same mode when no visible units are clamped. (E) tSNE plots of images produced by the two methods over 1800 consecutive samples. For every 6th of these samples, an output image is shown. Consecutive images are connected by gray lines. Different colors represent different image classes, defined by the label unit that showed the highest activity at the time the sample was generated. Note that tSNE inherently normalizes the area of the 2D projection; the volume of phase space covered by the Gibbs chain is, in fact, much smaller than the one covered by the spiking network.

The problem of mixing becomes even more pronounced when dealing with larger, more complex datasets. Here, we trained a hierarchical 3-layer network with 784 visible, 600 hidden and 10 label units on handwritten digits from the MNIST dataset [40]. By treating the label units as part of the visible layer during training, we simultaneously trained a generative and a discriminative model of the data. This objective is particularly challenging, because mechanisms that improve mixing tend to disrupt classification and vice-versa.

To evaluate the quality of generated samples, we computed a log-likelihood estimation of 2000 test images (not used during training) using the indirect sampling likelihood (ISL) method

[41, 42, see also SI]. Due to the size of the network, a full scan of the parameter space for finding optimal STP parameters was no longer feasible. Therefore, starting from a good parameter set found by trial and error, we performed two 2D-scans of the parameter space (Fig. 3A). As in the previous examples, we found short-term depression to be essential for achieving high ISL values. Furthermore, a small value of combined with short-term facilitation was also beneficial, allowing an initial strengthening followed by a weakening of the active attractor, as sketched in Fig. 1C,D. Similar observations have been made in cortex, where STP can promote the enhancement of transients [43].

We used one of the optimal STP parameter sets to compare the generative performance of spiking networks to classical Gibbs sampling. Due to its improved mixing capability, the spiking network was able to quickly cover a large portion of the relevant state space, as reflected by a faster ISL gain during sampling (Fig. 3B). This is a systematic effect and only weakly dependent on initial conditions, as can be seen in Fig. 3C, which shows a histogram over 100 random seeds. For this comparison, we chose a sampling duration of as a conservative estimate for the maximum duration for a biological agent to experience stable stimulus conditions and therefore sample from a stable target distribution. The faster mixing is the result of the spiking network’s ability to jump out of local attractors, which is reflected in a much shorter time spent on average within the same mode (Fig. 3D). Here, we defined a mode as the dominant class of the currently represented image; a mode was therefore defined by the identity of the neuron in the label layer with the highest firing rate.

It is important to note that, due to the STP-modulated interaction, the spiking network does not sample from the exact same distribution as the Gibbs sampler, despite using an equivalent parameter set. However, for a very large number of samples (), the two methods converge towards the same ISL (Fig. 3B), indicating that the discrepancy in performance for shorter sampling durations is not due to a fundamental difference in their respective ground truths.

While the ISL, as an abstract quantity, provides a useful numerical gauge of the quality of a generative model, a direct depiction of the produced images is particularly instructive. Here, we used the t-distributed stochastic neighbor embedding (tSNE) method [44, see also SI] for a 2D-embedding of the high-dimensional sampled distribution. The similarity between samples is largely reflected by their 2D distance and a large jump can be interpreted as a switch between attractors. As seen in Fig. 3E, the spiking network produces a significantly more diverse set of samples compared to the Gibbs sampler.

When the visible layer is clamped to a particular input, the same network can be used as a discriminative model of the learned data. Using the same parameters as for the generative task, the benchmark Gibbs sampler obtained a classification accuracy of 93.4% on the MNIST test data. The spiking network with STP performed only slightly worse, at 93.2%. The additional generative capabilites gained by the spiking networks through STP were therefore not strongly detrimental to their classification accuracy.

Modeling an imbalanced dataset

Figure 4: Comparison of Gibbs and AST samplers with STP-endowed spiking networks for imbalanced training data. (A) Histogram of relative time spent in different modes calculated from 16,000 samples. (B) Mode evolution over 8,000 consecutive samples. (C) tSNE plot of images generated by the spiking network over a duration of , with between consecutive images. (D) Ambiguous input to the visible layer. The upper half is not clamped and free to complete the pattern. (E) Histogram of relative time spent in different modes during the pattern completion task, measured over 20,000 consecutive samples. (F) Comparison of the sequence of images generated by the different methods over 5000 samples (only every 500th is shown).

In many real-world scenarios, the available data is imbalanced, with much of the data belonging to one class and significantly less samples being distributed over others. It is well-known that imbalanced data can cause severe problems for data mining and classification [45, 46]. One solution is to create a more balanced dataset from the imbalanced one, which can be achieved by methods such as under- or over-sampling [46, 47]. However, such an a-priori modification of the input data does not seem biologically plausible. Still, cognitive biological agents appear to easily overcome this problem: humans will have little difficulty imagining a platypus from seeing only its bill, despite having likely seen many more ducks throughout their lifetime. Spiking networks with STP provide a simple solution to the problem of imbalanced training data, without any need for preprocessing.

We generated an imbalanced dataset of 1000 images by randomly selecting 820 digits of class ”1” and 45 from the ”0”, ”2”, ”3” and ”8” classes. After training, we compared the generative output of a Gibbs sampler, an AST sampler and a spiking network with STP. Note that the effective sampling speed of AST is roughly 20 times slower compared to Gibbs sampling, since most of the produced samples are not considered valid. In this scenario, it becomes particularly useful that the spiking network transiently modifies the learned data distribution (Fig. 4A). The STP-induced weakening of active attractors balances out their activity, thereby negating the inherent imbalance induced by the training data. Furthermore, as observed before, the spiking network switches faster between modes (Fig. 4B,C).

These abilities become particularly useful in a scenario of inference based on incomplete information, for which pattern completion is a prime example. Here, we used a training set with 6 majority classes (”0”, ”1”, ”2”, ”3”, ”4”, ”6”, 800 samples each) and one minority class (200 samples of ”5”). We generated an ambigous image by clamping the lower half of the visible layer to a configuration compatible with both a ”3” and a ”5” (Fig. 4D). While Gibbs and AST strongly undersample the minority class, the spiking network produces a much more balanced set of images, with swift transitions between modes (Fig. 4E,F). The estimate of the possible realities underlying the incomplete observation is therefore improved both on long and on short time scales. This can be particularly useful for an agent in need of a quick reaction, as, for example, often required in nature in a fight-or-flight scenario.

Discussion

We have shown how a combination of event-based communication and short-term plasticity can enhance the ability of neural networks to perform probabilistic inference in high-dimensional data spaces. Here, a spike-triggered plasticity rule played a similar role to simulated tempering methods used for classical neural networks, but without requiring complex computations on the global network state or long waiting times between valid samples. The spiking networks outperformed their classical counterparts as generative models of real-world data, with little disturbance to their classification capability, which we expect to be largely remediable by additional fine-tuning of the network parameters. Furthermore, they were also able to cope with imbalanced training data, as demonstrated by their superior performance in a pattern completion task on ambiguous input. Intriguingly, the synaptic parameters used to achieve this performance are compatible to experimental data

[37, 38].

In a physical system such as a biological brain, the studied plasticity mechanism essentially comes for free, as it only requires a limited pool of synaptic resources. Together with other activity-modulating mechanisms such as neuronal adaptation, it could be a key contributor to the ability of the brain to navigate efficiently in a very-high-dimensional stimulus space. Importantly, these networks provide immediate computational advantages for spike-based neuromorphic devices, facilitating the development of efficient artificial agents that replicate the inferential capabilities of their biological archetypes.

Acknowledgments

We thank Johannes Bill for valuable discussions and comments. This research was supported by EU grants #269921 (BrainScaleS), #604102 and #720270 (Human Brain Project), the Heidelberg Graduate School of Fundamental Physics and the Manfred Stärk Foundation.

References

  • McCulloch and Pitts [1943] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
  • Rosenblatt [1958] Frank Rosenblatt.

    The perceptron: A probabilistic model for information storage and organization in the brain.

    Psychological review, 65(6):386, 1958.
  • Crick [1989] Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129–132, 1989.
  • Stork [1989] David G Stork. Is backpropagation biologically plausible. In International Joint Conference on Neural Networks, volume 2, pages 241–246. IEEE Washington, DC, 1989.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7, 2016.
  • Lee et al. [2016] Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience, 10, 2016.
  • Neftci et al. [2017] Emre O Neftci, Charles Augustine, Somnath Paul, and Georgios Detorakis. Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Frontiers in Neuroscience, 11, 2017.
  • Petrovici et al. [2017] Mihai A Petrovici, Sebastian Schmitt, Johann Klähn, David Stöckel, Anna Schroeder, Guillaume Bellec, Johannes Bill, Oliver Breitwieser, Ilja Bytschok, Andreas Grübl, et al. Pattern representation and recognition with accelerated analog neuromorphic systems. Proceedings of the 2017 IEEE International Symposium on Circuits and Systems, 2017. URL https://arxiv.org/abs/1703.06043.
  • Zipser and Andersen [1988] David Zipser and Richard A Andersen. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331(6158):679–684, 1988.
  • Kriegeskorte [2015] Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1:417–446, 2015.
  • Schmidhuber [2015] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Desjardins et al. [2010a] Guillaume Desjardins, Aaron Courville, Yoshua Bengio, Pascal Vincent, and Olivier Delalleau. Parallel tempering for training of restricted boltzmann machines. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 145–152. MIT Press Cambridge, MA, 2010a.
  • Salakhutdinov [2010] Ruslan Salakhutdinov. Learning deep boltzmann machines using adaptive mcmc. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 943–950, 2010.
  • Bengio et al. [2013] Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In ICML (1), pages 552–560, 2013.
  • Fiser et al. [2010] József Fiser, Pietro Berkes, Gergő Orbán, and Máté Lengyel. Statistically optimal perception and learning: from behavior to neural representations. Trends in cognitive sciences, 14(3):119–130, 2010.
  • Jezek et al. [2011] Karel Jezek, Espen J Henriksen, Alessandro Treves, Edvard I Moser, and May-Britt Moser. Theta-paced flickering between place-cell maps in the hippocampus. Nature, 478(7368):246, 2011.
  • Hindy et al. [2016] Nicholas C Hindy, Felicia Y Ng, and Nicholas B Turk-Browne. Linking pattern completion in the hippocampus to predictive coding in visual cortex. Nature neuroscience, 19(5):665, 2016.
  • Pfeil et al. [2013] Thomas Pfeil, Andreas Grübl, Sebastian Jeltsch, Eric Müller, Paul Müller, Mihai A Petrovici, Michael Schmuker, Daniel Brüderle, Johannes Schemmel, and Karlheinz Meier. Six networks on a universal neuromorphic computing substrate. Frontiers in neuroscience, 7, 2013.
  • Schemmel et al. [2010] Johannes Schemmel, Daniel Briiderle, Andreas Griibl, Matthias Hock, Karlheinz Meier, and Sebastian Millner. A wafer-scale neuromorphic hardware system for large-scale neural modeling. In Circuits and systems (ISCAS), proceedings of 2010 IEEE international symposium on, pages 1947–1950. IEEE, 2010.
  • Smolensky [1986] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
  • Larochelle and Bengio [2008] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th international conference on Machine learning, pages 536–543. ACM, 2008.
  • Salakhutdinov and Hinton [2009] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
  • Dahl et al. [2010] George Dahl, Abdel-rahman Mohamed, Geoffrey E Hinton, et al. Phone recognition with the mean-covariance restricted boltzmann machine. In Advances in neural information processing systems, pages 469–477, 2010.
  • Srivastava and Salakhutdinov [2012] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
  • Petrovici et al. [2016] Mihai A Petrovici, Johannes Bill, Ilja Bytschok, Johannes Schemmel, and Karlheinz Meier. Stochastic inference with spiking neurons in the high-conductance state. Physical Review E, 94(4):042312, 2016.
  • Destexhe et al. [2003] Alain Destexhe, Michael Rudolph, and Denis Pare. The high-conductance state of neocortical neurons in vivo. Nature Reviews Neuroscience, 4:739–751, 2003.
  • Petrovici et al. [2015] Mihai A Petrovici, Ilja Bytschok, Johannes Bill, Johannes Schemmel, and Karlheinz Meier. The high-conductance state enables neural sampling in networks of lif neurons. BMC Neuroscience, 16(1):O2, 2015.
  • Buesing et al. [2011] Lars Buesing et al. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput Biol, 7(11):e1002211, 2011.
  • Marinari and Parisi [1992] Enzo Marinari and Giorgio Parisi. Simulated tempering: a new monte carlo scheme. EPL (Europhysics Letters), 19(6):451, 1992.
  • Wang and Landau [2001] Fugao Wang and DP Landau. Efficient, multiple-range random walk algorithm to calculate the density of states. Physical review letters, 86(10):2050, 2001.
  • Tsodyks et al. [1998] Misha Tsodyks, Klaus Pawelzik, and Henry Markram. Neural networks with dynamic synapses. Neural computation, 10(4):821–835, 1998.
  • Diesmann and Gewaltig [2001] Markus Diesmann and Marc-Oliver Gewaltig. Nest: An environment for neural systems simulations. Forschung und wisschenschaftliches Rechnen, Beiträge zum Heinz-Billing-Preis, 58:43–70, 2001.
  • Davison et al. [2008] Andrew P Davison, Daniel Brüderle, Jochen Eppler, Jens Kremkow, Eilif Muller, Dejan Pecevski, Laurent Perrinet, and Pierre Yger. Pynn: a common interface for neuronal network simulators. Frontiers in neuroinformatics, 2, 2008.
  • Wang et al. [2006] Yun Wang, Henry Markram, Philip H Goodman, Thomas K Berger, Junying Ma, and Patricia S Goldman-Rakic. Heterogeneity in the pyramidal network of the medial prefrontal cortex. Nature neuroscience, 9(4), 2006.
  • Costa et al. [2013] Rui P Costa, P Jesper Sjöström, and Mark CW Van Rossum. Probabilistic inference of short-term synaptic plasticity in neocortical microcircuits. Frontiers in computational neuroscience, 7, 2013.
  • Lundqvist et al. [2006] Mikael Lundqvist, Martin Rehn, Mikael Djurfeldt, and Anders Lansner. Attractor dynamics in a modular network model of neocortex. Network: Computation in Neural Systems, 17(3):253–276, 2006.
  • LeCun [1998] Yann LeCun.

    The mnist database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/, 1998.
  • Breuleux et al. [2010] Olivier Breuleux, Yoshua Bengio, and Pascal Vincent. Unlearning for better mixing. Universite de Montreal/DIRO, 2010.
  • Desjardins et al. [2010b] Guillaume Desjardins, Aaron Courville, Yoshua Bengio, Pascal Vincent, and Olivier Delalleau.

    Tempered markov chain monte carlo for training of restricted boltzmann machines.

    In Proceedings of the thirteenth international conference on artificial intelligence and statistics, volume 9, pages 145–152, 2010b.
  • Abbott and Regehr [2004] LF Abbott and Wade G Regehr. Synaptic computation. Nature, 431(7010):796, 2004.
  • Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • Chawla [2005] Nitesh V Chawla. Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook, pages 853–867. Springer, 2005.
  • García and Herrera [2009] Salvador García and Francisco Herrera. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary computation, 17(3):275–306, 2009.
  • Chawla et al. [2002] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Hinton [2010] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.

Supplementary Information (SI)

1. Spiking networks

To generate our spiking sampling networks, we follow [28]. We emulate an HCS by stimulating the LIF neurons with balanced excitatory and inhibitory Poisson noise This produces an approximately logistic activation function (Fig.5A), parametrized by a shift and a scaling parameter (Eqn. 5). These parameters can be used to translate synaptic interaction strengths from the Boltzmann domain to synaptic conductances:

(9)

where denotes the peak synaptic conductance (see Eqn. 4), the membrane capacitance, the abstract Boltzmann weight, the corresponding reversal potential, the mean free membrane potential, the synaptic time constant and the (mean) effective membrane time constant. This corresponds to a match of the average interaction during the refractory period of the presynaptic neuron (Fig.5B). This setup allows an accurate sampling from target Boltzmann distributions (Fig.5C,D).

To speed up simulations, we used an effective current-based (CUBA) model to replace the conductance-based (COBA) one. Fig. 5E shows a comparison between the two models. Under appropriate parametrization, we could reduce the background input rates from to .

COBA CUBA
membrane capacitance
membrane time constant
refractory time constant
synaptic time constant
threshold voltage
reset potential
- excitatory reversal potential
- inhibitory reversal potential
Table 1: Neuron parameters
Figure 5: (A) Activation function of an LIF neuron in the HCS and logistic fit. (B) Sketch of synaptic weight translation (Eqn. 9). (C) Sampled distribution of a fully connected 4-neuron LIF network vs. target distribution. (D) Evolution of Kullback-Leibler divergence between sampled () and target () distribution for 5 different random seeds. (E) Free membrane potential () of a biologically plausible COBA LIF neuron in the HCS compared to an equivalent CUBA LIF neuron (parameters given in Tab. 1).

2. Training

To speed up training, we used RBMs with binary units, followed by a mapping of the resulting parameters to the spiking-network domain as described above. As a learning algorithm, we used the coupled adaptive simulated tempering (CAST) method [16]. In CAST, two instances of the RBM are simulated in parallel, with one of them staying at a constant inverse temperature and the other one using AST for mixing. In AST, states are updated by Gibbs sampling from . After each state update, the temperature is itself updated by an adaptive rule that ensures the algorithm spends a roughly equal amount of time at each value (Tab. 2).

1: Given adaptive weights and the initial configuration of
the state at temperature 1, = 1:
2: for (number of iterations) do
3:  Given , sample a new state from
 by Gibbs sampling.
4:  Given , sample from proposal distribution .
 Accept with probability:
5:  Update adaptive adjusting factors:
6: end for
7: Collect data: Obtain (dependent) samples from target distribution
by keeping .
Table 2: Adaptive simulated tempering

The used hyperparameters (number of epochs

, batch size , learning rate ) were based on suggestions from previous work [48] and empirical experience. For all datasets, we used 20 equidistant inverse temperatures . The adaptive weights were initialized to 1 for all temperatures and as the adaptive weights will converge. In all experiments, we set as . For the bar example (Fig. 2), we used , and . For the full MNIST example (Fig. 3), we used , and . For the first example of an imbalanced dataset (Fig. 4A-C), we used a network with 784 visible, 10 label and 400 hidden units with , and . For the example of pattern completion from an imbalanced dataset (Fig. 4D-E), we used a network with 784 visible, 10 label and 400 hidden units with , and .

3. Indirect sampling likelihood

To have a quantitative comparison of mixing between different sampling procedures, we used the indirect sampling likelihood (ISL) method [41, 42]. The method constructs a non-parametric density estimator to evaluate how close each test example is from any of the generated examples. The likelihood of a test sample given a series of generated sample is defined as:

(10)

where is the number of generated samples, is the dimension of or and is a hyperparameter which controls the gain () and punishment () to the likelihood when comparing the test sample with the generated sample. We set to optimize the likelihood; other values () would rescale the likelihood but without causing qualitative differences.

In Fig. 3B, we plot the mean log-likelihood of 2000 samples from the test set against the number of generated samples. The faster increase of the ISL curve for the spiking network is due to better mixing, as the generated samples cover the main modes of the test samples faster (Fig. 3

D,E). To provide a frame of reference, we also plotted two additional ISL curves. The POM (product of marginals) sampler generated images by sampling each pixel individually from its intensity distribution over the entire training set. This sampler preserves the marginal probability distributions for each pixel, but discards any further structure of the image (encoded in correlations between pixel intensities). The OPT (optimal) sampler started out with a base set of

images generated with AST, from which it randomly picked images sequantially. This guarantees optimal mixing for the underlying model, because the base set covers all main modes of the state space, but consecutive samples have no correlation.

4. t-distributed stochastic neighbor embedding

The tSNE method [44] finds a low-dimensional map for a high-dimensional data set, in which the similarity between samples is reflected by their distances in the low-dimensional map. Here, we projected the generated digits to a plane to provide an intuitive understanding of the network dynamics and the mixing between different modes (digit classes).

The Euclidean distances between high-dimensional samples are converted into symmetric pairwise similarities

(11)

where is the number of samples and is a conditional probability:

(12)

with variance

, which is determined by first defining a so-called perplexity value as the effective number of neighbors of a data point, and then running a binary search. For the low-dimensional points and mapped from the high-dimensional data points and

, the similarity is defined using a t-distribution with one degree of freedom:

(13)

If the mapped points correctly model the similarity between the high-dimensional data points, the similarities and will be equal.

With this motivation, tSNE minimizes the sum of Kullback-Leibler divergences over all data points using a gradient descent method. The cost function is given by

(14)

Its gradient with respect to the map point can then be derived to provide an update of the mapping:

(15)