An adversarial algorithm for variational inference with a new role for acetylcholine

by   Ari S. Benjamin, et al.
University of Pennsylvania

Sensory learning in the mammalian cortex has long been hypothesized to involve the objective of variational inference (VI). Likely the most well-known algorithm for cortical VI is the Wake-Sleep algorithm (Hinton et al. 1995). However Wake-Sleep problematically assumes that neural activities are independent given lower-layers during generation. Here, we construct a VI system that is both compatible with neurobiology and avoids this assumption. The core of the system is a wake-sleep discriminator that classifies network states as inferred or self-generated. Inference connections learn by opposing this discriminator. This adversarial dynamic solves a core problem within VI, which is to match the distribution of stimulus-evoked (inference) activity to that of self-generated activity. Meanwhile, generative connections learn to predict lower-level activity as in standard VI. We implement this algorithm and show that it can successfully train the approximate inference network for generative models. Our proposed algorithm makes several biological predictions that can be tested. Most importantly, it predicts a teaching signal that is remarkably similar to known properties of the cholinergic system.



There are no comments yet.


page 5

page 16


The Thermodynamic Variational Objective

We introduce the thermodynamic variational objective (TVO) for learning ...

On the relationship between variational inference and adaptive importance sampling

The importance weighted autoencoder (IWAE) (Burda et al., 2016) and rewe...

Learning to learn generative programs with Memoised Wake-Sleep

We study a class of neuro-symbolic generative models in which neural net...

Neural Variational Inference and Learning in Belief Networks

Highly expressive directed latent variable models, such as sigmoid belie...

Accelerometer based Activity Classification with Variational Inference on Sticky HDP-SLDS

As part of daily monitoring of human activities, wearable sensors and de...

Binding and Perspective Taking as Inference in a Generative Neural Network Model

The ability to flexibly bind features into coherent wholes from differen...

Memory semantization through perturbed and adversarial dreaming

Classical theories of memory consolidation emphasize the importance of r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variational inference is an objective for unsupervised representation learning and, for much of its long history, a hypothesis for how the sensory cortex might learn (Hinton and Sejnowski, 1983; Hinton and Ghahramani, 1997; Dayan et al., 2001). This objective specifies that feedback connections should model lower-layer activity, while feedforward connections should infer the posterior distribution of higher layers that could have generated lower layers. The appeal of the VI hypothesis stems from several distinct perspectives, each of which is supported by an extensive literature. Principal among these are analysis-by-synthesis theories of perception, efficient coding, and predictive processing.

Analysis-by-synthesis holds that human perceptions reflect inferences of the causes that could have led to sensory data, rather than the data itself (Yuille and Kersten, 2006; Mumford, 1994). This perception-as-inference view has experimental support from both perceptual studies (e.g. Körding et al. (2007); Kleinschmidt and Jaeger (2015); Dasgupta et al. (2020)) and neurophysiological studies (Berkes et al., 2011). Instead of defining causes as objective things in the world, analysis-by-synthesis goes further to state that causes and the way they relate to inputs must be learned as well through some internal generative model. Variational inference is a strategy to train both the inference and generative models of an analysis-by-synthesis system (Hinton and Sejnowski, 1983).

Variational inference is also a way to create a representational code for inputs in which the average number of bits required to communicate the neural state is as small as possible (Hinton and Zemel, 1994). The idea of efficient neural coding predated variational inference (Barlow and others, 1961), but since the 1990’s a number of studies have argued that neural representations are indeed efficient in this sense (Rieke et al., 1995; Olshausen and Field, 1996; Bell and Sejnowski, 1997; Vinje, 2002; Weliky et al., 2003; Harper and McAlpine, 2004). The way VI achieves efficient representations is with the generative connections; only the bits representing error between feedback predictions and stimulus-driven, bottom-up activities need be communicated (Hinton et al., 1995).

Finally, there is now great interest in the idea that feedback connections represent predictions of lower layers (Keller and Mrsic-Flogel, 2018)

. Compelling instances of predictive processing are those situations in which neurons that respond classically to one sensory domain, such as sound, develop dependencies upon cross-modal signals that are generally predictive, such as muscle activity (

Fiser et al. (2016); Schneider et al. (2018); Attinger et al. (2017)). Variational inference provides one strategy by which predictive feedback connections could be learned.

Despite the interest from these interrelated perspectives, it is still unclear whether the cortex actually learns via VI. What is missing is a mechanistic explanation of how this objective is attained through a biologically implemented algorithm.

The Wake-Sleep (WS) algorithm is perhaps the best-known example of an algorithm for variational inference in multilayer networks that is biologically plausible (Hinton et al., 1995). To sample from the generated distribution, WS introduces an offline, ‘fantasizing’ phase. During this phase the inference connections learn by trying to predict the higher-layer activity that generated lower layer activity. However, this strategy assumes that neurons fire independently given the lower-layer activity they generated. This assumption is not in general correct and ignores ‘explaining away’ phenomena. While many other algorithms for variational inference have been introduced that address problems with WS (Mnih and Gregor, 2014; Paisley et al., 2012; Bornschein and Bengio, 2014; Kingma and Welling, 2013; Rezende et al., 2014), it is controversial how the brain might implement these algorithms.

Since VI requires that neural activity during the inference phase should have the same distribution as in the generating phase, we suggest in this work that these distributions could be matched adversarially. Like the Wake-Sleep algorithm, this algorithm would operate in alternating online and offline phases. In addition to a network with bottom-up and top-down connections, there would be a second population whose role is to classify activity as inferred or generated. The inference connections change to trick this discriminator, while the generative connections maximize the log-likelihood of lower layers given higher layers during inference. We call this the Adversarial Wake-Sleep algorithm.

We then argue that a biological implementation of the wake-sleep discriminator would look similar to the cholinergic system. In addition to being anatomically similar to what would be required, projecting across the sensory cortex (Liu et al., 2015), acetylcholine has profound effects on representation learning in the critical period of development (Kilgard and Merzenich, 1998). Like acetylcholine, the wake-sleep discriminator’s output broadly resembles unfamiliarity and uncertainty about causes (Yu and Dayan, 2005). This theory does not explain all of acetylcholine’s varied effects, which range from attention to stress, but we find it plausible enough to merit closer scrutiny. We suggest an experiment that could test the interpretation.

2 Preliminaries

2.1 Setting and notation

We consider a noisy multilayer neural network whose hidden state at each layer we denote as the vector

. In the current work we only consider networks without skip connections. We define the inference network as a set of feedforward edges from lower layers to higher layers. The edges in

define a conditional probability distribution over hidden states,


We also define a set of feedback edges from higher layers to lower layers. These define the generative network . When the top layer is set to a particular value (a sample from some fixed prior), the edges in define the conditional distribution over the lower layers and input . Note while and are different networks, they are both directed sets of edges on the same nodes.

The algorithm is general to discrete and continuous , as well as to the form of noise injected into all

in both inference and generation. In applications where backpropagation will be used, we are restricted to continuous latent variables and reparameterizable noise families.

2.2 Background: variational inference

The goal of VI is to obtain a generative model for which observations are ‘minimally surprising’, in the sense that the model maximizes the average log-likelihood of the data:

This is equivalent to obtaining a generative model that minimizes , the Kullbeck-Leibler divergence between the actual and generated distributions over inputs. Directly evaluating this objective problematically requires marginalizing over all hidden states: .

VI uses the inference model to train despite this central difficulty (see Dayan et al. (2001); Bishop (2006); MacKay (2003) for introduction). The inference model is meant to approximate the "inverse" of , in the sense that should approximate the posterior ) for any in . The combined objective that and must maximize is:

This expression is often called the evidence lower-bound (ELBO) because it can be written as:


is a lower bound to . The negative of the ELBO is also sometimes called the ‘variational free energy’ due its alternative expression as the inference distribution’s energy (under the generative model) minus its entropy:

Training to maximize is straightforward (given ) in a multilayer network. At each layer the generative connections change to maximize the log-likelihood of lower layers given upper layers. For later reference we define this layerwise objective as :


As we discuss further in Section 6, this update rule is local and biologically plausible.

3 Adversarial Wake-Sleep

3.1 Variational inference as distribution matching

Relative to the ease of training , it is a much harder subproblem to train given . There are two main approaches the VI literature. The first is to try to calculate the gradient of directly with respect to

. This is possible but requires reducing the variance through control-variate techniques

(Paisley et al., 2012; Mnih and Gregor, 2014) or reparameterization of the expectations over (Kingma and Welling, 2013; Rezende et al., 2014). The key characteristic of this strategy is that it is performed online and does not require an offline, purely generative "Sleep" phase. This is one possible strategy as to how the cortex might implement VI (Rezende and Gerstner, 2014).

An alternative approach to VI uses alternating online and offline phases. As noted by the Wake-Sleep algorithm, one can easily ‘fantasize’ samples from the joint distribution by choosing the top layer from the prior

and propagating through the rest of the network. Instead of matching the generative posterior for a particular sample, we can instead match the joint distributions

and . If these can be matched, all marginals and conditionals will be matched as well.

To see the equivalence of these objectives, consider the expectation of the ELBO over :

To this expression let us add the entropy of the data , which does not depend on or . Incorporating this into the expectations in , and for convenience denoting the joint distribution of real and inferred as , we obtain a new objective with same stationary points for and that is the KL divergence between the joint distributions:

3.2 An adversarial wake-sleep strategy for matching the joint distributions

Our proposal is to match and adversarially. We introduce a discriminator, or series of discriminators, that sees the entire network state. The discriminator tries to classify the state as inference-like or generative-like, and the inference network (but not the generative network) changes to trick it. Generative connections maximize the log-likelihood of lower layers (Eq. 2).

The discriminator, which we denote as , maps the state to a single value. The objective upon depends on which notion of distance between distributions one would like to minimize. VI requires minimizing the KL divergence between and . However, given the fragility of adversarial learning, we argue one should choose a metric that better facilitates training. Here we take the Wasserstein GAN formulation and minimize the Wasserstein-1 distance between the inference and generative joint distributions (Arjovsky et al., 2017). In this case, simply tries to increase its output during inference and decrease it during generation. Its overall objective is:


An additional constraint upon in a Wassserstein GAN (WGAN) is that it is 1-Lipschitz continuous; in experiments we apply a gradient penalty throughout learning (Gulrajani et al., 2017).

While the discriminator maximizes this objective, the inference connections minimize it. This is to say that minimizes the output of during inference, .

As in the original Wake-Sleep algorithm, the above objectives can be calculated and optimized in two separate phases. In the online phase, a batch of examples are run through the inference network to obtain samples of from , is updated to predict lower layers, and and are updated to maximize or minimize the output of . In the offline phase, samples are ‘fantasized’ from , and learns to decrease its output. We present the step-by-step algorithm in the Appendix.

3.2.1 Reducing the input dimension for

In the formulation above, the discriminator has an input dimension equal to the entire space of activations of the network. This can be decreased by taking into account the structure of the network on which and are defined. If this network has no skip connections, and

define a Markov chain of layer transformations. The joint probability distribution can be factored by layer:

If the joint distributions between every two layers are matched between inference and generation, each factored conditional will be matched as well. Thus, as displayed in Figure 1, a separate sub-discriminator can be used for each pair of layers. The dimensionality of the inputs of each scales with the width of the network but not its depth.

3.2.2 Application to discrete hidden states

Suppose the units in the neural network can take one of a few discrete values, such as for the binary on-or-off neurons in a Helmholtz machine and the original Wake-Sleep algorithm. In this case one cannot take the derivative of with respect to the inference weights. Instead, we can apply the trick of REINFORCE (Williams, 1992). The gradient of with respect to the parameters at layer becomes:


This is exactly analogous to REINFORCE but with the discriminator replacing reward. To be usable in large networks, however, the variance of this estimator would have to be reduced considerably.

4 Experiments

We tested several implementations to explore when Adversarial Wake-Sleep works and when it could be beneficial in generative modeling. Overall, we found that when applied as the sole learning objective, the algorithm works but is quite fragile to optimization and stabilization methods. However, it is easily stabilized by adding other unsupervised objectives beyond VI.

Figure 1:

Generating images on a DCGAN architecture with Adversarial Wake-Sleep. A) Basic architecture. Images are passed up, or noise is passed down, the convolutional architecture. Dense 2-layer wake-sleep discriminators read out from pairs of layers in either stage. Optionally, generated images can be passed up and a readout discriminator estimates whether the top layer was induced by a real or generated image (Section 4.2) B) MNIST digits generated by standard Adversarial Wake-Sleep, in which generative connections learn only by maximizing the likelihood of lower-level inference activations. C) Adding the readout discriminator allows much better generation. Here the GAN/VI interpolation parameter was set to

. D) CIFAR-10 generation with .

4.1 Advesarial wake-sleep as the sole objective

We tested the algorithm at training a DCGAN (Radford et al., 2015)

to generate 32x32 MNIST digits and CIFAR-10 images (Figure 1). This is a full-convolutional, 5 layer network with ReLU activations to which we add isotropic Gaussian noise. Additional training details can be found in the Appendix.

As a point of comparison, the original Wake-Sleep algorithm did not converge to produce digits in this continuous setting with Gaussian noise models (32x32 MNIST digits, continuous-valued pixels and ReLU activations), even despite extensive experiments (as detailed in the Appendix). This was not because of the convolutional operators, as WS did produce digit-like images on a stochastic binary Helmholtz machine in a DCGAN-like configuration.

Adversarial Wake-Sleep does produce digits on 32x32 MNIST (Fig. 1B). This serves as a basic proof of concept for its feasibility. However, we found the algorithm to be quite unstable and as a result the generated images are far from state-of-the-art in diversity and quality. The most common type of failure was mode collapse in the inference network, which then prevented the generator from learning successfully (see Appendix for an example). We found that popular stabilization tricks in and were essential. The local divisive normalization operator of Karras et al. (2017), for example, greatly helped stabilize training. Incidentally, divisive normalization has long been known to be implemented in the sensory cortex (Heeger, 1992). If this algorithm is implemented in the cortex, it is likely that other properties of cortical networks act to stabilize training, as well.

4.2 Reusing the inference network as a discriminator

Adversarial Wake-Sleep can be stabilized with a small architectural addition: a linear readout from the top layer. The role of this readout discriminator is to determine whether the inference network has processed a real image or a generated image.

This addition preserves the architecture and generative goal but modifies the approach. In addition to maximizing the log-likelihood of lower inference layers, the generator now also tries to produce images that the readout discriminator classifies as real. This requires modifying the Sleep phase such that generated images are passed back up through the inference network. The inference network, in addition to its Adversarial Wake-Sleep objective, also now helps the readout discriminator by trying to map real and generated images to linearly separable subspaces.

The idea that an inference network can also serve as a discriminator on images precedes this paper (Ulyanov et al., 2018; Brock et al., 2016). However, here the inference network maps to the entire generative network state that might have led to an image, not just the top-level latent state . The idea can be imagined as a GAN in which the discriminator also specifies, approximately, the posterior distribution over generative activities.

One perspective that can help understand why mixing GAN and VI objectives might work is an energy-based perspective, in analogy to learning with Contrastive Divergence

(Hinton, 2002). By attempting to treat generated and real samples as differently as possible (in the eyes of ) while also lowering the energy of real samples, causes to effectively raise the energy of generated samples.

This approach is precisely a simple addition of a VI and GAN objective, and might be called Doubly Adversarial Wake-Sleep. We will write the GAN objective in the WGAN form:



be a hyperparameter that controls the relative influence of the VI or GAN objectives, with a value of 1 being pure Adversarial Wake-Sleep and 0 being a GAN. Then the overall objectives are:

This addition stabilized Adversarial Wake-Sleep to competitive levels while still preserving a biologically plausible architecture and wake-sleep dynamic. When trained on MNIST and CIFAR-10, the algorithm generates images of good quality (Fig. 2C, D).

4.2.1 The effect of varying

In Figure 2 we train a DCGAN to generate CIFAR-10 images while interpolating between the GAN and VI objectives using . Relative to a standard WGAN-GP (), adding the inference requirement to the discriminator does not harm generative performance. As measured by the FID score (Heusel et al., 2017), the quality of generated images remains high (Fig. 2B). Furthermore, increasing

progressively introduces the required layerwise autoencoding relationship, as can be seen by passing test images up one layer of

and back down one layer of and calculating the reconstruction error (Fig. 2A). These results show that a single feedforward network can learn to be a (variational) inference network while still being useful as a discriminator.

Figure 2: Interpolating between a GAN and VI objective, and in doing so incrementally turning the discriminator into an approximate inference network. The network is a DCGAN architecture trained to generate CIFAR-10 images, and all runs are with the random seed set to 0. A) Increasing makes the discriminator more like an inference network; here measured by the error between each inference layer and the generative prediction after passing up one layer, for CIFAR-10 test images. B) Increasing does not harm the generated images quality and diversity as measured by the FID score. The highest two values shown are

. C) The linear decodability of class labels from layer 4 of the inference network’s as measured with a linear SVM trained with 10-fold crossvalidation on the test set. The envelope represents the standard deviation across folds.

5 Related work in generative modeling

Several generative modeling papers have approached representation learning by adversarially matching joint distributions of latent vectors and inputs. Both the Adversarial Variational Inference (ALI) and BiGAN algorithms propose a third network that discriminates between the paired data and inferred latents from the latents and the data they generated , and train the generator and inference network to trick this discriminator (Donahue et al. (2016); Dumoulin et al. (2016)). This approach has been extended in various ways (Pu et al., 2017; Srivastava et al., 2017; Donahue and Simonyan, 2019). This setting differs from ours in that the latent vector is not the vector of activations of the entire generator/inference network, but the noise vector that the generator takes as input. This means that generative connections cannot be trained to predict lower-layer activations, as in VI; in ALI and BiGAN all weight changes are due to gradients that pass through the discriminator.

A step towards our algorithm are the hierarchical approaches of Belghazi et al. (2018) and Huang et al. (2017). Unlike ALI and BiGAN, the latent vector is multileveled; separate discriminators match the latent vector’s distribution at every level of hierarchy. However, in these algorithms the generator network trains against the discriminator and not in a layerwise maximum-likelihood fashion.

The idea of re-using the discriminator in a GAN as the inference network has also been proposed in several variants (Brock et al., 2016; Ulyanov et al., 2018; Huang et al., 2018; Munjal et al., 2019; Bang et al., 2020). These studies show that a GAN can be made into an autoencoder by pairing the discriminator and generator end-to-end and applying reconstruction costs. Our result is again different in that the inference network maps to the posterior of the entire generative network state rather than the top-level noise vector, allowing the maximum-likelihood update.

6 Potential biological implementation

The overall architecture of our biological model is shown schematically in Figure 3. There are two key elements: the discriminator, and how the generative model learns.

6.1 Learning the feedback connections

In VI, the generative connections change to better predict lower-level activity given upper layers. A biological neuron implementing this rule would need an additional compartment to integrate feedback activity so it could compare that prediction to the feedforward-driven somatic activity. This ‘dendritic prediction of somatic activity’ has in fact been proposed as an explanation of how spike-timing dependent plasticity depends on postsynaptic voltage (Urbanczik and Senn, 2014). It should be remarked that this rule is not enough to learn to model lower layers; for this the feedforward activity needs to match the generative posterior. This is the purpose of the discriminator.

Figure 3:

A biological implementation. During ‘Wake’, feedforward connections drive somatic activity. Feedback connections synapse on a segregated dendritic compartment and their synapses change so the compartment better predicts somatic activity. Connections

to the wake/sleep discriminator (Neuromod. #1, putatively ACh) change to increase its activity, and Neuromod. #1 is projected back to gate feedforward plasticity. During ‘Sleep’, feedback connections drive somatic activity, and the wake/sleep discriminator tries to decrease its activity. The scheme requires a second neuromodulator (#2) released in ‘Sleep’ that controls whether feedforward or feedback connections drive somatic activity.

6.2 Is the cholinergic system a wake-sleep discriminator?

Our algorithm requires a non-cortical region that projects across the sensory cortex that has the ability to gate or modulate plasticity, especially within the critical period of development. This led us to speculate that the cholinergic system could play this role. We found that a number of other features of acetylcholine are consistent with this interpretation, as well.

One of the many interpretations of acetylcholine (ACh) is that it signals unfamiliarity and uncertainty. In the work of Yu and Dayan, ACh signals the uncertainty about top-down, contextual information in sensory processing tasks (Dayan and Yu, 2002; Yu and Dayan, 2002). This hypothesis was later narrowed to a signal of expected, or learned, uncertainty (in contrast to unexpected uncertainty, or surpise) (Yu and Dayan, 2005). This is very close to how a discriminator would appear to respond in the ‘Wake’ phase. The discriminator is a learned estimate of whether a network state could have been self-generated. This means that the discriminator has high activity when it estimates that the inference network failed to produce high-level activity that could explain away the lower-level activity via the generative connections.

Another canonical feature of ACh is its control over cortical plasticity (Gu, 2002; Rasmusson, 2000). During the critical period in which sensory representations are formed, the cortex may irreversibly learn to respond to only one eye if the other is closed. However, if ACh is prevented from being released (Bear and Singer, 1986), or if its effect upon somatostatin-positive interneurons is blocked (Yaeger et al., 2019), cortical remapping is impaired. Conversely, and even after the critical period has ended, one can artificially induce profound changes in cortical representations by pairing ACh release with sensory stimuli (Kilgard and Merzenich, 1998). These findings have been replicated across many sensory domains (Gu, 2002) and indicate that ACh has a crucial role in the cortex’s strategy for learning sensory representations.

Acetylcholine is largely released during waking experience and in comparable amounts during REM sleep, but in low amounts in other stages of sleep (Kametani and Kawamura, 1990). This would be expected if ACh played a role as a wake-(REM)sleep discriminator, but is harder to explain with interpretations that stop at attention and unfamiliarity. It is worth noting that ACh is implicated in the control of sleep as well (Ozen Irmak and de Lecea, 2014; Hobson et al., 1975).

Acetylcholine has an extraordinary number of functions within the nervous system, and we do not imagine that this new interpretation should consolidate them all. Its role in attention, for example, appears difficult to explain in any rigorous way as relating to a wake-sleep discriminator (Sarter et al., 2005). Nevertheless, we believe this conjunction of sensory uncertainty, representation learning, and sleep is remarkably consistent with what would be required of a wake-sleep discriminator for VI.

6.3 Predictions

Adversarial Wake-Sleep could be tested for in the following manner. During the critical period of sensory learning, one could selectively silence activity during the stage of sleep most likely to correspond to the offline, generative phase in this algorithm (most plausibly REM sleep). This affects the generative distribution. One could then observe if waking activity changes, via apparently experience-dependent plasticity, to match that perturbed distribution. If so, one could further ask whether ACh mediates that change.

7 Discussion

If variational inference acts as the sensory cortex’s learning objective, or at least some part of it, the cortex could learn the inference connections with an adversarial strategy. It requires a wake-sleep discriminator, which has the simple objective of increasing its output during a stage of sleep and decreasing its output during wake. This objective is easier to learn for a biological area than directly estimating the variational free energy of the entire sensory cortex, as would be required if this strategy were not used (Mnih and Gregor, 2014; Rezende and Gerstner, 2014). This adversarial concept may help to understand the role of acetycholine in learning, pending further experiments.

Our experiments showed that Adversarial Wake-Sleep works to some degree but still falls far short of benchmarks in generative modeling. Other features of real cortical networks beyond divisive normalization may be required to stabilize the algorithm to competitive levels. We found one way to improve Adversarial Wake-Sleep was to re-use the inference network as a discriminator on inputs. We are agnostic as to whether biology takes this particular strategy, but note that it at least requires no additional architecture beyond a linear readout. This change is also consistent with a recent proposal that human perception corresponds to the processing of a discriminator (Gershman, 2019). However, we are not aware of any biological evidence that this fix is one that the cortex uses.

Our model of sensory learning is abstract. In addition to the usual differences between ANNs and biological neurons, we have not attempted to include any of the great amount that is known about sleep, the role of hippocampus in sensory learning, or a number of other potentially relevant systems.

Some of these details may answer important computational questions. For example, how do spiking neurons calculate their gradient with respect to the discriminator when given only its output? This problem is equivalent to the credit assignment problem in, for example, reinforcement learning. In our case the answer may lie in local microcircuits, and the fact the acetylcholine mediates learning largely through disinhibition

(Yaeger et al., 2019). Alternatively, due to a connection between backpropagation and variational autoencoders, there is the possibility that the feedforward and feedback connections themselves could be used for the credit assignment problem (Bengio, 2014).

8 Code availability

The Pytorch code implementing this algorithm that was used for the figures in this manuscript can be found at


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §3.2.
  • A. Attinger, B. Wang, and G. B. Keller (2017) Visuomotor Coupling Shapes the Functional Development of Mouse Visual Cortex. Cell. External Links: Document, ISSN 10974172 Cited by: §1.
  • D. Bang, S. Kang, and H. Shim (2020) Discriminator feature-based inference by recycling the discriminator of gans.

    International Journal of Computer Vision

    , pp. 1–23.
    Cited by: §5.
  • H. B. Barlow et al. (1961) Possible principles underlying the transformation of sensory messages. Sensory communication 1, pp. 217–234. Cited by: §1.
  • M. F. Bear and W. Singer (1986) Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature 320 (6058), pp. 172–176. External Links: Document, ISSN 00280836 Cited by: §6.2.
  • M. I. Belghazi, S. Rajeswar, O. Mastropietro, N. Rostamzadeh, J. Mitrovic, and A. Courville (2018) Hierarchical adversarially learned inference. arXiv preprint arXiv:1802.01071. Cited by: §5.
  • A. J. Bell and T. J. Sejnowski (1997) The “independent components” of natural scenes are edge filters. Vision research 37 (23), pp. 3327–3338. Cited by: §1.
  • Y. Bengio (2014) How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906. Cited by: §7.
  • P. Berkes, G. Orbán, M. Lengyel, and J. Fiser (2011) Spontaneous cortical activity reveals hallmarks of an optimal internal model of the environment. Science 331 (6013), pp. 83–87. External Links: Document, ISSN 00368075 Cited by: §1.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §2.2.
  • J. Bornschein and Y. Bengio (2014) Reweighted wake-sleep. arXiv preprint arXiv:1406.2751. Cited by: §1.
  • A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2016) Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093. Cited by: §4.2, §5.
  • I. Dasgupta, E. Schulz, J. B. Tenenbaum, and S. J. Gershman (2020) A theory of learning to infer.. Psychological Review 127 (3), pp. 412. Cited by: §1.
  • P. Dayan, L. F. Abbott, et al. (2001) Theoretical neuroscience. Vol. 806, Cambridge, MA: MIT Press. Cited by: §1, §2.2.
  • P. Dayan and A. Yu (2002) ACh, uncertainty, and cortical inference. In Advances in neural information processing systems, pp. 189–196. Cited by: §6.2.
  • J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §5.
  • J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §5.
  • V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §5.
  • A. Fiser, D. Mahringer, H. K. Oyibo, A. V. Petersen, M. Leinweber, and G. B. Keller (2016) Experience-dependent spatial expectations in mouse visual cortex. Nature Neuroscience 19 (12), pp. 1658–1664. External Links: Document, ISSN 15461726 Cited by: §1.
  • S. J. Gershman (2019) The generative adversarial brain.

    Frontiers in Artificial Intelligence

    2, pp. 18.
    Cited by: §7.
  • Q. Gu (2002) Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuroscience 111 (4), pp. 815–835. External Links: Document, ISSN 03064522 Cited by: §6.2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §3.2, §8.2.2, §8.2.3.
  • N. S. Harper and D. McAlpine (2004) Optimal neural population coding of an auditory spatial cue. Nature 430 (7000), pp. 682. Cited by: §1.
  • D. J. Heeger (1992) Normalization of cell responses in cat striate cortex. Visual neuroscience 9 (2), pp. 181–197. Cited by: §4.1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.2.1.
  • G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal (1995) The" wake-sleep" algorithm for unsupervised neural networks. Science 268 (5214), pp. 1158–1161. Cited by: §1, §1, §8.2.4.
  • G. E. Hinton and Z. Ghahramani (1997)

    Generative models for discovering sparse distributed representations

    Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 352 (1358), pp. 1177–1190. Cited by: §1.
  • G. E. Hinton and T. J. Sejnowski (1983) Optimal perceptual inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Vol. 448. Cited by: §1, §1.
  • G. E. Hinton and R. S. Zemel (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §1.
  • G. E. Hinton (2002) Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §4.2.
  • J. A. Hobson, R. W. McCarley, and P. W. Myzinski (1975) Sleep Cycle Oscillation : Reciprocal Discharge by Two Brainstem Neuronal Groups Author ( s ): J . Allan Hobson , Robert W . McCarley and Peter W . Wyzinski Published by : American Association for the Advancement of Science Stable URL : http://www.jstor.or. Science 189 (4196), pp. 55–58. Cited by: §6.2.
  • H. Huang, R. He, Z. Sun, T. Tan, et al. (2018) Introvae: introspective variational autoencoders for photographic image synthesis. In Advances in neural information processing systems, pp. 52–63. Cited by: §5.
  • X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie (2017) Stacked generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5077–5086. Cited by: §5.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §8.2.2.
  • H. Kametani and H. Kawamura (1990) Alterations in acetylcholine release in the rat hippocampus during sleep-wakefulness detected by intracerebral dialysis. Life sciences 47 (5), pp. 421–426. Cited by: §6.2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §4.1, 6th item, §8.2.2, §8.2.2.
  • G. B. Keller and T. D. Mrsic-Flogel (2018) Predictive Processing: A Canonical Cortical Computation. External Links: Document, ISSN 10974199 Cited by: §1.
  • M. P. Kilgard and M. M. Merzenich (1998) Cortical map reorganization enabled by nucleus basalis activity. Science 279 (5357), pp. 1714–1718. External Links: Document, ISSN 00368075 Cited by: §1, §6.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1, §8.2.4.
  • D. F. Kleinschmidt and T. F. Jaeger (2015) Robust speech perception: recognize the familiar, generalize to the similar, and adapt to the novel.. Psychological review 122 (2), pp. 148. Cited by: §1.
  • K. P. Körding, U. Beierholm, W. J. Ma, S. Quartz, J. B. Tenenbaum, and L. Shams (2007) Causal inference in multisensory perception. PLoS one 2 (9), pp. e943. Cited by: §1.
  • A. K. L. Liu, R. C. C. Chang, R. K.B. Pearce, and S. M. Gentleman (2015) Nucleus basalis of Meynert revisited: anatomy, history and differential involvement in Alzheimer’s and Parkinson’s disease. Acta Neuropathologica 129 (4), pp. 527–540. External Links: Document, ISSN 14320533 Cited by: §1.
  • D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §2.2.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: 5th item, §8.2.2.
  • A. Mnih and K. Gregor (2014) Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030. Cited by: §1, §3.1, §7.
  • D. Mumford (1994) Neuronal architectures for pattern-theoretic problems. Large-scale neuronal theories of the brain, pp. 125–152. Cited by: §1.
  • P. Munjal, A. Paul, and N. C. Krishnan (2019) Implicit discriminator in variational autoencoder. arXiv preprint arXiv:1909.13062. Cited by: §5.
  • B. A. Olshausen and D. J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §1.
  • S. Ozen Irmak and L. de Lecea (2014) Basal Forebrain Cholinergic Modulation of Sleep Transitions. Sleep 37 (12), pp. 1941–1951. External Links: Document, ISSN 0161-8105 Cited by: §6.2.
  • J. Paisley, D. Blei, and M. Jordan (2012)

    Variational bayesian inference with stochastic search

    arXiv preprint arXiv:1206.6430. Cited by: §1, §3.1.
  • Y. Pu, W. Wang, R. Henao, L. Chen, Z. Gan, C. Li, and L. Carin (2017) Adversarial symmetric variational autoencoder. In Advances in neural information processing systems, pp. 4330–4339. Cited by: §5.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.1, §8.2.1.
  • D. D. Rasmusson (2000) The role of acetylcholine in cortical synaptic plasticity. Behavioural Brain Research 115 (2), pp. 205–218. External Links: Document, ISBN 1902494652, ISSN 01664328 Cited by: §6.2.
  • D. Rezende and W. Gerstner (2014) Stochastic variational learning in recurrent spiking networks. Frontiers in Computational Neuroscience 8, pp. 38. External Links: Link, Document, ISSN 1662-5188 Cited by: §3.1, §7.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §1, §3.1.
  • F. Rieke, D. Bodnar, and W. Bialek (1995) Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London. Series B: Biological Sciences 262 (1365), pp. 259–265. Cited by: §1.
  • M. Sarter, M. E. Hasselmo, J. P. Bruno, and B. Givens (2005) Unraveling the attentional functions of cortical cholinergic inputs: Interactions between signal-driven and cognitive modulation of signal detection. Brain Research Reviews 48 (1), pp. 98–111. External Links: Document, ISSN 01650173 Cited by: §6.2.
  • D. M. Schneider, J. Sundararajan, and R. Mooney (2018) A cortical filter that learns to suppress the acoustic consequences of movement. Nature 561 (7723), pp. 391–395. Cited by: §1.
  • A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §5.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) It takes (only) two: adversarial generator-encoder networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.2, §5.
  • R. Urbanczik and W. Senn (2014) Learning by the dendritic prediction of somatic spiking. Neuron 81 (3), pp. 521–528. Cited by: §6.1.
  • W. E. Vinje (2002) Sparse Coding and Decorrelation in Primary Visual Cortex During Natural Vision. Science 287 (5456), pp. 1273–1276. External Links: Document, ISSN 00368075 Cited by: §1.
  • M. Weliky, J. O. Fiser, R. H. Hunt, and D. N. Wagner (2003) Coding of natural scenes in primary visual cortex. Neuron. External Links: Document, ISSN 08966273 Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.2.2.
  • C. E. Yaeger, D. L. Ringach, and J. T. Trachtenberg (2019) Neuromodulatory control of localized dendritic spiking in critical period cortex. Nature 56. External Links: Document, ISSN 0028-0836, Link Cited by: §6.2, §7.
  • A. Yu and P. Dayan (2002) Acetylcholine in cortical inference. Neural Networks 15 (4-6), pp. 719–730. Cited by: §6.2.
  • A. J. Yu and P. Dayan (2005) Uncertainty, neuromodulation, and attention. Neuron. External Links: Document, ISSN 08966273 Cited by: §1, §6.2.
  • A. Yuille and D. Kersten (2006) Vision as Bayesian inference: analysis by synthesis?. Trends in Cognitive Sciences 10 (7), pp. 301–308. External Links: Document, ISSN 13646613 Cited by: §1.


8.1 Algorithm

1: , , Learning rates
2: Gradient penalty size
4:      Initialize parameters in and .
5:     while  not converged do
6:         Wake
7:          Observe batch of data
8:          Infer hidden states
9:          Calculate discriminator’s value
10:          Calculate gradient penalty
11:          attempts to increase output
12:          attempts to decrease
13:          Adjust to maximize variational log-likelihood (Eq. 2)
14:         Sleep
15:          Sample top-level hidden state
16:          Propagate through
17:          Calculate discriminator’s value
18:          Calculate gradient penalty
19:          attempts to decrease output
20:     end while
21:     return and .
22:end procedure
Algorithm 1 Adversarial Wake-Sleep

Here shown with stochastic gradient descent and WGAN-GP distribution-matching objective.

1: , , , Learning rates
2: Interpolation parameter
4:      Initialize parameters in , , and .
5:     while  not converged do
6:         Wake
7:          Observe batch of data
8:          Infer hidden states
9:          Calculate discriminator’s value
10:          Calculate gradient penalty
11:          Calculate readout discriminator’s value
12:          attempts to increase output
13:          attempts to decrease and increase
14:          Adjust to maximize variational log-likelihood (Eq. 2).
15:          attempts to increase output
16:         Sleep
17:          Sample top-level hidden state
18:          Propagate through
19:          Calculate discriminator’s value
20:          Pass generated samples back up through
21:          Calculate readout discriminator
22:          Calculate gradient penalty
23:          attempts to decrease output
24:          attempts to decrease
25:          attempts to decrease output
26:          attempts to increase
27:     end while
28:     return and .
29:end procedure
Algorithm 2 Doubly Adversarial Wake-Sleep

8.2 Training details

8.2.1 Software and architecture

The experiments presented in this paper employed a DCGAN architecture (Radford et al., 2015) and were coded in Pytorch v1.3. The DCGAN is an all-convolutional network with 5 layers. The first layer above the inputs has 128 channels, which doubles every layer, except for the highest layer which has 100 channels (or 40 for MNIST). In order for the inference and generator networks to have the same architecture, the generator must use transposed convolutions where the inference uses convolutional operators. This is the standard DCGAN architecture, but where the standard discriminator is used instead as an approximate inference network

The 2d spatial dimension of the hidden layers is halved every layer upwards. For this reason the input dimension must be a factor of 2. For MNIST this requires resizing the 28x28 images to 32x32, which we performed using Pytorch’s inbuilt resize transform.

Stochasticity and activations

Our architecture uses ReLU activations in the inference (or discriminator) and generator networks. As a source of stochasticity, we added Gaussian noise of fixed and isotropic variance to all nodes before applying the ReLU activations. We also experimented with Laplacian-distributed noise and found similar performance.

Stochastic ‘inverse’ ReLU

The inference network has the purpose of estimating the posterior distribution of the generative network at layer that might have generated an . Because the generator applies the ReLU function, any element of that is 0 could have had a negative value before truncation. That is, 0 values in could have been generated by quite a large subspace of . To mitigate this source of mismatch between the inferred and true posteriors, we add a simple operator to that adds negative noise to

where its elements are 0. We applied negative, exponentially distributed noise with initial scale

that decayed by a factor of 10 every 30 epochs.

8.2.2 Adversarial Wake-Sleep

Here we describe the training details we found to work well for Adversarial Wake-Sleep on the above architecture. After settling upon the general configuration, we fine-tuned the hyperparameters with a random search with 100 runs. As a performance criterion, we used the classification accuracy of the linear readout from the inference network (Fig. 2C). We settled on this as it was a good indicator of stable convergence that also measured an objective aspect of performance. The configuration is summarized in Table 1.


We applied the Adam optimizer with and chosen from the range and . Each network (, , and ) was allowed a different learning rate . Training was very sensitive to relative learning rates. Our final configuration used a rate of for , for , and for .

Divisive normalization

It was noted in Karras et al. (2017) that a source of instability in adversarial training is a battle for scale between the discriminator and generator. This can be mitigated by forcing the feature vector at any spatial location to have unit norm. After the ReLU is applied, the features (i.e. channels) at each spatial location in each layer are divided by their norm.

Minibatch standard deviation

We applied another method from Karras et al. (2017), which was to calculate the standard deviation of the features in the discriminator and supply that as input in the penultimate discriminative layer. As our wake-sleep discriminator has only one hidden layer, we calculated the standard deviation of the inference/generative features and included that as input to the wake-sleep discriminator.


The generative distribution over the top-level hidden state is fixed and known: it is a multivariate standard normal distribution. We found slightly better performance when explicitly regularizing the KL divergence of the inference distribution over

towards the standard normal.

Prioritized replay

Sensory learning is known to involve replay events triggered by the hippocampus during sleep. Inspired by this, we tested ‘replaying’ some waking states during the Sleep phase by saving the value of and inserting it into the otherwise sampled from the standard normal distribution. We selected the half of each batch that had the highest wake-sleep discriminator outputs, as these network states were estimated as most inference-like (‘most surprising’ samples). Overall this strategy was not necessary for good convergence but was selected by the random search.

Gradient Penalty with an norm

Since our wake-sleep discriminator approximates the Wasserstein-1 distance, the gradient with respect to its inputs (the network state of and ) must be 1-Lipschitz. We applied the gradient penalty GP of Gulrajani et al. (2017) and minimized the distance of the gradients from having unit norm. Whereas the typical GP minimizes the distance, we found much better performance with minimizing the norm from . Parameter searches found the proper penalty to be around .

Spectral normalization in the inference network

We included the option of applying spectral normalization on the inference network in our parameter searches (Miyato et al., 2018). Its usage was selected in the randomly found best configuration, but did not appear essential for stable training.

Batch normalization in the inference network

The final configuration setting in the random search included batch normalization (Ioffe and Szegedy, 2015) applied to the inference network but not the generator network.

Hyperparameter choices
Wake-Sleep (binary)
Adversarial Wake-Sleep
Doubly Adversarial Wake-Sleep
Table 1: Hyperparameters chosen by random search. Key: : learning rate for inference , generator , wake-sleep discriminator , and readout discriminator . and are the momentum parameters of the Adam optimizer. is the gradient penalty upon the wake-sleep discriminator. is the gradient penalty w/r/t the input images of the output of . is the initial variance of the Gaussian noise applied to activations of and .

8.2.3 Doubly Adversarial Wake-Sleep

For our experiments in which the inference network was reused as a discriminator on inputs, we used the same setup but allowed hyperparameters to change. Once again, we performed a random search over the parameter space and selected the configuration with the best classification accuracy from a linear SVM reading out from the top inference layer. The configuration is shown in Table 1.

Since is now also a discriminator, we are required in the WGAN formulation to apply a gradient penalty on the inputs (here, the images). We used the standard WGAN-GP penalty (Gulrajani et al., 2017). To distinguish from the gradient penalty on , we denote the strength of this penalty as .

8.2.4 Wake-Sleep

As a comparison, we coded Wake-Sleep in a DCGAN-like architecture with continuous latent variables with ReLU activations. This is not the first paper to apply the Wake-Sleep algorithm, which is traditionally applied to stochastic binary networks, to continuous latent variables (Kingma and Welling, 2013).

Wake-Sleep trains the inference connections in the Sleep phase by maximizing at each layer , , where is a generative state. Similarly, during the Wake phase the generative connections maximize , where is an inferred state. More details can be found in Hinton et al. (1995).

Stochastic Binary Convolutional Networks

First, as a control, we noted that Wake-Sleep does produce digit-like images on binarized MNIST when the DCGAN architecture is a stochastic binary network (Figure 4). In a stochastic binary network, the activations and outputs are Bernoulli random variables with a probability of firing equal to a sigmoidal function of the convolved filter output. This is the original setting of the Wake-Sleep algorithm.

Figure 4: Binary MNIST digits generated by a stochastic binary network with convolutional filters in a DCGAN architecture, trained with the Wake-Sleep algorithm. Digits were rescaled to 32x32 to be compatible with the DCGAN architecture, then binarized by rounding. All parameters were adjusted with learning rates of 0.001 and the Adam optimizer.
ReLU Convolutional Networks

We wished to compare the Wake-Sleep algorithm in our setting of continuous latent variable networks with ReLU activations and injected noise. We explored a wide parameter space. However, the Wake-Sleep algorithm was unstable for all tested configurations. We employed a random search over the space with 250 runs, plus a number of choices selected by hand. The choices of configuration included the following:

  • Gaussian or Laplacian injected noise. In addition to affecting dynamics, the form of this noise affects the Wake-Sleep update rule. For Gaussian noise, is the error between the generated state and the prediction of by given . For Laplacian noise, the log-likelihood is the error.

  • The scale of the (isotropic) noise , chosen in the range .

  • The learning rate of the Adam optimizer for the inference and generative connections, which we allowed to differ. Learning rates spanned the range .

  • The and parameters of the Adam optimizer, chosen in the range and .

  • Spectral normalization on the inference network (Miyato et al., 2018).

  • Divisive normalization on inference and generative networks, in which the channels in each layer are divided by the norm of the channels at that spatial location. This is the ‘pixelwise feature normalization’ of Karras et al. (2017).

  • ReLU activation functions or SELU activations

  • Dropout on the activations with , if applied.

All parameter configurations were unstable with the Wake-Sleep algorithm. Training produced either images of pure black, pure white, or random featureless noise.