The variational autoencoder (VAE) Kingma2014vae ; Rezende2014 has been used to train deep latent variable based generative models which model a distribution over observations by latent variables such that
using a deep neural networkwhich transforms samples from into samples from . This model trains the latent variable based generative model using approximate posterior samples from a simultaneously trained recognition network or inference network to maximize the evidence lower bound (ELBO).
There are two ways to improve the quality of the learned deep generative model. The multi-sample objective used by the importance weighted autoencoder (IWAE) Burda2015 has been used to derive a tighter lower bound to the model evidence , leading to superior generative models. Optimizing this objective corresponds to implicitly reweighting the samples from the approximate posterior. A second way to improve the quality of the generative model is to explicitly improve the approximate posterior samples generated by the recognition network.
In the VAE framework, the recognition network is restricted to approximate posterior distributions under which the log probability of a sample and its derivatives can be evaluated in close form. The adversarial autoencoder (AAE)Makhzani2016 , and the adversarial variational Bayes (AVB) Mescheder2017 show how this constraint can be relaxed, leading to more flexible posterior distributions which are implicitly represented by the recognition network. In this paper, we derive importance weighted adversarial autoencoders of IW-AVB and IW-AAE, thus combining both adversarial and importance weighting techniques for improving probabilistic modeling.
Spike inference is an important Bayesian inference problem in neuroscienceberens2018community
. Calcium imaging methods enable the indirect measurement of neural activity of large populations of neurons in the living brain in a minimally invasive manner. The intracellular calcium concentration measured by fluorescence microscopy of a genetically encoded calcium sensor such as GCaMP6chen2013ultrasensitive is an indirect measure of the spiking activity of the neuron. VAEs have previously been used speiser2017fast ; Aitchison:2017wz to perform Bayesian inference of spiking activity by training inference networks to invert the known biophysically described generative process which converts unobserved spikes into observed fluorescence time series.
The accuracy of a VAE-based spike inference method depends strongly on the quality of the posterior approximation used by the inference network. The posterior distribution over the binary latent spike train given the fluorescence time series has previously been approximated speiser2017fast
using either a factorized Bernoulli distribution (VIMCO-FAC) where, or as an autoregressive Bernoulli distribution (VIMCO-CORR). As we show, the correlated autoregressive posterior is more accurate, but slow to sample from. In contrast, the factorized posterior allows for fast parallel sampling, especially on a GPU, but ignores correlations in the posterior. Fast inference networks which sample from correlated posteriors over discrete binary spike trains would be a significant advance for VAE-based spike inference.
Fast correlated distributions over time series can be constructed using normalizing flows for continuous random variablesRezende2015
, but this is considerably harder for discrete random variablesAitchison:2018vq . Thus an adversarial approach where an inference network which transforms noise samples into samples from the posterior can be trained without the need to evaluate the posterior likelihood is particularly appealing for modeling correlated distributions over discrete random variables. Here, we show that our adversarially trained inference networks produce correlated samples which outperform the factorized posterior trained in the conventional way as in speiser2017fast .
In addition to these practical advances, we derive theoretical results connecting the objective functions optimized by the importance weighted variants of the AVB, AAE, and VAE. The relationship between the AAE objective and data log likelihood is not fully understood. The AAE has been shown to be a special case of the Wasserstein autoencoder (WAE) under certain restricted conditions Bousquet2017 . However, we also do not understand the tradeoffs between the standard log-likelihood and penalized optimal transport objectives, and thus further theoretical insight is necessary to fully understand the tradeoffs between the VAE and AAE.
The main contributions of the paper are following:
We propose IW-AVB and IW-AAE that yield tighter lower bounds on log-likelihood compared to AVB and AAE, and the global solution for maximizes likelihood.
We provide theoretical insights into the importance weighted adversarial objective functions. In particular, we relate AAE and IW-AAE objectives to log-likelihoods and Wasserstein autoencoder objectives.
We develop standard and importance weighted adversarial neural spike inference for calcium imaging data, and show that adversarially trained inference networks outperform existing VAEs using factorized posteriors.
The maximum likelihood estimation of the parameterwith model defined as , where is a latent variable is in general intractable. Variational methods maximize a lower bound of the log likelihood. This lower bound is based on approximating the intractable distribution by a tractable distribution parameterized by variational parameter . VAEs maximize the following lower bound of : To make the relationship with proposed methods clear, we write this as
We do this for all criteria going forward.
To efficiently optimize this criterion with gradient descent, VAEs Kingma2014vae ; Rezende2014 define the approximate posterior such that the is a differentiable transformation of an noise variable . It is common to assume and , and for to be a deep network with weights .
Requiring that can be analytically evaluated restricts the class and is a limitation to such approaches. Adversarial variational Bayes (AVB) Mescheder2017 maximizes the variational lower bound by implicitly approximating KL divergence between approximate posterior and the prior distribution by introducing third neural network, . This neural network, known as the discriminator, implicitly estimates .
The three parametric models, and are jointly optimized using adversarial training. Unlike VAE and IWAE, in this framework, we can make arbitrarily flexible approximate distributions .
The adversarial autoencoder (AAE) Makhzani2016 is similar, except that the discriminative network depends only on , instead of on and . AAE objective minimizes the following objective:
AAE replaces the KL divergence between the approximate posterior and prior distribution in with an adversarial loss that tries to minimize the divergence between the aggregated posterior and the prior distribution .
3 Importance Weighted Adversarial Training
The importance weighted autoencoder (IWAE) Burda2015 provides a tighter lower bound to ,
Burda et al. Burda2015 show that , and approaches as .
3.1 IW-AVB and IW-AAE
In AVB, generative adversarial training on joint distributions between data and latent variables is applied to the variational lower bound. In this work, we propose applying it to the importance weighted lower bound of ,
where is defined as in Equation 3. We call this Importance Weighted Adversarial Variational Bayes bound (IW-AVB). As , as .
The main advantage of IW-AVB over AVB is that, when the true posterior distribution is not in the class of approximate posterior functions (as is generally the case), IW-AVB uses a tighter lower bound than AVB Burda2015 .
Similarly, we can apply importance weighting to improve AAE:
where is defined as in Equation 5.
IW-AVB and IW-AAE objectives can be described as a framework of minimax adversarial game between three neural networks, the generative network , inference network , and discriminative network . The inference network maps input to latent space , and the generative network maps latent samples to the the data space . Both inference and generative networks are jointly trained to minimize the reconstruction error and KL divergence term in . The discriminator network differentiates samples from the joint distribution between data and approximate posterior distribution (positive samples) versus the samples that are from the joint over data and prior latent distribution (negative samples).
Recent work Rainforth2018 has shown that optimizing the importance weighted bound can degrade the overall learning process of the inference network because the signal-to-noise ratio of the gradient estimates converges at the rate of and for generative and inference networks, respectively ( is the gradient estimate of ). The converges to 0 for inference network as , and the gradient estimates of become completely random. To mediate this, we apply the importance weighted bound for updating the parameter of generative network and variational lower bound for updating the parameters of inference network . Hence, we maximize the following:
We do this for IW-AAE as well.
An important reason to maximize w.r.t the variational lower bound in Equation 9 is that it guarantees for the optimal discriminator network Mescheder2017 . Since deriving indirectly depends on , we want the gradient w.r.t in to be disentangled from calculating the gradients of Equation 9. Thus, we are only using the importance weighted bound on generative model. Empirically, we find that this still improves performance (Section 5).
The following proposition shows that the global Nash equilibria of IW-AVB’s adversarial game yield global optima of the objective function in .
Assume can represent any function of two variables. If is a Nash Equilibrium of the two-player game for IW-AVB, then and is a global optimum of the importance weighted lower bound in Equation 9.
See the Appendix for proof. This proposition tells us that the solution to Equation 9 gives the solution to importance weighted bound, in which becomes the maximum likelihood assignment.
A similar property holds for AAE and IW-AAE with the discriminator .
Assume can represent any function of two variables. If is a Nash Equilibrium of two-player game for IW-AAE, then and is the global optimum of the following objective,
The steps of the proof are the same as for Proposition 1.
4 Relationship of IW-AVB and IW-AAE to other objectives
Bousquet et al. Bousquet2017 showed adversarial objectives with equivalent solutions to and . In a similar manner, we show that the adversarial objective with equivalent solutions to is
where is the generative adversarial network objective Goodfellow2014 with discriminative network , and and are data and model distributions. can be viewed as (pseudo-) divergence between the data and model distribution, where for all .
Similarly, the the adversarial objective for IW-AAE becomes
Bousquet et al. also show that minimizing is a special case of minimizing a penalized optimal transport (POT) with -Wasserstein distance.
These adversarial objectives bound becomes a tighter upper-bound as the number of samples increases:
For any distribution and , and for samples:
The proof follows the steps from Theorem 1 in Burda2015 .
The relationships between and , , and are
For any distribution and :
The proof is shown in the Appendix. The is tighter than (Proposition 3), and the is tighter than due to tighter adversarial approximation (i.e., since is convex). However, the relationship between and is unknown, because the trade-off between importance weighting bound versus the more flexible adversarial objective is unclear.
4.1 Relationship between Wasserstein Autoencoders and log-likelihood
We would like to understand the relationship between AAE (IW-AAE) and log-likelihood. Previously, it was shown that converges to Wasserstein autoencoder objective function under certain circumstances Bousquet2017 . We observe that converges to new Wasserstein autoencoder objective which gives a tighter bound on the autoencoder log-likelihood . The quantity can be understood as likelihood of reconstructed data from probabilistic encoding model. Further in Corollary in Appendix, in a special case, we were able to relate and .
is a distance function defined between probability distribution on a a metric space. Bousquetet al. Bousquet2017 showed that the penalized optimal transportation objective is relaxed version of Wasserstein autoencoder objective 111Given that the generative network is probabilistic function, we have where
and is a distance function. is used for the choice of convex divergence between the prior and the aggregated posterior Bousquet2017 . As , converges to . It turns out that is a special case of . This happens when the cost function is squared Euclidean distance and is Gaussian .
We can also observe that converges to :
Assume that , . Then, where and are
Moreover, converges to and converges to as .
where is the log-likelihood of an autoencoder222We abuse the notation by writing as a .. The bound is derived by applying Jensen’s inequality (see the proof in the Appendix). We observe that is the lower bound of under the condition that . The tighter bound is achieve using compare to . Lastly, we observe that approximates and approximates .
The following theorem shows the relationship between AAE objective and .
Maximizing AAE objective is equivalent to jointly maximizing , mutual information with respect to , and the negative of KL divergence between joint distribution and ,
The proof is in the Appendix. This illustrate the trade of between the mutual information and the relative information between and . In order for the gap between and to be small, need to become close to .
We conducted our experiments with two main objectives where we want to i. compare the performance between AVB, IW-AVB, AAE, and IW-AAE; ii. check whether the adversarial training objectives can benefit neural spike inference in general. For such reasons, we measure their performance in two experimental setups. First, we experiment on generative modeling task on MNIST dataset. Second, we apply adversarial training on neuron spike activity inference dataset with both amortized and non-amortized inference settings.
5.1 Generative modeling
We follow the same experimental procedure as Mescheder2017
for learning generative models on binarized MNIST dataset. We trained AVB, IW-AVB, AAE, and IW-AAE on 50,000 train examples with 10,000 validation examples, and measured log-likelihood on 10,000 test examples. We applied the same architecture fromMescheder2017 333We followed the experiment and the code from https://github.com/LMescheder/AdversarialVariationalBayes.. See the details of Mescheder2017 in Supplementary Materials.
We considered three following metrics. The log-likelihood was computed using Annealed Importance Sampling (AIS) Neal2001 ; Wu2016 with 1000 intermediate distribution and 5 parallel chains. We also applied the Frechet Inception Distance (FID) Heusel2017 . It compares the mean and covariance of the Inception-based representation of samples generated by the GAN to the mean and covariance of the same representation for training samples:
Lastly, we also considered GAN metric proposed by Im2018 that measure the quality of generator by estimating divergence between the true data distribution and for different choices of divergence measure. In our setting we considered least-square measure (LS).
Table 1 presents the results. We observe that IW-AVB gets the best test log-likelihood for both MNIST and FashionMNIST dataset444Note that our results are slightly lower than the reported results in Mescheder2017 . However, we used same codebase for all models (the results for FashionMNIST is shown in Appendix). On the other hand, IW-AAE gets the best FID and LS metric. We speculate that the reason is because AVB and IW-AVB directly maximizes the lower bound of the log-likelihood , whereas AAE and IW-AAE does not. AAE and IW-AAE maximizes the distance between data and model distribution directly. The MNIST and FashionMNIST samples are shown in Figure 1 and 7 in Appendix.
5.2 Neural activity inference from calcium imaging data
We consider a challenging and important problem in neuroscience – spike inference from calcium imaging. Here, the unobserved binary spike train is the latent variable which is transformed by a generative model whose functional form is derived from biophysics into fluorescence measurements of the intracellular calcium concentration.
We use a publicly available spike inference dataset, cai-1555The dataset is available at https://crcns.org/data-sets/methods/cai-1/. We use the data from five layer 2/3 pyramidal neurons in mouse visual cortex 666We excluded neurons that has clear artifacts and mislabels in the dataset.. The neurons are imaged at 60 Hz using GCaMP6f – a genetically encoded calcium indicator chen2013ultrasensitive . The ground truth spikes were measured electrophysiologically using cell-attached recordings.
When we train AVB, AAE, IW-AVB, IW-AAE to model fluorescence data, we use a biophysical generative model and a convolutional neural network as our inference network. Thus, the process is to generate (reconstruct) the fluorescence traces with inferred spikes using encoders. We ran five folds on every experiments in neural spike inference dataset. The details of architectures, biophysical model, and datasets can be found in the AppendixNeural Spike Modeling
We experimented under two settings: Non-amortized spike inference, and amortized spike inference settings. Non-amortized spike inference corresponds to training a new inference network for each neuron. This is expensive but it provides an estimate of the best possible performance achievable. Amortized spike inference setup corresponds to the more useful setting where a “training” dataset of neurons is used to train an inference network (without ground truth), and the trained inference network is tested on a new “test” neuron. This is the more practically useful setting for spike inference – once the inference network is trained, spike inference is extremely fast and only requires prediction by the inference network.
We use two variants of VIMCO Mnih2016 as a baseline, VIMCO-FACT and VIMCO-CORRspeiser2017fast . VIMCO-FACT uses a fast factorized posterior distribution which can be sampled in parallel over time, same as the adversarially trained networks. VIMCO-CORR uses an autoregressive posterior that produces correlated samples which must be sampled sequentially in time (see the details in the Appendix Neural Spike Modeling).
Following the neuroscience community, we evaluated the quality of our posterior inference networks by computing the correlation between predicted spikes and labels as the performance metric. We used a paired t-test goulden1956 compare the improvement of all pairs of inference networks across five neurons (see Figure 2). The full table of correlations scores for all neurons and methods in both amortized and non-amortized settings are shown in Appendix Table 3. We observe that AVB, AAE, IW-AAE, and IW-AVB performances lie in between VIMCO-FACT and VIMCO-CORR. Overall, we observe that IW-AVB, AAE, and IW-AAE performs similarly across given neuron datasets. Figure 3 illustrates the VIMCO-FACT and IW-AVB posterior approximation on neuron 1 dataset. From the figure, we observe that VIMCO-FACT tend to have high false negatives while IW-AVB tend to have high false positives. The results are similar for amortized experiments as shown in Table 4. Interestingly, the performance of IW-AVB, AAE, and IW-AAE were better than non-amortized experiments. This suggests that neural spike influencing can be generalized over multiple neurons. Note that this is the first time that adversarial training has been applied to neural spike inference.
Moreover, VIMCO-CORR generates correlated posterior samples, whereas the samples from VIMCO-FACT are independent. Nevertheless, the inference is slower at test time compared to VIMCO-FACT since the spike inferences are done sequentially rather than in parallel. This is huge disadvantage to VIMCO-CORR, because spike records can be hourly long. We emphasize the adversarial training, such as IW-AVB and IW-AAE, because they generate correlated posterior samples in parallel. Figure 2 demonstrates the time advantage of adversarial training over VIMCO-CORR. The total florescence data duration was 1 hour at a 60 Hz sampling rate and ran on NVIDIA GeForce RTX 2080 Ti.
Motivated by two ways of improving the variational bound: importance weighting Burda2015 and better posterior approximation Rezende:2015vu ; Mescheder2017 ; Makhzani2016 , we propose importance weighted adversarial variational Bayes (IW-AVB) and importance weighted adversarial autoencoder (IW-AAE). Our theoretical analysis provides better understanding of adversarial autoencoder objectives, and bridges the gap between log-likelihood of an autoencoder and generator.
Adversarially trained inference networks are particularly effective at learning correlated posterior distributions over discrete latent variables which can be sampled efficiently in parallel. We exploit this finding to apply both standard and importance weighted variants of AVB and AAE to the important yet challenging problem of inferring spiking neural activity from calcium imaging data. We have empirically shown that the correlated posteriors trained adversarially in general outperform existing VAEs with factorized posteriors. Moreover, we get tremendous speed gain during the spike inference compare to existing VAEs work with autoregressive correlated posteriors speiser2017fast .
- (1) Laurence Aitchison, Vincent Adam, and Srinivas C Turaga. Discrete flow posteriors for variational inference in discrete dynamical systems. arXiv, May 2018.
- (2) Laurence Aitchison, Lloyd Russell, Adam M Packer, Jinyao Yan, Philippe Castonguay, Michael Häusser, and Srinivas C Turaga. Model-based Bayesian inference of neural activity and connectivity from all-optical interrogation of a neural circuit. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3486–3495. Curran Associates, Inc., 2017.
- (3) Philipp Berens, Jeremy Freeman, Thomas Deneux, Nikolay Chenkov, Thomas McColgan, Artur Speiser, Jakob H Macke, Srinivas C Turaga, Patrick Mineault, Peter Rupprecht, et al. Community-based benchmarking improves spike rate inference from two-photon calcium imaging data. PLoS computational biology, 14(5):e1006157, 2018.
- (4) Olivier Bousquet, Sylvain Gelly, Ilya-Johann Tolstikhin, Simon-Gabriel, and Bernhald Scholkopf. From optimal transport to generative modeling the vegan cookbook. In arXiv preprint arXiv:1705.07642, 2017.
- (5) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In arXiv preprint arXiv:1509.00519, 2015.
- (6) Tsai-Wen Chen, Trevor J Wardill, Yi Sun, Stefan R Pulver, Sabine L Renninger, Amy Baohan, Eric R Schreiter, Rex A Kerr, Michael B Orger, Vivek Jayaraman, et al. Ultrasensitive fluorescent proteins for imaging neuronal activity. Nature, 499(7458):295, 2013.
- (7) Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted autoencoders. In arXiv preprint arXiv:1704.02916, 2017.
- (8) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, Warde-Farley David, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial network. In Neural Information Processing Systems, 2014.
- (9) C. H. Goulden. -. In Methods of Statistical Analysis, 2nd ed., pages 50–55. New York: Wiley, 1956.
- (10) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In arXiv preprint arXiv:1706.08500, 2017.
- (11) Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating gans with divergence proposed for training. In International Conference on Learning Representation, 2018.
- (12) Leonid Kantorovich. On the transfer of masses (in russian). In Doklady Akademii Nauk, 1942.
- (13) Diederik P Kingma and Max Welling. Auto-encoding varational bayes. In Proceedings of the Neural Information Processing Systems (NIPS), 2014.
- (14) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In arXiv preprint arXiv:1511.05644, 2015.
- (15) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifiying variational autoencoders and generative adversarial networks. In arXiv preprint arXiv:1701.04722, 2017.
- (16) Andriy Mnih and Danilo J. Rezende. Variational inference for monte carlo objectives. In arXiv preprint arXiv:1602.06725, 2016.
- (17) Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
- (18) Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001.
- (19) Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variationa bounds are not necessarily better. In arXiv preprint arXiv:1802.04537, 2018.
Danilo Jimenez Rezende and Shakir Mohamed.
Variational inference with normalizing flows.
International Conference of Machine Learning, 2015.
- (21) Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. arXiv, May 2015.
- (22) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative network. In arXiv preprint arXiv:1401.4082, 2014.
- (23) Artur Speiser, Jinyao Yan, Evan W Archer, Lars Buesing, Srinivas C Turaga, and Jakob H Macke. Fast amortized inference of neural activity from calcium imaging data with variational autoencoders. In Advances in Neural Information Processing Systems, pages 4024–4034, 2017.
- (24) Yuhaui Wu, Yuri Burda, and Ruslan Salakhutdinov. On the quantitative analysis of decoder-based generative models. In arXiv preprint arXiv:1611.04273, 2016.
Proof of Things
Assume that can represent any function of two variables. If is a Nash Equilibrium of two-player game, then and is a global optimum of the importance weighted lower bound.
Suppose that is a Nash Equilibrium. It was previously shown by  that
Now, we substitute into Equation 7 and show that maximizes the following formula as a function of and :
Define the implicit distribution :
is an importance weight.
Now, following the steps of turning in terms of with implicit , we have
Proof by contradiction, suppose that does not maximize the variational lower bound in . So there exist such that
However, substituting in is greater than , which contradicts the assumption. Hence, is a global optimum of .
Since we can express in terms of with implicit distribution ,
() is also a global optimum of the importance weighted lower bound.
For any distribution and :
First, we show .