1 Introduction
The variational autoencoder (VAE) Kingma2014vae ; Rezende2014 has been used to train deep latent variable based generative models which model a distribution over observations by latent variables such that
using a deep neural network
which transforms samples from into samples from . This model trains the latent variable based generative model using approximate posterior samples from a simultaneously trained recognition network or inference network to maximize the evidence lower bound (ELBO).There are two ways to improve the quality of the learned deep generative model. The multisample objective used by the importance weighted autoencoder (IWAE) Burda2015 has been used to derive a tighter lower bound to the model evidence , leading to superior generative models. Optimizing this objective corresponds to implicitly reweighting the samples from the approximate posterior. A second way to improve the quality of the generative model is to explicitly improve the approximate posterior samples generated by the recognition network.
In the VAE framework, the recognition network is restricted to approximate posterior distributions under which the log probability of a sample and its derivatives can be evaluated in close form. The adversarial autoencoder (AAE)
Makhzani2016 , and the adversarial variational Bayes (AVB) Mescheder2017 show how this constraint can be relaxed, leading to more flexible posterior distributions which are implicitly represented by the recognition network. In this paper, we derive importance weighted adversarial autoencoders of IWAVB and IWAAE, thus combining both adversarial and importance weighting techniques for improving probabilistic modeling.Spike inference is an important Bayesian inference problem in neuroscience
berens2018community. Calcium imaging methods enable the indirect measurement of neural activity of large populations of neurons in the living brain in a minimally invasive manner. The intracellular calcium concentration measured by fluorescence microscopy of a genetically encoded calcium sensor such as GCaMP6
chen2013ultrasensitive is an indirect measure of the spiking activity of the neuron. VAEs have previously been used speiser2017fast ; Aitchison:2017wz to perform Bayesian inference of spiking activity by training inference networks to invert the known biophysically described generative process which converts unobserved spikes into observed fluorescence time series.The accuracy of a VAEbased spike inference method depends strongly on the quality of the posterior approximation used by the inference network. The posterior distribution over the binary latent spike train given the fluorescence time series has previously been approximated speiser2017fast
using either a factorized Bernoulli distribution (VIMCOFAC) where
, or as an autoregressive Bernoulli distribution (VIMCOCORR). As we show, the correlated autoregressive posterior is more accurate, but slow to sample from. In contrast, the factorized posterior allows for fast parallel sampling, especially on a GPU, but ignores correlations in the posterior. Fast inference networks which sample from correlated posteriors over discrete binary spike trains would be a significant advance for VAEbased spike inference.Fast correlated distributions over time series can be constructed using normalizing flows for continuous random variables
Rezende2015, but this is considerably harder for discrete random variables
Aitchison:2018vq . Thus an adversarial approach where an inference network which transforms noise samples into samples from the posterior can be trained without the need to evaluate the posterior likelihood is particularly appealing for modeling correlated distributions over discrete random variables. Here, we show that our adversarially trained inference networks produce correlated samples which outperform the factorized posterior trained in the conventional way as in speiser2017fast .In addition to these practical advances, we derive theoretical results connecting the objective functions optimized by the importance weighted variants of the AVB, AAE, and VAE. The relationship between the AAE objective and data log likelihood is not fully understood. The AAE has been shown to be a special case of the Wasserstein autoencoder (WAE) under certain restricted conditions Bousquet2017 . However, we also do not understand the tradeoffs between the standard loglikelihood and penalized optimal transport objectives, and thus further theoretical insight is necessary to fully understand the tradeoffs between the VAE and AAE.
The main contributions of the paper are following:

We propose IWAVB and IWAAE that yield tighter lower bounds on loglikelihood compared to AVB and AAE, and the global solution for maximizes likelihood.

We provide theoretical insights into the importance weighted adversarial objective functions. In particular, we relate AAE and IWAAE objectives to loglikelihoods and Wasserstein autoencoder objectives.

We develop standard and importance weighted adversarial neural spike inference for calcium imaging data, and show that adversarially trained inference networks outperform existing VAEs using factorized posteriors.
2 Background
The maximum likelihood estimation of the parameter
with model defined as , where is a latent variable is in general intractable. Variational methods maximize a lower bound of the log likelihood. This lower bound is based on approximating the intractable distribution by a tractable distribution parameterized by variational parameter . VAEs maximize the following lower bound of : To make the relationship with proposed methods clear, we write this as(1) 
We do this for all criteria going forward.
To efficiently optimize this criterion with gradient descent, VAEs Kingma2014vae ; Rezende2014 define the approximate posterior such that the is a differentiable transformation of an noise variable . It is common to assume and , and for to be a deep network with weights .
Requiring that can be analytically evaluated restricts the class and is a limitation to such approaches. Adversarial variational Bayes (AVB) Mescheder2017 maximizes the variational lower bound by implicitly approximating KL divergence between approximate posterior and the prior distribution by introducing third neural network, . This neural network, known as the discriminator, implicitly estimates .
(2)  
(3) 
The three parametric models
, and are jointly optimized using adversarial training. Unlike VAE and IWAE, in this framework, we can make arbitrarily flexible approximate distributions .The adversarial autoencoder (AAE) Makhzani2016 is similar, except that the discriminative network depends only on , instead of on and . AAE objective minimizes the following objective:
(4)  
(5) 
AAE replaces the KL divergence between the approximate posterior and prior distribution in with an adversarial loss that tries to minimize the divergence between the aggregated posterior and the prior distribution .
3 Importance Weighted Adversarial Training
The importance weighted autoencoder (IWAE) Burda2015 provides a tighter lower bound to ,
(6) 
Burda et al. Burda2015 show that , and approaches as .
3.1 IWAVB and IWAAE
In AVB, generative adversarial training on joint distributions between data and latent variables is applied to the variational lower bound
. In this work, we propose applying it to the importance weighted lower bound of ,(7) 
where is defined as in Equation 3. We call this Importance Weighted Adversarial Variational Bayes bound (IWAVB). As , as .
The main advantage of IWAVB over AVB is that, when the true posterior distribution is not in the class of approximate posterior functions (as is generally the case), IWAVB uses a tighter lower bound than AVB Burda2015 .
IWAVB and IWAAE objectives can be described as a framework of minimax adversarial game between three neural networks, the generative network , inference network , and discriminative network . The inference network maps input to latent space , and the generative network maps latent samples to the the data space . Both inference and generative networks are jointly trained to minimize the reconstruction error and KL divergence term in . The discriminator network differentiates samples from the joint distribution between data and approximate posterior distribution (positive samples) versus the samples that are from the joint over data and prior latent distribution (negative samples).
Recent work Rainforth2018 has shown that optimizing the importance weighted bound can degrade the overall learning process of the inference network because the signaltonoise ratio of the gradient estimates converges at the rate of and for generative and inference networks, respectively ( is the gradient estimate of ). The converges to 0 for inference network as , and the gradient estimates of become completely random. To mediate this, we apply the importance weighted bound for updating the parameter of generative network and variational lower bound for updating the parameters of inference network . Hence, we maximize the following:
(9) 
We do this for IWAAE as well.
3.2 Analysis
An important reason to maximize w.r.t the variational lower bound in Equation 9 is that it guarantees for the optimal discriminator network Mescheder2017 . Since deriving indirectly depends on , we want the gradient w.r.t in to be disentangled from calculating the gradients of Equation 9. Thus, we are only using the importance weighted bound on generative model. Empirically, we find that this still improves performance (Section 5).
The following proposition shows that the global Nash equilibria of IWAVB’s adversarial game yield global optima of the objective function in .
Proposition 1.
Assume can represent any function of two variables. If is a Nash Equilibrium of the twoplayer game for IWAVB, then and is a global optimum of the importance weighted lower bound in Equation 9.
See the Appendix for proof. This proposition tells us that the solution to Equation 9 gives the solution to importance weighted bound, in which becomes the maximum likelihood assignment.
A similar property holds for AAE and IWAAE with the discriminator .
Proposition 2.
Assume can represent any function of two variables. If is a Nash Equilibrium of twoplayer game for IWAAE, then and is the global optimum of the following objective,
(10) 
where .
The steps of the proof are the same as for Proposition 1.
4 Relationship of IWAVB and IWAAE to other objectives
Bousquet et al. Bousquet2017 showed adversarial objectives with equivalent solutions to and . In a similar manner, we show that the adversarial objective with equivalent solutions to is
(11) 
where is the generative adversarial network objective Goodfellow2014 with discriminative network , and and are data and model distributions. can be viewed as (pseudo) divergence between the data and model distribution, where for all .
Similarly, the the adversarial objective for IWAAE becomes
(12) 
Bousquet et al. also show that minimizing is a special case of minimizing a penalized optimal transport (POT) with Wasserstein distance.
These adversarial objectives bound becomes a tighter upperbound as the number of samples increases:
Proposition 3.
For any distribution and , and for samples:
The proof follows the steps from Theorem 1 in Burda2015 .
The relationships between and , , and are
Proposition 4.
For any distribution and :
The proof is shown in the Appendix. The is tighter than (Proposition 3), and the is tighter than due to tighter adversarial approximation (i.e., since is convex). However, the relationship between and is unknown, because the tradeoff between importance weighting bound versus the more flexible adversarial objective is unclear.
4.1 Relationship between Wasserstein Autoencoders and loglikelihood
We would like to understand the relationship between AAE (IWAAE) and loglikelihood. Previously, it was shown that converges to Wasserstein autoencoder objective function under certain circumstances Bousquet2017 . We observe that converges to new Wasserstein autoencoder objective which gives a tighter bound on the autoencoder loglikelihood . The quantity can be understood as likelihood of reconstructed data from probabilistic encoding model. Further in Corollary in Appendix, in a special case, we were able to relate and .
Wasserstein distance
is a distance function defined between probability distribution on a a metric space. Bousquet
et al. Bousquet2017 showed that the penalized optimal transportation objective is relaxed version of Wasserstein autoencoder objective ^{1}^{1}1Given that the generative network is probabilistic function, we have where(13) 
and is a distance function. is used for the choice of convex divergence between the prior and the aggregated posterior Bousquet2017 . As , converges to . It turns out that is a special case of . This happens when the cost function is squared Euclidean distance and is Gaussian .
We can also observe that converges to :
Proposition 5.
Assume that , . Then, where and are
Moreover, converges to and converges to as .
where is the loglikelihood of an autoencoder^{2}^{2}2We abuse the notation by writing as a .. The bound is derived by applying Jensen’s inequality (see the proof in the Appendix). We observe that is the lower bound of under the condition that . The tighter bound is achieve using compare to . Lastly, we observe that approximates and approximates .
The following theorem shows the relationship between AAE objective and .
Theorem 1.
Maximizing AAE objective is equivalent to jointly maximizing , mutual information with respect to , and the negative of KL divergence between joint distribution and ,
(14) 
The proof is in the Appendix. This illustrate the trade of between the mutual information and the relative information between and . In order for the gap between and to be small, need to become close to .
5 Experiments
We conducted our experiments with two main objectives where we want to i. compare the performance between AVB, IWAVB, AAE, and IWAAE; ii. check whether the adversarial training objectives can benefit neural spike inference in general. For such reasons, we measure their performance in two experimental setups. First, we experiment on generative modeling task on MNIST dataset. Second, we apply adversarial training on neuron spike activity inference dataset with both amortized and nonamortized inference settings.
5.1 Generative modeling
We follow the same experimental procedure as Mescheder2017
for learning generative models on binarized MNIST dataset. We trained AVB, IWAVB, AAE, and IWAAE on 50,000 train examples with 10,000 validation examples, and measured loglikelihood on 10,000 test examples. We applied the same architecture from
Mescheder2017 ^{3}^{3}3We followed the experiment and the code from https://github.com/LMescheder/AdversarialVariationalBayes.. See the details of Mescheder2017 in Supplementary Materials.We considered three following metrics. The loglikelihood was computed using Annealed Importance Sampling (AIS) Neal2001 ; Wu2016 with 1000 intermediate distribution and 5 parallel chains. We also applied the Frechet Inception Distance (FID) Heusel2017 . It compares the mean and covariance of the Inceptionbased representation of samples generated by the GAN to the mean and covariance of the same representation for training samples:
(15) 
Lastly, we also considered GAN metric proposed by Im2018 that measure the quality of generator by estimating divergence between the true data distribution and for different choices of divergence measure. In our setting we considered leastsquare measure (LS).
LogLikelihood  FID  LS  

VAE  90.69 0.88  259.87  3.8e5 
IWAE  91.64 0.71  255.513  3.6e5 
AVB  90.42 0.78  256.13  4.1e5 
IWAVB  85.12 0.20  251.20  3.3e5 
AAE  101.78 0.62  266.76  3.8e5 
IWAAE  101.38 0.19  249.12  3.2e5 
Table 1 presents the results. We observe that IWAVB gets the best test loglikelihood for both MNIST and FashionMNIST dataset^{4}^{4}4Note that our results are slightly lower than the reported results in Mescheder2017 . However, we used same codebase for all models (the results for FashionMNIST is shown in Appendix). On the other hand, IWAAE gets the best FID and LS metric. We speculate that the reason is because AVB and IWAVB directly maximizes the lower bound of the loglikelihood , whereas AAE and IWAAE does not. AAE and IWAAE maximizes the distance between data and model distribution directly. The MNIST and FashionMNIST samples are shown in Figure 1 and 7 in Appendix.
Pairwise performance comparison using ttest
5.2 Neural activity inference from calcium imaging data
We consider a challenging and important problem in neuroscience – spike inference from calcium imaging. Here, the unobserved binary spike train is the latent variable which is transformed by a generative model whose functional form is derived from biophysics into fluorescence measurements of the intracellular calcium concentration.
We use a publicly available spike inference dataset, cai1^{5}^{5}5The dataset is available at https://crcns.org/datasets/methods/cai1/. We use the data from five layer 2/3 pyramidal neurons in mouse visual cortex ^{6}^{6}6We excluded neurons that has clear artifacts and mislabels in the dataset.. The neurons are imaged at 60 Hz using GCaMP6f – a genetically encoded calcium indicator chen2013ultrasensitive . The ground truth spikes were measured electrophysiologically using cellattached recordings.
When we train AVB, AAE, IWAVB, IWAAE to model fluorescence data, we use a biophysical generative model and a convolutional neural network as our inference network. Thus, the process is to generate (reconstruct) the fluorescence traces with inferred spikes using encoders. We ran five folds on every experiments in neural spike inference dataset. The details of architectures, biophysical model, and datasets can be found in the Appendix
Neural Spike ModelingWe experimented under two settings: Nonamortized spike inference, and amortized spike inference settings. Nonamortized spike inference corresponds to training a new inference network for each neuron. This is expensive but it provides an estimate of the best possible performance achievable. Amortized spike inference setup corresponds to the more useful setting where a “training” dataset of neurons is used to train an inference network (without ground truth), and the trained inference network is tested on a new “test” neuron. This is the more practically useful setting for spike inference – once the inference network is trained, spike inference is extremely fast and only requires prediction by the inference network.
We use two variants of VIMCO Mnih2016 as a baseline, VIMCOFACT and VIMCOCORRspeiser2017fast . VIMCOFACT uses a fast factorized posterior distribution which can be sampled in parallel over time, same as the adversarially trained networks. VIMCOCORR uses an autoregressive posterior that produces correlated samples which must be sampled sequentially in time (see the details in the Appendix Neural Spike Modeling).
Following the neuroscience community, we evaluated the quality of our posterior inference networks by computing the correlation between predicted spikes and labels as the performance metric. We used a paired ttest goulden1956 compare the improvement of all pairs of inference networks across five neurons (see Figure 2). The full table of correlations scores for all neurons and methods in both amortized and nonamortized settings are shown in Appendix Table 3. We observe that AVB, AAE, IWAAE, and IWAVB performances lie in between VIMCOFACT and VIMCOCORR. Overall, we observe that IWAVB, AAE, and IWAAE performs similarly across given neuron datasets. Figure 3 illustrates the VIMCOFACT and IWAVB posterior approximation on neuron 1 dataset. From the figure, we observe that VIMCOFACT tend to have high false negatives while IWAVB tend to have high false positives. The results are similar for amortized experiments as shown in Table 4. Interestingly, the performance of IWAVB, AAE, and IWAAE were better than nonamortized experiments. This suggests that neural spike influencing can be generalized over multiple neurons. Note that this is the first time that adversarial training has been applied to neural spike inference.
Moreover, VIMCOCORR generates correlated posterior samples, whereas the samples from VIMCOFACT are independent. Nevertheless, the inference is slower at test time compared to VIMCOFACT since the spike inferences are done sequentially rather than in parallel. This is huge disadvantage to VIMCOCORR, because spike records can be hourly long. We emphasize the adversarial training, such as IWAVB and IWAAE, because they generate correlated posterior samples in parallel. Figure 2 demonstrates the time advantage of adversarial training over VIMCOCORR. The total florescence data duration was 1 hour at a 60 Hz sampling rate and ran on NVIDIA GeForce RTX 2080 Ti.
6 Conclusions
Motivated by two ways of improving the variational bound: importance weighting Burda2015 and better posterior approximation Rezende:2015vu ; Mescheder2017 ; Makhzani2016 , we propose importance weighted adversarial variational Bayes (IWAVB) and importance weighted adversarial autoencoder (IWAAE). Our theoretical analysis provides better understanding of adversarial autoencoder objectives, and bridges the gap between loglikelihood of an autoencoder and generator.
Adversarially trained inference networks are particularly effective at learning correlated posterior distributions over discrete latent variables which can be sampled efficiently in parallel. We exploit this finding to apply both standard and importance weighted variants of AVB and AAE to the important yet challenging problem of inferring spiking neural activity from calcium imaging data. We have empirically shown that the correlated posteriors trained adversarially in general outperform existing VAEs with factorized posteriors. Moreover, we get tremendous speed gain during the spike inference compare to existing VAEs work with autoregressive correlated posteriors speiser2017fast .
References
 (1) Laurence Aitchison, Vincent Adam, and Srinivas C Turaga. Discrete flow posteriors for variational inference in discrete dynamical systems. arXiv, May 2018.
 (2) Laurence Aitchison, Lloyd Russell, Adam M Packer, Jinyao Yan, Philippe Castonguay, Michael Häusser, and Srinivas C Turaga. Modelbased Bayesian inference of neural activity and connectivity from alloptical interrogation of a neural circuit. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3486–3495. Curran Associates, Inc., 2017.
 (3) Philipp Berens, Jeremy Freeman, Thomas Deneux, Nikolay Chenkov, Thomas McColgan, Artur Speiser, Jakob H Macke, Srinivas C Turaga, Patrick Mineault, Peter Rupprecht, et al. Communitybased benchmarking improves spike rate inference from twophoton calcium imaging data. PLoS computational biology, 14(5):e1006157, 2018.
 (4) Olivier Bousquet, Sylvain Gelly, IlyaJohann Tolstikhin, SimonGabriel, and Bernhald Scholkopf. From optimal transport to generative modeling the vegan cookbook. In arXiv preprint arXiv:1705.07642, 2017.
 (5) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In arXiv preprint arXiv:1509.00519, 2015.
 (6) TsaiWen Chen, Trevor J Wardill, Yi Sun, Stefan R Pulver, Sabine L Renninger, Amy Baohan, Eric R Schreiter, Rex A Kerr, Michael B Orger, Vivek Jayaraman, et al. Ultrasensitive fluorescent proteins for imaging neuronal activity. Nature, 499(7458):295, 2013.
 (7) Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importanceweighted autoencoders. In arXiv preprint arXiv:1704.02916, 2017.
 (8) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, WardeFarley David, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial network. In Neural Information Processing Systems, 2014.
 (9) C. H. Goulden. . In Methods of Statistical Analysis, 2nd ed., pages 50–55. New York: Wiley, 1956.
 (10) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In arXiv preprint arXiv:1706.08500, 2017.
 (11) Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating gans with divergence proposed for training. In International Conference on Learning Representation, 2018.
 (12) Leonid Kantorovich. On the transfer of masses (in russian). In Doklady Akademii Nauk, 1942.
 (13) Diederik P Kingma and Max Welling. Autoencoding varational bayes. In Proceedings of the Neural Information Processing Systems (NIPS), 2014.
 (14) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In arXiv preprint arXiv:1511.05644, 2015.
 (15) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifiying variational autoencoders and generative adversarial networks. In arXiv preprint arXiv:1701.04722, 2017.
 (16) Andriy Mnih and Danilo J. Rezende. Variational inference for monte carlo objectives. In arXiv preprint arXiv:1602.06725, 2016.
 (17) Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
 (18) Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001.
 (19) Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variationa bounds are not necessarily better. In arXiv preprint arXiv:1802.04537, 2018.

(20)
Danilo Jimenez Rezende and Shakir Mohamed.
Variational inference with normalizing flows.
In
International Conference of Machine Learning
, 2015.  (21) Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. arXiv, May 2015.
 (22) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative network. In arXiv preprint arXiv:1401.4082, 2014.
 (23) Artur Speiser, Jinyao Yan, Evan W Archer, Lars Buesing, Srinivas C Turaga, and Jakob H Macke. Fast amortized inference of neural activity from calcium imaging data with variational autoencoders. In Advances in Neural Information Processing Systems, pages 4024–4034, 2017.
 (24) Yuhaui Wu, Yuri Burda, and Ruslan Salakhutdinov. On the quantitative analysis of decoderbased generative models. In arXiv preprint arXiv:1611.04273, 2016.
Appendix
Proof of Things
Proposition 1.
Assume that can represent any function of two variables. If is a Nash Equilibrium of twoplayer game, then and is a global optimum of the importance weighted lower bound.
Proof.
Suppose that is a Nash Equilibrium. It was previously shown by [8] that
Now, we substitute into Equation 7 and show that maximizes the following formula as a function of and :
Define the implicit distribution :
where
is an importance weight.
Now, following the steps of turning in terms of with implicit [7], we have
Thus, maximizes
Proof by contradiction, suppose that does not maximize the variational lower bound in . So there exist such that
However, substituting in is greater than , which contradicts the assumption. Hence, is a global optimum of .
Since we can express in terms of with implicit distribution [7],
() is also a global optimum of the importance weighted lower bound.
∎
Proposition 4.
For any distribution and :
Proof.
First, we show .