1 Introduction
Consider a model of coin tosses. With probabilistic models, one typically posits a latent probability, and supposes each toss is a Bernoulli outcome given this probability
[38, 16]. After observing a collection of coin tosses, Bayesian analysis lets us describe our inferences about the probability.However, we know from the laws of physics that the outcome of a coin toss is fully determined by its initial conditions (say, the impulse and angle of flip) [27, 9]. Therefore a coin toss’ randomness does not originate from a latent probability but in noisy initial parameters. This alternative model incorporates the physical system, better capturing the generative process. Furthermore the model is implicit, also known as a simulator: we can sample data from its generative process, but we may not have access to calculate its density [11, 22].
Coin tosses are simple, but they serve as a building block for complex implicit models. These models, which capture the laws and theories of realworld physical systems, pervade fields such as population genetics [42], statistical physics [1], and ecology [3]; they underlie structural equation models in economics and causality [41]; and they connect deeply to GAN [19], which use neural networks to specify a flexible implicit density [37].
Unfortunately, implicit models, including GAN, have seen limited success outside specific domains. There are two reasons. First, it is unknown how to design implicit models for more general applications, exposing rich latent structure such as priors, hierarchies, and sequences. Second, existing methods for inferring latent structure in implicit models do not sufficiently scale to highdimensional or large data sets. In this paper, we design a new class of implicit models and we develop a new algorithm for accurate and scalable inference.
For modeling, § 2 describes hierarchical implicit models
, a class of Bayesian hierarchical models which only assume a process that generates samples. This class encompasses both simulators in the classical literature and those employed in GAN. For example, we specify a Bayesian GAN, where we place a prior on its parameters. The Bayesian perspective allows GAN to quantify uncertainty and improve data efficiency. We can also apply them to discrete data; this setting is not possible with traditional estimation algorithms for GAN
[29].For inference, § 3 develops LFVILFVI, which combines variational inference with density ratio estimation [51, 37]. Variational inference posits a family of distributions over latent variables and then optimizes to find the member closest to the posterior [25]. Traditional approaches require a likelihoodbased model and use crude approximations, employing a simple approximating family for fast computation. LFVI expands variational inference to implicit models and enables accurate variational approximations with implicit variational families: LFVI does not require the variational density to be tractable. Further, unlike previous Bayesian methods for implicit models, LFVI scales to millions of data points with stochastic optimization.
This work has diverse applications. First, we analyze a classical problem from the ABC literature, where the model simulates an ecological system [3]. We analyze 100,000 time series which is not possible with traditional methods. Second, we analyze a Bayesian GAN, which is a GAN with a prior over its weights. Bayesian GAN outperform corresponding Bayesian neural networks with known likelihoods on several classification tasks. Third, we show how injecting noise into hidden units of RNN corresponds to a deep implicit model for flexible sequence generation.
Related Work.
This paper connects closely to three lines of work. The first is Bayesian inference for implicit models, known in the statistics literature as ABCABC
[3, 35]. ABC steps around the intractable likelihood by applying summary statistics to measure the closeness of simulated samples to real observations. While successful in many domains, ABC has shortcomings. First, the results generated by ABC depend heavily on the chosen summary statistics and the closeness measure. Second, as the dimensionality grows, closeness becomes harder to achieve. This is the classic curse of dimensionality.
The second is GAN [19]. GAN have seen much interest since their conception, providing an efficient method for estimation in neural networkbased simulators. Larsen et al. [30] propose a hybrid of variational methods and GAN for improved reconstruction. Chen et al. [7] apply information penalties to disentangle factors of variation. Donahue et al. [12], Dumoulin et al. [13] propose to match on an augmented space, simultaneously training the model and an inverse mapping from data to noise. Unlike any of the above, we develop models with explicit priors on latent variables, hierarchies, and sequences, and we generalize GAN to perform Bayesian inference.
The final thread is variational inference with expressive approximations [47, 50, 54]. The idea of casting the design of variational families as a modeling problem was proposed in Ranganath et al. [46]. Further advances have analyzed variational programs [44]—a family of approximations which only requires a process returning samples—and which has seen further interest [32]. Implicitlike variational approximations have also appeared in autoencoder frameworks [34, 36] and message passing [26]. We build on variational programs for inferring implicit models.
2 Hierarchical Implicit Models
Hierarchical models play an important role in sharing statistical strength across examples [17]
. For a broad class of hierarchical Bayesian models, the joint distribution of the hidden and observed variables is
(1) 
where is an observation, are latent variables associated to that observation (local variables), and are latent variables shared across observations (global variables). See Fig. 1 (left).
With hierarchical models, local variables can be used for clustering in mixture models, mixed memberships in topic models [4], and factors in probabilistic matrix factorization [49]. Global variables can be used to pool information across data points for hierarchical regression [17], topic models [4], and Bayesian nonparametrics [52].
Hierarchical models typically use a tractable likelihood . But many likelihoods of interest, such as simulatorbased models [22] and GAN [19], admit high fidelity to the true data generating process and do not admit a tractable likelihood. To overcome this limitation, we develop HIMHIM.
HIM have the same joint factorization as Eq.1 but only assume that one can sample from the likelihood. Rather than define explicitly, HIM define a function that takes in random noise and outputs given and ,
The induced, implicit likelihood of given and is
This integral is typically intractable. It is difficult to find the set to integrate over, and the integration itself may be expensive for arbitrary noise distributions and functions .
Fig. 1 (right) displays the graphical model for HIM. Noise () are denoted by triangles; deterministic computation () are denoted by squares. We illustrate two examples.
Example: Physical Simulators. Given initial conditions, simulators describe a stochastic process that generates data. For example, in population ecology, the LotkaVolterra model simulates predatorprey populations over time via a stochastic differential equation [57]. For prey and predator populations respectively, one process is
where Gaussian noises are added at each full time step. The simulator runs for time steps given initial population sizes for . Lognormal priors are placed over . The LotkaVolterra model is grounded by theory but features an intractable likelihood. We study it in § 4.
Example: Bayesian Generative Adversarial Network.
GANGAN define an implicit model and a method for parameter estimation [19]. They are known to perform well on image generation [43]. Formally, the implicit model for a GAN is
(2) 
where is a neural network with parameters , and is a standard normal or uniform. The neural network is typically not invertible; this makes the likelihood intractable.
The parameters in GAN are estimated by divergence minimization between the generated and real data. We make GAN amenable to Bayesian analysis by placing a prior on the parameters . We call this a Bayesian GAN. Bayesian GAN enable modeling of parameter uncertainty and are inspired by Bayesian neural networks, which have been shown to improve the uncertainty and data efficiency of standard neural networks [33, 39]. We study Bayesian GAN in § 4; Appendix B provides example implementations in the Edward probabilistic programming language [55].
3 LikelihoodFree Variational Inference
We described hierarchical implicit models, a rich class of latent variable models with local and global structure alongside an implicit density. Given data, we aim to calculate the model’s posterior . This is difficult as the normalizing constant is typically intractable. With implicit models, the lack of a likelihood function introduces an additional source of intractability.
We use variational inference [25]. It posits an approximating family and optimizes to find the member closest to . There are many choices of variational objectives that measure closeness [44, 31, 10]. To choose an objective, we lay out desiderata for a variational inference algorithm for implicit models:

[leftmargin=*]

Scalability
. Machine learning hinges on stochastic optimization to scale to massive data
[6]. The variational objective should admit unbiased subsampling with the standard technique,where some computation over the full data is approximated with a minibatch of data .

Implicit Local Approximations. Implicit models specify flexible densities; this induces very complex posterior distributions. Thus we would like a rich approximating family for the perdata point approximations . This means the variational objective should only require that one can sample and not evaluate its density.
One variational objective meeting our desiderata is based on the classical minimization of the KL divergence. (Surprisingly, Appendix C details how the KL is the only possible objective among a broad class.)
3.1 KL Variational Objective
Classical variational inference minimizes the KL divergence from the variational approximation to the posterior. This is equivalent to maximizing the ELBO,
(3) 
Let factorize in the same way as the posterior,
where is an intractable density and since the data is constant during inference, we drop conditioning for the global . Substituting and ’s factorization yields
This objective presents difficulties: the local densities and are both intractable. To solve this, we consider ratio estimation.
3.2 Ratio Estimation for the KL Objective
Let be the empirical distribution on the observations and consider using it in a “variational joint” . Now subtract the log empirical from the ELBO above. The ELBO reduces to
(4) 
(Here the proportionality symbol means equality up to additive constants.) Thus the ELBO is a function of the ratio of two intractable densities. If we can form an estimator of this ratio, we can proceed with optimizing the ELBO.
We apply techniques for ratio estimation [51]. It is a key idea in GAN [37, 56], and similar ideas have rearisen in statistics and physics [21, 8]. In particular, we use class probability estimation: given a sample from or we aim to estimate the probability that it belongs to . We model this using , where is a parameterized function (e.g., neural network) taking sample inputs and outputting a real value; is the logistic function outputting the probability.
We train
by minimizing a loss function known as a proper scoring rule
[18]. For example, in experiments we use the log loss,(5) 
The loss is zero if returns 1 when a sample is from and 0 when a sample is from . (We also experiment with the hinge loss; see § 4.) If is sufficiently expressive, minimizing the loss returns the optimal function [37],
As we minimize Eq.5, we use as a proxy to the log ratio in Eq.4. Note estimates the log ratio; it’s of direct interest and more numerically stable than the ratio.
The gradient of with respect to is
(6) 
We compute unbiased gradients with Monte Carlo.
3.3 Stochastic Gradients of the KL Objective
To optimize the ELBO, we use the ratio estimator,
(7) 
All terms are now tractable. We can calculate gradients to optimize the variational family . Below we assume the priors are differentiable. (We discuss methods to handle discrete global variables in the next section.)
We focus on reparameterizable variational approximations [28, 48]. They enable sampling via a differentiable transformation of random noise, . Due to Eq.7, we require the global approximation to admit a tractable density. With reparameterization, its sample is
for a choice of transformation and noise . For example, setting and
induces a normal distribution
.Similarly for the local variables , we specify
Unlike the global approximation, the local variational density need not be tractable: the ratio estimator relaxes this requirement. It lets us leverage implicit models not only for data but also for approximate posteriors. In practice, we also amortize computation with inference networks, sharing parameters across the perdata point approximate posteriors.
The gradient with respect to global parameters under this approximating family is
(8) 
The gradient backpropagates through the local sampling
and the global reparameterization . We compute unbiased gradients with Monte Carlo. The gradient with respect to local parameters is(9) 
where the gradient backpropagates through .^{1}^{1}1The ratio indirectly depends on but its gradient w.r.t. disappears. This is derived via the score function identity and the product rule (see, e.g., Ranganath et al. [45, Appendix]).
3.4 Algorithm
Algorithm 1 outlines the procedure. We call it LFVILFVI. LFVI is black box: it applies to models in which one can simulate data and local variables, and calculate densities for the global variables. LFVI first updates to improve the ratio estimator . Then it uses to update parameters of the variational approximation . We optimize and simultaneously. The algorithm is available in Edward [55].
LFVI is scalable: we can unbiasedly estimate the gradient over the full data set with minibatches [24]. The algorithm can also handle models of either continuous or discrete data. The requirement for differentiable global variables and reparameterizable global approximations can be relaxed using score function gradients [45].
Point estimates of the global parameters suffice for many applications [19, 48]. Algorithm 1 can find point estimates: place a point mass approximation on the parameters . This simplifies gradients and corresponds to variational EM.
4 Experiments
We developed new models and inference. For experiments, we study three applications: a largescale physical simulator for predatorprey populations in ecology; a Bayesian GAN for supervised classification; and a deep implicit model for symbol generation. In addition, Appendix F, provides practical advice on how to address the stability of the ratio estimator by analyzing a toy experiment. We initialize parameters from a standard normal and apply gradient descent with ADAM.
LotkaVolterra PredatorPrey Simulator. We analyze the LotkaVolterra simulator of § 2
and follow the same setup and hyperparameters of
Papamakarios and Murray [40]. Its global variables govern rates of change in a simulation of predatorprey populations. To infer them, we posit a meanfield normal approximation (reparameterized to be on the same support) and run Algorithm 1 with both a log loss and hinge loss for the ratio estimation problem; Appendix D details the hinge loss. We compare to rejection ABC, MCMCABC, and SMCABC [35]. MCMCABC uses a spherical Gaussian proposal; SMCABC is manually tuned with a decaying epsilon schedule; all ABC methods are tuned to use the best performing hyperparameters such as the tolerance error.Fig. 2 displays results on two data sets. In the top figures and bottom left, we analyze data consisting of a simulation for time steps, with recorded values of the populations every
time units. The bottom left figure calculates the negative log probability of the true parameters over the tolerance error for ABC methods; smaller tolerances result in more accuracy but slower runtime. The top figures compare the marginal posteriors for two parameters using the smallest tolerance for the ABC methods. Rejection ABC, MCMCABC, and SMCABC all contain the true parameters in their 95% credible interval but are less confident than our methods. Further, they required
simulations from the model, with an acceptance rate of and for rejection ABC and MCMCABC respectively.The bottom right figure analyzes data consisting of time series, each of the same size as the single time series analyzed in the previous figures. This size is not possible with traditional methods. Further, we see that with our methods, the posterior concentrates near the truth. We also experienced little difference in accuracy between using the log loss or the hinge loss for ratio estimation.
Test Set Error  

Model + Inference  Crabs  Pima  Covertype  MNIST 
Bayesian GAN + VI  0.03  0.232  0.154  0.0136 
Bayesian GAN + MAP  0.12  0.240  0.185  0.0283 
Bayesian NN + VI  0.02  0.242  0.164  0.0311 
Bayesian NN + MAP  0.05  0.320  0.188  0.0623 
Bayesian Generative Adversarial Networks. We analyze Bayesian GAN, described in § 2. Mimicking a use case of Bayesian neural networks [5, 23], we apply Bayesian GAN for classification on small to mediumsize data. The GAN defines a conditional , taking a feature as input and generating a label , via the process
(10) 
where
is a 2layer multilayer perception with ReLU activations, batch normalization, and is parameterized by weights and biases
. We place normal priors, .We analyze two choices of the variational model: one with a meanfield normal approximation for , and another with a point mass approximation (equivalent to maximum a posteriori). We compare to a Bayesian neural network, which uses the same generative process as Eq.10 but draws from a Categorical distribution rather than feeding noise into the neural net. We fit it separately using a meanfield normal approximation and maximum a posteriori. Table 1 shows that Bayesian GAN generally outperform their Bayesian neural net counterpart.
Note that Bayesian GAN can analyze discrete data such as in generating a classification label. Traditional GAN for discrete data is an open challenge [29]. In Appendix E, we compare Bayesian GAN with point estimation to typical GAN. Bayesian GAN are also able to leverage parameter uncertainty for analyzing these small to mediumsize data sets.
One problem with Bayesian GAN is that they cannot work with very large neural networks: the ratio estimator is a function of global parameters, and thus the input size grows with the size of the neural network. One approach is to make the ratio estimator not a function of the global parameters. Instead of optimizing model parameters via variational EM, we can train the model parameters by backpropagating through the ratio objective instead of the variational objective. An alternative is to use the hidden units as input which is much lower dimensional [53, Appendix C].
Injecting Noise into Hidden Units. In this section, we show how to build a hierarchical implicit model by simply injecting randomness into hidden units. We model sequences
with a recurrent neural network. For
,where and are both 1layer multilayer perceptions with ReLU activation and layer normalization. We place standard normal priors over all weights and biases. See Fig. 2(a).
⬇
1 x+x/x**x*//x*x+
2 x/x*x+x*x/x+x+x+
3 /+x*x+x*x/x/x+x+
4 /x+*x+x*x/x+xx+
5 x/x*x/x*x+x+x+x
6 x+x+x/x*x*x+x/x+\end{lstlisting}
7 \vspace{1ex}
8 \caption{
9 % \textbf{(top)}
10 % Generated sequences from a (deterministic) \gls{RNN}.
11 % \textbf{(bottom)}
12 Generated symbols from the implicit model. Good samples
13 place arithmetic operators between the variable .
14 The implicit model learned to follow rules from the context free grammar
15 up to some multiple operator repeats.
16 }
17 \label{fig:sequence}
18\end{subfigure}
19\vspace{1em}
20\end{figure}
21
22
23If the injected noise combines linearly with the
24output of , the induced distribution
25 is Gaussian parameterized by that
26output.
27This defines a stochastic \gls{RNN}
28\citep{bayer2014learning,fraccaro2016sequential}, which generalizes
29its deterministic connection.
30With nonlinear combinations, the implicit density is more flexible (and
31intractable), making previous methods for inference not applicable.
32In our method, we perform variational inference and specify to be
33implicit; we use the same architecture as the probability model’s
34implicit priors.
35
36We follow the same setup and hyperparameters as \citet{kusner2016gans}
37and generate simple onevariable arithmetic sequences
38following a context free grammar,
39\begin{equation*}
40S \to x \ S + S \ S  S \ S * S \ S/S,
41\end{equation*}
42where divides possible productions of the grammar.
43We concatenate the inputs and
44point estimate the global variables (model parameters) using variational EM.
45\myfig{sequence} displays samples from the inferred model, training on
46sequences with a maximum of 15 symbols. It achieves sequences which
47roughly follow the context free grammar.’
5 DiscussionWe developed a class of hierarchical implicit models and likelihoodfree variational inference, merging the idea of implicit densities with hierarchical Bayesian modeling and approximate posterior inference. This expands Bayesian analysis with the ability to apply neural samplers, physical simulators, and their combination with rich, interpretable latent structure. More stable inference with ratio estimation is an open challenge. This is especially important when we analyze largescale real world applications of implicit models. Recent work for genomics offers a promising solution [53]. Acknowledgements.We thank Balaji Lakshminarayanan for discussions which helped motivate this work. We also thank Christian Naesseth, Jaan Altosaar, and Adji Dieng for their feedback and comments. DT is supported by a Google Ph.D. Fellowship in Machine Learning and an Adobe Research Fellowship. This work is also supported by NSF IIS0745520, IIS1247664, IIS1009542, ONR N000141110651, DARPA FA87501420009, N6600115C4032, Facebook, Adobe, Amazon, and the John Templeton Foundation. References
Appendix A Noise versus Latent VariablesHIM have two sources of randomness for each data point: the latent variable and the noise ; these sources of randomness get transformed to produce . Bayesian analysis infers posteriors on latent variables. A natural question is whether one should also infer the posterior of the noise. The posterior’s shape—and ultimately if it is meaningful—is determined by the dimensionality of noise and the transformation. For example, consider the GAN model, which has no local latent variable, . The conditional is a point mass, fully determined by . When is injective, the posterior is also a point mass, where is the left inverse of .This means for injective functions of the randomness (both noise and latent variables), the “posterior” may be worth analysis as a deterministic hidden representation [12], but it is not random.The point mass posterior can be found via nonlinear least squares. Nonlinear least squares yields the iterative algorithm for some step size sequence . Note the updates will get stuck when the gradient of is zero. However, the injective property of allows the iteration to be checked for correctness (simply check if ). Appendix B Implicit Model Examples in EdwardWe demonstrate implicit models via example implementations in Edward [55]. Fig. 3 implements a 2layer DIM. It uses tf.layers to define neural networks: tf.layers.dense(x, 256) applies a fully connected layer with hidden units and input ; weight and bias parameters are abstracted from the user. The program generates data points using two layers of implicit latent variables and with an implicit likelihood. Fig. 4 implements a Bayesian GAN for classification. It manually defines a 2layer neural network, where for each data index, it takes features concatenated with noise as input. The output is a label , given by the sign of the last layer. We place a standard normal prior over all weights and biases. Running this program while feeding the placeholder generates a vector of labels .Appendix C KL UniquenessAn integral probability metric measures distance between two distributions and , Integral probability metrics have been used for parameter estimation in generative models [14] and for variational inference in models with tractable density [46]. In contrast to models with only local latent variables, to infer the posterior, we need an integral probability metric between it and the variational approximation. The direct approach fails because sampling from the posterior is intractable. An indirect approach requires constructing a sufficiently broad class of functions with posterior expectation zero based on Stein’s method [46]. These constructions require a likelihood function and its gradient. Working around the likelihood would require a form of nonparametric density estimation; unlike ratio estimation, we are unaware of a solution that sufficiently scales to high dimensions. As another class of divergences, the divergence is Unlike integral probability metrics, divergences are naturally conducive to ratio estimation, enabling implicit and implicit . However, the challenge lies in scalable computation. To subsample data in hierarchical models, we need to satisfy up to constants , so that the expectation becomes a sum over individual data points. For continuous functions, this is a defining property of the function. This implies the KLdivergence from to is the only divergence where the subsampling technique in our desiderata is possible. Appendix D Hinge LossLet output a real value, as with the log loss in Section 4. The hinge loss is We minimize this loss function by following unbiased gradients. The gradients are calculated analogously as for the log loss. The optimal is the log ratio. Appendix E Comparing Bayesian GANs with MAP to GANs with MLEIn Section 4, we argued that MAP estimation with a Bayesian GAN enables analysis over discrete data, but GAN—even with a maximum likelihood objective [20]—cannot. This is a surprising result: assuming a flat prior for MAP, the two are ultimately optimizing the same objective. We compare the two below. For GAN, assume the discriminator outputs a logit probability, so that it’s unconstrained instead of on . GAN with MLE use the discriminative problemThey use the generative problem Solving the generative problem with reparameterization gradients requires backpropagating through data generated from the model, . This is not possible for discrete . Further, the exponentiation also makes this objective numerically unstable and thus unusable in practice. Contrast this with Bayesian GAN with MLE (MAP and a flat prior). This applies a point mass variational approximation . It maximizes the ELBO, The first term is zero for a flat prior and point mass approximation; the problem reduces to Solving this is possible for discrete : it only requires backpropagating gradients through with respect to , all of which is differentiable. Further, the objective does not require a numerically unstable exponentiation. Ultimately, the difference lies in the role of the ratio estimators. Recall for Bayesian GAN, we use the ratio estimation problem The optimal ratio estimator is the logratio . Optimizing it with respect to reduces to optimizing the loglikelihood . The optimal discriminator for GAN with MLE has the same ratio, ; however, it is a constant function with respect to . Hence one cannot immediately substitute as a proxy to optimizing the likelihood. An alternative is to use importance sampling; the result is the former objective [20]. Appendix F Stability of Ratio EstimatorWith implicit models, the difference from standard KL variational inference lies in the ratio estimation problem. Thus we would like to assess the accuracy of the ratio estimator. We can check this by comparing to the true ratio under a model with tractable likelihood. We apply Bayesian linear regression. It features a tractable posterior which we leverage in our analysis. We use 50 simulated data points . The optimal (log) ratio isNote the loglikelihood minus is equal to the empirical distribution , a constant. Therefore if a ratio estimator is accurate, its difference with should be a constant with low variance across values of .See Fig. 5. The top graph displays the estimate of over updates of the variational approximation ; each estimate uses a sample from the current . The ratio estimator is more accurate as exactly converges to the posterior. This matches our intuition: if data generated from the model is close to the true data, then the ratio is more stable to estimate. An alternative hypothesis for Fig. 5 is that the ratio estimator has simply accumulated information during training. This turns out to be untrue; see the bottom graphs. On the left, is fixed at a random initialization; the estimate of is displayed over updates of . After many updates, still produces unstable estimates. In contrast, the right shows the same procedure with fixed at the posterior. is accurate after few updates. Several practical insights appear for training. First, it is not helpful to update multiple times before updating (at least in initial iterations). Additionally, if the specified model poorly matches the data, training will be difficult across all iterations. The property that ratio estimation is more accurate as the variational approximation improves is because is set to be the empirical distribution. (Note we could subtract any density from the ELBO in Equation 4.) LFVI finds that makes the observed data likely under , i.e., gets closer to the empirical distribution at values sampled from . Letting be the empirical distribution means the ratio estimation problem will be less trivially solvable (thus more accurate) as improves. Note also this motivates why we do not subsume inference of in the ratio in order to enable implicit global variables and implicit global variational approximations. Namely, estimation requires comparing samples between the prior and the posterior; they rarely overlap for global variables. 
5 Discussion
We developed a class of hierarchical implicit models and likelihoodfree variational inference, merging the idea of implicit densities with hierarchical Bayesian modeling and approximate posterior inference. This expands Bayesian analysis with the ability to apply neural samplers, physical simulators, and their combination with rich, interpretable latent structure.
More stable inference with ratio estimation is an open challenge. This is especially important when we analyze largescale real world applications of implicit models. Recent work for genomics offers a promising solution [53].
Acknowledgements.
We thank Balaji Lakshminarayanan for discussions which helped motivate this work. We also thank Christian Naesseth, Jaan Altosaar, and Adji Dieng for their feedback and comments. DT is supported by a Google Ph.D. Fellowship in Machine Learning and an Adobe Research Fellowship. This work is also supported by NSF IIS0745520, IIS1247664, IIS1009542, ONR N000141110651, DARPA FA87501420009, N6600115C4032, Facebook, Adobe, Amazon, and the John Templeton Foundation.
References
 Anelli et al. [2008] Anelli, G., Antchev, G., Aspell, P., Avati, V., Bagliesi, M., Berardi, V., Berretti, M., Boccone, V., Bottigli, U., Bozzo, M., et al. (2008). The totem experiment at the CERN large Hadron collider. Journal of Instrumentation, 3(08):S08007.
 Bayer and Osendorfer [2014] Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610.
 Beaumont [2010] Beaumont, M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution and Systematics, 41(379406):1.
 Blei et al. [2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
 Blundell et al. [2015] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural network. In International Conference on Machine Learning.
 Bottou [2010] Bottou, L. (2010). Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer.
 Chen et al. [2016] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems.
 Cranmer et al. [2015] Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169.
 Diaconis et al. [2007] Diaconis, P., Holmes, S., and Montgomery, R. (2007). Dynamical bias in the coin toss. SIAM, 49(2):211–235.
 Dieng et al. [2017] Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. M. (2017). The Divergence for Approximate Inference. In Neural Information Processing Systems.
 Diggle and Gratton [1984] Diggle, P. J. and Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society: Series B (Methodological), pages 193–227.
 Donahue et al. [2017] Donahue, J., Krähenbühl, P., and Darrell, T. (2017). Adversarial feature learning. In International Conference on Learning Representations.
 Dumoulin et al. [2017] Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A. (2017). Adversarially learned inference. In International Conference on Learning Representations.

Dziugaite et al. [2015]
Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015).
Training generative neural networks via maximum mean discrepancy
optimization.
In
Uncertainty in Artificial Intelligence
.  Fraccaro et al. [2016] Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with stochastic layers. In Neural Information Processing Systems.
 Gelman et al. [2013] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. Texts in Statistical Science Series. CRC Press, Boca Raton, FL.
 Gelman and Hill [2006] Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
 Gneiting and Raftery [2007] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
 Goodfellow et al. [2014] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Systems.
 Goodfellow [2014] Goodfellow, I. J. (2014). On distinguishability criteria for estimating generative models. In ICLR Workshop.
 Gutmann et al. [2014] Gutmann, M. U., Dutta, R., Kaski, S., and Corander, J. (2014). Statistical Inference of Intractable Generative Models via Classification. arXiv preprint arXiv:1407.4981.
 Hartig et al. [2011] Hartig, F., Calabrese, J. M., Reineking, B., Wiegand, T., and Huth, A. (2011). Statistical inference for stochastic simulation models–theory and application. Ecology Letters, 14(8):816–827.
 HernándezLobato et al. [2016] HernándezLobato, J. M., Li, Y., Rowland, M., HernándezLobato, D., Bui, T., and Turner, R. E. (2016). Blackbox divergence minimization. In International Conference on Machine Learning.
 Hoffman et al. [2013] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347.
 Jordan et al. [1999] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning.
 Karaletsos [2016] Karaletsos, T. (2016). Adversarial message passing for graphical models. In NIPS Workshop.
 Keller [1986] Keller, J. B. (1986). The probability of heads. The American Mathematical Monthly, 93(3):191–197.
 Kingma and Welling [2014] Kingma, D. P. and Welling, M. (2014). AutoEncoding Variational Bayes. In International Conference on Learning Representations.
 Kusner and HernándezLobato [2016] Kusner, M. J. and HernándezLobato, J. M. (2016). GANs for sequences of discrete elements with the GumbelSoftmax distribution. In NIPS Workshop.
 Larsen et al. [2016] Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning.
 Li and Turner [2016] Li, Y. and Turner, R. E. (2016). Rényi Divergence Variational Inference. In Neural Information Processing Systems.
 Liu and Feng [2016] Liu, Q. and Feng, Y. (2016). Two methods for wild variational inference. arXiv preprint arXiv:1612.00081.
 MacKay [1992] MacKay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology.
 Makhzani et al. [2015] Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
 Marin et al. [2012] Marin, J.M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180.
 Mescheder et al. [2017] Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722.
 Mohamed and Lakshminarayanan [2016] Mohamed, S. and Lakshminarayanan, B. (2016). Learning in implicit generative models. arXiv preprint arXiv:1610.03483.
 Murphy [2012] Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
 Neal [1994] Neal, R. M. (1994). Bayesian Learning for Neural Networks. PhD thesis, University of Toronto.
 Papamakarios and Murray [2016] Papamakarios, G. and Murray, I. (2016). Fast free inference of simulation models with Bayesian conditional density estimation. In Neural Information Processing Systems.
 Pearl [2000] Pearl, J. (2000). Causality. Cambridge University Press.
 Pritchard et al. [1999] Pritchard, J. K., Seielstad, M. T., PerezLezaun, A., and Feldman, M. W. (1999). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798.
 Radford et al. [2016] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations.
 Ranganath et al. [2016a] Ranganath, R., Altosaar, J., Tran, D., and Blei, D. M. (2016a). Operator variational inference. In Neural Information Processing Systems.
 Ranganath et al. [2014] Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
 Ranganath et al. [2016b] Ranganath, R., Tran, D., and Blei, D. M. (2016b). Hierarchical variational models. In International Conference on Machine Learning.
 Rezende and Mohamed [2015] Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning.
 Rezende et al. [2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.

Salakhutdinov and Mnih [2008]
Salakhutdinov, R. and Mnih, A. (2008).
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo.
In International Conference on Machine Learning, pages 880–887. ACM.  Salimans et al. [2015] Salimans, T., Kingma, D. P., and Welling, M. (2015). Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning.
 Sugiyama et al. [2012] Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Densityratio matching under the Bregman divergence: A unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics.
 Teh and Jordan [2010] Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. Bayesian Nonparametrics, 1.
 Tran and Blei [2017] Tran, D. and Blei, D. M. (2017). Implicit causal models for genomewide association studies. arXiv preprint arXiv:1710.10742.
 Tran et al. [2015] Tran, D., Blei, D. M., and Airoldi, E. M. (2015). Copula variational inference. In Neural Information Processing Systems.
 Tran et al. [2016] Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016). Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.
 Uehara et al. [2016] Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
 Wilkinson [2011] Wilkinson, D. J. (2011). Stochastic modelling for systems biology. CRC press.
Appendix A Noise versus Latent Variables
HIM have two sources of randomness for each data point: the latent variable and the noise ; these sources of randomness get transformed to produce . Bayesian analysis infers posteriors on latent variables. A natural question is whether one should also infer the posterior of the noise.
The posterior’s shape—and ultimately if it is meaningful—is determined by the dimensionality of noise and the transformation. For example, consider the GAN model, which has no local latent variable, . The conditional is a point mass, fully determined by . When is injective, the posterior is also a point mass,
where is the left inverse of
.This means for injective functions of the randomness (both noise and latent variables), the “posterior” may be worth analysis as a deterministic hidden representation
[12], but it is not random.The point mass posterior can be found via nonlinear least squares. Nonlinear least squares yields the iterative algorithm
for some step size sequence . Note the updates will get stuck when the gradient of is zero. However, the injective property of allows the iteration to be checked for correctness (simply check if ).
Appendix B Implicit Model Examples in Edward
We demonstrate implicit models via example implementations in Edward [55].
Fig. 3 implements a 2layer DIM. It uses tf.layers to define neural networks: tf.layers.dense(x, 256) applies a fully connected layer with hidden units and input ; weight and bias parameters are abstracted from the user. The program generates data points using two layers of implicit latent variables and with an implicit likelihood.
Fig. 4 implements a Bayesian GAN for classification. It manually defines a 2layer neural network, where for each data index, it takes features concatenated with noise as input. The output is a label , given by the sign of the last layer. We place a standard normal prior over all weights and biases. Running this program while feeding the placeholder
generates a vector of labels
.Appendix C KL Uniqueness
An integral probability metric measures distance between two distributions and ,
Integral probability metrics have been used for parameter estimation in generative models [14] and for variational inference in models with tractable density [46]. In contrast to models with only local latent variables, to infer the posterior, we need an integral probability metric between it and the variational approximation. The direct approach fails because sampling from the posterior is intractable.
An indirect approach requires constructing a sufficiently broad class of functions with posterior expectation zero based on Stein’s method [46]. These constructions require a likelihood function and its gradient. Working around the likelihood would require a form of nonparametric density estimation; unlike ratio estimation, we are unaware of a solution that sufficiently scales to high dimensions.
As another class of divergences, the divergence is
Unlike integral probability metrics, divergences are naturally conducive to ratio estimation, enabling implicit and implicit . However, the challenge lies in scalable computation. To subsample data in hierarchical models, we need to satisfy up to constants , so that the expectation becomes a sum over individual data points. For continuous functions, this is a defining property of the function. This implies the KLdivergence from to is the only divergence where the subsampling technique in our desiderata is possible.
Appendix D Hinge Loss
Let output a real value, as with the log loss in Section 4. The hinge loss is
We minimize this loss function by following unbiased gradients. The gradients are calculated analogously as for the log loss. The optimal is the log ratio.
Appendix E Comparing Bayesian GANs with MAP to GANs with MLE
In Section 4, we argued that MAP estimation with a Bayesian GAN enables analysis over discrete data, but GAN—even with a maximum likelihood objective [20]—cannot. This is a surprising result: assuming a flat prior for MAP, the two are ultimately optimizing the same objective. We compare the two below.
For GAN, assume the discriminator outputs a logit probability, so that it’s unconstrained instead of on
. GAN with MLE use the discriminative problemThey use the generative problem
Solving the generative problem with reparameterization gradients requires backpropagating through data generated from the model, . This is not possible for discrete . Further, the exponentiation also makes this objective numerically unstable and thus unusable in practice.
Contrast this with Bayesian GAN with MLE (MAP and a flat prior). This applies a point mass variational approximation . It maximizes the ELBO,
The first term is zero for a flat prior and point mass approximation; the problem reduces to
Solving this is possible for discrete : it only requires backpropagating gradients through with respect to , all of which is differentiable. Further, the objective does not require a numerically unstable exponentiation.
Ultimately, the difference lies in the role of the ratio estimators. Recall for Bayesian GAN, we use the ratio estimation problem
The optimal ratio estimator is the logratio . Optimizing it with respect to reduces to optimizing the loglikelihood . The optimal discriminator for GAN with MLE has the same ratio, ; however, it is a constant function with respect to . Hence one cannot immediately substitute as a proxy to optimizing the likelihood. An alternative is to use importance sampling; the result is the former objective [20].
Appendix F Stability of Ratio Estimator
With implicit models, the difference from standard KL variational inference lies in the ratio estimation problem. Thus we would like to assess the accuracy of the ratio estimator. We can check this by comparing to the true ratio under a model with tractable likelihood.
We apply Bayesian linear regression. It features a tractable posterior which we leverage in our analysis. We use 50 simulated data points
. The optimal (log) ratio isNote the loglikelihood minus is equal to the empirical distribution , a constant. Therefore if a ratio estimator is accurate, its difference with
should be a constant with low variance across values of
.See Fig. 5. The top graph displays the estimate of over updates of the variational approximation ; each estimate uses a sample from the current . The ratio estimator is more accurate as exactly converges to the posterior. This matches our intuition: if data generated from the model is close to the true data, then the ratio is more stable to estimate.
An alternative hypothesis for Fig. 5 is that the ratio estimator has simply accumulated information during training. This turns out to be untrue; see the bottom graphs. On the left, is fixed at a random initialization; the estimate of is displayed over updates of . After many updates, still produces unstable estimates. In contrast, the right shows the same procedure with fixed at the posterior. is accurate after few updates.
Several practical insights appear for training. First, it is not helpful to update multiple times before updating (at least in initial iterations). Additionally, if the specified model poorly matches the data, training will be difficult across all iterations.
The property that ratio estimation is more accurate as the variational approximation improves is because
Comments
There are no comments yet.