Amortized Inference of Variational Bounds for Learning Noisy-OR

06/06/2019 ∙ by Yiming Yan, et al. ∙ 0

Classical approaches for approximate inference depend on cleverly designed variational distributions and bounds. Modern approaches employ amortized variational inference, which uses a neural network to approximate any posterior without leveraging the structures of the generative models. In this paper, we propose Amortized Conjugate Posterior (ACP), a hybrid approach taking advantages of both types of approaches. Specifically, we use the classical methods to derive specific forms of posterior distributions and then learn the variational parameters using amortized inference. We study the effectiveness of the proposed approach on the Noisy-OR model and compare to both the classical and the modern approaches for approximate inference and parameter learning. Our results show that ACP outperforms other methods when there is a limited amount of training data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical techniques in probabilistic graphical models exploit (tractable) structures heavily for approximate inference (Koller et al., 2009). Well-studied examples are mean-field approaches by assuming factorized forms of posteriors (Jordan et al., 1999), (conjugate) variational bounds on likelihoods (Jaakkola and Jordan, 1999, 2000), and others (Saul et al., 1996; Saul and Jordan, 1996)

. In these methods, the forms of the approximate posteriors depend on the model structures, the definitions of the conditional probability tables or distributions, and the priors. Deriving them is often non-trivial, especially for complex models. For instance, deriving variational bounds often requires identifying special properties such as convexity and concavity of likelihood (or partition) functions. And in some cases, the derivation also depends on whether an upper-bound or lower-bound is needed 

(Jebara and Pentland, 2001).

In contrast, recent approaches in amortized variational inference (AVI) use neural networks to represent posteriors and reparameterization tricks to compute the likelihoods with Monte Carlo samples (Kingma and Welling, 2013; Mnih and Gregor, 2014). Even earlier, neural networks were applied to approximate posterior distribution in a supervised manner (Morris, 2001). What is appealing in this type of methods is that selecting the inference neural network requires significantly reduced efforts, and the structure of the generative model (and the corresponding likelihood function) does not directly come into play in determining the inference network. In other words, the inference network (i.e. the encoder) and the generative model (i.e.

the decoder) are parameterized independently, without explicitly sharing information. As such, with a large amount of training data, a high-capacity inference network is able to approximate the posterior well, and learning the generative model can be effective as the variance of the Monte Carlo sampling can be reduced. However, when the amount of the training data is small, the inference network can overfit and estimating the generative model has a high variance.

Is there a way to combine these two different types of approaches? In this paper, we take a step in this direction. Our main idea is to use the above-mentioned classical methods to derive approximate but tractable posteriors (and approximate likelihood functions) and then identify the optimal variational parameters by learning a neural network.

The key difference from the classical approaches is that the variational parameters are not optimized to give the tightest bounds on the likelihood functions but to maximize the evidence lower bound (ELBO). On the other hand, the key difference from the AVI is that the posterior and the generative model share information such that the posterior contributes explicitly to the gradient of the ELBO with respect to the generative model parameters.

As an example, we apply this new hybrid approach to the noisy-or model, which has been studied for binary observations and latent variables. Earlier work such as (Jaakkola and Jordan, 1999; Šingliar and Hauskrecht, 2006) proposed classical variational inference approaches. More recently, polynomial-time algorithms are designed to learn the structure and parameters of noisy-or model (Jernite et al., 2013; Halpern and Sontag, 2013). However they require strong assumptions on the structure of the graph.  (Halpern, 2016) studied semi-supervised method with stochastic variational learning on noisy-or and achieved good performance in parameter recovering. We show the proposed approach outperforms either the classical ones or AVI, when there is only a small amount of training data.

In the following, we will describe related work by reviewing different types of variational inference techniques in Section 2, in the context of the noisy-or model. We describe our approach in Section 3. In Section 4, we show our experiment results, followed by discussion in section 5.

2 Variational Inference for noisy-or

2.1 Basic Ideas

We are interested in modeling a random variable

with a generative latent variable model, where denotes the latent variables. denotes the model parameter and denotes the total number of data points in the training set . The log-likelihood is defined as


We assume the parameters for the prior and the conditional distribution are separable. For discrete latent variables, the integral is interpreted as summing over all possible configurations of .

Due to the logarithm before the integral, estimating is intractable. We introduce a variational distribution to approximate the posterior . This gives rise to maximizing the ELBO,


where the denotes the Kullback-Liebler (KL) divergence between the distributions and , and is the variational parameter. To avoid notation cluttering, we will drop the subscript and omit and whenever the context is clear.

Classical approaches restrict the distribution family for so that the expected conditional likelihood (the first term in the ELBO) can be computed. For example, mean-field approximation assumes a factorized form of the variational distribution. Unfortunately, even for some common likelihood functions , the factorized form does not turn the expectation tractable. We describe one such model and then describe how the classical and recent approaches tackle such challenges.

2.2 noisy-or

Figure 1: noisy-or model in plate notations.

noisy-or is a bipartite directed graph modeling the dependencies among binary observations. The structure is shown in Fig. 1, where represents the observed -dimensional data, are the latent variables (with ). The model defines the distribution


where and

are Bernoulli distributions. In particular,


where . Redefining , we have


where we slightly abuse the notation on and . In more compact form, we have


where takes value of either 0 or 1.

The form of the likelihood of positive observation makes it intractable for computing its expectation of logarithm, even when the posterior is approximated in factorized form.

2.3 Variational Bound via Conjugate Dual

The conjugate dual for approximating of noisy-or model was first introduced in (Jaakkola and Jordan, 1999):


where is a concave function in , is a variational parameter associated with , and is the conjugate dual function of . Here for , we have


The detailed derivation can be found in (Jaakkola and Jordan, 1999).

Note that the resulting upper-bound has the appealing property that it is linear in (hence, . Thus, computing its expectation with respect to a factorized (variational) posterior distribution is trivial. Hence we take as the approximation to .

The classical methods choose the best to achieve the tightest upper-bound


However, in the next section, we will show that we can use the upper-bound to select the form of variational distribution, and choose differently, leading to a different type of inference technique.

2.4 Amortized Variational Inference

Under the auto-encoding variational Bayes framework (Kingma and Welling, 2013), AVI uses global parameters to predict the parameters of approximate posterior distribution directly. For instance, (Kingma and Welling, 2013) predicts the parameters of as . Here, is the global trainable parameter, which is shared across all , . As a special case, can be a neural network and .

The expectation of log-likelihood requires sampling the latent variable . However, the gradients cannot be back-propagated though stochastic random variable . Hence for certain types of distributions , the reparametrization trick can be applied to reparameterize the random variable using a differentiable function (such as neural network) , where (a known and easily sampled distribution). Then the expected log-likelihood can be rewritten as


which can then be computed with Monte Carlo sampling from .

AVI utilizes the advantages of deep neural networks as inference model without explicitly deriving complex conjugate dual functions for likelihoods, and has the power of approximating posterior flexibly. However, when there is not enough training data, the inference model might overfit and leads to a large variance in estimating the generative model. We will introduce our method in the next section to tackle this problem.

3 A Hybrid Approach for Variational Inference

In this section, we introduce our inference strategy, Amortized Conjugate Posterior (ACP), combining ideas from both the classical methods and the recent AVI approaches.

3.1 Variational Posterior

Instead of approximating a factorized posterior distribution using a feed-forward network as in AVI, the upper-bound described for in eq. (7) can also be used to give rise to a factorized posterior distribution, after applying the Bayes rule. However, the posterior is a variational one:


Note that we only use the upper-bound for positive observations where . For negative observations, we use the true likelihood. After re-organizing the terms, we observe that


Namely, the variational posterior factorizes. And each factor is



is the sigmoid function.

Note that, in classical methods the variational parameters are optimized to tighten the upper-bounds according to eq. (9). However, this is not the only option. We describe alternative approaches next.

3.2 Amortized Conjugate Posterior

Instead of seeking the tightest upper-bounds for the likelihood function (for positive observations), we propose to optimize to maximize the ELBO in eq. (2). Moreover, to amortize the inference, we parameterize as in AVI,


where the parameters of the neural network are shared by all data points. Namely, for each , its variational distribution is


As in AVI, to optimize both and , we use Monte Carlo sampling to compute the ELBO (and its gradients with respect to the parameters). Note that in noisy-or, the ELBO can be written as


where the expectation of the positive log-likelihood (the first term in eq. (16)) is intractable while the negative log-likelihood (the second term) and the KL divergence can be computed analytically. Hence, we need to estimate the first term using Monte Carlo sampling.

Specifically, for each training data point , we use eq. (14) to compute the variational parameters, and use the form of posterior as eq. (13). We then sample from the posterior — since is Bernoulli, we use the Gumbel-Softmax reparameterization trick (Jang et al., 2016). The samples are then used to compute the expectation of true positive likelihoods (and their gradients).

The key difference from AVI is that the (variational) posterior has also dependency on the generative model parameters . While they contribute indirectly to the gradients of the expected conditional likelihoods for positive observations in the ELBO, the expected likelihoods for negative ones and the KL divergence between the posterior and the prior are analytically tractable and the gradients with respect to are directly used to optimize the parameters. In AVI, the expected conditional likelihood is a constant with respect to the generative model parameters and it does not contribute to the update.

The specific form eq. (13) of the variational posterior – how evidence is incorporated and the architecture is formed – constrains the capacity of the inference network in a typical AVI approach. However, this can be advantageous: when the amount of training data is limited, the constrained form could prevent overfitting and improve generalization. Our empirical results support this claim.

4 Experiments

In this section, we compare the three types of inference methods discussed before, namely the classical approach of conjugate dual inference (CDI), AVI and ACP. We perform empirical studies on synthetic datasets in sections 4.2 – 4.4. Results on real data are reported in section 4.5.

4.1 Experiment Setup

Synthetic Data

We created two types of synthetic datasets with the following generic procedures:

  • Select the parameters of the prior distributions , .

  • Select the generative model parameter and the “leak” probability . and needs to be non-negative.

  • Sample latent variables , from the prior distribution .

  • Sample observed data points from the conditional probability .

In syn-pattern, and . All , are set to be . The model parameters and are reshaped from the patterns depicted in Fig. 1(a). Each pattern is a matrix and reshaped to a

-dimensional vector representing

, where the white pixels in th pattern indicate the corresponding parameters to be 0. And the ninth pattern refers to the pattern of “leak” probability. The values of non-zero (i.e. black pixels) are which means . Fig. 1(b) shows some data points sampled from the given generative process. Each data point is a combination of the parameter patterns in Fig. 1(a) with missing parts.

Figure 2: The synthetic dataset. 1(a) shows the original patterns of the weight matrix . 1(b) represents some observed sampled from the generative procedure.

In syn-sparse, , and are sampled from distributions and . Moreover, we control the sparsity of the dataset by setting a sparsity level , . can be considered as the probability of removing the connection between and by setting , which enforces the sparse connections between latent and observed variables. Randomly removing connections can make some variables have no connection with others and become useless. To avoid that, we randomly add connection to the latent and observed variables which do not have any connection. We describe the configurations of syn-sparse in Table 1. We distinguish two types of datasets: a small version and large version where and are both small or both large.

The Sparsity is defined as the percentage of negative observations in the dataset


where lower Sparsity value indicates denser dataset. We split data points for validation and test set, where for the small models and for the large models.

Dataset Sparsity
1 5 1 10 0.95 50 100 94.2
Small 2 5 2 5 0.95 50 100 71.8
2 5 2 5 0.9 50 100 51.4
1 5 1 20 0.995 500 500 98.4
Large 1 20 1 20 0.95 500 500 95.3
1 10 1 10 0.95 500 500 73.6
Table 1: Configurations of syn-sparse datasets

Implementation Details

All our experiments were performed using Adam optimizer (Kingma and Ba, 2014) with a batch size of . During training, we set the number of Monte Carlo samples to for each data point to compute the ELBO. We rely on Gumbel-softmax reparametrization trick (Jang et al., 2016) to approximate sampling latent variables using continuous value to back-propagate gradients. Following  (Jang et al., 2016), we schedule exponential temperature decay, with the initial temperature to be and the minimum temperature to be . While during testing, we use the true discrete samples from the posterior and sample times to compute ELBO. For ACP, the variational parameter is the output of a neural network, which is constrained to be greater than . Thus we use a softplus

layer as the last layer of the neural network. The architecture (number of hidden layers and hidden dimensions) of the inference model for both AVI and ACP, as well as other hyperparameters including learning rate, momentum, temperature decay rate and temperature decay step, are sampled randomly for

times. We only report the result with the best hyperparameters. All experiments results are averaged from different random initializations.

4.2 Inference

Herein, we compare the CDI, AVI and ACP on their abilities of accurately approximating the posterior distribution. We fix the generative model with its ground truth parameters, and evaluate the inference performance by evaluating the ELBO. Since the generative model is fixed, the ELBO is maximized when . Hence the higher ELBO indicates better inference performance. Moreover, we compare the ground truth and using macro F1 and Exact Match (EM) scores as inference accuracy.

The parameters in CDI are optimized following the optimization strategy in (Jaakkola and Jordan, 1999), where we find the tightest likelihood upper-bound using fix-point optimization and use it to compute posterior (eq. (9)). While for AVI and ACP, we optimize the variational parameters to maximize the ELBO.

We evaluate the 3 methods on syn-pattern dataset and report the performance for AVI and ACP on the held-out set with different amount of training data. For CDI, we optimize the variational parameter on samples from held-out set directly, which has data points.

Method NELBO F1 EM
1000 AVI 14.0 94.5 91
ACP 14.4 94.3 90.4
100 AVI 18.4 86.6 76.8
ACP 17.4 87.1 81.1
20 AVI 37.2 49.2 47.6
ACP 22.2 76.1 64.0
N/A CDI 96.6 24.6 39.0
Table 2: Inference Performance on syn-pattern Dataset

Table 2 shows the experiment results, where indicates the number of training data. We observe that when we have sufficient training data (), AVI achieves slightly better performance due to its high flexibility to approximate the posterior. Yet when we have only limited amount of training data, ACP gains huge advantages over AVI. As we reduce the number of training data, the gap between ACP and AVI becomes wider.

Surprisingly, CDI achieves very poor performance. We plot the learning curve for CDI in Fig. 3, where LL-UB indicates the Log-Likelihood Upper-Bound:

In the first rounds of fix-point optimization, LL-UB becomes tighter as we optimize more iterations and then converges. However, with a tighter upper-bound, the ELBO first drops quickly, and then improves slightly during optimization. The final ELBO after convergence is much worse even than the initial point. This observation indicates that a tighter likelihood upper-bound is not equivalent to a better approximate posterior. Hence we will not compare to CDI in the rest of the experiments.

Figure 3: The learning curve of CDI. It shows that tighter variational bound does not imply better approximate posterior.

4.3 Parameter Estimation

To further analyze the properties of AVI and ACP, we jointly train the generative and inference models on syn-pattern dataset. The goal is to recover the patterns of in a fully unsupervised way, and compare how the amount of training data affects the performance of AVI and ACP.

Fig. 4 shows the experiment results. From the left column, we observe that both AVI and ACP recover all the patterns and achieve similar performance. When we reduce the amount of training data to (right column), though ACP still reconstructs all the patterns with slightly worse performance, the performance degrades more severely in AVI. Specifically, two patterns out of the eight are not recovered (i.e. the middle left and middle right patterns in Fig. 3(b)). Additionally, some patterns are merged (i.e. the upper right and middle patterns in Fig. 3(b)).

This result indicates that the model dependent posterior form is helpful in structured inference and learning useful latent representations, especially when we have small amount of training data.

(a) AVI,
(b) AVI,
(c) ACP,
(d) ACP,
Figure 4: The left and right column shows the recovered parameters after training with and data points respectively. The first row uses AVI and the second row applies ACP.
Figure 5: Comparison of AVI and ACP in noisy-or model under different model structure and different dataset sparsity.

4.4 Generalization with Small and Sparse Training Data

In this part, we investigate in detail how ACP leads to better generalization. According to eq. (13), sparse data requires less parameter to be approximated. Thus, in addition to varying the amount of training data, we also control the sparsity of the dataset and evaluate how it would affect the generalization performance. We evaluate the performance using the negative ELBO on held-out set. We use syn-sparse datasets described in Table 1 for these studies.

Fig. 5 shows the experiment results for AVI and ACP with different amount of training data in various degrees of sparsity. ACP consistently outperforms AVI with limited amount of data. As we increase the size of training set, the performance of AVI improves quickly and achieves the similar performance to ACP.

(Jaakkola and Jordan, 1999) claimed that when we decrease the sparsity of the dataset, the CDI using upper-bound approximation would achieve worse results. The reason is that the upper-bound approximation is only performed on positive observations, while the negative ones would be treated exactly. Thus adding more positive observations introduces more approximation and leads to poor results. In Fig. 5, we can observe that for both small and large model, the relative likelihood gaps (the y-axis in figures) become smaller as we decrease the sparsity of data. For instance, when we have training points, the relative likelihood gap in Fig. 4(a) is approximately out of , while reduces to out of in Fig. 4(c). The observation indicates that although ACP outperforms AVI when we have limited training data under all sparsity, ACP will obtains more performance gain over AVI on sparse data.

4.5 Topic Models

ACP (PMI = 2.78) AVI (PMI = 2.49)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 1 Topic 2 Topic 3 Topic 4
classification bayesian sparse application process process multi sparse
application base analysis classification inference inference bayesian regression
via optimization deep process gaussian bayesian probabilistic gaussian
method adaptive datum linear markov analysis information via
process method method stochastic datum gaussian inference estimation
multi function estimation analysis mixture datum dynamic inference
bayesian object multi multi analysis mixture approach analysis
kernel estimation feature datum bayesian variational function linear
feature information convex time variational approach application process
image datum probabilistic via dynamic probabilistic process optimization
PMI : 3.82 PMI : 3.22 PMI : 3.18 PMI : 3.16 PMI : 3.74 PMI : 3.73 PMI : 3.69 PMI : 2.98
Table 3: Top 10 words inferred on NeurIPS Titles dataset for top 4 topics with 5241 training data.
ACP (PMI = 2.75) AVI (PMI = 2.55)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 1 Topic 2 Topic 3 Topic 4
image kernel method inference base optimization feature process
feature base estimation bayesian adaptive sparse function gaussian
optimization image sparse process process recognition base via
application process non classification kernel classification datum datum
recognition dynamic stochastic kernel method method information inference
fast estimation optimization multi linear base adaptive optimization
clustering classification application analysis datum adaptive reinforcement time
via deep gradient linear system linear search recognition
representation optimal multi fast sample function gaussian base
reinforcement sample fast application dynamic regression bound latent
PMI : 3.47 PMI : 3.40 PMI : 3.20 PMI : 3.19 PMI : 3.09 PMI : 3.00 PMI : 2.95 PMI : 2.91
Table 4: Top 10 words inferred on NeurIPS Titles dataset for top 4 topics with 2000 training data.

In this section, we compare the ability of AVI and ACP in topic modeling, with the learned latent variables. We use the titles of all the Neural Information Processing Systems (NeurIPS) papers from between the years of 1987 and 2016 (14). Here each observed data point is a -dimensional binary vector representing a paper’s title, where is the size of the vocabulary. The value of , indicates the presence/absence of word in the -th title. After word lemmatization, removing stop words, the most common words and the words with less than occurrences in the whole corpus, we obtain a dataset with data points and unique words. The average length of the paper title is after pre-processing. We use data points for validation, for testing, and the remaining ones for training. We model the data with latent variables.

Each latent variable is interpreted as a topic capturing a distribution of words. To further show the semantic coherence of words in each topic, we report the point-wise mutual information (PMI), which has been shown to be highly correlated with human judgment in assessing word relatedness (Newman et al., 2009), between word pairs of each topic. To do so we use the whole English corpus, that consists of approximately 4 millions of documents and 2 billions of words. The PMI between two words and is given by , where is the probability that word occurs in , and is the probability that words and co-occur in a 5-word window in any document. The higher PMI indicates the higher semantic coherence.

Table 3 and 4 compares ACP and AVI with and titles as training data. For each model, we report the best 4 topics in terms of PMI and visualize their top-10 words. The words for topic are selected with the highest parameters , which corresponds to the highest likelihoods . We also report the average pairwise PMI between the top words within each topic, and the mean of the average pairwise PMI across all topics to evaluate the model.

With more training data (), ACP and AVI achieve average PMI. When is reduced to , all four selected topics in ACP have better average pairwise PMI scores than AVI.

5 Conclusion

We proposed ACP for variational inference, which is a hybrid approach combining the classical techniques of deriving variational bounds over likelihoods and recent approaches using neural networks for amortized variational inference. We showed that by constraining the form of approximate posterior using classical methods and learning the variational parameters to maximize ELBO, our approach can generalize well even with a small amount of training data. We compared our method to AVI on the noisy-or model, and evaluated the properties of our method on the tasks of inference, parameter estimation, topic modeling, etc. In all the experiments, ACP shows advantages over AVI when we have a limited amount of training data.


We appreciate the feedback from the reviewers. This work is partially supported by NSF Awards IIS-1513966/ 1632803/1833137, CCF-1139148, DARPA Award#: FA8750-18-2-0117, DARPA-D3M - Award UCB-00009528, Google Research Awards, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.


  • Y. Halpern and D. Sontag (2013) Unsupervised learning of noisy-or bayesian networks. arXiv preprint arXiv:1309.6834. Cited by: §1.
  • Y. Halpern (2016) Semi-supervised learning for electronic phenotyping in support of precision medicine. Ph.D. Thesis, New York University. Cited by: §1.
  • T. S. Jaakkola and M. I. Jordan (1999) Variational probabilistic inference and the qmr-dt network.

    Journal of artificial intelligence research

    10, pp. 291–322.
    Cited by: §1, §1, §2.3, §4.2, §4.4.
  • T. S. Jaakkola and M. I. Jordan (2000) Bayesian parameter estimation via variational methods. Statistics and Computing 10 (1), pp. 25–37. Cited by: §1.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2, §4.1.
  • T. Jebara and A. Pentland (2001) On reversing jensen’s inequality. In Advances in Neural Information Processing Systems, pp. 231–237. Cited by: §1.
  • Y. Jernite, Y. Halpern, and D. Sontag (2013) Discovering hidden variables in noisy-or networks using quartet tests. In Advances in Neural Information Processing Systems, pp. 2355–2363. Cited by: §1.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.4.
  • D. Koller, N. Friedman, and F. Bach (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §1.
  • A. Mnih and K. Gregor (2014) Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030. Cited by: §1.
  • Q. Morris (2001) Recognition networks for approximate inference in bn20 networks. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 370–377. Cited by: §1.
  • [14] (2019) NeurIPS titles 1987–2016. Cited by: §4.5.
  • D. Newman, S. Karimi, and L. Cavedon (2009) External evaluation of topic models. In in Australasian Doc. Comp. Symp., 2009, Cited by: §4.5.
  • L. K. Saul, T. Jaakkola, and M. I. Jordan (1996) Mean field theory for sigmoid belief networks. Journal of artificial intelligence research 4, pp. 61–76. Cited by: §1.
  • L. K. Saul and M. I. Jordan (1996) Exploiting tractable substructures in intractable networks. In Advances in neural information processing systems, pp. 486–492. Cited by: §1.
  • T. Šingliar and M. Hauskrecht (2006) Noisy-or component analysis and its application to link analysis. Journal of Machine Learning Research 7 (Oct), pp. 2189–2213. Cited by: §1.