Universal Marginalizer for Amortised Inference and Embedding of Generative Models

by   Robert Walecki, et al.

Probabilistic graphical models are powerful tools which allow us to formalise our knowledge about the world and reason about its inherent uncertainty. There exist a considerable number of methods for performing inference in probabilistic graphical models; however, they can be computationally costly due to significant time burden and/or storage requirements; or they lack theoretical guarantees of convergence and accuracy when applied to large scale graphical models. To this end, we propose the Universal Marginaliser Importance Sampler (UM-IS) -- a hybrid inference scheme that combines the flexibility of a deep neural network trained on samples from the model and inherits the asymptotic guarantees of importance sampling. We show how combining samples drawn from the graphical model with an appropriate masking function allows us to train a single neural network to approximate any of the corresponding conditional marginal distributions, and thus amortise the cost of inference. We also show that the graph embeddings can be applied for tasks such as: clustering, classification and interpretation of relationships between the nodes. Finally, we benchmark the method on a large graph (>1000 nodes), showing that UM-IS outperforms sampling-based methods by a large margin while being computationally efficient.



There are no comments yet.


page 7

page 12


Universal Marginaliser for Deep Amortised Inference for Probabilistic Programs

Probabilistic programming languages (PPLs) are powerful modelling tools ...

A Universal Marginalizer for Amortized Inference in Generative Models

We consider the problem of inference in a causal generative model where ...

Distributionally Robust Graphical Models

In many structured prediction problems, complex relationships between va...

Heron Inference for Bayesian Graphical Models

Bayesian graphical models have been shown to be a powerful tool for disc...

Model Uncertainty and Correctability for Directed Graphical Models

Probabilistic graphical models are a fundamental tool in probabilistic m...

Efficient Localized Inference for Large Graphical Models

We propose a new localized inference algorithm for answering marginaliza...

Approximate Knowledge Compilation by Online Collapsed Importance Sampling

We introduce collapsed compilation, a novel approximate inference algori...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic Graphical Models (PGM) provide a natural framework for expressing the conditional independence relationships between random variables. PGMs are used to formalise our knowledge about the world and for reasoning and decision-making. PGMs have been successfully used for problems in a wide range of real-life applications including information technology, engineering, systems biology and medicine, among others. In systems biology, the structure of a PGM is usualy learned and used to infer different biological properties from data


. For this type of application, the structure (edges) of the network is the main output. A particular example of a PGM is a Bayesian Network (BN) where all variables in the graphical model are discrete. These types of networks are widely used for medical applications like diagnosis systems. For this, the network structure is designed by experts and is then used to infer the conditional marginal probability of diseases given a set of evidence that contains observations for risk factors and/or symptoms (see Fig. 

4). In such domains, the penalty for errors during inference can be potentially life-threatening. This risk can be mitigated by choosing a more complex model for the underlying process. However, exact inference is often computationally intractable for complex models, and so approximate inference is required. Furthermore, if we increase the complexity of the models, then the cost of inference will increase accordingly, limiting the feasibility of available algorithms. Some approximate inference methods are: variational inference [28] and Monte Carlo methods such as importance sampling [20]. Variational inference methods can be fast but do not target the true posterior. Monte Carlo inference is consistent, but can be computationally expensive.

In this paper, we propose the Universal Marginaliser Importance Sampler (UM-IS), an amortised inference-based method for graph representation and efficient computation of asymptotically exact marginals. In order to compute the marginals, the UM still relies on Importance Sampling (IS). We use a guiding framework based on amortised inference that significantly improves the performance of the sampling algorithm rather than computing marginals from scratch every time we run the inference algorithm. This speed-up allows us to apply our inference scheme on large PGMs for interactive applications with minimum errors. Furthermore, the neural network can be used to calculate a vectorised representation of the evidence nodes. This representations can then be used for various machine learning tasks such as node clustering and classification.

The main contributions of the proposed work are as follows:

  • We introduce UM-IS, a novel algorithm for amortised inference-based importance sampling. The model has the flexibility of a deep neural network to perform amortised inference. The neural network is trained purely on samples from the model prior and it benfits from the asymptotic guarantees of importance sampling.

  • We demonstrate that the efficiency of importance sampling is significantly improved, which makes the proposed method applicable for interactive applications that rely on large PGMs.

  • We show on a variety of toy network and on a medical knowledge graph (>1000 nodes) that the proposed UM-IS outperforms sampling-based and deep learning-based methods by a large margin, while being computational efficient.

  • We show that the networks embeddings can serve as a vectorised representation of the provided evidence for tasks like classification and clustering or interpretation of node relationships.

Figure 1: Universal Marginaliser: The UM performs scalable and efficient inference on graphical models. This figure shows one pass through the network. First, (1) a sample is drawn from the PGM, (2) values are then masked and (3) the masked set is passed through the UM, which then, (4) computes the marginal posteriors.

2 Related Work

Currently, inference schemes in general PGMs use either message passing algorithms [19], variational inference [28, 21, 13, 12]

or Markov Chain Monte Carlo

[8]. Some exact inference algorithms are computationally expensive, within the context of the junction tree construction, because the time complexity is exponential in the size of the maximal clique in the junction tree [13]. In some cases, exact methods can be computationally efficient in a small graph or sparse regime [9]. However, it has been shown that on larger graphs such methods converge to a local minimum [10] that can be very different from the real marginals. Importance sampling methods [2, 20] are well studied and converge asymptotically to the global optimum. The caveat is that constructing good importance sampling proposals for large PGMs is hard and requires expert knowledge [25]. For this reason, we focus on amortised inference, techniques which speed up sampling by allowing us to “flexibly reuse inferences so as to answer a variety of related queries” [6].

Amortised inference has been popular for Sequential Monte Carlo and has been used to learn in advance either parameters [7]

or a discriminative model which provides conditional density estimates 

[18, 22]. These conditional density estimates can be used as proposals for importance sampling. This approach was also explored in [16]. The authors use MADE, a fixed sequential density estimator [5]. In contrast, our method can be seen as further extension of MADE, a general density estimator, able to learn from arbitrary sets of evidence.

Feed-forward neural networks have recently been deployed to perform amortised inference [17, 23]

. For this application, neural networks are serving as non-iterative approximate inference methods, trained by minimising the error between sets of evidence and predicted posteriors. They have been successfully applied to a variety of computer vision tasks, where the graphical model and its corresponding neural network for inference is trained jointly by maximising the variational evidence lower bound 

[17]. In a similar fashion, [23] introduced stochastic back-propagation, a set of rules for gradient back-propagation through stochastic variables. The algorithm can be used to perform highly efficient inference in large scale PGMs.

Recently, probabilistic programming languages have become popular for describing and performing inference in a variety of PGMs bypassing the burden on the user of having to implement the inference method. For example, [24] applied deep amortised inference to learn network parameters and later perform approximate inference on a PGM. Such models either follow the control flow of a predefined sequential procedure, or are restricted to a fixed set of evidence.

3 Universal Marginalizer (UM)

The Universal Marginaliser (UM) is a feed-forward neural network, used to perform fast, single-pass approximate inference on general PGMs at any scale. The UM can be used together with importance sampling as the proposal distribution, to obtain asymptotically exact results when estimating marginals of interest. We refer to this hybrid model as the Universal Marginaliser Importance Sampler (UM-IS). In this section, we introduce the notation and the training algorithm for the UM (see supplementary material Section 1 for an introduction to importance sampling).

3.1 Notation

A Bayesian Network (BN) encodes a distribution over the random variables through a Directed Acyclic Graph (DAG), the random variables are the graph nodes and the edges dictate the conditional independence relationships between random variables. Specifically, the conditional independence of a random variable given its parents is denoted as .

The random variables can be divided into two disjoint sets, the set of observed variables within the BN, and the set of the unobserved variables.

We utilise a Neural Network (NN) as an approximation to the marginal posterior distributions for each variable given an instantiation of any set of observations. We define as the encoding of the instantiation that specifies which variables are observed, and what their values are (see Section 5.1

). For a set of binary variables

with , the desired network maps the

-dimensional binary vector

to a vector in representing the probabilities :


This NN is used as a function approximator, hence, it can approximate any posterior marginal distribution given an arbitrary set of evidence . For this reason, we call this discriminative model as the Universal Marginaliser (UM). Indeed, if we consider the marginalisation operation in a Bayesian Network as a function , then existence of a neural network which can approximate this function is a direct consequence of the Universal Function Approximation Theorem (UFAT) [11]. It states that, under mild assumptions of smoothness, any continuous function can be approximated to an arbitrary precision by a neural network of a finite, but sufficiently large, number of hidden units. Once the weights of the NN are optimised, the activations of those hidden units can be computed to any new set of evidence. They are a compressed vectorised representation of the evidence set and can be used for tasks such as node clustering or classification.

3.2 Training a UM

In this section, we describe each step of the UM’s training algorithm for a given PGM. This model is typically a multi-output NN with one output per node in the PGM (i.e. each variable ). Once trained, this model can handle any type of input evidence instantiation and produce approximate posterior marginals .

The flow chart with each step of the training algorithm is depicted in Fig. 1(a)

. For simplicity, we assume that the training data (samples for the PGM) is pre-computed, and only one epoch is used to train the UM.

In practice, the following steps 1–4 are applied for each of the mini-batches separately rather than on a full training set all at once. This improves memory efficiency during training and ensures that the network receives a large variety of evidence combinations, accounting for low probability regions in . The steps are given as follows:

1. Acquiring samples from the PGM. The UM is trained offline by generating unbiased samples (i.e., complete assignment) from the PGM using ancestral sampling [15, Algorithm 12.2]. The PGM described here only contains binary variables , and each sample is a binary vector. In the next steps, these vectors will be partially masked as input and the UM will be trained to reconstruct the complete unmasked vectors as output.

2. Masking. In order for the network to approximate the marginal posteriors at test time, and be able to do so for any input evidence, each sample must be partially masked. The network will then receive as input a binary vector where a subset of the nodes initially observed were hidden, or masked

. This masking can be deterministic, i.e., always masking specific nodes, or probabilistic. We use a different masking distribution for every iteration during the optimization process. This is achieved in two steps. First, we sample two random numbers from a uniform distribution

where N is the number of nodes in the graph. Next, we mask from randomly selected () number of nodes the positive (negative) state. In this way, the ratio between the positive and negative evidence and the total number of masked nodes is different with every iteration. A network with a large enough capacity will eventually learn to capture all these possible representations.

There is some analogy here to dropout in the input layer and so this approach could work well as a regulariser, independently of this problem [26]. However, it is not suitable for this problem because of the constant dropout probability for all nodes.

3. Encoding the masked elements. Masked elements in the input vectors artificially reproduce queries with unobserved variables, and so their encoding must be consistent with the one used at test time. The encodings are detailed in Section 5.1.

4. Training with Cross Entropy Loss. We trained the NN by minimising the multi-label binary cross entropy of the sigmoid output layer and the unmasked samples .

5. Outputs: Posterior marginals. The desired posterior marginals are approximated by the output of the last NN-layer. We can use these values as a first estimate of the marginal posteriors (UM approach); however, combined with importance sampling, these approximated values can be further refined (UM-IS approach). This is discussed in Sections 4.1 and is empirically verified in Section 5.2.

1. Samples from PGM: .

2. Mask S: , probabilistic and/or deterministic.

3. Input: , Labels:

4. Neural network with sigmoid output.

5. Output: Predicted Posterior

min. entropy loss
(a) UM Training: The process to train a Universal Marginaliser using binary data generated from a Bayesian Network


NN Input:

UM (Trained neural network)


Sample node with as proposal

node sample

Output: One sample from joint
(b) Inference using UM-IS: The part in the box is repeated times, for each node in topological order
Figure 2: Training and inference of the UM-IS.

4 Hybrid: UM-IS

4.1 Sequential UM for Importance Sampling

The UM is a discriminative model which, given a set of observations , will approximate all the posterior marginals. While useful on its own, the estimated marginals are not guaranteed to be unbiased. To obtain a guarantee of asymptotic unbiasedness while making use of the speed of the approximate solution, we use the estimated marginals for proposals in importance sampling. A naïve approach is to sample each independently from , where is the -th element of vector . However, the product of the (approximate) posterior marginals may be very different to the true posterior joint, even if the marginal approximations are good (see supplementary material Section 2 for more details).

The universality of the UM makes the following scheme possible, which we call the Sequential Universal Marginaliser Importance Sampling (SUM-IS). A single proposal is sampled sequentially as follows. First, a new partially observed state is introduced and it is initialised to . Then, we sample , and update the previous sample such that is now observed with this value. We repeat this process, at each step sampling , and updating to include the new sampled value. Thus, we can approximate the conditional marginal for a node given the current sampled state and evidence to get the optimal proposal as follows:


Thus, the full sample is drawn from an implicit encoding of the approximate posterior joint distribution given by the UM. This is because the product of sampled probabilities from Equation 3

is expected to yield low variance importance weights when used as a proposal distribution.


The process by which we sample from these proposals is illustrated in Algorithm 1 and in Fig. 1(b).

1:Order the nodes topologically , where is the total number of nodes.
2:for  in [1,…,] (where is the total number of samples): do
4:     for  in [1,…]: do
5:         sample node from
6:         add to      
8:      (where is the likelihood, and )
9: (as in standard IS)
Algorithm 1 Sequential Universal Marginalizer importance sampling

The nodes are sampled sequentially, using the UM to provide a conditional probability estimate at each step. This requirement can affect computation time, depending on the parallelisation scheme used for sampling. In our experiments, we observed that some parallelisation efficiency can be recovered by increasing the number of samples per batch.

4.2 UM Architecture

The architecture of the UM is shown in Fig. 3. It is mostly similar to a denoising auto-encoder (see [27]) but with multiple branches – one branch for each node of the graph. In our experiments, we noticed that the cross entropy loss for different nodes highly depends on the number of parents and its depth in the graph. To simplify the network and reduce the number of parameters, we share the weights of all fully connected layers that correspond to specific type of nodes. The types are defined by the depth in the graph (type 1 nodes have no parents, type 2 nodes have only type 1 nodes as parents etc.). The architecture of the best performing model on the large medical graph has three types of nodes and the embedding layer has 2048 hidden states (more details are in Section 5.1).

(a) Directed Graphical Model
(b) UM architecture
Figure 3: Graphical Model and the corresponding UM architecture. The nodes of (a) the graph are categorized by their depth inside the network and the weights of (b) the UM neural network are shared for nodes of the same category.

5 Experiments

In our experiments, we chose the best performing UM in terms of Mean Absolute Error (MAE) on the test set for the subsequent experiments. We use ReLU non-linearities, apply dropout  

[26, 1] on the last hidden layer and use the Adam optimization method [14]

with batchsize of 2000 samples per batch for parameter learning. We have also included batch normalization between the fully connected layers. To train the model on a large medical graphical model, we used in total a stream of

samples, which took approximately 6 days on a single GPU.

5.1 Setup

Graph: We carry out our experiments on a large (>1000 nodes) proprietary Bayesian Network for medical diagnosis representing the relationships between risk factors, diseases and symptoms. A illustration of the model structure is given in Fig. 3(c).


We tried different NN architectures with a grid search over the values of the hyperparameters and on the number of hidden layers, number of states per hidden layer, learning rate and strength of regularisation through dropout.

Test set: The quality of approximate conditional marginals was measured using a test set of posterior marginals computed for 200 sets of evidence via ancestral sampling with 300 million samples. The test evidence set for the medical graph was generated by experts from real data. The test evidence set for the synthetic graphs was sampled from a uniform distribution. We used standard importance sampling, which corresponds to the likelihood weighting algorithm for discrete Bayesian networks [15, Chapter 12], with 8 GPUs over the course of 5 days to compute precise marginal posteriors of all test sets.

Metrics: Two main metrics are considered: the Mean Absolute Error (MAE) given by the absolute difference of the true and predicted node posteriors and the Pearson Correlation Coefficient (PCC) of the true and predicted marginal vectors. Note that we did not observe negative correlations and therefore both measures are bounded between 0 and 1. We also used the Effective Sample Size (ESS) statistic for the comparison with the standard importance sampling. This statistics measures the efficiency of the different proposal distributions used during sampling. In this case, we do not have access to the normalising constant of the posterior distribution, the ESS is defined as , where the weights, , are defined in Step 8 of Algorithm 1.

Data Representation:

We consider a one hot encoding for the unobserved and observed nodes. This representation only requires two binary values per node. One value represents if the node is observed and positive ([0,1]) and the other value represents whether this node is observed and negative ([1,0]). If the node is unobserved or masked, then both values are set to zero ([0,0]).

5.2 Results

In this section, we first discuss the results of different architectures for the UM, then compare the performance of importance sampling with different proposal functions. Finally, we discuss the efficiency of the algorithm.

(a) Synthetic graph, 96 nodes.
(b) Synthetic graph, 768 nodes.
(c) Medical PGM, (1200 nodes.)
Figure 4: Performance on three different graphical models. We applied inference through importance sampling with and without the support of a trained UM and evaluate it in terms of Pearson Correlation Coefficient (PCC), Mean Absolute Error (MAE) and Effective Sampling Size (ESS). The medical PGM described in the paper was designed with the help of the medical experts and contains 1200 nodes.

UM Architecture and Performance: We used a hyperparameter grid search on the different network architectures and data representations. The algorithmic performance was not greatly affected for different types of data representations. We hypothesise that this is due to the fact that neural networks are flexible models capable of handling different types of inputs efficiently by capturing the representations within the hidden layers. In contrast, the network architecture of the UM strongly depends on the structure of the PGM. For this reason, a specific UM needs to be trained for each PGM. This task can be computationally expensive but once the UM is trained, it can be used to compute the approximate marginals in a single forward pass on any new and even unseen set of evidence.

UM for Inference in PGMs: In order to evaluate the performance of sampling algorithms, we monitor the change in PCC and MAE on the test sets with respect to the total number of samples. We notice that across all experiments, a faster increase in the maximum value or the PCC is observed when the UM predictions are used as proposals for importance sampling. This effect becomes more pronounced as the size of the graphical model increases. Fig. 4 indicates standard IS (blue line) reaches PCC close to 1 and an MAE close to 0 on the small network with 96 nodes. In this case of very small graphs, both algorithms converge quickly to the exact solution. However, UM-IS (orange-line) still outperforms IS and converges faster, as seen in Fig. 3(a). For the synthetic graph with 798 nodes, standard IS reaches an MAE of with samples, whereas the UM-IS error is 3 times lower () for the same number of samples. The same conclusions can also be drawn for PCC. Most interestingly, on the large medical PGM (Fig. 3(c)), the UM-IS with samples exhibits better performance than standard IS with samples in terms of MAE and PCC.

(a) Diabetes embeddings.
(b) Smoke, Obesity embeddings.
Figure 5: The figures show the embeddings filtered for two set of symptoms and risk factors, where each scatter point corresponds to a set of evidence. The display embedding vectors correspond to the first two components. It can be seen that they separate quite well unrelated medical concepts and show an overlap for concepts which are closely related.

In other words, the time (and computational costs) of the inference algorithm is significantly reduced by factor of ten or more. We expect this improvement to be even stronger on much larger graphical models (see supplementary material Section 3 for more details). We also include the results of a simple UM architecture as a baseline. This simple UM (UM-IS-Basic) has one single hidden layer that is shared for all nodes of the PGM. We can see that the MAE and PCC still improved over standard IS. However, UM-IS with multiple fully connected layers per group of nodes significantly outperforms the basic UM by a large margin. There are two reasons for this. First, the model capacity of the UM is higher which allows to learn more complex structures from the data. Secondly, the losses in the UM are spread across all groups of nodes and the gradient update steps are optimised with the right order of magnitude for each group. This prevents the model from overfitting to the states of a specific type of node with a significant higher loss.

Graph Embedding: Extracting meaningful representations form the evidence set is an additional interesting feature of the UM. In this section, we demonstrate the qualitative results for this application. The graph embeddings are extracted as the 2048 dimensional activations of the inner layer of the UM (see Fig. 3). They are a low-dimensional vectorised representation of the evidence set in which the graphs structure is preserved. That means that the distance for nodes that are tightly connected in the PGM should be smaller that the distance to nodes than are independent. In order to visualise this feature, we plot the first two principal components of the embeddings from different evidence sets in which we know that they are related. We use the evidence set from the medical PGM with different diseases, risk-factors and symptoms as nodes. Fig. 4(a) shows that the embeddings of sets with active Type-1 and Type-2 diabetes are collocated. Although the two diseases have different underlying cause and connections in the graphical model (i.e pancreatic beta-cell atrophy and insulin-resistance respectively), they share similar symptoms and complications (e.g cardiovascular diseases, neuropathy, increased risk of infections etc.). A similar clustering can be seen in Fig. 4(b) for two cardiovascular risk factors: smoking and obesity, interestingly collocated with a sign seen in patient suffering from a severe heart condition (i.e unstable angina, or acute coronary syndrome): chest pain at rest.

Node Classfication:

Linear SVC Ridge
dense input dense input
Table 1:

Classification performances using two different features. Each classifier is trained on -

dense the dense embedding as features, and input

- the top layer (UM input) as features. The target (output) is always the disease layer.

To further asses the quality of the UM embeddings, we performed experiments for node classification with different features and two different classifiers. More precisely, we train a SVM and Ridge regression model with thresholded binary output for multi-task disease detection. These models were trained to detect the

most frequent diseases from (a) the set of evidence or (b) the embedding of that set. We used 5-fold standard cross validation with a grid search over the hyperparameter of both models and the number of PCA components for data preprocessing. Table 2 shows the experimental results for the two types of features. As expected, the models that were trained on the UM embeddings reach a significantly higher performance across all evaluation meassures. This is mainly because the embeddings of the evidence set are effectively compressed and structured and also preserve the information form the graph structure. Note that the mapping from the evidence set to the embeddings was optimised with an large number of generated samples () during the UM learning phase. Therefore, these representations can be used to build more robust machine learning methods for classfication and clustering rather then using the raw evidence set to the PGM.

6 Conclusion

This paper introduces a Universal Marginaliser based on a neural network which can approximate all conditional marginal distributions of a PGM. We have shown that a UM can be used via a chain decomposition of the BN to approximate the joint posterior distribution, and thus the optimal proposal distribution for importance sampling. While this process is computationally intensive, a first-order approximation can be used requiring only a single evaluation of a UM per evidence set. We evaluated the UM on multiple datasets and also on a large medical PGM demonstrating that the UM significantly improves the efficiency of importance sampling. The UM was trained offline using a large amount of generated training samples and for this reason, the model learned an effective representation for amortising the cost of inference. This speed-up makes the UM (in combination with importance sampling) applicable for interactive applications that require a high performance on very large PGMs. Furthermore, we have explored the use of the UM embeddings and we have shown that they can be used for tasks such as classification, clustering and interpretability of node relations. These UM embeddings make it possible to build more robust machine learning applications that rely on large generative models.


  • Baldi & Sadowski [2014] Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
  • Cheng & Druzdzel [2000] Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research, 2000.
  • Fan et al. [2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
  • Friedman [2004] Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303(5659):799–805, 2004.
  • Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.

    MADE: masked autoencoder for distribution estimation.

    In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 881–889, 2015.
  • Gershman & Goodman [2014] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the Cognitive Science Society, volume 36, 2014.
  • Gu et al. [2015] Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems, pp. 2629–2637, 2015.
  • Hastings [1970] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
  • Heckerman [1990] David Heckerman. A tractable inference algorithm for diagnosing multiple diseases. In

    Machine Intelligence and Pattern Recognition

    , volume 10, pp. 163–171. Elsevier, 1990.
  • Heskes [2003] Tom Heskes. Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In Advances in neural information processing systems, pp. 359–366, 2003.
  • Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Jaakkola & Jordan [1999] Tommi S Jaakkola and Michael I Jordan. Variational probabilistic inference and the QMR-DT network. Journal of artificial intelligence research, 10:291–322, 1999.
  • Jordan et al. [1999] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Kingma & Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koller & Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Le et al. [2017] Tuan Anh Le, Atilim Gunes Baydin, Robert Zinkov, and Frank Wood. Using synthetic data to train neural networks is model-based reasoning. arXiv preprint arXiv:1703.00868, 2017.
  • Mnih & Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
  • Morris [2001] Quaid Morris. Recognition networks for approximate inference in bn20 networks. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pp. 370–377, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1. URL http://dl.acm.org/citation.cfm?id=2074022.2074068.
  • Murphy et al. [1999] Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc., 1999.
  • Neal [2001] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
  • Ng & Jordan [2000] Andrew Y Ng and Michael I Jordan. Approximate inference algorithms for two-layer bayesian networks. In Advances in neural information processing systems, pp. 533–539, 2000.
  • Paige & Wood [2016] Brooks Paige and Frank Wood. Inference networks for sequential Monte Carlo in graphical models. In International Conference on Machine Learning, pp. 3040–3049, 2016.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Ritchie et al. [2016] Daniel Ritchie, Paul Horsfall, and Noah D Goodman. Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735, 2016.
  • Shwe & Cooper [1991] Michael Shwe and Gregory Cooper. An empirical analysis of likelihood-weighting simulation on a large, multiply connected medical belief network. Computers and Biomedical Research, 24(5):453–475, 1991.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
  • Vincent et al. [2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. ACM, 2008.
  • Wainwright et al. [2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.

7 Appendices

In this section, we review IS and describe how it is used for computing the marginals of a PGM given a set of evidence.

7.1 Sampling with a Proposal Distribution

In BN inference, Importance Sampling (IS) is used to provide the posterior marginal estimates . To do so, we draw samples from a distribution , known as the proposal distribution. The proposal distribution must be defined such that we can both sample from and evaluate it efficiently. Provided we can evaluate , and that this distribution is such that contain the Markov boundary of along with all its ancestors, IS states that we can form posterior estimates:


where and are the importance sampling weights and is an indicator function for .

The simplest proposal distribution is the prior, . However, as the prior and the posterior may be very different (especially in large networks) this is often an inefficient approach. An alternative is to use an estimate of the posterior distribution as a proposal. In this work, we argue that the UM learns an optimal proposal distribution.

7.2 Sampling from the Posterior Marginals

Take a BN with Bernoulli nodes and of arbitrary size and shape. Consider 2 specific nodes, and , such that is caused only and always by :

Given evidence , we assume that . We will now illustrate that using the posterior distribution as a proposal will not necessarily yield the best result.

Say we have been given evidence, , and the true conditional probability of , therefore also . We naively would expect to be the optimal proposal distribution. However we can illustrate the problems here by sampling with as the proposal.

Each node k N will have a weight and the total weight of the sample will be

The weights should be approximately 1 if Q is close to P. However, consider the . There are four combinations of and . We will sample =1, =1 only, in expectation, one every million samples, however when we do the weight will be . This is not a problem in the limit, however if it happens for example in the first 1000 samples then it will outweigh all other samples so far. As soon as we have a network with many nodes whose conditional probabilities are much greater than their marginal proposals this becomes almost inevitable. A further consequence of these high weights is that, since the entire sample is weighted by the same weight, every node probability will be effected by this high variance.

8 Performance on large graphical models

(a) Synthetic graph, 384 nodes.
(b) Synthetic graph, 1536 nodes.
Figure 6: Additional experimetns on very large synthetic graphs.

9 Node Classification with UM Embedding

UM for Node Classification: The UM incorporates a encoding step, encoding the input layer into one shared Embedding layer (as discussed in Section 5.2). To asses the quality of this embedding step, we perform classification experiments. First, we train a classifier from the input layer to the output layer . Then, we compare to a classifier trained on the dense shared embedding to the output layer . The dense shared embedding should encode all information present in the input layers separated for prediction (compare Fig. 5). We compute embeddings for samples with diseases and use them to train an SVM classifier [3] for disease detection.

See the comparison of classifier performance in Table 2. The learnt embedding significantly increases the performance for each classifier (about one order of magnitude).

Linear SVC RBF SVC Ridge
dense input dense input dense input
Table 2: Classifier performances for three different classifiers. Each classifier is trained on - "dense" the dense embedding as features, and "input" - the top layer (UM input) as features. The target (output) is always the disease layer.