Multi-Source Neural Variational Inference

11/11/2018 ∙ by Richard Kurle, et al. ∙ Technische Universität München 6

Learning from multiple sources of information is an important problem in machine-learning research. The key challenges are learning representations and formulating inference methods that take into account the complementarity and redundancy of various information sources. In this paper we formulate a variational autoencoder based multi-source learning framework in which each encoder is conditioned on a different information source. This allows us to relate the sources via the shared latent variables by computing divergence measures between individual source's posterior approximations. We explore a variety of options to learn these encoders and to integrate the beliefs they compute into a consistent posterior approximation. We visualise learned beliefs on a toy dataset and evaluate our methods for learning shared representations and structured output prediction, showing trade-offs of learning separate encoders for each information source. Furthermore, we demonstrate how conflict detection and redundancy can increase robustness of inference in a multi-source setting.



There are no comments yet.


page 6

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An essential feature of most living organisms is the ability to process, relate, and integrate information coming from a vast number of sensors and eventually from memories and predictions [Stein and Meredith1993]. While integrating information from complementary sources enables a coherent and unified description of the environment, redundant sources are beneficial for reducing uncertainty and ambiguity. Furthermore, when sources provide conflicting information, it can be inferred that some sources must be unreliable.

Replicating this feature is an important goal of multimodal machine learning [Baltrušaitis, Ahuja, and Morency2017]

. Learning joint representations of multiple modalities has been attempted using various methods, including neural networks

[Ngiam et al.2011], probabilistic graphical models [Srivastava and Salakhutdinov2014], and canonical correlation analysis [Andrew et al.2013]. These methods focus on learning joint representations and multimodal sensor fusion. However, it is challenging to relate information extracted from different modalities. In this work, we aim at learning probabilistic representations that can be related to each other by statistical divergence measures as well as translated from one modality to another. We make no assumptions about the nature of the data (i.e. multimodal or multi-view) and therefore adopt a more general problem formulation, namely learning from multiple information sources.

Probabilistic graphical models are a common choice to address the difficulties of learning from multiple sources by modelling relationships between information sources—i.e., observed random variables—via unobserved, random variables. Inferring the hidden variables is usually only tractable for simple linear models. For nonlinear models, one has to resort to approximate Bayesian methods. The variational autoencoder (VAE)

[Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014] is one such method, combining neural networks and variational inference for latent-variable models (LVM).

We build on the VAE framework, jointly learning the generative and inference models from multiple information sources. In contrast to the VAE, we encapsulate individual inference models into separate “modules”. As a result, we obtain multiple posterior approximations, each informed by a different source. These posteriors represent the belief over the same latent variables of the LVM, conditioned on the available information in the respective source.

Modelling beliefs individually—but coupled by the generative model—enables computing meaningful quantities such as measures of surprise, redundancy, or conflict between beliefs. Exploiting these measures can in turn increase the robustness of the inference models. Furthermore, we explore different methods to integrate arbitrary subsets of these beliefs, to approximate the posterior for the respective subset of observations. We essentially modularise neural variational inference in the sense that information sources and their associated encoders can be flexibly interchanged and combined after training.

2 Background—Neural variational inference

Consider a dataset of i.i.d. samples of some random variable and the following generative model:

where are the parameters of a neural network, defining the conditional distribution between latent and observable random variables and respectively. The variational autoencoder [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014] is an approximate inference method that enables learning the parameters of this model by optimising an evidence lower bound (ELBO) to the log marginal likelihood. A second neural network with parameters defines the parameters of an approximation of the posterior distribution. Since the computational cost of inference for each data point is shared by using a recognition model, some authors refer to this form of inference as amortised or neural variational inference [Gershman and Goodman2014, Mnih and Gregor2014].

The importance weighted autoencoder [Burda, Grosse, and Salakhutdinov2015]

(IWAE) generalises the VAE by using a multi-sample importance weighting estimate of the log-likelihood. The IWAE ELBO is given as:

where is the number of importance samples, and are the importance weights:

Besides achieving a tighter lower bound, the IWAE was motivated by noticing that a multi-sample estimate does not require all samples from the variational distribution to have a high posterior probability. This enables the training of a generative model using samples from a variational distribution with higher uncertainty. Importantly, this distribution need not be the posterior of all observations in the generative model. It can be a good enough proposal distribution, i.e. the belief from a partially-informed source.

3 Multi-source neural variational inference

We are interested in datasets consisting of tuples , we use to denote the index of the source. Each observation may be embedded in a different space but is assumed to be generated from the same latent state . Therefore, each corresponds to a different, potentially limited source of information about the underlying state . From now on we will refer to in the generative model as observations and the same in the inference model as information sources.

We model each observation in the generative model with a distinct set of parameters , although some parameters could be shared. The likelihood function is given as:

For inference, the VAE conditions on all observable data . However, one can condition (amortize) the approximate posterior distribution on any set of information sources. In this paper we limit ourselves to . An approximate posterior distribution may then be interpreted as the belief of the respective information sources about the latent variables, underlying the generative process.

In contrast to the VAE, we want to calculate the beliefs from different information sources individually, compare them, and eventually integrate them. In the following, we address each of these desiderata.




(a) Individual inferences


(b) Mixture of experts inference.


(c) Product of experts inference
Figure 1: Graphical models of inference models. White circles denote hidden random variables, grey-shaded circles—observed random variables, diamonds—deterministic variables. N is the number of i.i.d. samples in the dataset. To better distinguish the mixture or product of expert models from an IWAE with hard-wired integration in a neural-network layer, we explicitly draw the deterministic variables , denoting the parameters of the variational distributions.

3.1 Learning individual beliefs

In order to learn individual inference models as in Fig. 0(a), we propose an average of ELBOs, one for each information source and its respective inference model. The resulting objective is an ELBO to the log marginal likelihood itself and referred to as :



The indices , and refer to the data sample, information source, and importance sample index. The factors are the weights of the ELBOs, satisfying and . Although the could be inferred, we set . This ensures that all parameters are optimised individually to their best possible extent instead of down-weighting less informative sources.

Since we are dealing with partially-informed encoders instead of , the beliefs can be more uncertain than the posterior of all observations . This in turn degrades the generative model, as it requires samples from the posterior distribution. We found that the generative model becomes biased towards generating averaged samples rather than samples from a diverse, multimodal distribution. This issue arises in VAE-based objectives, irrespective of the complexity of the variational family, because each Monte-Carlo sample of latent variables must predict all observations. To account for this, we propose to use importance sampling estimates of the log-likelihood (see Sec. 2). The importance weighting and sampling-importance-resampling can be seen as feedback from the observations, allowing to approximate the true posterior even with poorly informed beliefs.

3.2 Comparing beliefs

Encapsulating individual inferences has an appealing advantage compared to an uninterpretable, deterministic combination within a neural network: Having obtained multiple beliefs w.r.t. the same latent variables, each informed by a distinct source, we can calculate meaningful quantities to relate the sources. Examples are measures of redundancy, surprise, or conflict. Here we focus on the latter.

Detecting conflict between beliefs is crucial to avoid false inferences and thus increase robustness of the model. Conflicting beliefs may stem from conflicting data or from unreliable (inference) models. The former is a form of data anomaly, e.g. due to a failing sensor. An unreliable model on the other hand may result from model misspecification or optimisation problems, i.e. due to the approximation or amortisation gap, respectively [Cremer, Li, and Duvenaud2018]. Distinguishing between the two causes of conflict is challenging however and requires evaluating the observed data under the likelihood functions.

Previous work has used the ratio of two KL divergences as a criterion to detect a conflict between a subjective prior and the data [Bousquet2008]. The nominator is the KL between the posterior and the subjective prior, and denominator is the KL between posterior and a non-informative reference prior. The two KL divergences measure the information gain of the posterior—induced by the evidence—w.r.t. the subjective prior and the non-informative prior, respectively. The decision criterion for conflict is a ratio greater than 1.

We propose a similar ratio, replacing the subjective prior with and taking the prior as reference:


This measure has the property that it yields high values if the belief of source is significantly more certain than that of . This is desirable for sources with redundant information. For complementary information sources other conflict measures, e.g. the measure defined in [Dahl, Gåsemyr, and Navig], may be more appropriate.

3.3 Integrating beliefs

So far, we have shown how to learn separate beliefs from different sources and how to relate them. However, we have not readily integrated the information from these sources. This can be seen by noticing that the gap between and the log marginal likelihood is significantly larger compared to an IWAE with an unflexible, hard-wired combination (see supplementary material). Here we propose two methods to integrate the beliefs to an integrated belief .

Disjunctive integration—Mixture of Experts

One approach to combine individual beliefs is by treating them as alternatives, which is justified if some (but not all) sources or their respective models are unreliable or in conflict [Khaleghi et al.2013]. We propose a mixture of experts (MoE) distribution, where each component is the belief, informed by a different source. The corresponding graphical model for inference is shown in Fig. 0(b). As in Sec. 3.1, the variational parameters are each predicted from one source individually without communication between them. The difference is that each is considered as a mixture component, such that the whole mixture distribution approximates the true posterior.

Instead of learning individual beliefs by optimising and integrating them subsequently into a combined , we can design an objective function for learning the MoE posterior directly. We refer to the corresponding ELBO as . It differs from only by the denominator of the importance weights, using the mixture distribution with component weights :

Conjunctive integration—Product of Experts

Another option for combining beliefs are conjunctive methods, treating each belief as a constraint. These are applicable in the case of equally reliable and independent evidences [Khaleghi et al.2013]. This can be seen by inspecting the mathematical form of the posterior distribution of all observations. Applying Bayes’ rule twice reveals that the true posterior of a graphical model with conditionally independent observations can be decomposed as a product of experts [Hinton2002] (PoE):


We propose to approximate Eq. (3) by replacing the true posteriors of single observations by the variational distributions , obtaining the inference model shown in Fig. 0(c)

. In order to make the PoE distribution computable, we further assume that the variational distributions and the prior are conjugate distributions in the exponential family. Probability distributions in the exponential family have the well-known property that their product is also in the exponential family. Hence, we can calculate the normalisation constant in Eq. (


) from the natural parameters. In this work, we focus on the popular case of normal distributions. For the derivation of the natural parameters and normalisation constant, we refer to the supplementary material.

Analogous to Sec. 3.3, we can design an objective to learn the PoE distribution directly, rather than integrating individual beliefs. We refer to the corresponding ELBO as :


where are the standard importance weights as in the IWAE and where is the PoE inference distribution. However, the natural parameters of the individual normal distributions are not uniquely identifiable by the natural parameters of the integrated normal distribution. Thus, optimising leads to inseparable individual beliefs. To account for this, we propose a hybrid between individual and integrated inference distribution:


where we choose in practice for simplicity.

In Sec. 5 we evaluate the proposed integration methods both as learning objectives, and for integrating the beliefs obtained by optimising or . Note again however, that or assume conditionally independent observations and equally reliable sources. In contrast, makes no assumptions about the structure of the generative model. This allows for any choice of appropriate integration method after learning.

4 Related Work

Canonical correlation analysis (CCA) [Hotelling1936] is an early attempt to examine the relationship between two sets of variables. CCA and nonlinear variants [Shon et al.2005, Andrew et al.2013, Feng, Li, and Wang2015] propose projections of pairs of features such that the transformed representations are maximally correlated. CCA variants have been widely used for learning from multiple information sources [Hardoon, Szedmak, and Shawe-taylor2004, Rasiwasia et al.2010]. These methods have in common with ours, that they learn a common representational space for multimodal data. Furthermore, a connection between linear CCA and probabilistic graphical models has been shown [Bach and Jordan2005].

Dempster-Shafer theory [Dempster1967, Shafer1976] is a widely used framework for integration of uncertain information. Similar to our PoE integration method, Dempster’s rule of combination takes the pointwise product of belief functions and normalises subsequently. Due to apparently counterintuitive results obtained when dealing with conflicting information [Zadeh1986], the research community proposed various measures to detect conflicting belief functions and proposed alternative integration methods. These include disjunctive integration methods [Jiang et al.2016, Denœux2008, Deng2015, Murphy2000], similar to our MoE integration method.

A closely related line of research is that of multimodal autoencoders [Ngiam et al.2011]

and multimodal Deep Boltzmann machines (DBM)

[Srivastava and Salakhutdinov2014]. Multimodal autoencoders use a shared representation for input and reconstructions of different modalities. Since multimodal autoencoders learn only deterministic functions, the interpretability of the representations is limited. Multimodal DBMs on the other hand learn multimodal generative models with a joint representation between the modalities. However, DBMs have only been shown to work on binary latent variables and are notoriously hard to train.

More recently, variational autoencoders were applied to multimodal learning [Suzuki, Nakayama, and Matsuo2016]. Their objective function maximises the ELBO using an encoder with hard-wired sources and additional KL divergence loss terms to train individual encoders. The difference to our methods is that we maximise an ELBO for which we require only individual encoders. We may then integrate the beliefs of arbitrary subsets of information sources after training. In contrast, the method in [Suzuki, Nakayama, and Matsuo2016] would require a separate encoder for each possible combination of sources. Similarly, [Vedantam et al.2017] first trains a generative model with multiple observations, using a fully-informed encoder. In a second training stage, they freeze the generative model parameters and proceed by optimising the parameters of inference models which are informed by a single source. Since the topology of the latent space is fixed in the second stage, finding good weights for the inferenc models may be complicated.

Concurrently to this work, [Wu and Goodman2018]

proposed a method for weakly-supervised learning from multimodal data, which is very similar to our hybrid method discussed in Sec. 

3.3. Their method is based on the VAE, whereas we find it crucial to optimise the importance-sampling based ELBO to prevent the generative models from generating averaged conditional samples (see Sec. 3.1).

5 Experiments

We visualise learned beliefs on a 2D toy problem, evaluate our methods for structured prediction and demonstrate how our framework can increase robustness of inference. Model and algorithm hyperparameters are summarised in the supplementary material.

5.1 Learning beliefs from complementary information sources

We begin our experiments with a toy dataset with complementary sources. As a generative process, we consider a mixture of bi-variate normal distributions with 8 mixture components. The means of each mixture component are located on the unit circle with equidistant angles, and the standard deviations are

. To simulate complementary sources, we allow each source to perceive only one dimension of the data. As with all our experiments, we assume a zero-centred normal prior with unit variance and

. We optimise with two inference models , , and two separate likelihood functions , . Fig. 2(a) (right) shows the beliefs of both information sources for 8 test data points. These test points are the means of the 8 mixture components of the observable data, rotated by . The small rotation is only for visualisation purposes, since each source is allowed to perceive only one axis and would therefore produce indistinguishable beliefs for data points with identical values on the perceived axis. We visualise the two beliefs corresponding to the same data point with identical colours. The height and width of the ellipses correspond to the standard deviations of the beliefs. Fig. 2(a) (left) shows random samples in the observation space, generated from 10 random latent samples for each belief. The generated samples are colour-coded in correspondence to the figure on the right. The 8 circles in the background visualise the true data distribution with 1 and 2 standard deviations. The two types of markers distinguish the information sources and used for inference. As can be seen, the beliefs reflect the ambiguity as a result of perceiving a single dimension . 111The true posterior (of a single source) has two modes for most data points. The uni-modal (Gaussian) proposal distribution learns to cover both modes.

(a) Individual beliefs and their predictions. Left: 8 coloured circles are centred at the 8 test inputs from a mixture of Gaussians toy dataset. The radii indicate 1 and 2 standard deviations of the normal distributions. The two types of markers represent generated data from random samples of one of the information sources (data axis 0 or 1). Right: Corresponding individual beliefs. Ellipses show 1 standard deviation of the individual approximate posterior distributions.
(b) Integrated belief and its predictions.
Figure 2: Approximate posterior distributions and samples from the predicted likelihood function with and without integration of beliefs

Next we integrate the two beliefs using Eq. (3). The resulting integrated belief and generated data from random latent samples of the belief are shown in Figs. 2(b) (right) and 2(b) (left) respectively. We can see that the integration resolves the ambiguity. In the supplementary material, we plot samples from the individual and integrated beliefs, before and after a sampling importance re-sampling procedure.

5.2 Learning and inference of shared representations for structured prediction

Models trained with or

can be used to predict structured data of any modality, conditioned on any available information source. Equivalently, we may impute missing data if modelled explicitly as an information source:


MNIST variants

We created 3 variants of MNIST [Lecun et al.1998], where we simulate multiple information sources as follows:

  • MNIST-TB: perceives the top half and perceives the bottom half of the image.

  • MNIST-QU: 4 information sources that each perceive quarters of the image.

  • MNIST-NO: 4 information sources with independent bit-flip noise with . We use these 4 sources to amortise inference. In the generative model, we use the standard, noise-free digits as observable variables.

First, we assess how well individual beliefs can be integrated after learning, and whether beliefs can be used individually when learning them as integrated inference distributions. On all MNIST variants, we train 5 different models by optimising the objectives , , , and with , as well as with . All other hyperparameters are identical. We then evaluate each model under the 3 objectives , and . For comparison, we also train a standard IWAE with hardwired sources on MNIST and on MNIST-NO with a single noisy source. The ELBOs on the test set are estimated using importance samples. The obtained estimates are summarised in Tab. 1.

102.20 102.40 265.59 104.03 108.97 -
101.51 101.82 264.48 103.37 108.30 -
94.38 94.39 87.59 90.07 90.81 88.79
120.46 120.37 447.67 129.63 140.61 -
119.10 119.98 446.02 128.16 139.19 -
108.07 107.85 87.67 89.20 90.17 88.79
94.81 94.86 101.20 96.27 95.31 -
93.98 94.03 100.36 95.58 94.55 -
94.52 94.65 92.27 92.21 94.49 94.95
Table 1: Negative evidence lower bounds on variants of randomly binarised MNIST. Lower is better.

The results confirm that learning the PoE inference model directly leads to inseparable individual beliefs. As expected, learning individual inference models and integrating them subsequently as a PoE comes with a tradeoff for , which is mostly due to the low entropy of the integrated distribution. On the other hand, optimising the model with achieves good results for both individual and integrated beliefs. On MNIST-NO, we can get an improvement of nats by integrating the beliefs of redundant sources, compared to the standard IWAE with a single source.

Next, we evaluate our method for conditional (structured) prediction using Eq. (6). Fig. 2(c) shows the means of the likelihood functions, with latent variables drawn from individual and integrated beliefs. To demonstrate conditional image generation from labels, we add a third encoder that perceives class labels. Fig. 2(d) shows the means of the likelihood functions, inferred from labels.

(c) Row 1: Original images. Row 2–4: Belief informed by top half of the image. Row 5–7: Informed by bottom half. Row 8–10: Integrated belief.
(d) Predictions from 10 random samples of the latent variables, inferred from one-hot class labels.
Figure 3: Predicted images, where latent variables are inferred from the variational distributions of different sources. Sources with partial information generate diverse samples, the integration resolves ambiguities. E.g. in Fig. 2(c), the lower half of digit 3 randomly generates digits 5 and 3 and the upper half generates digits 3 and 9. In contrast, the integration resolves ambiguities.

We also compare our method to the missing data imputation procedure described in [Rezende, Mohamed, and Wierstra2014]

for MNIST-TB und MNIST-QU. We run the Markov chain for all samples in the test set for 150 steps each and calculate the log likelihood of the imputed data at every step. The results—averaged over the dataset—are compared to our multimodal data generation method in Fig. 

4. For large portions of missing data as in MNIST-TB, the Markov chain often fails to converge to the marginal distribution. But even for MNIST-QU with only a quarter of the image missing, our method outperforms the Markov chain procedure by a large margin. Please consult the supplementary material for a visualisation of the stepwise generations during the inference procedure.

(a) MNIST-TB, where bottom half is missing.
(b) MNIST-QU, where bottom right quarter is missing.
Figure 4: Missing data imputation with Monte Carlo procedure described in [Rezende, Mohamed, and Wierstra2014] and our method. For the Markov chain procedure, the initial missing data is drawn randomly from and imputed from the previous random generation in subsequent steps. MSNVI was trained with . For MNIST-QU, we used the PoE belief of the three observed quarters. The plots show the log-likelihood at every step of the Markov chain, marginalised over the dataset. Higher is better.

Caltech-UCSD Birds 200

Caltech-UCSD Birds 200 [Welinder et al.2010] is a dataset with 6033 images of birds with resolutions, split into 3000 train and 3033 test images. As a second source, we use segmentation masks provided by [Yang, Safar, and Yang2014]. On this dataset we assess whether learning with multiple modalities can be advantageous in scenarios where we are interested only in one particular modality. Therefore, we evaluate the ELBO for a single source and a single target observation, i.e. encoding images and decoding segmentation masks. We compare models that learned with multiple modalities using and with models that learnt from a single modality. Additionally, we evaluate the segmentation accuracy using Eq. (6). The accuracy is estimated with 100 samples, drawn from the belief informed by image data. The results are summarised in Tab. 2.

* * IWAE
img-to-seg 5326 3264 5924 3337 3228
img-to-img -26179 -26663 -29285 -29668 -30415
accuracy 0.808 0.870 0.810 0.872 0.855
Table 2: Negative ELBOs and segmentation accuracy on Caltech-UCSD Birds 200. The IWAE was trained with a single source and target observation. Models trained with and use all sources and targets, and * and * use all sources for inference, but learn the generative model of a single modality.

We distinguish between objectives that involve both modalities in the generative model and objectives where we learn only the generative model for the modality of interest (segmentation), denoted with an asterisk. Models that have to learn the generative models for images and segmentations show worse ELBOs and accuracy, when evaluated on one modality. In contrast, the accuracy is slightly increased when we learn the generative model of segmentations only, but use both sources for inference.
We also refer the reader to the supplementary material, where we visualise conditionally generated images, showing that learning with the importance sampling estimate of the ELBO is crucial to generate diverse samples from partially informed sources.

5.3 Robustness via conflict detection and redundancy

Figure 5: Predictions (- and -coordinates) of the pendulum position (figures 1, 2, 3, 5, 6) and conflict measure (figure 4). For the predictions, latent variables are inferred from images of 3 sensors with different views (top row) as well as their integrated beliefs (bottom mid and right). The figures show predictions (of the static model) for different angles of the pendulum, performing 3 rotations. After 2 rotations, failure of sensor 0 is simulated by outputting noise only. Lines show the mean and shaded areas show 1 and 2 standard deviations, estimated using 500 random samples of latent variables. Bottom left: The conflict measure of Eq. (2) for different angles of the pendulum.

In this experiment we demonstrate how a shared latent representation can increase robustness, by exploiting sensor redundancy and the ability to detect conflicting data. We created a synthetic dataset of perspective images of a pendulum with different views of the same scene. The pendulum rotates along the z-axis and is centred at the origin. We simulate three cameras with -pixel resolution as information sources for inference and apply independent noise with std to all sources. Each sensor is directed towards the origin (centre of rotation) from different view-points: Sensor 0 is aligned with the -axis, and sensor 1 and 2 are rotated by along the - and -axis, respectively. The distance of all sensors to the origin is twice the radius of the pendulum rotation. For the generative model we use the - and -coordinate of the pendulum rather than reconstructing the images. The model was trained with .

In Fig. 5, we plot the mean and standard deviation of predicted - and -coordinates, where latent variables are inferred from a single source as well as from the PoE posteriors of different subsets. As expected, integrating the beliefs from redundant sensors reduces the predictive uncertainty. Additionally, we visualise the three images used as information sources above these plots.

Next, we simulate an anomaly in the form of a defect sensor 0, outputting random noise after 2 rotations of the pendulum. This has a detrimental effect on the integrated beliefs, where sensor 0 is part of the integration. We also plot the conflict measure of Eq. (2). As can be seen, the conflict measures for sensor 0 increases significantly when sensor 0 fails. In this case, one should integrate only the two remaining sensors with low conflict conjunctively.

6 Summary and future research directions

We extended neural variational inference to scenarios where multiple information sources are available. We proposed an objective function to learn individual inference models jointly with a shared generative model. We defined an exemplar measure (of conflict) to compare the beliefs from distinct inference models and their respective information sources. Furthermore, we proposed a disjunctive and a conjunctive integration method to combine arbitrary subsets of beliefs.

We compared the proposed objective functions experimentally, highlighting the advantages and drawbacks of each. Naive integration as a PoE () leads to inseparable individual beliefs, while optimising the sources only individually () worsens the integration of the sources. On the other hand, a hybrid of the two objectives () achieves a good trade-off between both desiderata. Moreover, we showed how our method can be applied to structured output prediction and the benefits of exploiting the comparability of beliefs to increase robustness.

This work offers several future research directions. As an initial step, we considered only static data and a simple latent variable model. However, we have made no assumptions about the type of information source. Interesting research directions are extensions to sequence models, hierarchical models and different forms of information sources such as external memory. Another important research direction is the combination of disjunctive and conjunctive integration methods, taking into account the conflict between sources.


We would like to thank Botond Cseke for valuable suggestions and discussions.


  • [Andrew et al.2013] Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, III–1247–III–1255.
  • [Bach and Jordan2005] Bach, F., and Jordan, M. 2005. A probabilistic interpretation of canonical correlation analysis.
  • [Baltrušaitis, Ahuja, and Morency2017] Baltrušaitis, T.; Ahuja, C.; and Morency, L.-P. 2017. Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.09406.
  • [Bousquet2008] Bousquet, N. 2008. Diagnostics of prior-data agreement in applied Bayesian analysis. Journal of Applied Statistics 35(9):1011–1029.
  • [Burda, Grosse, and Salakhutdinov2015] Burda, Y.; Grosse, R. B.; and Salakhutdinov, R. 2015. Importance weighted autoencoders. CoRR abs/1509.00519.
  • [Cremer, Li, and Duvenaud2018] Cremer, C.; Li, X.; and Duvenaud, D. K. 2018. Inference suboptimality in variational autoencoders. CoRR abs/1801.03558.
  • [Dahl, Gåsemyr, and Navig] Dahl, F. A.; Gåsemyr, J.; and Navig, B. A robust conflict measure of inconsistencies in Bayesian hierarchical models. Scandinavian Journal of Statistics 34(4):816–828.
  • [Dempster1967] Dempster, A. P. 1967. Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Statist. 38(2):325–339.
  • [Deng2015] Deng, Y. 2015. Generalized evidence theory. Applied Intelligence 43(3):530–543.
  • [Denœux2008] Denœux, T. 2008. Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence. Artificial Intelligence 172(2):234 – 264.
  • [Feng, Li, and Wang2015] Feng, F.; Li, R.; and Wang, X. 2015.

    Deep correspondence restricted boltzmann machine for cross-modal retrieval.

    Neurocomputing 154:50–60.
  • [Gershman and Goodman2014] Gershman, S., and Goodman, N. D. 2014. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, CogSci 2014, Quebec City, Canada, July 23-26, 2014.
  • [Hardoon, Szedmak, and Shawe-taylor2004] Hardoon, D. R.; Szedmak, S. R.; and Shawe-taylor, J. R. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16(12):2639–2664.
  • [Hinton2002] Hinton, G. E. 2002.

    Training products of experts by minimizing contrastive divergence.

    Neural Comput. 14(8):1771–1800.
  • [Hotelling1936] Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28(3/4):321–377.
  • [Jiang et al.2016] Jiang, W.; Xie, C.; Zhuang, M.; Shou, Y.; and Tang, Y. 2016. Sensor data fusion with z-numbers and its application in fault diagnosis. Sensors 16(9).
  • [Khaleghi et al.2013] Khaleghi, B.; Khamis, A.; Karray, F.; and Razavi, S. 2013. Multisensor data fusion: A review of the state-of-the-art. 14.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational Bayes. CoRR abs/1312.6114.
  • [Lecun et al.1998] Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Mnih and Gregor2014] Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 1791–1799.
  • [Murphy2000] Murphy, C. K. 2000. Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9.
  • [Ngiam et al.2011] Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011.

    Multimodal deep learning.

    In Getoor, L., and Scheffer, T., eds., ICML, 689–696. Omnipress.
  • [Rasiwasia et al.2010] Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G. R.; Levy, R.; and Vasconcelos, N. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, 251–260. New York, NY, USA: ACM.
  • [Rezende, Mohamed, and Wierstra2014] Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31th International Conference on Machine Learning (ICML), 1278–1286.
  • [Shafer1976] Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton: Princeton University Press.
  • [Shon et al.2005] Shon, A. P.; Grochow, K.; Hertzmann, A.; and Rao, R. P. N. 2005. Learning shared latent structure for image synthesis and robotic imitation. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’05, 1233–1240. Cambridge, MA, USA: MIT Press.
  • [Srivastava and Salakhutdinov2014] Srivastava, N., and Salakhutdinov, R. 2014. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15:2949–2980.
  • [Stein and Meredith1993] Stein, B. E., and Meredith, M. A. 1993. The merging of the senses. Cambridge, MA, US: The MIT Press.
  • [Suzuki, Nakayama, and Matsuo2016] Suzuki, M.; Nakayama, K.; and Matsuo, Y. 2016. Joint multimodal learning with deep generative models.
  • [Vedantam et al.2017] Vedantam, R.; Fischer, I.; Huang, J.; and Murphy, K. 2017. Generative models of visually grounded imagination. CoRR abs/1705.10762.
  • [Welinder et al.2010] Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
  • [Wu and Goodman2018] Wu, M., and Goodman, N. 2018. Multimodal generative models for scalable weakly-supervised learning. CoRR abs/1802.05335.
  • [Yang, Safar, and Yang2014] Yang, J.; Safar, S.; and Yang, M.-H. 2014. Max-margin boltzmann machines for object segmentation.

    2014 IEEE Conference on Computer Vision and Pattern Recognition

  • [Zadeh1986] Zadeh, L. A. 1986. A simple view of the dempster-shafer theory of evidence and its implication for the rule of combination. AI Mag. 7(2):85–90.

7 Appendix

A Individual inferences

In this section we derive the . Since any proposal distribution yields an ELBO to the log-marginal likelihood, the (weighted) average is also an ELBO.


The factors are the weights for each ELBO term, satisfying and .
When , the gap between and the marginal log-likelihood is the average Kullback-Leibler (KL) divergence between individual approximate posteriors and the true posterior from all sources:

This gap can be further decomposed as:

To minimise , not only the KL divergence of the individual approximate posterior and the respective true posterior need to be minimised, but also two additional terms which depend on the likelihood of those observations that have not been used as an information source for inference.

B Mixture of experts inference

The ELBO for the mixture distribution

can be derived similarly. We employ a Monte Carlo approximation only w.r.t. each mixture component but not w.r.t. the mixture weights. That is, we enumerate all possible mixture components rather than sampling each from an indicator variable. This reduces variance of the estimate and circumvents the problem of propagating gradients through the sampling process of discrete random variables.

minimises the average KL-divergence between the mixture of approximate posteriors and the true posterior from all sources:

C Product of Gaussian experts

Here we consider the popular case of individual Gaussian approximate posteriors and a zero-centred Gaussian prior. Let the normal distributions be represented in the canonical form with canonical parameters :

denotes the precision matrix and , where is the mean. Furthermore, is the partition function.

Let the subscripts , and indicate the -th approximate distribution, the prior and the integrated distribution. The natural parameters of the integrated variational posterior from Eq. (3) can then be calculated as follows:

To obtain a valid integrated variational posterior, we require the precision matrix to be positive semi-definite. This enforces requirements for the precision matrices . In the case of diagonal precision matrices, the necessary and sufficient condition is that has all positive entries. A sufficient condition for each entry is .

The partition function of the integrated belief can be calculated from the natural parameters, taking :


D Point-wise mutual information

Inspecting Eq. (3), we can see that the negative logarithm of the constant term corresponds to the pointwise mutual information (PMI) between the observations. We do not need to calculate this constant since we impose assumptions about the parametric forms of the distributions and can calculate the partition function of the integrated belief using Eq. (7).

However, we can also calculate this partition function from the product of individual partition functions and the above mentioned constant in Eq. (3):

The PMI can then be calculated as:

The pointwise mutual information can be calculated between any subset of information sources. We note however, that it is based on the assumption that all involved probability density functions—the prior and all approximate posterior distributions—are normal distributions.

E Visualisation of samples from individual and integrated beliefs on mixture of bi-variate Gaussians dataset

For the mixture of bi-variate Gaussians dataset, we show latent samples from both information sources in Fig. 6(a) (left) and samples obtained by sampling importance re-sampling (SIR) using the full likelihood model in 6(a) (right). We also show random samples from the integrated beliefs as well as samples obtain by SIR in Fig. 6(b) (left) and 6(b) (right) respectively. We conclude that the integrated beliefs are much better proposal distributions, resolving the ambiguity of the individual sources.

(a) Individual beliefs and their predictions. Left: Random samples from variational posterior without integration. Colours correspond to 8 test points, located at the means of the mixture of Gaussians data distribution. Right: Samples after sampling importance re-sampling using all likelihood functions.
(b) Integrated belief and its predictions. Left: Random samples from integrated variational posterior. Colours correspond to the test points. Right: Samples after sampling importance re-sampling using all likelihood functions.
Figure 6: Samples from individual and integrated beliefs and samples obtained after SIR

F Visualisation of missing data imputation

Fig. 7 shows the mean of generated images for 50 steps of the Markov chain procedure for missing data imputation. As can be seen in Fig. 6(c), the chain does not converge for many digits within 50 steps if too large portions of the data are missing. Indeed, we observed that the procedure randomly fails or succeeds to converge for the same input even after 150 steps.

(c) Bottom half of the image is missing
(d) Bottom right quarter of the image is missing
Figure 7: Missing data imputation results: Mean of generated images. Observed data (fixed binarised) is kept unchanged, missing data is replaced with randomly generated (binary) image of previous iteration. The initial missing data is drawn randomly from . Each of the 10 rows is an exemplar image of digits 0–9.

G Conditional generations on Caltech-UCSD Birds 200

We show conditional generations of images, inferred from images or segmentation masks in Fig. 8. When inferring from segmentation masks, the conditional distribution should be highly multimodal due to the missing colour information. This uncertainty should ideally be covered in the uncertainty of the belief. As can be seen in Fig. 7(c), learning with a single importance sample leads to predictions of average images. For completeness, generated segmentation masks are shown in Fig. 9.

(a) Trained with .
(b) Trained with .
(c) Trained with , K=1.
Figure 8: Conditional image generations, where latent variables are inferred from different sources. Row 1: Target observations. Row 2–4: Latent variables inferred from images. Row 5–15: Latent variables inferred from segmentation masks.
(a) Trained with .
(b) Trained with .
(c) Trained with , K=1.
Figure 9: Conditional segmentation mask generations, where latent variables are inferred from different sources. Row 1: Target observations. Row 2–4: Latent variables inferred from segmentation masks. Row 5–15: Latent variables inferred from images.

H Experiment setups

All inference (generative) models use the same neural network architectures for the different sources, except the first (last) layer, which depends on the dimensions of the data. We refer to main parts of the architectures, identical for each source, as “stem”. In case of inference models, the stem is the input to a dense layer with linear (no activation) and sigmoid activations, parameterising the mean and std-dev of the approximate posterior distribution. In case of generative models, refer to the respective subsections.

We use the Adam optimiser [Kingma and Ba2014] with and

in all experiments. In the tables, “dense” denotes fully connected layers, “conv” refers to convolutional layers, “pool” refers to pooling (down-sampling), and “interpol” refers to a bilinear interpolation (up-sampling).

is the number of importance-weighted samples and refers to the number of latent dimensions, each modelled with a diagonal normal distribution with zero mean and unit standard deviation.

Partially observable mixture of bi-variate Gaussians

In the pendulum experiment, we use 2 sources, corresponding to the - and

-coordinates of the sample from a mixture of bi-variate Gaussians distribution. The neural network stems and training hyperparameters are summarised in Tab. 

3. The generative models are both 1D Normal distributions, parameterised by linear dense layers, taking inputs from their respective stems.

Inference models layer activation output shape dense tanh 32 dense tanh 32 Generative models layer activation output shape dense tanh 32 dense tanh 32 Hyperparameters batch size learning rate #iterations 8 2 32 0.0001 25k
Table 3: Neural network architectures (stem) and hyperparameters used for experiments with partially observable mixture of bi-variate Gaussians

MNIST variants

The neural network stems are summarised in Tab. 4

. The data is modelled as Bernoulli distributions of dimensions 784 for MNIST-NO, 392 for MNIST-TB and 196 for MNIST-QU. The Bernoulli parameters are parameterised by linear dense layers, taking inputs from their respective stems.

Inference models layer activation output shape dense elu 200 dense elu 200 Generative models layer activation output shape dense elu 200 dense elu 200 Hyperparameters batch size learning rate #iterations 16 16 128 0.00005 250k
Table 4: Neural network architectures and hyperparameters used for experiments with MNIST variants.


In the pendulum experiment, we use 3 sources with images for the inference model, but a single observation of - and -coordinates of the pendulum centre. The generative model is assumed Normal for both coordinates, where the mean is predicted by a linear dense layer taking inputs from the stem, and the std deviation is a global variable. The neural network stems and training hyperparameters are summarised in Tab. 5.

Inference models layer activation output shape dense tanh 32 dense tanh 32 Generative models layer activation output shape dense tanh 256 dense tanh 64 dense tanh 16 Hyperparameters batch size learning rate #iterations 16 2 16 0.00005 50k
Table 5: Neural network architectures and hyperparameters used for perspective pendulum experiments

Caltech-UCSD Birds 200

The neural network stems are summarised in Tab. 6. Images are modelled as diagonal normal distributions and segmentation masks as Bernoulli distributions. The generative model stem is the input to a

-transposed convolutional layers with stride 2, yielding the mean of the likelihood function. The standard deviations are global and shared for all pixels. Leaky rectified linear units (lrelu) use


Inference models layer kernel stride activation output shape conv 1 lrelu 128x128x16 conv 1 lrelu 128x128x16 pool 2 - 64x64x16 conv 1 lrelu 64x64x32 conv 1 lrelu 64x64x32 pool 2 - 32x32x32 conv 1 lrelu 32x32x48 conv 1 lrelu 32x32x48 pool 2 - 16x16x48 conv 1 lrelu 16x16x64 conv 1 lrelu 16x16x64 pool 2 - 8x8x64 conv 1 lrelu 8x8x96 conv 1 lrelu 8x8x96 pool 2 - 4x4x96 dense - - linear 256 Generative models layer kernel stride activation output shape dense - - linear 4x4x96 conv 1 lrelu 4x4x64 conv 1 lrelu 4x4x64 interpol 2 - 8x8x64 conv 1 lrelu 8x8x48 conv 1 lrelu 8x8x48 interpol 2 - 16x16x48 conv 1 lrelu 16x16x32 conv 1 lrelu 16x16x32 interpol 2 - 32x32x32 conv 1 lrelu 32x32x16 conv 1 lrelu 32x32x16 interpol 2 - 64x64x16 Hyperparameters batch size learning rate #iterations 80 96 16 0.0002 25k
Table 6: Neural network architectures and hyperparameters used for Caltech-UCSD Birds 200 experiments