1 Introduction
An essential feature of most living organisms is the ability to process, relate, and integrate information coming from a vast number of sensors and eventually from memories and predictions [Stein and Meredith1993]. While integrating information from complementary sources enables a coherent and unified description of the environment, redundant sources are beneficial for reducing uncertainty and ambiguity. Furthermore, when sources provide conflicting information, it can be inferred that some sources must be unreliable.
Replicating this feature is an important goal of multimodal machine learning [Baltrušaitis, Ahuja, and Morency2017]
. Learning joint representations of multiple modalities has been attempted using various methods, including neural networks
[Ngiam et al.2011], probabilistic graphical models [Srivastava and Salakhutdinov2014], and canonical correlation analysis [Andrew et al.2013]. These methods focus on learning joint representations and multimodal sensor fusion. However, it is challenging to relate information extracted from different modalities. In this work, we aim at learning probabilistic representations that can be related to each other by statistical divergence measures as well as translated from one modality to another. We make no assumptions about the nature of the data (i.e. multimodal or multiview) and therefore adopt a more general problem formulation, namely learning from multiple information sources.Probabilistic graphical models are a common choice to address the difficulties of learning from multiple sources by modelling relationships between information sources—i.e., observed random variables—via unobserved, random variables. Inferring the hidden variables is usually only tractable for simple linear models. For nonlinear models, one has to resort to approximate Bayesian methods. The variational autoencoder (VAE)
[Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014] is one such method, combining neural networks and variational inference for latentvariable models (LVM).We build on the VAE framework, jointly learning the generative and inference models from multiple information sources. In contrast to the VAE, we encapsulate individual inference models into separate “modules”. As a result, we obtain multiple posterior approximations, each informed by a different source. These posteriors represent the belief over the same latent variables of the LVM, conditioned on the available information in the respective source.
Modelling beliefs individually—but coupled by the generative model—enables computing meaningful quantities such as measures of surprise, redundancy, or conflict between beliefs. Exploiting these measures can in turn increase the robustness of the inference models. Furthermore, we explore different methods to integrate arbitrary subsets of these beliefs, to approximate the posterior for the respective subset of observations. We essentially modularise neural variational inference in the sense that information sources and their associated encoders can be flexibly interchanged and combined after training.
2 Background—Neural variational inference
Consider a dataset of i.i.d. samples of some random variable and the following generative model:
where are the parameters of a neural network, defining the conditional distribution between latent and observable random variables and respectively. The variational autoencoder [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014] is an approximate inference method that enables learning the parameters of this model by optimising an evidence lower bound (ELBO) to the log marginal likelihood. A second neural network with parameters defines the parameters of an approximation of the posterior distribution. Since the computational cost of inference for each data point is shared by using a recognition model, some authors refer to this form of inference as amortised or neural variational inference [Gershman and Goodman2014, Mnih and Gregor2014].
The importance weighted autoencoder [Burda, Grosse, and Salakhutdinov2015]
(IWAE) generalises the VAE by using a multisample importance weighting estimate of the loglikelihood. The IWAE ELBO is given as:
where is the number of importance samples, and are the importance weights:
Besides achieving a tighter lower bound, the IWAE was motivated by noticing that a multisample estimate does not require all samples from the variational distribution to have a high posterior probability. This enables the training of a generative model using samples from a variational distribution with higher uncertainty. Importantly, this distribution need not be the posterior of all observations in the generative model. It can be a good enough proposal distribution, i.e. the belief from a partiallyinformed source.
3 Multisource neural variational inference
We are interested in datasets consisting of tuples , we use to denote the index of the source. Each observation may be embedded in a different space but is assumed to be generated from the same latent state . Therefore, each corresponds to a different, potentially limited source of information about the underlying state . From now on we will refer to in the generative model as observations and the same in the inference model as information sources.
We model each observation in the generative model with a distinct set of parameters , although some parameters could be shared. The likelihood function is given as:
For inference, the VAE conditions on all observable data . However, one can condition (amortize) the approximate posterior distribution on any set of information sources. In this paper we limit ourselves to . An approximate posterior distribution may then be interpreted as the belief of the respective information sources about the latent variables, underlying the generative process.
In contrast to the VAE, we want to calculate the beliefs from different information sources individually, compare them, and eventually integrate them. In the following, we address each of these desiderata.
3.1 Learning individual beliefs
In order to learn individual inference models as in Fig. 0(a), we propose an average of ELBOs, one for each information source and its respective inference model. The resulting objective is an ELBO to the log marginal likelihood itself and referred to as :
(1) 
with
The indices , and refer to the data sample, information source, and importance sample index. The factors are the weights of the ELBOs, satisfying and . Although the could be inferred, we set . This ensures that all parameters are optimised individually to their best possible extent instead of downweighting less informative sources.
Since we are dealing with partiallyinformed encoders instead of , the beliefs can be more uncertain than the posterior of all observations . This in turn degrades the generative model, as it requires samples from the posterior distribution. We found that the generative model becomes biased towards generating averaged samples rather than samples from a diverse, multimodal distribution. This issue arises in VAEbased objectives, irrespective of the complexity of the variational family, because each MonteCarlo sample of latent variables must predict all observations. To account for this, we propose to use importance sampling estimates of the loglikelihood (see Sec. 2). The importance weighting and samplingimportanceresampling can be seen as feedback from the observations, allowing to approximate the true posterior even with poorly informed beliefs.
3.2 Comparing beliefs
Encapsulating individual inferences has an appealing advantage compared to an uninterpretable, deterministic combination within a neural network: Having obtained multiple beliefs w.r.t. the same latent variables, each informed by a distinct source, we can calculate meaningful quantities to relate the sources. Examples are measures of redundancy, surprise, or conflict. Here we focus on the latter.
Detecting conflict between beliefs is crucial to avoid false inferences and thus increase robustness of the model. Conflicting beliefs may stem from conflicting data or from unreliable (inference) models. The former is a form of data anomaly, e.g. due to a failing sensor. An unreliable model on the other hand may result from model misspecification or optimisation problems, i.e. due to the approximation or amortisation gap, respectively [Cremer, Li, and Duvenaud2018]. Distinguishing between the two causes of conflict is challenging however and requires evaluating the observed data under the likelihood functions.
Previous work has used the ratio of two KL divergences as a criterion to detect a conflict between a subjective prior and the data [Bousquet2008]. The nominator is the KL between the posterior and the subjective prior, and denominator is the KL between posterior and a noninformative reference prior. The two KL divergences measure the information gain of the posterior—induced by the evidence—w.r.t. the subjective prior and the noninformative prior, respectively. The decision criterion for conflict is a ratio greater than 1.
We propose a similar ratio, replacing the subjective prior with and taking the prior as reference:
(2) 
This measure has the property that it yields high values if the belief of source is significantly more certain than that of . This is desirable for sources with redundant information. For complementary information sources other conflict measures, e.g. the measure defined in [Dahl, Gåsemyr, and Navig], may be more appropriate.
3.3 Integrating beliefs
So far, we have shown how to learn separate beliefs from different sources and how to relate them. However, we have not readily integrated the information from these sources. This can be seen by noticing that the gap between and the log marginal likelihood is significantly larger compared to an IWAE with an unflexible, hardwired combination (see supplementary material). Here we propose two methods to integrate the beliefs to an integrated belief .
Disjunctive integration—Mixture of Experts
One approach to combine individual beliefs is by treating them as alternatives, which is justified if some (but not all) sources or their respective models are unreliable or in conflict [Khaleghi et al.2013]. We propose a mixture of experts (MoE) distribution, where each component is the belief, informed by a different source. The corresponding graphical model for inference is shown in Fig. 0(b). As in Sec. 3.1, the variational parameters are each predicted from one source individually without communication between them. The difference is that each is considered as a mixture component, such that the whole mixture distribution approximates the true posterior.
Instead of learning individual beliefs by optimising and integrating them subsequently into a combined , we can design an objective function for learning the MoE posterior directly. We refer to the corresponding ELBO as . It differs from only by the denominator of the importance weights, using the mixture distribution with component weights :
Conjunctive integration—Product of Experts
Another option for combining beliefs are conjunctive methods, treating each belief as a constraint. These are applicable in the case of equally reliable and independent evidences [Khaleghi et al.2013]. This can be seen by inspecting the mathematical form of the posterior distribution of all observations. Applying Bayes’ rule twice reveals that the true posterior of a graphical model with conditionally independent observations can be decomposed as a product of experts [Hinton2002] (PoE):
(3) 
We propose to approximate Eq. (3) by replacing the true posteriors of single observations by the variational distributions , obtaining the inference model shown in Fig. 0(c)
. In order to make the PoE distribution computable, we further assume that the variational distributions and the prior are conjugate distributions in the exponential family. Probability distributions in the exponential family have the wellknown property that their product is also in the exponential family. Hence, we can calculate the normalisation constant in Eq. (
3) from the natural parameters. In this work, we focus on the popular case of normal distributions. For the derivation of the natural parameters and normalisation constant, we refer to the supplementary material.
Analogous to Sec. 3.3, we can design an objective to learn the PoE distribution directly, rather than integrating individual beliefs. We refer to the corresponding ELBO as :
(4) 
where are the standard importance weights as in the IWAE and where is the PoE inference distribution. However, the natural parameters of the individual normal distributions are not uniquely identifiable by the natural parameters of the integrated normal distribution. Thus, optimising leads to inseparable individual beliefs. To account for this, we propose a hybrid between individual and integrated inference distribution:
(5) 
where we choose in practice for simplicity.
In Sec. 5 we evaluate the proposed integration methods both as learning objectives, and for integrating the beliefs obtained by optimising or . Note again however, that or assume conditionally independent observations and equally reliable sources. In contrast, makes no assumptions about the structure of the generative model. This allows for any choice of appropriate integration method after learning.
4 Related Work
Canonical correlation analysis (CCA) [Hotelling1936] is an early attempt to examine the relationship between two sets of variables. CCA and nonlinear variants [Shon et al.2005, Andrew et al.2013, Feng, Li, and Wang2015] propose projections of pairs of features such that the transformed representations are maximally correlated. CCA variants have been widely used for learning from multiple information sources [Hardoon, Szedmak, and Shawetaylor2004, Rasiwasia et al.2010]. These methods have in common with ours, that they learn a common representational space for multimodal data. Furthermore, a connection between linear CCA and probabilistic graphical models has been shown [Bach and Jordan2005].
DempsterShafer theory [Dempster1967, Shafer1976] is a widely used framework for integration of uncertain information. Similar to our PoE integration method, Dempster’s rule of combination takes the pointwise product of belief functions and normalises subsequently. Due to apparently counterintuitive results obtained when dealing with conflicting information [Zadeh1986], the research community proposed various measures to detect conflicting belief functions and proposed alternative integration methods. These include disjunctive integration methods [Jiang et al.2016, Denœux2008, Deng2015, Murphy2000], similar to our MoE integration method.
A closely related line of research is that of multimodal autoencoders [Ngiam et al.2011]
and multimodal Deep Boltzmann machines (DBM)
[Srivastava and Salakhutdinov2014]. Multimodal autoencoders use a shared representation for input and reconstructions of different modalities. Since multimodal autoencoders learn only deterministic functions, the interpretability of the representations is limited. Multimodal DBMs on the other hand learn multimodal generative models with a joint representation between the modalities. However, DBMs have only been shown to work on binary latent variables and are notoriously hard to train.More recently, variational autoencoders were applied to multimodal learning [Suzuki, Nakayama, and Matsuo2016]. Their objective function maximises the ELBO using an encoder with hardwired sources and additional KL divergence loss terms to train individual encoders. The difference to our methods is that we maximise an ELBO for which we require only individual encoders. We may then integrate the beliefs of arbitrary subsets of information sources after training. In contrast, the method in [Suzuki, Nakayama, and Matsuo2016] would require a separate encoder for each possible combination of sources. Similarly, [Vedantam et al.2017] first trains a generative model with multiple observations, using a fullyinformed encoder. In a second training stage, they freeze the generative model parameters and proceed by optimising the parameters of inference models which are informed by a single source. Since the topology of the latent space is fixed in the second stage, finding good weights for the inferenc models may be complicated.
Concurrently to this work, [Wu and Goodman2018]
proposed a method for weaklysupervised learning from multimodal data, which is very similar to our hybrid method discussed in Sec.
3.3. Their method is based on the VAE, whereas we find it crucial to optimise the importancesampling based ELBO to prevent the generative models from generating averaged conditional samples (see Sec. 3.1).5 Experiments
We visualise learned beliefs on a 2D toy problem, evaluate our methods for structured prediction and demonstrate how our framework can increase robustness of inference. Model and algorithm hyperparameters are summarised in the supplementary material.
5.1 Learning beliefs from complementary information sources
We begin our experiments with a toy dataset with complementary sources. As a generative process, we consider a mixture of bivariate normal distributions with 8 mixture components. The means of each mixture component are located on the unit circle with equidistant angles, and the standard deviations are
. To simulate complementary sources, we allow each source to perceive only one dimension of the data. As with all our experiments, we assume a zerocentred normal prior with unit variance and
. We optimise with two inference models , , and two separate likelihood functions , . Fig. 2(a) (right) shows the beliefs of both information sources for 8 test data points. These test points are the means of the 8 mixture components of the observable data, rotated by . The small rotation is only for visualisation purposes, since each source is allowed to perceive only one axis and would therefore produce indistinguishable beliefs for data points with identical values on the perceived axis. We visualise the two beliefs corresponding to the same data point with identical colours. The height and width of the ellipses correspond to the standard deviations of the beliefs. Fig. 2(a) (left) shows random samples in the observation space, generated from 10 random latent samples for each belief. The generated samples are colourcoded in correspondence to the figure on the right. The 8 circles in the background visualise the true data distribution with 1 and 2 standard deviations. The two types of markers distinguish the information sources and used for inference. As can be seen, the beliefs reflect the ambiguity as a result of perceiving a single dimension . ^{1}^{1}1The true posterior (of a single source) has two modes for most data points. The unimodal (Gaussian) proposal distribution learns to cover both modes.Next we integrate the two beliefs using Eq. (3). The resulting integrated belief and generated data from random latent samples of the belief are shown in Figs. 2(b) (right) and 2(b) (left) respectively. We can see that the integration resolves the ambiguity. In the supplementary material, we plot samples from the individual and integrated beliefs, before and after a sampling importance resampling procedure.
5.2 Learning and inference of shared representations for structured prediction
Models trained with or
can be used to predict structured data of any modality, conditioned on any available information source. Equivalently, we may impute missing data if modelled explicitly as an information source:
(6) 
MNIST variants
We created 3 variants of MNIST [Lecun et al.1998], where we simulate multiple information sources as follows:

MNISTTB: perceives the top half and perceives the bottom half of the image.

MNISTQU: 4 information sources that each perceive quarters of the image.

MNISTNO: 4 information sources with independent bitflip noise with . We use these 4 sources to amortise inference. In the generative model, we use the standard, noisefree digits as observable variables.
First, we assess how well individual beliefs can be integrated after learning, and whether beliefs can be used individually when learning them as integrated inference distributions. On all MNIST variants, we train 5 different models by optimising the objectives , , , and with , as well as with . All other hyperparameters are identical. We then evaluate each model under the 3 objectives , and . For comparison, we also train a standard IWAE with hardwired sources on MNIST and on MNISTNO with a single noisy source. The ELBOs on the test set are estimated using importance samples. The obtained estimates are summarised in Tab. 1.
MNISTTB  

IWAE  
102.20  102.40  265.59  104.03  108.97    
101.51  101.82  264.48  103.37  108.30    
94.38  94.39  87.59  90.07  90.81  88.79 
MNISTQU  

IWAE  
120.46  120.37  447.67  129.63  140.61    
119.10  119.98  446.02  128.16  139.19    
108.07  107.85  87.67  89.20  90.17  88.79 
MNISTNO  

IWAE  
94.81  94.86  101.20  96.27  95.31    
93.98  94.03  100.36  95.58  94.55    
94.52  94.65  92.27  92.21  94.49  94.95 
The results confirm that learning the PoE inference model directly leads to inseparable individual beliefs. As expected, learning individual inference models and integrating them subsequently as a PoE comes with a tradeoff for , which is mostly due to the low entropy of the integrated distribution. On the other hand, optimising the model with achieves good results for both individual and integrated beliefs. On MNISTNO, we can get an improvement of nats by integrating the beliefs of redundant sources, compared to the standard IWAE with a single source.
Next, we evaluate our method for conditional (structured) prediction using Eq. (6). Fig. 2(c) shows the means of the likelihood functions, with latent variables drawn from individual and integrated beliefs. To demonstrate conditional image generation from labels, we add a third encoder that perceives class labels. Fig. 2(d) shows the means of the likelihood functions, inferred from labels.
We also compare our method to the missing data imputation procedure described in [Rezende, Mohamed, and Wierstra2014]
for MNISTTB und MNISTQU. We run the Markov chain for all samples in the test set for 150 steps each and calculate the log likelihood of the imputed data at every step. The results—averaged over the dataset—are compared to our multimodal data generation method in Fig.
4. For large portions of missing data as in MNISTTB, the Markov chain often fails to converge to the marginal distribution. But even for MNISTQU with only a quarter of the image missing, our method outperforms the Markov chain procedure by a large margin. Please consult the supplementary material for a visualisation of the stepwise generations during the inference procedure.CaltechUCSD Birds 200
CaltechUCSD Birds 200 [Welinder et al.2010] is a dataset with 6033 images of birds with resolutions, split into 3000 train and 3033 test images. As a second source, we use segmentation masks provided by [Yang, Safar, and Yang2014]. On this dataset we assess whether learning with multiple modalities can be advantageous in scenarios where we are interested only in one particular modality. Therefore, we evaluate the ELBO for a single source and a single target observation, i.e. encoding images and decoding segmentation masks. We compare models that learned with multiple modalities using and with models that learnt from a single modality. Additionally, we evaluate the segmentation accuracy using Eq. (6). The accuracy is estimated with 100 samples, drawn from the belief informed by image data. The results are summarised in Tab. 2.
*  *  IWAE  

imgtoseg  5326  3264  5924  3337  3228 
imgtoimg  26179  26663  29285  29668  30415 
accuracy  0.808  0.870  0.810  0.872  0.855 
We distinguish between objectives that involve both modalities in the generative model and objectives where we learn only the generative model for the modality of interest (segmentation), denoted with an asterisk.
Models that have to learn the generative models for images and segmentations show worse ELBOs and accuracy, when evaluated on one modality.
In contrast, the accuracy is slightly increased when we learn the generative model of segmentations only, but use both sources for inference.
We also refer the reader to the supplementary material, where we visualise conditionally generated images, showing that learning with the importance sampling estimate of the ELBO is crucial to generate diverse samples from partially informed sources.
5.3 Robustness via conflict detection and redundancy
In this experiment we demonstrate how a shared latent representation can increase robustness, by exploiting sensor redundancy and the ability to detect conflicting data. We created a synthetic dataset of perspective images of a pendulum with different views of the same scene. The pendulum rotates along the zaxis and is centred at the origin. We simulate three cameras with pixel resolution as information sources for inference and apply independent noise with std to all sources. Each sensor is directed towards the origin (centre of rotation) from different viewpoints: Sensor 0 is aligned with the axis, and sensor 1 and 2 are rotated by along the  and axis, respectively. The distance of all sensors to the origin is twice the radius of the pendulum rotation. For the generative model we use the  and coordinate of the pendulum rather than reconstructing the images. The model was trained with .
In Fig. 5, we plot the mean and standard deviation of predicted  and coordinates, where latent variables are inferred from a single source as well as from the PoE posteriors of different subsets. As expected, integrating the beliefs from redundant sensors reduces the predictive uncertainty. Additionally, we visualise the three images used as information sources above these plots.
Next, we simulate an anomaly in the form of a defect sensor 0, outputting random noise after 2 rotations of the pendulum. This has a detrimental effect on the integrated beliefs, where sensor 0 is part of the integration. We also plot the conflict measure of Eq. (2). As can be seen, the conflict measures for sensor 0 increases significantly when sensor 0 fails. In this case, one should integrate only the two remaining sensors with low conflict conjunctively.
6 Summary and future research directions
We extended neural variational inference to scenarios where multiple information sources are available. We proposed an objective function to learn individual inference models jointly with a shared generative model. We defined an exemplar measure (of conflict) to compare the beliefs from distinct inference models and their respective information sources. Furthermore, we proposed a disjunctive and a conjunctive integration method to combine arbitrary subsets of beliefs.
We compared the proposed objective functions experimentally, highlighting the advantages and drawbacks of each. Naive integration as a PoE () leads to inseparable individual beliefs, while optimising the sources only individually () worsens the integration of the sources. On the other hand, a hybrid of the two objectives () achieves a good tradeoff between both desiderata. Moreover, we showed how our method can be applied to structured output prediction and the benefits of exploiting the comparability of beliefs to increase robustness.
This work offers several future research directions. As an initial step, we considered only static data and a simple latent variable model. However, we have made no assumptions about the type of information source. Interesting research directions are extensions to sequence models, hierarchical models and different forms of information sources such as external memory. Another important research direction is the combination of disjunctive and conjunctive integration methods, taking into account the conflict between sources.
Acknowledgements
We would like to thank Botond Cseke for valuable suggestions and discussions.
References
 [Andrew et al.2013] Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, III–1247–III–1255. JMLR.org.
 [Bach and Jordan2005] Bach, F., and Jordan, M. 2005. A probabilistic interpretation of canonical correlation analysis.
 [Baltrušaitis, Ahuja, and Morency2017] Baltrušaitis, T.; Ahuja, C.; and Morency, L.P. 2017. Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.09406.
 [Bousquet2008] Bousquet, N. 2008. Diagnostics of priordata agreement in applied Bayesian analysis. Journal of Applied Statistics 35(9):1011–1029.
 [Burda, Grosse, and Salakhutdinov2015] Burda, Y.; Grosse, R. B.; and Salakhutdinov, R. 2015. Importance weighted autoencoders. CoRR abs/1509.00519.
 [Cremer, Li, and Duvenaud2018] Cremer, C.; Li, X.; and Duvenaud, D. K. 2018. Inference suboptimality in variational autoencoders. CoRR abs/1801.03558.
 [Dahl, Gåsemyr, and Navig] Dahl, F. A.; Gåsemyr, J.; and Navig, B. A robust conflict measure of inconsistencies in Bayesian hierarchical models. Scandinavian Journal of Statistics 34(4):816–828.
 [Dempster1967] Dempster, A. P. 1967. Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Statist. 38(2):325–339.
 [Deng2015] Deng, Y. 2015. Generalized evidence theory. Applied Intelligence 43(3):530–543.
 [Denœux2008] Denœux, T. 2008. Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence. Artificial Intelligence 172(2):234 – 264.

[Feng, Li, and Wang2015]
Feng, F.; Li, R.; and Wang, X.
2015.
Deep correspondence restricted boltzmann machine for crossmodal retrieval.
Neurocomputing 154:50–60.  [Gershman and Goodman2014] Gershman, S., and Goodman, N. D. 2014. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, CogSci 2014, Quebec City, Canada, July 2326, 2014.
 [Hardoon, Szedmak, and Shawetaylor2004] Hardoon, D. R.; Szedmak, S. R.; and Shawetaylor, J. R. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16(12):2639–2664.

[Hinton2002]
Hinton, G. E.
2002.
Training products of experts by minimizing contrastive divergence.
Neural Comput. 14(8):1771–1800.  [Hotelling1936] Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28(3/4):321–377.
 [Jiang et al.2016] Jiang, W.; Xie, C.; Zhuang, M.; Shou, Y.; and Tang, Y. 2016. Sensor data fusion with znumbers and its application in fault diagnosis. Sensors 16(9).
 [Khaleghi et al.2013] Khaleghi, B.; Khamis, A.; Karray, F.; and Razavi, S. 2013. Multisensor data fusion: A review of the stateoftheart. 14.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational Bayes. CoRR abs/1312.6114.
 [Lecun et al.1998] Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Mnih and Gregor2014] Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, 1791–1799.
 [Murphy2000] Murphy, C. K. 2000. Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9.

[Ngiam et al.2011]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y.
2011.
Multimodal deep learning.
In Getoor, L., and Scheffer, T., eds., ICML, 689–696. Omnipress.  [Rasiwasia et al.2010] Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G. R.; Levy, R.; and Vasconcelos, N. 2010. A new approach to crossmodal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, 251–260. New York, NY, USA: ACM.

[Rezende, Mohamed, and
Wierstra2014]
Rezende, D. J.; Mohamed, S.; and Wierstra, D.
2014.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31th International Conference on Machine Learning (ICML), 1278–1286.  [Shafer1976] Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton: Princeton University Press.
 [Shon et al.2005] Shon, A. P.; Grochow, K.; Hertzmann, A.; and Rao, R. P. N. 2005. Learning shared latent structure for image synthesis and robotic imitation. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’05, 1233–1240. Cambridge, MA, USA: MIT Press.
 [Srivastava and Salakhutdinov2014] Srivastava, N., and Salakhutdinov, R. 2014. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15:2949–2980.
 [Stein and Meredith1993] Stein, B. E., and Meredith, M. A. 1993. The merging of the senses. Cambridge, MA, US: The MIT Press.
 [Suzuki, Nakayama, and Matsuo2016] Suzuki, M.; Nakayama, K.; and Matsuo, Y. 2016. Joint multimodal learning with deep generative models.
 [Vedantam et al.2017] Vedantam, R.; Fischer, I.; Huang, J.; and Murphy, K. 2017. Generative models of visually grounded imagination. CoRR abs/1705.10762.
 [Welinder et al.2010] Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. CaltechUCSD Birds 200. Technical Report CNSTR2010001, California Institute of Technology.
 [Wu and Goodman2018] Wu, M., and Goodman, N. 2018. Multimodal generative models for scalable weaklysupervised learning. CoRR abs/1802.05335.

[Yang, Safar, and
Yang2014]
Yang, J.; Safar, S.; and Yang, M.H.
2014.
Maxmargin boltzmann machines for object segmentation.
2014 IEEE Conference on Computer Vision and Pattern Recognition
320–327.  [Zadeh1986] Zadeh, L. A. 1986. A simple view of the dempstershafer theory of evidence and its implication for the rule of combination. AI Mag. 7(2):85–90.
7 Appendix
A Individual inferences
In this section we derive the . Since any proposal distribution yields an ELBO to the logmarginal likelihood, the (weighted) average is also an ELBO.
where
The factors are the weights for each ELBO term, satisfying and .
When , the gap between and the marginal loglikelihood is the average KullbackLeibler (KL) divergence between individual approximate posteriors and the true posterior from all sources:
This gap can be further decomposed as:
To minimise , not only the KL divergence of the individual approximate posterior and the respective true posterior need to be minimised, but also two additional terms which depend on the likelihood of those observations that have not been used as an information source for inference.
B Mixture of experts inference
The ELBO for the mixture distribution
can be derived similarly. We employ a Monte Carlo approximation only w.r.t. each mixture component but not w.r.t. the mixture weights. That is, we enumerate all possible mixture components rather than sampling each from an indicator variable. This reduces variance of the estimate and circumvents the problem of propagating gradients through the sampling process of discrete random variables.
minimises the average KLdivergence between the mixture of approximate posteriors and the true posterior from all sources:
C Product of Gaussian experts
Here we consider the popular case of individual Gaussian approximate posteriors and a zerocentred Gaussian prior. Let the normal distributions be represented in the canonical form with canonical parameters :
denotes the precision matrix and , where is the mean. Furthermore, is the partition function.
Let the subscripts , and indicate the th approximate distribution, the prior and the integrated distribution. The natural parameters of the integrated variational posterior from Eq. (3) can then be calculated as follows:
To obtain a valid integrated variational posterior, we require the precision matrix to be positive semidefinite. This enforces requirements for the precision matrices . In the case of diagonal precision matrices, the necessary and sufficient condition is that has all positive entries. A sufficient condition for each entry is .
The partition function of the integrated belief can be calculated from the natural parameters, taking :
(7) 
D Pointwise mutual information
Inspecting Eq. (3), we can see that the negative logarithm of the constant term corresponds to the pointwise mutual information (PMI) between the observations. We do not need to calculate this constant since we impose assumptions about the parametric forms of the distributions and can calculate the partition function of the integrated belief using Eq. (7).
However, we can also calculate this partition function from the product of individual partition functions and the above mentioned constant in Eq. (3):
The PMI can then be calculated as:
The pointwise mutual information can be calculated between any subset of information sources. We note however, that it is based on the assumption that all involved probability density functions—the prior and all approximate posterior distributions—are normal distributions.
E Visualisation of samples from individual and integrated beliefs on mixture of bivariate Gaussians dataset
For the mixture of bivariate Gaussians dataset, we show latent samples from both information sources in Fig. 6(a) (left) and samples obtained by sampling importance resampling (SIR) using the full likelihood model in 6(a) (right). We also show random samples from the integrated beliefs as well as samples obtain by SIR in Fig. 6(b) (left) and 6(b) (right) respectively. We conclude that the integrated beliefs are much better proposal distributions, resolving the ambiguity of the individual sources.
F Visualisation of missing data imputation
Fig. 7 shows the mean of generated images for 50 steps of the Markov chain procedure for missing data imputation. As can be seen in Fig. 6(c), the chain does not converge for many digits within 50 steps if too large portions of the data are missing. Indeed, we observed that the procedure randomly fails or succeeds to converge for the same input even after 150 steps.
G Conditional generations on CaltechUCSD Birds 200
We show conditional generations of images, inferred from images or segmentation masks in Fig. 8. When inferring from segmentation masks, the conditional distribution should be highly multimodal due to the missing colour information. This uncertainty should ideally be covered in the uncertainty of the belief. As can be seen in Fig. 7(c), learning with a single importance sample leads to predictions of average images. For completeness, generated segmentation masks are shown in Fig. 9.
H Experiment setups
All inference (generative) models use the same neural network architectures for the different sources, except the first (last) layer, which depends on the dimensions of the data. We refer to main parts of the architectures, identical for each source, as “stem”. In case of inference models, the stem is the input to a dense layer with linear (no activation) and sigmoid activations, parameterising the mean and stddev of the approximate posterior distribution. In case of generative models, refer to the respective subsections.
We use the Adam optimiser [Kingma and Ba2014] with and
in all experiments. In the tables, “dense” denotes fully connected layers, “conv” refers to convolutional layers, “pool” refers to pooling (downsampling), and “interpol” refers to a bilinear interpolation (upsampling).
is the number of importanceweighted samples and refers to the number of latent dimensions, each modelled with a diagonal normal distribution with zero mean and unit standard deviation.Partially observable mixture of bivariate Gaussians
In the pendulum experiment, we use 2 sources, corresponding to the  and
coordinates of the sample from a mixture of bivariate Gaussians distribution. The neural network stems and training hyperparameters are summarised in Tab.
3. The generative models are both 1D Normal distributions, parameterised by linear dense layers, taking inputs from their respective stems.MNIST variants
The neural network stems are summarised in Tab. 4
. The data is modelled as Bernoulli distributions of dimensions 784 for MNISTNO, 392 for MNISTTB and 196 for MNISTQU. The Bernoulli parameters are parameterised by linear dense layers, taking inputs from their respective stems.
Pendulum
In the pendulum experiment, we use 3 sources with images for the inference model, but a single observation of  and coordinates of the pendulum centre. The generative model is assumed Normal for both coordinates, where the mean is predicted by a linear dense layer taking inputs from the stem, and the std deviation is a global variable. The neural network stems and training hyperparameters are summarised in Tab. 5.
CaltechUCSD Birds 200
The neural network stems are summarised in Tab. 6. Images are modelled as diagonal normal distributions and segmentation masks as Bernoulli distributions. The generative model stem is the input to a
transposed convolutional layers with stride 2, yielding the mean of the likelihood function. The standard deviations are global and shared for all pixels. Leaky rectified linear units (lrelu) use
.
Comments
There are no comments yet.