Masking schemes for universal marginalisers

by   Divya Gautam, et al.
babylon health

We consider the effect of structure-agnostic and structure-dependent masking schemes when training a universal marginaliser (arXiv:1711.00695) in order to learn conditional distributions of the form P(x_i |x_b), where x_i is a given random variable and x_b is some arbitrary subset of all random variables of the generative model of interest. In other words, we mimic the self-supervised training of a denoising autoencoder, where a dataset of unlabelled data is used as partially observed input and the neural approximator is optimised to minimise reconstruction loss. We focus on studying the underlying process of the partially observed data—how good is the neural approximator at learning all conditional distributions when the observation process at prediction time differs from the masking process during training? We compare networks trained with different masking schemes in terms of their predictive performance and generalisation properties.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning about an exponential amount of conditional distributions

We introduce the Neural Conditioner (NC), a self-supervised machine able...

An Improved Training Procedure for Neural Autoregressive Data Completion

Neural autoregressive models are explicit density estimators that achiev...

Learning Functional Causal Models with Generative Neural Networks

We introduce a new approach to functional causal modeling from observati...

Attentive Neural Processes

Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by le...

Universal Supervised Learning for Individual Data

Universal supervised learning is considered from an information theoreti...

Understanding the Limitations of Network Online Learning

Studies of networked phenomena, such as interactions in online social me...

Bayesian clustering using random effects models and predictive projections

Linear mixed models are widely used for analyzing hierarchically structu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

In automated medical diagnosis a Bayesian network (BN) can be used as a statistical model for risk factors, diseases and symptoms in order to provide medical recommendations. For a BN to be appropriate for medical diagnosis, it typically needs to capture the relationships between thousands of diseases, symptoms and risk factors, making exact inference intractable. In this work, we focus on unlabeled binary data generated from a three-layer risk-disease-symptom BN as the generative model of interest.

During a typical diagnosis, only a subset of the evidence will be available, and hence we need to be able to efficiently (implicitly) marginalise over unobserved symptoms and risk factors. To obtain estimators of the posterior marginal distributions of diseases, we follow the amortised inference paradigm 

(GerGoo14) and train a denoising autoencoder (Vinetal10; Benetal13)

using a variety of masking schemes for the input. This results in neural approximators for arbitrary conditional probabilities 

(Douetal17; Beletal19)

. In this scenario, the quantities of interest are the posterior probability of diseases given sets of evidence containing observations from risk factors and/or symptoms from each patient. These estimates can then be used for diagnostic accuracy, namely, how high the algorithm ranks the reference diagnoses on a differential diagnosis 

(Shwetal91dia). Since the goal is to provide medically accurate recommendations, it is crucial to obtain accurate marginal estimates as well as to have computationally feasible inference schemes given the exponential number of queries required.

The main contribution of this work is a study of the effect on learning conditional distributions when the pattern of observed values in the data is misspecified. We consider multiple masking distributions, some of which are independent of the underlying structure of the BN, and others which are more tailored to it.

2 Background and related work

Let , , denote the collection of random variables in a BN of

nodes with joint distribution denoted by

. is the masking distribution for the masking random variable . The masks are applied to the nodes through the element-wise product: . If , we say that the value is observed, otherwise it is unobserved or masked. We will also use the notation to denote the subset of unmasked nodes.

We use the universal marginaliser (UM) (Douetal17)

, a feedforward neural network that takes masked samples

from a given BN as input, and learns to predict the marginals of all the nodes in the BN conditioned on the assignment of the observed nodes in the input. It is trained using the multi-label binary cross entropy loss:


in the same manner as a denoising autoencoder (Vinetal10). As a consequence, the UM, denoted by , learns to output , for all . Joint posterior distributions can be estimated using the product rule and repeated evaluations of the UM. Section B in the supplementary material contains further details about our implementation.

There exist other types of universal marginalisers that use a variety of generative probabilistic models based on other neural networks to estimate arbitrary conditional distributions. GerGreMurLar15 modify an autoencoder neural network to estimate conditional probabilities with autoregressive constraints. The authors mask the weighted connections of a standard autoencoder to convert it into a distribution estimator. Ivaetal19 propose a variational autoencoder (KinDieWel14) with arbitrary conditioning. The authors use stochastic variational Bayes during the training together with the reparameterisation trick as well as a masking distribution. Beletal19 propose a generative adversarial network (Gooetal14) to learn every single conditional distribution: they use gradient based regularisation as well as a masking distribution during training. It is possible to apply the masking schemes proposed in section 3 to these other types of universal marginalisers. Masking the input layer is a special case of dropout (srivastava2014dropout; baldi2014dropout)

, which is used to regularize over-parameterised neural networks by reducing the importance of individual neurons in the network during training.

3 Masking distributions

We consider a variety of masking distributions that, together with the generative model , produce the data that we use to train the UM as , where denotes element-wise product.

Uniform power setwise (uniform): Masks are sampled uniformly from the power set of possible masks .

Uniform sizewise (sizewise): The distribution of the number of ones in sampled using the uniform method above is , and hence the samples contain on average ones; see Figure 2 in the supplementary material for an illustration of this. If more samples from the tails of the Binomial are desired, a distribution that selects uniformly in terms of the number of ones therein can be achieved by first sampling a mask size , and then sampling a permutation of the array with ones and zeros.

Independent nodewise (nodewise): Nodes are masked individually and independently rather than sampling an entire mask. Specifically, for each batch of samples, we sample and set for that batch, where

is a hyperparameter of the model. As with sizewise masking above, this procedure controls the number of evidence sets of different sizes seen during training: each sampled value of

effectively specifies a maximum evidence set size for the masks in that batch. In this work we use in order to sample from all possible evidence sets during training, but with a different distribution to the two methods described above.

Deterministic cycle (deterministic): This is a deterministic version of the nodewise method above: instead of sampling the probability

of observing the nodes individually from a uniform distribution each batch, a list of values in

is preselected, and we cycle through these values proceeding to the next one each batch. The rationale is to avoid clusters of values that can arise from sampling from the uniform distribution. Note that this method together with the above three are all structure-agnostic masking methods.

Markov blanket (markov): For each mask, all nodes are initially masked, then one disease node is chosen at random, and the nodes in its Markov blanket are unmasked using the independent nodewise scheme above. The Markov blanket of a node is given by , where and correspond to the parents and children of respectively. This is an example of a structure-dependent masking method.

4 Experiments

In this section, we compare the performance of the UM trained with loss function (

2) using the different masking distributions described in Section 3. The target (synthetic) Noisy-OR BN consists of nodes arranged in three layers of eight nodes each, representing risk factors, symptoms and diseases respectively. See Appendix B for details about the architecture and training of the UM, and Appendix C for the structure of the target BN.

We tested the UM’s ability to reproduce conditioned marginals for evidence sets which were randomly chosen under two different observation models. In the first case, we assume that evidence is observed under the uniform power setwise masking scheme. In the second case, the test set construction uses the Markov masking assumptions—evidence is picked only from a randomly chosen disease node’s Markov blanket. The assignments of the unmasked nodes in both cases were chosen arbitrarily. We constructed these test sets in order to capture the behaviour of each masking scheme when used for training a UM under different prediction-time observation patterns, mimicking the clinical diagnosis scenario.

[Uniform evidence ] [Markov evidence ]

Figure 1: Absolute difference between the UM’s prediction and the ground truth as a function of evidence set size over , where the states are picked arbitrarily and , is either the uniform power setwise masking distribution or the Markov blanket one. The queried disease node is fixed across bars and plots.

Our comparison looks at the absolute difference between the UM’s prediction and the ground truth for a given disease node as a function of evidence set size . The ground truth can be obtained in small Bayesian networks using exact inference methods. In Figure 1, we see that in the case of uniform evidence all masking methods are competitive for small , with deterministic masking being better on average in that regime. Markov masking progressively gets worse as increases—the reason for this behaviour is that the Markov masking never chooses some of the masking patterns during training. When we mask the evidence with respect to the Markov distribution (Figure 1), the Markov masking becomes competitive again across different .

The UM in Figure 1 was trained with a large number of samples for each masking scheme, which makes the effect of the structure-agnostic masking distributions less apparent in the context of our small BN; see Appendix D

for examples of the UM trained for fewer epochs where differences are more pronounced.

We conjecture that using different masking schemes for training and corruption has an effect due to the inherent structure in the dataset, for example when using data from a Bayesian network or image data. This effect could potentially be less pronounced if the dataset does not contain any structure. In Appendix D, we conducted an additional experiment to verify this claim empirically. We benchmarked some of our proposed masking schemes with a different universal marginaliser that uses a variational autoencoder architecture from (Ivaetal19). The yeast dataset (Dua:2019) was used, which is an example of independent and iid data and different masking and corruption schemes were picked.

5 Conclusions

A number of structure-agnostic masking schemes are presented and its performance has been evaluated using a structured synthetic dataset from a Bayesian network. The main conclusion of this work is that the choice of masking scheme employed during training impacts both predictive performance and training efficiency, this choice should be informed by the quantities of interest at prediction time for maximal benefit. Model generalisation is also affected by the choice of masking method when the corruption process differs from the masking used during training or is unknown, as in real world scenarios.

A natural extension to this work is to benchmark the structure-dependent and structure-agnostic masking schemes on real-world clinical case data and larger Bayesian networks.

Finally, we have used our masking schemes for other universal marginalisers such as the variational autoencoder with arbitrary conditioning (Ivaetal19) to see the effect of masking schemes for iid data. Further experiments can be conducted with different neural network architectures such as the generative adversarial network from Beletal19 and more complex datasets to study the performance of universal marginalisers for structured data. We believe this is an interesting avenue of future work.


Appendix A Distribution of possible masks of a given size

Figure 2: Illustration of the non-uniform distribution of the possible masks of size for , for a BN of nodes.

Appendix B Neural network architecture and training

Figure 3

shows the neural network (NN) architecture, which consists of an input layer followed by a number of distinct multi-layer perceptron (MLP) branches, one for each layer of descendants in the BN, e.g. three branches for the

, and layers of a three-layer BN network.

Figure 3: Neural network architecture. The three branches correspond to the , and nodes of a three-layer BN.

Each MLP branch consists of 3 fully-connected layers of 512 hidden units each, to which batch normalisation (ioffe2015batch) before—and sigmoid activation after—is applied; followed by a final fully-connected layer of hidden units, where , and correspond to the numbers of , and nodes in the BN respectively. Lastly softmax activation is applied on a per node basis. This is done because of the 2 Boolean parameterization of the node assignments, where corresponds to unobserved/masked, to observed and false, and to observed and true. Hence the NN ends up learning two marginals for each node: one corresponding to the probability that the node is false, and another corresponding to the probability that it is true. The softmax activation ensures these two probabilities sum to 1.

The NN is trained using the Adam optimizer (Kin14), with the default parameter values and a learning rate of .

Appendix C Noisy-OR model

The number of parameters in a BN can be potentially large if the network is dense. Namely, each node has parameters where is the number of parents of node . In order to have a more compact representation for its parameters, only storing parameters per node, the Noisy OR model was proposed (Shwetal91).

The top layer of nodes corresponds to risk factors, the second layer, to disease nodes and the third layer, to symptom nodes. The probability of a symptom node being false, given that only one disease is true and the rest are false, is , where is the leak probability which corresponds to the case where all diseases are false. Analogously, the probability of a disease being false, given that only one risk factor is true, is , where is the leak probability for the case where all risk factors are false. The prior for risk factor being false is denoted by . The general form for the probability of a risk factor, disease or symptom node being false given the values of its parents can be written in the following way:

The prior probabilities of diseases and risk factors can obtained from epidemiological data, where available. The conditional probabilities

and can be obtained through elicitation from multiple independent medical experts or from electronic medical records. The network structure expresses the corresponding clinical knowledge in graphical form, namely, whether or not there exists a direct relationship between a given pair of nodes.. See ShwCoo91 for details about how to obtain the corresponding priors from QMR frequencies for a two layer BN and Shwetal91, for another method to obtain the corresponding conditional probabilities. BauKolSin97 and TonKol00

propose an online parameter estimation and active learning methods respectively for parameter estimation in BNs.

Pop80 discusses structure finding methods for BNs.

Appendix D Additional Experiments

d.1 Masking schemes with data from a Bayesian network

In this appendix, we share a few more details from our experiments. Figure 4 is similar to the plot in the main paper, but for a different query disease node . In this example it is somewhat more apparent that the deterministic-cycle masking is the best performing masking scheme for small evidence sets. In Figure 4, the UM trained using Markov masking does not generalise well to uniformly-chosen evidence but is well-behaved over all the Markov-chosen evidence set sizes for which it was trained (Figure 4).

[Uniform evidence ] [Markov evidence ]

Figure 4: These plots correspond to a different query and are similar to Figure 1. Deterministic cycle masking is the best performing masking scheme for small evidence sets.

[Uniform evidence ] [Markov evidence ]

Figure 5: Models trained for 20 epochs exhibit poorer performance on average and the effect of different masking schemes is more pronounced. One epoch corresponds to the UM using 10 million samples. Note the different scale of the y-axis compared to Figures 1 and 4.

[Uniform evidence ] [Markov evidence ]

Figure 6: Linear model plots corresponding to the same experiment as Figure 5. For a small number of epochs, the sizewise masking distribution is the best scheme to use for both the uniform and the Markov evidence.

During training with the different masking distributions, the UM obtained more than possible states of the -node BN during training, which is nearly four orders of magnitude larger than the size of the state space. In larger BNs with hundreds of nodes, the UM would not be able to get samples from substantial parts of the state space during training. In addition, for the masking distributions that satisfy for all , it is expected that a denoising-autoencoder will behave as a UM after obtaining a sufficiently large number of samples, making comparisons between masking schemes difficult. UMs trained with uniform masking exhibit good performance (Douetal17) even for large BNs.

In order to simulate the scenario of training a large BN with the same small sythentic BN available, we trained another series of models for a much smaller number of epochs, in order to see if differences between the masking schemes are more pronounced. Figures 5 and 6 correspond to this experiment. Across the various queries considered during this experiment (most of which are not shown), we noticed that the deterministic cycle scheme had the worst behaviour whereas the sizewise scheme was the most performant.

d.2 Masking schemes with iid data

Feature MCAR-trained Sizewise-trained
0 0.02748 (0.044) 0.02838 (0.046)
1 0.02160 (0.037) 0.02207 (0.038)
2 0.00936 (0.017) 0.00942 (0.017)
3 0.02511 (0.050) 0.02534 (0.051)
4 0.00386 (0.031) 0.00389 (0.031)
5 0.00856 (0.073) 0.00845 (0.073)
6 0.00404 (0.012) 0.00406 (0.012)
7 0.01597 (0.046) 0.01584 (0.046)
Table 1:

MCAR corrupted data. Mean squared error of reconstruction per feature and standard deviation.

Feature MCAR-trained Sizewise-trained
0 0.03181 (0.051) 0.03202 (0.051)
1 0.02286 (0.038) 0.02317 (0.038)
2 0.00980 (0.017) 0.01002 (0.017)
3 0.02619 (0.052) 0.02617 (0.051)
4 0.00370 (0.030) 0.00372 (0.030)
5 0.00878 (0.075) 0.00878 (0.075)
6 0.00410 (0.011) 0.00411 (0.011)
7 0.01539 (0.041) 0.01524 (0.042)
Table 2: Structured-corrupted data. Mean squared error of reconstruction per feature and standard deviation.

In this section, we included an experiment with the variational autoencoder with arbritrary conditioning (VAEAC) architecture (Ivaetal19), together with the yeast dataset (Dua:2019). Ivaetal19 consider a single corruption distribution called ”Missing completely at random” (MCAR) which corrupts the original data by randomly turning observed values into unobserved. The masking scheme MCAR with parameter equal to 0.5 corresponds to what our masking scheme called uniform power setwise masking distribution; see Section 3 for details. Their experimental setup goes as follows: a VAE is first trained to reconstruct the corrupted dataset with the use of another MCAR distribution that is different from the one used to corrupt the dataset. Since the VAE can utilize any reasonable masking distribution in order to recover the original values of the features, we extended this experiment. Specifically, the dataset was corrupted and reconstructed using different masking distributions for each step. Then, the mean-squared errors per feature were compared for each configuration of masking and corruption processes. In Table 1, the data were corrupted by an MCAR with probability of observing a single feature per row of the dataset. The VAE that uses a sizewise distribution during training does not have a considerable advantage over the MCAR case.

We also picked the corruption distribution to be different from the possible masking schemes used for training. In Table 2, we consider a corruption distribution that induces some structure - we call this a structured-corrupted distrbution (SC). The corruption scheme SC always masks neighboring triplets of features, i.e., , , etc. as opposed to MCAR, which independently corrupts different features. The SC corruption distribution emulates situations where only a subset of relevant features is observed at any given time. As expected, the VAEAC using MCAR or uniform-sizewise during training exhibits larger error, on average, when the ground truth masking is SC. We conjecture that part of this error is because the VAE cannot capture couplings between the predicted distributions given evidence but further investigations are needed to verify such claim.